Creating training data for a Maxent classfier in Java

Creating training data for a Maxent classfier in Java - java

I am trying to create the java implementation for maxent classifier. I need to classify the sentences into n different classes.
I had a look at ColumnDataClassifier in stanford maxent classifier. But I am not able to understand how to create training data. I need training data in the form where training data includes POS Tags for words for sentence, so that the features used for classifier will be like previous word, next word etc.
I am looking for training data which has sentences with POS TAGGING and sentence class mentioned. example :
My/(POS) name/(POS) is/(POS) XYZ/(POS) CLASS
Any help will be appreciated.

If I understand it correctly, you are trying to treat sentences as a set of POS tags.
In your example, the sentence "My name is XYZ" would be represented as a set of (PRP$, NN, VBZ, NNP).
That would mean, every sentence is actually a binary vector of length 37 (because there are 36 possible POS tags according to this page + the CLASS outcome feature for the whole sentence)
This can be encoded for OpenNLP Maxent as follows:
PRP$=1 NN=1 VBZ=1 NNP=1 CLASS=SomeClassOfYours1
or simply:
PRP$ NN VBZ NNP CLASS=SomeClassOfYours1
(For working code-snippet see my answer here: Training models using openNLP maxent)
Some more sample data would be:
"By 1978, Radio City had lost its glamour, and the owners of Rockefeller Center decided to demolish the aging hall."
"In time he was entirely forgotten, many of his buildings were demolished, others insensitively altered."
"As soon as she moved out, the mobile home was demolished, the suit said."
...
This would yield samples:
IN CD NNP VBD VBN PRP$ NN CC DT NNS IN TO VB VBG CLASS=SomeClassOfYours2
IN NN PRP VBD RB VBN JJ IN PRP$ NNS CLASS=SomeClassOfYours3
IN RB PRP VBD RP DT JJ NN VBN NN CLASS=SomeClassOfYours2
...
However, I don't expect that such a classification yields good results. It would be better to make use of other structural features of a sentence, such as the parse tree or dependency tree that can be obtained using e.g. Stanford parser.
Edited on 28.3.2016:
You can also use the whole sentence as a training sample. However, be aware that:
- two sentences might contain same words but have different meaning
- there is a pretty high chance of overfitting
- you should use short sentences
- you need a huge training set
According to your example, I would encode the training samples as follows:
class=CLASS My_PRP name_NN is_VBZ XYZ_NNP
...
Notice that the outcome variable comes as the first element on each line.
Here is a fully working minimal example using opennlp-maxent-3.0.3.jar.
package my.maxent;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.zip.GZIPInputStream;
import opennlp.maxent.GIS;
import opennlp.maxent.io.GISModelReader;
import opennlp.maxent.io.SuffixSensitiveGISModelWriter;
import opennlp.model.AbstractModel;
import opennlp.model.AbstractModelWriter;
import opennlp.model.DataIndexer;
import opennlp.model.DataReader;
import opennlp.model.FileEventStream;
import opennlp.model.MaxentModel;
import opennlp.model.OnePassDataIndexer;
import opennlp.model.PlainTextFileDataReader;
public class MaxentTest {
public static void main(String[] args) throws IOException {
String trainingFileName = "training-file.txt";
String modelFileName = "trained-model.maxent.gz";
// Training a model from data stored in a file.
// The training file contains one training sample per line.
DataIndexer indexer = new OnePassDataIndexer( new FileEventStream(trainingFileName));
MaxentModel trainedMaxentModel = GIS.trainModel(100, indexer); // 100 iterations
// Storing the trained model into a file for later use (gzipped)
File outFile = new File(modelFileName);
AbstractModelWriter writer = new SuffixSensitiveGISModelWriter((AbstractModel) trainedMaxentModel, outFile);
writer.persist();
// Loading the gzipped model from a file
FileInputStream inputStream = new FileInputStream(modelFileName);
InputStream decodedInputStream = new GZIPInputStream(inputStream);
DataReader modelReader = new PlainTextFileDataReader(decodedInputStream);
MaxentModel loadedMaxentModel = new GISModelReader(modelReader).getModel();
// Now predicting the outcome using the loaded model
String[] context = {"is_VBZ", "Gaby_NNP"};
double[] outcomeProbs = loadedMaxentModel.eval(context);
String outcome = loadedMaxentModel.getBestOutcome(outcomeProbs);
System.out.println("=======================================");
System.out.println(outcome);
System.out.println("=======================================");
}
}
And some dummy training data (stored as training-file.txt):
class=Male My_PRP name_NN is_VBZ John_NNP
class=Male My_PRP name_NN is_VBZ Peter_NNP
class=Female My_PRP name_NN is_VBZ Anna_NNP
class=Female My_PRP name_NN is_VBZ Gaby_NNP
This yields the following output:
Indexing events using cutoff of 0
Computing event counts... done. 4 events
Indexing... done.
Sorting and merging events... done. Reduced 4 events to 4.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 4
Number of Outcomes: 2
Number of Predicates: 7
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-2.772588722239781 0.5
2: ... loglikelihood=-2.4410105407571203 1.0
...
99: ... loglikelihood=-0.16111520541752372 1.0
100: ... loglikelihood=-0.15953272940719138 1.0
=======================================
class=Female
=======================================

Related

Errors when following Stanford CoreNLP Tutorial in Eclipse

Problem :
I am getting a bunch of errors when I pasted this sample code from the Stanford CoreNLP tutorial in eclipse--this code is from the beginner's Stanford CoreNLP tutorial (`https://stanfordnlp.github.io/CoreNLP/api.html). I'm not sure what is wrong--I imported the external JAR files as mentioned in other tutorials, but I am still getting errors. :
import edu.stanford.nlp.coref.data.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.ie.util.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.trees.*;
import java.util.*;
public class BasicPipelineExample{
public static String text = "Joe Smith was born in California. " +
"In 2017, he went to Paris, France in the summer. " +
"His flight left at 3:00pm on July 10th, 2017. " +
"After eating some escargot for the first time, Joe said,
\"That was delicious!\" " +
"He sent a postcard to his sister Jane Smith. " +
"After hearing about Joe's trip, Jane decided she might go to France one day.";
public static void main(String[] DEEPANSHA) throws InterruptedException
{
//set up pipeline properties
Properties props = new Properties();
// set the list of annotators to run
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,parse,depparse,coref,kbp,quote");
//set a property for an annotator, in this case the coref annotator is being set to use the neural algorithm
props.setProperty("coref.algorithm", "neural");
// build pipeline
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// create a document object
CoreDocument document = new CoreDocument(text);
// annnotate the document
pipeline.annotate(document);
// text of the first sentence
String sentenceText = document.sentences().get(0).text();
System.out.println("Example: sentence");
System.out.println(sentenceText);
System.out.println();
}
}
Errors shown :
Adding annotator tokenize
No tokenizer type provided. Defaulting to PTBTokenizer.
Adding annotator ssplit
Adding annotator pos
Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [2.2 sec].
Adding annotator lemma
Adding annotator ner
Loading classifier from edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz ... done [5.6 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.muc.7class.distsim.crf.ser.gz ... done [1.3 sec].
Loading classifier from edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz ... done [3.1 sec].
TokensRegexNERAnnotator ner.fine.regexner: Read 580641 unique entries out of 581790 from edu/stanford/nlp/models/kbp/regexner_caseless.tab, 0 TokensRegex patterns.
TokensRegexNERAnnotator ner.fine.regexner: Read 4857 unique entries out of 4868 from edu/stanford/nlp/models/kbp/regexner_cased.tab, 0 TokensRegex patterns.
TokensRegexNERAnnotator ner.fine.regexner: Read 585498 unique entries from 2 files
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at edu.stanford.nlp.ling.tokensregex.parser.TokenSequenceParser.<init>(TokenSequenceParser.java:3446)
at edu.stanford.nlp.ling.tokensregex.TokenSequencePattern.getNewEnv(TokenSequencePattern.java:158)
at edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.createPatternMatcher(TokensRegexNERAnnotator.java:343)
at edu.stanford.nlp.pipeline.TokensRegexNERAnnotator.<init>(TokensRegexNERAnnotator.java:295)
at edu.stanford.nlp.pipeline.NERCombinerAnnotator.setUpFineGrainedNER(NERCombinerAnnotator.java:209)
at edu.stanford.nlp.pipeline.NERCombinerAnnotator.<init>(NERCombinerAnnotator.java:152)
at edu.stanford.nlp.pipeline.AnnotatorImplementations.ner(AnnotatorImplementations.java:68)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$getNamedAnnotators$44(StanfordCoreNLP.java:546)
at edu.stanford.nlp.pipeline.StanfordCoreNLP$$Lambda$14/501263526.apply(Unknown Source)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.lambda$null$69(StanfordCoreNLP.java:625)
at edu.stanford.nlp.pipeline.StanfordCoreNLP$$Lambda$36/277630005.get(Unknown Source)
at edu.stanford.nlp.util.Lazy$3.compute(Lazy.java:126)
at edu.stanford.nlp.util.Lazy.get(Lazy.java:31)
at edu.stanford.nlp.pipeline.AnnotatorPool.get(AnnotatorPool.java:149)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.construct(StanfordCoreNLP.java:495)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:201)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:194)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.<init>(StanfordCoreNLP.java:181)
at BasicPipelineExample.main(BasicPipelineExample.java:29)

Increase Virtual Machine Heap Size to 3 or 4GB for a full pipeline (pos, depparse, ner).
It should work after that.
https://wiki.eclipse.org/FAQ_How_do_I_increase_the_heap_size_available_to_Eclipse%3F

Letters from two lines are interchanged, Adobe Reader can do the job

I have a question with regard to pdfbox 1.8.13. I am trying to read in the entire text from a one page PDF document. Adobe Reader can do the job, pdfbox reads almost the entire page but scrambles the first two lines of the document and the last two lines of the document so that letters are interchanged.
Does anybody know how to solve such an issue? First, where to ask, second, how can I share the PDF with you, and Third, does someone have the possibility to check whether the problem also exists in version 2.0.7 of pdfbox, which I understodd is completely different and thus not straightforward to implement?
Thank you in advance for your help
Stephan
Adobe Reader:
ScalableCapitalHRB217778,AmtsgerichtMünchenSeite1von1
VermögensverwaltungGmbHUSt-IdNr.DE300434774
Prinzregentenstr.
48Geschäftsführung:80538München
ErikPodzuweit,FlorianPrucker
pdfbox:
SVecramlaöbgleenCsavpeitrawlaltung GmbH UHSRtB-I2d1N7r7.7D8E,3A0m0t4s3g4e7ri7c4ht München Seite 1 von 1
8P0ri5n3zr8egMeünntcehnesntr. 48 GEreikscPhoädftzsufwüheritu,nFglo: rian Prucker
Link to the PDF (I have verified that the problem is the same with the unmodified and the modified PDF that I have uploaded):
https://wetransfer.com/downloads/5930649bce9a1d1a686a0da63f1b9bce20170808071518/9b9140
P.S.: In the meanwhile, I have also tried the PDDocument.loadNonSeq version in pdfbox 1.8.13 but this resulted in the same problem.

Thank you #tilman-hausherr for your helpful hints. With them, I managed to debug my problem.
You were right that leaving out the sorting option (I don't know why it was used before in the project that I now work on) resolved the scrambling issue even in pdfbox-1.8.13. And you were right that the text extraction result using pdfbox-2.0.7 gave even better results.
The relevant Java code snippets that I was using with pdfbox-1.8.13 were:
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.util.PDFTextStripper;
...
PDDocument doc = PDDocument.load(file);
PDFTextStripper textStripper = new PDFTextStripper();
textStripper.setSortByPosition(true);
String text = textStripper.getText(doc);
If I understand correctly, the API for simple text extraction going from pdfbox-1.8.13 to pdfbox-2.0.7 is not the same, but very similar, the PDFTextStripper has just been moved from util to text:
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
...
PDDocument doc = PDDocument.load(file);
PDFTextStripper textStripper = new PDFTextStripper();
// textStripper.setSortByPosition(true);
String text = textStripper.getText(doc);
To find out about all of this, as you said the command line tool was very helpful and here are the results of the text extraction with the different options (https://pdfbox.apache.org/1.8/commandline.html and https://pdfbox.apache.org/2.0/commandline.html):
java -jar pdfbox-app-1.8.13.jar ExtractText -sort "20170801 Rechnung.pdf":
SVecramlaöbgleenCsavpeitrawl HRBPrinzregentenstra.l4tu8ng GmbH GUSest-I
2d1N7r7.7D8E,3A0m0t4s3g4e7ri7c4ht München Seite 1 von 1
80538 München ErikcPhoädftzsufwüheritu,nFglo: rian Prucker
java -jar pdfbox-app-1.8.13.jar ExtractText "20170801 Rechnung.pdf":
Scalable CapitalVermögensverwaltung GmbHPrinzregentenstr. 4880538 München
HRB 217778, Amtsgericht MünchenUSt-IdNr. DE300434774Geschäftsführung:Erik
Podzuweit, Florian Prucker
Seite 1 von 1
java -jar pdfbox-app-2.0.7.jar ExtractText -sort "20170801 Rechnung.pdf":
Scalable Capital HRB 217778, Amtsgericht München Seite 1 von 1
Vermögensverwaltung GmbH USt-IdNr. DE300434774
Prinzregentenstr. 48 Geschäftsführung:
80538 München Erik Podzuweit, Florian Prucker
java -jar pdfbox-app-2.0.7.jar ExtractText "20170801 Rechnung.pdf"
Scalable Capital
Vermögensverwaltung GmbH
Prinzregentenstr. 48
80538 München
HRB 217778, Amtsgericht München
USt-IdNr. DE300434774
Geschäftsführung:
Erik Podzuweit, Florian Prucker
Seite 1 von 1
So I think pdfbox-2.0.7 gives the nicest results in this case, especially without the -sort option, even if I don't know why the algorithms behave differently, since pdfbox-1.8.3 gave the same result with or without the -nonSeq option.

Getting unknown integer value in Map-reduce output file

I am working on a hadoop map-reduce program where i am not setting the mapper and reducer and not setting any other parameter to the Job configuration from my program. I did so assuming that the the Job will send the same output as the input to the output file.
But what i found that it is printing some dummy integer value in the output file with every line separated by tab(i guess).
Here is my code:
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class MinimalMapReduce extends Configured implements Tool {
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setJarByClass(getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) {
String argg[] = {"/Users/***/Documents/hadoop/input/input.txt",
"/Users/***/Documents/hadoop/output_MinimalMapReduce"};
try{
int exitCode = ToolRunner.run(new MinimalMapReduce(), argg);
System.exit(exitCode);
}catch(Exception e){
e.printStackTrace();
}
}
}
And here is the input:
2011 22
2011 25
2012 40
2013 35
2013 38
2014 44
2015 43
And here is the output:
0 2011 22
8 2011 25
16 2012 40
24 2013 35
32 2013 38
40 2014 44
48 2015 43
How can i get the same ouput as the input?

I did so assuming that the the Job will send the same output as the input to the output file
You were correct in assuming that. Technically, you are getting the whatever you have in the file as the output. Remember that mappers and reducers take Key-Value pair as an input.
The input to a mapper is the an input split of the file and the input to a reducer is output of the mapper(s).
But what i found that it is printing some dummy integer value in the output file with every line separated by tab
These dummy integer are nothing but the offset of that line from the start of the file. Since each row you have consists of [4 DIGITS]<space>[2 DIGITS]<new-line>, your offsets are multiple of eights.
Why are you getting this offset since you haven't defined any mapper or reducer, you might ask? This is because , a mapper will always run which will do this task of mapping each line to it's offset and is referred to as an IdentityMapper.
How can i get the same ouput as the input?
Well you can define a mapper and just map the input lines to the output and strip the offsets.
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
// Some cool logic here
}
In the above code, key contains the dummy integer value i.e. offset. And value contains the value of each line, one at a time.
You can write your own code to write the value using the context.write function and then using no reducer and setting job.setNumReduceTasks(0) to get the desired output.

I agree with the #philantrovert's answer, but here is the more details i found.
According to the Hadoop- The Definitive Guide, it is TextInputFormat which adds the offset to the line numbers. Here is the documentation about the TextInputFormat:
TextInputFormat is the default InputFormat. Each record is a line of input. The key, a LongWritable, is the byte offset within the file of the beginning of the line. The value is the contents of the line, excluding any line terminators (e.g., newline or carriage return), and is packaged as a Text object. So, a file containing the following text:
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
is divided into one split of four records. The records are interpreted as the following key-value pairs:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
Clearly, the keys are not line numbers. This would be impossible to implement in general, in that a file is broken into splits at byte, not line, boundaries. Splits are processed independently. Line numbers are really a sequential notion. You have to keep a count of lines as you consume them, so knowing the line number within a split would be possible, but not within the file.
However, the offset within the file of each line is known by each split independently of the other splits, since each split knows the size of the preceding splits and just adds this onto the offsets within the split to produce a global file offset. The offset is usually sufficient for applications that need a unique identifier for each line. Combined with the file’s name, it is unique within the filesystem. Of course, if all the lines are a fixed width, calculating the line number is simply a matter of dividing the offset by the width.

CSV Files in Python (Sliding Window)

I am a beginner to Python, I need a help with manipulating csv files in Python.
I am trying to do sliding window mechanism for each row in dataset.
for an example if the dataset is this
timestamp | temperature | windspeed
965068200 9.61883 60.262
965069100 9.47203 60.1664
965070000 9.31125 60.0145
965070900 9.13649 59.8064
and if user specified window size is 3,the result should be something like
timestamp | temperature-2 | temperature-1 |temperature-0 | windspeed-2 | windspeed-1 | windspeed-0
965070000 9.61883 9.47203 9.31125 60.262 60.1664 60.0145
965070900 9.47203 9.31125 9.13649 60.1664 60.0145 59.8064
I could do this by using List of ObjectsArray in Java.Reading CSV and generate new CSV which it contains transformed dataset.
Here is the code
http://pastebin.com/cQnTBg8d #researh
I need to do this in Python , please help me to solve this.
Thank you

This answer assumes you are using Python 3.x - for Python 2.x some changes are required (some obvious places are commented)
For the data format in the question, this could be a starting point in Python:
import collections
def slide(infile,outfile,window_size):
queue=collections.deque(maxlen=window_size)
line=infile.readline()
headers=[s.strip() for s in line.split("|")]
row=[headers[0]]
for h in headers[1:]
for i in reversed(range(window_size)):
row.append("%s-%i"%(h,i))
outfile.write(" | ".join(row))
outfile.write("\n")
for line in infile:
queue.append(line.split())
if len(queue)==window_size:
row=[queue[-1][0]]
for j in range(1,len(headers)):
for old in queue:
row.append(old[j])
outfile.write("\t".join(row))
outfile.write("\n")
ws=3
with open("infile.csv","r") as inf:
with open("outfile.csv","w") as outf:
slide(inf,outf,ws)
actually this code is all about using a queue to keep the input rows for the window and not more - everything else is text-to-list-to-text.
With actual csv-data (see comment)
import csv
import collections
def slide(infile,outfile,window_size):
r=csv.reader(infile)
w=csv.writer(outfile)
queue=collections.deque(maxlen=window_size)
headers=next(r) # r.next() on python 2
l=[headers[0]]
for h in headers[1:]
for i in reversed(range(window_size)):
l.append("%s-%i"%(h,i))
w.writerow(l)
hrange=range(1,len(headers))
for row in r:
queue.append(row)
if len(queue)==window_size:
l=[queue[-1][0]]
for j in hrange:
for old in queue:
l.append(old[j])
w.writerow(l)
ws=3
with open("infile.csv","r") as inf: # rb and no newline param on python 2
with open("outfile.csv","w") as outf: # wb and no newline param on python 2
slide(inf,outf,ws)

Why is Weka GUI output different from Java code?

Why is that the result from running the filter StringToWordVector in Weka GUI is different from the equivalent java code? I use the same attributes as I used in the gui but the tokenizer in java doesn't seem to do a proper job! I was told by a Ph.D student that it is common and no further answer from him.
Please help. My project is stalled.
Here is my code:
DataSource tempSource = new DataSource("/home/r_omio/Dataset.arff");
Instances temp = tempSource.getDataSet();
NumericToBinary nbTemp = new NumericToBinary();
nbTemp.setInputFormat(temp);
temp = Filter.useFilter(temp, nbTemp);
StringToWordVector stringFilterTemp = new StringToWordVector(2500);
stringFilterTemp.setOptions(
weka.core.Utils.splitOptions("-R 1,2,3,4 -W 2500 -prune-rate -1.0 <br>-N 1 -stemmer weka.core.stemmers.NullStemmer -M 1 -tokenizer weka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?![]_\"")
);
stringFilterTemp.setInputFormat(temp);
temp = Filter.useFilter(temp, stringFilterTemp);

I suspect your delimiters are incorrectly escaped. Try using the default delimiters in the GUI and leaving the tokenizer out in Java, which will use the default, and see if you get the same value.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.