Memory requirements for Stanford NER retraining

Memory requirements for Stanford NER retraining - java

I am retraining the Stanford NER model on my own training data for extracting organizations. But, whether I use a 4GB RAM machine or an 8GB RAM machine, I get the same Java heap space error.
Could anyone tell what is the general configuration of machines on which we can retrain the models without getting these memory issues?
I used the following command :
java -mx4g -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop newdata_retrain.prop
I am working with training data (multiple files - each file has about 15000 lines in the following format) - one word and its category on each line
She O
is O
working O
at O
Microsoft ORGANIZATION
Is there anything else we could do to make these models run reliably ? I did try with reducing the number of classes in my training data. But that is impacting the accuracy of extraction. For example, some locations or other entities are getting classified as organization names. Can we reduce specific number of classes without impact on accuracy ?
One data I am using is the Alan Ritter twitter nlp data : https://github.com/aritter/twitter_nlp/tree/master/data/annotated/ner.txt
The properties file looks like this:
#location of the training file
trainFile = ner.txt
#location where you would like to save (serialize to) your
#classifier; adding .gz at the end automatically gzips the file,
#making it faster and smaller
serializeTo = ner-model-twitter.ser.gz
#structure of your training file; this tells the classifier
#that the word is in column 0 and the correct answer is in
#column 1
map = word=0,answer=1
#these are the features we'd like to train with
#some are discussed below, the rest can be
#understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
useNGrams=true
#no ngrams will be included that do not contain either the
#beginning or end of the word
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
#the next 4 deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
saveFeatureIndexToDisk = true
The error I am getting : the stacktrace is this :
CRFClassifier invoked on Mon Dec 01 02:55:22 UTC 2014 with arguments:
-prop twitter_retrain.prop
usePrevSequences=true
useClassFeature=true
useTypeSeqs2=true
useSequences=true
wordShape=chris2useLC
saveFeatureIndexToDisk=true
useTypeySequences=true
useDisjunctive=true
noMidNGrams=true
serializeTo=ner-model-twitter.ser.gz
maxNGramLeng=6
useNGrams=true
usePrev=true
useNext=true
maxLeft=1
trainFile=ner.txt
map=word=0,answer=1
useWord=true
useTypeSeqs=true
[1000][2000]numFeatures = 215032
setting nodeFeatureIndicesMap, size=149877
setting edgeFeatureIndicesMap, size=65155
Time to convert docs to feature indices: 4.4 seconds
numClasses: 21 [0=O,1=B-facility,2=I-facility,3=B-other,4=I-other,5=B-company,6=B-person,7=B-tvshow,8=B-product,9=B-sportsteam,10=I-person,11=B-geo-loc,12=B-movie,13=I-movie,14=I-tvshow,15=I-company,16=B-musicartist,17=I-musicartist,18=I-geo-loc,19=I-product,20=I-sportsteam]
numDocuments: 2394
numDatums: 46469
numFeatures: 215032
Time to convert docs to data/labels: 2.5 seconds
Writing feature index to temporary file.
numWeights: 31880772
QNMinimizer called on double function of 31880772 variables, using M = 25.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:923)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:885)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:879)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:91)
at edu.stanford.nlp.ie.crf.CRFClassifier.trainWeights(CRFClassifier.java:1911)
at edu.stanford.nlp.ie.crf.CRFClassifier.train(CRFClassifier.java:1718)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:759)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:747)
at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:2937)

One way you can try reducing number of classes is to not use B-I notation. For example, club B-facility and I-facility into facility. Of course, another way it to use a bigger memory machine.

Shouldn't that be -Xmx4g not -mx4g?

Sorry for getting to this a bit late! I suspect the problem is the input format of the file; in particular, my first guess is that the file is being treated as a single long sentence.
The expected format of the training file is in the CoNLL format, which means each line of the file is a new token, and the end of a sentence is denoted by a double newline. So, for example, a file could look like:
Cats O
have O
tails O
. O
Felix ANIMAL
is O
a O
cat O
. O
Could you let me know if it's indeed in this format? If so, could you include a stack trace of the error, and the properties file you are using? Does it work if you run on just the first few sentences of the file?
--Gabor

If you are going to do analysis on non-transactional data sets you may want to use another tool like Elasticsearch (simpler) or Hadoop (exponentially more complicated). MongoDB is a good middleground as well.

First uninstall the existing java jdk and reinstall again.
Then you can use the heap size as much as you can based on your hard disk size.
In the term "-mx4g" 4g is not the RAM it is the heap size.
Even I Faced the same error initially. after doing this it is gone.
Even I misunderstood 4g as RAM initially.
Now I am able to start my server even with 100g of heap size.
Next,
Instead of using Customised NER model, I suggest you to use Custom RegexNER Model with which you can add millions of words of same entity name within in a single document too.
These 2 errors I faced initially.
For any queries comment below.

Related

Recursion in parser combinator running out of stack space

I've been creating a simple parser combinator library in Java and for a first attempt I'm using programatic strcutures to define both the tokens and the parser grammar, see below:
final Combinator c2 = new CombinatorBuilder()
/*
.addParser("SEXPRESSION", of(Option.of(new Terminal("LPAREN"), zeroOrMore(new ParserPlaceholder("EXPRESSION")), new Terminal("RPAREN"))))
.addParser("QEXPRESSION", of(Option.of(new Terminal("LBRACE"), zeroOrMore(new ParserPlaceholder("EXPRESSION")), new Terminal("RBRACE"))))
*/
.addParser("SEXPRESSION", of(Option.of(new Terminal("LPAREN"), new ParserPlaceholder("EXPRESSION"), new Terminal("RPAREN"))))
.addParser("QEXPRESSION", of(Option.of(new Terminal("LBRACE"), new ParserPlaceholder("EXPRESSION"), new Terminal("RBRACE"))))
.addParser("EXPRESSION", of(
Option.of(new Terminal("NUMBER")),
Option.of(new Terminal("SYMBOL")),
Option.of(new Terminal("STRING")),
Option.of(new Terminal("COMMENT")),
Option.of(new ParserPlaceholder("SEXPRESSION")),
Option.of(new ParserPlaceholder("QEXPRESSION"))
)).build()
If I take the first Parser "SEXPRESSION" defined using the builer I can explain the structure:
Parameters to addParser:
Name of parser
an ImmutableList of disjunctive Options
Parameters to Option.of:
An array of Elements where each element is either a Terminal, or a ParserPlaceholder which is later substituted for the actual Parser where the names match.
The idea is to be able to reference one Parser from another and thus have more complex grammars expressed.
The problem I'm having is that using the grammar above to parse a string value such as "(+ 1 2)" gets stuck in an infinite recursive call when parsing the RPAREN ')' as the "SEXPRESSIONS" and "EXPRESSION" Parsers have "one or many" cardinaltiy.
I'm sure I could get creative and come up with some way of limiting the depth of the recursive calls, perhaps by ensuring that when the "SEXPRESSION" parser hands off to the "EXPRESSION" parser which then hands off to the "SEXPRESSION" parser, and no token are taken, then drop out? But I don't want a hacky solution if a standard solution exists.
Any ideas?
Thanks

Not to dodge the question, but I don't think there's anything wrong with calling an application using VM arguments to increase stack size.
This can be done in Java by adding the flag -XssNm where N is the amount of memory the application is called with.
The default Java stack size is 512 KB which, frankly, is hardly any memory at all. Minor optimizations aside, I felt that it was difficult, if not impossible to work with that little memory to implement complex recursive solutions, especially because Java isn't the least bit efficient when it comes to recursion.
So, some examples of this flag, as as follows:
-Xss4M 4 MB
-Xss2G 2 GB
It also goes right after you call java to launch the application, or if you are using an IDE like Eclipse, you can go in and manually set the command line arguments in run configurations.
Hope this helps!

How can I traverse SNMP/MIB tree with names and not the numbers (snmp4j)

I am tying to understand snmp; I was able to run the example at http://www.shivasoft.in/blog/java/snmp/create-snmp-client-in-java-using-snmp4j/ to traverse the SNMP tree and get the following results
1.3.6.1.2.1.1.1.0 : OCTET STRING : Hardware: Intel64 Family bla bla bla
1.3.6.1.2.1.1.2.0 : OBJECT IDENTIFIER : 1.3.6.1.4.1.311.1.1.3.1.1
1.3.6.1.2.1.1.3.0 : TimeTicks : 2 days, 0:56:16.59
1.3.6.1.2.1.1.4.0 : OCTET STRING :
1.3.6.1.2.1.1.5.0 : OCTET STRING : KASHILI-LTP.internal.harmonia.com
1.3.6.1.2.1.1.6.0 : OCTET STRING :
1.3.6.1.2.1.1.7.0 : Integer32 : 76
The numbers in format 1.3.6.1.x.x are going to be the keys into hashmap; I want to make these numbers more user friendly. How would I do that? May be I can pass MIB file to my java code(?). I want the output where all the numbers 1.3.6.x.x are replaced with actual names so that I know what they are (in Linux shell, I can get that effect by passing -m switch to snmpwalk)

The numeric OIDs always have a translation into names, defined in a MIB file, as you've already realized. The snmpwalk command (from net-snmp) is able to load MIB files and display the human-readable variable names. If you walk some more equipment, however, you will soon discover that many machines use MIBs that even net-snmp doesn't know about. You can download those MIB files and load them into net-snmp (see net-snmp documentation).
Sadly, snmp4j does not support MIB loading in its free version. There seems to be a commercially available library from the same vendor, called SMIPro, which seems like it could do what you need. I haven't tried it, though. They seem to have a trial license available if you want.

In dalvik, what expression will generate instructions 'not-int' and 'const-string/jumbo'?

I am new on learning dalvik, and I want to dump out every instruction in dalvik.
But there are still 3 instructions I can not get no matter how I write the code.
They are 'not-int', 'not-long', 'const-string/jumbo'.
I written like this to get 'not-int' but failed:
int y = ~x;
Dalvik generated an 'xor x, -1' instead.
and I know 'const-string/jumbo' means that there is more than 65535 strings in the code and the index is 32bit. But when I decleared 70000 strings in the code, the compiler said the code was too long.
So the question is: how to get 'not-int' and 'const-string/jumbo' in dalvik by java code?

const-string/jumbo is easy. As you noted, you just need to define more than 65535 strings, and reference one of the later ones. They don't all need to be in a single class file, just in the same DEX file.
Take a look at dalvik/tests/056-const-string-jumbo, in particular the "build" script that generates a Java source file with a large number of strings.
As far as not-int and not-long go, I don't think they're ever generated. I ran dexdump -d across a pile of Android 4.4 APKs and didn't find a single instance of either.

any quick sorting for a huge csv file

I am looking some java implementation of sorting algorithm. The file could be HUGE, say 20000*600=12,000,000 lines of records. The line is comma delimited with 37 fields and we use 5 fields as keys. Is it possible to sort it quickly, say 30 minutes?
If you got other approach other than java, it is welcome if it can be easily integrated into java system. For example, unix utility.
Thanks.
Edit: The lines need to be sort is dispersed into 600 files, with 20000 lines each, 4mb for each file. Finally I would like them to be 1 big sorted file.
I am trying to time a unix sort, would update that afterwards.
Edit:
I appended all the files into a big one, and tried the unix sort function, it is pretty good. The time to sort a 2gb file is 12-13 minutes. The append action require 4 minutes for 600 files.
sort -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r big.txt -o sorted.txt

How does the data get in the CSV format? Does it come from a relational database? You can make it such that whatever process creates the file writes its entries in the right order so you don't have to solve this problem down the line.
If you are doing a simple lexicographic order you can try the unix sort, but I am not sure how that will perform on a file with that size.

Calling unix sort program should be efficient. It does multiple passes to ensure it is not a memory hog. You can fork a process with java's Runtime, but the outputs of the process are redirected, so you have to some juggling to get the redirect to work right:
public static void sortInUnix(File fileIn, File sortedFile)
throws IOException, InterruptedException {
String[] cmd = {
"cmd", "/c",
// above should be changed to "sh", "-c" if on Unix system
"sort " + fileIn.getAbsolutePath() + " > "
+ sortedFile.getAbsolutePath() };
Process sortProcess = Runtime.getRuntime().exec(cmd);
// capture error messages (if any)
BufferedReader reader = new BufferedReader(new InputStreamReader(
sortProcess.getErrorStream()));
String outputS = reader.readLine();
while (outputS != null) {
System.err.println(outputS);
outputS = reader.readLine();
}
sortProcess.waitFor();
}

Use the java library big-sorter which is published to Maven Central and has an optional dependency on commons-csv for CSV processing. It handles files of any size by splitting to intermediate files, sorting and merging the intermediate files repeatedly until there is only one left. Note also that the max group size for a merge is configurable (useful for when there are a large number of input files).
Here's an example:
Given the CSV file below, we will sort on the second column (the "number" column):
name,number,cost
WIPER BLADE,35,12.55
ALLEN KEY 5MM,27,3.80
Serializer<CSVRecord> serializer = Serializer.csv(
CSVFormat.DEFAULT
.withFirstRecordAsHeader()
.withRecordSeparator("\n"),
StandardCharsets.UTF_8);
Comparator<CSVRecord> comparator = (x, y) -> {
int a = Integer.parseInt(x.get("number"));
int b = Integer.parseInt(y.get("number"));
return Integer.compare(a, b);
};
Sorter
.serializer(serializer)
.comparator(comparator)
.input(inputFile)
.output(outputFile)
.sort();
The result is:
name,number,cost
ALLEN KEY 5MM,27,3.80
WIPER BLADE,35,12.55
I created a CSV file with 12 million rows and 37 columns and filled the grid with random integers between 0 and 100,000. I then sorted the 2.7GB file on the 11th column using big-sorter and it took 8 mins to do single-threaded on an i7 with SSD and max heap set at 512m (-Xmx512m).
See the project README for more details.

Java Lists can be sorted, you can try starting there.

Python on a big server.
import csv
def sort_key( aRow ):
return aRow['this'], aRow['that'], aRow['the other']
with open('some_file.csv','rb') as source:
rdr= csv.DictReader( source )
data = [ row for row in rdr ]
data.sort( key=sort_key )
fields= rdr.fieldnames
with open('some_file_sorted.csv', 'wb') as target:
wtr= csv.DictWriter( target, fields }
wtr.writerows( data )
This should be reasonably quick. And it's very flexible.
On a small machine, break this into three passes: decorate, sort, undecorate
Decorate:
import csv
def sort_key( aRow ):
return aRow['this'], aRow['that'], aRow['the other']
with open('some_file.csv','rb') as source:
rdr= csv.DictReader( source )
with open('temp.txt','w') as target:
for row in rdr:
target.write( "|".join( map(str,sort_key(row)) ) + "|" + row )
Part 2 is the operating system sort using "|" as the field separator
Undecorate:
with open('sorted_temp.txt','r') as source:
with open('sorted.csv','w') as target:
for row in rdr:
keys, _, data = row.rpartition('|')
target.write( data )

You don't mention platform, so it is hard to come to terms with the time specified. 12x10^6 records isn't that many, but sorting is a pretty intensive task. Let's say 37 fields, say 100bytes/field would be 45GB? That's a bit much for most machines, but if the records average 10bytes/field your server should be able to fit the entire file in RAM, which would be ideal.
My suggestion: Break the file into chunks that are 1/2 the available RAM, sort each chunk, then merge-sort the resulting sorted chunks. This lets you do all of your sorting in memory rather than hitting swap, which is what I suspect of causing any slow-down.
Say (1G chunks, in a directory you can play around in):
split --line-bytes=1000000000 original_file chunk
for each in chunk*
do
sort $each > $each.sorted
done
sort -m chunk*.sorted > original_file.sorted

As your data set is huge as you have mentioned. Sorting it all at one go will be time consuming depending on your machine (If you try QuickSort).
But since you would like it to be done within 30 mins. I would suggest that you have a look at Map Reduce using
Apache Hadoop as your application server.
Please keep in mind it's not an easy approach, but in the longer run you can easily scale up depending upon your data size.
I am also pointing you to an excellent link on Hadoop setup
Work your way through single node setup and move to Hadoop cluster.
I would be glad to help you if you get stuck anywhere.

You really do need to make sure you have the right tools for the job. ( Today, I am hoping to get a 3.8 GHz PC with 24 GB memory for home use. It been a while since I bought myself a new toy. ;)
However, if you want to sort these lines and you don't have enough hardware, you don't need to break up the data because its in 600 files already.
Sort each file individually, then do a 600-way merge sort (you only need to keep 600 lines in memory at once) Its not as simple as doing them all at once, but you could probably do it on a mobile phone. ;)

Since you have 600 smaller files, it could be faster to sort all of them concurrently. This will eat up 100% of the CPU. That's the point, correct?
waitlist=
for f in ${SOURCE}/*
do
sort -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r -o ${f}.srt ${f} &
waitlist="$waitlist $!"
done
wait $waitlist
LIST=`echo $SOURCE/*.srt`
sort --merge -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r -o sorted.txt ${LIST}
This will sort 600 small files all at the same time and then merge the sorted files. It may be faster than trying to sort a single large file.

Use Map/Reduce Hadoop to do the sorting.. i recommend Spring Data Hadoop. Java.

Well since you're talking about HUGE datasets this means you'll need some external sorting algorithm anyhow. There are some for java and pretty much any other language out there - since the result will have to be stored on the disk anyhow which language you're using is pretty uninteresting.

How to convert a sample dataset from the R package "spatstat" into a shapefile

I have written a kernel density estimator in Java that takes input in the form of ESRI shapefiles and outputs a GeoTIFF image of the estimated surface. To test this module I need an example shapefile, and for whatever reason I have been told to retrieve one from the sample data included in R. Problem is that none of the sample data is a shapefile...
So I'm trying to use the shapefiles package's funciton convert.to.shapefile(4) to convert the bei dataset included in the spatstat package in R to a shapefile. Unfortunately this is proving to be harder than I thought. Does anyone have any experience in doing this? If you'd be so kind as to lend me a hand here I'd greatly appreciate it.
Thanks,
Ryan
References:
spatstat,
shapefiles

There are converter functions for Spatial objects in the spatstat and maptools packages that can be used for this. A shapefile consists of at least points (or lines or polygons) and attributes for each object.
library(spatstat)
library(sp)
library(maptools)
data(bei)
Coerce bei to a Spatial object, here just points without attributes since there are no "marks" on the ppp object.
spPoints <- as(bei, "SpatialPoints")
A shapefile requires at least one column of attribute data, so create a dummy.
dummyData <- data.frame(dummy = rep(0, npoints(bei)))
Using the SpatialPoints object and the dummy data, generate a SpatialPointsDataFrame.
spDF <- SpatialPointsDataFrame(spPoints, dummyData)
At this point you should definitely consider what the coordinate system used by bei is and whether you can represent that with a WKT CRS (well-known text coordinate reference system). You can assign that to the Spatial object as another argument to SpatialPointsDataFrame, or after create with proj4string(spDF) <- CRS("+proj=etc...") (but this is an entire problem all on its own that we could write pages on).
Load the rgdal package (this is the most general option as it supports many formats and uses the GDAL library, but may not be available because of system dependencies.
library(rgdal)
(Use writePolyShape in the maptools package if rgdal is not available).
The syntax is the object, then the "data source name" (here the current directory, this can be a full path to a .shp, or a folder), then the layer (for shapefiles the file name without the extension), and then the name of the output driver.
writeOGR(obj = spDF, dsn = ".", layer = "bei", driver = "ESRI Shapefile")
Note that the write would fail if the "bei.shp" already existed and so would have to be deleted first unlink("bei.shp").
List any files that start with "bei":
list.files(pattern = "^bei")
[1] "bei.dbf" "bei.shp" "bei.shx"
Note that there is no general "as.Spatial" converter for ppp objects, since decisions must be made as to whether this is a point patter with marks and so on - it might be interesting to try writing one, that reports on whether dummy data was required and so on.
See the following vignettes for further information and details on the differences between these data representations:
library(sp); vignette("sp")
library(spatstat); vignette("spatstat")

A general solution is:
convert the "ppp" or "owin" classed objects to appropriate classed objects from the sp package
use the writeOGR() function from package rgdal to write the Shapefile out
For example, if we consider the hamster data set from spatstat:
require(spatstat)
require(maptools)
require(sp)
require(rgdal)
data(hamster)
first convert this object to a SpatialPointsDataFrame object:
ham.sp <- as.SpatialPointsDataFrame.ppp(hamster)
This gives us a sp object to work from:
> str(ham.sp, max = 2)
Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
..# data :'data.frame': 303 obs. of 1 variable:
..# coords.nrs : num(0)
..# coords : num [1:303, 1:2] 6 10.8 25.8 26.8 32.5 ...
.. ..- attr(*, "dimnames")=List of 2
..# bbox : num [1:2, 1:2] 0 0 250 250
.. ..- attr(*, "dimnames")=List of 2
..# proj4string:Formal class 'CRS' [package "sp"] with 1 slots
This object has a single variable in the #data slot:
> head(ham.sp#data)
marks
1 dividing
2 dividing
3 dividing
4 dividing
5 dividing
6 dividing
So say we now want to write out this variable as an ESRI Shapefile, we use writeOGR()
writeOGR(ham.sp, "hamster", "marks", driver = "ESRI Shapefile")
This will create several marks.xxx files in directory hamster created in the current working directory. That set of files is the ShapeFile.
One of the reasons why I didn't do the above with the bei data set is that it doesn't contain any data and thus we can't coerce it to a SpatialPointsDataFrame object. There are data we could use, in bei.extra (loaded at same time as bei), but these extra data or on a regular grid. So we'd have to
convert bei.extra to a SpatialGridDataFrame object (say bei.spg)
convert bei to a SpatialPoints object (say bei.sp)
overlay() the bei.sp points on to the bei.spg grid, yielding values from the grid for each of the points in bei
that should give us a SpatialPointsDataFrame that can be written out using writeOGR() as above
As you see, that is a bit more involved just to give you a Shapefile. Will the hamster data example I show suffice? If not, I can hunt out my Bivand et al tomorrow and run through the steps for bei.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Memory requirements for Stanford NER retraining - java

One way you can try reducing number of classes is to not use B-I notation. For example, club B-facility and I-facility into facility. Of course, another way it to use a bigger memory machine.

Shouldn't that be -Xmx4g not -mx4g?

If you are going to do analysis on non-transactional data sets you may want to use another tool like Elasticsearch (simpler) or Hadoop (exponentially more complicated). MongoDB is a good middleground as well.

Related

Recursion in parser combinator running out of stack space

How can I traverse SNMP/MIB tree with names and not the numbers (snmp4j)

In dalvik, what expression will generate instructions 'not-int' and 'const-string/jumbo'?

any quick sorting for a huge csv file

How to convert a sample dataset from the R package "spatstat" into a shapefile

Categories

Resources