Running OPTICS clustering on ELKI using large geo-dataset

Running OPTICS clustering on ELKI using large geo-dataset - java

I am using OPTICSXi with rstartree on ELKI to cluster geo-dataset (latitude&longitude), Gowalla, which includes about 6 million records, but MiniGUI always shows 'java heap space' and 'error:out of memory'.
I used to see an answer of Anony Mousse, in which 1.2 million location data were dealed in 11 minutes, using OPTICSXi on ELKI. i'm so confused. Why ELKI reporting these errors?
Any parameters I need to modify on java platform or ELKI?

This is a standard out of memory error.
You will have to add more memory, or decrease memory consumption somehow.
You could also try the cover tree (it should need much less memory than the current R*-tree implementation). Make sure to use an appropriate, and small, value of epsilon to benefit from indexing.

Related

"main" java.lang.OutOfMemoryError: Java heap space Error in Stanford Custom Entity Recognition Model training

I'm trying to train a custom NER model to recognize 41 entities(the training set has around 6000 lines)
When I try to run the training command provided in the nlp site :
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop
This is the error I'm facing :
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at edu.stanford.nlp.optimization.AbstractCachingDiffFunction.ensure(AbstractCachingDiffFunction.java:136)
at edu.stanford.nlp.optimization.AbstractCachingDiffFunction.derivativeAt(AbstractCachingDiffFunction.java:151)
at edu.stanford.nlp.optimization.QNMinimizer.evaluateFunction(QNMinimizer.java:1150)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:898)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:856)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:850)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:93)
at edu.stanford.nlp.ie.crf.CRFClassifier.trainWeights(CRFClassifier.java:1935)
at edu.stanford.nlp.ie.crf.CRFClassifier.train(CRFClassifier.java:1742)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:785)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:756)
at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:3011)
I tried adding -Xmx4096m to my java command to specify the max heap space as 4GB( that is the maximum available space in my machine) but still no luck.
I also tried adding -Xms1024m to specify minimum heap space and yet no different result.
This same command worked flawless without any heap space errors when I tried it to train a model for 20 entities(1500 lines)
Is this heap space related to RAM or the available space?
Should I try training in a machine with more Ram or storage?

If you think you end up with memory-availability issue, here is the guidelines from stanford (Refer back if you can).
Ultimately, if you have tons of features and lots of classes, you need to have lots of memory to train a CRFClassifier. We frequently train models that require several gigabytes of RAM and are used to typing java -mx4g.
You can decrease the memory of the limited-memory quasi-Newton optimizer (L-BFGS). The optimizer maintains a number of past guesses which are used to approximate the Hessian. Having more guesses makes the estimate more accurate, and optimization is faster, but the memory used by the system during optimization is linear in the number of guesses. This is specified by the parameter qnSize. The default is 25. Using 10 is perfectly adequate. If you're short of memory, things will still work with much smaller values, even just a value of 2.
Decrease the order of the CRF. We usually use just first order CRFs (maxLeft=1 and no features that refer to the answer class more than one away - it's okay to refer to word features any distance away). While the code supports arbitrary order CRFs, building second, third, or fourth order CRFs will greatly increase memory usage and normally isn't necessary. Remember: maxLeft refers to the size of the class contexts that your features use (that is, it is one smaller than the clique size). A first order CRF can still look arbitrarily far to the left or right to get information about the observed data context.
Decrease the number of features generated. To see all the features generated, you can set the property printFeatures to true. CRFClassifier will then write (potentially huge) files in the current directory listing the features generated for each token position. Options that generate huge numbers of features include useWordPairs and useNGrams when maxNGramLeng is a large number.
Decrease the number of classes in your model. This may or may not be possible, depending on what your modeling requirements are. But time complexity is proportional to the number of classes raised to the clique size.
Use the flag useObservedSequencesOnly=true. This makes it so that you can only label adjacent words with label sequences that were seen next to each other in the training data. For some kinds of data this actually gives better accuracy, for other kinds it is worse. But unless the label sequence patterns are dense, it will reduce your memory usage.
Of course, shrinking the amount of training data will also reduce the memory needed, but isn't very desirable if you're trying to train the best classifier. You might consider throwing out sentences with no entities in them, though.
If you're concerned about runtime memory usage, some of the above items still apply (number of features and classes, useObservedSequencesOnly, and order of the CRF), but in addition, you can use the flag featureDiffThresh, for example featureDiffThresh=0.05. In training, CRFClassifier will train one model, drop all the features with weight (absolute value) beneath the given threshold, and then train a second model. Training thus takes longer, but the resulting model is smaller and faster at runtime, and usually has very similar performance for a reasonable threshold such as 0.05.

What do you do when you need more Java Heap Space?

Sorry if this has been asked before (though I can't really find a solution).
I'm not really too good at programming, but anyways, I am crawling a bunch of websites and storing information about them on a server. I need a java program to process vector coordinates associated with each of the documents (about a billion or so documents with a grant total of 500,000 numbers, plus or minus, associated with each of the documents). I need to calculate the singular value decomposition of that whole matrix.
Now Java, obviously, can't handle as big of a matrix as that to my knowledge. If i try making a relatively small array (about 44 million big) then I will get a heap error. I use eclipse, and so I tried changing the -xmx value to 1024m (it won't go any higher for some reason even though I have a computer with 8gb of ram).
What solution is there to this? Another way of retrieving the data I need? Calculating the SVD in a different way? Using a different programming language to do this?
EDIT: Just for right now, pretend there are a billion entries with 3 words associated with each. I am setting the Xmx and Xms correctly (from run configurations in eclipse -> this is the equivalent to running java -XmsXXXX -XmxXXXX ...... in command prompt)

The Java heap space can be set with the -Xmx (note the initial capital X) option and it can certainly reach far more than 1 GB, provided you are using an 64-bit JVM and the corresponding physical memory is available. You should try something along the lines of:
java -Xmx6144m ...
That said, you need to reconsider your design. There is a significant space cost associated with each object, with a typical minimum somewhere around 12 to 16 bytes per object, depending on your JVM. For example, a String has an overhead of about 36-40 bytes...
Even with a single object per document with no book-keeping overhead (impossible!), you just do not have the memory for 1 billion (1,000,000,000) documents. Even for a single int per document you need about 4 GB.
You should re-design your application to make use of any sparseness in the matrix, and possibly to make use of disk-based storage when possible. Having everything in memory is nice, but not always possible...

Are you using a 32 bit JVM? These cannot have more than 2 GB of Heap, I never managed to allocate more than 1.5 GB. Instead, use a 64 bit JVM, as these can allocate much more Heap.

Or you could apply some math to it and use divide and conquer strategy. This means, split the problem into little problems to get to the same result.
Don't know much about SVD but maybe this page can be helpful:
http://www.netlib.org/lapack/lug/node32.html

-Xms and -Xmx are different. The one containg s is the starting heap space and the one with x is the maximum heap space.
so
java -Xms512 -Xmx1024
would give you 512 to start with
As other people have said though you may need to break your problem down to get this to work. Are you using 32 or 64 bit java?

For data of that size, you should not plan to store it all in memory. The most common scheme to externalize this kind of data is to store it all in a database and structure your program around database queries.

Just for right now, pretend there are a billion entries with 3 words associated with each.
If you have one billion entries you need 1 billion times the size of each entry. If you mean 3 x int as words that's 12 GB at least just for the data. If you meant words as String, you would enumerate the words as there is only about 100K words in English and it would take the same amount of space.
Given 16 GB cost a few hundred dollars, I would suggest buying more memory.

solr spatial bad performance

I'm using SOLR-3.4, spatial filtering with the schema having LatLonType (subType=tdouble). I have an index of about 20M places. My basic problem is that if I do bbox filter with cache=true, the performance is reasonably good (~40-50 QPS, about 100-150ms latency), but a big downside is crazy fast old gen heap growth ultimately leading to major collections every 30-40 minutes (on a very large heap, 25GB). And at that point performance is beyond unacceptable. On the other hand I can turn off caching for bbox filters, but then my latency and QPS drops (the latency goes down from 100ms => 500ms). The NumericRangeQuery javadoc talks about the great performance you can get (sub 100 ms) but now I wonder if that was with filterCache enabled, and nobody bothered to look at the heap growth that results. I feel like this is sort of a catch-22 since neither configuration is really acceptable.
I'm open to any ideas. My last idea (untried) is to use geo hash (and pray that it either performs better with cache=false, or has more manageable heap growth if cache=true).
EDIT:
Precision step: default (8 for double I think)
System memory: 32GB (EC2 M2 2XL)
JVM: 24GB
Index size: 11 GB
EDIT2:
A tdouble with precisionStep of 8 means that your doubles will be splitted in sequences of 8 bits. If all your latitudes and longitudes only differ by the last sequence of 8 bits, then tdouble would have the same performance has a normal double on a range query. This is why I suggested to test a precisionStep of 4.
Question: what does this actually mean for a double value?

Having a profile of Solr while responding to your spatial queries would be of great help to understand what is slow, see hprof for example.
Still, here are a few ideas on how you could (perhaps) improve latency.
First you could try to test what happens when decreasing the precisionStep (try 4 for example). If the latitudes and longitudes are too close of each other and the precisionStep is too high, Lucene cannot take advantage of having several indexed values.
You could also try to give a little bit less memory to the JVM in order to give the OS cache more chances to cache frequently accessed index files.
Then, if it is still not fast enough, you could try to extend replace TrieDoubleField as a sub field by a field type that would use a frange query for the getRangeQuery method. This would reduce the number of disk access while computing the range at the cost of a higher memory usage. (I have never tested it, it might provide horrible performance as well.)

Scaling application that reads large XML files

I have an application which reads large set of XML files (multiple around 20-30) periodically, like once every 10 minutes. Now each XML file can be approximated to at least 40-100 MB in size. Once each XML has read, a map is created out of the file, and then the map is passed across a processor chain (10-15), each processor using the data, performing some filter or writing to database, etc.
Now the application is running in 32 bit JVM. No intention on moving to 64 bit JVM right now. The memory foot-print as expected is very high... nearing the threshold of a 32 bit JVM. For now when we receive large files, we serialize the generated map into disk and run through the processor chain maximum of 3-4 map concurrently as if we try to process all the maps at the same time, it would easily go OutOfMemory. Also garbage collection is pretty high.
I have some ideas but wanted to see if there are some options which people have already tried/evaluated. So what are the options here for scaling this kind of application?

Yea, to parrot #aaray and #MeBigFatGuy, you want to use some event based parser for this, the dom4j mentioned, or SAX or StAX.
As a simple example, that 100MB XML is consuming a minimum of 200MB of RAM if you load it wholesale, as each character is immediately expanded to a 16 bit character.
Next, any tag of elements that you're not using is going to consume extra memory (plus all of the other baggage and bookkeeping of the nodes) and it's all wasted. If you're dealing with numbers, converting the raw string to a long will be a net win if the number is larger than 2 digits.
IF (and this is a BIG IF) you are using a lot of a reasonably small set of Strings, you can save some memory by String.intern()'ing them. This is a canonicalization process that makes sure if the string already exists in the jvm, its shared. The downside of this is that it pollutes your permgen (once interned, always interned). PermGen is pretty finite, but on the other hand it's pretty much immune to GC.
Have you considered being able to run the XML through an external XSLT to remove all of the cruft that you don't want to process before it even enters your JVM? There are several standalone, command line XSL processors that you can use to pre-process the files to something perhaps more sane. It really depends on how much of the data that is coming in you're actually using.
By using an event based XML processing model, the XSLT step is pretty much redundant. But the event based models are all basically awful to use, so perhaps using the XSLT step would let you re-use some of your existing DOM logic (assuming that's what you're doing).
The flatter your internal structures, the cheaper they are in terms of memory. You actually have a little bit of an advantage running a 32b vm, since instance pointers are half the size. But still, when you're talking 1000's or millions of nodes, it all adds up, and quickly.

We had a similar problem processing large XML files (around 400Mb). We greatly reduced the memory footprint of the application using this:
http://dom4j.sourceforge.net/dom4j-1.6.1/faq.html#large-doc

You can insert the contents of each XML file into a temporary DB table and each chain link would fetch the data it needs. You will probably lose performance, but gain scalability.

determining java memory usage

Hmmm. Is there a primer anywhere on memory usage in Java? I would have thought Sun or IBM would have had a good article on the subject but I can't find anything that looks really solid. I'm interested in knowing two things:
at runtime, figuring out how much memory the classes in my package are using at a given time
at design time, estimating general memory overhead requirements for various things like:
how much memory overhead is required for an empty object (in addition to the space required by its fields)
how much memory overhead is required when creating closures
how much memory overhead is required for collections like ArrayList
I may have hundreds of thousands of objects created and I want to be a "good neighbor" to not be overly wasteful of RAM. I mean I don't really care whether I'm using 10% more memory than the "optimal case" (whatever that is), but if I'm implementing something that uses 5x as much memory as I could if I made a simple change, I'd want to use less memory (or be able to create more objects for a fixed amount of memory available).
I found a few articles (Java Specialists' Newsletter and something from Javaworld) and one of the builtin classes java.lang.instrument.getObjectSize() which claims to measure an "approximation" (??) of memory use, but these all seem kind of vague...
(and yes I realize that a JVM running on two different OS's may be likely to use different amounts of memory for different objects)

I used JProfiler a number of years ago and it did a good job, and you could break down memory usage to a fairly granular level.

As of Java 5, on Hotspot and other VMs that support it, you can use the Instrumentation interface to ask the VM the memory usage of a given object. It's fiddly but you can do it.
In case you want to try this method, I've added a page to my web site on querying the memory size of a Java object using the Instrumentation framework.
As a rough guide in Hotspot on 32 bit machines:
objects use 8 bytes for
"housekeeping"
fields use what you'd expect them to
use given their bit length (though booleans tend to be allocated an entire byte)
object references use 4 bytes
overall obejct size has a
granularity of 8 bytes (i.e. if you
have an object with 1 boolean field
it will use 16 bytes; if you have an
object with 8 booleans it will also
use 16 bytes)
There's nothing special about collections in terms of how the VM treats them. Their memory usage is the total of their internal fields plus -- if you're counting this -- the usage of each object they contain. You need to factor in things like the default array size of an ArrayList, and the fact that that size increases by 1.5 whenever the list gets full. But either asking the VM or using the above metrics, looking at the source code to the collections and "working it through" will essentially get you to the answer.
If by "closure" you mean something like a Runnable or Callable, well again it's just a boring old object like any other. (N.B. They aren't really closures!!)

You can use JMP, but it's only caught up to Java 1.5.

I've used the profiler that comes with newer versions of Netbeans a couple of times and it works very well, supplying you with a ton of information about memory usage and runtime of your programs. Definitely a good place to start.

If you are using a pre 1.5 VM - You can get the approx size of objects by using serialization. Be warned though.. this can require double the amount of memory for that object.

See if PerfAnal will give you what you are looking for.

This might be not the exact answer you are looking for, but the bosts of the following link will give you very good pointers. Other Question about Memory

I believe the profiler included in Netbeans can moniter memory usage also, you can try that

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.