Out of memory while indexing with Lucene

Out of memory while indexing with Lucene - java

I'm using Lucene 4.9.0 to index 23k files, but now I'm receiving java.lang.OutOfMemoryError: Java heap space message .
I don't want to increase "heap size" because the number of files tends to increase everyday.
How can I index all files without the OOM problem and increase "heap space"?

Your question is too vague and makes little sense.
First of all, 23K files can be 1 byte/each or 1G/each. How are we supposed to know what's inside and how heavyweight they are?
Secondly, you say
I don't want to increase "heap size" because <...>
and straight after you say
How can I index all files without the OOM problem and increase "heap space"
Can you make up your mind on whether you can increase heap space or not?
There's a certain amount of memory required to index the data, and there's nothing much you can do about it. That said, the most memory required is during merging process and you can play with the merge factor to see if this helps you.

Related

How To Append the Large Text Files to JTextArea in java Swing

I have implemented an Java Swing Application.In that I have wrote Open File Functionality.I have tried with lot of ways to read the file and write into the JTextArea(I have tried with append(),setText() and read() method also).But,It working upto 100 MB.If I want to open over 100 MB file It raises an "out of Memory Exception : Java Heap space" at textarea.append().Is there any way to append over 100MB data to JTextArea or Anyway to Increase the Memory capacity of JTextArea.Please give a Suggestions for the above issue.Thanking You.

Possibly a duplicate of Java using up far more memory than allocated with -Xmx as your problem is really that your java-instance is running out of memory.
Java can open (theoretically) files of any size, as long as you have the memory for it to be read.
I would however recommend that you only read in parts of a file in memory at a time. And when you've finish with that part you move on to the next specified amount of text.
Anyhow, for this instance and if this is not a regular problem, you could use -Xmx800m which would let java use 800mb for heap space.
If this is not a one time thing, you really should look in to just reading in parts of a file at a time. http://www.baeldung.com/java-read-lines-large-file should put you in the right direction.

Java Heap Size Reduction

BACKGROUND
I recently wrote a java application that consumes a specified amount of MB. I am doing this purposefully to see how another Java application reacts to specific RAM loads (I am sure there are tools for this purpose, but this was the fastest). The memory consumer app is very simple. I enter the number of MB I want to consume and create a vector of that many bytes. I also have a reset button that removes the elements of the vector and prompts for a new number of bytes.
QUESTION
I noticed that the heap size of the java process never reduces once the vector is cleared. I tried clear(), but the heap remains the same size. It seems like the heap grows with the elements, but even though the elements are removed the size remains. Is there a way in java code to reduce heap size? Is there a detail about the java heap that I am missing? I feel like this is an important question because if I wanted to keep a low memory footprint in any java application, I would need a way to keep the heap size from growing or at least not large for long lengths of time.

Try garbage collection by making call to System.gc()
This might help you - When does System.gc() do anything
Calling GC extensively is not recommended.

You should provide max heap size with -Xmx option, and watch memory allocation by you app. Also use weak references for objects which have short time lifecycle and GC remove them automatically.

Memory management with Java (how to use the data segment?)

I have very large data structures that I define as static fields in a class. I think they get pushed into the heap because my code fails with that error message (heap memory exceeded). Now, I think I recall there to be a memory segment besides heap and stack that is much larger, called data. Is it possible for me to push the variables in that segment? If so, how is this accomplished? I can't afford to increase the heap size because my program will be used by others.

The only thing you could really possibly mean is the disk -- actually writing things to files.

What is the best practice to grow a very large binary file rapidly?

My Java application deals with large binary data files using memory mapped file (MappedByteBuffer, FileChannel and RandomAccessFile). It often needs to grow the binary file - my current approach is to re-map the file with a larger region.
It works, however there are two problems
Grow takes more and more time as the file becomes larger.
If grow is conducted very rapidly (E.G. in a while(true) loop), JVM will hang forever after the re-map operation is done for about 30,000+ times.
What are the alternative approaches, and what is the best way to do this?
Also I cannot figure out why the second problem occurs. Please also suggest your opinion on that problem.
Thank you!
Current code for growing a file, if it helps:
(set! data (.map ^FileChannel data-fc FileChannel$MapMode/READ_WRITE
0 (+ (.limit ^MappedByteBuffer data) (+ DOC-HDR room))))

You probably want to grow your file in larger chunks. Use a doubling each time you remap, like a dynamic array, so that the cost for growing is an amortized constant.
I don't know why the remap hangs after 30,000 times, that seems odd. But you should be able to get away with a lot less than 30,000 remaps if you use the scheme I suggest.

The JVM doesn't clean up memory mappings even if you call the cleaner explicitly. Thank you #EJP for the correction.
If you create 32,000 of these they could be all in existence at once. BTW: I suspect you might be hitting some 15-bit limit.
The only solution for this is; don't create so many mapping. You can map an entire disk 4 TB disk with less than 4K mapping.
I wouldn't create a mapping less than 16 to 128 MB if you know the usage will grow and I would consider up to 1 GB per mapping. The reason you can do this with little cost is that the main memory and disk space will not be allocated until you actually use the pages. i.e. the main memory usage grows 4 KB at a time.
The only reason I wouldn't create a 2 GB mapping is Java doesn't support this due to an Integer.MAX_VALUE size limit :( If you have 2 GB or more you have to create multiple mappings.

Unless you can afford an exponential growth on the file like doubling, or any other constant multiplier, you need to consider whether you really need a MappedByteBuffer at all, considering their limitations (unable to grow the file, no GC, etc). I personally would either be reviewing the problem or else using a RandomAccessFile in "rw" mode, probably with a virtual-array layer over the top of it.

How can I avoid OutOfMemoryErrors when updating documents in a Lucene index?

I am trying to refresh a Lucene index in incremental mode that is updating documents that have changed and keeping other unchanged documents as they are.
For updating changed documents, I am deleting those documents using IndexWriter.deleteDocuments(Query) and then adding updated documents using IndexWriter.addDocument().
The Query object used in the IndexWriter.deleteDocuments contains approx 12-15 terms. In the process of refreshing the index I also sometimes need to do a FULL refresh by deleting all the documents using IndexWriter.deleteDocuments and then adding the new documents.
The problem is when I called IndexWriter.flush() after say approx 100000 docs deletions, it takes a long time to execute and throws an OutOfMemoryError. If I disable flushing, the indexing goes fast upto say 2000000 docs deletions and then it throws an OutOfMemoryError. I have tried to set the IndexWriter.setRAMBufferSizeMB to 500 to avoid the out of memory error, but with no luck. The index size is 1.8 GB.

First. Increasing the RAM buffer is not your solution. As far as I understand it is a cache and I rather would argue that it is increasing your problem. An OutOfMemoryError is a JVM problem not a problem of Lucene. You can set the RAM buffer to 1TB - if your VM does not have enough memory, you have a problem anyway. So you can do two things: Increase JVM memory or decrease consumption.
Second. Have you already considered increasing heap memory settings? The reason why flushing takes forever is that the system is doing a lot of garbage collections shortly before it runs out of memory. This is a typical symptom. You can check that using a tool like jvisualvm. You need to install the GC details plugin first, but then you can select and monitor your crazy OutOfMemory app. If you have learned about your memory issue, you can increase maximum heap space like that:
java -Xmx512M MyLuceneApp (or however you start your Lucene application)
But, again, I would use tools to check your memory consumption profile and garbage collection behavior first. Your goal should be to avoid running low on memory, because this causes garbage collection to slow down your application down to no performance.
Third. Now if you increase your heap you have to be sure that you have enough native memory as well. Because if you do not (check with tools like top on Linux) your system will start swapping to disk and this will hit Lucene performance like crazy as well. Because Lucene is optimized for sequential disk reads and if your system starts to swap, your hard disk will do a lot of disk seeking which is 2 orders of magnitude slower than sequential reading. So it will be even worse.
Fourth. If you do not have enough memory consider deleting in batches. After a 1,000 or 10,000 documents do a flush, then again and again. The reason for this OutOfMemoryError is that Lucene has to keep everything in memory until you do the flush. So it might be a good idea anyway not to allow to flush batches that are too big, to avoid problems in the future.

On the (rare) occasion that I want to wipe all docs from my Lucene index, I find it much more efficient to close the IndexWriter, delete the index files directly and then basically starting a fresh index. The operation takes very little time and is guaranteed to leave your index in a pristine (if somewhat empty) state.

Try to use a smaller RamBufferedSize for your IndexWriter.
IndexWriter calss flush if the buffer full (or number of documents reaches a certain level). By setting the buffer size to a large number, you are implicitly postponing calling flush which can result in having too many documents in the memory.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.