I have 642million records to be uploaded into java concurrent hashmap. The size of the entire data set on disk is only 30GB. However, when I am uploading it to memory, almost 250GB of memory is consumed.
I don't have any explanation for such a overhead. Can any one explain it? Also, suggest any idea how to reduce the memory consumption..
Related
What is the difference between cpu cache and memory cache?
When data is cached in memory there is also a higher probability that
this data is also cached in the CPU cache of the CPU executing the
thread. [1]
And how we can relate caching in cpu and memory?
To go into detail your question relates to both hardware and software used in computing.
Cache
This is just a general term used to refer to sets of data that are accessed quite often.
In computing, a cache /ˈkæʃ/ kash,1 is a hardware or software component that stores data so future requests for that data can be served faster.
Source
Memory Cache
Quite simply is a cache of frequently accessed data that is stored in a reasonably fast medium, e.g. in RAM or on a disk drive.
CPU Cache
This is a small block of RAM like memory that is physically a part of the CPU. It generally doesn't have alot of memory.
e.g. The Intel Core 17-920 has a cache of 8MB
Source
The point of this cache is to store data that the CPU is using quite regularly to speed up transfer time, since the CPU cache is physically closer to the processor then RAM is.
According to Wikipedia;
In computing, a cache is a hardware or software component that stores
data so future requests for that data can be served faster
So basically it is a location where you store data so that the next time you want the data you can access it quicker. Which means that the cache needs to be in a location which is quicker than the original location.
Typically the hard disk is used to store most data in a persistent manner. This is the largest data store in a computer system and is normally slow.
All the "work" is however done by the CPU. So in order to do processing of the data the CPU needs to first read the data, then process it, then write it out. As the CPU has a very limited memory/data registers then it does a lot of reading and writing.
Ideally your CPU would have a large enough data registers to store everything you need. But memory on the CPU is very expensive, so this is not practical.
So, you have main memory where the applications store some data temporarily while running to make it quicker.
The way that applications work mean that you tend to have a lot of data which is accessed very frequently. Often referred to as hot data.
So the purpose of the cache is to store such hot data so that you can quicker and easier refer to it and use it when needed.
So the closer to the CPU core you have your data, the quicker it can be accessed and hence performance is increased. But the more expensive it is.
The graphic shows this the different levels together with approx. access times
It can vary slightly depending on CPU architecture (and has changed over time) but generally a L1 & L2 cache is per core. A L3 cache is shared between multiple cores. A L1 cache is also often split into a Data cache and an instructions cache.
So your CPU cache will contain data is which accessed the most at that time, so there is a relation of sorts to either the main memory or the HDD where the data was fetched from. But because it is small, the cache will quickly change to using other data if you do something else, or something else if running in the background.
It is therefore not really possible to control the cache of the CPU. Plus if you did you would effectively slow down everything else (including the O/S) because you are denying them the ability to use the cache.
Every time your application reads and stores data in main memory then it is effectively creating its own cache, assuming you then access the data from this location and don't read it from disk (or other location) every time you need it.
So this can mean that part of it is also in the CPU Cache, but not necessarily. As you can have data in your main memory from your application, but you application is not doing anything, or has not accessed that data for a long time.
Remember also that the data in the CPU caches are very small in comparison to data in main memory. For example the Broadwell Intel Xeon chips have;
L1 Cache = 64 KB (per core)
L2 Cache = 256 KB (per core)
L3 Cache = 2 - 6 MB (shared).
The "memory cache" appears to really just be talking about anywhere in memory. Sometimes this is a cache of data stored on disk or externally. This is a software cache.
The CPU cache is a hardware cache and is faster, more localised but smaller.
I have an application that uses a lot of memory diff'ing the contents of two potentially huge (100k+) directories. It makes sense to me that such an operation would use a lot of memory, but once my diff'ing operation is done, the heap remains the same size.
I basically have code that instantiates a class to store the filename, file size, path, and modification date for each file on the source and target. I save the additions, deletions, and updates in other arrays. I then clear() my source and target arrays (which could be 100k+ each by now), leaving relatively small additions, deletions, and updates arrays left.
After I clear() my target and source arrays though, the memory usage (as visible via VirtualVM and Windows Task Manager) doesn't drop. I'm not experienced enough with VirtualVM (or any profiler for that matter) to figure out what is taking up all this memory. VirtualVM's heap dump lists the top few objects with a retained size of a few megabytes.
Anything to help point me in the right direction?
If the used heap goes down after a Garbage Collection, than it likely works as expected. Java increases its heap when it needs more memory, but does not free it -- it prefers to keep it in case the application uses more memory again. See Is there a way to lower Java heap when not in use? for this topic on why the heap is not reduced after the used heap amount lowers.
The VM grows or shrinks the heap based on the command-line parameters -XX:MinHeapFreeRatio and -XX:MaxHeapFreeRatio. It will shrink the heap when the free percentage hits -XX:MaxHeapFreeRatio, whose default is 70.
There is a short discussion of this in Oracle's bug #6498735.
Depending on your code you might be generating memory leaks and the Garbage collector just can't free them up.
I would suggest to instrument your code in order to find potential memory leaks. Once this is ruled out or fixed, I would start to look at the code itself for possible improvement.
Note that for instance if you use the try/catch/finally block. The finally block might not be called at all (or at least not immediately). If you do some resource freeing in a finally block this might be the answer.
Nevertheless read up on the subject, for instance here: http://www.toptal.com/java/hunting-memory-leaks-in-java
I'm writing a parser that loads rather large files 400+mb and parses them in about 32mb chunks, then saves the parsed data to disk. I save the data by having a thread with a synchronised list, the thread checks the list periodically and saves anything that's there. I then delete that element from the List. However the VM memory use continues to grow.
It's very fast when the Java virtual machine VM memory size is very big, but obviously slows when it reaches a cap. I can load a 400 mb file in a 300 mb of memory, but this is really slow.
Why is it that even with objects that I don't use any more they persist in memory, but are absolutely fine to be deallocated by the garbage collector (which is really slow).
How do I prevent the heap from becoming huge?
Yes, you can do :
System.gc();
and call to garbage collector explicity. And you can do a simple:
var = null;
and you will deallocate the memory assigned to this var.
I hope this help.
I have very large data structures that I define as static fields in a class. I think they get pushed into the heap because my code fails with that error message (heap memory exceeded). Now, I think I recall there to be a memory segment besides heap and stack that is much larger, called data. Is it possible for me to push the variables in that segment? If so, how is this accomplished? I can't afford to increase the heap size because my program will be used by others.
The only thing you could really possibly mean is the disk -- actually writing things to files.
I am trying to refresh a Lucene index in incremental mode that is updating documents that have changed and keeping other unchanged documents as they are.
For updating changed documents, I am deleting those documents using IndexWriter.deleteDocuments(Query) and then adding updated documents using IndexWriter.addDocument().
The Query object used in the IndexWriter.deleteDocuments contains approx 12-15 terms. In the process of refreshing the index I also sometimes need to do a FULL refresh by deleting all the documents using IndexWriter.deleteDocuments and then adding the new documents.
The problem is when I called IndexWriter.flush() after say approx 100000 docs deletions, it takes a long time to execute and throws an OutOfMemoryError. If I disable flushing, the indexing goes fast upto say 2000000 docs deletions and then it throws an OutOfMemoryError. I have tried to set the IndexWriter.setRAMBufferSizeMB to 500 to avoid the out of memory error, but with no luck. The index size is 1.8 GB.
First. Increasing the RAM buffer is not your solution. As far as I understand it is a cache and I rather would argue that it is increasing your problem. An OutOfMemoryError is a JVM problem not a problem of Lucene. You can set the RAM buffer to 1TB - if your VM does not have enough memory, you have a problem anyway. So you can do two things: Increase JVM memory or decrease consumption.
Second. Have you already considered increasing heap memory settings? The reason why flushing takes forever is that the system is doing a lot of garbage collections shortly before it runs out of memory. This is a typical symptom. You can check that using a tool like jvisualvm. You need to install the GC details plugin first, but then you can select and monitor your crazy OutOfMemory app. If you have learned about your memory issue, you can increase maximum heap space like that:
java -Xmx512M MyLuceneApp (or however you start your Lucene application)
But, again, I would use tools to check your memory consumption profile and garbage collection behavior first. Your goal should be to avoid running low on memory, because this causes garbage collection to slow down your application down to no performance.
Third. Now if you increase your heap you have to be sure that you have enough native memory as well. Because if you do not (check with tools like top on Linux) your system will start swapping to disk and this will hit Lucene performance like crazy as well. Because Lucene is optimized for sequential disk reads and if your system starts to swap, your hard disk will do a lot of disk seeking which is 2 orders of magnitude slower than sequential reading. So it will be even worse.
Fourth. If you do not have enough memory consider deleting in batches. After a 1,000 or 10,000 documents do a flush, then again and again. The reason for this OutOfMemoryError is that Lucene has to keep everything in memory until you do the flush. So it might be a good idea anyway not to allow to flush batches that are too big, to avoid problems in the future.
On the (rare) occasion that I want to wipe all docs from my Lucene index, I find it much more efficient to close the IndexWriter, delete the index files directly and then basically starting a fresh index. The operation takes very little time and is guaranteed to leave your index in a pristine (if somewhat empty) state.
Try to use a smaller RamBufferedSize for your IndexWriter.
IndexWriter calss flush if the buffer full (or number of documents reaches a certain level). By setting the buffer size to a large number, you are implicitly postponing calling flush which can result in having too many documents in the memory.