Memory management in Java, loading large files, how to deallocate objects?

Memory management in Java, loading large files, how to deallocate objects? - java

I'm writing a parser that loads rather large files 400+mb and parses them in about 32mb chunks, then saves the parsed data to disk. I save the data by having a thread with a synchronised list, the thread checks the list periodically and saves anything that's there. I then delete that element from the List. However the VM memory use continues to grow.
It's very fast when the Java virtual machine VM memory size is very big, but obviously slows when it reaches a cap. I can load a 400 mb file in a 300 mb of memory, but this is really slow.
Why is it that even with objects that I don't use any more they persist in memory, but are absolutely fine to be deallocated by the garbage collector (which is really slow).
How do I prevent the heap from becoming huge?

Yes, you can do :
System.gc();
and call to garbage collector explicity. And you can do a simple:
var = null;
and you will deallocate the memory assigned to this var.
I hope this help.

Related

how to optimize large short-lived objects in jvm gc

we meet a problem that about jvm gc. we have a large QPS application that jvm heap memory increased very fast. it will use more than 2g heap memory at a few seconds, then gc triggers that will collected more than 2g memory every time also very frequency。GC collect case like below picture.so this have two problems
gc need some time. what is more, it is frequent.
system will not stabilized.
I abstract the problem like below code. System allocate short-lived object fast.
public static void fun1() {
for(int i = 0; i < 5000; i++) {
Byte[] bs = new Byte[1024 * 1024 * 5];
bs = null;
}
}
so, I have some questions:
many say that set object equals null will let gc thread collect easy。what the mean of that? we all know that minor GC is always triggered when the JVM is unable to allocate space for a new object.Thus, whether a object
is null, gc will triggered only when space is not enough. so set object is null is not meaningful.
how to optimize this if exists large short-lived object? I means how to collect this objects not when young generation is not enough.
Any suggestions will help me.

I would say you need to cache them. Instead of recreating them every time try to find a way to reuse already created ones.
Try to allocate more memory to your app with the option
-Xms8g -Xmx8g
gc is called when there is not enough memory so if you have more gc won't be called so often.
It's hard to suggest something valuable without the huge objects nature. Try to add an example of such an object as well as how to you create/call them.
If you have big byte arrays inside the short lived objects try to place them out of java (e.g. in a file and keep file path only).

many say that set object equals null will let gc thread collect easy
This question is covered by several others on stack overflow so i won't repeat those here. E.g. Is unused object available for garbage collection when it's still visible in stack?
how to optimize this if exists large short-lived object? I means how to collect this objects not when young generation is not enough.
Increase the size of the young generation. So that it can contain the object. Or use G1GC which can selectively collect regions containing such objects instead of the entire old generation.
You could also try using direct byte buffers which will allocate the memory outside the heap thus reducing GC pressure. But the memory will stay allocated until the Buffers pointing to the memory get collected.
Other than that you should redesign your code to avoid such frequent, large allocations. Often one can use object pools or thread-local variables as scratch memory.

Java program keeps getting bigger

I've written a program in Java that detects when the computer it's being run on is idle. When the idle time is reset (in other words, the mouse or keyboard is used), the program locks the computer. This program is designed to run when the computer starts and continue to run while the machine is on. My problem is that the program takes up more and more space as it runs longer. I don't see any reason why it should; there's nothing like an ArrayList that's being added to constantly. The program "expands" in memory by about 10 megabytes per hour. Is there some sort of garbage collection I should be doing?

Try to set the heap size to a lower value... the garbage collector should then kick-in earlier. Manually calling System.gc() from time to time should also solve your problem. If this results in OutOfMemory exception after a while and/or the memory is still constantly increasing, then you really have a memory leak somewhere.

It doesn't sound like you even have a problem. 10 MB really isn't that large. It could be that the garbage collector simply hasn't "decided" to run in a while. You can try to call the GC directly by calling System.gc(), but really, I wouldn't worry too much unless you're running out of memory or having performance issues.

Any time your program uses the new operator the runtime will allocate new memory that may not be freed until the garbage collector decides it is time to reclaim available space. So even if you are not "leaking" memory by adding to a collection that is never cleared you are still using memory and your usage will grow over time.
Consider eliminating calls to new (e.g. by reusing existing objects) or tuning the heap size settings on the JVM to initiate the garbage collector more frequently if memory consumption is a concern.

Garbage Collection behavior

During start up of my application, database is queried, objects are created (from the result of the query) and are inserted in a a Arraylist. The arraylist is later looped and another data structure is created out of it. The arraylist (which is huge in size) is later garbage collected. My question is, is this a strain on a garbage collector to collect such a big object at once. What if I create a QUEUE data structure instead of arraylist. Reading the object from the queue would make them eligible for GC. Is that lesser strain on the GC? I am aware that GC could run anytime and there are no guarantees of it execution. More that the timing of execution, what I would like to understand is is it more work for the GC to collect from a contiguous location of memory (arraylist) as against a QUEUE , in which memory allocation is not contiguous?

is it more work for the GC to collect from a contiguous location of memory (arraylist) as against a QUEUE , in which memory allocation is not contiguous?
It is more work to clean up a linked list based queue than an ArrayList. This is becaume an ArrayList has two objects, the Queue has one object per element.
If you want to reduce the GC load, process the data as you read it. This way you won't need a queue or a list and you might find you have processed all the data by the time it has downloaded. i.e. it could be quite a bit faster too.

The biggest strain here comes from keeping objects, which are "huge in size" in memory. It can cause GC to work more frequently if other objects need to be created on a heap or even lead to "out of memory" exception when the size of your DB and ArrayList increase.
Any solution that would allow you to decrease the size of memory allocated to "huge" objects will help. If you can build your queue in such a way that queue elements are released fast without waiting for all other objects to be read from a DB, go for it.
As Peter mentioned in his answer, it would be even better to process an object as soon as it was read from a DB without queuing it or adding to a list.
One of the possible solutions would be to re-design your data access layer and use ResultSet (http://docs.oracle.com/javase/7/docs/api/java/sql/ResultSet.html), which is available in any Java platform that I can think of. Since ResultSet is kept on a DB side, you can read records one at a time and decrease the strain on your memory significantly.
Another approach would be to implement pagination, e.g. by changing your original query in such a way that only a portion of ListArray is read from a DB at a time.

Can you create objects faster than JVM GC can handle ? Thus OOM occurs?

I am just wondering what if I create/delete objects in the following manner, would the GC be able to handle it ?
create millions of objects.
wait for 5 minutes.
delete the objects
create them right away after deletion (without wait/delay)
I think the CPU usage would go up as it runs the GC, but is it possible that the GC would not be able to catch as it tries to reclaim the heap ?

If the heap survives the initial surge, you're fine.
You can OOM if you allocated too many objects, speed has nothing to do with it.
But if you don't have more active objects than the heap can handle, you won't go OOM. Part of allocating a new object is checking to see if a GC is necessary. So you can't "out race it".

How does the heap manager in java or C++ keep track of all the memory locations used by the threads or processes?

I wanted to understand what data structures the heap managers in Java or OS in case of C++ or C keep track of the memory locations used by the threads and processes. One way is to use a map of objects and the memory address and a reverse map of memory starting address and the size of the object in the memory.
But here it won't be able to cater the new memory requests in O(1) time. Is there any better data structure to do this?

Note that unmanaged languages are going to be allocating/freeing memory through system calls, generally not managing it themselves. Still regardless of what level of abstraction (OS to the run time), something has to deal with this:
One method is called buddy block allocation, described well with an example on Wikipedia. It essentially keeps track of the usage of spaces in memory of varying sizes (typically multiples of 2). This can be done with a number of arrays with clever indexing, or perhaps more intuitively with a binary tree, each node tell whether a certain block is free, all nodes on a level representing the same size block.
This suffers from internal fragmentation; as things come and go, you might ended up with your data scattered rather than being efficiently consolidated, making it harder to fit in large data. This could be countered by a more complicated, dynamic system, but buddy blocks have the advantage of simplicity.

The OS keeps track of the process's memory allocation in an overall view - 4KB pages or bigger "lumps" are stored in some form of list.
In the typical Windows implementation (Microsoft's C runtime library) - at least in recent versions, all memory allocations are done through the HeapAlloc() system call. So every single heap allocation goes through to the OS. Whether the OS actually tracks every single allocation or just keeps a map of "what is free, what is used" is another matter. It is my understanding that the heap management code has no list of "current allocations", just a list of freed memory lump
In Linux/Unix, the C library will typically avoid calling the OS for every little allocation, and instead uses a large lump of memory, and splits that up into smaller pieces per allocation. Again, no tracking of allocated memory inside the heap management.
This is done at a process level. I'm not aware of an operating system that differentiates memory allocations on a per-thread level (other than TLS - thread local storage, but that is typically a very small region, outside of the typical heap code management).
So, in summary: the OS and/or C/C++ runtime doesn't actually keep a list of all the used allocations - it keeps a list of "freed" memory [and when another lump is freed, typically will "Join" previous and next consecutive allocations to reduce fragmentation]. When the allocator is firsts started, it's given a large lump, which is then assigned as a single freed allocation. When a request is made, the lump is split into sections and the free list becomes the remainder. When that lump is not sufficient, another big lump is carved off using the underlying OS allocations.
There is a small amount of metadata stored with each allocation, which contains things like "how much memory is allocated", and this metadata is used when freeing the memory. In the typical case, this data is stored immediately before the allocated memory. But there is no way to find the allocation metadata without knowing about the allocations in some other way.

there is no automatic garbage collection in C++. You need to call free/delete for malloc/new heap memory allocations. That's where tools like valgrind(to check memory leak) comes handy. There are other concepts like auto_ptr which automatically frees the heap memory which you can refer to.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.