String.split() temporary objects and Garbage Collect

String.split() temporary objects and Garbage Collect - java

In my project, we have a requirement to read a very large file, where each line has identifiers separated by a special character ( "|"). Unfortunately I can't use parallelism, since it is necessary to make a validation between the last character of a line with the first of the next line, to decide whether it will or not be extracted. Anyway, the requirement is very simple: break the line into tokens, analyze them and store only some of them in memory. The code is very simple, something like below:
final LineIterator iterator = FileUtils.lineIterator(file)
while(iterator.hasNext()){
final String[] tokens = iterator.nextLine().split("\\|");
//process
}
But this little piece of code is very, very inefficient. The method split() generates too many temporary objects that are not been collected (as best explained here: http://chrononsystems.com/blog/hidden-evils-of-javas-stringsplit-and-stringr.
For comparison purposes: a 5mb file was using around 35 mb memory at the end of file process.
I tested some alternatives like:
Using a pre compiled pattern (Performance of StringTokenizer class vs. split method in Java)
Use Guava's Splitter (Java split String performances)
Optimize String storage (http://java-performance.info/string-packing-converting-characters-to-bytes/)
Use of optimized collections (http://blog.takipi.com/5-coding-hacks-to-reduce-gc-overhead)
But none of them appears to be efficient enough. Using JProfiler, I could see that the amount of memory used by temporary objects is too high (35 mb used, but only 15 mb is actually been used by valid objects).
Then I decide make a simple test: after 50,000 lines read, explicit call to System.gc(). And then, at the end of process, the memory usage has decreased from 35 mb to 16mb. I tested many, many times, and always got the same result.
I know invoke that invoke of System.gc () is a bad practice (as indicated in Why is it bad practice to call System.gc()?). But is there is any other alternative in a cenario where the split() method could be invoked millions of times?
[UPDATE]
I use a 5 mb file only for test purpose, but the system should process much larger files (500Mb ~ 1Gb)

The first and most important thing to say here is, don't worry about it. The JVM is consuming 35MB of RAM because it's configuration says that's a low enough amount. When its highly efficient GC algorithm decides it's time, it will sweep all those objects away, no problem.
If you really want to, you can invoke Java with memory management options (e.g. java -Xmxn=...) -- I suggest it's not worth doing unless you're running on very limited hardware.
However, if you really want to avoid allocating an array of String each time you process a line, there are many ways to do so.
One way is to use a StringTokenizer:
StringTokenizer st = new StringTokenizer(line,"|");
while (st.hasMoreElements()) {
process(st.nextElement());
}
You could also avoid consuming a line at a time. Get your file as a stream, use a StreamTokenizer, and consume one token at a time in this way.
Read the API docs for Scanner, BufferedInputStream, Reader -- there are lots of choices in this area, because you're doing something fundamental.
However, none of these will cause Java to GC sooner or more aggressively. If the JRE doesn't consider itself short of memory, it won't collect any garbage.
Try writing something like this:
public static void main(String[] args) {
Random r = new Random();
Integer x;
while(true) {
x = Integer.valueof(r.nextInt());
}
}
Run it and watch your JVM's heap size as it runs (put a sleep in if the usage shoots up too quickly to see). Each time around the loop, Java creates what you call a 'temporary object' of type Integer. All of these stay in the heap until the GC decides it needs to clear them away. You'll see that it won't do this until it reaches a certain level. But when it reaches that level, it will do a good job of ensuring that its limits are never exceeded.

You should adjust your way of analyzing situations. While the article about the regex compilation under the hood is correct in general, it doesn’t apply here. When you look at the source code of String.split(String), you’ll see that it just delegates to String.split(String,int) which has a special code path for patterns consisting of just one literal character, including escaped ones like your \|.
The only temporary object created within that code path is an ArrayList. The regex package is not involved at all; this fact might help you understanding why precompiling a regex pattern did not improve the performance here.
When you use a Profiler to come to the conclusion that there are too many objects, you should use it also to find out what kinds of objects there are and where they originate, instead of doing wild guessing.
But it’s not clear, why you complain at all. You can configure the JVM to use a certain maximum memory. As long as that maximum has not been reached, the JVM just does what you told it, using that memory rather than wasting CPU cycles just to not using the available memory. Where’s the sense in not using the available memory?

Related

Java high memory usage - GC Strings

I'm trying to write a code that will have a minimal impact on resources and I have come across GC behavior I don't understand.
Apparently Strings are not cleared from the memory immediately even though they are not in use anymore.
for(int i = 0; i < 999999999; i++)
System.out.println("Test");
Memory usage graph
according to the graph I assume that a new String object is created on every run of the loop but it is not cleared automatically on the next run of the loop - if that is the case I would like to know why is it happening and in case I'm misreading the situation I would like to know what really is happening "behind the curtains".
When I add Sleep to the code I presented above the graph becomes stable, what is the reason for that?
for(int i = 0; i < 999999999; i++){
System.out.println("Test");
try{
Thread.sleep(1);
}
catch(Exception e){}
}
Stable graph
Also I have a few question about the given case:
Can GC be forced to be more aggressive? I mean shorten the object lifetime and not reducing the memory allocated by JVM?
If I plug in a null value to the variable will it affect the time until it's cleared by the GC?
What is the correct way to work with Strings when I need to run a large number of regex matches on them?
What is the best way to declare a String object "obsolete" so the GC will clear it?
Does the above situation occur because Java does an automatic intern for Strings and if so is there a way to cancel it?
Thank you very much!

I assume that a new String object is created on every run of the loop
No, if it was creating a new String on each iteration you would get far more garbage.
At this garbage rate it could be the profiler which is allocating some objects.
A String literal is create once ever. (In a JVM)
but it is not cleared automatically on the next run of the loop
Correct, even if it was created on each iteration the GC only runs when it needs to, doing it on each iteration would be insanely expensive.
When I add Sleep to the code I presented above the graph becomes stable, what is the reason for that?
You have dramatically slowed down your application.
Can GC be forced to be more aggressive?
You can make the Eden space much smaller, but this would slow down your application.
If I plug in a null value to the variable will it affect the time until it's cleared by the GC?
No, this rarely does anything.
What is the correct way to work with Strings when I need to run a large number of regex matches on them
regex's create a lot of garbage. If you want to reduce allocations and speed up your application, avoid using regex's.
I recently speed up an application by 3x by replacing some commonly used regex with direct String handling.
What is the best way to declare a String object "obsolete" so the GC will clear it?
Use it in a limited scope. When the scope ends so does the reference to it and it can be GCed.
Does the above situation occur because Java does an automatic intern
Once a String is interned it is not recreated.
for Strings and if so is there a way to cancel it?
Sure, force it create a new String each time. This of course creates more garbage and is much slower (and the code is longer) but you can do it if you want.

The Garbage Collector collects when its time to collect, more or less.
Yes, depending on what collector you are using. There's literally dozens of vm properties you can set, some of them influencing each other.
I don't think it does in 'newer' JDK's
Normally you do not care. When it comes to GC, it's more about not loading tons of gigs of data into your memory. One specialty about strings are its its interns, but Strings will be gc'd like other objects, too.
When there's no reference to the string/intern anymore (when you exit the braces)
No, the situation does occur, because java's GC's work this way...
I can explain the GC effects on base on CMS/ParNew (since I know this combo best), it works like this:
The heap is splitted into two regions (i exclude PermGen for now).
Young and Old
Young is split into 'eden' and 'copy' (or survivor)
When you generate a new object, it will go Young->Eden. At some point, the eden will reach its max memory, then not used objects will be removed, objects still having references will be copied to Young->Copy.
As the program keeps running, Young->Copy will reach its max memory. It will be copied again in another Young->Copy memory space.
At some point, it can't do that anymore, so some objects it will be moved from Young->Copy to Old, depending on a copy counter (I think). Same story for the old heap.
So what can you tune? First of all, you normally have throughput (batching) and low-latency (webpages), the ParNew/CMS combo was used for low-latency.
Since I know ParNew/CMS best, I'll explain what you can consider tuning first:
You can tune max memory (more memory means more managing, the less memory an application needs to run, the better... in general)
You can tune heap ration between young and old
You can tune the ratios between eden and copy within young
You can tune the time, when CMS starts its collection cycle
And then there's a lot more. From my personal experience, for large applications, we used in general the following settings:
Fix min and max memory to the same size (no change of max heap)
New Ratio to Old something about 1:4 to 1:7
Disable System.gc()
Log a lot of gc stuff
put an alert on OutOfMemory
do weekly analysis on the log and decide on tuning parameters. (Only one parameter at a time ;)
If you really want to know what's behind everything, I'd recommend reading a book, because there's really, really, really a lot going on.

Why does a big unreferenced HashMap increase performance in Java?

I have a performance problem that I can't get my head around. I am writing a Java application that parses huge (> 20 million lines) text files and stores certain information in a Set.
I measure the performance in seconds per million lines. Since I need a lot of memory, I usually run the program with -Xmx6000m and -Xms4000m.
If I just run the program, It parses 1 Million lines in about 6 seconds. However, I realized after some performance investigations, that if I add this code before the actual parsing routine, performance increases to under 3 seconds per 1 million lines:
BufferedReader br = new BufferedReader(new FileReader("graphs.nt"));
HashMap<String, String> foo = new HashMap<String, String>();
String line;
while ((line = br.readLine()) != null){
foo.put(line, "foo");
}
foo = null;
br.close();
br = null;
The graphs.nt file is about 9 million lines long. The performance increase persists even if I do not set foo to null, this is mainly to demonstrate that the map is in fact not used by the program.
The rest of the code is completely unrelated. I use a parser from openrdf sesame to read a different (not the graphs.nt) file and store extracted information in a new HashSet, created by another object.
In the rest of the code, I create a Parser object, to which I pass a Handler object.
This really confuses me. My guess is, that this somehow drives the JVM to allocate more memory for my program, which I can see hints for when I run top. Without the HashMap, it will allocate about 1 Gig of memory. If I initialize the HashMap, it will allocate > 2 Gigs.
My question is, if this sounds at all reasonable. Is it possible that creating such a big object will allocate more memory for the program to use afterwards? Shouldn't -Xmx and -Xms control the memory allocation or are there further arguments that maybe play a role here?
I am aware that this may seem like an odd question and that information is scarce, but this is all the information that I found related to the issue. If there is any more information that may be helpful, I am more than happy to provide it.

Memory and GC can definitely impact performance. If possible you should run Xms==Xmx to disable resizing, and give JVM plenty of room at start. Your app could exit before any major GC is needed.

Unless you go out of your way to make it otherwise, "foo" will eventually pass out of scope and be collected, even if you don't nil the pointer, and even if the method containing the above code is never exited. But it will have forced the heap to grow larger, and this will reduce the relative overhead of GC.
(It would be an interesting experiment to reference "foo" at the end of your program, to keep it in scope.)

This sounds like file caching? Your file "graphs.nt" is probably cached in RAM either by the OS or the JVM. GC will allow memory consumption to go up for performance reasons, if you add a forced collect right after your preload, System.gc(), you'll be able to tell if the caching happens in the JVM or in the OS.

Repeated disk writes

I need to write a list of words to a file and then save the file on a disk. Is one of the following two ways better than the other? The second one obviously uses more main memory but is there a difference in speed?
(this is just pseudocode)
for i = 0 to i = n:
word = generateWord();
FileWriter.println(word);
end loop
versus
String [] listOfWords = new List
for i = 0 to i = n:
word = generateWord();
listOfWords.add(word)
end loop
for i = 0 to n:
FileWriter.println(listOfWords[i]);
end loop

These two methods you show are exactly the same in terms of disk usage efficiency.
When thinking about speed of disk writes, you must always take into account what kind of writer object you are using. There are many types of writer objects and each of them may behave differently when it comes to actual disk writes.
If the object you are using is one of those that write the exact data you tell it to, then your way of writing is very inefficient. You should consider switching to another writer (BufferedWriter for example) or building a longer string before writing it.
In general, you should try to write data in chunks that fit the disk's chunk size.

Between your code and the disk, you have a stack something like: Java library code, a virtual machine runtime, the C runtime library, the operating system file cache/virtual memory subsystem, the operating system I/O scheduler, a device driver and the physical disk firmware.
Just do the simplest thing possible unless profiling shows a problem. Several of those layers will already be tuned to handle buffering, batching and scheduling sequential writes since they're such a common use case.

From FileWriters standpoint you are doing the exacty same thing in both examples, so clearly there cannot be any difference regarding file I/O. And, as you say, the first one's space complexity is O(1), as opposed to second one's O(N).

How to estimate whether a given task would have enough memory to run in Java

I am developing an application that allows users to set the maximum data set size they want me to run their algorithm against
It has become apparent that array sizes around 20,000,000 in size causes an 'out of memory' error. Because I am invoking this via reflection, there is not really a great deal I can do about this.
I was just wondering, is there any way I can check / calculate what the maximum array size could be based on the users heap space settings and therefore validate user entry before running the application?
If not, are there any better solutions?
Use Case:
The user provides a data size they want to run their algorithm against, we generate a scale of numbers to test it against up to the limit they provided.
We record the time it takes to run and measure the values (in order to work out the o-notation).
We need to somehow limit the users input so as to not exceed or get this error. Ideally we want to measure n^2 algorithms on as bigger array sizes as we can (which could last in terms of runtime for days) therefore we really don't want it running for 2 days and then failing as it would have been a waste of time.

You can use the result of Runtime.freeMemory() to estimate the amount of available memory. However, it might be that actually a lot of memory is occupied by unreachable objects, which will be reclaimed by GC soon. So you might actually be able to use more memory than this. You can try invoking the GC before, but this is not guaranteed to do anything.
The second difficulty is to estimate the amount of memory needed for a number given by the user. While it is easy to calculate the size of an ArrayList with so many entries, this might not be all. For example, which objects are stored in this list? I would expect that there is at least one object per entry, so you need to add this memory too. Calculating the size of an arbitrary Java object is much more difficult (and in practice only possible if you know the data structures and algorithms behind the objects). And then there might be a lot of temporary objects creating during the run of the algorithm (for example boxed primitives, iterators, StringBuilders etc.).
Third, even if the available memory is theoretically sufficient for running a given task, it might be practically insufficient. Java programs can get very slow if the heap is repeatedly filled with objects, then some are freed, some new ones are created and so on, due to a large amount of Garbage Collection.
So in practice, what you want to achieve is very difficult and probably next to impossible. I suggest just try running the algorithm and catch the OutOfMemoryError.
Usually, catching errors is something you should not do, but this seems like an occasion where its ok (I do this in some similar cases). You should make sure that as soon as the OutOfMemoryError is thrown, some memory becomes reclaimable for GC. This is usually not a problem, as the algorithm aborts, the call stack is unwound and some (hopefully a lot of) objects are not reachable anymore. In your case, you should probably ensure that the large list is part of these objects which immediately become unreachable in the case of an OOM. Then you have a good chance of being able to continue your application after the error.
However, note that this is not a guarantee. For example, if you have multiple threads working and consuming memory in parallel, the other threads might as well receive an OutOfMemoryError and not be able to cope with this. Also the algorithm needs to support the fact that it might get interrupted at any arbitrary point. So it should make sure that the necessary cleanup actions are executed nevertheless (and of course you are in trouble if those need a lot of memory!).

Java: enough free heap to create an object?

I recently came across this in some code - basically someone trying to create a large object, coping when there's not enough heap to create it:
try {
// try to perform an operation using a huge in-memory array
byte[] massiveArray = new byte[BIG_NUMBER];
} catch (OutOfMemoryError oome) {
// perform the operation in some slower but less
// memory intensive way...
}
This doesn't seem right, since Sun themselves recommend that you shouldn't try to catch Error or its subclasses. We discussed it, and another idea that came up was explicitly checking for free heap:
if (Runtime.getRuntime().freeMemory() > SOME_MEMORY) {
// quick memory-intensive approach
} else {
// slower, less demanding approach
}
Again, this seems unsatisfactory - particularly in that picking a value for SOME_MEMORY is difficult to easily relate to the job in question: for some arbitrary large object, how can I estimate how much memory its instantiation might need?
Is there a better way of doing this? Is it even possible in Java, or is any idea of managing memory below the abstraction level of the language itself?
Edit 1: in the first example, it might actually be feasible to estimate the amount of memory a byte[] of a given length might occupy, but is there a more generic way that extends to arbitrary large objects?
Edit 2: as #erickson points out, there are ways to estimate the size of an object once it's created, but (ignoring a statistical approach based on previous object sizes) is there a way of doing so for yet-uncreated objects?
There also seems to be some debate as to whether it's reasonable to catch OutOfMemoryError - anyone know anything conclusive?

freeMemory isn't quite right. You'd also have to add maxMemory()-totalMemory(). e.g. assuming you start up the VM with max-memory=100M, the JVM may at the time of your method call only be using (from the OS) 50M. Of that, let's say 30M is actually in use by the JVM. That means you'll show 20M free (roughly, because we're only talking about the heap here), but if you try to make your larger object, it'll attempt to grab the other 50M its contract allows it to take from the OS before giving up and erroring. So you'd actually (theoretically) have 70M available.
To make this more complicated, the 30M it reports as in use in the above example includes stuff that may be eligible for garbage collection. So you may actually have more memory available, if it hits the ceiling it'll try to run a GC to free more memory.
You can try to get around this bit by manually triggering a System.GC, except that that's not such a terribly good thing to do because
-it's not guaranteed to run immediately
-it will stop everything in its tracks while it runs
Your best bet (assuming you can't easily rewrite your algorithm to deal with smaller memory chunks, or write to a memory-mapped file, or something less memory intensive) might be to do a safe rough estimate of the memory needed and insure that it's available before you run your function.

There are some kludges that you can use to estimate the size of an existing object; you could adapt some of these to predict the size of a yet-to-be created object.
However, in this case, I think it might be best to catch the Error. First of all, asking for the free memory doesn't account for what's available after garbage collection, which will be performed before raising an OOME. And, requesting a garbage collection with System.gc() isn't reliable. It's often explicitly disabled because it can wreck performance, and if it's not disabled… well, it can wreck performance when used unnecessarily.
It is impossible to recover from most errors. However, recoverability is up to the caller, not the callee. In this case, if you have a strategy to recover from an OutOfMemoryError, it is valid to catch it and fall back.
I guess that, in practice, it really comes down to the difference between the "slow" and "fast" way. If the "slow" method is fast enough, I'd stick with that, as it's safer and simpler. And, it seems to me, allowing it to be used as a fall back means that it is "fast enough." Don't let small optimizations derail the reliability of your application.

The "try to allocate and handle the error" approach is very dangerous.
What if you barely get your memory? A later OOM exception might occur because you brought things too close to the limits. Almost any library call will allocate memory at least briefly.
During your allocation a different thread may receive an OOM exception while trying to allocate a relatively small object. Even if your allocation is destined to fail.
The only viable approach is your second one, with the corrections noted in other answers. But you have to be sure and leave extra "slop space" in the heap when you decide to use your memory intensive approach.

I don't believe that there's a reasonable, generic approach to this that could safely be assumed to be 100% reliable. Even the Runtime.freeMemory approach is vulnerable to the fact that you may actually have enough memory after a garbage collection, but you wouldn't know that unless you force a gc. But then there's no foolproof way to force a GC either. :)
Having said that, I suspect if you really did know approximately how much you needed, and did run a System.gc() beforehand, and your running in a simple single-threaded app, you'd have a reasonably decent shot at getting it right with the .freeMemory call.
If any of those constraints fail, though, and you get the OOM error, your back at square one, and therefore are probably no better off than just catching the Error subclass. While there are some risks associated with this (Sun's VM does not make a lot of guarantees about what happens after an OOM... there's some risk of internal state corruption), there are many apps for which just catching it and moving on with life will leave you with no serious harm.
A more interesting question in my mind, however, is why are there cases where you do have enough memory to do this and others where you don't? Perhaps some more analysis of the performance tradeoffs involved is the real answer?

Definitely catching error is the worst approach. Error happens when there is NOTHING you can do about it. Not even create a log, puff, like "... Houston, we lost the VM".
I didn't quite get the second reason. It was bad because it is hard to relate SOME_MEMORY to the operations? Could you rephrase it for me?
The only alternative I see, is to use the hard disk as the memory ( RAM/ROM as in the old days ) I guess that is what you're pointing in your "else slower, less demanding approach"
Every platform has its limits, java suppport as much as RAM your hardware is willing to give ( well actually you by configuring the VM ) In Sun JVM impl that could be done with the
-Xmx
Option
like
java -Xmx8g some.name.YourMemConsumingApp
For instance
Of course you may end up trying to perform an operation that takes 10 gb of RAM
If that's your case then you should definitely swap to disk.
Additionally, using the strategy pattern could make a nicer code. Although here it looks overkill:
if (isEnoughMemory(SOME_MEMORY)) {
strategy = new InMemoryStrategy();
} else {
strategy = new DiskStrategy();
}
strategy.performTheAction();
But it may help if the "else" involves a lot of code and looks bad. Furthermore if somehow you can use a third approach ( like using a cloud for processing ) you can add a third Strategy
...
strategy = new ImaginaryCloudComputingStrategy();
...
:P
EDIT
After getting the problem with the second approach: If there are some times when you don't know how much RAM is going to be consumed but you do know how much you have left, you could use a mixed approach ( RAM when you have enough, ROM[disk] when you don't )
Suppose this theorical problem.
Suppose you receive a file from a stream and don't know how big it is.
Then you perform some operation on that stream ( encrypt it for instance ).
If you use RAM only it would be very fast, but if the file is large enough as to consume all your APP memory, then you have to perform some of the operation in memory and then swap to file and save temporary data there.
The VM will GC when running out of memory, you get more memory and then you perform the other chunk. And this repeat until you have the big stream processed.
while( !isDone() ) {
if (isMemoryLow()) {
//Runtime.getRuntime().freeMemory() < SOME_MEMORY + some other validations
swapToDisk(); // and make sure resources are GC'able
}
byte [] array new byte[PREDEFINED_BUFFER_SIZE];
process( array );
process( array );
}
cleanUp();

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.