Support for Compressed Strings being Dropped in HotSpot JVM? - java

On this Oracle page Java HotSpot VM Options, it lists -XX:+UseCompressedStrings as being available and on by default. However in Java 6 update 29, it is off by default and in Java 7 update 2 it reports a warning
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option UseCompressedStrings; support was removed in 7.0
Does anyone know the thinking behind removing this option?
sorting lines of an enormous file.txt in java
With -mx2g, this example took 4.541 seconds with the option on and 5.206 second with it off in Java 6 update 29. It is hard to see that it impacts performance.
Note: Java 7 update 2 requires 2.0 G whereas Java 6 update 29 without compressed strings requires 1.8 GB and with compressed string requires only 1.0 GB.

Originally, this option was added to improve SPECjBB performance. The gains are due to reduced memory bandwidth requirements between the processor and DRAM. Loading and storing bytes in the byte[] consumes 1/2 the bandwidth versus chars in the char[].
However, this comes at a price. The code has to determine if the internal array is a byte[] or char[]. This takes CPU time and if the workload is not memory bandwidth constrained, it can cause a performance regression. There is also a code maintenance price due to the added complexity.
Because there weren't enough production-like workloads that showed significant gains (except perhaps SPECjBB), the option was removed.
There is another angle to this. The option reduces heap usage. For applicable Strings, it reduces the memory usage of those Strings by 1/2. This angle wasn't considered at the time of option removal. For workloads that are memory capacity constrained (i.e. have to run with limited heap space and GC takes a lot of time), this option can prove useful.
If enough memory capacity constrained production-like workloads can be found to justify the option's inclusion, then maybe the option will be brought back.
Edit 3/20/2013: An average server heap dump uses 25% of the space on Strings. Most Strings are compressible. If the option is reintroduced, it could save half of this space (e.g. ~12%)!
Edit 3/10/2016: A feature similar to compressed strings is coming back in JDK 9 JEP 254.

Just to add, for those interested...
The java.lang.CharSequence interface (which java.lang.String implements), allows more compact representations of Strings than UTF-16.
Apps which manipulate a lot of strings, should probably be written to accept CharSequence, such that they would work with java.lang.String, or more compact representations.
8-bit (UTF-8), or even 5, 6, or 7-bit encoded, or even compressed strings can be represented as CharSequence.
CharSequences can also be a lot more efficient to manipulate - subsequences can be defined as views (pointers) onto the original content for example, instead of copying.
For example in concurrent-trees, a suffix tree of ten of Shakespeare's plays, requires 2GB of RAM using CharSequence-based nodes, and would require 249GB of RAM if using char[] or String-based nodes.

Since there were up votes, I figure I wasn't missing something obvious so I have logged it as a bug (at the very least an omission in the documentation)
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7129417
(Should be visible in a couple of days)

Java 9 executes the sorting lines of an enormous file.txt in java twice as fast on my machine as Java 6 and also only needs 1G of memory as it has -XX:+CompactStrings enabled by default. Also, in Java 6, the compressed strings only worked for 7-bit ASCII characters, whereas in Java 9, it supports Latin1 (ISO-8859-1). Some operations like charAt(idx) might be slightly slower though. With the new design, they could also support other encodings in future.
I wrote a newsletter about this on The Java Specialists' Newsletter.

In OpenJDK 7 (1.7.0_147-icedtea, Ubuntu 11.10), the JVM simply fails with an
Unrecognized VM option 'UseCompressedStrings'
when JAVA_OPTS (or command line) contains -XX:+UseCompressedStrings.
It seems Oracle really removed the option.

Related

Running OPTICS clustering on ELKI using large geo-dataset

I am using OPTICSXi with rstartree on ELKI to cluster geo-dataset (latitude&longitude), Gowalla, which includes about 6 million records, but MiniGUI always shows 'java heap space' and 'error:out of memory'.
I used to see an answer of Anony Mousse, in which 1.2 million location data were dealed in 11 minutes, using OPTICSXi on ELKI. i'm so confused. Why ELKI reporting these errors?
Any parameters I need to modify on java platform or ELKI?
This is a standard out of memory error.
You will have to add more memory, or decrease memory consumption somehow.
You could also try the cover tree (it should need much less memory than the current R*-tree implementation). Make sure to use an appropriate, and small, value of epsilon to benefit from indexing.

When performing mmap, would C or Java have any significant performance differences?

I have a 50GB file that is a sorted csv file.
Would it in theory make any difference if I was performing lookups on this file using memory mapped access using C or java?
I'm guessing since the file access is pushed down to the operating system level, it really shouldn't make much of a difference correct?
In theory, Java will be infinitesimally slower because of the need for additional indirections due to Java's object-oriented method invocation, and possibly due to the need to cross the Java/JNI boundary.
In practice, the Hotspot compiler optimizes direct ByteBuffer access, and the cost of page faults will far exceed the extra memory indirection.
Giving a direct answer to question.
C's mmap() and Java's FileChannel.map() are considered to be pretty much equivalents and won't have significant performance differences.
Java can only map 2 GB at a time. This is because ByteBuffer uses 32-bit integers for length, size, etc. So you'd need 25 mmaps for your 50 GB file. C can just create a single mmap, although it won't be portable to 1990s computers (if you care about that)

What do you do when you need more Java Heap Space?

Sorry if this has been asked before (though I can't really find a solution).
I'm not really too good at programming, but anyways, I am crawling a bunch of websites and storing information about them on a server. I need a java program to process vector coordinates associated with each of the documents (about a billion or so documents with a grant total of 500,000 numbers, plus or minus, associated with each of the documents). I need to calculate the singular value decomposition of that whole matrix.
Now Java, obviously, can't handle as big of a matrix as that to my knowledge. If i try making a relatively small array (about 44 million big) then I will get a heap error. I use eclipse, and so I tried changing the -xmx value to 1024m (it won't go any higher for some reason even though I have a computer with 8gb of ram).
What solution is there to this? Another way of retrieving the data I need? Calculating the SVD in a different way? Using a different programming language to do this?
EDIT: Just for right now, pretend there are a billion entries with 3 words associated with each. I am setting the Xmx and Xms correctly (from run configurations in eclipse -> this is the equivalent to running java -XmsXXXX -XmxXXXX ...... in command prompt)
The Java heap space can be set with the -Xmx (note the initial capital X) option and it can certainly reach far more than 1 GB, provided you are using an 64-bit JVM and the corresponding physical memory is available. You should try something along the lines of:
java -Xmx6144m ...
That said, you need to reconsider your design. There is a significant space cost associated with each object, with a typical minimum somewhere around 12 to 16 bytes per object, depending on your JVM. For example, a String has an overhead of about 36-40 bytes...
Even with a single object per document with no book-keeping overhead (impossible!), you just do not have the memory for 1 billion (1,000,000,000) documents. Even for a single int per document you need about 4 GB.
You should re-design your application to make use of any sparseness in the matrix, and possibly to make use of disk-based storage when possible. Having everything in memory is nice, but not always possible...
Are you using a 32 bit JVM? These cannot have more than 2 GB of Heap, I never managed to allocate more than 1.5 GB. Instead, use a 64 bit JVM, as these can allocate much more Heap.
Or you could apply some math to it and use divide and conquer strategy. This means, split the problem into little problems to get to the same result.
Don't know much about SVD but maybe this page can be helpful:
http://www.netlib.org/lapack/lug/node32.html
-Xms and -Xmx are different. The one containg s is the starting heap space and the one with x is the maximum heap space.
so
java -Xms512 -Xmx1024
would give you 512 to start with
As other people have said though you may need to break your problem down to get this to work. Are you using 32 or 64 bit java?
For data of that size, you should not plan to store it all in memory. The most common scheme to externalize this kind of data is to store it all in a database and structure your program around database queries.
Just for right now, pretend there are a billion entries with 3 words associated with each.
If you have one billion entries you need 1 billion times the size of each entry. If you mean 3 x int as words that's 12 GB at least just for the data. If you meant words as String, you would enumerate the words as there is only about 100K words in English and it would take the same amount of space.
Given 16 GB cost a few hundred dollars, I would suggest buying more memory.

How much data can I store in Java Session?

We are working in a Tomcat/J2EE application.
In this application we store a lot of data in the session, and I'm wondering how many data can we store with no problems.
What is the minimum restriction? The memory of Tomcat? The JVM?
How can I calculate if I can store it 200k strings?
For #1 - You can store as much data as heap size allocated to JVM. Of course, tomcat runs inside the JVM so it will also use some part of the memory allocated.
For #2 - It really depends on the size of the string - 2 bytes are required per unicode character. Take the average size of your string, multiply it by 200k and then make sure you have enough memory allocated.
Tomcat runs in a JVM. So if you have a 32 bits jre, you can have a maximum heap size of about 1,7Gb. If you want more, you should switch to a 64 bits jre.
About string allocation, the internal java character encoding is Unicode, so I think it is UTF-8. In order to save space, you may compress those strings using zip and saving them as these were files.

determining java memory usage

Hmmm. Is there a primer anywhere on memory usage in Java? I would have thought Sun or IBM would have had a good article on the subject but I can't find anything that looks really solid. I'm interested in knowing two things:
at runtime, figuring out how much memory the classes in my package are using at a given time
at design time, estimating general memory overhead requirements for various things like:
how much memory overhead is required for an empty object (in addition to the space required by its fields)
how much memory overhead is required when creating closures
how much memory overhead is required for collections like ArrayList
I may have hundreds of thousands of objects created and I want to be a "good neighbor" to not be overly wasteful of RAM. I mean I don't really care whether I'm using 10% more memory than the "optimal case" (whatever that is), but if I'm implementing something that uses 5x as much memory as I could if I made a simple change, I'd want to use less memory (or be able to create more objects for a fixed amount of memory available).
I found a few articles (Java Specialists' Newsletter and something from Javaworld) and one of the builtin classes java.lang.instrument.getObjectSize() which claims to measure an "approximation" (??) of memory use, but these all seem kind of vague...
(and yes I realize that a JVM running on two different OS's may be likely to use different amounts of memory for different objects)
I used JProfiler a number of years ago and it did a good job, and you could break down memory usage to a fairly granular level.
As of Java 5, on Hotspot and other VMs that support it, you can use the Instrumentation interface to ask the VM the memory usage of a given object. It's fiddly but you can do it.
In case you want to try this method, I've added a page to my web site on querying the memory size of a Java object using the Instrumentation framework.
As a rough guide in Hotspot on 32 bit machines:
objects use 8 bytes for
"housekeeping"
fields use what you'd expect them to
use given their bit length (though booleans tend to be allocated an entire byte)
object references use 4 bytes
overall obejct size has a
granularity of 8 bytes (i.e. if you
have an object with 1 boolean field
it will use 16 bytes; if you have an
object with 8 booleans it will also
use 16 bytes)
There's nothing special about collections in terms of how the VM treats them. Their memory usage is the total of their internal fields plus -- if you're counting this -- the usage of each object they contain. You need to factor in things like the default array size of an ArrayList, and the fact that that size increases by 1.5 whenever the list gets full. But either asking the VM or using the above metrics, looking at the source code to the collections and "working it through" will essentially get you to the answer.
If by "closure" you mean something like a Runnable or Callable, well again it's just a boring old object like any other. (N.B. They aren't really closures!!)
You can use JMP, but it's only caught up to Java 1.5.
I've used the profiler that comes with newer versions of Netbeans a couple of times and it works very well, supplying you with a ton of information about memory usage and runtime of your programs. Definitely a good place to start.
If you are using a pre 1.5 VM - You can get the approx size of objects by using serialization. Be warned though.. this can require double the amount of memory for that object.
See if PerfAnal will give you what you are looking for.
This might be not the exact answer you are looking for, but the bosts of the following link will give you very good pointers. Other Question about Memory
I believe the profiler included in Netbeans can moniter memory usage also, you can try that

Categories

Resources