JVM Getting the largest objects in the heap programmatically

JVM Getting the largest objects in the heap programmatically - java

How programmatically (from within the java application/agent) do I get a "live" summary of the largest objects in the heap (including their instances count and size)?
Similarly to what Profilers do.
For example, here is a screenshot from JProfiler:
Usually I used to work with heap dumps in the cases where I really needed that, but now I would like to figure out how exactly profilers retrieve this kind of information from the running JVM without actually taking a heap dump.
Is it possible to get this kind of information by using the Java API? If its impossible what is the alternative in the native code? Code example would be the best for my needs, because this specific part of java universe is really new to me.
I kind of believe that if I was interested to find the information about some really specific classes I could use instrumentation or something, but here as far as I understand, it uses the sampling so there should be another way.
I'm currently using HotSpot VM running java 8 on linux, however the more "generic" solution I'll find - the better.

There is no standard Java API for heap walking. However, there is a HotSpot-specific diagnostic command that can be invoked through JMX:
String histogram = (String) ManagementFactory.getPlatformMBeanServer().invoke(
new ObjectName("com.sun.management:type=DiagnosticCommand"),
"gcClassHistogram",
new Object[]{null},
new String[]{"[Ljava.lang.String;"});
This will collect the class histogram and return the result as a single formatted String:
num #instances #bytes class name
----------------------------------------------
1: 3269 265080 [C
2: 1052 119160 java.lang.Class
3: 156 92456 [B
4: 3247 77928 java.lang.String
5: 1150 54104 [Ljava.lang.Object;
6: 579 50952 java.lang.reflect.Method
7: 292 21024 java.lang.reflect.Field
8: 395 12640 java.util.HashMap$Node
...
The contents is equivalent to the output of jmap -histo command.
The only standard API for heap walking is the native JVM TI IterateThroughHeap function, but it's not so easy to use, and it works much slower than the above diagnostic command.

Related

Command to show the content of biggest strings and char arrays on Java 6. How?

I'm trying to find a way to return the content of the biggest String and char[] in the Java Heap, using some java tool for that. I'm using Java 6.
I'm working in an application with some memory issues. Most of the problem is from the String and Char arrays created for the queries generated by JPA Hibernate (version 3.3.0).
I can use jmap tool to show the high consume of memory:
jmap -histo:live <PID_OF_JAVA_PROCESS> | head -n 20
And print something like that (the first one is char[]):
num #instances #bytes class name
----------------------------------------------
1: 3286152 1412140072 [C
2: 10198187 239513896 [Ljava.lang.String;
3: 3937983 126015456 java.lang.String
4: 662242 47550936 [[Ljava.lang.String;
I can see the content of the internal char arrays and Strings using Eclipse Memory Analyzer, one by one. But this is not productive.
I would like to have this information more easy, using some java tool to get the all information on this Strings and char arrays and saved in some text file.
With this information, I can see the whole scenario and priorize the queries that are more problematic.
Thanks!

With JVisualVM, it's easy. Run your program then launch JVisualVM.
Then in the process browser of JVisualVM, identify your Java process and do a right click and select "Head Dump", it will create a dump of the HEAP used by your program.
Enter in the generated Head node in the process browser, then click in the "classes" option, you will have classes.
Search the char[] class name and double click on it and you will find the info you search (all instances of char[] in your HEAP) in the instances table. You can sort all the tables.

How does an interpreter interpret the code?

For simplicity imagine this scenario, we have a 2-bit computer, which has a pair of 2 bit registers called r1 and r2 and only works with immediate addressing.
Lets say the bit sequence 00 means add to our cpu. Also 01 means move data to r1 and 10 means move data to r2.
So there is an Assembly Language for this computer and a Assembler, where a sample code would be written like
mov r1,1
mov r2,2
add r1,r2
Simply, when I assemble this code to native language and the file will be something like:
0101 1010 0001
the 12 bits above is the native code for:
Put decimal 1 to R1, Put decimal 2 to R2, Add the data and store in R1.
So this is basically how a compiled code works, right?
Lets say someone implements a JVM for this architecture. In Java I will be writing code like:
int x = 1 + 2;
How exactly will JVM interpret this code? I mean eventually the same bit pattern must be passed to the cpu, isn't it? All cpu's have a number of instructions that it can understand and execute, and they are after all just some bits. Lets say the compiled Java byte-code looks something like this:
1111 1100 1001
or whatever.. Does it mean that the interpreting changes this code to 0101 1010 0001 when executing? If it is, it is already in the Native Code, so why is it said that JIT only kicks in after a number of times? If it does not convert it exactly to 0101 1010 0001, then what does it do? How does it make the cpu do the addition?
Maybe there are some mistakes in my assumptions.
I know interpreting is slow, compiled code is faster but not portable, and a virtual machine "interprets" a code, but how? I am looking for "how exactly/technically interpreting" is done. Any pointers (such as books or web pages) are welcome instead of answers as well.

The CPU architecture you describe is unfortunately too restricted to make this really clear with all the intermediate steps. Instead, I will write pseudo-C and pseudo-x86-assembler, hopefully in a way that is clear without being terribly familiar with C or x86.
The compiled JVM bytecode might look something like this:
ldc 0 # push first first constant (== 1)
ldc 1 # push the second constant (== 2)
iadd # pop two integers and push their sum
istore_0 # pop result and store in local variable
The interpreter has (a binary encoding of) these instructions in an array, and an index referring to the current instruction. It also has an array of constants, and a memory region used as stack and one for local variables. Then the interpreter loop looks like this:
while (true) {
switch(instructions[pc]) {
case LDC:
sp += 1; // make space for constant
stack[sp] = constants[instructions[pc+1]];
pc += 2; // two-byte instruction
case IADD:
stack[sp-1] += stack[sp]; // add to first operand
sp -= 1; // pop other operand
pc += 1; // one-byte instruction
case ISTORE_0:
locals[0] = stack[sp];
sp -= 1; // pop
pc += 1; // one-byte instruction
// ... other cases ...
}
}
This C code is compiled into machine code and run. As you can see, it's highly dynamic: It inspects each bytecode instruction each time that instruction is executed, and all values goes through the stack (i.e. RAM).
While the actual addition itself probably happens in a register, the code surrounding the addition is rather different from what a Java-to-machine code compiler would emit. Here's an excerpt from what a C compiler might turn the above into (pseudo-x86):
.ldc:
incl %esi # increment the variable pc, first half of pc += 2;
movb %ecx, program(%esi) # load byte after instruction
movl %eax, constants(,%ebx,4) # load constant from pool
incl %edi # increment sp
movl %eax, stack(,%edi,4) # write constant onto stack
incl %esi # other half of pc += 2
jmp .EndOfSwitch
.addi
movl %eax, stack(,%edi,4) # load first operand
decl %edi # sp -= 1;
addl stack(,%edi,4), %eax # add
incl %esi # pc += 1;
jmp .EndOfSwitch
You can see that the operands for the addition come from memory instead of being hardcoded, even though for the purposes of the Java program they are constant. That's because for the interpreter, they are not constant. The interpreter is compiled once and then must be able to execute all sorts of programs, without generating specialized code.
The purpose of the JIT compiler is to do just that: Generate specialized code. A JIT can analyze the ways the stack is used to transfer data, the actual values of various constants in the program, and the sequence of calculations performed, to generate code that more efficiently does the same thing. In our example program, it would allocate the local variable 0 to a register, replace the access to the constant table with moving constants into registers (movl %eax, $1), and redirect the stack accesses to the right machine registers. Ignoring a few more optimizations (copy propagation, constant folding and dead code elimination) that would normally be done, it might end up with code like this:
movl %ebx, $1 # ldc 0
movl %ecx, $2 # ldc 1
movl %eax, %ebx # (1/2) addi
addl %eax, %ecx # (2/2) addi
# no istore_0, local variable 0 == %eax, so we're done

Not all computers have the same instruction set. Java bytecode is a kind of Esperanto - an artificial language to improve communication. The Java VM translates the universal Java bytecode to the instruction set of the computer it runs on.
So how does JIT figure in here? The main purpose of the JIT compiler is optimization. There are often different ways to translate a certain piece of bytecode into the target machine code. The most performance-ideal translation is often non-obvious because it might depend on the data. There are also limits to how far a program can analyze an algorithm without executing it - the halting problem is a well-known such limitation but not the only one. So what the JIT compiler does is try different possible translations and measure how fast they are executed with the real-world data the program processes. So it takes a number of executions until the JIT compiler found the perfect translation.

One of the important steps in Java is that the compiler first translates the .java code into a .class file, which contains the Java bytecode. This is useful, as you can take .class files and run them on any machine that understands this intermediate language, by then translating it on the spot line-by-line, or chunk-by-chunk. This is one of the most important functions of the java compiler + interpreter. You can directly compile Java source code to native binary, but this negates the idea of writing the original code once and being able to run it anywhere. This is because the compiled native binary code will only run on the same hardware/OS architecture that it was compiled for. If you want to run it on another architecture, you'd have to recompile the source on that one. With the compilation to the intermediate-level bytecode, you don't need to drag around the source code, but the bytecode. It's a different issue, as you now need a JVM that can interpret and run the bytecode. As such, compiling to the intermediate-level bytecode, which the interpreter then runs, is an integral part of the process.
As for the actual realtime running of code: yes, the JVM will eventually interpret/run some binary code that may or may not be identical to natively compiled code. And in a one-line example, they may seem superficially the same. But the interpret typically doesn't precompile everything, but goes through the bytecode and translates to binary line-by-line or chunk-by-chunk. There are pros and cons to this (compared to natively compiled code, e.g. C and C compilers) and lots of resources online to read up further on. See my answer here, or this, or this one.

Simplifying, interpreter is a infinite loop with a giant switch inside.
It reads Java byte code (or some internal representation) and emulates a CPU executing it.
This way the real CPU executes the interpreter code, which emulates the virtual CPU.
This is painfully slow. Single virtual instruction adding two numbers requires three function calls and many other operations.
Single virtual instruction takes a couple of real instructions to execute.
This is also less memory efficient as you have both real and emulated stack, registers and instruction pointers.
while(true) {
Operation op = methodByteCode.get(instructionPointer);
switch(op) {
case ADD:
stack.pushInt(stack.popInt() + stack.popInt())
instructionPointer++;
break;
case STORE:
memory.set(stack.popInt(), stack.popInt())
instructionPointer++;
break;
...
}
}
When some method is interpreted multiple times, JIT compiler kicks in.
It will read all virtual instructions and generate one or more native instructions which does the same.
Here I'm generating string with text assembly which would require additional assembly to native binary conversions.
for(Operation op : methodByteCode) {
switch(op) {
case ADD:
compiledCode += "popi r1"
compiledCode += "popi r2"
compiledCode += "addi r1, r2, r3"
compiledCode += "pushi r3"
break;
case STORE:
compiledCode += "popi r1"
compiledCode += "storei r1"
break;
...
}
}
After native code is generated, JVM will copy it somewhere, mark this region as executable and instruct the interpreter to invoke it instead of interpreting byte code next time this method is invoked.
Single virtual instruction might still take more than one native instruction but this will be nearly as fast as ahead of time compilation to native code (like in C or C++).
Compilation is usually much slower than interpreting, but has to be done only once and only for chosen methods.

Memory requirements for Stanford NER retraining

I am retraining the Stanford NER model on my own training data for extracting organizations. But, whether I use a 4GB RAM machine or an 8GB RAM machine, I get the same Java heap space error.
Could anyone tell what is the general configuration of machines on which we can retrain the models without getting these memory issues?
I used the following command :
java -mx4g -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop newdata_retrain.prop
I am working with training data (multiple files - each file has about 15000 lines in the following format) - one word and its category on each line
She O
is O
working O
at O
Microsoft ORGANIZATION
Is there anything else we could do to make these models run reliably ? I did try with reducing the number of classes in my training data. But that is impacting the accuracy of extraction. For example, some locations or other entities are getting classified as organization names. Can we reduce specific number of classes without impact on accuracy ?
One data I am using is the Alan Ritter twitter nlp data : https://github.com/aritter/twitter_nlp/tree/master/data/annotated/ner.txt
The properties file looks like this:
#location of the training file
trainFile = ner.txt
#location where you would like to save (serialize to) your
#classifier; adding .gz at the end automatically gzips the file,
#making it faster and smaller
serializeTo = ner-model-twitter.ser.gz
#structure of your training file; this tells the classifier
#that the word is in column 0 and the correct answer is in
#column 1
map = word=0,answer=1
#these are the features we'd like to train with
#some are discussed below, the rest can be
#understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
useNGrams=true
#no ngrams will be included that do not contain either the
#beginning or end of the word
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
#the next 4 deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
saveFeatureIndexToDisk = true
The error I am getting : the stacktrace is this :
CRFClassifier invoked on Mon Dec 01 02:55:22 UTC 2014 with arguments:
-prop twitter_retrain.prop
usePrevSequences=true
useClassFeature=true
useTypeSeqs2=true
useSequences=true
wordShape=chris2useLC
saveFeatureIndexToDisk=true
useTypeySequences=true
useDisjunctive=true
noMidNGrams=true
serializeTo=ner-model-twitter.ser.gz
maxNGramLeng=6
useNGrams=true
usePrev=true
useNext=true
maxLeft=1
trainFile=ner.txt
map=word=0,answer=1
useWord=true
useTypeSeqs=true
[1000][2000]numFeatures = 215032
setting nodeFeatureIndicesMap, size=149877
setting edgeFeatureIndicesMap, size=65155
Time to convert docs to feature indices: 4.4 seconds
numClasses: 21 [0=O,1=B-facility,2=I-facility,3=B-other,4=I-other,5=B-company,6=B-person,7=B-tvshow,8=B-product,9=B-sportsteam,10=I-person,11=B-geo-loc,12=B-movie,13=I-movie,14=I-tvshow,15=I-company,16=B-musicartist,17=I-musicartist,18=I-geo-loc,19=I-product,20=I-sportsteam]
numDocuments: 2394
numDatums: 46469
numFeatures: 215032
Time to convert docs to data/labels: 2.5 seconds
Writing feature index to temporary file.
numWeights: 31880772
QNMinimizer called on double function of 31880772 variables, using M = 25.
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:923)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:885)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:879)
at edu.stanford.nlp.optimization.QNMinimizer.minimize(QNMinimizer.java:91)
at edu.stanford.nlp.ie.crf.CRFClassifier.trainWeights(CRFClassifier.java:1911)
at edu.stanford.nlp.ie.crf.CRFClassifier.train(CRFClassifier.java:1718)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:759)
at edu.stanford.nlp.ie.AbstractSequenceClassifier.train(AbstractSequenceClassifier.java:747)
at edu.stanford.nlp.ie.crf.CRFClassifier.main(CRFClassifier.java:2937)

One way you can try reducing number of classes is to not use B-I notation. For example, club B-facility and I-facility into facility. Of course, another way it to use a bigger memory machine.

Shouldn't that be -Xmx4g not -mx4g?

Sorry for getting to this a bit late! I suspect the problem is the input format of the file; in particular, my first guess is that the file is being treated as a single long sentence.
The expected format of the training file is in the CoNLL format, which means each line of the file is a new token, and the end of a sentence is denoted by a double newline. So, for example, a file could look like:
Cats O
have O
tails O
. O
Felix ANIMAL
is O
a O
cat O
. O
Could you let me know if it's indeed in this format? If so, could you include a stack trace of the error, and the properties file you are using? Does it work if you run on just the first few sentences of the file?
--Gabor

If you are going to do analysis on non-transactional data sets you may want to use another tool like Elasticsearch (simpler) or Hadoop (exponentially more complicated). MongoDB is a good middleground as well.

First uninstall the existing java jdk and reinstall again.
Then you can use the heap size as much as you can based on your hard disk size.
In the term "-mx4g" 4g is not the RAM it is the heap size.
Even I Faced the same error initially. after doing this it is gone.
Even I misunderstood 4g as RAM initially.
Now I am able to start my server even with 100g of heap size.
Next,
Instead of using Customised NER model, I suggest you to use Custom RegexNER Model with which you can add millions of words of same entity name within in a single document too.
These 2 errors I faced initially.
For any queries comment below.

In dalvik, what expression will generate instructions 'not-int' and 'const-string/jumbo'?

I am new on learning dalvik, and I want to dump out every instruction in dalvik.
But there are still 3 instructions I can not get no matter how I write the code.
They are 'not-int', 'not-long', 'const-string/jumbo'.
I written like this to get 'not-int' but failed:
int y = ~x;
Dalvik generated an 'xor x, -1' instead.
and I know 'const-string/jumbo' means that there is more than 65535 strings in the code and the index is 32bit. But when I decleared 70000 strings in the code, the compiler said the code was too long.
So the question is: how to get 'not-int' and 'const-string/jumbo' in dalvik by java code?

const-string/jumbo is easy. As you noted, you just need to define more than 65535 strings, and reference one of the later ones. They don't all need to be in a single class file, just in the same DEX file.
Take a look at dalvik/tests/056-const-string-jumbo, in particular the "build" script that generates a Java source file with a large number of strings.
As far as not-int and not-long go, I don't think they're ever generated. I ran dexdump -d across a pile of Android 4.4 APKs and didn't find a single instance of either.

Maximum size of a method in Java 7 and 8

I know that a method cannot be larger than 64 KB with Java. The limitation causes us problems with generated code from a JavaCC grammar. We had problems with Java 6 and were able to fix this by changing the grammar. Has the limit been changed for Java 7 or is it planned for Java 8?
Just to make it clear. I don't need a method larger than 64 KB by myself. But I wrote a grammar which compiles to a very large method.

According to JVMS7 :
The fact that end_pc is exclusive is a historical mistake in the
design of the Java virtual machine: if the Java virtual machine code
for a method is exactly 65535 bytes long and ends with an instruction
that is 1 byte long, then that instruction cannot be protected by an
exception handler. A compiler writer can work around this bug by
limiting the maximum size of the generated Java virtual machine code
for any method, instance initialization method, or static initializer
(the size of any code array) to 65534 bytes.
But this is about Java 7. There is no final specs for Java 8, so nobody (except its developers) could answer this question.
UPD (2015-04-06) According to JVM8 it is also true for Java 8.

Good question. As always we should go to the source to find the answer ("The Java® Virtual Machine Specification"). The section does not explicitly mention a limit (as did the Java6 VM spec) though, but somewhat circumspectly:
The greatest number of local variables in the local variables array of a frame created upon invocation of a method (§2.6) is limited to 65535 by the size of the max_locals item of the Code attribute (§4.7.3) giving the code of the method, and by the 16-bit local variable indexing of the Java Virtual Machine instruction set.
Cheers,

It has not changed. The limit of code in methods is still 64 KB in both Java 7 and Java 8.
References:
From the Java 7 Virtual Machine Specification (4.9.1 Static Constraints):
The static constraints on the Java Virtual Machine code in a class file specify how
Java Virtual Machine instructions must be laid out in the code array and what the
operands of individual instructions must be.
The static constraints on the instructions in the code array are as follows:
The code array must not be empty, so the code_length item cannot have the
value 0.
The value of the code_length item must be less than 65536.
From the Java 8 Virtual Machine Specification (4.7.3 The Code Attribute):
The value of the code_length item gives the number of bytes in the code array
for this method.
The value of code_length must be greater than zero (as the code array must
not be empty) and less than 65536.

Andremoniy has answered the java 7 part of this question already, but seems at that time it was soon to decide about java 8 so I complete the answer to cover that part:
Quoting from jvms:
The fact that end_pc is exclusive is a historical mistake in the design of the Java Virtual Machine: if the Java Virtual Machine code for a method is exactly 65535 bytes long and ends with an instruction that is 1 byte long, then that instruction cannot be protected by an exception handler. A compiler writer can work around this bug by limiting the maximum size of the generated Java Virtual Machine code for any method, instance initialization method, or static initializer (the size of any code array) to 65534 bytes.
As you see seems this historical problem doesn't seem to remedy at least in this version (java 8).

As a workaround, and if you have access to the parser's code, you could modify it to work within whatever 'limits are imposed by the JVM compiler ...
(Assuming it den't take forever to find the portions in the parser code to modify)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.