How to generate constant pool index when java compile

How to generate constant pool index when java compile - java

I`m interested in java class file (.class)
if we see .class file using javap, can see Constant pools infomation.
#4 = Utf8 java/lang/Object
#5 = Utf8 <init>
#6 = Utf8 ()V
#7 = Utf8 Code
There are index #1,#2,#3,#4,#5, #6.......
java compiler will be genreate these index...
Is there rules to generate index number? is it random number?

Is there rules to generate index number?
If you mean, are the rules specified (in the JVM spec), then answer is No.
is it random number?
No. If you delved deeply into the compiler source code, etc you would in theory have sufficient information to predict the index values of the constant pool entries. The allocation of indexes looks random, but it is (I think) totally deterministic and repeatable.
However, predicting the indexes for an arbitrary Java program (without compiling it!) is unlikely to be practical.

Related

JVM Getting the largest objects in the heap programmatically

How programmatically (from within the java application/agent) do I get a "live" summary of the largest objects in the heap (including their instances count and size)?
Similarly to what Profilers do.
For example, here is a screenshot from JProfiler:
Usually I used to work with heap dumps in the cases where I really needed that, but now I would like to figure out how exactly profilers retrieve this kind of information from the running JVM without actually taking a heap dump.
Is it possible to get this kind of information by using the Java API? If its impossible what is the alternative in the native code? Code example would be the best for my needs, because this specific part of java universe is really new to me.
I kind of believe that if I was interested to find the information about some really specific classes I could use instrumentation or something, but here as far as I understand, it uses the sampling so there should be another way.
I'm currently using HotSpot VM running java 8 on linux, however the more "generic" solution I'll find - the better.

There is no standard Java API for heap walking. However, there is a HotSpot-specific diagnostic command that can be invoked through JMX:
String histogram = (String) ManagementFactory.getPlatformMBeanServer().invoke(
new ObjectName("com.sun.management:type=DiagnosticCommand"),
"gcClassHistogram",
new Object[]{null},
new String[]{"[Ljava.lang.String;"});
This will collect the class histogram and return the result as a single formatted String:
num #instances #bytes class name
----------------------------------------------
1: 3269 265080 [C
2: 1052 119160 java.lang.Class
3: 156 92456 [B
4: 3247 77928 java.lang.String
5: 1150 54104 [Ljava.lang.Object;
6: 579 50952 java.lang.reflect.Method
7: 292 21024 java.lang.reflect.Field
8: 395 12640 java.util.HashMap$Node
...
The contents is equivalent to the output of jmap -histo command.
The only standard API for heap walking is the native JVM TI IterateThroughHeap function, but it's not so easy to use, and it works much slower than the above diagnostic command.

Creating a char array results in an Object in java bytecode

I've hit kind of a wall, trying to write a simple compiler in Java, using ASM. Basically, I am trying to add strings of characters together, and cannot work out why my code fails to do so. The problem lies with how the following lines of code compile:
char[] p;
p = "Hi";
p = p + i[0];
Where i is an initialized array. The line p = "Hi"; compiles as:
bipush 2;
newarray t_char;
dup;
bipush 0;
ldc h;
castore;
dup;
bipush 1;
ldc i;
castore;
Note that I am deliberately treating the string "Hi" as a char array, instead of directly as a String object. When decompiled, it reads as:
Object localObject1 = { 'H', 'i'};
And thus, as {'H', 'i'} is not a proper constructor for Object, the program does not execute. Now, my confusion, and the reason I came to stackoverflow with this is that when the line line p = p + i[0]; is removed from the program, or replaced with one not using an array, such as p = p + 5;, the line p = "Hi"; compiles, again, in the exact same way:
bipush 2;
newarray t_char;
dup;
bipush 0;
ldc h;
castore;
dup;
bipush 1;
ldc i;
castore;
And when decompiled, the same line reads as:
char[] arrayOfChar1 = {'H', 'i'};
The program runs just fine. I have absolutely no idea what is going on here, nor any about how to solve it.
To decompile the .class files, I am using this decompiler.
I would like to know why the exact same bytecode decompiles differently in these 2 cases.

In general, you can not expect to be able to recompile decompiled code. Compilation and decompilation are both lossy processes. In particular, bytecode does not have to contain explicit types like Java source code does, and the type checking rules for bytecode are much laxer than the source level type system.
This means that when decompiling the code, the decompiler has to guess at the type of local variables (unless the optional debugging metadata was included with the compiled class). In some cases, it guessed Object, which led to a compilation error. In other cases, it guessed char[]. If you want a more in depth explanation, you could dive into the decompiler source code, but the real issue is expecting the decompiler to magically give good results in the absence of type information in the first place.
Anyway, if you want to edit already compiled code, you shouldn't use a decompiler. Your best bet is to use an assembler/disassembler pair like Krakatau, which allows you to edit classfiles losslessly at the bytecode level (assuming you understand bytecode).

Java Class File Specification statement

The Java Class File Specification states that:
The code array gives the actual bytes of Java Virtual Machine code that implement the method.
When the code array is read into memory on a byte-addressable machine, if the first byte of the array is aligned on a 4-byte boundary, the tableswitch and lookupswitch 32-bit offsets will be 4-byte aligned. (Refer to the descriptions of those instructions for more information on the consequences of code array alignment.)
(https://docs.oracle.com/javase/specs/jvms/se7/html/jvms-4.html#jvms-4.7.3)
How would I interpret this statement?
The wikipedia page for those 2 instructions mentions this: (https://en.wikipedia.org/wiki/Java_bytecode_instruction_listings)
Tableswitch additional bytes:
4+: [0–3 bytes padding], defaultbyte1, defaultbyte2, defaultbyte3, defaultbyte4, lowbyte1, lowbyte2, lowbyte3, lowbyte4, highbyte1, highbyte2, highbyte3, highbyte4, jump offsets...
Lookupswitch additional bytes:
4+: <0–3 bytes padding>, defaultbyte1, defaultbyte2, defaultbyte3, defaultbyte4, npairs1, npairs2, npairs3, npairs4, match-offset pairs...
I think the <0–3 bytes padding> is relevant to the Class File Specification statement, I just don't know how exactly.

The tableswitch and lookupswitch instructions are defined to have between 0 and 3 bytes of padding, depending on their offset within the method's bytecode. The actual definition of the padding can be found in section 6.5 where the formats of each instruction are listed.
Immediately after the lookupswitch opcode, between zero and three
bytes must act as padding, such that defaultbyte1 begins at an address
that is a multiple of four bytes from the start of the current method
(the opcode of its first instruction).
The statement you highlighted explains the motivation for this design choice, which might otherwise seem odd or pointless.
This allows for a more efficient implementation of a Java interpreter, because if the code is loaded at a 4-byte aligned address, the offsets and keys in the switch tables can be read with aligned access.
Of course, it isn't that important nowadays, because we have fancy JITs, but back in the early days of Java, the JVM probably was implemented as a naive interpreter where this would make a big difference in performance.

How does an interpreter interpret the code?

For simplicity imagine this scenario, we have a 2-bit computer, which has a pair of 2 bit registers called r1 and r2 and only works with immediate addressing.
Lets say the bit sequence 00 means add to our cpu. Also 01 means move data to r1 and 10 means move data to r2.
So there is an Assembly Language for this computer and a Assembler, where a sample code would be written like
mov r1,1
mov r2,2
add r1,r2
Simply, when I assemble this code to native language and the file will be something like:
0101 1010 0001
the 12 bits above is the native code for:
Put decimal 1 to R1, Put decimal 2 to R2, Add the data and store in R1.
So this is basically how a compiled code works, right?
Lets say someone implements a JVM for this architecture. In Java I will be writing code like:
int x = 1 + 2;
How exactly will JVM interpret this code? I mean eventually the same bit pattern must be passed to the cpu, isn't it? All cpu's have a number of instructions that it can understand and execute, and they are after all just some bits. Lets say the compiled Java byte-code looks something like this:
1111 1100 1001
or whatever.. Does it mean that the interpreting changes this code to 0101 1010 0001 when executing? If it is, it is already in the Native Code, so why is it said that JIT only kicks in after a number of times? If it does not convert it exactly to 0101 1010 0001, then what does it do? How does it make the cpu do the addition?
Maybe there are some mistakes in my assumptions.
I know interpreting is slow, compiled code is faster but not portable, and a virtual machine "interprets" a code, but how? I am looking for "how exactly/technically interpreting" is done. Any pointers (such as books or web pages) are welcome instead of answers as well.

The CPU architecture you describe is unfortunately too restricted to make this really clear with all the intermediate steps. Instead, I will write pseudo-C and pseudo-x86-assembler, hopefully in a way that is clear without being terribly familiar with C or x86.
The compiled JVM bytecode might look something like this:
ldc 0 # push first first constant (== 1)
ldc 1 # push the second constant (== 2)
iadd # pop two integers and push their sum
istore_0 # pop result and store in local variable
The interpreter has (a binary encoding of) these instructions in an array, and an index referring to the current instruction. It also has an array of constants, and a memory region used as stack and one for local variables. Then the interpreter loop looks like this:
while (true) {
switch(instructions[pc]) {
case LDC:
sp += 1; // make space for constant
stack[sp] = constants[instructions[pc+1]];
pc += 2; // two-byte instruction
case IADD:
stack[sp-1] += stack[sp]; // add to first operand
sp -= 1; // pop other operand
pc += 1; // one-byte instruction
case ISTORE_0:
locals[0] = stack[sp];
sp -= 1; // pop
pc += 1; // one-byte instruction
// ... other cases ...
}
}
This C code is compiled into machine code and run. As you can see, it's highly dynamic: It inspects each bytecode instruction each time that instruction is executed, and all values goes through the stack (i.e. RAM).
While the actual addition itself probably happens in a register, the code surrounding the addition is rather different from what a Java-to-machine code compiler would emit. Here's an excerpt from what a C compiler might turn the above into (pseudo-x86):
.ldc:
incl %esi # increment the variable pc, first half of pc += 2;
movb %ecx, program(%esi) # load byte after instruction
movl %eax, constants(,%ebx,4) # load constant from pool
incl %edi # increment sp
movl %eax, stack(,%edi,4) # write constant onto stack
incl %esi # other half of pc += 2
jmp .EndOfSwitch
.addi
movl %eax, stack(,%edi,4) # load first operand
decl %edi # sp -= 1;
addl stack(,%edi,4), %eax # add
incl %esi # pc += 1;
jmp .EndOfSwitch
You can see that the operands for the addition come from memory instead of being hardcoded, even though for the purposes of the Java program they are constant. That's because for the interpreter, they are not constant. The interpreter is compiled once and then must be able to execute all sorts of programs, without generating specialized code.
The purpose of the JIT compiler is to do just that: Generate specialized code. A JIT can analyze the ways the stack is used to transfer data, the actual values of various constants in the program, and the sequence of calculations performed, to generate code that more efficiently does the same thing. In our example program, it would allocate the local variable 0 to a register, replace the access to the constant table with moving constants into registers (movl %eax, $1), and redirect the stack accesses to the right machine registers. Ignoring a few more optimizations (copy propagation, constant folding and dead code elimination) that would normally be done, it might end up with code like this:
movl %ebx, $1 # ldc 0
movl %ecx, $2 # ldc 1
movl %eax, %ebx # (1/2) addi
addl %eax, %ecx # (2/2) addi
# no istore_0, local variable 0 == %eax, so we're done

Not all computers have the same instruction set. Java bytecode is a kind of Esperanto - an artificial language to improve communication. The Java VM translates the universal Java bytecode to the instruction set of the computer it runs on.
So how does JIT figure in here? The main purpose of the JIT compiler is optimization. There are often different ways to translate a certain piece of bytecode into the target machine code. The most performance-ideal translation is often non-obvious because it might depend on the data. There are also limits to how far a program can analyze an algorithm without executing it - the halting problem is a well-known such limitation but not the only one. So what the JIT compiler does is try different possible translations and measure how fast they are executed with the real-world data the program processes. So it takes a number of executions until the JIT compiler found the perfect translation.

One of the important steps in Java is that the compiler first translates the .java code into a .class file, which contains the Java bytecode. This is useful, as you can take .class files and run them on any machine that understands this intermediate language, by then translating it on the spot line-by-line, or chunk-by-chunk. This is one of the most important functions of the java compiler + interpreter. You can directly compile Java source code to native binary, but this negates the idea of writing the original code once and being able to run it anywhere. This is because the compiled native binary code will only run on the same hardware/OS architecture that it was compiled for. If you want to run it on another architecture, you'd have to recompile the source on that one. With the compilation to the intermediate-level bytecode, you don't need to drag around the source code, but the bytecode. It's a different issue, as you now need a JVM that can interpret and run the bytecode. As such, compiling to the intermediate-level bytecode, which the interpreter then runs, is an integral part of the process.
As for the actual realtime running of code: yes, the JVM will eventually interpret/run some binary code that may or may not be identical to natively compiled code. And in a one-line example, they may seem superficially the same. But the interpret typically doesn't precompile everything, but goes through the bytecode and translates to binary line-by-line or chunk-by-chunk. There are pros and cons to this (compared to natively compiled code, e.g. C and C compilers) and lots of resources online to read up further on. See my answer here, or this, or this one.

Simplifying, interpreter is a infinite loop with a giant switch inside.
It reads Java byte code (or some internal representation) and emulates a CPU executing it.
This way the real CPU executes the interpreter code, which emulates the virtual CPU.
This is painfully slow. Single virtual instruction adding two numbers requires three function calls and many other operations.
Single virtual instruction takes a couple of real instructions to execute.
This is also less memory efficient as you have both real and emulated stack, registers and instruction pointers.
while(true) {
Operation op = methodByteCode.get(instructionPointer);
switch(op) {
case ADD:
stack.pushInt(stack.popInt() + stack.popInt())
instructionPointer++;
break;
case STORE:
memory.set(stack.popInt(), stack.popInt())
instructionPointer++;
break;
...
}
}
When some method is interpreted multiple times, JIT compiler kicks in.
It will read all virtual instructions and generate one or more native instructions which does the same.
Here I'm generating string with text assembly which would require additional assembly to native binary conversions.
for(Operation op : methodByteCode) {
switch(op) {
case ADD:
compiledCode += "popi r1"
compiledCode += "popi r2"
compiledCode += "addi r1, r2, r3"
compiledCode += "pushi r3"
break;
case STORE:
compiledCode += "popi r1"
compiledCode += "storei r1"
break;
...
}
}
After native code is generated, JVM will copy it somewhere, mark this region as executable and instruct the interpreter to invoke it instead of interpreting byte code next time this method is invoked.
Single virtual instruction might still take more than one native instruction but this will be nearly as fast as ahead of time compilation to native code (like in C or C++).
Compilation is usually much slower than interpreting, but has to be done only once and only for chosen methods.

Maximum number of enum elements in Java

What is the maximum number of elements allowed in an enum in Java?
I wanted to find out the maximum number of cases in a switch statement. Since the largest primitive type allowed in switch is int, we have cases from -2,147,483,648 to 2,147,483,647 and one default case. However enums are also allowed... so the question..

From the class file format spec:
The per-class or per-interface constant pool is limited to 65535 entries by the 16-bit constant_pool_count field of the ClassFile structure (§4.1). This acts as an internal limit on the total complexity of a single class or interface.
I believe that this implies that you cannot have more then 65535 named "things" in a single class, which would also limit the number of enum constants.
If a see a switch with 2 billion cases, I'll probably kill anyone that has touched that code.
Fortunately, that cannot happen:
The amount of code per non-native, non-abstract method is limited to 65536 bytes by the sizes of the indices in the exception_table of the Code attribute (§4.7.3), in the LineNumberTable attribute (§4.7.8), and in the LocalVariableTable attribute (§4.7.9).

The maximum number of enum elements is 2746. Reading the spec was very misleading and caused me to create a flawed design with the assumption I would never hit the 64K or even 32K high-water mark. Unfortunately, the number is much lower than the spec seems to indicate. As a test, I tried the following with both Java 7 and Java 8: Ran the following code redirecting it to a file, then compiled the resulting .java file.
System.out.println("public enum EnumSizeTest {");
int max = 2746;
for ( int i=0; i<max; i++) {
System.out.println("VAR"+i+",");
}
System.out.println("VAR"+max+"}");
Result, 2746 works, and 2747 does not.
After 2746 entries, the compiler throws a code too large error, like
EnumSizeTest.java:2: error: code too large
Decompiling this Enum class file, the restriction appears to be caused by the code generated for each enum value in the static constructor (mostly).

Enums definitely have limits, with the primary (hard) limit around 32K values. They are subject to Java class maximums, both of the 'constant pool' (64K entries) and -- in some compiler versions -- to a method size limit (64K bytecode) on the static initializer.
'Enum' initialization internally, uses two constants per value -- a FieldRef and a Utf8 string. This gives the "hard limit" at ~32K values.
Older compilers (Eclipse Indigo at least) also run into an issue as to the static initializer method-size. With 24 bytes of bytecode required to instantiate each value & add it to the values array. a limit around 2730 values may be encountered.
Newer compilers (JDK 7 at least) automatically split large static initializers off into methods named " enum constant initialization$2", " enum constant initialization$3" etc so are not subject to the second limit.
You can disassemble bytecode via javap -v -c YourEnum.class to see how this works.
[It might be theoretically possible to write an "old-style" Enum class as handcoded Java, to break the 32K limit and approach close to 64K values. The approach would be to initialize the enum values by reflection to avoid needing string constants in the pool. I tested this approach and it works in Java 7, but the desirability of such an approach (safety issues) are questionable.]
Note to editors: Utf8 was an internal type in the Java classfile IIRC, it's not a typo to be corrected.

Well, on jdk1.6 I hit this limit. Someone has 10,000 enum in an xsd and when we generate, we get a 60,000 line enum file and I get a nice java compiler error of
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.0.2:compile (default-compile) on project framework: Compilation failure
[ERROR] /Users/dhiller/Space/ifp-core/framework/target/generated-sources/com/framework/util/LanguageCodeSimpleType.java:[7627,4] code too large
so quite possibly the limit is much lower than the other answers here OR maybe the comments and such that are generated are taking up too much space. Notice the line number is 7627 in the java compiler error though if the line limit is 7627, I wonder what the line length limit is ;) which may be similar. ie. the limits may be not be based on number of enums but line length limits or number of lines in the file limit so you would have rename enums to A, B, etc. to be very small to fit more enums into the enum.
I can't believe someone wrote an xsd with a 10,000 enum..they must have generated this portion of the xsd.

Update for Java 15+
In JDK 15, the maximal number of constants in enums was raised to about 4103: https://bugs.openjdk.java.net/browse/JDK-8241798
This was achieved by splitting the static initializer into two parts:
Before Java 15 (pseudocode):
enum E extends Enum<E> {
...
static {
C1 = new E(...);
C2 = new E(...);
...
CN = new E(...);
$VALUES = new E[N];
$VALUES[0] = C1;
$VALUES[1] = C2;
...
$VALUES[N-1] = CN;
}
}
Since Java 15:
enum E extends Enum<E> {
...
static {
C1 = new E(...);
C2 = new E(...);
...
CN = new E(...);
$VALUES = $values();
}
private static E[] $values() {
E[] array = new E[N];
array[0] = C1;
array[1] = C2;
...
array[N-1] = CN;
return array ;
}
}
This allowed the static initializer to contain more code (until it hits the 64 kilobytes limit) and therefore to initialize more enum constants.

The maximum size of any method in Java is 65536 bytes. While you can theoretically have a large switch or more enum values, its the maximum size of a method you are likely to hit first.

The Enum class uses an int to track each value's ordinal, so the max would be the same as int at best, if not much lower.
And as others have said, if you have to ask you're Doing It Wrong

There is no maximum number per se for any practical purposes. If you need to define thousands of enums in one class you need to rewrite your program.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.