I know that a method cannot be larger than 64 KB with Java. The limitation causes us problems with generated code from a JavaCC grammar. We had problems with Java 6 and were able to fix this by changing the grammar. Has the limit been changed for Java 7 or is it planned for Java 8?
Just to make it clear. I don't need a method larger than 64 KB by myself. But I wrote a grammar which compiles to a very large method.
According to JVMS7 :
The fact that end_pc is exclusive is a historical mistake in the
design of the Java virtual machine: if the Java virtual machine code
for a method is exactly 65535 bytes long and ends with an instruction
that is 1 byte long, then that instruction cannot be protected by an
exception handler. A compiler writer can work around this bug by
limiting the maximum size of the generated Java virtual machine code
for any method, instance initialization method, or static initializer
(the size of any code array) to 65534 bytes.
But this is about Java 7. There is no final specs for Java 8, so nobody (except its developers) could answer this question.
UPD (2015-04-06) According to JVM8 it is also true for Java 8.
Good question. As always we should go to the source to find the answer ("The Java® Virtual Machine Specification"). The section does not explicitly mention a limit (as did the Java6 VM spec) though, but somewhat circumspectly:
The greatest number of local variables in the local variables array of a frame created upon invocation of a method (§2.6) is limited to 65535 by the size of the max_locals item of the Code attribute (§4.7.3) giving the code of the method, and by the 16-bit local variable indexing of the Java Virtual Machine instruction set.
Cheers,
It has not changed. The limit of code in methods is still 64 KB in both Java 7 and Java 8.
References:
From the Java 7 Virtual Machine Specification (4.9.1 Static Constraints):
The static constraints on the Java Virtual Machine code in a class file specify how
Java Virtual Machine instructions must be laid out in the code array and what the
operands of individual instructions must be.
The static constraints on the instructions in the code array are as follows:
The code array must not be empty, so the code_length item cannot have the
value 0.
The value of the code_length item must be less than 65536.
From the Java 8 Virtual Machine Specification (4.7.3 The Code Attribute):
The value of the code_length item gives the number of bytes in the code array
for this method.
The value of code_length must be greater than zero (as the code array must
not be empty) and less than 65536.
Andremoniy has answered the java 7 part of this question already, but seems at that time it was soon to decide about java 8 so I complete the answer to cover that part:
Quoting from jvms:
The fact that end_pc is exclusive is a historical mistake in the design of the Java Virtual Machine: if the Java Virtual Machine code for a method is exactly 65535 bytes long and ends with an instruction that is 1 byte long, then that instruction cannot be protected by an exception handler. A compiler writer can work around this bug by limiting the maximum size of the generated Java Virtual Machine code for any method, instance initialization method, or static initializer (the size of any code array) to 65534 bytes.
As you see seems this historical problem doesn't seem to remedy at least in this version (java 8).
As a workaround, and if you have access to the parser's code, you could modify it to work within whatever 'limits are imposed by the JVM compiler ...
(Assuming it den't take forever to find the portions in the parser code to modify)
Related
As far as I understand, Java Strings are just an array of characters, with the maximum length being an integer value.
If I understand this answer correctly, it is possible to cause an overflow with a String - albeit in "unusual circumstances".
Since Java Strings are based on char arrays and Java automatically checks array bounds, buffer overflows are only possible in unusual scenarios:
If you call native code via JNI
In the JVM itself (usually written in C++)
The interpreter or JIT compiler does not work correctly (Java bytecode mandated bounds checks)
Correct me if I'm wrong, but I believe this means that you can write outside the bounds of the array, without triggering the ArrayIndexOutOfBounds (or similar) exception.
I've encountered issues in C++ with buffer overflows, and I can find plenty of advice about other languages, but none specifically answering what would happen if you caused a buffer overflow with a String (or any other array type) in Java.
I know that Java Strings are bounds-checked, and can't be overflowed by native Java code alone (unless issues are present in the compiler or JVM, as per points 2 and 3 above), but the first point implies that it is technically possible to get a char[] into an... undesirable position.
Given this, I have two specific questions about the behaviour of such issues in Java, assuming the above is correct:
If a String can overflow, what happens when it does?
What would the implications of this behaviour be?
Thanks in advance.
To answer you first question, I had the luck of actually causing a error of such, and the execution just stopped throwing one of these errors:
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
So that was my case, I don't know if that represents a security problem as buffer overflow in C and C++.
A String in Java is immutable, so once created there is no writing to the underlying array of char or array of byte (it depends on the Java version and contents of the String whether one or the other is used). Ok, using JNI could circumvent that, but with pure Java, it is impossible to leave the bounds of the array without causing an ArrayOutOfBoundsException or alike.
The only way to cause a kind of an overflow in the context of String handling would be to create a new String that is too long. Make sure that your JVM will have enough heap (around 36 GB), create a char array of Integer.MAX_VALUE - 1, populate that appropriately, call new String( byte [] ) with that array, and then execute
var string = string.concat( new String( array ) );
But the result is just an exception telling you that it was attempted to create a too large array.
How are keywords represented in binary form?
For ex:: In java, how is the sin() represented in binary? How is sqrt() and other functions represented.
If not only in java, in any language how is it represented?? because ultimately everything is translated into binary and then into on and off signals.
Thanks in advance.
Firstly, sin is not a keyword in Java. It is an identifier. Keywords are things like if, class, and so on.
It depends on when you are asking about.
In the source code, the sin identifier is represented as characters, and those characters are represented as bits (i.e. binary) .... if you want to look at it that way.
In the classfile that is output by the javac compiler, the word sin is represented as string in the Constant Pool. (The JVM spec specifies the format of classfiles in great detail.)
When the classfile is first loaded by a JVM, the word sin becomes a Java String object.
When the code is linked by the JVM, the reference to the String is resolved to some kind of reference to a method. (The details are implementation specific. You'd need to read the JVM source code to find out more.)
When the code is JIT compiler, the reference to the method (typically) turns into the address in memory of the first native instruction of the JIT compiled method. (Strictly speaking, this is not "assembly language". But the native instructions could be represented as assembly language. Assembly language is really just a "human friendly" textual representation of the instructions.)
so how does the computer know that when sin is written it has to do the sine of a number.
What happens is that the Java runtime loads that class containing the method. Then it looks for the sin(double) method in the class that it loaded. What typically happens is that the named method resolves to some bytecodes that are the instructions that tell the runtime what the method should do. But in the case of sin, the method is a native method, and the instructions are actually native instructions that are part of one of the JVM's native libraries.
If not of methods, Can we have binary representation of Keywords?? Like int, and float etc??
It depends on the actual keywords. But generally speaking, genuine Java keywords are transformed by the compiler into a form that doesn't have a distinct / discrete representation for the individual keywords.
If not only in java, in any language how is it represented?? because ultimately everything is translated into binary and then into on and off signals.
This tells me that you probably have a fundamental misunderstanding of how programming languages are implemented. So instead of answering this question (it doesn't really have a proper answer other than "well they're not represented at all"), I will try to help you understand why this question is the wrong one to ask.
Your computer runs machine code, and only machine code. You can feed it any random sequence of bytes, it doesn't matter what they were intended to be, as soon as you point the program counter to it it will be interpreted as if it is machine code (of course giving it bytes that were not intended to be machine code is probably a bad idea). As a running example, I'll use this x64 code:
48 01 F7 48 89 F8 C3
If you have no idea what's going on, that's normal at this level. Most people don't read machine code (but they could if they learned it, it's not magic). This is where the zeroes and ones are, to the processor it's not even in hexadecimal, that's just what humans like to read.
At a level above that there is assembly, which is in most cases really just a different way of looking at machine code, in such a way that humans find it easier to read. The example from earlier looks more sensible in assembly:
add rdi, rsi
mov rax, rdi
ret
Still not very clear what's going on to someone who doesn't know x64 assembly, but at least it gives some sort of clue: there's an add in it. It probably adds things.
At a yet higher level, you could have java bytecode or java, but I think the java aspect of this question misses the point, it's probably there because OP doesn't realize that java is different from "the classic picture". Java just complicates matters without explaining the big picture. Let's use C instead. The example in C could look like:
int64_t foo_or_whatever(int64_t x, int64_t y)
{
return x + y;
}
If you don't know C but you do know Java, the only strange thing here is int64_t, which is roughly the equivalent of a long in Java.
So yes, things were added, as the assembly code suggested. Now where did the keywords go?
That question doesn't make as much sense as you thought it did. The compiler understands keywords, and uses them to create machine code that implements your program. After that point they stop being relevant. They only mean something in the context of the high level language that you wrote the code in, you could say that at that level, they are stored as ASCII or UTF8 string in a file. They have nothing to do with machine code, they do not appear in any form there, and you can write machine code without having translated it from a high level language that has keywords. That return and ret looks vaguely similar is a bit of a red herring, they have something to do with each other but the relation is far from simple (that it worked out simply in the example I'm using is of course no accident).
The int64_t has perhaps not entirely disappeared (mostly it has, though). The fact that the addition operates on 64bit integers is encoded in the instruction 48 01 F7. Not the keyword int64_t (which isn't even a keyword, but let's not get into that), "the fact that what you have there is an addition between 64bit integers", which is an conceptually different thing though caused here by the use of int64_t. To split that instruction out while skipping some of the detail (because this is a beginner question), there's
48 = 01001000 encoding REX.W, meaning this instruction is 64bit
01 = 00000001 encoding add rm64, r64 in this case
D1 = 11010001 encoding the operands rdi and rsi
To learn more about what the processor does with machine code (in case your follow-up question is "but how does it know what to do with something like 48 01 F7"), study computer architecture. If you want a book, I recommend Computer Architecture, Fifth Edition: A Quantitative Approach, which is quite accessible to beginners and commonly used in first-year courses about computer architectures.
To learn more about the journey from high level language to machine code, study compiler construction. If you want a book, I recommend Compilers: Principles, Techniques, and Tools, but it may be hard to get through it as a beginner. If you want a free course, you could follow Compilers on Coursera (the first few lectures especially will give you an overview of what compilers do without getting too technical yet).
Incidentally, if you give the example C code to GCC, it makes
lea rax, [rdi + rsi]
ret
It's still doing the same thing, but in a way that didn't fit my story, so I took the liberty of doing it in a slightly different way.
sin() is a function so it's represented as a memory address where its code block is.
Keywords (like for) aren't represented as binary, for for example is converted to a list of byte code jump instructions which are compiled into assembly instructions which are represented as binary.
My point is that you cannot convert most keywords directly into binary. You can unroll them into bytecode which you could then convert to native machine code and binary but not directly to binary.
Here, read this then after you understand it move onto how bytecode is converted to native code.
Keywords and Functions
That said, a keyword in Java (and most languages) is a reserved word like for, while or return but your examples are not keywords, they are function names like sin() and sqrt()
Not really sure what you want to know here; so let's go "bytecode"...
Both the .sin() and .sqrt() methods are static methods from the Math class; therefore, the compiler will generate a call site with both arguments, a reference to the method and then call invokestatic.
Other than invokestatic, you have invokevirtual, invokespecial, invokeinterface and (since Java 7) invokedynamic.
Now, at runtime, the JIT will kick in; and the JIT may end up producing pure native code, but this is not a guarantee. In any event, the code will be fast enough.
And the same goes for the JDK libraries themselves; the JIT will kick in and maybe turn the byte code into native code given a sufficient time to analyze it (escape analysis, inlining etc).
And since the JIT does "whatever it wants", you reliably cannot have a "binary" representation of any method from any class.
I am new on learning dalvik, and I want to dump out every instruction in dalvik.
But there are still 3 instructions I can not get no matter how I write the code.
They are 'not-int', 'not-long', 'const-string/jumbo'.
I written like this to get 'not-int' but failed:
int y = ~x;
Dalvik generated an 'xor x, -1' instead.
and I know 'const-string/jumbo' means that there is more than 65535 strings in the code and the index is 32bit. But when I decleared 70000 strings in the code, the compiler said the code was too long.
So the question is: how to get 'not-int' and 'const-string/jumbo' in dalvik by java code?
const-string/jumbo is easy. As you noted, you just need to define more than 65535 strings, and reference one of the later ones. They don't all need to be in a single class file, just in the same DEX file.
Take a look at dalvik/tests/056-const-string-jumbo, in particular the "build" script that generates a Java source file with a large number of strings.
As far as not-int and not-long go, I don't think they're ever generated. I ran dexdump -d across a pile of Android 4.4 APKs and didn't find a single instance of either.
So far as I know, when JRE executes an Java application,
the string will be seen as a USC2 byte array internally.
In wikipedia, the following content can be found.
Java originally used UCS-2, and added UTF-16 supplementary character support in J2SE 5.0.
With the new release version of Java (Java 7) ,
what is its internal character-encoding?
Is there any possibility that Java start to use UCS-4 internally ?
Java 7 still uses UTF-16 internally (Read the last section of the Charset Javadoc), and it's very unlikely that will change to UCS-4. I'll give you two reasons for that:
Changing from UCS-2=>UCS-4 would most likely meant that they would have to change the char primitive from a 16 bits type to a 32 bits type. Looking in the past at how high Sun/Oracle have valued backwards compatibility, a change like this is very unlikely.
A UCS-4 takes a lot more memory than a UTF-16 encoded String for most use cases.
Q: So far as I know, when JRE executes an Java application, the string
will be seen as a (16-bit Unicode) byte array
A: Yes
Q: With the new release version of Java (Java 7) , what is its
internal charater-encoding?
A: Same
Q: Is there any possibility that Java start to use UCS-4 internally?
A: I haven't heard anything of the kind
However, you can use "code-points" to implement UTF-32 characters in Java 5 and higher:
http://www.ibm.com/developerworks/java/library/j-unicode/
http://jcp.org/en/jsr/detail?id=204
This question came up in Spring class, which has some rather long class names. Is there a limit in the language for class name lengths?
The Java Language Specification states that identifiers are unlimited in length.
In practice though, the filesystem will limit the length of the resulting file name.
65535 characters I believe. From the Java virtual machine specification:
The length of field and method names,
field and method descriptors, and
other constant string values is
limited to 65535 characters by the
16-bit unsigned length item of the
CONSTANT_Utf8_info structure (§4.4.7).
Note that the limit is on the number
of bytes in the encoding and not on
the number of encoded characters.
UTF-8 encodes some characters using
two or three bytes. Thus, strings
incorporating multibyte characters are
further constrained.
here:
https://docs.oracle.com/javase/specs/jvms/se6/html/ClassFile.doc.html#88659
With JDK 1.5, the practical limit for class names on Windows XP with 255 -- longer names gave errors in the file system. This was the full name (directory+package+class).
I have not tried JDK 1.6 on Vista or windows 7, hopefully Sun fixed it to be the NTFS limit of 8000 or so.
No. Java doesn't impose any limit on the class name. But if you interfacing with other systems (e.g. JNI) its better to be on the safe side.