Internal character encoding of Java 7

Internal character encoding of Java 7 - java

So far as I know, when JRE executes an Java application,
the string will be seen as a USC2 byte array internally.
In wikipedia, the following content can be found.
Java originally used UCS-2, and added UTF-16 supplementary character support in J2SE 5.0.
With the new release version of Java (Java 7) ,
what is its internal character-encoding?
Is there any possibility that Java start to use UCS-4 internally ?

Java 7 still uses UTF-16 internally (Read the last section of the Charset Javadoc), and it's very unlikely that will change to UCS-4. I'll give you two reasons for that:
Changing from UCS-2=>UCS-4 would most likely meant that they would have to change the char primitive from a 16 bits type to a 32 bits type. Looking in the past at how high Sun/Oracle have valued backwards compatibility, a change like this is very unlikely.
A UCS-4 takes a lot more memory than a UTF-16 encoded String for most use cases.

Q: So far as I know, when JRE executes an Java application, the string
will be seen as a (16-bit Unicode) byte array
A: Yes
Q: With the new release version of Java (Java 7) , what is its
internal charater-encoding?
A: Same
Q: Is there any possibility that Java start to use UCS-4 internally?
A: I haven't heard anything of the kind
However, you can use "code-points" to implement UTF-32 characters in Java 5 and higher:
http://www.ibm.com/developerworks/java/library/j-unicode/
http://jcp.org/en/jsr/detail?id=204

Related

Check if a Java version is greater than a certain iteration in Java?

I wish to check if a user's Java version is at least 1.8.0_171. I mean that specific iteration or higher, meaning 1.8.0_151, for instance, would not work.
I planned to originally use org.apache.commons.lang3.SystemUtils' isJavaVersionAtLeast(JavaVersion requiredVersion) method, but it seems that you cannot specify the iteration number.
Based on this and Java's changing way of representing version numbers in Java (e.g. 1.8 then 9), what is the best way to check the Java version of the user in the Java program?
Edit:
This was marked as a duplicate of this question; however, I think it is different in that it asks how to compare the java version with a certain version given the changes in format of how the java version is shown.

Even with the versioning change, I think the solution is still as simple as using the following boolean expression:
"1.8.0_171".compareTo(System.getProperty("java.version")) <= 0
If the user's java.version property is any less than 1.8.0_171, then the above expression returns false, and vice versa. This works for using "9" or "10" in place of the java.version property as well.

Have the NFC Normalization semantics changed between Java 6 and 7?

The unicode character U+FA8E CJK COMPATIBILITY IDEOGRAPH-FA8E is a compatibility character mapped to U+641C [CJK Unified Ideographs]. In Java 6 NFC normalization leaves it U+FA8E, while in Java 7 it does decompose it to U+641C?
When running this small snippet:
String fancyChar = "\uFA8E";
String normalized = Normalizer.normalize(fancyChar, Normalizer.Form.NFC);
System.out.printf("%04x == %04x\n", (int)(fancyChar.charAt(0)), (int)(normalized.charAt(0)));
System.out.println(fancyChar.equals(normalized));
In Java 6 (latest versions of both Sun/Oracle and OpenJDK):
fa8e == fa8e
true
In Java 7 (latest versions of both Sun/Oracle and OpenJDK):
fa8e == 641c
false
So my question is, why has this changed?
Reading the UNICODE NORMALIZATION FORMS it seems NFC should not decompose characters with compatibility mapping?
But the fact that both Oracle and OpenJDK have switched this for Java 7 makes me wonder.

The character U+FA8E has canonical mapping to U+641C. The authoritative reference on this is the UnicodeData.txt file in the Unicode Character Database. Thus, the correct NFC form of U+FA8E is U+641C.
So this is apparently a bug fix. It seems to affect other characters in the same group, too.

Maximum size of a method in Java 7 and 8

I know that a method cannot be larger than 64 KB with Java. The limitation causes us problems with generated code from a JavaCC grammar. We had problems with Java 6 and were able to fix this by changing the grammar. Has the limit been changed for Java 7 or is it planned for Java 8?
Just to make it clear. I don't need a method larger than 64 KB by myself. But I wrote a grammar which compiles to a very large method.

According to JVMS7 :
The fact that end_pc is exclusive is a historical mistake in the
design of the Java virtual machine: if the Java virtual machine code
for a method is exactly 65535 bytes long and ends with an instruction
that is 1 byte long, then that instruction cannot be protected by an
exception handler. A compiler writer can work around this bug by
limiting the maximum size of the generated Java virtual machine code
for any method, instance initialization method, or static initializer
(the size of any code array) to 65534 bytes.
But this is about Java 7. There is no final specs for Java 8, so nobody (except its developers) could answer this question.
UPD (2015-04-06) According to JVM8 it is also true for Java 8.

Good question. As always we should go to the source to find the answer ("The Java® Virtual Machine Specification"). The section does not explicitly mention a limit (as did the Java6 VM spec) though, but somewhat circumspectly:
The greatest number of local variables in the local variables array of a frame created upon invocation of a method (§2.6) is limited to 65535 by the size of the max_locals item of the Code attribute (§4.7.3) giving the code of the method, and by the 16-bit local variable indexing of the Java Virtual Machine instruction set.
Cheers,

It has not changed. The limit of code in methods is still 64 KB in both Java 7 and Java 8.
References:
From the Java 7 Virtual Machine Specification (4.9.1 Static Constraints):
The static constraints on the Java Virtual Machine code in a class file specify how
Java Virtual Machine instructions must be laid out in the code array and what the
operands of individual instructions must be.
The static constraints on the instructions in the code array are as follows:
The code array must not be empty, so the code_length item cannot have the
value 0.
The value of the code_length item must be less than 65536.
From the Java 8 Virtual Machine Specification (4.7.3 The Code Attribute):
The value of the code_length item gives the number of bytes in the code array
for this method.
The value of code_length must be greater than zero (as the code array must
not be empty) and less than 65536.

Andremoniy has answered the java 7 part of this question already, but seems at that time it was soon to decide about java 8 so I complete the answer to cover that part:
Quoting from jvms:
The fact that end_pc is exclusive is a historical mistake in the design of the Java Virtual Machine: if the Java Virtual Machine code for a method is exactly 65535 bytes long and ends with an instruction that is 1 byte long, then that instruction cannot be protected by an exception handler. A compiler writer can work around this bug by limiting the maximum size of the generated Java Virtual Machine code for any method, instance initialization method, or static initializer (the size of any code array) to 65534 bytes.
As you see seems this historical problem doesn't seem to remedy at least in this version (java 8).

As a workaround, and if you have access to the parser's code, you could modify it to work within whatever 'limits are imposed by the JVM compiler ...
(Assuming it den't take forever to find the portions in the parser code to modify)

how to protect against Null Byte Injection in a java webapp

How can null byte injection be done on a java webapp, Or rather - how does on protect against it?
Should I look at each byte of the request parameter and inspect its 'byte' value to be 0 ? I can't imagine a 0 byte sneaking in a request parameter... can it?
My main aim is to make sure the filename used for saving the file is safe enough. And for now, I am not looking answers that recommend (for example): replacing ALL non-word characters with Underscore.

Allowing the user to store files with arbitrary names is dangerous. What happens if the user provides "../../../WINDOWS/explorer.exe"? You should restrict filenames to only contain characters known to be harmless.
'\0' is not known to be harmless. As far as Java is concerned, '\0' is a character like any other. However, the operating system is likely to interpret '\0' as the end of a string. If a string is passed from Java to the operating system, that different interpretation could result in exploitable bugs. Consider:
if (filename.endsWith(".txt") {
store(filename, data);
}
where filename is "C:\Windows\explorer.exe\0.txt", which ends with ".txt" to Java, but with ".exe" to the operating system.

I'm not sure why you're concerned with null byte injection. Java isn't like C/C++, where strings are null-terminated character arrays.
You ought to bind and validate parameters and values coming in from the web tier. How do you define "safe enough"?

You have 2 choices:
1 Scan the string (convert it to a char array first) for null bytes.
2 upgrade to Java 8 or Java 7u40 and you are protected. (Yes, i tested it!, it works!)
in May 1013 Oracle fixed the problem: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8014846

Null byte injection in filenames was fixed in Java 7 update 40 (released around Sept. 2013). So, its been fixed for a while now, but it WAS a problem for over a decade and it was a NASTY vulnerability in Java. The fix is documented here: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8014846
-Dave Wichers

Does Java have a limit on the class name length?

This question came up in Spring class, which has some rather long class names. Is there a limit in the language for class name lengths?

The Java Language Specification states that identifiers are unlimited in length.
In practice though, the filesystem will limit the length of the resulting file name.

65535 characters I believe. From the Java virtual machine specification:
The length of field and method names,
field and method descriptors, and
other constant string values is
limited to 65535 characters by the
16-bit unsigned length item of the
CONSTANT_Utf8_info structure (§4.4.7).
Note that the limit is on the number
of bytes in the encoding and not on
the number of encoded characters.
UTF-8 encodes some characters using
two or three bytes. Thus, strings
incorporating multibyte characters are
further constrained.
here:
https://docs.oracle.com/javase/specs/jvms/se6/html/ClassFile.doc.html#88659

With JDK 1.5, the practical limit for class names on Windows XP with 255 -- longer names gave errors in the file system. This was the full name (directory+package+class).
I have not tried JDK 1.6 on Vista or windows 7, hopefully Sun fixed it to be the NTFS limit of 8000 or so.

No. Java doesn't impose any limit on the class name. But if you interfacing with other systems (e.g. JNI) its better to be on the safe side.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Internal character encoding of Java 7 - java

Related

Check if a Java version is greater than a certain iteration in Java?

Have the NFC Normalization semantics changed between Java 6 and 7?

Maximum size of a method in Java 7 and 8

how to protect against Null Byte Injection in a java webapp

Does Java have a limit on the class name length?

Categories

Resources