Java String UTF-8 limits

Java String UTF-8 limits - java

I'm trying to deserialize Strings from files directly and I have a question about very long Strings: Java Strings have a character count limit equal to Integer.MAX_VALUE, which is 31^2-1.
But here comes my question: what happens when I have a UTF-8 String with little less than that size but formed by characters with size more than 1 byte and then I ask Java to give me the byte array?
To make it clearer, what happens if I could run this code? (I haven't got RAM enough):
String toPrint = "";
String string100 = "";
int max = Integer.MAX_VALUE -100;
for (int i = 0; i < 100; i += 10) {
string100 += "1234567ñ90";
}
for (int i = 0; i < max; i += 100) {
toPrint += string100;
}
System.out.println("String complete!");
byte[] byteArray = toPrint.getBytes(StandardCharsets.UTF_8);
System.out.println(byteArray.length);
System.exit(0);
Does it print "String complete!"? Or does it break before?

Fundamentally, the limit on Strings is that the char arrays inside of them can't be longer than the maximum array length, which is roughly Integer.MAX_VALUE and greater than your variable max. Strings store their characters in UTF-16 and therefore the UTF-16 representation of a string can't exceed the maximum array length. The number of bytes in UTF-8 and the number of logical characters (Unicode code points, or UTF-32 characters) ultimately don't matter.
Now let's move to your particular example. Since each of the 10 characters in "1234567ñ90" is a single UTF-16 value, that string takes up 10 values of a String's char array. Despite your code's horrible performance and high memory requirement, it should eventually get to "String complete!" if there is sufficient available memory. However, it will break when converting to UTF-8 because the UTF-8 representation of the string is longer than the maximum array length, since "ñ" requires more than one byte.

Array size is also limited to Integer.MAX_VALUE (which is why String size is limited, after all there's a char[] backing it) , so it's impossible to get the byte array if the encoding uses more bytes than that, no matter what the size of the String is in characters.
The end result would be an OutOfMemoryError, but creating the String in the first place would succeed.

Related

What is the maximum char value in a java program in Netbeans IDE/ what is wrong with my program?

What is the maximum Unicode value of a char in Java (in particular in the Netbeans IDE, if that makes any difference) I've been trying to write a program that, as part of the program, multiplies a char by a random number. According to what I've heard, based on the maximum Unicode value I should be able to multiply the highest value char I'm using (the tilde) by at least 8000 without causing overflow, however overflow does occur in my program. Is there a difference between the maximum Unicode char value and the maximum that is available in Netbeans? In case that isn't the case I have included my code below:
EDIT What I want to do with this portion of the program is "encrypt" the password by multiplying the char with a random number, and then I included a separate section meant to "decrypt" said code, however testing with smaller numbers I found that that part worked.
public static void main(String[] args) {
String pass = "Password";
String pwE = "";
int key [] = new int[pass.length()];
for (int i = 0; i < pass.length(); i++)
{
key[i] = (int)(Math.random()*8000+1); /*EDIT changed the placeholder to the actual function I'm using */
System.out.println(key[i]);
}
for (int i = 0; i < pass.length(); i++)
{
pwE += (char)(pass.charAt(i)*key[i]);
}
System.out.println(pwE);
pass = "";
for (int i = 0; i < pwE.length(); i++)
{
pass += (char)(pwE.charAt(i)/key[i]);
}
System.out.println(pass);
}

"Is there a difference between the maximum Unicode char value and the maximum that is available in Netbeans [sic]?"
No, of course not. NetBeans doesn't have its own private, non-compliant version of Java. The maximum value of a char is always Character.MAX_VALUE, as documented.
http://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#MAX_VALUE
Your problem is very likely caused by your use of String to drive "encryption" and "decryption". You don't bother to control the string encoding, and that could conceivably create strangeness with respect to surrogate pairs and the like. You're mixing the numeric nature of char with String's use of the type to represent characters.
Since you didn't bother to share inputs, expected outputs, and actual outputs with us, we can only guess. Perhaps if you were to share sufficient information ...

A char is a 16 bit unsigned type in Java.
Its maximum value is 65535.
Your multiplication of a char by an element of key looks suspect to me. Your casting this result (which will be an int type) back to char causes wraparound modulo 65536.
Your suspecting Netbeans is a red herring.
Very crudely, if your string only uses ASCII characters, then a maximumum multiplication of 512 would work.

Can Java String handle 10^15 characters ? [duplicate]

I'm trying The Next Palindrome problem from Sphere Online Judge (SPOJ) where I need to find a palindrome for a integer of up to a million digits. I thought about using Java's functions for reversing Strings, but would they allow for a String to be this long?

You should be able to get a String of length
Integer.MAX_VALUE always 2,147,483,647 (231 - 1)
(Defined by the Java specification, the maximum size of an array, which the String class uses for internal storage)
OR
Half your maximum heap size (since each character is two bytes) whichever is smaller.

I believe they can be up to 2^31-1 characters, as they are held by an internal array, and arrays are indexed by integers in Java.

While you can in theory Integer.MAX_VALUE characters, the JVM is limited in the size of the array it can use.
public static void main(String... args) {
for (int i = 0; i < 4; i++) {
int len = Integer.MAX_VALUE - i;
try {
char[] ch = new char[len];
System.out.println("len: " + len + " OK");
} catch (Error e) {
System.out.println("len: " + len + " " + e);
}
}
}
on Oracle Java 8 update 92 prints
len: 2147483647 java.lang.OutOfMemoryError: Requested array size exceeds VM limit
len: 2147483646 java.lang.OutOfMemoryError: Requested array size exceeds VM limit
len: 2147483645 OK
len: 2147483644 OK
Note: in Java 9, Strings will use byte[] which will mean that multi-byte characters will use more than one byte and reduce the maximum further. If you have all four byte code-points e.g. emojis, you will only get around 500 million characters

Have you considered using BigDecimal instead of String to hold your numbers?

Integer.MAX_VALUE is max size of string + depends of your memory size but the Problem on sphere's online judge you don't have to use those functions

Java9 uses byte[] to store String.value, so you can only get about 1GB Strings in Java9. Java8 on the other hand can have 2GB Strings.
By character I mean "char"s, some character is not representable in BMP(like some of the emojis), so it will take more(currently 2) chars.

The heap part gets worse, my friends. UTF-16 isn't guaranteed to be limited to 16 bits and can expand to 32

Represent long in least amount of characters

I need to represent both very large and small numbers in the shortest string possible. The numbers are unsigned. I have tried just straight Base64 encode, but for some smaller numbers, the encoded string is longer than just storing the number as a string. What would be the best way to most optimally store a very large or short number in the shortest string possible with it being URL safe?

I have tried just straight Base64 encode, but for some smaller numbers, the encoded string is longer than just storing the number as a string
Base64 encoding of binary byte data will make it longer, by about a third. It is not supposed to make it shorter, but to allow safe transport of binary data in formats that are not binary safe.
However, base 64 is more compact than decimal representation of a number (or of byte data), even if it is less compact than base 256 (the raw byte data). Encoding your numbers in base 64 directly will make them more compact than decimal. This will do it:
private static final String base64Chars =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_";
static String encodeNumber(long x) {
char[] buf = new char[11];
int p = buf.length;
do {
buf[--p] = base64Chars.charAt((int)(x % 64));
x /= 64;
} while (x != 0);
return new String(buf, p, buf.length - p);
}
static long decodeNumber(String s) {
long x = 0;
for (char c : s.toCharArray()) {
int charValue = base64Chars.indexOf(c);
if (charValue == -1) throw new NumberFormatException(s);
x *= 64;
x += charValue;
}
return x;
}
Using this encoding scheme, Long.MAX_VALUE will be the string H__________, which is 11 characters long, compared to its decimal representation 9223372036854775807 which is 19 characters long. Numbers up to about 16 million will fit in a mere 4 characters. That's about as short as you'll get it. (Technically there are two other characters which do not need to be encoded in URLs: . and ~. You can incorporate those to get base 66, which would be a smidgin shorter for some numbers, although that seems a bit pedantic.)

To extend on Stephen C's answer, here is a piece of code to convert to base 62 (but you can increase this by adding more characters to the digits String (just pick what characters are valid for you):
public static String toString(long n) {
String digits = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
int base = digits.length();
String s = "";
while (n > 0) {
long d = n % base;
s = digits.charAt(d) + s;
n = n / base;
}
return s;
}
This will never result in the string representation being longer than the digit one.

Assuming that you don't do any compression, and that you restrict yourself to URL safe characters, then the following procedure will give you the most compact encoding possible.
Make a list of all URL safe characters
Count them. Suppose you have N.
Represent your number in base N, representing 0 by the first character, 1 by the 2nd and so on.
So, what about compression ...
If you assume that the numbers you are representing are uniformly distributed across their range, then there is no real opportunity for compression.
Otherwise, there is potential for compression. If you can reduce the size of the common numbers then you can typically achieve a saving by compression. This is how Huffman encoding works.
But the downside is that compression at this level is not perfect across the range of numbers. It reduces the size of some numbers, but it inevitably increases the size of others.
So what does this mean for your use-case?
I think it means that you are looking at the problem the wrong way. You should not be aiming for a minimal encoded size for every number. You should be aiming to minimize the size on average ... averaged over the actual distribution of your numbers.

Efficient way to calculate byte length of a character, depending on the encoding

What's the most efficient way to calculate the byte length of a character, taking the character encoding into account? The encoding would be only known during runtime. In UTF-8 for example the characters have a variable byte length, so each character needs to be determined individually. As far now I've come up with this:
char c = getCharSomehow();
String encoding = getEncodingSomehow();
// ...
int length = new String(new char[] { c }).getBytes(encoding).length;
But this is clumsy and inefficient in a loop since a new String needs to be created everytime. I can't find other and more efficient ways in the Java API. There's a String#valueOf(char), but according its source it does basically the same as above. I imagine that this can be done with bitwise operations like bit shifting, but that's my weak point and I'm unsure how to take the encoding into account here :)
If you question the need for this, check this topic.
Update: the answer from #Bkkbrad is technically the most efficient:
char c = getCharSomehow();
String encoding = getEncodingSomehow();
CharsetEncoder encoder = Charset.forName(encoding).newEncoder();
// ...
int length = encoder.encode(CharBuffer.wrap(new char[] { c })).limit();
However as #Stephen C pointed out, there are more problems with this. There may for example be combined/surrogate characters which needs to be taken into account as well. But that's another problem which needs to be solved in the step before this step.

Use a CharsetEncoder and reuse a CharBuffer as input and a ByteBuffer as output.
On my system, the following code takes 25 seconds to encode 100,000 single characters:
Charset utf8 = Charset.forName("UTF-8");
char[] array = new char[1];
for (int reps = 0; reps < 10000; reps++) {
for (array[0] = 0; array[0] < 10000; array[0]++) {
int len = new String(array).getBytes(utf8).length;
}
}
However, the following code does the same thing in under 4 seconds:
Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
char[] array = new char[1];
CharBuffer input = CharBuffer.wrap(array);
ByteBuffer output = ByteBuffer.allocate(10);
for (int reps = 0; reps < 10000; reps++) {
for (array[0] = 0; array[0] < 10000; array[0]++) {
output.clear();
input.clear();
encoder.encode(input, output, false);
int len = output.position();
}
}
Edit: Why do haters gotta hate?
Here's a solution that reads from a CharBuffer and keeps track of surrogate pairs:
Charset utf8 = Charset.forName("UTF-8");
CharsetEncoder encoder = utf8.newEncoder();
CharBuffer input = //allocate in some way, or pass as parameter
ByteBuffer output = ByteBuffer.allocate(10);
int limit = input.limit();
while(input.position() < limit) {
output.clear();
input.mark();
input.limit(Math.max(input.position() + 2, input.capacity()));
if (Character.isHighSurrogate(input.get()) && !Character.isLowSurrogate(input.get())) {
//Malformed surrogate pair; do something!
}
input.limit(input.position());
input.reset();
encoder.encode(input, output, false);
int encodedLen = output.position();
}

If you can guarantee that the input is well-formed UTF-8, then there's no reason to find code points at all. One of the strengths of UTF-8 is that you can detect the start of a code point from any position in the string. Simply search backwards until you find a byte such that (b & 0xc0) != 0x80, and you've found another character. Since a UTF-8 encoded code point is always 6 bytes or less, you can copy the intermediate bytes into a fixed-length buffer.
Edit: I forgot to mention, even if you don't go with this strategy, it is not sufficient to use a Java "char" to store arbitrary code points since code point values can exceed 0xffff. You need to store code points in an "int".

It is possible that an encoding scheme could encode a given character as a variable number of bytes, depending on what comes before and after it in the character sequence. The byte length you get from encoding a single character String is therefore not the whole answer.
(For example, you could theoretically receive a baudot / teletype characters encoded as 4 characters every 3 bytes, or you could theoretically treat a UTF-16 + a stream compressor as an encoding scheme. Yes, it is all a bit implausible, but ...)

Try Charset.forName("UTF-8").encode("string").limit(); Might be a bit more efficient, maybe not.

String's Maximum length in Java - calling length() method

In Java, what is the maximum size a String object may have, referring to the length() method call?
I know that length() return the size of a String as a char [];

Considering the String class' length method returns an int, the maximum length that would be returned by the method would be Integer.MAX_VALUE, which is 2^31 - 1 (or approximately 2 billion.)
In terms of lengths and indexing of arrays, (such as char[], which is probably the way the internal data representation is implemented for Strings), Chapter 10: Arrays of The Java Language Specification, Java SE 7 Edition says the following:
The variables contained in an array
have no names; instead they are
referenced by array access expressions
that use nonnegative integer index
values. These variables are called the
components of the array. If an array
has n components, we say n is the
length of the array; the components of
the array are referenced using integer
indices from 0 to n - 1, inclusive.
Furthermore, the indexing must be by int values, as mentioned in Section 10.4:
Arrays must be indexed by int values;
Therefore, it appears that the limit is indeed 2^31 - 1, as that is the maximum value for a nonnegative int value.
However, there probably are going to be other limitations, such as the maximum allocatable size for an array.

java.io.DataInput.readUTF() and java.io.DataOutput.writeUTF(String) say that a String object is represented by two bytes of length information and the modified UTF-8 representation of every character in the string. This concludes that the length of String is limited by the number of bytes of the modified UTF-8 representation of the string when used with DataInput and DataOutput.
In addition, The specification of CONSTANT_Utf8_info found in the Java virtual machine specification defines the structure as follows.
CONSTANT_Utf8_info {
u1 tag;
u2 length;
u1 bytes[length];
}
You can find that the size of 'length' is two bytes.
That the return type of a certain method (e.g. String.length()) is int does not always mean that its allowed maximum value is Integer.MAX_VALUE. Instead, in most cases, int is chosen just for performance reasons. The Java language specification says that integers whose size is smaller than that of int are converted to int before calculation (if my memory serves me correctly) and it is one reason to choose int when there is no special reason.
The maximum length at compilation time is at most 65536. Note again that the length is the number of bytes of the modified UTF-8 representation, not the number of characters in a String object.
String objects may be able to have much more characters at runtime. However, if you want to use String objects with DataInput and DataOutput interfaces, it is better to avoid using too long String objects. I found this limitation when I implemented Objective-C equivalents of DataInput.readUTF() and DataOutput.writeUTF(String).

Since arrays must be indexed with integers, the maximum length of an array is Integer.MAX_INT (231-1, or 2 147 483 647). This is assuming you have enough memory to hold an array of that size, of course.

I have a 2010 iMac with 8GB of RAM, running Eclipse Neon.2 Release (4.6.2) with Java 1.8.0_25. With the VM argument -Xmx6g, I ran the following code:
StringBuilder sb = new StringBuilder();
for (int i = 0; i < Integer.MAX_VALUE; i++) {
try {
sb.append('a');
} catch (Throwable e) {
System.out.println(i);
break;
}
}
System.out.println(sb.toString().length());
This prints:
Requested array size exceeds VM limit
1207959550
So, it seems that the max array size is ~1,207,959,549. Then I realized that we don't actually care if Java runs out of memory: we're just looking for the maximum array size (which seems to be a constant defined somewhere). So:
for (int i = 0; i < 1_000; i++) {
try {
char[] array = new char[Integer.MAX_VALUE - i];
Arrays.fill(array, 'a');
String string = new String(array);
System.out.println(string.length());
} catch (Throwable e) {
System.out.println(e.getMessage());
System.out.println("Last: " + (Integer.MAX_VALUE - i));
System.out.println("Last: " + i);
}
}
Which prints:
Requested array size exceeds VM limit
Last: 2147483647
Last: 0
Requested array size exceeds VM limit
Last: 2147483646
Last: 1
Java heap space
Last: 2147483645
Last: 2
So, it seems the max is Integer.MAX_VALUE - 2, or (2^31) - 3
P.S. I'm not sure why my StringBuilder maxed out at 1207959550 while my char[] maxed out at (2^31)-3. It seems that AbstractStringBuilder doubles the size of its internal char[] to grow it, so that probably causes the issue.

apparently it's bound to an int, which is 0x7FFFFFFF (2147483647).

The Return type of the length() method of the String class is int.
public int length()
Refer http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#length()
So the maximum value of int is 2147483647.
String is considered as char array internally,So indexing is done within the maximum range.
This means we cannot index the 2147483648th member.So the maximum length of String in java is 2147483647.
Primitive data type int is 4 bytes(32 bits) in java.As 1 bit (MSB) is used as a sign bit,The range is constrained within -2^31 to 2^31-1 (-2147483648 to 2147483647).
We cannot use negative values for indexing.So obviously the range we can use is from 0 to 2147483647.

As mentioned in Takahiko Kawasaki's answer, java represents Unicode strings in the form of modified UTF-8 and in JVM-Spec CONSTANT_UTF8_info Structure, 2 bytes are allocated to length (and not the no. of characters of String).
To extend the answer, the ASM jvm bytecode library's putUTF8 method, contains this:
public ByteVector putUTF8(final String stringValue) {
int charLength = stringValue.length();
if (charLength > 65535) {
// If no. of characters> 65535, than however UTF-8 encoded length, wont fit in 2 bytes.
throw new IllegalArgumentException("UTF8 string too large");
}
for (int i = 0; i < charLength; ++i) {
char charValue = stringValue.charAt(i);
if (charValue >= '\u0001' && charValue <= '\u007F') {
// Unicode code-point encoding in utf-8 fits in 1 byte.
currentData[currentLength++] = (byte) charValue;
} else {
// doesnt fit in 1 byte.
length = currentLength;
return encodeUtf8(stringValue, i, 65535);
}
}
...
}
But when code-point mapping > 1byte, it calls encodeUTF8 method:
final ByteVector encodeUtf8(final String stringValue, final int offset, final int maxByteLength /*= 65535 */) {
int charLength = stringValue.length();
int byteLength = offset;
for (int i = offset; i < charLength; ++i) {
char charValue = stringValue.charAt(i);
if (charValue >= 0x0001 && charValue <= 0x007F) {
byteLength++;
} else if (charValue <= 0x07FF) {
byteLength += 2;
} else {
byteLength += 3;
}
}
...
}
In this sense, the max string length is 65535 bytes, i.e the utf-8 encoding length. and not char count
You can find the modified-Unicode code-point range of JVM, from the above utf8 struct link.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.