String's Maximum length in Java - calling length() method

String's Maximum length in Java - calling length() method - java

In Java, what is the maximum size a String object may have, referring to the length() method call?
I know that length() return the size of a String as a char [];

Considering the String class' length method returns an int, the maximum length that would be returned by the method would be Integer.MAX_VALUE, which is 2^31 - 1 (or approximately 2 billion.)
In terms of lengths and indexing of arrays, (such as char[], which is probably the way the internal data representation is implemented for Strings), Chapter 10: Arrays of The Java Language Specification, Java SE 7 Edition says the following:
The variables contained in an array
have no names; instead they are
referenced by array access expressions
that use nonnegative integer index
values. These variables are called the
components of the array. If an array
has n components, we say n is the
length of the array; the components of
the array are referenced using integer
indices from 0 to n - 1, inclusive.
Furthermore, the indexing must be by int values, as mentioned in Section 10.4:
Arrays must be indexed by int values;
Therefore, it appears that the limit is indeed 2^31 - 1, as that is the maximum value for a nonnegative int value.
However, there probably are going to be other limitations, such as the maximum allocatable size for an array.

java.io.DataInput.readUTF() and java.io.DataOutput.writeUTF(String) say that a String object is represented by two bytes of length information and the modified UTF-8 representation of every character in the string. This concludes that the length of String is limited by the number of bytes of the modified UTF-8 representation of the string when used with DataInput and DataOutput.
In addition, The specification of CONSTANT_Utf8_info found in the Java virtual machine specification defines the structure as follows.
CONSTANT_Utf8_info {
u1 tag;
u2 length;
u1 bytes[length];
}
You can find that the size of 'length' is two bytes.
That the return type of a certain method (e.g. String.length()) is int does not always mean that its allowed maximum value is Integer.MAX_VALUE. Instead, in most cases, int is chosen just for performance reasons. The Java language specification says that integers whose size is smaller than that of int are converted to int before calculation (if my memory serves me correctly) and it is one reason to choose int when there is no special reason.
The maximum length at compilation time is at most 65536. Note again that the length is the number of bytes of the modified UTF-8 representation, not the number of characters in a String object.
String objects may be able to have much more characters at runtime. However, if you want to use String objects with DataInput and DataOutput interfaces, it is better to avoid using too long String objects. I found this limitation when I implemented Objective-C equivalents of DataInput.readUTF() and DataOutput.writeUTF(String).

Since arrays must be indexed with integers, the maximum length of an array is Integer.MAX_INT (231-1, or 2 147 483 647). This is assuming you have enough memory to hold an array of that size, of course.

I have a 2010 iMac with 8GB of RAM, running Eclipse Neon.2 Release (4.6.2) with Java 1.8.0_25. With the VM argument -Xmx6g, I ran the following code:
StringBuilder sb = new StringBuilder();
for (int i = 0; i < Integer.MAX_VALUE; i++) {
try {
sb.append('a');
} catch (Throwable e) {
System.out.println(i);
break;
}
}
System.out.println(sb.toString().length());
This prints:
Requested array size exceeds VM limit
1207959550
So, it seems that the max array size is ~1,207,959,549. Then I realized that we don't actually care if Java runs out of memory: we're just looking for the maximum array size (which seems to be a constant defined somewhere). So:
for (int i = 0; i < 1_000; i++) {
try {
char[] array = new char[Integer.MAX_VALUE - i];
Arrays.fill(array, 'a');
String string = new String(array);
System.out.println(string.length());
} catch (Throwable e) {
System.out.println(e.getMessage());
System.out.println("Last: " + (Integer.MAX_VALUE - i));
System.out.println("Last: " + i);
}
}
Which prints:
Requested array size exceeds VM limit
Last: 2147483647
Last: 0
Requested array size exceeds VM limit
Last: 2147483646
Last: 1
Java heap space
Last: 2147483645
Last: 2
So, it seems the max is Integer.MAX_VALUE - 2, or (2^31) - 3
P.S. I'm not sure why my StringBuilder maxed out at 1207959550 while my char[] maxed out at (2^31)-3. It seems that AbstractStringBuilder doubles the size of its internal char[] to grow it, so that probably causes the issue.

apparently it's bound to an int, which is 0x7FFFFFFF (2147483647).

The Return type of the length() method of the String class is int.
public int length()
Refer http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#length()
So the maximum value of int is 2147483647.
String is considered as char array internally,So indexing is done within the maximum range.
This means we cannot index the 2147483648th member.So the maximum length of String in java is 2147483647.
Primitive data type int is 4 bytes(32 bits) in java.As 1 bit (MSB) is used as a sign bit,The range is constrained within -2^31 to 2^31-1 (-2147483648 to 2147483647).
We cannot use negative values for indexing.So obviously the range we can use is from 0 to 2147483647.

As mentioned in Takahiko Kawasaki's answer, java represents Unicode strings in the form of modified UTF-8 and in JVM-Spec CONSTANT_UTF8_info Structure, 2 bytes are allocated to length (and not the no. of characters of String).
To extend the answer, the ASM jvm bytecode library's putUTF8 method, contains this:
public ByteVector putUTF8(final String stringValue) {
int charLength = stringValue.length();
if (charLength > 65535) {
// If no. of characters> 65535, than however UTF-8 encoded length, wont fit in 2 bytes.
throw new IllegalArgumentException("UTF8 string too large");
}
for (int i = 0; i < charLength; ++i) {
char charValue = stringValue.charAt(i);
if (charValue >= '\u0001' && charValue <= '\u007F') {
// Unicode code-point encoding in utf-8 fits in 1 byte.
currentData[currentLength++] = (byte) charValue;
} else {
// doesnt fit in 1 byte.
length = currentLength;
return encodeUtf8(stringValue, i, 65535);
}
}
...
}
But when code-point mapping > 1byte, it calls encodeUTF8 method:
final ByteVector encodeUtf8(final String stringValue, final int offset, final int maxByteLength /*= 65535 */) {
int charLength = stringValue.length();
int byteLength = offset;
for (int i = offset; i < charLength; ++i) {
char charValue = stringValue.charAt(i);
if (charValue >= 0x0001 && charValue <= 0x007F) {
byteLength++;
} else if (charValue <= 0x07FF) {
byteLength += 2;
} else {
byteLength += 3;
}
}
...
}
In this sense, the max string length is 65535 bytes, i.e the utf-8 encoding length. and not char count
You can find the modified-Unicode code-point range of JVM, from the above utf8 struct link.

Related

StringBuilder constructor implementation vulnerable to exception

The initial capacity of a StringBuilder when initialized with an existing String or CharSequence is the length of the original text + 16 from the code in StringBuilder constructor:
super(str.length() + 16);
My query what if the original length is close to Integer.MAX_VALUE ?
Will it throw NegativeArraySizeExceptionor will it change int to long for proper execution ?

The NegativeArraySizeException is expected here as :
As per the String implementation, a String internally use a char[] to hold the individual characters. So the maximum String length is actually dependent on the char[] size .
Java internally use int (not Integer) to index the individual locations of an array and hence the maximum length of a String can be Integer.MAX_VALUE and any thing greater than that should throw an exception as JVM will not be able to index the individual locations which are more than maximum int value.
Due to this constraint NegativeArraySizeException is thrown for any array which expands beyond the max allowed limit which in this case is max int value.

It will throw a NegativeArraySizeException since the integer will have wrapped around.
Effectively, it's the same as:
int len = Integer.MAX_VALUE;
// here we are trying to create an array of size -2147483633
char [] value = new char[len + 16];

it will throw NegativeArraySizeException

Java String UTF-8 limits

I'm trying to deserialize Strings from files directly and I have a question about very long Strings: Java Strings have a character count limit equal to Integer.MAX_VALUE, which is 31^2-1.
But here comes my question: what happens when I have a UTF-8 String with little less than that size but formed by characters with size more than 1 byte and then I ask Java to give me the byte array?
To make it clearer, what happens if I could run this code? (I haven't got RAM enough):
String toPrint = "";
String string100 = "";
int max = Integer.MAX_VALUE -100;
for (int i = 0; i < 100; i += 10) {
string100 += "1234567ñ90";
}
for (int i = 0; i < max; i += 100) {
toPrint += string100;
}
System.out.println("String complete!");
byte[] byteArray = toPrint.getBytes(StandardCharsets.UTF_8);
System.out.println(byteArray.length);
System.exit(0);
Does it print "String complete!"? Or does it break before?

Fundamentally, the limit on Strings is that the char arrays inside of them can't be longer than the maximum array length, which is roughly Integer.MAX_VALUE and greater than your variable max. Strings store their characters in UTF-16 and therefore the UTF-16 representation of a string can't exceed the maximum array length. The number of bytes in UTF-8 and the number of logical characters (Unicode code points, or UTF-32 characters) ultimately don't matter.
Now let's move to your particular example. Since each of the 10 characters in "1234567ñ90" is a single UTF-16 value, that string takes up 10 values of a String's char array. Despite your code's horrible performance and high memory requirement, it should eventually get to "String complete!" if there is sufficient available memory. However, it will break when converting to UTF-8 because the UTF-8 representation of the string is longer than the maximum array length, since "ñ" requires more than one byte.

Array size is also limited to Integer.MAX_VALUE (which is why String size is limited, after all there's a char[] backing it) , so it's impossible to get the byte array if the encoding uses more bytes than that, no matter what the size of the String is in characters.
The end result would be an OutOfMemoryError, but creating the String in the first place would succeed.

Can Java String handle 10^15 characters ? [duplicate]

I'm trying The Next Palindrome problem from Sphere Online Judge (SPOJ) where I need to find a palindrome for a integer of up to a million digits. I thought about using Java's functions for reversing Strings, but would they allow for a String to be this long?

You should be able to get a String of length
Integer.MAX_VALUE always 2,147,483,647 (231 - 1)
(Defined by the Java specification, the maximum size of an array, which the String class uses for internal storage)
OR
Half your maximum heap size (since each character is two bytes) whichever is smaller.

I believe they can be up to 2^31-1 characters, as they are held by an internal array, and arrays are indexed by integers in Java.

While you can in theory Integer.MAX_VALUE characters, the JVM is limited in the size of the array it can use.
public static void main(String... args) {
for (int i = 0; i < 4; i++) {
int len = Integer.MAX_VALUE - i;
try {
char[] ch = new char[len];
System.out.println("len: " + len + " OK");
} catch (Error e) {
System.out.println("len: " + len + " " + e);
}
}
}
on Oracle Java 8 update 92 prints
len: 2147483647 java.lang.OutOfMemoryError: Requested array size exceeds VM limit
len: 2147483646 java.lang.OutOfMemoryError: Requested array size exceeds VM limit
len: 2147483645 OK
len: 2147483644 OK
Note: in Java 9, Strings will use byte[] which will mean that multi-byte characters will use more than one byte and reduce the maximum further. If you have all four byte code-points e.g. emojis, you will only get around 500 million characters

Have you considered using BigDecimal instead of String to hold your numbers?

Integer.MAX_VALUE is max size of string + depends of your memory size but the Problem on sphere's online judge you don't have to use those functions

Java9 uses byte[] to store String.value, so you can only get about 1GB Strings in Java9. Java8 on the other hand can have 2GB Strings.
By character I mean "char"s, some character is not representable in BMP(like some of the emojis), so it will take more(currently 2) chars.

The heap part gets worse, my friends. UTF-16 isn't guaranteed to be limited to 16 bits and can expand to 32

How to get the value of Integer.MAX_VALUE in Java without using the Integer class

I have this question that has completely stumped me.
I have to create a variable that equals Integer.MAX_VALUE... (in Java)
// The answer must contain balanced parentesis
public class Exercise{
public static void main(String [] arg){
[???]
assert (Integer.MAX_VALUE==i);
}
}
The challenge is that the source code cannot contain the words "Integer", "Float", "Double" or any digits (0 - 9).

Here's a succinct method:
int ONE = "x".length();
int i = -ONE >>> ONE; //unsigned shift
This works because the max integer value in binary is all ones, except the top (sign) bit, which is zero. But -1 in twos compliment binary is all ones, so by bit shifting -1 one bit to the right, you get the max value.
11111111111111111111111111111111 // -1 in twos compliment
01111111111111111111111111111111 // max int (2147483647)

As others have said.
int i = Integer.MAX_VALUE;
is what you want.
Integer.MAX_VALUE, is a "static constant" inside of the "wrapper class" Integer that is simply the max value. Many classes have static constants in them that are helpful.

Here's a solution:
int ONE = "X".length();
int max = ONE;
while (max < max + ONE) {
max = max + ONE;
}
or lots of variants.
(The trick you were missing is how to "create" an integer value without using a numeric literal or a number wrapper class. Once you have created ONE, the rest is simple ...)

A bit late, but here goes:
int two = "xx".length();
int thirtyone = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx".length();
System.out.println(Math.pow(two, thirtyone)-1);
How did I go? :p
I do like that bitshift one though...

The issue is that the answer cannot contain: "Integer", "Float", "Double", and digits (0 - 9)
There are other things in Java which can be represented as an Integer, for example a char:
char aCharacter = 'a';
int asInt = (int) aCharacter;
System.out.println(asInt); //Output: 97
You can also add chars together in this manner:
char aCharacter = 'a';
char anotherCharacter = 'b';
int sumOfCharacters = aCharacter + anotherCharacter;
System.out.println(sumOfCharacters); //Output: 195
With this information, you should be able to work out how to get to 2147483647on your own.

OK, so an Integer can only take certain values. This is from MIN_VALUE to MAX_VALUE where the minimum value is negative.
If you increase an integer past this upper bound the value will wrap around and become the lowest value possible. e.g. MAX_VALUE+1 = MIN_VALUE.
Equally, if you decrease an integer past the lower bound it will wrap around and become the largest possible value. e.g. MIN_VALUE-1 = MAX_VALUE.
Therefore a simple program that instantiates an int, decrements it until it wraps around and returns that value should give you the same value as Integer.MAX_VALUE
public static void main(String [] arg) {
int i = -1
while (i<0) {
i--;
}
System.out.println(i);
}

Bit-wise efficient uniform random number generation

I recall reading about a method for efficiently using random bits in an article on a math-oriented website, but I can't seem to get the right keywords in Google to find it anymore, and it's not in my browser history.
The gist of the problem that was being asked was to take a sequence of random numbers in the domain [domainStart, domainEnd) and efficiently use the bits of the random number sequence to project uniformly into the range [rangeStart, rangeEnd). Both the domain and the range are integers (more correctly, longs and not Z). What's an algorithm to do this?
Implementation-wise, I have a function with this signature:
long doRead(InputStream in, long rangeStart, long rangeEnd);
in is based on a CSPRNG (fed by a hardware RNG, conditioned through SecureRandom) that I am required to use; the return value must be between rangeStart and rangeEnd, but the obvious implementation of this is wasteful:
long doRead(InputStream in, long rangeStart, long rangeEnd) {
long retVal = 0;
long range = rangeEnd - rangeStart;
// Fill until we get to range
for (int i = 0; (1 << (8 * i)) < range; i++) {
int in = 0;
do {
in = in.read();
// but be sure we don't exceed range
} while(retVal + (in << (8 * i)) >= range);
retVal += in << (8 * i);
}
return retVal + rangeStart;
}
I believe this is effectively the same idea as (rand() * (max - min)) + min, only we're discarding bits that push us over max. Rather than use a modulo operator which may incorrectly bias the results to the lower values, we discard those bits and try again. Since hitting the CSPRNG may trigger re-seeding (which can block the InputStream), I'd like to avoid wasting random bits. Henry points out that this code biases against 0 and 257; Banthar demonstrates it in an example.
First edit: Henry reminded me that summation invokes the Central Limit Theorem. I've fixed the code above to get around that problem.
Second edit: Mechanical snail suggested that I look at the source for Random.nextInt(). After reading it for a while, I realized that this problem is similar to the base conversion problem. See answer below.

Your algorithm produces biased results. Let's assume rangeStart=0 and rangeEnd=257. If first byte is greater than 0, that will be the result. If it's 0, the result will be either 0 or 256 with 50/50 probability. So 0 and 256 are twice less likely to be chosen than any other number.
I did a simple test to confirm this:
p(0)=0.001945
p(1)=0.003827
p(2)=0.003818
...
p(254)=0.003941
p(255)=0.003817
p(256)=0.001955
I think you need to do the same as java.util.Random.nextInt and discard the whole number, instead just the last byte.

After reading the source to Random.nextInt(), I realized that this problem is similar to the base conversion problem.
Rather than converting a single symbol at a time, it would be more effective to convert blocks of input symbol at a time through an accumulator "buffer" which is large enough to represent at least one symbol in the domain and in the range. The new code looks like this:
public int[] fromStream(InputStream input, int length, int rangeLow, int rangeHigh) throws IOException {
int[] outputBuffer = new int[length];
// buffer is initially 0, so there is only 1 possible state it can be in
int numStates = 1;
long buffer = 0;
int alphaLength = rangeLow - rangeHigh;
// Fill outputBuffer from 0 to length
for (int i = 0; i < length; i++) {
// Until buffer has sufficient data filled in from input to emit one symbol in the output alphabet, fill buffer.
fill:
while(numStates < alphaLength) {
// Shift buffer by 8 (*256) to mix in new data (of 8 bits)
buffer = buffer << 8 | input.read();
// Multiply by 256, as that's the number of states that we have possibly introduced
numStates = numStates << 8;
}
// spits out least significant symbol in alphaLength
outputBuffer[i] = (int) (rangeLow + (buffer % alphaLength));
// We have consumed the least significant portion of the input.
buffer = buffer / alphaLength;
// Track the number of states we've introduced into buffer
numStates = numStates / alphaLength;
}
return outputBuffer;
}
There is a fundamental difference between converting numbers between bases and this problem, however; in order to convert between bases, I think one needs to have enough information about the number to perform the calculation - successive divisions by the target base result in remainders which are used to construct the digits in the target alphabet. In this problem, I don't really need to know all that information, as long as I'm not biasing the data, which means I can do what I did in the loop labeled "fill."

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.