Can Java String handle 10^15 characters ? [duplicate]

Can Java String handle 10^15 characters ? [duplicate] - java

I'm trying The Next Palindrome problem from Sphere Online Judge (SPOJ) where I need to find a palindrome for a integer of up to a million digits. I thought about using Java's functions for reversing Strings, but would they allow for a String to be this long?

You should be able to get a String of length
Integer.MAX_VALUE always 2,147,483,647 (231 - 1)
(Defined by the Java specification, the maximum size of an array, which the String class uses for internal storage)
OR
Half your maximum heap size (since each character is two bytes) whichever is smaller.

I believe they can be up to 2^31-1 characters, as they are held by an internal array, and arrays are indexed by integers in Java.

While you can in theory Integer.MAX_VALUE characters, the JVM is limited in the size of the array it can use.
public static void main(String... args) {
for (int i = 0; i < 4; i++) {
int len = Integer.MAX_VALUE - i;
try {
char[] ch = new char[len];
System.out.println("len: " + len + " OK");
} catch (Error e) {
System.out.println("len: " + len + " " + e);
}
}
}
on Oracle Java 8 update 92 prints
len: 2147483647 java.lang.OutOfMemoryError: Requested array size exceeds VM limit
len: 2147483646 java.lang.OutOfMemoryError: Requested array size exceeds VM limit
len: 2147483645 OK
len: 2147483644 OK
Note: in Java 9, Strings will use byte[] which will mean that multi-byte characters will use more than one byte and reduce the maximum further. If you have all four byte code-points e.g. emojis, you will only get around 500 million characters

Have you considered using BigDecimal instead of String to hold your numbers?

Integer.MAX_VALUE is max size of string + depends of your memory size but the Problem on sphere's online judge you don't have to use those functions

Java9 uses byte[] to store String.value, so you can only get about 1GB Strings in Java9. Java8 on the other hand can have 2GB Strings.
By character I mean "char"s, some character is not representable in BMP(like some of the emojis), so it will take more(currently 2) chars.

The heap part gets worse, my friends. UTF-16 isn't guaranteed to be limited to 16 bits and can expand to 32

Related

Java String UTF-8 limits

I'm trying to deserialize Strings from files directly and I have a question about very long Strings: Java Strings have a character count limit equal to Integer.MAX_VALUE, which is 31^2-1.
But here comes my question: what happens when I have a UTF-8 String with little less than that size but formed by characters with size more than 1 byte and then I ask Java to give me the byte array?
To make it clearer, what happens if I could run this code? (I haven't got RAM enough):
String toPrint = "";
String string100 = "";
int max = Integer.MAX_VALUE -100;
for (int i = 0; i < 100; i += 10) {
string100 += "1234567ñ90";
}
for (int i = 0; i < max; i += 100) {
toPrint += string100;
}
System.out.println("String complete!");
byte[] byteArray = toPrint.getBytes(StandardCharsets.UTF_8);
System.out.println(byteArray.length);
System.exit(0);
Does it print "String complete!"? Or does it break before?

Fundamentally, the limit on Strings is that the char arrays inside of them can't be longer than the maximum array length, which is roughly Integer.MAX_VALUE and greater than your variable max. Strings store their characters in UTF-16 and therefore the UTF-16 representation of a string can't exceed the maximum array length. The number of bytes in UTF-8 and the number of logical characters (Unicode code points, or UTF-32 characters) ultimately don't matter.
Now let's move to your particular example. Since each of the 10 characters in "1234567ñ90" is a single UTF-16 value, that string takes up 10 values of a String's char array. Despite your code's horrible performance and high memory requirement, it should eventually get to "String complete!" if there is sufficient available memory. However, it will break when converting to UTF-8 because the UTF-8 representation of the string is longer than the maximum array length, since "ñ" requires more than one byte.

Array size is also limited to Integer.MAX_VALUE (which is why String size is limited, after all there's a char[] backing it) , so it's impossible to get the byte array if the encoding uses more bytes than that, no matter what the size of the String is in characters.
The end result would be an OutOfMemoryError, but creating the String in the first place would succeed.

Represent long in least amount of characters

I need to represent both very large and small numbers in the shortest string possible. The numbers are unsigned. I have tried just straight Base64 encode, but for some smaller numbers, the encoded string is longer than just storing the number as a string. What would be the best way to most optimally store a very large or short number in the shortest string possible with it being URL safe?

I have tried just straight Base64 encode, but for some smaller numbers, the encoded string is longer than just storing the number as a string
Base64 encoding of binary byte data will make it longer, by about a third. It is not supposed to make it shorter, but to allow safe transport of binary data in formats that are not binary safe.
However, base 64 is more compact than decimal representation of a number (or of byte data), even if it is less compact than base 256 (the raw byte data). Encoding your numbers in base 64 directly will make them more compact than decimal. This will do it:
private static final String base64Chars =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_";
static String encodeNumber(long x) {
char[] buf = new char[11];
int p = buf.length;
do {
buf[--p] = base64Chars.charAt((int)(x % 64));
x /= 64;
} while (x != 0);
return new String(buf, p, buf.length - p);
}
static long decodeNumber(String s) {
long x = 0;
for (char c : s.toCharArray()) {
int charValue = base64Chars.indexOf(c);
if (charValue == -1) throw new NumberFormatException(s);
x *= 64;
x += charValue;
}
return x;
}
Using this encoding scheme, Long.MAX_VALUE will be the string H__________, which is 11 characters long, compared to its decimal representation 9223372036854775807 which is 19 characters long. Numbers up to about 16 million will fit in a mere 4 characters. That's about as short as you'll get it. (Technically there are two other characters which do not need to be encoded in URLs: . and ~. You can incorporate those to get base 66, which would be a smidgin shorter for some numbers, although that seems a bit pedantic.)

To extend on Stephen C's answer, here is a piece of code to convert to base 62 (but you can increase this by adding more characters to the digits String (just pick what characters are valid for you):
public static String toString(long n) {
String digits = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
int base = digits.length();
String s = "";
while (n > 0) {
long d = n % base;
s = digits.charAt(d) + s;
n = n / base;
}
return s;
}
This will never result in the string representation being longer than the digit one.

Assuming that you don't do any compression, and that you restrict yourself to URL safe characters, then the following procedure will give you the most compact encoding possible.
Make a list of all URL safe characters
Count them. Suppose you have N.
Represent your number in base N, representing 0 by the first character, 1 by the 2nd and so on.
So, what about compression ...
If you assume that the numbers you are representing are uniformly distributed across their range, then there is no real opportunity for compression.
Otherwise, there is potential for compression. If you can reduce the size of the common numbers then you can typically achieve a saving by compression. This is how Huffman encoding works.
But the downside is that compression at this level is not perfect across the range of numbers. It reduces the size of some numbers, but it inevitably increases the size of others.
So what does this mean for your use-case?
I think it means that you are looking at the problem the wrong way. You should not be aiming for a minimal encoded size for every number. You should be aiming to minimize the size on average ... averaged over the actual distribution of your numbers.

Bit-wise efficient uniform random number generation

I recall reading about a method for efficiently using random bits in an article on a math-oriented website, but I can't seem to get the right keywords in Google to find it anymore, and it's not in my browser history.
The gist of the problem that was being asked was to take a sequence of random numbers in the domain [domainStart, domainEnd) and efficiently use the bits of the random number sequence to project uniformly into the range [rangeStart, rangeEnd). Both the domain and the range are integers (more correctly, longs and not Z). What's an algorithm to do this?
Implementation-wise, I have a function with this signature:
long doRead(InputStream in, long rangeStart, long rangeEnd);
in is based on a CSPRNG (fed by a hardware RNG, conditioned through SecureRandom) that I am required to use; the return value must be between rangeStart and rangeEnd, but the obvious implementation of this is wasteful:
long doRead(InputStream in, long rangeStart, long rangeEnd) {
long retVal = 0;
long range = rangeEnd - rangeStart;
// Fill until we get to range
for (int i = 0; (1 << (8 * i)) < range; i++) {
int in = 0;
do {
in = in.read();
// but be sure we don't exceed range
} while(retVal + (in << (8 * i)) >= range);
retVal += in << (8 * i);
}
return retVal + rangeStart;
}
I believe this is effectively the same idea as (rand() * (max - min)) + min, only we're discarding bits that push us over max. Rather than use a modulo operator which may incorrectly bias the results to the lower values, we discard those bits and try again. Since hitting the CSPRNG may trigger re-seeding (which can block the InputStream), I'd like to avoid wasting random bits. Henry points out that this code biases against 0 and 257; Banthar demonstrates it in an example.
First edit: Henry reminded me that summation invokes the Central Limit Theorem. I've fixed the code above to get around that problem.
Second edit: Mechanical snail suggested that I look at the source for Random.nextInt(). After reading it for a while, I realized that this problem is similar to the base conversion problem. See answer below.

Your algorithm produces biased results. Let's assume rangeStart=0 and rangeEnd=257. If first byte is greater than 0, that will be the result. If it's 0, the result will be either 0 or 256 with 50/50 probability. So 0 and 256 are twice less likely to be chosen than any other number.
I did a simple test to confirm this:
p(0)=0.001945
p(1)=0.003827
p(2)=0.003818
...
p(254)=0.003941
p(255)=0.003817
p(256)=0.001955
I think you need to do the same as java.util.Random.nextInt and discard the whole number, instead just the last byte.

After reading the source to Random.nextInt(), I realized that this problem is similar to the base conversion problem.
Rather than converting a single symbol at a time, it would be more effective to convert blocks of input symbol at a time through an accumulator "buffer" which is large enough to represent at least one symbol in the domain and in the range. The new code looks like this:
public int[] fromStream(InputStream input, int length, int rangeLow, int rangeHigh) throws IOException {
int[] outputBuffer = new int[length];
// buffer is initially 0, so there is only 1 possible state it can be in
int numStates = 1;
long buffer = 0;
int alphaLength = rangeLow - rangeHigh;
// Fill outputBuffer from 0 to length
for (int i = 0; i < length; i++) {
// Until buffer has sufficient data filled in from input to emit one symbol in the output alphabet, fill buffer.
fill:
while(numStates < alphaLength) {
// Shift buffer by 8 (*256) to mix in new data (of 8 bits)
buffer = buffer << 8 | input.read();
// Multiply by 256, as that's the number of states that we have possibly introduced
numStates = numStates << 8;
}
// spits out least significant symbol in alphaLength
outputBuffer[i] = (int) (rangeLow + (buffer % alphaLength));
// We have consumed the least significant portion of the input.
buffer = buffer / alphaLength;
// Track the number of states we've introduced into buffer
numStates = numStates / alphaLength;
}
return outputBuffer;
}
There is a fundamental difference between converting numbers between bases and this problem, however; in order to convert between bases, I think one needs to have enough information about the number to perform the calculation - successive divisions by the target base result in remainders which are used to construct the digits in the target alphabet. In this problem, I don't really need to know all that information, as long as I'm not biasing the data, which means I can do what I did in the loop labeled "fill."

Suggestions for compression library to get byte[] as small as possible without considering cpu expense?

Correct me if I'm approaching this wrong, but I have a queue server and a bunch of java workers that I'm running on in a cluster. My queue has work units that are very small but there are many of them. So far my benchmarks and review of the workers has shown that I get about 200mb/second.
So I'm trying to figure out how to get more work units via my bandwidth. Currently my CPU usage is not very high(40-50%) because it can process the data faster than the network can send it. I want to get more work through the queue and am willing to pay for it via expensive compression/decompression(since half of each core is idle right now).
I have tried java LZO and gzip, but was wondering if there was anything better(even if its more cpu expensive)?
Updated: data is a byte[]. Basically the queue only takes it in that format so I am using ByteArrayOutputStream to write two ints and a int[] to to a byte[] format. The values in int[] are all ints between 0 to 100(or 1000 but the vast majority of the numbers are zeros). The lists are quite large anywhere from 1000 to 10,000 items(again, majority zeros..never more than 100 non-zero numbers in the int[])

It sounds like using a custom compression mechanism that exploits the structure of the data could be very efficient.
Firstly, using a short[] (16 bit data type) instead of an int[] will halve (!) the amount of data sent, you can do this because the numbers are easily between -2^15 (-32768) and 2^15-1 (32767). This is ridiculously easy to implement.
Secondly, you could use a scheme similar to run-length encoding: a positive number represents that number literally, while a negative number represents that many zeros (after taking absolute values). e.g.
[10, 40, 0, 0, 0, 30, 0, 100, 0, 0, 0, 0] <=> [10, 40, -3, 30, -1, 100, -4]
This is harder to implement that just substituting short for int, but will provide ~80% compression in the very worst case (1000 numbers, 100 non-zero, none of which are consecutive).
I just did some simulations to work out the compression ratios. I tested the method I described above, and the one suggested by Louis Wasserman and sbridges. Both performed very well.
Assuming the length of the array and the number of non-zero numbers are both uniformly between their bounds, both methods save about 5400 ints (or shorts) on average with a compressed size of about 2.5% the original! The run-length encoding method seems to save about 1 additional int (or average compressed size that is 0.03% smaller), i.e. basically no difference, so you should use the one that is easiest to implement. The following are histograms of the compression ratios for 50000 random samples (they are very similar!).
Summary: using shorts instead of ints and one of the compression methods, you will be able to compress the data to about 1% of its original size!
For the simulation, I used the following R script:
SIZE <- 50000
lengths <- sample(1000:10000, SIZE, replace=T)
nonzeros <- sample(1:100, SIZE, replace=T)
f.rle <- function(len, nonzero) {
indexes <- sort(c(0,sample(1:len, nonzero, F)))
steps <- diff(indexes)
sum(steps > 1) + nonzero # one short per run of zeros, and one per zero
}
f.index <- function(len, nonzero) {
nonzero * 2
}
# using the [value, -1 * number of zeros,...] method
rle.comprs <- mapply(f.rle, lengths, nonzeros)
print(mean(lengths - rle.comprs)) # average number of shorts saved
rle.ratios <- rle.comprs / lengths * 100
print(mean(rle.ratios)) # average compression ratio
# using the [(index, value),...] method
index.comprs <- mapply(f.index, lengths, nonzeros)
print(mean(lengths - index.comprs)) # average number of shorts saved
index.ratios <- index.comprs / lengths * 100
print(mean(index.ratios)) # average compression ratio
par(mfrow=c(2,1))
hist(rle.ratios, breaks=100, freq=F, xlab="Compression ratio (%)", main="Run length encoding")
hist(index.ratios, breaks=100, freq=F, xlab="Compression ratio (%)", main="Store indices")

Try encoding your data as two varints, the first varint is the index of the number in the sequence, the second is the number itself. For entries which are 0, write nothing.

I wrote an implementation of an RLE algorithm. This operates on a byte array, so could be used as an in-line filter with your existing code. It should safely handle large or negative values should your data change in the future.
It encodes a sequence of zeros as {0}{qty} where {qty} is in the range 1..255. All other bytes are stored as the byte itself. You squish your byte array before sending, and bloat it back to full size when receiving.
public static byte[] squish(byte[] bloated) {
int size = bloated.length;
ByteBuffer bb = ByteBuffer.allocate(2 * size);
bb.putInt(size);
int zeros = 0;
for (int i = 0; i < size; i++) {
if (bloated[i] == 0) {
if (++zeros == 255) {
bb.putShort((short) zeros);
zeros = 0;
}
} else {
if (zeros > 0) {
bb.putShort((short) zeros);
zeros = 0;
}
bb.put(bloated[i]);
}
}
if (zeros > 0) {
bb.putShort((short) zeros);
zeros = 0;
}
size = bb.position();
byte[] buf = new byte[size];
bb.rewind();
bb.get(buf, 0, size).array();
return buf;
}
public static byte[] bloat(byte[] squished) {
ByteBuffer bb = ByteBuffer.wrap(squished);
byte[] bloated = new byte[bb.getInt()];
int pos = 0;
while (bb.position() < bb.capacity()) {
byte value = bb.get();
if (value == 0) {
bb.position(bb.position() - 1);
pos += bb.getShort();
} else {
bloated[pos++] = value;
}
}
return bloated;
}

I've been impressed with BZIP2, compared with 7z and gzip. I haven't personally tried this Java implementation, but it looks like it would be easy to substitute your GZIP call for this one and verify the results.
http://www.kohsuke.org/bzip2

You should probably try all the major ones on your data stream and see which works best. You should also consider that some algorithms will take longer to run, adding more latency to the queue. This may or may not be a problem depending on your application.
You can sometimes get better compression if you know something about the data. (dbaupp's answer covers this approach nicely)
This comparison of compression algorithms might be useful. From the article:

String's Maximum length in Java - calling length() method

In Java, what is the maximum size a String object may have, referring to the length() method call?
I know that length() return the size of a String as a char [];

Considering the String class' length method returns an int, the maximum length that would be returned by the method would be Integer.MAX_VALUE, which is 2^31 - 1 (or approximately 2 billion.)
In terms of lengths and indexing of arrays, (such as char[], which is probably the way the internal data representation is implemented for Strings), Chapter 10: Arrays of The Java Language Specification, Java SE 7 Edition says the following:
The variables contained in an array
have no names; instead they are
referenced by array access expressions
that use nonnegative integer index
values. These variables are called the
components of the array. If an array
has n components, we say n is the
length of the array; the components of
the array are referenced using integer
indices from 0 to n - 1, inclusive.
Furthermore, the indexing must be by int values, as mentioned in Section 10.4:
Arrays must be indexed by int values;
Therefore, it appears that the limit is indeed 2^31 - 1, as that is the maximum value for a nonnegative int value.
However, there probably are going to be other limitations, such as the maximum allocatable size for an array.

java.io.DataInput.readUTF() and java.io.DataOutput.writeUTF(String) say that a String object is represented by two bytes of length information and the modified UTF-8 representation of every character in the string. This concludes that the length of String is limited by the number of bytes of the modified UTF-8 representation of the string when used with DataInput and DataOutput.
In addition, The specification of CONSTANT_Utf8_info found in the Java virtual machine specification defines the structure as follows.
CONSTANT_Utf8_info {
u1 tag;
u2 length;
u1 bytes[length];
}
You can find that the size of 'length' is two bytes.
That the return type of a certain method (e.g. String.length()) is int does not always mean that its allowed maximum value is Integer.MAX_VALUE. Instead, in most cases, int is chosen just for performance reasons. The Java language specification says that integers whose size is smaller than that of int are converted to int before calculation (if my memory serves me correctly) and it is one reason to choose int when there is no special reason.
The maximum length at compilation time is at most 65536. Note again that the length is the number of bytes of the modified UTF-8 representation, not the number of characters in a String object.
String objects may be able to have much more characters at runtime. However, if you want to use String objects with DataInput and DataOutput interfaces, it is better to avoid using too long String objects. I found this limitation when I implemented Objective-C equivalents of DataInput.readUTF() and DataOutput.writeUTF(String).

Since arrays must be indexed with integers, the maximum length of an array is Integer.MAX_INT (231-1, or 2 147 483 647). This is assuming you have enough memory to hold an array of that size, of course.

I have a 2010 iMac with 8GB of RAM, running Eclipse Neon.2 Release (4.6.2) with Java 1.8.0_25. With the VM argument -Xmx6g, I ran the following code:
StringBuilder sb = new StringBuilder();
for (int i = 0; i < Integer.MAX_VALUE; i++) {
try {
sb.append('a');
} catch (Throwable e) {
System.out.println(i);
break;
}
}
System.out.println(sb.toString().length());
This prints:
Requested array size exceeds VM limit
1207959550
So, it seems that the max array size is ~1,207,959,549. Then I realized that we don't actually care if Java runs out of memory: we're just looking for the maximum array size (which seems to be a constant defined somewhere). So:
for (int i = 0; i < 1_000; i++) {
try {
char[] array = new char[Integer.MAX_VALUE - i];
Arrays.fill(array, 'a');
String string = new String(array);
System.out.println(string.length());
} catch (Throwable e) {
System.out.println(e.getMessage());
System.out.println("Last: " + (Integer.MAX_VALUE - i));
System.out.println("Last: " + i);
}
}
Which prints:
Requested array size exceeds VM limit
Last: 2147483647
Last: 0
Requested array size exceeds VM limit
Last: 2147483646
Last: 1
Java heap space
Last: 2147483645
Last: 2
So, it seems the max is Integer.MAX_VALUE - 2, or (2^31) - 3
P.S. I'm not sure why my StringBuilder maxed out at 1207959550 while my char[] maxed out at (2^31)-3. It seems that AbstractStringBuilder doubles the size of its internal char[] to grow it, so that probably causes the issue.

apparently it's bound to an int, which is 0x7FFFFFFF (2147483647).

The Return type of the length() method of the String class is int.
public int length()
Refer http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#length()
So the maximum value of int is 2147483647.
String is considered as char array internally,So indexing is done within the maximum range.
This means we cannot index the 2147483648th member.So the maximum length of String in java is 2147483647.
Primitive data type int is 4 bytes(32 bits) in java.As 1 bit (MSB) is used as a sign bit,The range is constrained within -2^31 to 2^31-1 (-2147483648 to 2147483647).
We cannot use negative values for indexing.So obviously the range we can use is from 0 to 2147483647.

As mentioned in Takahiko Kawasaki's answer, java represents Unicode strings in the form of modified UTF-8 and in JVM-Spec CONSTANT_UTF8_info Structure, 2 bytes are allocated to length (and not the no. of characters of String).
To extend the answer, the ASM jvm bytecode library's putUTF8 method, contains this:
public ByteVector putUTF8(final String stringValue) {
int charLength = stringValue.length();
if (charLength > 65535) {
// If no. of characters> 65535, than however UTF-8 encoded length, wont fit in 2 bytes.
throw new IllegalArgumentException("UTF8 string too large");
}
for (int i = 0; i < charLength; ++i) {
char charValue = stringValue.charAt(i);
if (charValue >= '\u0001' && charValue <= '\u007F') {
// Unicode code-point encoding in utf-8 fits in 1 byte.
currentData[currentLength++] = (byte) charValue;
} else {
// doesnt fit in 1 byte.
length = currentLength;
return encodeUtf8(stringValue, i, 65535);
}
}
...
}
But when code-point mapping > 1byte, it calls encodeUTF8 method:
final ByteVector encodeUtf8(final String stringValue, final int offset, final int maxByteLength /*= 65535 */) {
int charLength = stringValue.length();
int byteLength = offset;
for (int i = offset; i < charLength; ++i) {
char charValue = stringValue.charAt(i);
if (charValue >= 0x0001 && charValue <= 0x007F) {
byteLength++;
} else if (charValue <= 0x07FF) {
byteLength += 2;
} else {
byteLength += 3;
}
}
...
}
In this sense, the max string length is 65535 bytes, i.e the utf-8 encoding length. and not char count
You can find the modified-Unicode code-point range of JVM, from the above utf8 struct link.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.