I need to represent both very large and small numbers in the shortest string possible. The numbers are unsigned. I have tried just straight Base64 encode, but for some smaller numbers, the encoded string is longer than just storing the number as a string. What would be the best way to most optimally store a very large or short number in the shortest string possible with it being URL safe?
I have tried just straight Base64 encode, but for some smaller numbers, the encoded string is longer than just storing the number as a string
Base64 encoding of binary byte data will make it longer, by about a third. It is not supposed to make it shorter, but to allow safe transport of binary data in formats that are not binary safe.
However, base 64 is more compact than decimal representation of a number (or of byte data), even if it is less compact than base 256 (the raw byte data). Encoding your numbers in base 64 directly will make them more compact than decimal. This will do it:
private static final String base64Chars =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_";
static String encodeNumber(long x) {
char[] buf = new char[11];
int p = buf.length;
do {
buf[--p] = base64Chars.charAt((int)(x % 64));
x /= 64;
} while (x != 0);
return new String(buf, p, buf.length - p);
}
static long decodeNumber(String s) {
long x = 0;
for (char c : s.toCharArray()) {
int charValue = base64Chars.indexOf(c);
if (charValue == -1) throw new NumberFormatException(s);
x *= 64;
x += charValue;
}
return x;
}
Using this encoding scheme, Long.MAX_VALUE will be the string H__________, which is 11 characters long, compared to its decimal representation 9223372036854775807 which is 19 characters long. Numbers up to about 16 million will fit in a mere 4 characters. That's about as short as you'll get it. (Technically there are two other characters which do not need to be encoded in URLs: . and ~. You can incorporate those to get base 66, which would be a smidgin shorter for some numbers, although that seems a bit pedantic.)
To extend on Stephen C's answer, here is a piece of code to convert to base 62 (but you can increase this by adding more characters to the digits String (just pick what characters are valid for you):
public static String toString(long n) {
String digits = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
int base = digits.length();
String s = "";
while (n > 0) {
long d = n % base;
s = digits.charAt(d) + s;
n = n / base;
}
return s;
}
This will never result in the string representation being longer than the digit one.
Assuming that you don't do any compression, and that you restrict yourself to URL safe characters, then the following procedure will give you the most compact encoding possible.
Make a list of all URL safe characters
Count them. Suppose you have N.
Represent your number in base N, representing 0 by the first character, 1 by the 2nd and so on.
So, what about compression ...
If you assume that the numbers you are representing are uniformly distributed across their range, then there is no real opportunity for compression.
Otherwise, there is potential for compression. If you can reduce the size of the common numbers then you can typically achieve a saving by compression. This is how Huffman encoding works.
But the downside is that compression at this level is not perfect across the range of numbers. It reduces the size of some numbers, but it inevitably increases the size of others.
So what does this mean for your use-case?
I think it means that you are looking at the problem the wrong way. You should not be aiming for a minimal encoded size for every number. You should be aiming to minimize the size on average ... averaged over the actual distribution of your numbers.
Related
The unicode value of ucs-4 character '🤣' is 0001f923, it gets auto changed to the corresponding value of \uD83E\uDD23 when being copied into java code in intelliJ IDEA.
Java only supports ucs-2, so there occurs a transformation from ucs-4 to ucs-2.
I want to know the logic of the transformation, but didn't find any material about it.
https://en.wikipedia.org/wiki/UTF-16#U+010000_to_U+10FFFF
U+010000 to U+10FFFF
0x10000 is subtracted from the code point (U), leaving a 20-bit number (U') in the range 0x00000–0xFFFFF. U is defined to be no
greater than 0x10FFFF.
The high ten bits (in the range 0x000–0x3FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate (W1), which will be
in the range 0xD800–0xDBFF.
The low ten bits (also in the range 0x000–0x3FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate (W2),
which will be in the range 0xDC00–0xDFFF.
Now with input code point \U1F923:
\U1F923 - \U10000 = \UF923
\UF923 = 1111100100100011 = 00001111100100100011 = [0000111110][0100100011] = [\U3E][\U123]
\UD800 + \U3E = \UD83E
\UDC00 + \U123 = \UDD23
The result: \UD83E\UDD23
Programming:
public static void main(String[] args) {
int input = 0x1f923;
int x = input - 0x10000;
int highTenBits = x >> 10;
int lowTenBits = x & ((1 << 10) - 1);
int high = highTenBits + 0xd800;
int low = lowTenBits + 0xdc00;
System.out.println(String.format("[%x][%x]", high, low));
}
Though String contains Unicode as a char array where char is a two byte UTF-16BE encoding, there also is support for UCS4.
UCS4: UTF-32, "code points":
Unicode code points, UCS4, are represented in java as int.
int[] ucs4 = new int[] {0x0001_f923};
String s = new String(ucs4, 0, ucs4.length);
ucs4 = s.codePoints().toArray();
There are encodings, transformations, of code points to UTF-16 and UTF-8 which require longer sequences of respectively 2-byte or 1-byte values.
The encoding is chosen such that the 2/1-byte values will be different from any other value. That means that such a value will not erroneously match "/" or any other string search. That is realized by high bits starting with 1... and then bits of the code point in big-endian format (most significant first).
Rather than searching for UCS4 and UCS2 a search for UTF-16 will yield info on the algorithms used.
I'm trying to deserialize Strings from files directly and I have a question about very long Strings: Java Strings have a character count limit equal to Integer.MAX_VALUE, which is 31^2-1.
But here comes my question: what happens when I have a UTF-8 String with little less than that size but formed by characters with size more than 1 byte and then I ask Java to give me the byte array?
To make it clearer, what happens if I could run this code? (I haven't got RAM enough):
String toPrint = "";
String string100 = "";
int max = Integer.MAX_VALUE -100;
for (int i = 0; i < 100; i += 10) {
string100 += "1234567ñ90";
}
for (int i = 0; i < max; i += 100) {
toPrint += string100;
}
System.out.println("String complete!");
byte[] byteArray = toPrint.getBytes(StandardCharsets.UTF_8);
System.out.println(byteArray.length);
System.exit(0);
Does it print "String complete!"? Or does it break before?
Fundamentally, the limit on Strings is that the char arrays inside of them can't be longer than the maximum array length, which is roughly Integer.MAX_VALUE and greater than your variable max. Strings store their characters in UTF-16 and therefore the UTF-16 representation of a string can't exceed the maximum array length. The number of bytes in UTF-8 and the number of logical characters (Unicode code points, or UTF-32 characters) ultimately don't matter.
Now let's move to your particular example. Since each of the 10 characters in "1234567ñ90" is a single UTF-16 value, that string takes up 10 values of a String's char array. Despite your code's horrible performance and high memory requirement, it should eventually get to "String complete!" if there is sufficient available memory. However, it will break when converting to UTF-8 because the UTF-8 representation of the string is longer than the maximum array length, since "ñ" requires more than one byte.
Array size is also limited to Integer.MAX_VALUE (which is why String size is limited, after all there's a char[] backing it) , so it's impossible to get the byte array if the encoding uses more bytes than that, no matter what the size of the String is in characters.
The end result would be an OutOfMemoryError, but creating the String in the first place would succeed.
I'm using below method to encode given text.
static long encodeText(String text) {
long l = 31;
for (int i = 0; i < text.length(); i++) {
l = l * 47 + text.getBytes()[i] % 97;
}
return l;
}
When i call above method as encodeText("stackoverflow"), return the encoded text 3818417496786881978.
Now i want to provide encoded text and get String value. For example, if i give 3818417496786881978 to decodeText(long encoded), i need to get output as stackoverflow.
static String decodeText(long encoded) {
String str = null;
// decode steps here
return str;
}
How can i do this ?
Think this through logically: the clear text "stackoverflow" when represented as 7-bit ASCII represents 13 times 7 bits (=91 bits) of information. Thats more than a long (64 bits) can hold. So your encoding will lose information, making decoding impossible.
That should be also quite apparent from the formula you use:
l = l * 47 + text.getBytes()[i] % 97;
For each charcter you get a number between 0 and 96 (you already loosing information in the modulo, reducing information to 97 possible characters (e.g. you cannot distinguish between the bytes 1 and 98 after the modulo any more). Then you multiply your long by a number less than 97 (47), so two consecutive characters will overlap in terms of information distribution in the cyphertext.
Finally, after adding more an more characters, the long simply overflows and the topmost bits are simply lost.
In conclusion: If you want to decode the cyphertext ever again, fix the loss of information in these three places.
I recall reading about a method for efficiently using random bits in an article on a math-oriented website, but I can't seem to get the right keywords in Google to find it anymore, and it's not in my browser history.
The gist of the problem that was being asked was to take a sequence of random numbers in the domain [domainStart, domainEnd) and efficiently use the bits of the random number sequence to project uniformly into the range [rangeStart, rangeEnd). Both the domain and the range are integers (more correctly, longs and not Z). What's an algorithm to do this?
Implementation-wise, I have a function with this signature:
long doRead(InputStream in, long rangeStart, long rangeEnd);
in is based on a CSPRNG (fed by a hardware RNG, conditioned through SecureRandom) that I am required to use; the return value must be between rangeStart and rangeEnd, but the obvious implementation of this is wasteful:
long doRead(InputStream in, long rangeStart, long rangeEnd) {
long retVal = 0;
long range = rangeEnd - rangeStart;
// Fill until we get to range
for (int i = 0; (1 << (8 * i)) < range; i++) {
int in = 0;
do {
in = in.read();
// but be sure we don't exceed range
} while(retVal + (in << (8 * i)) >= range);
retVal += in << (8 * i);
}
return retVal + rangeStart;
}
I believe this is effectively the same idea as (rand() * (max - min)) + min, only we're discarding bits that push us over max. Rather than use a modulo operator which may incorrectly bias the results to the lower values, we discard those bits and try again. Since hitting the CSPRNG may trigger re-seeding (which can block the InputStream), I'd like to avoid wasting random bits. Henry points out that this code biases against 0 and 257; Banthar demonstrates it in an example.
First edit: Henry reminded me that summation invokes the Central Limit Theorem. I've fixed the code above to get around that problem.
Second edit: Mechanical snail suggested that I look at the source for Random.nextInt(). After reading it for a while, I realized that this problem is similar to the base conversion problem. See answer below.
Your algorithm produces biased results. Let's assume rangeStart=0 and rangeEnd=257. If first byte is greater than 0, that will be the result. If it's 0, the result will be either 0 or 256 with 50/50 probability. So 0 and 256 are twice less likely to be chosen than any other number.
I did a simple test to confirm this:
p(0)=0.001945
p(1)=0.003827
p(2)=0.003818
...
p(254)=0.003941
p(255)=0.003817
p(256)=0.001955
I think you need to do the same as java.util.Random.nextInt and discard the whole number, instead just the last byte.
After reading the source to Random.nextInt(), I realized that this problem is similar to the base conversion problem.
Rather than converting a single symbol at a time, it would be more effective to convert blocks of input symbol at a time through an accumulator "buffer" which is large enough to represent at least one symbol in the domain and in the range. The new code looks like this:
public int[] fromStream(InputStream input, int length, int rangeLow, int rangeHigh) throws IOException {
int[] outputBuffer = new int[length];
// buffer is initially 0, so there is only 1 possible state it can be in
int numStates = 1;
long buffer = 0;
int alphaLength = rangeLow - rangeHigh;
// Fill outputBuffer from 0 to length
for (int i = 0; i < length; i++) {
// Until buffer has sufficient data filled in from input to emit one symbol in the output alphabet, fill buffer.
fill:
while(numStates < alphaLength) {
// Shift buffer by 8 (*256) to mix in new data (of 8 bits)
buffer = buffer << 8 | input.read();
// Multiply by 256, as that's the number of states that we have possibly introduced
numStates = numStates << 8;
}
// spits out least significant symbol in alphaLength
outputBuffer[i] = (int) (rangeLow + (buffer % alphaLength));
// We have consumed the least significant portion of the input.
buffer = buffer / alphaLength;
// Track the number of states we've introduced into buffer
numStates = numStates / alphaLength;
}
return outputBuffer;
}
There is a fundamental difference between converting numbers between bases and this problem, however; in order to convert between bases, I think one needs to have enough information about the number to perform the calculation - successive divisions by the target base result in remainders which are used to construct the digits in the target alphabet. In this problem, I don't really need to know all that information, as long as I'm not biasing the data, which means I can do what I did in the loop labeled "fill."
I'm currently trying to parse some long values stored as Strings in java, the problem I have is this:
String test = "fffff8000261e000"
long number = Long.parseLong(test, 16);
This throws a NumberFormatException:
java.lang.NumberFormatException: For input string: "fffff8000261e000"
However, if I knock the first 'f' off the string, it parses it fine.
I'm guessing this is because the number is large and what I'd normally do is put an 'L' on the end of the long to fix that problem. I can't however work out the best way of doing that when parsing a long from a string.
Can anyone offer any advice?
Thanks
There's two different ways of answering your question, depending on exactly what sort of behavior you're really looking for.
Answer #1: As other people have pointed out, your string (interpreted as a positive hexadecimal integer) is too big for the Java long type. So if you really need (positive) integers that big, then you'll need to use a different type, perhaps java.math.BigInteger, which also has a constructor taking a String and a radix.
Answer #2: I wonder, though, if your string represents the "raw" bytes of the long. In your example it would represent a negative number. If that's the case, then Java's built-in long parser doesn't handle values where the high bit is set (i.e. where the first digit of a 16 digit string is greater than 7).
If you're in case #2, then here is one (pretty inefficient) way of handling it:
String test = "fffff8000261e000";
long number = new java.math.BigInteger(test, 16).longValue();
which produces the value -8796053053440. (If your string is more than 16 hex digits long, it would silently drop any higher bits.)
If efficiency is a concern, you could write your own bit-twiddling routine that takes the hex digits off the end of the string two at a time, perhaps building a byte array, then converting to long. Some similar code is here:
How to convert a Java Long to byte[] for Cassandra?
The primitive long variable can hold values in the range from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 inclusive.
The calculation shows that fffff8000261e000 hexademical is 18,446,735,277,656,498,176 decimal, which is obviously out of bounds. Instead, fffff8000261e000 hexademical is 1,152,912,708,553,793,536 decimal, which is as obviously within bounds.
As everybody here proposed, use BigInteger to account for such cases. For example, BigInteger bi = new BigInteger("fffff8000261e000", 16); will solve your problem. Also, new java.math.BigInteger("fffff8000261e000", 16).toString() will yield 18446735277656498176 exactly.
The number you are parsing is too large to fit in a java Long. Adding an L wouldn't help. If Long had been an unsigned data type, it would have fit.
One way to cope is to divide the string in two parts and then use bit shift when adding them together:
String s= "fffff8000261e000";
long number;
long n1, n2;
if (s.length() < 16) {
number = Long.parseLong(s, 16);
}
else {
String s1 = s.substring(0, 1);
String s2 = s.substring(1, s.length());
n1=Long.parseLong(s1, 16) << (4 * s2.length());
n2= Long.parseLong(s2, 16);
number = (Long.parseLong(s1, 16) << (4 * s2.length())) + Long.parseLong(s2, 16);
System.out.println( Long.toHexString(n1));
System.out.println( Long.toHexString(n2));
System.out.println( Long.toHexString(number));
}
Note:
If the number is bigger than Long.MAX_VALUE the resulting long will be a negative value, but the bit pattern will match the input.