I've seen it suggested that Base 40 encoding can be used to compress Strings (in Java to send to a Redis instance FWIW) and a quick test shows it more efficient for some of the data I'm using than an alternative I'm considering; Smaz.
Is there any reason to prefer base 32 or 64 encoding over 40? Any disadvantage, is encoding like this potentially lossless?
40 provides letters (probably lower case unless your application tends to use upper case most of the time) and digits for 36, and then four more for punctuation and shifts. You can make it lossless by making one of the remaining an escape so the next one or two characters represent a byte not in the other 39. Also a good approach is to have a shift-lock character that toggles between upper and lower case, if you tend to have strings of upper case characters.
40 is a convenient base, since three base-40 digits fit nicely in two bytes. 40^3 (64000) is a smidge less than 2^16 (65536).
What you should use depends on the statistics of your data.
Related
Basically what the title says. I'm aware that i could use char as type if i only had one letter, but i need a datatype for 2 letters, e.g "XY". Is there anything that uses less storage (bit) or is smaller than a String? Or are multiple letters generally just saved as Strings? Thanks!
If you are sure that there are no higher-unicode characters (i.e. characters that use more than 1 char to store) in use, there are a few options:
As mentioned by #rdas, you could use an array: char[2]. This would be a bit more memory-efficient than a String, as the String has additional members. If it's only ASCII-characters, you could even use byte[2].
As one char is 16 bits, 2 chars are 32 bits, so could also try to encode the 2 characters into 1 int, as this also uses only 32 bytes, and you would not have the object overhead of the array. Clearly, this requires some additional steps to encode/decode when you need to show the stored information as actual characters, e.g. when presenting it to the user.
If your characters are only ASCII codes, i.e. every character fits into 1 byte, you could even fit it into a short.
Depending on the number of two-character combinations that you actually need to support, you could actually just enumerate all the possible combinations, use a lookup Map or sorted Array, and then only store the number/index of the code. Again, depending on the number of combinations, use a byte, short or int.
No it is not possible
This is why::
String s = "ab" // uses only 4 bytes of data as each character reserves 2 bytes
And other data types uses >= 4 bytes except short and byte but here short and byte cannot store characters
I have a question about using the bit-vector approach that is common to finding whether a string has unique characters. I have seen those solutions out there (one of them) work well for ASCII and UTF-16 character set.
However, how will the same approach work for UTF-32? The longest continuous bit vector can be a long variable in Java right? UTF-16 requires 1024 such variables. If we take the same approach it will require 2^26 long variables (I think). Is it possible to solve for such a big character set using bit-vector?
I think you are missing something important here. UTF-32 is an encoding for Unicode. Unicode actually fits within a 21 bit space. As the Unicode FAQ states:
"The Unicode Standard encodes characters in the range U+0000..U+10FFFF, which amounts to a 21-bit code space."
Any UTF-32 "characters" that are outside of the Unicode code space are invalid ... and you should never see them in a UTF-32 encoded String. So 2^15 longs should be enough.
In practice, you are unlikely to see code points outside of the Basic Linguistic Plane (plane 0). So it makes sense to use a bitmap for the BMP (i.e. codes up to 65535) and a sparse data structure (e.g. a HashSet<Integer>) for the other panes.
You could also consider using BitSet instead for "rolling your own" bit-set data structures using long or long[].
Finally, I should not that some of the code in the Q&A that you linked to is NOT appropriate for looking for unique characters in UTF-16 for a couple of reasons:
The idea of using N variables of type long and a switch statement does not scale. The code of the switch statement gets large and unmanageable ... and possibly gets bigger than the JVM spec can cope with. (The maximum size of a compiled method is 2^16 - 1 bytes of bytecode, so it clearly isn't viable for implementing a bit-vector for all of the Unicode code space.)
It is a better idea to use an array of long and get rid of the need for a switch ... which is only really there because you have N distinct long variables.
In UTF-16, each code unit (16 bit value) encodes either 1 code point (character) or half a code point. If you simply create a bitmap of the code units, you won't detect unique characters properly.
Well, a long contains 64 bits of information, and the set of UTF-32 characters contains approximately 2^21 elements, which would require 2^21 bits of information. You would be right that it would require 2^26 long variables if the UTF-32 dataset used all bits. However, as it is, you only require 2^13 long variables (still a lot).
If you assume that the characters are evenly distributed over the dataset, this inefficiency is unavoidable and the best solution would be to use something else like a Set<Long> or something. However, English plaintext tends to have a majority of its characters in the ASCII range (0-127), and most Western languages are fairly constrained to a specific range, so you could use a bit vector for the high-frequency regions and a Set or other order-independent, high-efficiency contains data structure to represent the rest of the regions.
My project takes a String s and passes an all lower case version s.toLowerCase() to a lossless encoder.
I can convert encode/decode the lower case string just fine, but this obviously would not be practical, so I need to be able to preserve the original String's capitalization somehow.
I was thinking of using Character.isUpperCase() to get an array of integers UpperCaseLetters[] that represents the locations of all capital letters in s. I would then use this array to place a ^ at all locations UpperCaseLettes[i] + 1 in the encoded string. When decoding the string, I would know that every character preceding a ^ is capital. (By the way, for this encoder will never generate ^ when encoding).
This method seems sloppy to me though. I was also thinking of using bit strings to represent capitalization, but the over all goal of the application is compression, so that would not be very efficient.
Is there any easier way to get and apply capitlization masks for strings? If there is, how much "storage" would it need?
Your options:
Auto-capitalize:
Use a general algorithm for capitalization, use one of the below techniques to only record the letters that differ between the generated and the actual capitalization. To regenerate, just run the algorithm again and flip the capitalization of all the recorded letters. Assuming there are capital letters where there should be (e.g. start of sentences), this will slow the algorithm down slightly (only by a small constant factor of n, and decent compression is generally much slower than that) and always reduce the amount of storage space required by a few.
Bitmap of capital positions:
You've already covered this one, not particularly efficient.
Prefix capitals with identifying character:
Also already covered, except that you described postfix, but prefix is generally better and, for a more generic solution, you can also escape the ^ with ^^. Not a bad idea. Depending on the compression, it might be a good idea to instead use a letter that already appears in the dataset. Either the most or least common letter, or you may have to look at the compression algorithm and do quite a bit of processing to determine the ideal letter to use.
Store distance of capital from start in any format:
Has no advantage over distance to next capital (below).
Distance to next capital - non-bitstring representation:
Generally less efficient than using bitstrings.
Bit string = distance to next capital:
You have a sequence of lengths, each indicating, in sequence, the distances between capitals. So if we have distances 0,3,1,0,5 capitalization would be as follows: AbcdEfGHijklmNo (skip 0 characters to the first, 3 character to the second, 1 character to the 3rd, etc.). There are some options available to store this:
Fixed length: Not a good idea since it needs to be = the longest possible distance. An obvious alternative is having some sort of overflow into the next length, but this still uses too much space.
Fixed length, different settings: Best explained with an example - the first 4 bits indicate the length, 00 means there are 2-bits following to indicate the distance, 01 means 4-bits, 10 means 8-bits, 11 means 16-bits, if there's a chance of more than 16-bits, you may want to do something like - 110 means 16-bits, 1110 means 32-bits, 11110 means 64-bits, etc. (this may sound similar to determining the class of a IPv4 address). So 0001010100 would split into 00-01, 01-0100, thus distances 1, 4. Note that the lengths don't have to increment in powers of 2. 16-bits = 65535 characters is a lot and 2-bits = 3 is very little, you can probably make it 4, 6, 8, (16?), (32?), ??? (unless there are a few capitals in a row, then you probably want 2-bits as well).
Variable length using escape sequence: Say the escape sequence is 00, we want to use all strings that doesn't contain 00, so the bit value table will look as follows:
Bits Value
1 1
10 2
11 3
101 4 // skipped 100
110 5
111 6
1010 7 // skipped 1000 and 1001
10100101010010101000101000010 will split into 101, 10101, 101010, 101, 0, 10. Note that ...1001.. just causes a split ending at the left 1 and a split starting at the right 1, and ...10001... causes a split ending at the first 0 and a split starting at the right 1, and ...100001... indicates a 0-valued distance in between. The pseudo-code is something like:
if (current value == 1 && zeroCount < 2)
add to current split
zeroCount = 0
else if (current value == 1) // after 00...
if (zeroCount % 2 == 1) { add zero to current split; zeroCount--; }
record current split, clear current split
while (zeroCount > 2) { record 0-distance split; zeroCount -= 2; }
else zeroCount++
This looks like a good solution for short distances, but once the distances become large I suspect you start skipping too many values and the length increases to quickly.
There is no ideal solution, it greatly depends on the data, you'll have to play around with prefixing capitals and different options for bit string distances to see which is best for your typical dataset.
I noticed that the max limit for a radix in Java is base 36.
Is this an arbitrary limit, or does Java have reason for limiting the radix in this way?
It's the number of decimal digits (10), plus the number of letters in the alphabet (26).
If a radix of 37 were allowed, a new character would have to be picked to represent the 37th digit. While it certainly would have been possible to pick some character, there is no obvious choice. It makes sense to just disallow larger radixes.
Very simple: 26 letters + 10 digits = 36.
In order to represent a number, traditionally digits and Latin letters are used.
For completeness, I would add that there are two constants defined in JDK:
Character.MIN_RADIX
Character.MAX_RADIX
Radix limit make sense if the output has to be readable.
In various case, the output does NOT need to be readable.
Thus indeed, a higher limit would help in such cases.
And the java langage radix limit is a weak point for java.
You can use the Base64 encoding scheme as specified in RFC 4648 and RFC 2045.
Just generate the byte representation of you int number according you needs, to be compatible with the majority of the libraries that implement Base64.
I'd like to know if there is a simple way to "cast" a byte array containing a data-structure of a known layout to an Object. The byte[] consists of BCD packed values, 1 or 2-byte integer values and character values. I'm obtaining the byte[] via reading a file with a FileInputStream.
People who've worked on IBM-Mainframe systems will know what I mean right away - the problem is I have to do the same in Java.
Any suggestions welcome.
No, because the object layout can vary depending on what VM you're using, what architecture the code is running on etc.
Relying on an in-memory representation has always felt brittle to me...
I suggest you look at DataInputStream - that will be the simplest way to parse your data, I suspect.
Not immediately, but you can write one pretty easily if you know exactly what the bytes represent.
To convert a BCD packed number you need to extract the two digits encoded. The four lower bits encode the lowest digit and you get that by &'ing with 15 (1111 binary). The four upper bits encode the highest digit which you get by shifting right 4 bits and &'ing with 15.
Also note that IBM most likely have tooling available if you this is what you are actually doing. For the IBM i look for the jt400 IBM Toolbox for Java.