How to compress a String in Java?

How to compress a String in Java? - java

I use GZIPOutputStream or ZIPOutputStream to compress a String (my string.length() is less than 20), but the compressed result is longer than the original string.
On some site, I found some friends said that this is because my original string is too short, GZIPOutputStream can be used to compress longer strings.
so, can somebody give me a help to compress a String?
My function is like:
String compress(String original) throws Exception {
}
Update:
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.zip.GZIPOutputStream;
import java.util.zip.*;
//ZipUtil
public class ZipUtil {
public static String compress(String str) {
if (str == null || str.length() == 0) {
return str;
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
GZIPOutputStream gzip = new GZIPOutputStream(out);
gzip.write(str.getBytes());
gzip.close();
return out.toString("ISO-8859-1");
}
public static void main(String[] args) throws IOException {
String string = "admin";
System.out.println("after compress:");
System.out.println(ZipUtil.compress(string));
}
}
The result is :

Compression algorithms almost always have some form of space overhead, which means that they are only effective when compressing data which is sufficiently large that the overhead is smaller than the amount of saved space.
Compressing a string which is only 20 characters long is not too easy, and it is not always possible. If you have repetition, Huffman Coding or simple run-length encoding might be able to compress, but probably not by very much.

When you create a String, you can think of it as a list of char's, this means that for each character in your String, you need to support all the possible values of char. From the sun docs
char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).
If you have a reduced set of characters you want to support you can write a simple compression algorithm, which is analogous to binary->decimal->hex radix converstion. You go from 65,536 (or however many characters your target system supports) to 26 (alphabetical) / 36 (alphanumeric) etc.
I've used this trick a few times, for example encoding timestamps as text (target 36 +, source 10) - just make sure you have plenty of unit tests!

If the passwords are more or less "random" you are out of luck, you will not be able to get a significant reduction in size.
But: Why do you need to compress the passwords? Maybe what you need is not a compression, but some sort of hash value? If you just need to check if a name matches a given password, you don't need do save the password, but can save the hash of a password. To check if a typed in password matches a given name, you can build the hash value the same way and compare it to the saved hash. As a hash (Object.hashCode()) is an int you will be able to store all 20 password-hashes in 80 bytes).

Your friend is correct. Both gzip and ZIP are based on DEFLATE. This is a general purpose algorithm, and is not intended for encoding small strings.
If you need this, a possible solution is a custom encoding and decoding HashMap<String, String>. This can allow you to do a simple one-to-one mapping:
HashMap<String, String> toCompressed, toUncompressed;
String compressed = toCompressed.get(uncompressed);
// ...
String uncompressed = toUncompressed.get(compressed);
Clearly, this requires setup, and is only practical for a small number of strings.

Huffman Coding might help, but only if you have a lot of frequent characters in your small String

The ZIP algorithm is a combination of LZW and Huffman Trees. You can use one of theses algorithms separately.
The compression is based on 2 factors :
the repetition of substrings in your original chain (LZW): if there are a lot of repetitions, the compression will be efficient. This algorithm has good performances for compressing a long plain text, since words are often repeated
the number of each character in the compressed chain (Huffman): more the repartition between characters is unbalanced, more the compression will be efficient
In your case, you should try the LZW algorithm only. Used basically, the chain can be compressed without adding meta-informations: it is probably better for short strings compression.
For the Huffman algorithm, the coding tree has to be sent with the compressed text. So, for a small text, the result can be larger than the original text, because of the tree.

Huffman encoding is a sensible option here. Gzip and friends do this, but the way they work is to build a Huffman tree for the input, send that, then send the data encoded with the tree. If the tree is large relative to the data, there may be no not saving in size.
However, it is possible to avoid sending a tree: instead, you arrange for the sender and receiver to already have one. It can't be built specifically for every string, but you can have a single global tree used to encode all strings. If you build it from the same language as the input strings (English or whatever), you should still get good compression, although not as good as with a custom tree for every input.

If you know that your strings are mostly ASCII you could convert them to UTF-8.
byte[] bytes = string.getBytes("UTF-8");
This may reduce the memory size by about 50%. However, you will get a byte array out and not a string. If you are writing it to a file though, that should not be a problem.
To convert back to a String:
private final Charset UTF8_CHARSET = Charset.forName("UTF-8");
...
String s = new String(bytes, UTF8_CHARSET);

You don't see any compression happening for your String, As you atleast require couple of hundred bytes to have real compression using GZIPOutputStream or ZIPOutputStream. Your String is too small.(I don't understand why you require compression for same)
Check Conclusion from this article:
The article also shows how to compress
and decompress data on the fly in
order to reduce network traffic and
improve the performance of your
client/server applications.
Compressing data on the fly, however,
improves the performance of
client/server applications only when
the objects being compressed are more
than a couple of hundred bytes. You
would not be able to observe
improvement in performance if the
objects being compressed and
transferred are simple String objects,
for example.

Take a look at the Huffman algorithm.
https://codereview.stackexchange.com/questions/44473/huffman-code-implementation
The idea is that each character is replaced with sequence of bits, depending on their frequency in the text (the more frequent, the smaller the sequence).
You can read your entire text and build a table of codes, for example:
Symbol Code
a 0
s 10
e 110
m 111
The algorithm builds a symbol tree based on the text input. The more variety of characters you have, the worst the compression will be.
But depending on your text, it could be effective.

Related

Converting Java byte array to Buffer in Node.js

In an Android app I have a byte array containing data in the following format:
In another Node.js server, the same data is stored in a Buffer which looks like this:
I am looking for a way to convert both data to the same format so I can compare the two and check if they are equal. What would be the best way to approach this?

[B#cbf1911 is not a format. That is the result of invoking the .toString() method on a java object which doesn't have a custom toString implementation (thus, you get the default implementation written in java.lang.Object itself. The format of that string is:
binary-style-class-name#system-identity-hashcode.
[B is the binary style class name. That's JVM-ese for byte[].
cbf1911 is the system identity hashcode, which is (highly oversimplified and not truly something you can use to look stuff up) basically the memory address.
It is not the content of that byte array.
Lots of java APIs allow you to pass in any object and will just invoke the toString for you. Where-ever you're doing this, you wrote a bug; you need to write some explicit code to turn that byte array into data.
Note that converting bytes into characters, which you'll have to do whenever you need to put that byte array onto a character-based comms channel (such as JSON or email), is tricky.
<Buffer 6a 61 ...>
This is listing each byte as an unsigned hex nibble. This is an incredibly inefficient format, but it gets the job done.
A better option is base64. That is merely highly inefficient (but not incredibly inefficient); it spends 4 characters to encode 3 bytes (vs the node.js thing which spends 3 characters to encode 1 byte). Base64 is a widely supported standard.
When encoding, you need to explicitly write that. When decoding, same story.
In java, to encode:
import android.util.Base64;
class Foo {
void example() {
byte[] array = ....;
String base64 = Base64.encodeToString(array, Base64.DEFAULT);
System.out.println(base64);
}
}
That string is generally 'safe' - it has no characters in it that could end up being interpreted as control flow (so no <, no ", etc), and is 100% ASCII which tends to survive broken charset encoding transitions, which are common when tossing strings about the interwebs.
How do you decode base64 in node? I don't know, but I'm sure a web search for 'node base64 decode' will provide hundreds of tutorials.
Good luck!

What is the minimum test to verify that a component can save/retrieve UTF8 encoded strings

I am integration testing a component. The component allows you to save and fetch strings.
I want to verify that the component is handling UTF-8 characters properly. What is the minimum test that is required to verify this?
I think that doing something like this is a good start:
// This is the ☺ character
String toSave = "\u263A";
int id = 123;
// Saves to Database
myComponent.save( id, toSave );
// Retrieve from Database
String fromComponent = myComponent.retrieve( id );
// Verify they are same
org.junit.Assert.assertEquals( toSave, fromComponent );
One mistake I have made in the past is I have set String toSave = "è". My test passed because the string was saved and retrieved properly to/from the DB. Unfortunately the application was not actually working correctly because the app was using ISO 8859-1 encoding. This meant that è worked but other characters like ☺ did not.
Question restated: What is the minimum test (or tests) to verify that I can persist UTF-8 encoded strings?

A code and/or documentation review is probably your best option here. But, you can probe if you want. It seems that a sufficient test is the goal and minimizing it is less important. It is hard to figure what a sufficient test is, based only on speculation of what the threat would be, but here's my suggestion: all codepoints, including U+0000, proper handling of "combining characters."
The method you want to test has a Java string as a parameter. Java doesn't have "UTF-8 encoded strings": Java's native text datatypes use the UTF-16 encoding of the Unicode character set. This is common for in-memory representations of text—It's used by Java, .NET, JavaScript, VB6, VBA,…. UTF-8 is commonly used for streams and storage, so it makes sense that you should ask about it in the context of "saving and fetching". Databases typically offer one or more of UTF-8, 3-byte-limited UTF-8, or UTF-16 (NVARCHAR) datatypes and collations.
The encoding is an implementation detail. If the component accepts a Java string, it should either throw an exception for data it is unwilling to handle or handle it properly.
"Characters" is a rather ill-defined term. Unicode codepoints range from 0x0 to 0x10FFFF—21 bits. Some codepoints are not assigned (aka "defined"), depending on the Unicode Standard revision. Java datatypes can handle any codepoint, but information about them is limited by version. For Java 8, "Character information is based on the Unicode Standard, version 6.2.0.". You can limit the test to "defined" codepoints or go all possible codepoints.
A codepoint is either a base "character" or a "combining character". Also, each codepoint is in exactly one Unicode Category. Two categories are for combining characters. To form a grapheme, a base character is followed by zero or more combining characters. It might be difficult to layout graphemes graphically (see Zalgo text) but for text storage all that it is needed to not mangle the sequence of codepoints (and byte order, if applicable).
So, here is a non-minimal, somewhat comprehensive test:
final Stream<Integer> codepoints = IntStream
.rangeClosed(Character.MIN_CODE_POINT, Character.MAX_CODE_POINT)
.filter(cp -> Character.isDefined(cp)) // optional filtering
.boxed();
final int[] combiningCategories = {
Character.COMBINING_SPACING_MARK,
Character.ENCLOSING_MARK
};
final Map<Boolean, List<Integer>> partitionedCodepoints = codepoints
.collect(Collectors.partitioningBy(cp ->
Arrays.binarySearch(combiningCategories, Character.getType(cp)) < 0));
final Integer[] baseCodepoints = partitionedCodepoints.get(true)
.toArray(new Integer[0]);
final Integer[] combiningCodepoints = partitionedCodepoints.get(false)
.toArray(new Integer[0]);
final int baseLength = baseCodepoints.length;
final int combiningLength = combiningCodepoints.length;
final StringBuilder graphemes = new StringBuilder();
for (int i = 0; i < baseLength; i++) {
graphemes.append(Character.toChars(baseCodepoints[i]));
graphemes.append(Character.toChars(combiningCodepoints[i % combiningLength]));
}
final String test = graphemes.toString();
final byte[] testUTF8 = StandardCharsets.UTF_8.encode(test).array();
// Java 8 counts for when filtering by Character.isDefined
assertEquals(736681, test.length()); // number of UTF-16 code units
assertEquals(3241399, testUTF8.length); // number of UTF-8 code units

If your component is only capable of storing and retrieving strings, then all you need to do is make sure that nothing gets lost in the conversion to and from the Unicode strings of java and the UTF-8 strings that the component stores.
That would involve checking with at least one character from each UTF-8 code point length. So, I would suggest check with:
One character from the US-ASCII set, (1-byte long code point,) then
One character from Greek, (2-byte long code point,) and
One character from Chinese (3-byte long code point.)
In theory you would also want to check with an emoji (4-byte long code point,) though these cannot be represented in java's Unicode strings, so it's moot point.
A useful extra test would be to try a string combining at least one character from each of the above cases, so as to make sure that characters of different code-point lengths can co-exist within the same string.
(If your component does anything more than storing and retrieving strings, like searching for strings, then things can get a bit more complicated, but it seems to me that you specifically avoided asking about that.)
I do believe that black box testing is the only kind of testing that makes sense, so I would not recommend polluting the interface of your component with methods that would expose knowledge of its internals. However, there are two things that you can do to increase the testability of the component without ruining its interface:
Introduce additional functions to the interface that might help with testing without disclosing anything about the internal implementation and without requiring that the testing code must have knowledge of the internal implementation of the component.
Introduce functionality useful for testing in the constructor of your component. The code that constructs the component knows precisely what component it is constructing, so it is intimately familiar with the nature of the component, so it is okay to pass something implementation-specific there.
An example of what you could do with any of the above techniques would be to artificially severely limit the number of bytes that the internal representation is allowed to occupy, so that you can make sure that a certain string you are planning to store will fit. So, you could limit the internal size to no more than 9 bytes, and then make sure that a java unicode string containing 3 chinese characters gets properly stored and retrieved.

String instances use a predefined and unchangeable encoding(16-bit words).
So, returning only a String from your service is probably not enough to do this check.
You should try to return the byte representation of the persisted String (a byte array for example) and compare the content of this array with the "\u263A" String that you would encode in bytes with the UTF-8 charset.
String toSave = "\u263A";
int id = 123;
// Saves to Database
myComponent.save(id, toSave );
// Retrieve from Database
byte[] actualBytes = myComponent.retrieve(id );
// assertion
byte[] expectedBytes = toSave.getBytes(Charset.forName("UTF-8"));
Assert.assertTrue(Arrays.equals(expectedBytes, actualBytes));

Java Compress Multiple strings with the same rule

I'm creating an android application that needs a massive database (70mb but the application has to work offline...). The largest table has two columns, a keyword and a definition. The definitions themselves are relatively short, usually under 2000 characters, so compressing each one individually wouldn't save me very much since compression libraries store the rules decompress the strings as part of the compressed string.
However if I could compress all of these strings with the same set of rules and then store just the compressed data in the DB and the rules elsewhere, I could save a lot of space. Does anyone know of a library that will let me do something like this?
Desired behavior:
public String getDefinition(String keyword) {
DecompressionObject decompresser = new DecompressionObject(RULES_FILE);
byte[] data = queryDatabase(keyword);
return decompresser.decompress(keyword);
}

The "rules" as you call them is not why you are getting limited compression efficacy. The Huffman code table that precedes the data in a deflate stream is around 80 bytes, and so is not significant compared to your 2000 byte string.
What is limiting the compression efficacy is simply a lack of history from which to draw matching strings. The only place to look for matching strings is in the 2000 characters, and then only in the preceding characters at any point in the compression.
What you could do to improve compression would be to create a dictionary of common strings that would be used as history to precede each string you are compressing. Then that same dictionary is provided to the decompressor ahead of time for it to use to decompress each string. This assumes that there is some commonality of content in your ensemble of strings.
zlib provides these functions in deflateSetDictionary() and inflateSetDictionary().

Store a number as ASCII text in Java?

It's probably a stupid question but here's the thing. I was reading this question:
Storing 1 million phone numbers
and the accepted question was what I was thinking: using a trie. In the comments Matt Ball suggested:
I think storing the phone numbers as ASCII text and compressing is a very reasonable suggestion
Problem: how do I do that in Java? And ASCII text does stand for String?

For in-memory storage as indicated in the question:
ByteArrayOutputStream baos = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(
new GZIPOutputStream(baos), "US-ASCII");
for(String number : numbers){
out.write(number);
out.write('\n');
}
byte[] data = baos.toByteArray();
But as Pete remarked: this may be good for memory efficiency, but you can't really do anything with the data afterwards, so it's not really very useful.

Yes, ASCII means Strings in this case. You can store compressed data in Java using the java.util.zip.GZIPOutputStream.

In answer to an implied, but different question;
Q: You have 1 billion phones numbers and you need to send these over a low bandwidth connection. You only need to send whether the phone number is in the collection or not. (No other information required)
A: This is the general approach
First sort the list if its not sorted already.
From the lowest number find regions of continuous numbers. Send the start of the region and the phones which are taken. This can be stored a BitSet (1-bit per possible number) Send the phone number at the start and the BitSet whenever the gap is more than some threshold.
Write the stream to a compressed data set.
Test this to compare with a simple sending of all numbers.
You can use Strings in a sorted TreeMap. One million numbers is not very much and will use about 64 MB. I don't see the need for a more complex solution.
The latest version of Java can store ASCII text efficiently by using a byte[] instead of a char[] however, the overhead of your data structure is likely to be larger.
If you need to store a phone numbers as a key, you could store them with the assumption that large ranges will be continous. As such you could store them like
NavigableMap<String, PhoneDetails[]>
In this structure, the key would define the start of the range and you could have a phone details for each number. This could be not much bigger than the reference to the PhoneDetails (which is the minimum)
BTW: You can invent very efficient structures if you don't need access to the data. If you never access the data, don't keep it in memory, in fact you can just discard it as it won't ever be needed.
Alot depending on what you want to do with the data and why you have it in memory at all.
You can Use DeflatorOutputStream to a ByteArrayOutputStream, which will be very small, but not very useful.
I suggest using DeflatorOutputStream as its more light weight/faster/smaller than GZIPOutputStream.

Java String are by default UTF-8 encoded, you have to change the encoding if you want to manipulate ASCII text.

In Java, what's the fastest way to "build" and use a string, character by character?

I have a Java socket connection that is receiving data intermittently. The number of bytes of data received with each burst varies. The data may or may not be terminated by a well-known character (such as CR or LF). The length of each burst of data is variable.
I'm attempting to build a string out of each burst of data. What is the fastest way (speed, not memory), to build a string that would later need to be parsed?
I began by using a byte array to store the incoming bytes, then converting them to a String with each burst, like so:
byte[] message = new byte[1024];
...
message[i] = //byte from socket
i++;
...
String messageStr = new String(message);
...
//parse the string here
The obvious disadvantage of this is that some bursts may be longer than 1024. I don't want to arbitrarily create a larger byte array (what if my burst is larger?).
What is the best way of doing this? Should I create a StringBuilder object and append() to it? That way I don't have to convert from StringBuilder to String (since the former has most of the methods I need).
Again, speed of execution is my biggest concern.
TIA.

I would probably use an InputStreamReader wrapped around a BufferedInputStream, which in turn wraps the socket. And write code that processes a message at a time, potentially blocking for input. If the input is bursty, I might run on a background thread and use a concurrent queue to hold the messages.
Reading a buffer at a time and trying to convert it to characters is exactly what BufferedInputStream/InputStreamReader does. And it does so while paying attention to encoding, something that (as other people have noted) your solution does not.
I don't know why you're focused on speed, but you'll find that the time to process data coming off a socket is far less than the time it takes to transmit over that socket.

Note that as you're transmitting across network layers, your speed of conversion may not be the bottleneck. It would be worth measuring, if you believe this to be important.
Note (also) that you're not specifying a character encoding in your conversion from bytes to String (via characters). I would enforce that somehow, otherwise your client/server communication can become corrupted if/when your client/server run in different environments. You can enforce that via JVM runtime args, but it's not a particularly safe option.
Given the above, you may want to consider StringBuilder(int capacity) to configure it in advance with an appropriate size, such that it doesn't have to reallocate on the fly.

First of all, you are making a lot of assumptions about charachter encoding that you receive from your client. Is it US-ASCII, ISO-8859-1, UTF-8?
Because in Java string is not a sequence of bytes, when it comes to building portable String serialization code you should make explicit decisions about character encoding. For this reason you should NEVER use StringBuilder to convert bytes to String. If you look at StringBuilder interface you will notice that it does not even have an append( byte ) method, and that's not because designers just overlooked it.
In your case you should definetly use a ByteArrayOutputStream. The only drawback of using straight implementation of ByteArrayOutputStream is that its toByteArray() method returns a copy of the array held by the object internaly. For this reason you may create your own subclass of ByteArrayOutputStream and provide direct access to the protected buf member.
Note that if you don't use default implementation, remember to specify byte array bounds in your String constructor. Your code should look something like this:
MyByteArrayOutputStream message = new MyByteArrayOutputStream( 1024 );
...
message.write( //byte from socket );
...
String messageStr = new String(message.buf, 0, message.size(), "ISO-8859-1");
Substitute ISO-8859-1 for the character set that's suitable for your needs.

StringBuilder is your friend. Add as many characters as needed, then call toString() to obtain the String.

I would create a "small" array of characters and append characters to it.
When the array is full (or transmission ends), use the StringBuilder.append(char[] str) method to append the content of the array to your string.
Now for the "small" size of the array - you will need to try various sizes and see which one is fastest for your production environment (performance "may" depend on the JVM, OS, processor type and speed and so on)
EDIT: Other people mentioned ByteArrayOutputStream, I agree it is another option as well.

You may wish to look at ByteArrayOutputStream depending if you are dealing with Bytes instead of Characters.
I generally will use a ByteArrayOutputStream to assemble a message then use toString/toByteArray to retrive it when the message is finished.
Edit: ByteArrayOutputStream can handle various Character set encoding through the toString call.

Personally, independent of language, I would send all characters to an in-memory data stream and once I need the string, I would read all characters from this stream into a string.
As an alternative, you could use a dynamic array, making it bigger whenever you need to add more characters. Even better, keep track of the actual length and increase the array with additional blocks instead of single characters. Thus, you would start with 1 character in an array of 1000 chars. Once you get at 1001, the array needs to be resized to 2000, then 3000, 4000, etc...
Fortunately, several languages including Java have a special build-in class that specializes in this. These are the stringbuilder classes. Whatever technique they use isn't that important but they have been created to boost performance so they should be your fastest option.

Have a look at the Text class. It's faster (for the operations you perform) and more deterministic than StringBuilder.
Note: the project containing the class is aimed at RTSJ VMs. It is perfectly usable in standard SE/EE environments though.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.