Compressing string to specified max chars in Java

Compressing string to specified max chars in Java - java

I want to compress any string, if it has more than 50chars in it. If char size if less than 50, then let it be untouched, else it has to be compressed to required limit (50).
I will be inserting this compressed/uncompressed output in DB. So, while fetching back the data from DB, i want the compressed string to be easily differentiated from uncompressed (some common pattern for compressed string).
Please suggest some best compressing lib/algorithm ?

Please suggest some best compressing lib/algorithm ?
there is no such algorithm or library so that any string can be compressed to a given max characters (e.g. 50) without any loss of information.

If your compression is allowed to be lossy, you can use the abbreviate function from Apache Commons lang

I would use Inflater / Deflater from java.util.zip package. They use native ZLIB compression library and are very efficient. It's easy to differentiate compressed text from uncompressed - try to decompress and you will get an exception if it's a plain text

Related

Java decompress HTTP GZIP content from json attribute

We are working with packetbeat, a network packet analyzer tool to capture http requests and http responses. Packebeat persists this packet events in json format. The problem comes when the server supports gzip compression, packetbeat could not unzip content and save it directly the gzip content as json attribute. As you can see (Note: json has been simplified);
{
{
... ,
"content-type":"application/json;charset=UTF-8",
"transfer-encoding":"chunked",
"content-length":6347,
"x-application-context":"proxy-service:pre,native:8080",
"content-encoding":"gzip",
"connection":"keep-alive",
"date":"Mon, 18 Dec 2017 07:18:23 GMT"
},
"body": "\u001f\ufffd\u0008\u0000\u0000\u0000\u0000\u0000\u0000\u0003\ufffd]k\ufffd\u0014Ǳ\ufffd/\ufffdYI\ufffd#\ufffd*\ufffdo\ufffd\ufffd\ufffd\u0002\t\u0010^\ufffd\u001c\u000eE=\ufffd{\ufffdb\ufffd\ufffdE\ufffd\ufffdC\ufffd\ufffdf\ufffd,\ufffd\u003e\ufffd\ufffd\ufffd\u001ef\u001a\u0008\u0005\ufffd\ufffdg\ufffd\ufffd\ufffdYYU\ufffd\ufffd;\ufffdoN\ufffd\ufffd\ufffdg\ufffd\u0011UdK\ufffd\u0015\u0015\ufffdo\u000eH\ufffd\u000c\u0015Iq\ndC\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd ... "
}
We are thinking in preprocess packet json files to unzip content. Could someone tell me what i need to decompress zipped "body" json attribute using java?

Your data is irrecoverably broken. Generally I would suggest using Base64 encoding for transferring binary data packed into JSON, but you can read about possible alternatives in Binary Data in JSON String. Something better than Base64 if you like experimenting.
Otherwise, in theory you could just use a variant of String.getBytes() to get an array of bytes, and wrap the result into the mentioned (in the other answer) streams:
byte bodyBytes[]=body.getBytes();
ByteArrayInputStream bais=new ByteArrayInputStream(bodyBytes);
GZipInputStream gis=new GZipInputStream(bais);
<do something with gis here, perhaps use an additional DataInputStream>
Apart from the String-thing (which is usually not a good idea), this is how you unpack a gzip-compressed array of bytes.
However valid gzip data starts with a magic number 0x1F,0x8B (see Wikipedia, or you can also dig up the actual specification). Your data starts with 0x1F (the \u001F part), but continues with a \ufffd Unicode character, which a replacement character (see Wikipedia again).
Some tool was encoding the binary data and did not like the 0x8B, most probably because it was >=0x80. If you further read in your JSON, there are many \ufffd-s in it, all values above (or equal to) 0x80 have been replaced with this. So the data at the moment is irrecoverably broken even if JSON would support raw binary data inside (but it does not).

In Java you can use the GZIPInputStream class to decode the GZIP data, I think you would need to turn the value into a ByteArrayInputStream first.

Java Compress Multiple strings with the same rule

I'm creating an android application that needs a massive database (70mb but the application has to work offline...). The largest table has two columns, a keyword and a definition. The definitions themselves are relatively short, usually under 2000 characters, so compressing each one individually wouldn't save me very much since compression libraries store the rules decompress the strings as part of the compressed string.
However if I could compress all of these strings with the same set of rules and then store just the compressed data in the DB and the rules elsewhere, I could save a lot of space. Does anyone know of a library that will let me do something like this?
Desired behavior:
public String getDefinition(String keyword) {
DecompressionObject decompresser = new DecompressionObject(RULES_FILE);
byte[] data = queryDatabase(keyword);
return decompresser.decompress(keyword);
}

The "rules" as you call them is not why you are getting limited compression efficacy. The Huffman code table that precedes the data in a deflate stream is around 80 bytes, and so is not significant compared to your 2000 byte string.
What is limiting the compression efficacy is simply a lack of history from which to draw matching strings. The only place to look for matching strings is in the 2000 characters, and then only in the preceding characters at any point in the compression.
What you could do to improve compression would be to create a dictionary of common strings that would be used as history to precede each string you are compressing. Then that same dictionary is provided to the decompressor ahead of time for it to use to decompress each string. This assumes that there is some commonality of content in your ensemble of strings.
zlib provides these functions in deflateSetDictionary() and inflateSetDictionary().

Best Practice for Writing Data Using protobuf

We need our protobuf messages to contain as little data as possible. So what are the best practices we can follow in order to gain the maximum out of it. As an example writing byte[] as a String or ByteString ? What makes the difference? And adding a list of Integers as a repeated list or something else ?

As an example writing byte[] as a String or ByteString ?
If you want to write binary data, use a bytes fields (so ByteString). A string field is UTF-8-encoded text, so can't be used for all possible byte sequences.
And adding a list of integers as a repeated list or something else ?
Yes, use a repeated list - but with the [packed=true] option.
Basically, look over the whole encoding documentation and work out what's most appropriate for you. In particular, choose carefully between the various numeric representations, based on what your actual data will be. (If you're writing 32-bit values which are typically very large, consider using the fixed32 format instead of just int32 for example.)

Mapping java variables data to C Structures and write a c compatible file

I have a java class file with three arrayLists, one with type String, one with type Integer and other is ArrayList with type (ArrayList(String)). I have to write these these arraylists to a structure in C with character arrays, integers and short and output a file in a specefic format extension. The file has to be readable again by the same application. What is the best way to trasnfer the data from java to c structure and then output the c structure in a file. Thank you

There is no "C compatible file" format. If you have C structs written to disk file directly, then those are in an ad-hoc binary format. Exact format depends on things like packing and padding of the struct, byte order, word size of the CPU (like, 32 or 64 bit), etc.
So, start by defining the format, then forget it is produced by C.
Once you have the format defined, you can write a program to parse it in Java. If it is short with fixed length records, I'd probably create a class, which internally has just a private byte[] array, and then methods to manipulate it, save it and load it.

I suggest you write/read the data to a ByteBuffer using native byte ordering. The rest is up to you are to how you do it.
A library which might help is Javolution's Struct library which helps you map C structs onto ByteBuffers. This can help with C's various padding rules i.e.the exact layout might not be obvious.

How to compress a String in Java?

I use GZIPOutputStream or ZIPOutputStream to compress a String (my string.length() is less than 20), but the compressed result is longer than the original string.
On some site, I found some friends said that this is because my original string is too short, GZIPOutputStream can be used to compress longer strings.
so, can somebody give me a help to compress a String?
My function is like:
String compress(String original) throws Exception {
}
Update:
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.zip.GZIPOutputStream;
import java.util.zip.*;
//ZipUtil
public class ZipUtil {
public static String compress(String str) {
if (str == null || str.length() == 0) {
return str;
}
ByteArrayOutputStream out = new ByteArrayOutputStream();
GZIPOutputStream gzip = new GZIPOutputStream(out);
gzip.write(str.getBytes());
gzip.close();
return out.toString("ISO-8859-1");
}
public static void main(String[] args) throws IOException {
String string = "admin";
System.out.println("after compress:");
System.out.println(ZipUtil.compress(string));
}
}
The result is :

Compression algorithms almost always have some form of space overhead, which means that they are only effective when compressing data which is sufficiently large that the overhead is smaller than the amount of saved space.
Compressing a string which is only 20 characters long is not too easy, and it is not always possible. If you have repetition, Huffman Coding or simple run-length encoding might be able to compress, but probably not by very much.

When you create a String, you can think of it as a list of char's, this means that for each character in your String, you need to support all the possible values of char. From the sun docs
char: The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive).
If you have a reduced set of characters you want to support you can write a simple compression algorithm, which is analogous to binary->decimal->hex radix converstion. You go from 65,536 (or however many characters your target system supports) to 26 (alphabetical) / 36 (alphanumeric) etc.
I've used this trick a few times, for example encoding timestamps as text (target 36 +, source 10) - just make sure you have plenty of unit tests!

If the passwords are more or less "random" you are out of luck, you will not be able to get a significant reduction in size.
But: Why do you need to compress the passwords? Maybe what you need is not a compression, but some sort of hash value? If you just need to check if a name matches a given password, you don't need do save the password, but can save the hash of a password. To check if a typed in password matches a given name, you can build the hash value the same way and compare it to the saved hash. As a hash (Object.hashCode()) is an int you will be able to store all 20 password-hashes in 80 bytes).

Your friend is correct. Both gzip and ZIP are based on DEFLATE. This is a general purpose algorithm, and is not intended for encoding small strings.
If you need this, a possible solution is a custom encoding and decoding HashMap<String, String>. This can allow you to do a simple one-to-one mapping:
HashMap<String, String> toCompressed, toUncompressed;
String compressed = toCompressed.get(uncompressed);
// ...
String uncompressed = toUncompressed.get(compressed);
Clearly, this requires setup, and is only practical for a small number of strings.

Huffman Coding might help, but only if you have a lot of frequent characters in your small String

The ZIP algorithm is a combination of LZW and Huffman Trees. You can use one of theses algorithms separately.
The compression is based on 2 factors :
the repetition of substrings in your original chain (LZW): if there are a lot of repetitions, the compression will be efficient. This algorithm has good performances for compressing a long plain text, since words are often repeated
the number of each character in the compressed chain (Huffman): more the repartition between characters is unbalanced, more the compression will be efficient
In your case, you should try the LZW algorithm only. Used basically, the chain can be compressed without adding meta-informations: it is probably better for short strings compression.
For the Huffman algorithm, the coding tree has to be sent with the compressed text. So, for a small text, the result can be larger than the original text, because of the tree.

Huffman encoding is a sensible option here. Gzip and friends do this, but the way they work is to build a Huffman tree for the input, send that, then send the data encoded with the tree. If the tree is large relative to the data, there may be no not saving in size.
However, it is possible to avoid sending a tree: instead, you arrange for the sender and receiver to already have one. It can't be built specifically for every string, but you can have a single global tree used to encode all strings. If you build it from the same language as the input strings (English or whatever), you should still get good compression, although not as good as with a custom tree for every input.

If you know that your strings are mostly ASCII you could convert them to UTF-8.
byte[] bytes = string.getBytes("UTF-8");
This may reduce the memory size by about 50%. However, you will get a byte array out and not a string. If you are writing it to a file though, that should not be a problem.
To convert back to a String:
private final Charset UTF8_CHARSET = Charset.forName("UTF-8");
...
String s = new String(bytes, UTF8_CHARSET);

You don't see any compression happening for your String, As you atleast require couple of hundred bytes to have real compression using GZIPOutputStream or ZIPOutputStream. Your String is too small.(I don't understand why you require compression for same)
Check Conclusion from this article:
The article also shows how to compress
and decompress data on the fly in
order to reduce network traffic and
improve the performance of your
client/server applications.
Compressing data on the fly, however,
improves the performance of
client/server applications only when
the objects being compressed are more
than a couple of hundred bytes. You
would not be able to observe
improvement in performance if the
objects being compressed and
transferred are simple String objects,
for example.

Take a look at the Huffman algorithm.
https://codereview.stackexchange.com/questions/44473/huffman-code-implementation
The idea is that each character is replaced with sequence of bits, depending on their frequency in the text (the more frequent, the smaller the sequence).
You can read your entire text and build a table of codes, for example:
Symbol Code
a 0
s 10
e 110
m 111
The algorithm builds a symbol tree based on the text input. The more variety of characters you have, the worst the compression will be.
But depending on your text, it could be effective.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Compressing string to specified max chars in Java - java

Please suggest some best compressing lib/algorithm ? there is no such algorithm or library so that any string can be compressed to a given max characters (e.g. 50) without any loss of information.

If your compression is allowed to be lossy, you can use the abbreviate function from Apache Commons lang

I would use Inflater / Deflater from java.util.zip package. They use native ZLIB compression library and are very efficient. It's easy to differentiate compressed text from uncompressed - try to decompress and you will get an exception if it's a plain text

Related

Java decompress HTTP GZIP content from json attribute

Java Compress Multiple strings with the same rule

Best Practice for Writing Data Using protobuf

Mapping java variables data to C Structures and write a c compatible file

How to compress a String in Java?

Categories

Resources