UTF-8 String class for java

UTF-8 String class for java - java

I need to hold lots of string objects in memory (hundreds of MB) and I want to hold them in UTF-8 format since in most cases it will require half of the memory the default implementation use.
The default String class requires for a 12 characters string 60 bytes (See http://blog.griddynamics.com/2010/01/java-tricks-reducing-memory-consumption.html).
Most of my Strings are 10-20 characters long.
I wonder if there is some open source library which offers a wrapper for such strings?
I know how to convert String to UTF-8 byte array but I'm looking for a wrapper class which will provide all needed utilities functions (Hash, Equal, toString, fromString, etc).

Apache Avro has an UTF8 wrapper class which implements CharSequence, but I don't know the memory consumption of such objects
Hadoop has the Text class which has quite the kind of interface you desire

If you want a distinct object for each string and you want them as compact as possible then use byte arrays. That will be 1 byte per char vs 2, and you won't have the overhead of the String header (which adds probably 32 bytes per object).
But of course you wouldn't be able to use any String methods on these without first converting to String.
But if you really want to save space, store the strings back-to-back in a few larger arrays, with "dope vectors" to locate the individual strings.

Related

Converting Java byte array to Buffer in Node.js

In an Android app I have a byte array containing data in the following format:
In another Node.js server, the same data is stored in a Buffer which looks like this:
I am looking for a way to convert both data to the same format so I can compare the two and check if they are equal. What would be the best way to approach this?

[B#cbf1911 is not a format. That is the result of invoking the .toString() method on a java object which doesn't have a custom toString implementation (thus, you get the default implementation written in java.lang.Object itself. The format of that string is:
binary-style-class-name#system-identity-hashcode.
[B is the binary style class name. That's JVM-ese for byte[].
cbf1911 is the system identity hashcode, which is (highly oversimplified and not truly something you can use to look stuff up) basically the memory address.
It is not the content of that byte array.
Lots of java APIs allow you to pass in any object and will just invoke the toString for you. Where-ever you're doing this, you wrote a bug; you need to write some explicit code to turn that byte array into data.
Note that converting bytes into characters, which you'll have to do whenever you need to put that byte array onto a character-based comms channel (such as JSON or email), is tricky.
<Buffer 6a 61 ...>
This is listing each byte as an unsigned hex nibble. This is an incredibly inefficient format, but it gets the job done.
A better option is base64. That is merely highly inefficient (but not incredibly inefficient); it spends 4 characters to encode 3 bytes (vs the node.js thing which spends 3 characters to encode 1 byte). Base64 is a widely supported standard.
When encoding, you need to explicitly write that. When decoding, same story.
In java, to encode:
import android.util.Base64;
class Foo {
void example() {
byte[] array = ....;
String base64 = Base64.encodeToString(array, Base64.DEFAULT);
System.out.println(base64);
}
}
That string is generally 'safe' - it has no characters in it that could end up being interpreted as control flow (so no <, no ", etc), and is 100% ASCII which tends to survive broken charset encoding transitions, which are common when tossing strings about the interwebs.
How do you decode base64 in node? I don't know, but I'm sure a web search for 'node base64 decode' will provide hundreds of tutorials.
Good luck!

Java Compress Multiple strings with the same rule

I'm creating an android application that needs a massive database (70mb but the application has to work offline...). The largest table has two columns, a keyword and a definition. The definitions themselves are relatively short, usually under 2000 characters, so compressing each one individually wouldn't save me very much since compression libraries store the rules decompress the strings as part of the compressed string.
However if I could compress all of these strings with the same set of rules and then store just the compressed data in the DB and the rules elsewhere, I could save a lot of space. Does anyone know of a library that will let me do something like this?
Desired behavior:
public String getDefinition(String keyword) {
DecompressionObject decompresser = new DecompressionObject(RULES_FILE);
byte[] data = queryDatabase(keyword);
return decompresser.decompress(keyword);
}

The "rules" as you call them is not why you are getting limited compression efficacy. The Huffman code table that precedes the data in a deflate stream is around 80 bytes, and so is not significant compared to your 2000 byte string.
What is limiting the compression efficacy is simply a lack of history from which to draw matching strings. The only place to look for matching strings is in the 2000 characters, and then only in the preceding characters at any point in the compression.
What you could do to improve compression would be to create a dictionary of common strings that would be used as history to precede each string you are compressing. Then that same dictionary is provided to the decompressor ahead of time for it to use to decompress each string. This assumes that there is some commonality of content in your ensemble of strings.
zlib provides these functions in deflateSetDictionary() and inflateSetDictionary().

File size vs. in memory size in Java

If I take an XML file that is around 2kB on disk and load the contents as a String into memory in Java and then measure the object size it's around 33kB.
Why the huge increase in size?
If I do the same thing in C++ the resulting string object in memory is much closer to the 2kB.
To measure the memory in Java I'm using Instrumentation.
For C++, I take the length of the serialized object (e.g string).

I think there are multiple factors involved.
First of all, as Bruce Martin said, objects in java have an overhead of 16 bytes per object, c++ does not.
Second, Strings in Java might be 2 Bytes per character instead of 1.
Third, it could be that Java reserves more Memory for its Strings than the C++ std::string does.
Please note that these are just ideas where the big difference might come from.

Assuming that your XML file contains mainly ASCII characters and uses an encoding that represents them as single bytes, then you can espect the in memory size to be at least double, since Java uses UTF-16 internally (I've heard of some JVMs that try to optimize this, thouhg). Added to that will be overhead for 2 objects (the String instance and an internal char array) with some fields, IIRC about 40 bytes overall.
So your "object size" of 33kb is definitely not correct, unless you're using a weird JVM. There must be some problem with the method you use to measure it.

In Java String object have some extra data, that increases it's size.
It is object data, array data and some other variables. This can be array reference, offset, length etc.
Visit http://www.javamex.com/tutorials/memory/string_memory_usage.shtml for details.

String: a String's memory growth tracks its internal char array's growth. However, the String class adds another 24 bytes of overhead.
For a nonempty String of size 10 characters or less, the added overhead cost relative to useful payload (2 bytes for each char plus 4 bytes for the length), ranges from 100 to 400 percent.
More:
What is the memory consumption of an object in Java?

Yes, you should GC and give it time to finish. Just System.gc(); and print totalMem() in the loop. You also better to create a million of string copies in array (measure empty array size and, then, filled with strings), to be sure that you measure the size of strings and not other service objects, which may present in your program. String alone cannot take 32 kb. But hierarcy of XML objects can.
Said that, I cannot resist the irony that nobody cares about memory (and cache hits) in the world of Java. We are know that JIT is improving and it can outperform the native C++ code in some cases. So, there is not need to bother about memory optimization. Preliminary optimization is a root of all evils.

As stated in other answers, Java's String is adding an overhead. If you need to store a large number of strings in memory, I suggest you to store them as byte[] instead. Doing so the size in memory should be the same than the size on disk.
String -> byte[] :
String a = "hello";
byte[] aBytes = a.getBytes();
byte[] -> String :
String b = new String(aBytes);

Java: String.substring() with long type parameters

I have a large string (an RSS Article to be more precise) and I want to get the word in a specific startIndex and endIndex. String provides the substring method, but only using ints as its parameters. My start and end indexes are of type long.
What is the best way to get the word from a String using start and end indexes of type long?
My first solution was to start trimming the String and get it down so I can use ints. Didn't like where it was going. Then I looked at Apache Commons Lang but didn't find anything. Any good solutions?
Thank you.
Update:
Just to provide a little more information.
I am using a tool called General Architecture for Text Engineering (GATE) which scans a String and returns a list of Annotations. An annotation holds a type of a word (Person, Location, etc) and the start and end indexes of that word .
For the RSS, I use ROME, which reads an RSS feed and contains the body of the article in a String.

There is no point doing this on a String because a String can hold at 2^31 - 1 characters. Internally the string's characters are held in a char[], and all of the API methods use int as the type for lengths, positions and offsets.
The same restriction applied to StringBuffer or StringBuilder; i.e. an int length.
A StringReader is backed by a String, so that won't help.
Both CharBuffer and ByteBuffer have the same restriction; i.e. an int length.
A bare array of a primitive type is limited to an int length.
In short, you are going to have to implement your own "long string" type that internally holds its characters in (for example) an array of arrays of characters.
(I tried a Google search but I couldn't spot an existing implementation of long strings that looked credible. I guess there's not a lot of call for monstrously large strings in Java ...)
By the way, if you anticipate that the strings are never going to be this large, you should just convert your long offsets to int. A cast would work, but you might want to check the range and throw an exception if you ever get an offset >= 2^31.

A String is backed by a char[], and arrays can only be indexed with ints (and can consequently only hold 231 characters). If you have long indexes, just cast them to ints - if they're larger than Integer.MAX_VALUE, your program is broken.

You'd better use a java.io.Reader. This class supports the methods skip(long n) and read(char[] cbuf). But please note they return a long (how many bytes were skipped / read), so you need to call those methods in a loop.

Probably it would be better not to use String but StringReader.

In Java, what's the fastest way to "build" and use a string, character by character?

I have a Java socket connection that is receiving data intermittently. The number of bytes of data received with each burst varies. The data may or may not be terminated by a well-known character (such as CR or LF). The length of each burst of data is variable.
I'm attempting to build a string out of each burst of data. What is the fastest way (speed, not memory), to build a string that would later need to be parsed?
I began by using a byte array to store the incoming bytes, then converting them to a String with each burst, like so:
byte[] message = new byte[1024];
...
message[i] = //byte from socket
i++;
...
String messageStr = new String(message);
...
//parse the string here
The obvious disadvantage of this is that some bursts may be longer than 1024. I don't want to arbitrarily create a larger byte array (what if my burst is larger?).
What is the best way of doing this? Should I create a StringBuilder object and append() to it? That way I don't have to convert from StringBuilder to String (since the former has most of the methods I need).
Again, speed of execution is my biggest concern.
TIA.

I would probably use an InputStreamReader wrapped around a BufferedInputStream, which in turn wraps the socket. And write code that processes a message at a time, potentially blocking for input. If the input is bursty, I might run on a background thread and use a concurrent queue to hold the messages.
Reading a buffer at a time and trying to convert it to characters is exactly what BufferedInputStream/InputStreamReader does. And it does so while paying attention to encoding, something that (as other people have noted) your solution does not.
I don't know why you're focused on speed, but you'll find that the time to process data coming off a socket is far less than the time it takes to transmit over that socket.

Note that as you're transmitting across network layers, your speed of conversion may not be the bottleneck. It would be worth measuring, if you believe this to be important.
Note (also) that you're not specifying a character encoding in your conversion from bytes to String (via characters). I would enforce that somehow, otherwise your client/server communication can become corrupted if/when your client/server run in different environments. You can enforce that via JVM runtime args, but it's not a particularly safe option.
Given the above, you may want to consider StringBuilder(int capacity) to configure it in advance with an appropriate size, such that it doesn't have to reallocate on the fly.

First of all, you are making a lot of assumptions about charachter encoding that you receive from your client. Is it US-ASCII, ISO-8859-1, UTF-8?
Because in Java string is not a sequence of bytes, when it comes to building portable String serialization code you should make explicit decisions about character encoding. For this reason you should NEVER use StringBuilder to convert bytes to String. If you look at StringBuilder interface you will notice that it does not even have an append( byte ) method, and that's not because designers just overlooked it.
In your case you should definetly use a ByteArrayOutputStream. The only drawback of using straight implementation of ByteArrayOutputStream is that its toByteArray() method returns a copy of the array held by the object internaly. For this reason you may create your own subclass of ByteArrayOutputStream and provide direct access to the protected buf member.
Note that if you don't use default implementation, remember to specify byte array bounds in your String constructor. Your code should look something like this:
MyByteArrayOutputStream message = new MyByteArrayOutputStream( 1024 );
...
message.write( //byte from socket );
...
String messageStr = new String(message.buf, 0, message.size(), "ISO-8859-1");
Substitute ISO-8859-1 for the character set that's suitable for your needs.

StringBuilder is your friend. Add as many characters as needed, then call toString() to obtain the String.

I would create a "small" array of characters and append characters to it.
When the array is full (or transmission ends), use the StringBuilder.append(char[] str) method to append the content of the array to your string.
Now for the "small" size of the array - you will need to try various sizes and see which one is fastest for your production environment (performance "may" depend on the JVM, OS, processor type and speed and so on)
EDIT: Other people mentioned ByteArrayOutputStream, I agree it is another option as well.

You may wish to look at ByteArrayOutputStream depending if you are dealing with Bytes instead of Characters.
I generally will use a ByteArrayOutputStream to assemble a message then use toString/toByteArray to retrive it when the message is finished.
Edit: ByteArrayOutputStream can handle various Character set encoding through the toString call.

Personally, independent of language, I would send all characters to an in-memory data stream and once I need the string, I would read all characters from this stream into a string.
As an alternative, you could use a dynamic array, making it bigger whenever you need to add more characters. Even better, keep track of the actual length and increase the array with additional blocks instead of single characters. Thus, you would start with 1 character in an array of 1000 chars. Once you get at 1001, the array needs to be resized to 2000, then 3000, 4000, etc...
Fortunately, several languages including Java have a special build-in class that specializes in this. These are the stringbuilder classes. Whatever technique they use isn't that important but they have been created to boost performance so they should be your fastest option.

Have a look at the Text class. It's faster (for the operations you perform) and more deterministic than StringBuilder.
Note: the project containing the class is aimed at RTSJ VMs. It is perfectly usable in standard SE/EE environments though.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

UTF-8 String class for java - java

Apache Avro has an UTF8 wrapper class which implements CharSequence, but I don't know the memory consumption of such objects Hadoop has the Text class which has quite the kind of interface you desire

Related

Converting Java byte array to Buffer in Node.js

Java Compress Multiple strings with the same rule

File size vs. in memory size in Java

Java: String.substring() with long type parameters

In Java, what's the fastest way to "build" and use a string, character by character?

Categories

Resources