Use Java unsafe to point char array to a memory location - java

Some analysis on a Java application showed that it's spending a lot of time decoding UTF-8 byte arrays into String objects. The stream of UTF-8 bytes are coming from a LMDB database, and the values in the database are Protobuf messages, which is why it's decoding UTF-8 so much. Another problem being caused by this is Strings are taking up a large chunk of memory because of the decoding from the memory-map into a String object in the JVM.
I want to refactor this application so it does not allocate a new String every time it reads a message from the database. I want the underlying char array in the String object to simply point to the memory location.
package testreflect;
import java.lang.reflect.Field;
import sun.misc.Unsafe;
public class App {
public static void main(String[] args) throws Exception {
Field field = Unsafe.class.getDeclaredField("theUnsafe");
field.setAccessible(true);
Unsafe UNSAFE = (Unsafe) field.get(null);
char[] sourceChars = new char[] { 'b', 'a', 'r', 0x2018 };
// Encoding to a byte array; asBytes would be an LMDB entry
byte[] asBytes = new byte[sourceChars.length * 2];
UNSAFE.copyMemory(sourceChars,
UNSAFE.arrayBaseOffset(sourceChars.getClass()),
asBytes,
UNSAFE.arrayBaseOffset(asBytes.getClass()),
sourceChars.length*(long)UNSAFE.arrayIndexScale(sourceChars.getClass()));
// Copying the byte array to the char array works, but is there a way to
// have the char array simply point to the byte array without copying?
char[] test = new char[sourceChars.length];
UNSAFE.copyMemory(asBytes,
UNSAFE.arrayBaseOffset(asBytes.getClass()),
test,
UNSAFE.arrayBaseOffset(test.getClass()),
asBytes.length*(long)UNSAFE.arrayIndexScale(asBytes.getClass()));
// Allocate a String object, but set its underlying
// byte array manually to avoid the extra memory copy
long stringOffset = UNSAFE.objectFieldOffset(String.class.getDeclaredField("value"));
String stringTest = (String) UNSAFE.allocateInstance(String.class);
UNSAFE.putObject(stringTest, stringOffset, test);
System.out.println(stringTest);
}
}
So far, I've figured out how to copy a byte array to a char array and set the underlying array in a String object using the Unsafe package. This should reduce the amount of CPU time the application is wasting decoding UTF-8 bytes.
However, this does not solve the memory problem. Is there a way to have a char array point to a memory location and avoid a memory allocation altogether? Avoiding the copy altogether will reduce the number of unnecessary allocations the JVM is making for these strings, leaving more room for the OS to cache entries from the LMDB database.

I think you are taking the wrong approach here.
So far, I've figured out how to copy a byte array to a char array and set the underlying array in a String object using the Unsafe package. This should reduce the amount of CPU time the application is wasting decoding UTF-8 bytes.
Erm ... no.
Using memory copy to copy from a byte[] to char[] is not going to work. Each char in the destination char[] will actually contain 2 bytes from the original. If you then try to wrap the char[] as a String, you will get a weird kind of mojibake.
What a real UTF-8 to String conversion does it to convert between 1 and 4 bytes (codeunits) representing a UTF-8 codepoint into 1 or 2 16-bit codeunits representing the same codepoint in UTF-16. That cannot be done using a plain memory copy.
If you aren't familiar with it, it would be worth reading the Wikipedia article on UTF-8 so that you understand how the text is encoded.
The solution depends on what you intend to do with the text data.
If the data must really be in the form of String (or StringBuilder or char[]) objects, then you really have no choice but to do the full conversion. Try anything else and you are liable to mess up; e.g. garbled text and potential JVM crashes.
If you want something that is "string like", you could conceivably implement a custom subclass of CharSequence, that wraps the bytes in the messages and decodes the UTF-8 on the fly. But doing that efficiently make be a problem, especially implementing the charAt method as an O(1) method.
If you are simply wanting to hold and/or compare the (entire) texts, this could possibly be done by representing them as or in a byte[] objects. These operations can be performed on the UTF-8 encoded data directly.
If the input text could actually be sent in character encoding with a fixed 8-bit character size (e.g. ASCII, Latin-1, etc) or as UTF-16, that simplifies things.

Related

Converting Java byte array to Buffer in Node.js

In an Android app I have a byte array containing data in the following format:
In another Node.js server, the same data is stored in a Buffer which looks like this:
I am looking for a way to convert both data to the same format so I can compare the two and check if they are equal. What would be the best way to approach this?
[B#cbf1911 is not a format. That is the result of invoking the .toString() method on a java object which doesn't have a custom toString implementation (thus, you get the default implementation written in java.lang.Object itself. The format of that string is:
binary-style-class-name#system-identity-hashcode.
[B is the binary style class name. That's JVM-ese for byte[].
cbf1911 is the system identity hashcode, which is (highly oversimplified and not truly something you can use to look stuff up) basically the memory address.
It is not the content of that byte array.
Lots of java APIs allow you to pass in any object and will just invoke the toString for you. Where-ever you're doing this, you wrote a bug; you need to write some explicit code to turn that byte array into data.
Note that converting bytes into characters, which you'll have to do whenever you need to put that byte array onto a character-based comms channel (such as JSON or email), is tricky.
<Buffer 6a 61 ...>
This is listing each byte as an unsigned hex nibble. This is an incredibly inefficient format, but it gets the job done.
A better option is base64. That is merely highly inefficient (but not incredibly inefficient); it spends 4 characters to encode 3 bytes (vs the node.js thing which spends 3 characters to encode 1 byte). Base64 is a widely supported standard.
When encoding, you need to explicitly write that. When decoding, same story.
In java, to encode:
import android.util.Base64;
class Foo {
void example() {
byte[] array = ....;
String base64 = Base64.encodeToString(array, Base64.DEFAULT);
System.out.println(base64);
}
}
That string is generally 'safe' - it has no characters in it that could end up being interpreted as control flow (so no <, no ", etc), and is 100% ASCII which tends to survive broken charset encoding transitions, which are common when tossing strings about the interwebs.
How do you decode base64 in node? I don't know, but I'm sure a web search for 'node base64 decode' will provide hundreds of tutorials.
Good luck!

conversion of byte array to string causing OOM

In my application i m storing strings using randomaccessfile and while reading back the string i need to convert byte array to string, which is causing OOM. Is there a better way to convert other than this
str = new String(b, "UTF-8");
where b is byte array
Is there a better way to convert other than new String(bytes, "UTF-8") ?
This is actually a rather complicated question.
This constructor cannot simply incorporate the byte[] into the string:
Prior to Java 9, it is always necessary to decode the byte array to a UTF-16 coded array of char. So the constructor is liable to allocate roughly double the memory used by the source byte[].
With Java 9 you have the option of using a new compact representation for String. If you do the AND if the UTF-8 encoded byte array only contains code-points in Unicode code-plane zero (\u0000 to \u00ff) then the String value is a byte[]. However, even in this case the constructor must copy the bytes to a new byte[].
In both cases, there is no more space-efficient way to create a String from a byte[]. Furthermore, I don't think there is a more space-efficient way do the conversion starting with a stream of bytes and a character count. (I am excluding things like modifying the lava.lang.* implementation, or breaking abstraction using reflection.)
Bottom line: when converting a byte[] to a String you should allow at least twice as much contiguous free memory as the original byte[] if you want your code to work on older JVMs.

How can I get byteSize of String Array other than traversing the Array

I want to optimize my code by using ByteBuffer in place of String. What I am getting is String[]. I am doing formatting on each element of it.
e.g. String strAry[] = {"Help", "I", "am", "trapped", "in", "a", "fortune", "cookie", "factory"};
is my String array, I am writing content of it to a .csv file in
format "StrArray[0]";"StrArray[1]";"StrArray2";"StrArray[3]"; so on...
which is internally creating multiple Strings and this code is running into loop for hundreds n thousands of time some time.
I want to implement ByteBuffer. While creating
ByteBuffer bbuf = ByteBuffer.allocate(bufferSize); I need to specify buffer size here.
I dont want to iterate over each element of String [] to calculate its byteSize.
Any help is appreciated.
Couple of notes:
Data structure usage
I think you should be using CharBuffer and not ByteBuffer. CharBuffer is requiring the number of characters and not bytes.
Buffers from Java NIO are always used as buffers, that means there is a possibility that you will need to read into them multiple times.
If you need to have the whole content in memory, buffers are not the data structure for this use case.
You don't have to know the exact size for a buffer, the allocated size is the maximal capacity of the buffer.
StringBuilder is a mutable data structure for string processing. You might consider using it instead.
You don't have to know the exact size.
Computation of final size
might be done using Stream API (Java 8) or similar utility methods.

Does String(byte[]) create a deep copy of the byte array?

Or is it that it just gets a reference to it?
I have a byte array that gets re-written by an external library - is it safe to pass it into a String constructor, or should I create a clone first?
byte[] b = MagicLib.getData();
String s = new String(b);
// actually a pointer to previous memory, just with different data
b = MagicLib.getMoreData();
A String contains an array of chars, not bytes. Therefore, the String cannot share the byte's storage.
Additionally, note that the byte[] will be decoded into characters according to the platform default charset (per the documentation on String(byte[])), which implies further that a decoded version of the byte[] array has to be separately constructed.
In Oracle Java it returns a new char[] depending on the decoding charset used
Java Strings are immutable, so the entire array has to be copied. Otherwise, you could change the contents of the String by modifying the byte array.

UTF-8 String class for java

I need to hold lots of string objects in memory (hundreds of MB) and I want to hold them in UTF-8 format since in most cases it will require half of the memory the default implementation use.
The default String class requires for a 12 characters string 60 bytes (See http://blog.griddynamics.com/2010/01/java-tricks-reducing-memory-consumption.html).
Most of my Strings are 10-20 characters long.
I wonder if there is some open source library which offers a wrapper for such strings?
I know how to convert String to UTF-8 byte array but I'm looking for a wrapper class which will provide all needed utilities functions (Hash, Equal, toString, fromString, etc).
Apache Avro has an UTF8 wrapper class which implements CharSequence, but I don't know the memory consumption of such objects
Hadoop has the Text class which has quite the kind of interface you desire
If you want a distinct object for each string and you want them as compact as possible then use byte arrays. That will be 1 byte per char vs 2, and you won't have the overhead of the String header (which adds probably 32 bytes per object).
But of course you wouldn't be able to use any String methods on these without first converting to String.
But if you really want to save space, store the strings back-to-back in a few larger arrays, with "dope vectors" to locate the individual strings.

Categories

Resources