In my application i m storing strings using randomaccessfile and while reading back the string i need to convert byte array to string, which is causing OOM. Is there a better way to convert other than this
str = new String(b, "UTF-8");
where b is byte array
Is there a better way to convert other than new String(bytes, "UTF-8") ?
This is actually a rather complicated question.
This constructor cannot simply incorporate the byte[] into the string:
Prior to Java 9, it is always necessary to decode the byte array to a UTF-16 coded array of char. So the constructor is liable to allocate roughly double the memory used by the source byte[].
With Java 9 you have the option of using a new compact representation for String. If you do the AND if the UTF-8 encoded byte array only contains code-points in Unicode code-plane zero (\u0000 to \u00ff) then the String value is a byte[]. However, even in this case the constructor must copy the bytes to a new byte[].
In both cases, there is no more space-efficient way to create a String from a byte[]. Furthermore, I don't think there is a more space-efficient way do the conversion starting with a stream of bytes and a character count. (I am excluding things like modifying the lava.lang.* implementation, or breaking abstraction using reflection.)
Bottom line: when converting a byte[] to a String you should allow at least twice as much contiguous free memory as the original byte[] if you want your code to work on older JVMs.
Related
In an Android app I have a byte array containing data in the following format:
In another Node.js server, the same data is stored in a Buffer which looks like this:
I am looking for a way to convert both data to the same format so I can compare the two and check if they are equal. What would be the best way to approach this?
[B#cbf1911 is not a format. That is the result of invoking the .toString() method on a java object which doesn't have a custom toString implementation (thus, you get the default implementation written in java.lang.Object itself. The format of that string is:
binary-style-class-name#system-identity-hashcode.
[B is the binary style class name. That's JVM-ese for byte[].
cbf1911 is the system identity hashcode, which is (highly oversimplified and not truly something you can use to look stuff up) basically the memory address.
It is not the content of that byte array.
Lots of java APIs allow you to pass in any object and will just invoke the toString for you. Where-ever you're doing this, you wrote a bug; you need to write some explicit code to turn that byte array into data.
Note that converting bytes into characters, which you'll have to do whenever you need to put that byte array onto a character-based comms channel (such as JSON or email), is tricky.
<Buffer 6a 61 ...>
This is listing each byte as an unsigned hex nibble. This is an incredibly inefficient format, but it gets the job done.
A better option is base64. That is merely highly inefficient (but not incredibly inefficient); it spends 4 characters to encode 3 bytes (vs the node.js thing which spends 3 characters to encode 1 byte). Base64 is a widely supported standard.
When encoding, you need to explicitly write that. When decoding, same story.
In java, to encode:
import android.util.Base64;
class Foo {
void example() {
byte[] array = ....;
String base64 = Base64.encodeToString(array, Base64.DEFAULT);
System.out.println(base64);
}
}
That string is generally 'safe' - it has no characters in it that could end up being interpreted as control flow (so no <, no ", etc), and is 100% ASCII which tends to survive broken charset encoding transitions, which are common when tossing strings about the interwebs.
How do you decode base64 in node? I don't know, but I'm sure a web search for 'node base64 decode' will provide hundreds of tutorials.
Good luck!
Some analysis on a Java application showed that it's spending a lot of time decoding UTF-8 byte arrays into String objects. The stream of UTF-8 bytes are coming from a LMDB database, and the values in the database are Protobuf messages, which is why it's decoding UTF-8 so much. Another problem being caused by this is Strings are taking up a large chunk of memory because of the decoding from the memory-map into a String object in the JVM.
I want to refactor this application so it does not allocate a new String every time it reads a message from the database. I want the underlying char array in the String object to simply point to the memory location.
package testreflect;
import java.lang.reflect.Field;
import sun.misc.Unsafe;
public class App {
public static void main(String[] args) throws Exception {
Field field = Unsafe.class.getDeclaredField("theUnsafe");
field.setAccessible(true);
Unsafe UNSAFE = (Unsafe) field.get(null);
char[] sourceChars = new char[] { 'b', 'a', 'r', 0x2018 };
// Encoding to a byte array; asBytes would be an LMDB entry
byte[] asBytes = new byte[sourceChars.length * 2];
UNSAFE.copyMemory(sourceChars,
UNSAFE.arrayBaseOffset(sourceChars.getClass()),
asBytes,
UNSAFE.arrayBaseOffset(asBytes.getClass()),
sourceChars.length*(long)UNSAFE.arrayIndexScale(sourceChars.getClass()));
// Copying the byte array to the char array works, but is there a way to
// have the char array simply point to the byte array without copying?
char[] test = new char[sourceChars.length];
UNSAFE.copyMemory(asBytes,
UNSAFE.arrayBaseOffset(asBytes.getClass()),
test,
UNSAFE.arrayBaseOffset(test.getClass()),
asBytes.length*(long)UNSAFE.arrayIndexScale(asBytes.getClass()));
// Allocate a String object, but set its underlying
// byte array manually to avoid the extra memory copy
long stringOffset = UNSAFE.objectFieldOffset(String.class.getDeclaredField("value"));
String stringTest = (String) UNSAFE.allocateInstance(String.class);
UNSAFE.putObject(stringTest, stringOffset, test);
System.out.println(stringTest);
}
}
So far, I've figured out how to copy a byte array to a char array and set the underlying array in a String object using the Unsafe package. This should reduce the amount of CPU time the application is wasting decoding UTF-8 bytes.
However, this does not solve the memory problem. Is there a way to have a char array point to a memory location and avoid a memory allocation altogether? Avoiding the copy altogether will reduce the number of unnecessary allocations the JVM is making for these strings, leaving more room for the OS to cache entries from the LMDB database.
I think you are taking the wrong approach here.
So far, I've figured out how to copy a byte array to a char array and set the underlying array in a String object using the Unsafe package. This should reduce the amount of CPU time the application is wasting decoding UTF-8 bytes.
Erm ... no.
Using memory copy to copy from a byte[] to char[] is not going to work. Each char in the destination char[] will actually contain 2 bytes from the original. If you then try to wrap the char[] as a String, you will get a weird kind of mojibake.
What a real UTF-8 to String conversion does it to convert between 1 and 4 bytes (codeunits) representing a UTF-8 codepoint into 1 or 2 16-bit codeunits representing the same codepoint in UTF-16. That cannot be done using a plain memory copy.
If you aren't familiar with it, it would be worth reading the Wikipedia article on UTF-8 so that you understand how the text is encoded.
The solution depends on what you intend to do with the text data.
If the data must really be in the form of String (or StringBuilder or char[]) objects, then you really have no choice but to do the full conversion. Try anything else and you are liable to mess up; e.g. garbled text and potential JVM crashes.
If you want something that is "string like", you could conceivably implement a custom subclass of CharSequence, that wraps the bytes in the messages and decodes the UTF-8 on the fly. But doing that efficiently make be a problem, especially implementing the charAt method as an O(1) method.
If you are simply wanting to hold and/or compare the (entire) texts, this could possibly be done by representing them as or in a byte[] objects. These operations can be performed on the UTF-8 encoded data directly.
If the input text could actually be sent in character encoding with a fixed 8-bit character size (e.g. ASCII, Latin-1, etc) or as UTF-16, that simplifies things.
Or is it that it just gets a reference to it?
I have a byte array that gets re-written by an external library - is it safe to pass it into a String constructor, or should I create a clone first?
byte[] b = MagicLib.getData();
String s = new String(b);
// actually a pointer to previous memory, just with different data
b = MagicLib.getMoreData();
A String contains an array of chars, not bytes. Therefore, the String cannot share the byte's storage.
Additionally, note that the byte[] will be decoded into characters according to the platform default charset (per the documentation on String(byte[])), which implies further that a decoded version of the byte[] array has to be separately constructed.
In Oracle Java it returns a new char[] depending on the decoding charset used
Java Strings are immutable, so the entire array has to be copied. Otherwise, you could change the contents of the String by modifying the byte array.
I need to parse a binary file created by C++ and overwrite a 4 char long char array in that file, for example change the original char array of ABCD to WXYZ.
I know exactly the position in terms of bytes of the that char array. I tried RandomAccessFile which let me go to the position easily. But I cannot make the rest work for me right now.
Is the RandomAccessFile a right way to go?
I know I have to do some conversion from 2 bytes char to one byte char.
Anybody has a good way to do this?
Fine: always try the JavaDoc RandomAccessFile.
long position = ...;
byte[] bytes = new byte[] { (byte)'W', ... };
raf.seek(position);
raf.write(bytes);
RandomAccessFile is fine. As you have already figured out, in C++ char is a single byte, whereas Java uses UTF-16.
The easiest option might be to use byte[4] in your code to represent the 4-character ASCII string.
How can I pass array byte to getReader without changes data.
byte_msg = Some array byte
println(">>>" + byte_msg)
HttpServletRequest.getReader returns new BufferedReader(
new InputStreamReader(new ByteArrayInputStream(byte_msg)))
And post reciever:
byte_msg = IOUtils.toByteArray(post.request.getReader)
println("<<<" + byte_msg)
And print return. Why do I get different answers?
>>>[B#38ffd135
<<<[B#60c0c8b5
You're printing out the result of byte[].toString() - which isn't the value of the byte array... it's just the value returned by Object.toString() - [B for "byte array", # and then the hash code. You need to convert the data to hex or something like that - which you need to do explicitly. For example, you could use the Hex class from Apache Commons Codec:
String hex = new String(Hex.encode(byte_msg));
Not that if this is arbitrary binary data you should not use InputStreamReader to convert it to a string in the first place. InputStreamReader is designed for binary data which is encoded text data - and IMO you should specify the encoding, too.
If you want to transfer arbitrary binary data, you should either transfer it without any conversion into text (so see whether your post class allows that) or use something like hex or base64 to convert to/from binary data safely.
IOUtils.toByteArray creates a new ByteArrayOutputStream then uses toByteArray() which creates a new byte[] and this array being a new objects has a new object id (the hash code you see, which is different). And this happens even if the content of the array was not changed.
In this case the mere observation (via IOUtils.toByteArray) has altered the output, because this check creates a new byte[] ;)
As Jon said, check the content of the array to see if there are any changes.
In order to print the content arrays you can convert the content of array to string using :
java.util.Arrays.toString(byte[])
and then print the result to stdout.
println(">>>" + Arrays.toString(byte_msg));
j.u.Arrays documentation is here.