Or is it that it just gets a reference to it?
I have a byte array that gets re-written by an external library - is it safe to pass it into a String constructor, or should I create a clone first?
byte[] b = MagicLib.getData();
String s = new String(b);
// actually a pointer to previous memory, just with different data
b = MagicLib.getMoreData();
A String contains an array of chars, not bytes. Therefore, the String cannot share the byte's storage.
Additionally, note that the byte[] will be decoded into characters according to the platform default charset (per the documentation on String(byte[])), which implies further that a decoded version of the byte[] array has to be separately constructed.
In Oracle Java it returns a new char[] depending on the decoding charset used
Java Strings are immutable, so the entire array has to be copied. Otherwise, you could change the contents of the String by modifying the byte array.
Related
Some analysis on a Java application showed that it's spending a lot of time decoding UTF-8 byte arrays into String objects. The stream of UTF-8 bytes are coming from a LMDB database, and the values in the database are Protobuf messages, which is why it's decoding UTF-8 so much. Another problem being caused by this is Strings are taking up a large chunk of memory because of the decoding from the memory-map into a String object in the JVM.
I want to refactor this application so it does not allocate a new String every time it reads a message from the database. I want the underlying char array in the String object to simply point to the memory location.
package testreflect;
import java.lang.reflect.Field;
import sun.misc.Unsafe;
public class App {
public static void main(String[] args) throws Exception {
Field field = Unsafe.class.getDeclaredField("theUnsafe");
field.setAccessible(true);
Unsafe UNSAFE = (Unsafe) field.get(null);
char[] sourceChars = new char[] { 'b', 'a', 'r', 0x2018 };
// Encoding to a byte array; asBytes would be an LMDB entry
byte[] asBytes = new byte[sourceChars.length * 2];
UNSAFE.copyMemory(sourceChars,
UNSAFE.arrayBaseOffset(sourceChars.getClass()),
asBytes,
UNSAFE.arrayBaseOffset(asBytes.getClass()),
sourceChars.length*(long)UNSAFE.arrayIndexScale(sourceChars.getClass()));
// Copying the byte array to the char array works, but is there a way to
// have the char array simply point to the byte array without copying?
char[] test = new char[sourceChars.length];
UNSAFE.copyMemory(asBytes,
UNSAFE.arrayBaseOffset(asBytes.getClass()),
test,
UNSAFE.arrayBaseOffset(test.getClass()),
asBytes.length*(long)UNSAFE.arrayIndexScale(asBytes.getClass()));
// Allocate a String object, but set its underlying
// byte array manually to avoid the extra memory copy
long stringOffset = UNSAFE.objectFieldOffset(String.class.getDeclaredField("value"));
String stringTest = (String) UNSAFE.allocateInstance(String.class);
UNSAFE.putObject(stringTest, stringOffset, test);
System.out.println(stringTest);
}
}
So far, I've figured out how to copy a byte array to a char array and set the underlying array in a String object using the Unsafe package. This should reduce the amount of CPU time the application is wasting decoding UTF-8 bytes.
However, this does not solve the memory problem. Is there a way to have a char array point to a memory location and avoid a memory allocation altogether? Avoiding the copy altogether will reduce the number of unnecessary allocations the JVM is making for these strings, leaving more room for the OS to cache entries from the LMDB database.
I think you are taking the wrong approach here.
So far, I've figured out how to copy a byte array to a char array and set the underlying array in a String object using the Unsafe package. This should reduce the amount of CPU time the application is wasting decoding UTF-8 bytes.
Erm ... no.
Using memory copy to copy from a byte[] to char[] is not going to work. Each char in the destination char[] will actually contain 2 bytes from the original. If you then try to wrap the char[] as a String, you will get a weird kind of mojibake.
What a real UTF-8 to String conversion does it to convert between 1 and 4 bytes (codeunits) representing a UTF-8 codepoint into 1 or 2 16-bit codeunits representing the same codepoint in UTF-16. That cannot be done using a plain memory copy.
If you aren't familiar with it, it would be worth reading the Wikipedia article on UTF-8 so that you understand how the text is encoded.
The solution depends on what you intend to do with the text data.
If the data must really be in the form of String (or StringBuilder or char[]) objects, then you really have no choice but to do the full conversion. Try anything else and you are liable to mess up; e.g. garbled text and potential JVM crashes.
If you want something that is "string like", you could conceivably implement a custom subclass of CharSequence, that wraps the bytes in the messages and decodes the UTF-8 on the fly. But doing that efficiently make be a problem, especially implementing the charAt method as an O(1) method.
If you are simply wanting to hold and/or compare the (entire) texts, this could possibly be done by representing them as or in a byte[] objects. These operations can be performed on the UTF-8 encoded data directly.
If the input text could actually be sent in character encoding with a fixed 8-bit character size (e.g. ASCII, Latin-1, etc) or as UTF-16, that simplifies things.
In my application i m storing strings using randomaccessfile and while reading back the string i need to convert byte array to string, which is causing OOM. Is there a better way to convert other than this
str = new String(b, "UTF-8");
where b is byte array
Is there a better way to convert other than new String(bytes, "UTF-8") ?
This is actually a rather complicated question.
This constructor cannot simply incorporate the byte[] into the string:
Prior to Java 9, it is always necessary to decode the byte array to a UTF-16 coded array of char. So the constructor is liable to allocate roughly double the memory used by the source byte[].
With Java 9 you have the option of using a new compact representation for String. If you do the AND if the UTF-8 encoded byte array only contains code-points in Unicode code-plane zero (\u0000 to \u00ff) then the String value is a byte[]. However, even in this case the constructor must copy the bytes to a new byte[].
In both cases, there is no more space-efficient way to create a String from a byte[]. Furthermore, I don't think there is a more space-efficient way do the conversion starting with a stream of bytes and a character count. (I am excluding things like modifying the lava.lang.* implementation, or breaking abstraction using reflection.)
Bottom line: when converting a byte[] to a String you should allow at least twice as much contiguous free memory as the original byte[] if you want your code to work on older JVMs.
What´s the difference between
"hello world".getBytes("UTF-8");
and
Charset.forName("UTF-8").encode("hello world").array();
?
The second code produces a byte array with 0-bytes at the end in most cases.
Your second snippet uses ByteBuffer.array(), which just returns the array backing the ByteBuffer. That may well be longer than the content written to the ByteBuffer.
Basically, I would use the first approach if you want a byte[] from a String :) You could use other ways of dealing with the ByteBuffer to convert it to a byte[], but given that String.getBytes(Charset) is available and convenient, I'd just use that...
Sample code to retrieve the bytes from a ByteBuffer:
ByteBuffer buffer = Charset.forName("UTF-8").encode("hello world");
byte[] array = new byte[buffer.limit()];
buffer.get(array);
System.out.println(array.length); // 11
System.out.println(array[0]); // 104 (encoded 'h')
Is
char buf[] = "test";
in C equivalent to
String buf = new String("test");
in Java?
And is
char *buf;
buf = "test";
equivalent to
String buf = "test";
?
It's difficult to say they're equivalent, although I understand what you're driving at.
Your C version is a sequence of 8-bit chars. The Java variant is Unicode-aware.
Secondly, in Java you're creating an object with behaviour, rather than just a sequence of chars.
Finally, the Java variant is immutable. You can change the reference, but not the underlying set of characters (this is a function of being wrapped by the String object)
For something largely equivalent you could use an array of bytes in Java. Note that this wouldn't be null-terminated, however. Java arrays are aware of their length rather than using a convention of null-termination. Alternatively a closer C++ equivalent would probaly be std::string
This question can't really be answered - you're comparing apples to oranges.
In C, a "string" is really just a char array, that is null-terminated (that is, a '\0' byte at the end, placed by the compiler, and expected by the str__() library functions.
In Java, String is an object, that holds (possibly among other things), an array of characters, and an integer count.
They are different things, and they are used differently. Is there something specific you are trying to accomplish and having trouble with? If so, ask that, and we will try to answer it. Otherwise, this isn't really a valid question, IMO.
The first two are not equivalent. In Java, the String object, besides storing a char array, contains also other things (e.g. the length field). The java version is, of course, more OO.
The second ones are equivalent with the same observations as above. They are both pointers to containers of characters. The c container is a simple char array, while the string is a full-fledged object.
No. These are different data types. char buf[] is an array and String buf is an object. The String buf will be dynamically sized and have plenty of helpful methods with it. char buf[] is a static sized chunk of memory holding 5 8-bit characters.
Below will create an array of characters
char buf[] = "test";
Where as String buf = new String("test"); will lead to creation of a String Object, but internally its char[] itself which is made immutable internally using a String object wrapper. So its a representation difference in the above 2 programming languages.
How can I pass array byte to getReader without changes data.
byte_msg = Some array byte
println(">>>" + byte_msg)
HttpServletRequest.getReader returns new BufferedReader(
new InputStreamReader(new ByteArrayInputStream(byte_msg)))
And post reciever:
byte_msg = IOUtils.toByteArray(post.request.getReader)
println("<<<" + byte_msg)
And print return. Why do I get different answers?
>>>[B#38ffd135
<<<[B#60c0c8b5
You're printing out the result of byte[].toString() - which isn't the value of the byte array... it's just the value returned by Object.toString() - [B for "byte array", # and then the hash code. You need to convert the data to hex or something like that - which you need to do explicitly. For example, you could use the Hex class from Apache Commons Codec:
String hex = new String(Hex.encode(byte_msg));
Not that if this is arbitrary binary data you should not use InputStreamReader to convert it to a string in the first place. InputStreamReader is designed for binary data which is encoded text data - and IMO you should specify the encoding, too.
If you want to transfer arbitrary binary data, you should either transfer it without any conversion into text (so see whether your post class allows that) or use something like hex or base64 to convert to/from binary data safely.
IOUtils.toByteArray creates a new ByteArrayOutputStream then uses toByteArray() which creates a new byte[] and this array being a new objects has a new object id (the hash code you see, which is different). And this happens even if the content of the array was not changed.
In this case the mere observation (via IOUtils.toByteArray) has altered the output, because this check creates a new byte[] ;)
As Jon said, check the content of the array to see if there are any changes.
In order to print the content arrays you can convert the content of array to string using :
java.util.Arrays.toString(byte[])
and then print the result to stdout.
println(">>>" + Arrays.toString(byte_msg));
j.u.Arrays documentation is here.