Java : Char vs String byte size - java

I was surprised to find that the following code
System.out.println("Character size:"+Character.SIZE/8);
System.out.println("String size:"+"a".getBytes().length);
outputs this:
Character size:2
String size:1
I would assume that a single character string should take up the same (or more ) bytes than a single char.
In particular I am wondering.
If I have a java bean with several fields in it, how its size will increase depending on the nature of the fields (Character, String, Boolean, Vector, etc...) I'm assuming that all java objects have some (probably minimal) footprint, and that one of the smallest of these footprints would be a single character. To test that basic assumption I started with the above code - and the results of the print statements seem counterintuitive.
Any insights into the way java stores/serializes characters vs strings by default would be very helpful.

getBytes() outputs the String with the default encoding (most likely ISO-8859-1) while the internal character char has always 2 bytes. Internally Java uses always char arrays with a 2 byte char, if you want to know more about encoding, read the link by Oded in the question comments.

I would like to say what i think,correct me if i am wrong but you are finding the length of the string which is correctly it is showing as 1 as you have only 1 character in the string. length shows the length not the size . length and size are two different things.
check this Link.. you are finding the number of bytes occupied in the wrong way

well, you have that 1 char in char array has the size of 2 bytes and that your String contains is 1 character long, not that it has 1 byte size.
The String object in Java consists of:
private final char value[];
private final int offset;
private final int count;
private int hash;
only this should assure you that anyway the String object is bigger then char array.
If you want to learn more about how object's size you can also read about the object headers and multiplicity factor for char arrays. For example here or here.

I want to add some code first and then a bit of explanation:
import java.nio.charset.Charset;
public class Main {
public static void main(String[] args) {
System.out.println("Character size: " + Character.SIZE / 8);
final byte[] bytes = "a".getBytes(Charset.forName("UTF-16"));
System.out.println("String size: " + bytes.length);
sprintByteAsHex(bytes[0]);
sprintByteAsHex(bytes[1]);
sprintByteAsHex(bytes[2]);
sprintByteAsHex(bytes[3]);
}
static void sprintByteAsHex(byte b) {
System.out.print((Integer.toHexString((b & 0xFF))));
}
}
And the output will be:
Character size: 2
String size: 4
feff061
So what you are actually missing is, you are not providing any parameter to the getBytes method. Probably, you are getting the bytes for UTF-8 representation of the character 'a'.
Well, but why did we get 4 bytes, when we asked for UTF-16? Ok, Java uses UTF-16 internally, then we should have gotten 2 bytes right?
If you examine the output:
feff061
Java actually returned us a BOM: https://en.wikipedia.org/wiki/Byte_order_mark.
So the first 2 bytes: feff is required for signalling that following bytes will be UTF-16 Big Endian. Please see the Wikipedia page for further information.
The remaining 2 bytes: 0061 is the 2 byte representation of the character "a" you have. Can be verified from: http://www.fileformat.info/info/unicode/char/0061/index.htm
So yes, a character in Java is 2 bytes, but when you ask for bytes without a specific encoding, you may not always get 2 bytes since different encodings will require different amount of bytes for various characters.

The SIZE of a Character is the storage needed for a char, which is 16 bit. The length of a string (also the length of the underlying char-array or bytes-array) is the number of characters (or bytes), not a size in bit.
That's why you had do to the division by 8 for the size, but not for the length. The length needs to be multiplied by two.
Also note that you will get other lengths for the byte-array if you specify a different encoding. In this case a transformation to a single- or varying-size encoding was performed when doing getBytes().
See: http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#getBytes(java.nio.charset.Charset)

Related

Is there a datatype that uses less storage for 2 letters than String?

Basically what the title says. I'm aware that i could use char as type if i only had one letter, but i need a datatype for 2 letters, e.g "XY". Is there anything that uses less storage (bit) or is smaller than a String? Or are multiple letters generally just saved as Strings? Thanks!
If you are sure that there are no higher-unicode characters (i.e. characters that use more than 1 char to store) in use, there are a few options:
As mentioned by #rdas, you could use an array: char[2]. This would be a bit more memory-efficient than a String, as the String has additional members. If it's only ASCII-characters, you could even use byte[2].
As one char is 16 bits, 2 chars are 32 bits, so could also try to encode the 2 characters into 1 int, as this also uses only 32 bytes, and you would not have the object overhead of the array. Clearly, this requires some additional steps to encode/decode when you need to show the stored information as actual characters, e.g. when presenting it to the user.
If your characters are only ASCII codes, i.e. every character fits into 1 byte, you could even fit it into a short.
Depending on the number of two-character combinations that you actually need to support, you could actually just enumerate all the possible combinations, use a lookup Map or sorted Array, and then only store the number/index of the code. Again, depending on the number of combinations, use a byte, short or int.
No it is not possible
This is why::
String s = "ab" // uses only 4 bytes of data as each character reserves 2 bytes
And other data types uses >= 4 bytes except short and byte but here short and byte cannot store characters

I wanted to Convert any length String to fixed 32 Bytes

I want to convert any length of String to byte32 in Java.
Code
String s="9c46267273a4999031c1d0f7e40b2a59233ce59427c4b9678d6c3a4de49b6052e71f6325296c4bddf71ea9e00da4e88c4d4fcbf241859d6aeb41e1714a0e";
//Convert into byte32
From the comments it became clear that you want to reduce the storage space of that string to 32 bytes.
The given string can easily be compressed from the 124 bytes to 62 bytes by doing a hexadecimal conversion.
However, there is no algorithm and there will not be an algorithm that can compress any data to 32 bytes. Imagine that would be possible: it would have been implemented and you would be able to get ZIP files of just 32 bytes for any file you compress.
So, unfortunately, the answer is: it's not possible.
You can not convert any length string to a byte array of length 32.
Java uses UTF-16 as it's string encoding, so in order to store 100% of the string, 1:1 as a fixed length byte array, you would be at a surface glance be limited to 16 characters.
If you are willing to live with the limitation of 16 characters, byte[] bytes = s.getBytes(); should give you a variable length byte array, but it's best to specify an explicit encoding. e.g. byte [] array2 = str.getBytes("UTF-16");
This doesn't completely solve your problem. You will now likely have to check that the byte array doesn't exceed 32 bytes, and come up with strategies for padding, possible null termination (which may potentially eat into your character budget)
Now, if you don't need the entire UTF-16 string space that Java uses for strings by default, you can get away with longer strings, by using other encodings.
IF this is to be used for any kind of other standard or something ( I see references to etherium being thrown around) then you will need to follow their standards.
Unless you are writing your own library for dealing with it directly, I highly recommend using a library that already exists, and appears to be well tested, and used.
You can achieve with the following function
byte[] bytes = s.getBytes();

Array of chars vs. array of bytes

I've found a few answers about this but none of them seem to apply to my issue.
I'm using the NDK and C++ is expecting an unsigned char array of 1024 elements, so I need to create this in java to pass it as a parameter.
The unsigned char array is expected to contain both numbers and characters.
I have tried this:
byte[] lMessage = new byte[1024];
lMessage[4] = 'a';
The problem is that then the 4th element gets added as a numerical value instead of maintaining the 'a' character.
I have also tried
char[] lMessage = new char[1024];
lMessage[4] = 'a';
While this retains the character, it does duplicate the amount of bytes in the array from 8 to 16.
I need the output to be a 8 bit ASCII unsigned array.
Any suggestions?
Thanks.
It is wrong to say that the element "gets added as a numerical value". The only thing that you can say for sure is that it gets added as electrostatic charges in eight cells of your RAM.
How you choose to represent those eight bits (01100001) in order to visualize them has little to do with what they really are, so if you choose to see them as a numerical value, then you might be tricked into believing that they are in fact a numerical value. (Kind of like a self-fulfilling prophecy (wikipedia).)
But in fact they are nothing but 8 electrostatic charges, interpretable in whatever way we like. We can choose to interpret them as a two's complement number (97), we can choose to interpret them as a binary-coded decimal number (61), we can choose to interpret them as an ASCII character ('a'), we can choose to interpret them as an x86 instruction opcode (popa), the list goes on.
The closest thing to an unsigned char in C++ is a byte in java. That's because the fundamental characteristic of these small data types is how many bits long they are. Chars in C++ are 8-bit long, and the only type in java which is also 8-bits long is the byte.
Unfortunately, a byte in java tends to be thought of as a numerical quantity rather than as a character, so tools (such as debuggers) that display bytes will display them as little numbers. But this is just an arbitrary convention: they could have just as easily chosen to display bytes as ASCII (8-bit) characters, and then you would be seeing an actual 'a' in byte[] lMessage[4].
So, don't be fooled by what the tools are showing, all that counts is that it is an 8-bit quantity. And if the tools are showing 97 (0x61), then you know that the bit pattern stored in those 8 memory cells can just as legitimately be thought of as an 'a', because the ASCII code of 'a' is 97.
So, finally, to answer your question, what you need to do is find a way to convert a java string, which consists of 16-bit unicode characters, to an array of ASCII characters, which would be bytes in java. You can try this:
String s = "TooManyEduardos";
byte[] bytes = s.getBytes("US-ASCII");
Or you can read the answers to this question: Convert character to ASCII numeric value in java for more ideas.
Will work for ASCII chars
lMessage[4] = new String('a').getBytes()[0];

Converting US-ASCII encoded byte to integer and back

I have a byte array that can be of size 2,3 or 4. I need to convert this to the correct integer value. I also need to do this in reverse, i.e an 2,3 or 4 character integer to a byte array.
e.g., raw hex bytes are : 54 and 49. The decoded string US-ASCII value is 61. So the integer answer needs to be 61.
I have read all the conversion questions on stackoverflow etc that I could find, but they all give the completely wrong answer, I dont know whether it could be the encoding?
If I do new String(lne,"US-ASCII"), where lne is my byte array, I get the correct 61. But when doing this ((int)lne[0] << 8) | ((int)lne[1] & 0xFF), I get the complete wrong answer.
This may be a silly mistake or I completely don't understand the number representation schemes in Java and the encoding/decoding idea.
Any help would be appreciated.
NOTE: I know I can just parse the String to integer, but I would like to know if there is a way to use fast operations like shifting and binary arithmetic instead?
Here's a thought on how to use fast operations like byte shifting and decimal arithmetic to speed this up. Assuming you have the current code:
byte[] token; // bytes representing a bunch of ascii numbers
int n = Integer.parseInt(new String(token)); // current approach
Then you could instead replace that last line and do the following (assuming no negative numbers, no foreign langauge characters, etc.):
int n = 0;
for (byte b : token)
n = 10*n + (b-'0');
Out of interest, this resulted in roughly a 28% speedup for me on a massive data set. I think this is due to not having to allocate new String objects and then trash them after each parseInt call.
You need two conversion steps. First, convert your ascii bytes to a string. That's what new String(lne,"us-ascii") does for you. Then, convert the string representation of the number to an actual number. For that you use something like Integer.parseInt(theString) -- remember to handle NumberFormatException.
As you say, new String(lne,"US-ASCII") will give you the correct string. To convert your String to an integer, use int myInt = Integer.parseInt(new String(lne,"US-ASCII"));

Reading a UTF-8 String from a ByteBuffer where the length is an unsigned int

I am trying to read a UTF8 string via a java.nio.ByteBuffer. The size is an unsinged int, which, of course, Java doesn't have. I have read the value into a long so that I have the value.
The next issue I have is that I cannot create an array of bytes with the long, and casting he long back to an int will cause it to be signed.
I also tried using limit() on the buffer, but again it works with int not long.
The specific thing I am doing is reading the UTF8 strings out of a class file, so the buffer has more in it that just the UTF8 string.
Any ideas on how to read a UTF8 string that has a potential length of an unsigned int from a ByteBuffer.
EDIT:
Here is an example of the issue.
SourceDebugExtension_attribute {
u2 attribute_name_index;
u4 attribute_length;
u1 debug_extension[attribute_length];
}
attribute_name_index
The value of the attribute_name_index item must be a valid index into the constant_pool table. The constant_pool entry at that index must be a CONSTANT_Utf8_info structure representing the string "SourceDebugExtension".
attribute_length
The value of the attribute_length item indicates the length of the attribute, excluding the initial six bytes. The value of the attribute_length item is thus the number of bytes in the debug_extension[] item.
debug_extension[]
The debug_extension array holds a string, which must be in UTF-8 format. There is no terminating zero byte.
The string in the debug_extension item will be interpreted as extended debugging information. The content of this string has no semantic effect on the Java Virtual Machine.
So, from a technical point of view, it is possible to have a string in the class file that is the full u4 (unsigned, 4 bytes) in length.
These won't be an issue if there is a limit to the size of a UTF8 string (I am no UTF8 expert so perhaps there is such a limit).
I could just punt on it and go with the reality that there is not going to be a String that long...
Unless your array of bytes is more than 2GB (the largest positive value of a Java int), you won't have a problem with casting the long back into a signed int.
If your array of bytes needs to be more than 2GB in length, you're doing it wrong, not least because that's way more than the default maximum heapsize of the JVM...
Having signed int won't be your main problem. Say you had a String which was 4 billion in length. You would need a ByteBuffer which is at least 4 GB, a byte[] which is at least 4 GB. When you convert this to a String, you need at least 8 GB (2 bytes per character) and a StringBuilder to build it. (Of at least 8 GB)
All up you need, 24 GB to process 1 String. Even if you have a lot of memory you won't get many Strings of this size.
Another approach is to treat the length as signed and if unsigned treat as a error as you won't have enough memory to process the String in any case. Even to handle a String which is 2 billion (2^31-1) in length you will need 12 GB to convert it to a String this way.
Java arrays use a (Java, i.e. signed) int for access as per the languge spec, so it's impossible to have an String (which is backed by a char array) longer than Integer.MAX_INT
But even that much is way too much to be processing in one chunk - it'll totally kill performance and make your program fail with an OutOfMemoryError on most machines if a sufficiently large String is ever encountered.
What you should do is process any string in chunks of a sensible size, say a few megs at a time. Then there's no practical limit on the size you can deal with.
I guess you could implement CharSequence on top of a ByteBuffer. That would allow you to keep your "String" from turning up on the heap, although most utilities that deal with characters actually expect a String. And even then, there is actually a limit on CharSequence as well. It expects the size to be returned as an int.
(You could theoretically create a new version of CharSequence that returns the size as a long, but then there's nothing in Java that would help you in dealing with that CharSequence. Perhaps it would be useful if you would implement subSequence(...) to return an ordinary CharSequence.)

Categories

Resources