Array of chars vs. array of bytes - java

I've found a few answers about this but none of them seem to apply to my issue.
I'm using the NDK and C++ is expecting an unsigned char array of 1024 elements, so I need to create this in java to pass it as a parameter.
The unsigned char array is expected to contain both numbers and characters.
I have tried this:
byte[] lMessage = new byte[1024];
lMessage[4] = 'a';
The problem is that then the 4th element gets added as a numerical value instead of maintaining the 'a' character.
I have also tried
char[] lMessage = new char[1024];
lMessage[4] = 'a';
While this retains the character, it does duplicate the amount of bytes in the array from 8 to 16.
I need the output to be a 8 bit ASCII unsigned array.
Any suggestions?
Thanks.

It is wrong to say that the element "gets added as a numerical value". The only thing that you can say for sure is that it gets added as electrostatic charges in eight cells of your RAM.
How you choose to represent those eight bits (01100001) in order to visualize them has little to do with what they really are, so if you choose to see them as a numerical value, then you might be tricked into believing that they are in fact a numerical value. (Kind of like a self-fulfilling prophecy (wikipedia).)
But in fact they are nothing but 8 electrostatic charges, interpretable in whatever way we like. We can choose to interpret them as a two's complement number (97), we can choose to interpret them as a binary-coded decimal number (61), we can choose to interpret them as an ASCII character ('a'), we can choose to interpret them as an x86 instruction opcode (popa), the list goes on.
The closest thing to an unsigned char in C++ is a byte in java. That's because the fundamental characteristic of these small data types is how many bits long they are. Chars in C++ are 8-bit long, and the only type in java which is also 8-bits long is the byte.
Unfortunately, a byte in java tends to be thought of as a numerical quantity rather than as a character, so tools (such as debuggers) that display bytes will display them as little numbers. But this is just an arbitrary convention: they could have just as easily chosen to display bytes as ASCII (8-bit) characters, and then you would be seeing an actual 'a' in byte[] lMessage[4].
So, don't be fooled by what the tools are showing, all that counts is that it is an 8-bit quantity. And if the tools are showing 97 (0x61), then you know that the bit pattern stored in those 8 memory cells can just as legitimately be thought of as an 'a', because the ASCII code of 'a' is 97.
So, finally, to answer your question, what you need to do is find a way to convert a java string, which consists of 16-bit unicode characters, to an array of ASCII characters, which would be bytes in java. You can try this:
String s = "TooManyEduardos";
byte[] bytes = s.getBytes("US-ASCII");
Or you can read the answers to this question: Convert character to ASCII numeric value in java for more ideas.

Will work for ASCII chars
lMessage[4] = new String('a').getBytes()[0];

Related

Is there a datatype that uses less storage for 2 letters than String?

Basically what the title says. I'm aware that i could use char as type if i only had one letter, but i need a datatype for 2 letters, e.g "XY". Is there anything that uses less storage (bit) or is smaller than a String? Or are multiple letters generally just saved as Strings? Thanks!
If you are sure that there are no higher-unicode characters (i.e. characters that use more than 1 char to store) in use, there are a few options:
As mentioned by #rdas, you could use an array: char[2]. This would be a bit more memory-efficient than a String, as the String has additional members. If it's only ASCII-characters, you could even use byte[2].
As one char is 16 bits, 2 chars are 32 bits, so could also try to encode the 2 characters into 1 int, as this also uses only 32 bytes, and you would not have the object overhead of the array. Clearly, this requires some additional steps to encode/decode when you need to show the stored information as actual characters, e.g. when presenting it to the user.
If your characters are only ASCII codes, i.e. every character fits into 1 byte, you could even fit it into a short.
Depending on the number of two-character combinations that you actually need to support, you could actually just enumerate all the possible combinations, use a lookup Map or sorted Array, and then only store the number/index of the code. Again, depending on the number of combinations, use a byte, short or int.
No it is not possible
This is why::
String s = "ab" // uses only 4 bytes of data as each character reserves 2 bytes
And other data types uses >= 4 bytes except short and byte but here short and byte cannot store characters

Get least significant bytes from an integer

I need to sum all data bytes in ByteArrayOutputStream, adding +1 to the result and taking the 2 least significant bytes.
int checksum = 1;
for(byte b : byteOutputStream.toByteArray()) {
checksum += b;
}
Any input on taking the 2 least significant bytes would be helpful. Java 8 is used in the environment.
If you really mean least significant bytes then:
checksum & 0xFFFF
If you meant that you want to take least significant bits from checksum, then:
checksum & 0x3
Add
checksum &= 0x0000ffff;
That will zero out everything to the left of the 2 least significant bytes.
Your question is a bit underspecified. You didn’t say neither, what you want to do with these two bytes nor how you want to store them (which depends on what you want to do).
To get to individual bytes, you can use
byte lowest = (byte)checksum, semiLowest=(byte)(checksum>>8);
In case you want to store them in a single integer variable, you have to decide, how these bytes are to be interpreted numerically, i.e signed or unsigned.
If you want a signed interpretation, the operation is as simple as
short lowest2bytes = (short)checksum;
If you want an unsigned interpretation, there’s the obstacle that Java has no dedicated type for that. There is a 2 byte sized unsigned type (char), but using it for numerical values can cause confusion when other code tries to interpret it as character value (i.e. when printing). So in that case, the best solution is to use an int variable again and only initialize it with the unsigned char value:
int lowest2bytes = (char)checksum;
Note that this is semantically equivalent to
int lowest2bytes = checksum&0xffff;
seen in other solutions.

Java: Why does this string of 4 characters create a byte[] with 6 data?

I need to be able to convert an int into a string which represents a series of bytes, and back. To do this, I came up with this code:
Int -> Byte[] -> String
new String(ByteBuffer.allocate(5).putInt(num).array())
String -> Byte[] -> Int
ByteBuffer.allocate(4).put(team.getBytes()).getInt(0)
One of my test cases is the number 4231. When viewed as a string, none of the characters are visible but that's not completely unusual, and when I invoke it's .length() method, it returns 4. But when I used .getBytes(), I get [0, 0, 16, -17, -65, -67], which causes a StackOverflowException. Can someone explain this result to me?
Without knowing the platform default encoding of your machine, it's slightly hard to say - and you should avoid calling String.getBytes without specifying an encoding, IMO.
However, basically a String represents a sequence of characters, encoded as a sequence of UTF-16 code units. Not every character is representable in one byte, in many encodings - and you certainly shouldn't assume it is. (You shouldn't even assume there's one character per char, due surrogate pairs used to represent non-BMP characters.)
Fundamentally, you shouldn't treat a string like this - if you want to encode non-text data in a string, use hex or base64 to encode the binary data, and then decode it appropriately. Otherwise you can easily get invalid strings, and lose data - and more importantly, you're simply not treating the type for the purpose it was designed.
When you convert a byte[] into a String, you're saying "This is the binary representation of some text, in a particular encoding" (either explicitly or using the platform default). That's simply not the case here - there's no text to start with, just a number... the binary data isn't encoded text, it's an encoded integer.
First, the integer was convert to 4 bytes, so the bytes are [ 0, 0, 16, -17 ]. First, let's convert 4231 to hex. We get: 000010E1. Converting to decimal, the zeroes are obviously zero. The 10 has a 1 in the 16's place, so it's 16.
So the only real mystery is where the -17 came from. The answer is that if you take the 8-bit representation of E1(hex) and add the 8 bit representation of 17(decimal) to it, you get zero (with a carry to nowhere). Therefore E1(hex) is the 8-bit representation of -17 decimal.
If this kind of stuff isn't obvious to you, you probably shouldn't mess with native encodings and should instead separate and combine the numbers yourself using things like multiplication and division. (Use just use decimal numbers and strings.)
What you are trying is viewing bytes as characters. That concept became invalid with the introduction of multi-byte characters in operating systems and languages.
In java Strings are composed of characters, not bytes. A mistake often made is that a conversion from byte[] -> String -> byte[] using the getBytes()/new String(byte[]) will yield the original bytes. Thats simply not true, depending on the encoding, byte[] -> String may already lose information (if the byte[] contains values invalid for that encoding). Likewise, not every encoding can encode every possible character.
So you are chaining two possibly lossy operations and wonder why information is lost.
Proper way to encode the information contained in the int is to select a specific representation for the int (e.g. decimal or hexadecimal) and encode/decode that.
Try this for encoding/decoding:
String hex = Integer.toString(i, 16);
int decoded = Integer.parseInt(hex, 16);

Why does writeBytes discard each character's high eight bits?

I wanted to use DataOutputStream#writeBytes, but was running into errors. Description of writeBytes(String) from the Java Documentation:
Writes out the string to the underlying output stream as a sequence of bytes. Each character in the string is written out, in sequence, by discarding its high eight bits.
I think the problem I'm running into is due to the part about "discarding its high eight bits". What does that mean, and why does it work that way?
Most Western programmers tend to think in terms of ASCII, where one character equals one byte, but Java Strings are 16-bit Unicode. writeBytes just writes out the lower byte, which for ASCII/ISO-8859-1 is the "character" in the C sense.
The char data type is a single 16-bit Unicode character. It has a minimum value of '\u0000' (or 0) and a maximum value of '\uffff' (or 65,535 inclusive). But The byte data type is an 8-bit signed two's complement integer. It has a minimum value of -128 and a maximum value of 127 (inclusive). That is why this function is writing the low-order byte of each char in the string from first to last. Any information in the high-order byte is lost. In other words, it assumes the string contains only characters whose value is between 0and 255.
You may look into the writeUTF(String s) method, which, retains the information in the high-order byte as well as the length of the string. First it writes the number of characters in the string onto the underlying output stream as a 2-byte unsigned int between 0 and 65,535. Next it encodes the string in UTF-8 and writes the bytes of the encoded string to the underlying output stream. This allows a data input stream reading those bytes to completely reconstruct the string.

Converting US-ASCII encoded byte to integer and back

I have a byte array that can be of size 2,3 or 4. I need to convert this to the correct integer value. I also need to do this in reverse, i.e an 2,3 or 4 character integer to a byte array.
e.g., raw hex bytes are : 54 and 49. The decoded string US-ASCII value is 61. So the integer answer needs to be 61.
I have read all the conversion questions on stackoverflow etc that I could find, but they all give the completely wrong answer, I dont know whether it could be the encoding?
If I do new String(lne,"US-ASCII"), where lne is my byte array, I get the correct 61. But when doing this ((int)lne[0] << 8) | ((int)lne[1] & 0xFF), I get the complete wrong answer.
This may be a silly mistake or I completely don't understand the number representation schemes in Java and the encoding/decoding idea.
Any help would be appreciated.
NOTE: I know I can just parse the String to integer, but I would like to know if there is a way to use fast operations like shifting and binary arithmetic instead?
Here's a thought on how to use fast operations like byte shifting and decimal arithmetic to speed this up. Assuming you have the current code:
byte[] token; // bytes representing a bunch of ascii numbers
int n = Integer.parseInt(new String(token)); // current approach
Then you could instead replace that last line and do the following (assuming no negative numbers, no foreign langauge characters, etc.):
int n = 0;
for (byte b : token)
n = 10*n + (b-'0');
Out of interest, this resulted in roughly a 28% speedup for me on a massive data set. I think this is due to not having to allocate new String objects and then trash them after each parseInt call.
You need two conversion steps. First, convert your ascii bytes to a string. That's what new String(lne,"us-ascii") does for you. Then, convert the string representation of the number to an actual number. For that you use something like Integer.parseInt(theString) -- remember to handle NumberFormatException.
As you say, new String(lne,"US-ASCII") will give you the correct string. To convert your String to an integer, use int myInt = Integer.parseInt(new String(lne,"US-ASCII"));

Categories

Resources