Why do I take java.nio.BufferUnderflowException - java

I take BufferUnderflowException from following code.
int length = mBuf.remaining();
char[] charBuff = new char[length];
for (int i = 0; i < length; ++i) {
char[i] = mBuf.getChar();
}
mBuf is a ByteBuffer. Line "char[i] =mBuf.getChar();" is crashed.
What do you think about this problem?

You have mistakenly assumed that a char is one byte in size. In Java, a char is two bytes, so mBuf.getChar() consumes two bytes. The documentation even states that the method reads the next two bytes.
If you use the CharBuffer returned by mBuf.asCharBuffer(), that buffer's remaining() method will give you the number you expect.
Update: Based on your comment, I understand now that your buffer does in fact contain one-byte characters. Since Java deals with the entire Unicode repertoire (which contains hundreds of thousands of characters), you must tell it which Charset (character-to-byte encoding) you are using:
// This is a pretty common one-byte charset.
Charset charset = StandardCharsets.ISO_8859_1;
// This is another common one-byte charset. You must use the same charset
// that was used to write the bytes in your ObjectiveC program.
//Charset charset = Charset.forName("windows-1252");
CharBuffer c = charset.newDecoder().decode(mBuf);
char[] charBuff = new char[c.remaining()];
c.get(charBuff);

Related

How many bytes of English and Chinese characters take in java?

import java.io.UnsupportedEncodingException;
public class TestChar {
public static void main(String[] args) throws UnsupportedEncodingException {
String cnStr = "龙";
String enStr = "a";
byte[] cnBytes = cnStr.getBytes("UTF-8");
byte[] enBytes = enStr.getBytes("UTF-8");
System.out.println("bytes size of Chinese:" + cnBytes.length);
System.out.println("bytes size of English:" + enBytes.length);
// in java, char takes two bytes, the question is:
char cnc = '龙'; // will '龙‘ take two or three bytes ?
char enc = 'a'; // will 'a' take one or two bytes ?
}
}
Output :
bytes size of Chinese:3
bytes size of English:1
Here, My JVM is set as UTF-8, from the output, we know Chinese character '龙' takes 3 bytes, and English character 'a' takes one byte. My question is:
In Java, char takes two bytes, here, char cnc = '龙'; char enc = 'a'; will cnc only takes two bytes instead of 3 bytes ? And 'a' takes two bytes instead of one byte ?
The codepoint value of 龙 is 40857. That fits inside the two bytes of a char.
It takes 3 bytes to encode in UTF-8 because not all 2-byte sequences are valid in UTF-8.
UTF-8 is a variable-length character encoding, where characters take up 1 to 4 bytes.
A Java char is 16 bits. See 3.1 Unicode in the Java Language Specification to understand how exactly Java handles Unicode.
Internally, Strings/chars are UTF-16, so it'll be the same for both: Each char will be 16bits.
byte[] cnBytes = cnStr.getBytes("UTF-8");
UTF-8 is a variable length encoding, so the Chinese char takes more bits because it's out of the ASCII character range.

How to read a byte back in Java?

I need to read in bytes from a file, turn them into a string, do something with the string, then get the bytes back from the string, so I have the following code :
byte[] bFile=readFileBytes(filePath);
StringBuilder massageBuilder=new StringBuilder();
for (int i=0;i<bFile.length;i++) massageBuilder.append(bFile[i]);
String x=massageBuilder.charAt(n)+"";
...
byte b=x.getBytes();
But the last step doesn't get back the byte, what's wrong, I wan to get back the "massageBuilder.charAt(n)" ?
You can't get back to the original bytes given how you're adding them to your string builder.
Take this example:
byte[] bFile = "This is the input string".getBytes();
StringBuilder massageBuilder = new StringBuilder();
for (int i = 0; i < bFile.length; i++)
massageBuilder.append(bFile[i]);
When you print massageBuilder, you get
8410410511532105115321161041013210511011211711632115116114105110103
These become a random sequence of numbers that offers no way of distinguishing original bytes. One or more characters in the resulting string will be linked to a single input byte. Even if you knew the character set of the original text, you'd still have trouble because of ambiguous sequences.
It might be possible if you used a delimiter of some sort...
massageBuilder.append(bFile[i]).append("-");
//84~104~105~115~32~105~115~32~116~104~101~32~105~110~112~117~116~...
In which case you can split by it and rebuild your byte array.

Java String from byte array

I am currently reading in a UDP byte array that I know is a string and I know the MAXIMUM possible length of said string. So I print out a string (which is usually shorter than the max length). I am able to print it out but it prints out the text then junk characters. Is there a way to trim the junk binary data without knowing the actual length of the valid text?
String result = new String(input, Charset.forName("US-ASCII"));
Ill try for those asking for more data. Here is how the UDP message is read:
sock.receive(incoming);
byte[] data = incoming.getData();
String s = new String(data, 0, incoming.getLength());
The UDP message itself will contain a header of fixed size and then a set of data (Max size of 1024 bytes). This data may be int, string, byte etc. This is determined by header data. So depending on the type, i chop the data out based on the appropriate size chunks. The problem I am focusing on is the String type of data. I know that the max size of a string will be 128 bytes per string, so I read that amount in chunks via where dataArray is the byte array.:
for (int i = 0; i < msg.length; i = i + readSize)
{
dataArray = Arrays.copyOfRange(msg, i, i + readSize);
}
Then I use the original code in the first code set in this post to place the data into a string object. Thing is, the text that is usually sent is less than the 128 bytes allocated for max size. So when I print the string, I get the valid text and then whitespace and non-normal ascii characters (junk data). Hope this addition helps.
An example of the output is here. Everything up to the .mof is valid:
https://1drv.ms/i/s!Ai0t7Oj1PUFBpRP9K_2RlocAK4B7
Is there a way to trim the junk binary data without knowing the actual
length of the valid text?
Yes you can simply call trim(), it will remove the trailing null characters. Indeed trim() removes every leading and trailing characters less or equal to \u0020 (aka whitespace) which includes \u0000 (aka null character).
byte[] bytes = "foo bar".getBytes();
// Simulate message with a size bigger than the actual encoded String
byte[] msg = new byte[32];
System.arraycopy(bytes, 0, msg, 0, bytes.length);
// Decode the message
String result = new String(msg, Charset.forName("US-ASCII"));
// Trim the result
System.out.printf("Result: '%s'%n", result.trim());
Output:
Result: 'foo bar'
Ok here is how I was able to get it to work. It's a rather manual method but before using
String result = new String(input, Charset.forName("US-ASCII"));
to combine the byte array into a string, I looked at each byte and made sure it was within the printable range of 0x20 - 0x7e. If not, I replaced the value with a space (0x20). Then finished off with a .trim on the string.

convert A Byte to ASII char in java

I have a Byte array and each byte in the array corresponds to an ASII character(8 bit ASCII character).I am trying the get the whole list of ASII chars from the list.
byte[] data;
ArrayList<Character> qualAr = new ArrayList<>();
for (int i = 0; i < data.length; i++) {
qualAr.add((char)data[i]);
}
The above method,did not print the all ASCII chars properly as many of the chars that was printed contained square boxes and empty space.If the issue is not setting the encoding,then how to set the type of encoding to ASCII in the above method? Most of the examples i saw where of UTF-8.
Update: Thank you all. The problem was not with the encoding. I had found new documentation stating that the values needs to converted using - ASCII+33 and without that the values tried to print the initial ASCII chars which wouldn't make any sense.
Try using the following code:
String dataConverted = new String(data, "UTF-8");
ArrayList<Character> qualAr = new ArrayList<>();
for (char c : dataConverted.toCharArray()) {
qualAr.add(c);
}
I convert your byte array to a String, and then generate the list of characters. ASCII characters should be represented as one byte codes in UTF-8.
Keep in mind that the first 32 or so ASCII characters may render as boxes or blank spaces.
Here is a link to the basic ASCII table.

Java String to byteArray conversion issue

I am trying to encode/decode a ByteArray to String, but input/output are not matching. Am I doing something wrong?
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(by));
String s = new String(by, Charsets.UTF_8);
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(s.getBytes(Charsets.UTF_8)));
The output is:
130021000061f8f0001a
130021000061efbfbd
Complete code:
String[] arr = {"13", "00", "21", "00", "00", "61", "F8", "F0", "00", "1A"};
byte[] by = new byte[arr.length];
for (int i = 0; i < arr.length; i++) {
by[i] = (byte)(Integer.parseInt(arr[i],16) & 0xff);
}
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(by));
String s = new String(by, Charsets.UTF_8);
System.out.println(org.apache.commons.codec.binary.Hex.encodeHexString(s.getBytes(Charsets.UTF_8)));
The problem here is that f8f0001a isn't a valid UTF-8 byte sequence.
First of all, the f8 opening byte denotes a 5 byte sequence and you've only got four. Secondly, f8 can only be followed by a byte of 8x, 9x, ax or bx form.
Therefore it gets replaced with a unicode replacement character (U+FFFD), whose byte sequence in UTF-8 is efbfbd.
And there (rightly) is no guarantee that the conversion of an invalid byte sequence to and from a string will result in the same byte sequence. (Note that even with two, seemingly identical strings, you might get different bytes representing them in Unicode, see Unicode equivalence. )
The moral of the story is: if you want to represent bytes, don't convert them to characters, and if you want to represent text, don't use byte arrays.
My UTF-8 is a bit rusty :-), but the sequence F8 F0 is imho not a valid utf-8 encoding.
Look at http://en.wikipedia.org/wiki/Utf-8#Description.
When you build the String from the array of bytes, the bytes are decoded.
Since the bytes from your code does not represent valid characters, the bytes that finally composes the String are not the same your passed as parameter.
public String(byte[] bytes)
Constructs a new String by decoding the
specified array of bytes using the platform's default charset. The
length of the new String is a function of the charset, and hence may
not be equal to the length of the byte array.
The behavior of this
constructor when the given bytes are not valid in the default charset
is unspecified. The CharsetDecoder class should be used when more
control over the decoding process is required.

Categories

Resources