I have a Byte array and each byte in the array corresponds to an ASII character(8 bit ASCII character).I am trying the get the whole list of ASII chars from the list.
byte[] data;
ArrayList<Character> qualAr = new ArrayList<>();
for (int i = 0; i < data.length; i++) {
qualAr.add((char)data[i]);
}
The above method,did not print the all ASCII chars properly as many of the chars that was printed contained square boxes and empty space.If the issue is not setting the encoding,then how to set the type of encoding to ASCII in the above method? Most of the examples i saw where of UTF-8.
Update: Thank you all. The problem was not with the encoding. I had found new documentation stating that the values needs to converted using - ASCII+33 and without that the values tried to print the initial ASCII chars which wouldn't make any sense.
Try using the following code:
String dataConverted = new String(data, "UTF-8");
ArrayList<Character> qualAr = new ArrayList<>();
for (char c : dataConverted.toCharArray()) {
qualAr.add(c);
}
I convert your byte array to a String, and then generate the list of characters. ASCII characters should be represented as one byte codes in UTF-8.
Keep in mind that the first 32 or so ASCII characters may render as boxes or blank spaces.
Here is a link to the basic ASCII table.
Related
I would like to combine a "\u" with a String that contains a Hex-Code so that I can print out a unicode character in the console.
I've tried something like this, but the console only prints regular text, eg \uf600:
ArrayList<String> arr = new ArrayList<String>();
emoji.codePoints()
.mapToObj(Integer::toHexString)
.forEach((n) -> arr.add(n)); // arr will contain hex strings
for (int i = 1; i < arr.size(); i += 2) {
System.out.println("\\u" + arr.get(i));
}
In Java, \u exists only in the compiler, as a convenience to help you add unicode character literals in your source code. If at run time you create a string that contains \ u followed by hex digits, there is no mechanism in place to transform it into a single char.
It sounds like you want to transform each code point separately to a string. Here is one way you can do that: use Character.toChars to transform the code point to a char array, and then build a new string from the char array:
ArrayList<String> arr = new ArrayList<String>();
emoji.codePoints().mapToObj(Character::toChars).map(String::new)
.forEach(arr::add)
I need to read in bytes from a file, turn them into a string, do something with the string, then get the bytes back from the string, so I have the following code :
byte[] bFile=readFileBytes(filePath);
StringBuilder massageBuilder=new StringBuilder();
for (int i=0;i<bFile.length;i++) massageBuilder.append(bFile[i]);
String x=massageBuilder.charAt(n)+"";
...
byte b=x.getBytes();
But the last step doesn't get back the byte, what's wrong, I wan to get back the "massageBuilder.charAt(n)" ?
You can't get back to the original bytes given how you're adding them to your string builder.
Take this example:
byte[] bFile = "This is the input string".getBytes();
StringBuilder massageBuilder = new StringBuilder();
for (int i = 0; i < bFile.length; i++)
massageBuilder.append(bFile[i]);
When you print massageBuilder, you get
8410410511532105115321161041013210511011211711632115116114105110103
These become a random sequence of numbers that offers no way of distinguishing original bytes. One or more characters in the resulting string will be linked to a single input byte. Even if you knew the character set of the original text, you'd still have trouble because of ambiguous sequences.
It might be possible if you used a delimiter of some sort...
massageBuilder.append(bFile[i]).append("-");
//84~104~105~115~32~105~115~32~116~104~101~32~105~110~112~117~116~...
In which case you can split by it and rebuild your byte array.
I am currently reading in a UDP byte array that I know is a string and I know the MAXIMUM possible length of said string. So I print out a string (which is usually shorter than the max length). I am able to print it out but it prints out the text then junk characters. Is there a way to trim the junk binary data without knowing the actual length of the valid text?
String result = new String(input, Charset.forName("US-ASCII"));
Ill try for those asking for more data. Here is how the UDP message is read:
sock.receive(incoming);
byte[] data = incoming.getData();
String s = new String(data, 0, incoming.getLength());
The UDP message itself will contain a header of fixed size and then a set of data (Max size of 1024 bytes). This data may be int, string, byte etc. This is determined by header data. So depending on the type, i chop the data out based on the appropriate size chunks. The problem I am focusing on is the String type of data. I know that the max size of a string will be 128 bytes per string, so I read that amount in chunks via where dataArray is the byte array.:
for (int i = 0; i < msg.length; i = i + readSize)
{
dataArray = Arrays.copyOfRange(msg, i, i + readSize);
}
Then I use the original code in the first code set in this post to place the data into a string object. Thing is, the text that is usually sent is less than the 128 bytes allocated for max size. So when I print the string, I get the valid text and then whitespace and non-normal ascii characters (junk data). Hope this addition helps.
An example of the output is here. Everything up to the .mof is valid:
https://1drv.ms/i/s!Ai0t7Oj1PUFBpRP9K_2RlocAK4B7
Is there a way to trim the junk binary data without knowing the actual
length of the valid text?
Yes you can simply call trim(), it will remove the trailing null characters. Indeed trim() removes every leading and trailing characters less or equal to \u0020 (aka whitespace) which includes \u0000 (aka null character).
byte[] bytes = "foo bar".getBytes();
// Simulate message with a size bigger than the actual encoded String
byte[] msg = new byte[32];
System.arraycopy(bytes, 0, msg, 0, bytes.length);
// Decode the message
String result = new String(msg, Charset.forName("US-ASCII"));
// Trim the result
System.out.printf("Result: '%s'%n", result.trim());
Output:
Result: 'foo bar'
Ok here is how I was able to get it to work. It's a rather manual method but before using
String result = new String(input, Charset.forName("US-ASCII"));
to combine the byte array into a string, I looked at each byte and made sure it was within the printable range of 0x20 - 0x7e. If not, I replaced the value with a space (0x20). Then finished off with a .trim on the string.
How can I get the UTF8 code of a char in Java ?
I have the char 'a' and I want the value 97
I have the char 'é' and I want the value 233
here is a table for more values
I tried Character.getNumericValue(a) but for a it gives me 10 and not 97, any idea why?
This seems very basic but any help would be appreciated!
char is actually a numeric type containing the unicode value (UTF-16, to be exact - you need two chars to represent characters outside the BMP) of the character. You can do everything with it that you can do with an int.
Character.getNumericValue() tries to interpret the character as a digit.
You can use the codePointAt(int index) method of java.lang.String for that. Here's an example:
"a".codePointAt(0) --> 97
"é".codePointAt(0) --> 233
If you want to avoid creating strings unnecessarily, the following works as well and can be used for char arrays:
Character.codePointAt(new char[] {'a'},0)
Those "UTF-8" codes are no such thing. They're actually just Unicode values, as per the Unicode code charts.
So an 'é' is actually U+00E9 - in UTF-8 it would be represented by two bytes { 0xc3, 0xa9 }.
Now to get the Unicode value - or to be more precise the UTF-16 value, as that's what Java uses internally - you just need to convert the value to an integer:
char c = '\u00e9'; // c is now e-acute
int i = c; // i is now 233
This produces good result:
int a = 'a';
System.out.println(a); // outputs 97
Likewise:
System.out.println((int)'é');
prints out 233.
Note that the first example only works for characters included in the standard and extended ASCII character sets. The second works with all Unicode characters. You can achieve the same result by multiplying the char by 1.
System.out.println( 1 * 'é');
Your question is unclear. Do you want the Unicode codepoint for a particular character (which is the example you gave), or do you want to translate a Unicode codepoint into a UTF-8 byte sequence?
If the former, then I recommend the code charts at http://www.unicode.org/
If the latter, then the following program will do it:
public class Foo
{
public static void main(String[] argv)
throws Exception
{
char c = '\u00E9';
ByteArrayOutputStream bos = new ByteArrayOutputStream();
OutputStreamWriter out = new OutputStreamWriter(bos, "UTF-8");
out.write(c);
out.flush();
byte[] bytes = bos.toByteArray();
for (int ii = 0 ; ii < bytes.length ; ii++)
System.out.println(bytes[ii] & 0xFF);
}
}
(there's also an online Unicode to UTF8 page, but I don't have the URL on this machine)
My method to do it is something like this:
char c = 'c';
int i = Character.codePointAt(String.valueOf(c), 0);
// testing
System.out.println(String.format("%c -> %d", c, i)); // c -> 99
You can create a simple loop to list all the UTF-8 characters available like this:
public class UTF8Characters {
public static void main(String[] args) {
for (int i = 12; i <= 999; i++) {
System.out.println(i +" - "+ (char)i);
}
}
}
There is an open source library MgntUtils that has a Utility class StringUnicodeEncoderDecoder. That class provides static methods that convert any String into Unicode sequence vise-versa. Very simple and useful. To convert String you just do:
String codes = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(myString);
For example a String "Hello World" will be converted into
"\u0048\u0065\u006c\u006c\u006f\u0020
\u0057\u006f\u0072\u006c\u0064"
It works with any language. Here is the link to the article that explains all te ditails about the library: MgntUtils. Look for the subtitle "String Unicode converter". The article gives you link to Maven Central where you can get artifacts and github where you can get the project itself. The library comes with well written javadoc and source code.
If I use any ASCII characters from 33 to 127, the codePointAt method gives the correct decimal value, for example:
String s1 = new String("#");
int val = s1.codePointAt(0);
This returns 35 which is the correct value.
But if I try use ASCII characters from 128 to 255 (extended ASCII/ISO-8859-1), this method gives wrong value, for example:
String s1 = new String("ƒ") // Latin small letter f with hook
int val = s1.codePointAt(0);
This should return 159 as per this reference table, but instead returns 409, why is this?
But if I try use ASCII characters from 128 to 255
ASCII doesn't have values in this range. It only uses 7 bits.
Java chars are UTF-16 (and nothing else!). If you want to represent ASCII using Java, you need to use a byte array.
The codePointAt method returns the 32-bit codepoint. 16-bit chars can't contain the entire Unicode range, so some code points must be split across two chars (as per the encoding scheme for UTF-16). The codePointAt method helps resolve to chars code points.
I wrote a rough guide to encoding in Java here.
Java chars are not encoded in ISO-8859-1. They use UTF-16 which has the same values for 7bit ASCII characters (only values from 0-127).
To get the correct value for ISO-8859-1 you have to convert your string into a byte[] with String.getBytes("ISO-8859-1"); and look in the byte array.
Update
ISO-8859-1 is not the extended ASCII encoding, use String.getBytes("Cp437"); to get the correct values.
in Unicode
ƒ 0x0192 LATIN SMALL LETTER F WITH HOOK
String.codePointAt returns the Unicode-Codepoint at this specified index.
The Unicode-Codepoint of ƒ is 402, see
http://www.decodeunicode.org/de/u+0192/properties
So
System.out.println("ƒ".codePointAt(0));
printing 402 is correct.
If you are interested in the representation in other charsets, you can printout the bytes representaion of the character in other charsets via getBytes(String charsetName):
final String s = "ƒ";
for (final String csName : Charset.availableCharsets().keySet()) {
try {
final Charset cs = Charset.forName(csName);
final CharsetEncoder encode = cs.newEncoder();
if (encode.canEncode(s))
{
System.out.println(csName + ": " + Arrays.toString(s.getBytes(csName)));
}
} catch (final UnsupportedOperationException uoe) {
} catch (final UnsupportedEncodingException e) {
}
}