Check if a String is valid UTF-8 encoded in Java

Check if a String is valid UTF-8 encoded in Java - java

How can I check if a string is in valid UTF-8 format?

Only byte data can be checked. If you constructed a String then its already in UTF-16 internally.
Also only byte arrays can be UTF-8 encoded.
Here is a common case of UTF-8 conversions.
String myString = "\u0048\u0065\u006C\u006C\u006F World";
System.out.println(myString);
byte[] myBytes = null;
try
{
myBytes = myString.getBytes("UTF-8");
}
catch (UnsupportedEncodingException e)
{
e.printStackTrace();
System.exit(-1);
}
for (int i=0; i < myBytes.length; i++) {
System.out.println(myBytes[i]);
}
If you don't know the encoding of your byte array, juniversalchardet is a library to help you detect it.

The following post is taken from the official Java tutorials available at: https://docs.oracle.com/javase/tutorial/i18n/text/string.html.
The StringConverter program starts by creating a String containing
Unicode characters:
String original = new String("A" + "\u00ea" + "\u00f1" + "\u00fc" + "C");
When printed, the String named original appears as:
AêñüC
To convert the String object to UTF-8, invoke the getBytes method and
specify the appropriate encoding identifier as a parameter. The
getBytes method returns an array of bytes in UTF-8 format. To create a
String object from an array of non-Unicode bytes, invoke the String
constructor with the encoding parameter. The code that makes these
calls is enclosed in a try block, in case the specified encoding is
unsupported:
try {
byte[] utf8Bytes = original.getBytes("UTF8");
byte[] defaultBytes = original.getBytes();
String roundTrip = new String(utf8Bytes, "UTF8");
System.out.println("roundTrip = " + roundTrip);
System.out.println();
printBytes(utf8Bytes, "utf8Bytes");
System.out.println();
printBytes(defaultBytes, "defaultBytes");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
The StringConverter program prints out the values in the utf8Bytes and
defaultBytes arrays to demonstrate an important point: The length of
the converted text might not be the same as the length of the source
text. Some Unicode characters translate into single bytes, others into
pairs or triplets of bytes.
The printBytes method displays the byte arrays by invoking the byteToHex method, which is defined in the source file,
UnicodeFormatter.java. Here is the printBytes method:
public static void printBytes(byte[] array, String name) {
for (int k = 0; k < array.length; k++) {
System.out.println(name + "[" + k + "] = " + "0x" +
UnicodeFormatter.byteToHex(array[k]));
}
}
The output of the printBytes method follows. Note that only the first
and last bytes, the A and C characters, are the same in both arrays:
utf8Bytes[0] = 0x41
utf8Bytes[1] = 0xc3
utf8Bytes[2] = 0xaa
utf8Bytes[3] = 0xc3
utf8Bytes[4] = 0xb1
utf8Bytes[5] = 0xc3
utf8Bytes[6] = 0xbc
utf8Bytes[7] = 0x43
defaultBytes[0] = 0x41
defaultBytes[1] = 0xea
defaultBytes[2] = 0xf1
defaultBytes[3] = 0xfc
defaultBytes[4] = 0x43

Related

Java - What is the proper way to convert a UTF-8 String to binary?

I'm using this code to convert a UTF-8 String to binary:
public String toBinary(String str) {
byte[] buf = str.getBytes(StandardCharsets.UTF_8);
StringBuilder result = new StringBuilder();
for (int i = 0; i < buf.length; i++) {
int ch = (int) buf[i];
String binary = Integer.toBinaryString(ch);
result.append(("00000000" + binary).substring(binary.length()));
result.append(' ');
}
return result.toString().trim();
}
Before I was using this code:
private String toBinary2(String str) {
StringBuilder result = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
int ch = (int) str.charAt(i);
String binary = Integer.toBinaryString(ch);
if (ch<256)
result.append(("00000000" + binary).substring(binary.length()));
else {
binary = ("0000000000000000" + binary).substring(binary.length());
result.append(binary.substring(0, 8));
result.append(' ');
result.append(binary.substring(8));
}
result.append(' ');
}
return result.toString().trim();
}
These two method can return different results; for example:
toBinary("è") = "11000011 10101000"
toBinary2("è") = "11101000"
I think that because the bytes of è are negative while the corresponding char is not (because char is a 2 byte unsigned integer).
What I want to know is: which of the two approaches is the correct one and why?
Thanks in advance.

Whenever you want to convert text into binary data (or into text representing binary data, as you do here) you have to use some encoding.
Your toBinary uses UTF-8 for that encoding.
Your toBinary2 uses something that's not a standard encoding: it encodes every UTF-16 codepoint * <= 256 in a single byte and all others in 2 bytes. Unfortunately that one is not a useful encoding, since for decoding you'll have to know if a single byte is stand-alone or part of a 2-byte sequence (UTF-8/UTF-16 do that by indicating with the highest-level bits which one it is).
tl;dr toBinary seems correct, toBinary2 will produce output that can't uniquely be decoded back to the original string.
* You might be wondering where the mention of UTF-16 comes from: That's because all String objects in Java are implicitly encoded in UTF-16. So if you use charAt you get UTF-16 codepoints (which just so happen to be equal to the Unicode code number for all characters that fit into the Basic Multilingual Plane).

This code snippet might help.
String s = "Some String";
byte[] bytes = s.getBytes();
StringBuilder binary = new StringBuilder();
for(byte b:bytes){
int val =b;
for(int i=;i<=s.length;i++){
binary.append((val & 128) == 0 ? 0 : 1);
val<<=1;
}
}
System.out.println(" "+s+ "to binary" +binary);

Is ISO-8859-1 encoding binary-safe in Java?

If I read a binary stream into a String using an ISO-8859-1 encoding, and subsequently convert it back to a binary stream, would I always get exactly the same bytes? And if not, when would I not get the same bytes?
public byte[] toStringAndBack(byte[] binaryData) throws Exception {
String s = new String(binaryData, "ISO-8859-1");
return s.getBytes("ISO-8859-1");
}
=== EDIT ===
Test:
byte[] d = {0, 1, 2, 3, 4, (byte)128, (byte)129, (byte)130}; // some not defined values
byte[] dd = toStringAndBack(d);
for (byte b : dd)
System.out.print((b&0xFF) + " ");
Output:
0 1 2 3 4 128 129 130
So, even not defined bytes seem to be converted properly.

The constructor you're using says:
The behavior of this constructor when the given bytes are not valid in the given charset is unspecified.
So in theory it could fail for any value ISO-8859-1 doesn't assign characters to, such as 0-31 and 128-160.
That means even if it works on a given JVM's String implementation (or Charset implementation for ISO-8859-1), you cannot rely on it working on another JVM's String/Charset implementation (whether that's just a different dot-rev of a JVM from the same vendor, or a different vendor's JVM).

Let's test it:
// all possible bytes
byte[] bin = new byte[256];
for (int i=0; i<bin.length; i++)
bin[i] = (byte)i;
// convert to string
String s = new String(bin, "ISO-8859-1");
for (int i=0; i<s.length(); i++)
{
if (s.charAt(i) != i)
System.out.println(i + " s[i]=" + s.charAt(i));
}
// convert back to byte[]
byte[] bout = s.getBytes("ISO-8859-1");
for (int i=0; i<bin.length; i++)
{
if (bin[i] != bout[i])
System.out.println(i + " in=" + bin[i] + " bout=" + bout[i]);
}
System.out.println("done");
It prints only done.
Therefore at least for the current ISO-8859-1 implementation the operations are binary safe as defined in the question.
EDIT:
the current implementation is sun.nio.cs.ISO_8859_1.
Looking at the source it only checks if a char is < 256 to decide if it can be encoded.

Convert byte to string in Java

I use below code to convert byte to string:
System.out.println("string " + Byte.toString((byte)0x63));
Why it print "string 99".
How to modify to let it print "string c"?

System.out.println(new String(new byte[]{ (byte)0x63 }, "US-ASCII"));
Note especially that converting bytes to Strings always involves an encoding. If you do not specify it, you'll be using the platform default encoding, which means the code can break when running in different environments.

The string ctor is suitable for this conversion:
System.out.println("string " + new String(new byte[] {0x63}));

Use char instead of byte:
System.out.println("string " + (char)0x63);
Or if you want to be a Unicode puritan, you use codepoints:
System.out.println("string " + new String(new int[]{ 0x63 }, 0, 1));
And if you like the old skool US-ASCII "every byte is a character" idea:
System.out.println("string " + new String(new byte[]{ (byte)0x63 },
StandardCharsets.US_ASCII));
Avoid using the String(byte[]) constructor recommended in other answers; it relies on the default charset. Circumstances could arise where 0x63 actually isn't the character c.

You can use printf:
System.out.printf("string %c\n", 0x63);
You can as well create a String with such formatting, using String#format:
String s = String.format("string %c", 0x63);

you can use
the character equivalent to 0x63 is 'c' but byte equivalent to it is 99
System.out.println("byte "+(char)0x63);

You have to construct a new string out of a byte array. The first element in your byteArray should be 0x63. If you want to add any more letters, make the byteArray longer and add them to the next indices.
byte[] byteArray = new byte[1];
byteArray[0] = 0x63;
try {
System.out.println("string " + new String(byteArray, "US-ASCII"));
} catch (UnsupportedEncodingException e) {
// TODO: Handle exception.
e.printStackTrace();
}
Note that specifying the encoding will eventually throw an UnsupportedEncodingException and you must handle that accordingly.

If it's a single byte, just cast the byte to a char and it should work out to be fine i.e. give a char entity corresponding to the codepoint value of the given byte. If not, use the String constructor as mentioned elsewhere.
char ch = (char)0x63;
System.out.println(ch);

String str = "0x63";
int temp = Integer.parseInt(str.substring(2, 4), 16);
char c = (char)temp;
System.out.print(c);

This is my version:
public String convertBytestoString(InputStream inputStream)
{
int bytes;
byte[] buffer = new byte[1024];
bytes = inputStream.read(buffer);
String stringData = new String(buffer,0,bytes);
return stringData;
}

Using StringBuilder class in Java:
StringBuilder str = new StringBuilder();
for (byte aByte : bytesArray) {
if (aByte != 0) {
str.append((char) aByte);
} else {
break;
}

convert EBCDIC String to ASCII format?

I am having a flat file which is pulled from a Db2 table ,the flat file contains records in both the char format as well as packed decimal format.how to convert the packed data to a java string.is there any way to convert the entire flat file to ASCII format.

EBCDIC is a family of encodings. You'll need to know more in details which EBCDIC encoding you're after.
Java has a number of supported encodings, including:
IBM500/Cp500 - EBCDIC 500V1
x-IBM834/Cp834 - IBM EBCDIC DBCS-only Korean (double-byte)
IBM1047/Cp1047 - Latin-1 character set for EBCDIC hosts
Try those and see what you get. Something like:
InputStreamReader rdr = new InputStreamReader(new FileInputStream(<your file>), java.nio.Charset.forName("ibm500"));
while((String line = rdr.readLine()) != null) {
System.out.println(line);
}

Read the file as a String, write it as EBCDIC. Use the OutputStreamWriter and InputStreamWriter and give the encoding in the constructor.

Following from PAP, CP037 is US EBCDIC encoding.
Also have a look at JRecord Project. It allows you to read a file with either a Cobol or Xml description and will handle EBCDIC and Comp-3.
Finally here is a routine to convert packed decimal bytes to String see method getMainframePackedDecimal in Conversion

Sharing a sample code by me for your reference:
package mypackage;
import java.io.UnsupportedEncodingException;
import java.math.BigInteger;
public class EtoA {
public static void main(String[] args) throws UnsupportedEncodingException {
System.out.println("########");
String edata = "/ÂÄÀ"; //Some EBCDIC string ==> here the OP can provide the content of flat file which the OP pulled from DB2 table
System.out.println("ebcdic source to ascii:");
System.out.println("ebcdic: " + edata);
String ebcdic_encoding = "IBM-1047"; //Setting the encoding in which the source was encoded
byte[] result = edata.getBytes(ebcdic_encoding); //Getting the raw bytes of the EBCDIC string by mentioning its encoding
String output = asHex(result); //Converting the raw bytes into hexadecimal format
byte[] b = new BigInteger(output, 16).toByteArray(); //Now its easy to convert it into another byte array (mentioning that this is of base16 since it is hexadecimal)
String ascii = new String(b, "ISO-8859-1"); //Now convert the modified byte array to normal ASCII string using its encoding "ISO-8859-1"
System.out.println("ascii: " + ascii); //This is the ASCII string which we can use universally in JAVA or wherever
//Inter conversions of similar type (ASCII to EBCDIC) are given below:
System.out.println("########");
String adata = "abcd";
System.out.println("ascii source to ebcdic:");
System.out.println("ascii: " + adata);
String ascii_encoding = "ISO-8859-1";
byte[] res = adata.getBytes(ascii_encoding);
String out = asHex(res);
byte[] bytebuff = new BigInteger(out, 16).toByteArray();
String ebcdic = new String(bytebuff, "IBM-1047");
System.out.println("ebcdic: " + ebcdic);
//Converting from hexadecimal string to EBCDIC if needed
System.out.println("########");
System.out.println("hexcode to ebcdic");
String hexinput = "81828384"; //Hexadecimal which we are converting to EBCDIC
System.out.println("hexinput: " + hexinput);
byte[] buffer = new BigInteger(hexinput, 16).toByteArray();
String eout = new String(buffer, "IBM-1047");
System.out.println("ebcdic out:" + eout);
//Converting from hexadecimal string to ASCII if needed
System.out.println("########");
System.out.println("hexcode to ascii");
String hexin = "61626364";
System.out.println("hexin: " + hexin);
byte[] buff = new BigInteger(hexin, 16).toByteArray();
String asciiout = new String(buff, "ISO-8859-1");
System.out.println("ascii out:" + asciiout);
}
//This asHex method converts the given byte array to a String of Hexadecimal equivalent
public static String asHex(byte[] buf) {
char[] HEX_CHARS = "0123456789abcdef".toCharArray();
char[] chars = new char[2 * buf.length];
for (int i = 0; i < buf.length; ++i) {
chars[2 * i] = HEX_CHARS[(buf[i] & 0xF0) >>> 4];
chars[2 * i + 1] = HEX_CHARS[buf[i] & 0x0F];
}
return new String(chars);
}
}

How to convert Strings to and from UTF8 byte arrays in Java

In Java, I have a String and I want to encode it as a byte array (in UTF8, or some other encoding). Alternately, I have a byte array (in some known encoding) and I want to convert it into a Java String. How do I do these conversions?

Convert from String to byte[]:
String s = "some text here";
byte[] b = s.getBytes(StandardCharsets.UTF_8);
Convert from byte[] to String:
byte[] b = {(byte) 99, (byte)97, (byte)116};
String s = new String(b, StandardCharsets.US_ASCII);
You should, of course, use the correct encoding name. My examples used US-ASCII and UTF-8, two commonly-used encodings.

Here's a solution that avoids performing the Charset lookup for every conversion:
import java.nio.charset.Charset;
private final Charset UTF8_CHARSET = Charset.forName("UTF-8");
String decodeUTF8(byte[] bytes) {
return new String(bytes, UTF8_CHARSET);
}
byte[] encodeUTF8(String string) {
return string.getBytes(UTF8_CHARSET);
}

String original = "hello world";
byte[] utf8Bytes = original.getBytes("UTF-8");

You can convert directly via the String(byte[], String) constructor and getBytes(String) method. Java exposes available character sets via the Charset class. The JDK documentation lists supported encodings.
90% of the time, such conversions are performed on streams, so you'd use the Reader/Writer classes. You would not incrementally decode using the String methods on arbitrary byte streams - you would leave yourself open to bugs involving multibyte characters.

My tomcat7 implementation is accepting strings as ISO-8859-1; despite the content-type of the HTTP request. The following solution worked for me when trying to correctly interpret characters like 'é' .
byte[] b1 = szP1.getBytes("ISO-8859-1");
System.out.println(b1.toString());
String szUT8 = new String(b1, "UTF-8");
System.out.println(szUT8);
When trying to interpret the string as US-ASCII, the byte info wasn't correctly interpreted.
b1 = szP1.getBytes("US-ASCII");
System.out.println(b1.toString());

As an alternative, StringUtils from Apache Commons can be used.
byte[] bytes = {(byte) 1};
String convertedString = StringUtils.newStringUtf8(bytes);
or
String myString = "example";
byte[] convertedBytes = StringUtils.getBytesUtf8(myString);
If you have non-standard charset, you can use getBytesUnchecked() or newString() accordingly.

I can't comment but don't want to start a new thread. But this isn't working. A simple round trip:
byte[] b = new byte[]{ 0, 0, 0, -127 }; // 0x00000081
String s = new String(b,StandardCharsets.UTF_8); // UTF8 = 0x0000, 0x0000, 0x0000, 0xfffd
b = s.getBytes(StandardCharsets.UTF_8); // [0, 0, 0, -17, -65, -67] 0x000000efbfbd != 0x00000081
I'd need b[] the same array before and after encoding which it isn't (this referrers to the first answer).

For decoding a series of bytes to a normal string message I finally got it working with UTF-8 encoding with this code:
/* Convert a list of UTF-8 numbers to a normal String
* Usefull for decoding a jms message that is delivered as a sequence of bytes instead of plain text
*/
public String convertUtf8NumbersToString(String[] numbers){
int length = numbers.length;
byte[] data = new byte[length];
for(int i = 0; i< length; i++){
data[i] = Byte.parseByte(numbers[i]);
}
return new String(data, Charset.forName("UTF-8"));
}

If you are using 7-bit ASCII or ISO-8859-1 (an amazingly common format) then you don't have to create a new java.lang.String at all. It's much much more performant to simply cast the byte into char:
Full working example:
for (byte b : new byte[] { 43, 45, (byte) 215, (byte) 247 }) {
char c = (char) b;
System.out.print(c);
}
If you are not using extended-characters like Ä, Æ, Å, Ç, Ï, Ê and can be sure that the only transmitted values are of the first 128 Unicode characters, then this code will also work for UTF-8 and extended ASCII (like cp-1252).

Charset UTF8_CHARSET = Charset.forName("UTF-8");
String strISO = "{\"name\":\"א\"}";
System.out.println(strISO);
byte[] b = strISO.getBytes();
for (byte c: b) {
System.out.print("[" + c + "]");
}
String str = new String(b, UTF8_CHARSET);
System.out.println(str);

Reader reader = new BufferedReader(
new InputStreamReader(
new ByteArrayInputStream(
string.getBytes(StandardCharsets.UTF_8)), StandardCharsets.UTF_8));

//query is your json
DefaultHttpClient httpClient = new DefaultHttpClient();
HttpPost postRequest = new HttpPost("http://my.site/test/v1/product/search?qy=");
StringEntity input = new StringEntity(query, "UTF-8");
input.setContentType("application/json");
postRequest.setEntity(input);
HttpResponse response=response = httpClient.execute(postRequest);

terribly late but i just encountered this issue and this is my fix:
private static String removeNonUtf8CompliantCharacters( final String inString ) {
if (null == inString ) return null;
byte[] byteArr = inString.getBytes();
for ( int i=0; i < byteArr.length; i++ ) {
byte ch= byteArr[i];
// remove any characters outside the valid UTF-8 range as well as all control characters
// except tabs and new lines
if ( !( (ch > 31 && ch < 253 ) || ch == '\t' || ch == '\n' || ch == '\r') ) {
byteArr[i]=' ';
}
}
return new String( byteArr );
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Check if a String is valid UTF-8 encoded in Java - java

How can I check if a string is in valid UTF-8 format?

Related

Java - What is the proper way to convert a UTF-8 String to binary?

Is ISO-8859-1 encoding binary-safe in Java?

Convert byte to string in Java

convert EBCDIC String to ASCII format?

How to convert Strings to and from UTF8 byte arrays in Java

Categories

Resources