I am trying to convert codepoints from one charset to another in Java.
For example character ř is 248 in windows-1250, 345 in unicode.
So I have source charset and source codepoint and target charset and want to calculate target codepoint.
This may sound easy as windows-1250 is single byte,
but I want it to work on any charset, like GB2312.
I guess it can be done somehow with Charset class,
but it seems that it only converts bytes, not actual code points.
Charset sourceCharset = Charset.forName("GB2312");
int sourceCodePoint = 45257; //吧 chinese character
Charset targetCharset = Charset.forName("UTF-8");
int targetCodePoint = ...; //???
I checked Charset class for methods codepoint related, but there's only decode and encode, which works with bytes. I tried googling something related but without success.
Thanks in advance for any help.
At least in Java there is no notion of codepoints for character sets other than Unicode. You have to convert the integer to byte array and then to unicode.
Charset sourceCharset = Charset.forName("windows-1250");
int sourceCodePoint = 248; // ř
byte[] bytes = {(byte)sourceCodePoint};
String targetString = new String(bytes, sourceCharset);
int targetCodePoint = targetString.codePointAt(0);
System.out.println("targetString = " + targetString);
System.out.println("targetCodePoint = " + targetCodePoint);
output:
targetString = ř
targetCodePoint = 345
Chinese characters in GB2312 are represented by 2 bytes, so you need to store them in a byte array of length 2.
Charset sourceCharset = Charset.forName("GB2312");
int sourceCodePoint = 45257; // 吧 chinese character
byte[] bytes = ByteBuffer.allocate(2).putShort((short)sourceCodePoint).array();
String targetString = new String(bytes, sourceCharset);
int targetCodePoint = targetString.codePointAt(0);
System.out.println("targetString = " + targetString);
System.out.println("targetCodePoint = " + targetCodePoint);
output:
targetString = 吧
targetCodePoint = 21543
I have to read JPEG pictures from a database. They are stored as Hex Strings.
For tests I have saved the original hex String to a file, opened it with Notepad++ and applied Convert -> HEX --> ASCII and saved the result. This is a valid JPEG and can be rendered in a browser.
I have tried to reproduce this in java.
private String hexToASCII(String hex) {
StringBuilder output = new StringBuilder();
for (int i = 0; i < hex.length(); i+=2) {
String str = hex.substring(i, i+2);
output.append((char)Integer.parseInt(str, 16));
}
return output.toString();
}
When I save the result to disk, the resulting file is no jpg
it begins with
ÿØÿà JFIF ÿÛ C
If have also tried to convert the original hex string with
https://www.rapidtables.com/convert/number/hex-to-ascii.html
this gives the same result as my code. The resulting file is no jpg.
What is Notepad++ doing and how can I reproduce this in Java?
Any advise would be greatly appreciated.
Do not convert the file contents to ASCII, UTF-8, ISO-8859-1 or other character encodings! JPEG files are binary files, and don't need encoding.
I suggest:
private void hexToBin(String hex, OutputStream stream) {
for (int i = 0; i < hex.length(); i += 2) {
String str = hex.substring(i, i + 2);
stream.write(Integer.parseInt(str, 16));
}
}
And perhaps:
private byte[] hexToBin(String hex) {
ByteArrayOutputStream stream = new ByteArrayOutputStream(hex.length() / 2);
hexToBin(hex, stream);
return stream.toByteArray();
}
I'm trying to read a shortcod-file binary file that could be found here.
The method i'm using to print the content of this file:
public void read3RegularGraphs( String pathFile ) throws IOException {
InputStream reader = new FileInputStream(pathFile);
byte [] fileBytes = Files.readAllBytes(new File(pathFile).toPath());
char singleChar;
for(byte b : fileBytes) {
singleChar = (char) b;
System.out.print(singleChar);
}
}
Unfortunatly, I'm getting an incorrect output format, i'm getting symbols in place of chars.
How can I convert the binary content to character format.
Thank you
You need to pass Charset to use decoding. Char and byte are two different things
List<String> stringList = Files.readAllLines(new File(pathFile).toPath(), Charset.forName("UTF-8"));
When you convert among Char, Strings and byte arrays, declare explicitly the Charset
byte[] byteArray= stringTest.getBytes(Charset.forName("UTF-8"));
String stringTest = new String(byteArray, Charset.forName("UTF-8"));
How can I check if a string is in valid UTF-8 format?
Only byte data can be checked. If you constructed a String then its already in UTF-16 internally.
Also only byte arrays can be UTF-8 encoded.
Here is a common case of UTF-8 conversions.
String myString = "\u0048\u0065\u006C\u006C\u006F World";
System.out.println(myString);
byte[] myBytes = null;
try
{
myBytes = myString.getBytes("UTF-8");
}
catch (UnsupportedEncodingException e)
{
e.printStackTrace();
System.exit(-1);
}
for (int i=0; i < myBytes.length; i++) {
System.out.println(myBytes[i]);
}
If you don't know the encoding of your byte array, juniversalchardet is a library to help you detect it.
The following post is taken from the official Java tutorials available at: https://docs.oracle.com/javase/tutorial/i18n/text/string.html.
The StringConverter program starts by creating a String containing
Unicode characters:
String original = new String("A" + "\u00ea" + "\u00f1" + "\u00fc" + "C");
When printed, the String named original appears as:
AêñüC
To convert the String object to UTF-8, invoke the getBytes method and
specify the appropriate encoding identifier as a parameter. The
getBytes method returns an array of bytes in UTF-8 format. To create a
String object from an array of non-Unicode bytes, invoke the String
constructor with the encoding parameter. The code that makes these
calls is enclosed in a try block, in case the specified encoding is
unsupported:
try {
byte[] utf8Bytes = original.getBytes("UTF8");
byte[] defaultBytes = original.getBytes();
String roundTrip = new String(utf8Bytes, "UTF8");
System.out.println("roundTrip = " + roundTrip);
System.out.println();
printBytes(utf8Bytes, "utf8Bytes");
System.out.println();
printBytes(defaultBytes, "defaultBytes");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
The StringConverter program prints out the values in the utf8Bytes and
defaultBytes arrays to demonstrate an important point: The length of
the converted text might not be the same as the length of the source
text. Some Unicode characters translate into single bytes, others into
pairs or triplets of bytes.
The printBytes method displays the byte arrays by invoking the byteToHex method, which is defined in the source file,
UnicodeFormatter.java. Here is the printBytes method:
public static void printBytes(byte[] array, String name) {
for (int k = 0; k < array.length; k++) {
System.out.println(name + "[" + k + "] = " + "0x" +
UnicodeFormatter.byteToHex(array[k]));
}
}
The output of the printBytes method follows. Note that only the first
and last bytes, the A and C characters, are the same in both arrays:
utf8Bytes[0] = 0x41
utf8Bytes[1] = 0xc3
utf8Bytes[2] = 0xaa
utf8Bytes[3] = 0xc3
utf8Bytes[4] = 0xb1
utf8Bytes[5] = 0xc3
utf8Bytes[6] = 0xbc
utf8Bytes[7] = 0x43
defaultBytes[0] = 0x41
defaultBytes[1] = 0xea
defaultBytes[2] = 0xf1
defaultBytes[3] = 0xfc
defaultBytes[4] = 0x43
In Java, I have a String and I want to encode it as a byte array (in UTF8, or some other encoding). Alternately, I have a byte array (in some known encoding) and I want to convert it into a Java String. How do I do these conversions?
Convert from String to byte[]:
String s = "some text here";
byte[] b = s.getBytes(StandardCharsets.UTF_8);
Convert from byte[] to String:
byte[] b = {(byte) 99, (byte)97, (byte)116};
String s = new String(b, StandardCharsets.US_ASCII);
You should, of course, use the correct encoding name. My examples used US-ASCII and UTF-8, two commonly-used encodings.
Here's a solution that avoids performing the Charset lookup for every conversion:
import java.nio.charset.Charset;
private final Charset UTF8_CHARSET = Charset.forName("UTF-8");
String decodeUTF8(byte[] bytes) {
return new String(bytes, UTF8_CHARSET);
}
byte[] encodeUTF8(String string) {
return string.getBytes(UTF8_CHARSET);
}
String original = "hello world";
byte[] utf8Bytes = original.getBytes("UTF-8");
You can convert directly via the String(byte[], String) constructor and getBytes(String) method. Java exposes available character sets via the Charset class. The JDK documentation lists supported encodings.
90% of the time, such conversions are performed on streams, so you'd use the Reader/Writer classes. You would not incrementally decode using the String methods on arbitrary byte streams - you would leave yourself open to bugs involving multibyte characters.
My tomcat7 implementation is accepting strings as ISO-8859-1; despite the content-type of the HTTP request. The following solution worked for me when trying to correctly interpret characters like 'é' .
byte[] b1 = szP1.getBytes("ISO-8859-1");
System.out.println(b1.toString());
String szUT8 = new String(b1, "UTF-8");
System.out.println(szUT8);
When trying to interpret the string as US-ASCII, the byte info wasn't correctly interpreted.
b1 = szP1.getBytes("US-ASCII");
System.out.println(b1.toString());
As an alternative, StringUtils from Apache Commons can be used.
byte[] bytes = {(byte) 1};
String convertedString = StringUtils.newStringUtf8(bytes);
or
String myString = "example";
byte[] convertedBytes = StringUtils.getBytesUtf8(myString);
If you have non-standard charset, you can use getBytesUnchecked() or newString() accordingly.
I can't comment but don't want to start a new thread. But this isn't working. A simple round trip:
byte[] b = new byte[]{ 0, 0, 0, -127 }; // 0x00000081
String s = new String(b,StandardCharsets.UTF_8); // UTF8 = 0x0000, 0x0000, 0x0000, 0xfffd
b = s.getBytes(StandardCharsets.UTF_8); // [0, 0, 0, -17, -65, -67] 0x000000efbfbd != 0x00000081
I'd need b[] the same array before and after encoding which it isn't (this referrers to the first answer).
For decoding a series of bytes to a normal string message I finally got it working with UTF-8 encoding with this code:
/* Convert a list of UTF-8 numbers to a normal String
* Usefull for decoding a jms message that is delivered as a sequence of bytes instead of plain text
*/
public String convertUtf8NumbersToString(String[] numbers){
int length = numbers.length;
byte[] data = new byte[length];
for(int i = 0; i< length; i++){
data[i] = Byte.parseByte(numbers[i]);
}
return new String(data, Charset.forName("UTF-8"));
}
If you are using 7-bit ASCII or ISO-8859-1 (an amazingly common format) then you don't have to create a new java.lang.String at all. It's much much more performant to simply cast the byte into char:
Full working example:
for (byte b : new byte[] { 43, 45, (byte) 215, (byte) 247 }) {
char c = (char) b;
System.out.print(c);
}
If you are not using extended-characters like Ä, Æ, Å, Ç, Ï, Ê and can be sure that the only transmitted values are of the first 128 Unicode characters, then this code will also work for UTF-8 and extended ASCII (like cp-1252).
Charset UTF8_CHARSET = Charset.forName("UTF-8");
String strISO = "{\"name\":\"א\"}";
System.out.println(strISO);
byte[] b = strISO.getBytes();
for (byte c: b) {
System.out.print("[" + c + "]");
}
String str = new String(b, UTF8_CHARSET);
System.out.println(str);
Reader reader = new BufferedReader(
new InputStreamReader(
new ByteArrayInputStream(
string.getBytes(StandardCharsets.UTF_8)), StandardCharsets.UTF_8));
//query is your json
DefaultHttpClient httpClient = new DefaultHttpClient();
HttpPost postRequest = new HttpPost("http://my.site/test/v1/product/search?qy=");
StringEntity input = new StringEntity(query, "UTF-8");
input.setContentType("application/json");
postRequest.setEntity(input);
HttpResponse response=response = httpClient.execute(postRequest);
terribly late but i just encountered this issue and this is my fix:
private static String removeNonUtf8CompliantCharacters( final String inString ) {
if (null == inString ) return null;
byte[] byteArr = inString.getBytes();
for ( int i=0; i < byteArr.length; i++ ) {
byte ch= byteArr[i];
// remove any characters outside the valid UTF-8 range as well as all control characters
// except tabs and new lines
if ( !( (ch > 31 && ch < 253 ) || ch == '\t' || ch == '\n' || ch == '\r') ) {
byteArr[i]=' ';
}
}
return new String( byteArr );
}