How to convert codepoint of one charset to another in Java? - java

I am trying to convert codepoints from one charset to another in Java.
For example character ř is 248 in windows-1250, 345 in unicode.
So I have source charset and source codepoint and target charset and want to calculate target codepoint.
This may sound easy as windows-1250 is single byte,
but I want it to work on any charset, like GB2312.
I guess it can be done somehow with Charset class,
but it seems that it only converts bytes, not actual code points.
Charset sourceCharset = Charset.forName("GB2312");
int sourceCodePoint = 45257; //吧 chinese character
Charset targetCharset = Charset.forName("UTF-8");
int targetCodePoint = ...; //???
I checked Charset class for methods codepoint related, but there's only decode and encode, which works with bytes. I tried googling something related but without success.
Thanks in advance for any help.

At least in Java there is no notion of codepoints for character sets other than Unicode. You have to convert the integer to byte array and then to unicode.
Charset sourceCharset = Charset.forName("windows-1250");
int sourceCodePoint = 248; // ř
byte[] bytes = {(byte)sourceCodePoint};
String targetString = new String(bytes, sourceCharset);
int targetCodePoint = targetString.codePointAt(0);
System.out.println("targetString = " + targetString);
System.out.println("targetCodePoint = " + targetCodePoint);
output:
targetString = ř
targetCodePoint = 345
Chinese characters in GB2312 are represented by 2 bytes, so you need to store them in a byte array of length 2.
Charset sourceCharset = Charset.forName("GB2312");
int sourceCodePoint = 45257; // 吧 chinese character
byte[] bytes = ByteBuffer.allocate(2).putShort((short)sourceCodePoint).array();
String targetString = new String(bytes, sourceCharset);
int targetCodePoint = targetString.codePointAt(0);
System.out.println("targetString = " + targetString);
System.out.println("targetCodePoint = " + targetCodePoint);
output:
targetString = 吧
targetCodePoint = 21543

Related

Java - What is the proper way to convert a UTF-8 String to binary?

I'm using this code to convert a UTF-8 String to binary:
public String toBinary(String str) {
byte[] buf = str.getBytes(StandardCharsets.UTF_8);
StringBuilder result = new StringBuilder();
for (int i = 0; i < buf.length; i++) {
int ch = (int) buf[i];
String binary = Integer.toBinaryString(ch);
result.append(("00000000" + binary).substring(binary.length()));
result.append(' ');
}
return result.toString().trim();
}
Before I was using this code:
private String toBinary2(String str) {
StringBuilder result = new StringBuilder();
for (int i = 0; i < str.length(); i++) {
int ch = (int) str.charAt(i);
String binary = Integer.toBinaryString(ch);
if (ch<256)
result.append(("00000000" + binary).substring(binary.length()));
else {
binary = ("0000000000000000" + binary).substring(binary.length());
result.append(binary.substring(0, 8));
result.append(' ');
result.append(binary.substring(8));
}
result.append(' ');
}
return result.toString().trim();
}
These two method can return different results; for example:
toBinary("è") = "11000011 10101000"
toBinary2("è") = "11101000"
I think that because the bytes of è are negative while the corresponding char is not (because char is a 2 byte unsigned integer).
What I want to know is: which of the two approaches is the correct one and why?
Thanks in advance.
Whenever you want to convert text into binary data (or into text representing binary data, as you do here) you have to use some encoding.
Your toBinary uses UTF-8 for that encoding.
Your toBinary2 uses something that's not a standard encoding: it encodes every UTF-16 codepoint * <= 256 in a single byte and all others in 2 bytes. Unfortunately that one is not a useful encoding, since for decoding you'll have to know if a single byte is stand-alone or part of a 2-byte sequence (UTF-8/UTF-16 do that by indicating with the highest-level bits which one it is).
tl;dr toBinary seems correct, toBinary2 will produce output that can't uniquely be decoded back to the original string.
* You might be wondering where the mention of UTF-16 comes from: That's because all String objects in Java are implicitly encoded in UTF-16. So if you use charAt you get UTF-16 codepoints (which just so happen to be equal to the Unicode code number for all characters that fit into the Basic Multilingual Plane).
This code snippet might help.
String s = "Some String";
byte[] bytes = s.getBytes();
StringBuilder binary = new StringBuilder();
for(byte b:bytes){
int val =b;
for(int i=;i<=s.length;i++){
binary.append((val & 128) == 0 ? 0 : 1);
val<<=1;
}
}
System.out.println(" "+s+ "to binary" +binary);

Convert string to byte[] do an operation and back to byte[]

I'm converting an old VB.net project to Java (I barely know any VB).
Dim asciis As Byte() = System.Text.Encoding.ASCII.GetBytes(name)
For i As Int32 = 0 To asciis.Length - 1
asciis(i) = CByte(asciis(i) + 1)
Next
Dim encryptedName As String = StrReverse(Uri.EscapeDataString(System.Text.Encoding.ASCII.GetString(asciis, 0, asciis.Count())))
I converted it to:
byte[] asciis = name.getBytes();
for (int i =0; i<asciis.length-1;i++){
asciis[i] = (byte)(asciis[i]+1);
}
String encryptedName = StringUtils.reverse(asciis.toString()).substring(0,asciis.length);
I converted the name 29384 and the .Net gives 594A3%3 while my Java code gives d9354.
What am I missing?
This asciis.toString() is not correct (it will give you the adress of the array instead), you need to do new String(asciis, StandardCharsets.UTF_8) to create the String from the array of bytes. And you need to apply URLEncoder.encode(newString, StandardCharsets.UTF_8.name()) to apply the same URI encoding that is done in your VB code. Also you need to do name.getBytes(StandardCharsets.UTF_8) instead of just name.getBytes(), because else you'll use the default charset of the operating system it's running on, and it might not be ASCII compatible.
Alright as #Nyamiou said I had to give the charset to the String and encode it with an URLEncoder.
byte[] asciis = number.getBytes(Charset.forName("US-ASCII"));
for (int i =0; i<asciis.length;i++){
asciis[i] = (byte)(asciis[i]+1);
}
String asciiString = new String(asciis, Charset.forName("US-ASCII"));
String encryptedNumber= StringUtils.reverse(URLEncoder.encode(asciiString, "US-ASCII"));

Java convert String UTF-8 to UTF-16

I try to convert String a = "try" to String UTF-16
I did this :
try {
String ulany = new String("357810087745445");
System.out.println(ulany.getBytes().length);
String string = new String(ulany.getBytes(), "UTF-16");
System.out.println(string.getBytes().length);
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
And ulany.getBytes().length = 15
and System.out.println(string.getBytes().length) = 24 but I think that it should be 30 what I did wrong ?
String (and char) hold Unicode. So nothing is needed.
However if you want bytes, binary data, that are in some encoding, like UTF-16, you need a conversion:
ulany.getBytes("UTF-16") // Those bytes are in UTF-16 big endian
ulany.getBytes("UTF-16LE")
However System.out uses the operating systems encoding, so one cannot just pick some different encoding.
In fact char is UTF-16 encoded.
What happens
//String ulany = new String("357810087745445");
String ulany = "357810087745445";
The String copy constructor stems from the C++ beginning, and is senseless.
System.out.println(ulany.getBytes().length);
This will run on different platforms differently, as getBytes() uses
the default Charset. Better
System.out.println(ulany.getBytes("UTF-8").length);
String string = new String(ulany.getBytes(), "UTF-16");
This interpretes those bytes pairwise; having 15 bytes is already wrong.
Evidently one gets 7 (8?) special characters, as the high byte is not zero.
System.out.println(string.getBytes().length);
Now getting 24 means an average 3 bytes per char. Hence the default platform encoding is probably UTF-8 creating multibyte sequences.
The string will contain something like:
String string = "\u3533\u3837\u3031\u3830\u3737\u3534\u3434?";
You can also include a text encoding in getBytes(). For example:
String string = new String(ulany.getBytes("UTF-8"), "UTF-16");

convert EBCDIC String to ASCII format?

I am having a flat file which is pulled from a Db2 table ,the flat file contains records in both the char format as well as packed decimal format.how to convert the packed data to a java string.is there any way to convert the entire flat file to ASCII format.
EBCDIC is a family of encodings. You'll need to know more in details which EBCDIC encoding you're after.
Java has a number of supported encodings, including:
IBM500/Cp500 - EBCDIC 500V1
x-IBM834/Cp834 - IBM EBCDIC DBCS-only Korean (double-byte)
IBM1047/Cp1047 - Latin-1 character set for EBCDIC hosts
Try those and see what you get. Something like:
InputStreamReader rdr = new InputStreamReader(new FileInputStream(<your file>), java.nio.Charset.forName("ibm500"));
while((String line = rdr.readLine()) != null) {
System.out.println(line);
}
Read the file as a String, write it as EBCDIC. Use the OutputStreamWriter and InputStreamWriter and give the encoding in the constructor.
Following from PAP, CP037 is US EBCDIC encoding.
Also have a look at JRecord Project. It allows you to read a file with either a Cobol or Xml description and will handle EBCDIC and Comp-3.
Finally here is a routine to convert packed decimal bytes to String see method getMainframePackedDecimal in Conversion
Sharing a sample code by me for your reference:
package mypackage;
import java.io.UnsupportedEncodingException;
import java.math.BigInteger;
public class EtoA {
public static void main(String[] args) throws UnsupportedEncodingException {
System.out.println("########");
String edata = "/ÂÄÀ"; //Some EBCDIC string ==> here the OP can provide the content of flat file which the OP pulled from DB2 table
System.out.println("ebcdic source to ascii:");
System.out.println("ebcdic: " + edata);
String ebcdic_encoding = "IBM-1047"; //Setting the encoding in which the source was encoded
byte[] result = edata.getBytes(ebcdic_encoding); //Getting the raw bytes of the EBCDIC string by mentioning its encoding
String output = asHex(result); //Converting the raw bytes into hexadecimal format
byte[] b = new BigInteger(output, 16).toByteArray(); //Now its easy to convert it into another byte array (mentioning that this is of base16 since it is hexadecimal)
String ascii = new String(b, "ISO-8859-1"); //Now convert the modified byte array to normal ASCII string using its encoding "ISO-8859-1"
System.out.println("ascii: " + ascii); //This is the ASCII string which we can use universally in JAVA or wherever
//Inter conversions of similar type (ASCII to EBCDIC) are given below:
System.out.println("########");
String adata = "abcd";
System.out.println("ascii source to ebcdic:");
System.out.println("ascii: " + adata);
String ascii_encoding = "ISO-8859-1";
byte[] res = adata.getBytes(ascii_encoding);
String out = asHex(res);
byte[] bytebuff = new BigInteger(out, 16).toByteArray();
String ebcdic = new String(bytebuff, "IBM-1047");
System.out.println("ebcdic: " + ebcdic);
//Converting from hexadecimal string to EBCDIC if needed
System.out.println("########");
System.out.println("hexcode to ebcdic");
String hexinput = "81828384"; //Hexadecimal which we are converting to EBCDIC
System.out.println("hexinput: " + hexinput);
byte[] buffer = new BigInteger(hexinput, 16).toByteArray();
String eout = new String(buffer, "IBM-1047");
System.out.println("ebcdic out:" + eout);
//Converting from hexadecimal string to ASCII if needed
System.out.println("########");
System.out.println("hexcode to ascii");
String hexin = "61626364";
System.out.println("hexin: " + hexin);
byte[] buff = new BigInteger(hexin, 16).toByteArray();
String asciiout = new String(buff, "ISO-8859-1");
System.out.println("ascii out:" + asciiout);
}
//This asHex method converts the given byte array to a String of Hexadecimal equivalent
public static String asHex(byte[] buf) {
char[] HEX_CHARS = "0123456789abcdef".toCharArray();
char[] chars = new char[2 * buf.length];
for (int i = 0; i < buf.length; ++i) {
chars[2 * i] = HEX_CHARS[(buf[i] & 0xF0) >>> 4];
chars[2 * i + 1] = HEX_CHARS[buf[i] & 0x0F];
}
return new String(chars);
}
}

How best to convert a byte[] array to a string buffer

I have a number of byte[] array variables I need to convert to string buffers.
is there a method for this type of conversion ?
Thanks
Thank you all for your responses..However I didn't make myself clear....
I'm using some byte[] arrays pre-defined as public static "under" the class declaration
for my java program. these "fields" are reused during the "life" of the process.
As the program issues status messages, (written to a file) I've defined a string buffer
(mesg_data) that used to format a status message.
So as the program executes
I tried msg2 = String(byte_array2)
I get a compiler error:
cannot find symbol
symbol : method String(byte[])
location: class APPC_LU62.java.LU62XnsCvr
convrsID = String(conversation_ID) ;
example:
public class LU62XnsCvr extends Object
.
.
static String convrsID ;
static byte[] conversation_ID = new byte[8] ;
So I can't use a "dynamic" define of a string variable because the same variable is used
in multiple occurances.
I hope I made myself clear
Thanks ever so much
Guy
String s = new String(myByteArray, "UTF-8");
StringBuilder sb = new StringBuilder(s);
There is a constructor that a byte array and encoding:
byte[] bytes = new byte[200];
//...
String s = new String(bytes, "UTF-8");
In order to translate bytes to characters you need to specify encoding: the scheme by which sequences (typically of length 1,2 or 3) of 0-255 values (that is: sequence of bytes) are mapped to characters. UTF-8 is probably the best bet as a default.
You can turn it to a String directly
byte[] bytearray
....
String mystring = new String(bytearray)
and then to convert to a StringBuffer
StringBuffer buffer = new StringBuffer(mystring)
You may use
str = new String(bytes)
By thewhat the code above does is to create a java String (i.e. UTF-16) with the default platform character encoding.
If the byte array was created from a string encoded in the platform default character encoding this will work well.
If not you need to specify the correct character encoding (Charset) as
String str = new String (byte [] bytes, Charset charset)
It depends entirely on the character encoding, but you want:
String value = new String(bytes, "US-ASCII");
This would work for US-ASCII values.
See Charset for other valid character encodings (e.g., UTF-8)

Categories

Resources