convert EBCDIC String to ASCII format? - java

I am having a flat file which is pulled from a Db2 table ,the flat file contains records in both the char format as well as packed decimal format.how to convert the packed data to a java string.is there any way to convert the entire flat file to ASCII format.

EBCDIC is a family of encodings. You'll need to know more in details which EBCDIC encoding you're after.
Java has a number of supported encodings, including:
IBM500/Cp500 - EBCDIC 500V1
x-IBM834/Cp834 - IBM EBCDIC DBCS-only Korean (double-byte)
IBM1047/Cp1047 - Latin-1 character set for EBCDIC hosts
Try those and see what you get. Something like:
InputStreamReader rdr = new InputStreamReader(new FileInputStream(<your file>), java.nio.Charset.forName("ibm500"));
while((String line = rdr.readLine()) != null) {
System.out.println(line);
}

Read the file as a String, write it as EBCDIC. Use the OutputStreamWriter and InputStreamWriter and give the encoding in the constructor.

Following from PAP, CP037 is US EBCDIC encoding.
Also have a look at JRecord Project. It allows you to read a file with either a Cobol or Xml description and will handle EBCDIC and Comp-3.
Finally here is a routine to convert packed decimal bytes to String see method getMainframePackedDecimal in Conversion

Sharing a sample code by me for your reference:
package mypackage;
import java.io.UnsupportedEncodingException;
import java.math.BigInteger;
public class EtoA {
public static void main(String[] args) throws UnsupportedEncodingException {
System.out.println("########");
String edata = "/ÂÄÀ"; //Some EBCDIC string ==> here the OP can provide the content of flat file which the OP pulled from DB2 table
System.out.println("ebcdic source to ascii:");
System.out.println("ebcdic: " + edata);
String ebcdic_encoding = "IBM-1047"; //Setting the encoding in which the source was encoded
byte[] result = edata.getBytes(ebcdic_encoding); //Getting the raw bytes of the EBCDIC string by mentioning its encoding
String output = asHex(result); //Converting the raw bytes into hexadecimal format
byte[] b = new BigInteger(output, 16).toByteArray(); //Now its easy to convert it into another byte array (mentioning that this is of base16 since it is hexadecimal)
String ascii = new String(b, "ISO-8859-1"); //Now convert the modified byte array to normal ASCII string using its encoding "ISO-8859-1"
System.out.println("ascii: " + ascii); //This is the ASCII string which we can use universally in JAVA or wherever
//Inter conversions of similar type (ASCII to EBCDIC) are given below:
System.out.println("########");
String adata = "abcd";
System.out.println("ascii source to ebcdic:");
System.out.println("ascii: " + adata);
String ascii_encoding = "ISO-8859-1";
byte[] res = adata.getBytes(ascii_encoding);
String out = asHex(res);
byte[] bytebuff = new BigInteger(out, 16).toByteArray();
String ebcdic = new String(bytebuff, "IBM-1047");
System.out.println("ebcdic: " + ebcdic);
//Converting from hexadecimal string to EBCDIC if needed
System.out.println("########");
System.out.println("hexcode to ebcdic");
String hexinput = "81828384"; //Hexadecimal which we are converting to EBCDIC
System.out.println("hexinput: " + hexinput);
byte[] buffer = new BigInteger(hexinput, 16).toByteArray();
String eout = new String(buffer, "IBM-1047");
System.out.println("ebcdic out:" + eout);
//Converting from hexadecimal string to ASCII if needed
System.out.println("########");
System.out.println("hexcode to ascii");
String hexin = "61626364";
System.out.println("hexin: " + hexin);
byte[] buff = new BigInteger(hexin, 16).toByteArray();
String asciiout = new String(buff, "ISO-8859-1");
System.out.println("ascii out:" + asciiout);
}
//This asHex method converts the given byte array to a String of Hexadecimal equivalent
public static String asHex(byte[] buf) {
char[] HEX_CHARS = "0123456789abcdef".toCharArray();
char[] chars = new char[2 * buf.length];
for (int i = 0; i < buf.length; ++i) {
chars[2 * i] = HEX_CHARS[(buf[i] & 0xF0) >>> 4];
chars[2 * i + 1] = HEX_CHARS[buf[i] & 0x0F];
}
return new String(chars);
}
}

Related

How to convert codepoint of one charset to another in Java?

I am trying to convert codepoints from one charset to another in Java.
For example character ř is 248 in windows-1250, 345 in unicode.
So I have source charset and source codepoint and target charset and want to calculate target codepoint.
This may sound easy as windows-1250 is single byte,
but I want it to work on any charset, like GB2312.
I guess it can be done somehow with Charset class,
but it seems that it only converts bytes, not actual code points.
Charset sourceCharset = Charset.forName("GB2312");
int sourceCodePoint = 45257; //吧 chinese character
Charset targetCharset = Charset.forName("UTF-8");
int targetCodePoint = ...; //???
I checked Charset class for methods codepoint related, but there's only decode and encode, which works with bytes. I tried googling something related but without success.
Thanks in advance for any help.
At least in Java there is no notion of codepoints for character sets other than Unicode. You have to convert the integer to byte array and then to unicode.
Charset sourceCharset = Charset.forName("windows-1250");
int sourceCodePoint = 248; // ř
byte[] bytes = {(byte)sourceCodePoint};
String targetString = new String(bytes, sourceCharset);
int targetCodePoint = targetString.codePointAt(0);
System.out.println("targetString = " + targetString);
System.out.println("targetCodePoint = " + targetCodePoint);
output:
targetString = ř
targetCodePoint = 345
Chinese characters in GB2312 are represented by 2 bytes, so you need to store them in a byte array of length 2.
Charset sourceCharset = Charset.forName("GB2312");
int sourceCodePoint = 45257; // 吧 chinese character
byte[] bytes = ByteBuffer.allocate(2).putShort((short)sourceCodePoint).array();
String targetString = new String(bytes, sourceCharset);
int targetCodePoint = targetString.codePointAt(0);
System.out.println("targetString = " + targetString);
System.out.println("targetCodePoint = " + targetCodePoint);
output:
targetString = 吧
targetCodePoint = 21543

Java JPEG hex to ASCII

I have to read JPEG pictures from a database. They are stored as Hex Strings.
For tests I have saved the original hex String to a file, opened it with Notepad++ and applied Convert -> HEX --> ASCII and saved the result. This is a valid JPEG and can be rendered in a browser.
I have tried to reproduce this in java.
private String hexToASCII(String hex) {
StringBuilder output = new StringBuilder();
for (int i = 0; i < hex.length(); i+=2) {
String str = hex.substring(i, i+2);
output.append((char)Integer.parseInt(str, 16));
}
return output.toString();
}
When I save the result to disk, the resulting file is no jpg
it begins with
ÿØÿà JFIF ÿÛ C
If have also tried to convert the original hex string with
https://www.rapidtables.com/convert/number/hex-to-ascii.html
this gives the same result as my code. The resulting file is no jpg.
What is Notepad++ doing and how can I reproduce this in Java?
Any advise would be greatly appreciated.
Do not convert the file contents to ASCII, UTF-8, ISO-8859-1 or other character encodings! JPEG files are binary files, and don't need encoding.
I suggest:
private void hexToBin(String hex, OutputStream stream) {
for (int i = 0; i < hex.length(); i += 2) {
String str = hex.substring(i, i + 2);
stream.write(Integer.parseInt(str, 16));
}
}
And perhaps:
private byte[] hexToBin(String hex) {
ByteArrayOutputStream stream = new ByteArrayOutputStream(hex.length() / 2);
hexToBin(hex, stream);
return stream.toByteArray();
}

Read and convert binary file to char file using Java

I'm trying to read a shortcod-file binary file that could be found here.
The method i'm using to print the content of this file:
public void read3RegularGraphs( String pathFile ) throws IOException {
InputStream reader = new FileInputStream(pathFile);
byte [] fileBytes = Files.readAllBytes(new File(pathFile).toPath());
char singleChar;
for(byte b : fileBytes) {
singleChar = (char) b;
System.out.print(singleChar);
}
}
Unfortunatly, I'm getting an incorrect output format, i'm getting symbols in place of chars.
How can I convert the binary content to character format.
Thank you
You need to pass Charset to use decoding. Char and byte are two different things
List<String> stringList = Files.readAllLines(new File(pathFile).toPath(), Charset.forName("UTF-8"));
When you convert among Char, Strings and byte arrays, declare explicitly the Charset
byte[] byteArray= stringTest.getBytes(Charset.forName("UTF-8"));
String stringTest = new String(byteArray, Charset.forName("UTF-8"));

Check if a String is valid UTF-8 encoded in Java

How can I check if a string is in valid UTF-8 format?
Only byte data can be checked. If you constructed a String then its already in UTF-16 internally.
Also only byte arrays can be UTF-8 encoded.
Here is a common case of UTF-8 conversions.
String myString = "\u0048\u0065\u006C\u006C\u006F World";
System.out.println(myString);
byte[] myBytes = null;
try
{
myBytes = myString.getBytes("UTF-8");
}
catch (UnsupportedEncodingException e)
{
e.printStackTrace();
System.exit(-1);
}
for (int i=0; i < myBytes.length; i++) {
System.out.println(myBytes[i]);
}
If you don't know the encoding of your byte array, juniversalchardet is a library to help you detect it.
The following post is taken from the official Java tutorials available at: https://docs.oracle.com/javase/tutorial/i18n/text/string.html.
The StringConverter program starts by creating a String containing
Unicode characters:
String original = new String("A" + "\u00ea" + "\u00f1" + "\u00fc" + "C");
When printed, the String named original appears as:
AêñüC
To convert the String object to UTF-8, invoke the getBytes method and
specify the appropriate encoding identifier as a parameter. The
getBytes method returns an array of bytes in UTF-8 format. To create a
String object from an array of non-Unicode bytes, invoke the String
constructor with the encoding parameter. The code that makes these
calls is enclosed in a try block, in case the specified encoding is
unsupported:
try {
byte[] utf8Bytes = original.getBytes("UTF8");
byte[] defaultBytes = original.getBytes();
String roundTrip = new String(utf8Bytes, "UTF8");
System.out.println("roundTrip = " + roundTrip);
System.out.println();
printBytes(utf8Bytes, "utf8Bytes");
System.out.println();
printBytes(defaultBytes, "defaultBytes");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
}
The StringConverter program prints out the values in the utf8Bytes and
defaultBytes arrays to demonstrate an important point: The length of
the converted text might not be the same as the length of the source
text. Some Unicode characters translate into single bytes, others into
pairs or triplets of bytes.
The printBytes method displays the byte arrays by invoking the byteToHex method, which is defined in the source file,
UnicodeFormatter.java. Here is the printBytes method:
public static void printBytes(byte[] array, String name) {
for (int k = 0; k < array.length; k++) {
System.out.println(name + "[" + k + "] = " + "0x" +
UnicodeFormatter.byteToHex(array[k]));
}
}
The output of the printBytes method follows. Note that only the first
and last bytes, the A and C characters, are the same in both arrays:
utf8Bytes[0] = 0x41
utf8Bytes[1] = 0xc3
utf8Bytes[2] = 0xaa
utf8Bytes[3] = 0xc3
utf8Bytes[4] = 0xb1
utf8Bytes[5] = 0xc3
utf8Bytes[6] = 0xbc
utf8Bytes[7] = 0x43
defaultBytes[0] = 0x41
defaultBytes[1] = 0xea
defaultBytes[2] = 0xf1
defaultBytes[3] = 0xfc
defaultBytes[4] = 0x43

How to convert Strings to and from UTF8 byte arrays in Java

In Java, I have a String and I want to encode it as a byte array (in UTF8, or some other encoding). Alternately, I have a byte array (in some known encoding) and I want to convert it into a Java String. How do I do these conversions?
Convert from String to byte[]:
String s = "some text here";
byte[] b = s.getBytes(StandardCharsets.UTF_8);
Convert from byte[] to String:
byte[] b = {(byte) 99, (byte)97, (byte)116};
String s = new String(b, StandardCharsets.US_ASCII);
You should, of course, use the correct encoding name. My examples used US-ASCII and UTF-8, two commonly-used encodings.
Here's a solution that avoids performing the Charset lookup for every conversion:
import java.nio.charset.Charset;
private final Charset UTF8_CHARSET = Charset.forName("UTF-8");
String decodeUTF8(byte[] bytes) {
return new String(bytes, UTF8_CHARSET);
}
byte[] encodeUTF8(String string) {
return string.getBytes(UTF8_CHARSET);
}
String original = "hello world";
byte[] utf8Bytes = original.getBytes("UTF-8");
You can convert directly via the String(byte[], String) constructor and getBytes(String) method. Java exposes available character sets via the Charset class. The JDK documentation lists supported encodings.
90% of the time, such conversions are performed on streams, so you'd use the Reader/Writer classes. You would not incrementally decode using the String methods on arbitrary byte streams - you would leave yourself open to bugs involving multibyte characters.
My tomcat7 implementation is accepting strings as ISO-8859-1; despite the content-type of the HTTP request. The following solution worked for me when trying to correctly interpret characters like 'é' .
byte[] b1 = szP1.getBytes("ISO-8859-1");
System.out.println(b1.toString());
String szUT8 = new String(b1, "UTF-8");
System.out.println(szUT8);
When trying to interpret the string as US-ASCII, the byte info wasn't correctly interpreted.
b1 = szP1.getBytes("US-ASCII");
System.out.println(b1.toString());
As an alternative, StringUtils from Apache Commons can be used.
byte[] bytes = {(byte) 1};
String convertedString = StringUtils.newStringUtf8(bytes);
or
String myString = "example";
byte[] convertedBytes = StringUtils.getBytesUtf8(myString);
If you have non-standard charset, you can use getBytesUnchecked() or newString() accordingly.
I can't comment but don't want to start a new thread. But this isn't working. A simple round trip:
byte[] b = new byte[]{ 0, 0, 0, -127 }; // 0x00000081
String s = new String(b,StandardCharsets.UTF_8); // UTF8 = 0x0000, 0x0000, 0x0000, 0xfffd
b = s.getBytes(StandardCharsets.UTF_8); // [0, 0, 0, -17, -65, -67] 0x000000efbfbd != 0x00000081
I'd need b[] the same array before and after encoding which it isn't (this referrers to the first answer).
For decoding a series of bytes to a normal string message I finally got it working with UTF-8 encoding with this code:
/* Convert a list of UTF-8 numbers to a normal String
* Usefull for decoding a jms message that is delivered as a sequence of bytes instead of plain text
*/
public String convertUtf8NumbersToString(String[] numbers){
int length = numbers.length;
byte[] data = new byte[length];
for(int i = 0; i< length; i++){
data[i] = Byte.parseByte(numbers[i]);
}
return new String(data, Charset.forName("UTF-8"));
}
If you are using 7-bit ASCII or ISO-8859-1 (an amazingly common format) then you don't have to create a new java.lang.String at all. It's much much more performant to simply cast the byte into char:
Full working example:
for (byte b : new byte[] { 43, 45, (byte) 215, (byte) 247 }) {
char c = (char) b;
System.out.print(c);
}
If you are not using extended-characters like Ä, Æ, Å, Ç, Ï, Ê and can be sure that the only transmitted values are of the first 128 Unicode characters, then this code will also work for UTF-8 and extended ASCII (like cp-1252).
Charset UTF8_CHARSET = Charset.forName("UTF-8");
String strISO = "{\"name\":\"א\"}";
System.out.println(strISO);
byte[] b = strISO.getBytes();
for (byte c: b) {
System.out.print("[" + c + "]");
}
String str = new String(b, UTF8_CHARSET);
System.out.println(str);
Reader reader = new BufferedReader(
new InputStreamReader(
new ByteArrayInputStream(
string.getBytes(StandardCharsets.UTF_8)), StandardCharsets.UTF_8));
//query is your json
DefaultHttpClient httpClient = new DefaultHttpClient();
HttpPost postRequest = new HttpPost("http://my.site/test/v1/product/search?qy=");
StringEntity input = new StringEntity(query, "UTF-8");
input.setContentType("application/json");
postRequest.setEntity(input);
HttpResponse response=response = httpClient.execute(postRequest);
terribly late but i just encountered this issue and this is my fix:
private static String removeNonUtf8CompliantCharacters( final String inString ) {
if (null == inString ) return null;
byte[] byteArr = inString.getBytes();
for ( int i=0; i < byteArr.length; i++ ) {
byte ch= byteArr[i];
// remove any characters outside the valid UTF-8 range as well as all control characters
// except tabs and new lines
if ( !( (ch > 31 && ch < 253 ) || ch == '\t' || ch == '\n' || ch == '\r') ) {
byteArr[i]=' ';
}
}
return new String( byteArr );
}

Categories

Resources