COMP-3 data unpacking in Java (Embedded in Pentaho)

COMP-3 data unpacking in Java (Embedded in Pentaho) - java

We are facing a challenge in reading the COMP-3 data in Java embedded inside Pentaho ETL. There are few Float values stored as packed decimals in a flat file along with other plain text. While the plain texts are getting read properly, we tried using Charset.forName("CP500");, but it never worked. We still get junk characters.
Since Pentaho scripts doesn't support COMP-3, in their forums they suggested to go with User Defined Java class. Could anyone help us if you have come across and solved such?

Is it a Cobol File ???, Do you have a Cobol Copybook ???.
Possible options include
As Bill said Convert the Comp-3 to Text on the source machine
Write your own Conversion Code
Use a library like JRecord. Note: I am the author of JRecord
Converting Comp-3
in Comp-3,
Value Comp-3 (signed) Comp-3 (Unsigned) Zoned-Decimal
123 x'123c' x'123f' ?? "12C"
-123 x'123d' "12L"
There is more than one way to convert a comp-3 to a decimal integer. One way
is to
Connvert x'123c' ->> String "123c"
Drop the last character and test for the sign
Java Code to convert comp3 (from a byte array:
public static String getMainframePackedDecimal(final byte[] record,
final int start,
final int len) {
String hex = getDecimal(record, start, start + len);
//Long.toHexString(toBigInt(start, len).longValue());
String ret = "";
String sign = "";
if (! "".equals(hex)) {
switch (hex.substring(hex.length() - 1).toLowerCase().charAt(0)) {
case 'd' : sign = "-";
case 'a' :
case 'b' :
case 'c' :
case 'e' :
case 'f' :
ret = sign + hex.substring(0, hex.length() - 1);
break;
default:
ret = hex;
}
}
if ("".equals(ret)) {
ret = "0";
}
}
public static String getDecimal(final byte[] record, final int start, final int fin) {
int i;
String s;
StringBuffer ret = new StringBuffer("");
int b;
for (i = start; i < fin; i++) {
b = toPostiveByte(record[i]);
s = Integer.toHexString(b);
if (s.length() == 1) {
ret.append('0');
}
ret.append(s);
}
return ret.toString();
}
JRecord
In JRecord, if you have a Cobol Copybook,
there is
Cobol2Csv a program to convert a Cobol-Data file to CSV using a Cobol Copybook
Data2Xml convert a Cobol Data file to Xml using a Cobol Copybook.
Read Cobol-Data File with a Cobol Copybook.
Read a Fixed width file with a Xml Description
Define the Fields in Java
Reading with Cobol Copybook in JRecord
ICobolIOBuilder ioBldr = JRecordInterface1.COBOL
.newIOBuilder(copybookName)
.setDialect( ICopybookDialects.FMT_MAINFRAME)
.setFont("cp037")
.setFileOrganization(Constants.IO_FIXED_LENGTH)
.setDropCopybookNameFromFields(true);
AbstractLine saleRecord;
AbstractLineReader reader = ioBldr.newReader(salesFile);
while ((saleRecord = reader.read()) != null) {
....
}
reader.close();
Defining the File in Java with JRecord
AbstractLineReader reader = JRecordInterface1.FIXED_WIDTH.newIOBuilder()
.defineFieldsByLength()
.addFieldByLength("Sku" , Type.ftChar, 8, 0)
.addFieldByLength("Store", Type.ftNumRightJustified, 3, 0)
.addFieldByLength("Date" , Type.ftNumRightJustified, 6, 0)
.addFieldByLength("Dept" , Type.ftNumRightJustified, 3, 0)
.addFieldByLength("Qty" , Type.ftNumRightJustified, 2, 0)
.addFieldByLength("Price", Type.ftNumRightJustified, 6, 2)
.endOfRecord()
.newReader(this.getClass().getResource("DTAR020_tst1.bin.txt").getFile());
AbstractLine saleRecord;
while ((saleRecord = reader.read()) != null) {
}
Zoned Decimal
Another Mainframe-Cobol numeric format is Zoned-Decimal. It is a text format where the sign is Over-typed on the last digit. In zoned-decimal 123 is "12C" while -123 is "12L".

Related

Java error getting the bytes of a string to do an XOR check

Hi I am currently trying to use get the Bytes on a string 1 and 2 so I can do an XOR check with string 1 and modulus of string2 length but I keep coming up with and error "String cannot be converted to bytes" This java code actually came from a decompiled apk and the dex files converted to jar then viewed from jd-gui. So I am wondering if there was a loss of logic code in the conversion
=================================================
public class A {
//initializing
public static String a(String paramString1, String paramString2)
{
//declaring string 1 and 2
paramString1 = "i]\rD\004\025\027\004_~\002\006`HZ#UBY\\Ku\002O2\003_MQB\020\007G~\004Q";
paramString2 = "prodkey";
//getting the byte of string 1 and 2
paramString1 = paramString1.getBytes();
paramString2 = paramString2.getBytes();
//XOR check while iterating through paramstring1 byte array
int[] arrayOfByte = new int[paramString1.length];
int i = 0;
while (i < paramString1.length)
{
arrayOfByte[i] = ((byte)(paramString1[i] ^ paramString2[(i % paramString2.length)]));
i += 1;
}
//after this I should be able to see string arrayofByte in this format
// "1227a5aa-ce82-4da8-hFba6-c4hFe80hFba720"
return new String(arrayOfByte);
}
}
====================================
I also tried converting the code to python but the byte() function
just converts it to utf-8 b'string'

Java GZip makes small differences when compressing file and decompressing it again

After a week of work I designed a binary file format, and made a Java reader for it. It's just an experiment, which works fine, unless I'm using the GZip compression function.
I called my binary type MBDF (Minimal Binary Database Format), and it can store 8 different types:
Integer (There is nothing like a byte, short, long or anything like that, since it is stored in flexible space (bigger numbers take more space))
Float-32 (32-bits floating point format, like java's float type)
Float-64 (64-bits floating point format, like java's double type)
String (A string in UTF-16 format)
Boolean
Null (Just specifies a null value)
Array (Something like java's ArrayList<Object>)
Compound (A String - Object map)
I used this data as test data:
COMPOUND {
float1: FLOAT_32 3.3
bool2: BOOLEAN true
float2: FLOAT_64 3.3
int1: INTEGER 3
compound1: COMPOUND {
xml: STRING "two length compound"
int: INTEGER 23
}
string1: STRING "Hello world!"
string2: STRING "3"
arr1: ARRAY [
STRING "Hello world!"
INTEGER 3
STRING "3"
FLOAT_32 3.29
FLOAT_64 249.2992
BOOLEAN true
COMPOUND {
str: STRING "one length compound"
}
BOOLEAN false
NULL null
]
bool1: BOOLEAN false
null1: NULL null
}
The xml key in a compound does matter!!
I made a file from it using this java code:
MBDFFile.writeMBDFToFile(
"/Users/<anonymous>/Documents/Java/MBDF/resources/file.mbdf",
b.makeMBDF(false)
);
Here, the variable b is a MBDFBinary object, containing all the data given above. With the makeMBDF function it generates the ISO 8859-1 encoded string and if the given boolean is true, it compresses the string using GZip. Then, when writing, an extra information character is added at the beginning of the file, containing information about how to read it back.
Then, after writing the file, I read it back into java and parse it
MBDF mbdf = MBDFFile.readMBDFFromFile("/Users/<anonymous>/Documents/Java/MBDF/resources/file.mbdf");
System.out.println(mbdf.getBinaryObject().parse());
This prints exactly the information mentioned above.
Then I try to use compression:
MBDFFile.writeMBDFToFile(
"/Users/<anonymous>/Documents/Java/MBDF/resources/file.mbdf",
b.makeMBDF(true)
);
I do exactly the same to read it back as I did with the uncompressed file, which should work. It prints this information:
COMPOUND {
float1: FLOAT_32 3.3
bool2: BOOLEAN true
float2: FLOAT_64 3.3
int1: INTEGER 3
compound1: COMPOUND {
xUT: STRING 'two length compound'
int: INTEGER 23
}
string1: STRING 'Hello world!'
string2: STRING '3'
arr1: ARRAY [
STRING 'Hello world!'
INTEGER 3
STRING '3'
FLOAT_32 3.29
FLOAT_64 249.2992
BOOLEAN true
COMPOUND {
str: STRING 'one length compound'
}
BOOLEAN false
NULL null
]
bool1: BOOLEAN false
null1: NULL null
}
Comparing it to the initial information, the name xml changed into xUT for some reason...
After some research I found little differences in binary data between before the compression and after the compression. Such patterns as 110011 change into 101010.
When I make the name xml longer, like xmldm, it is just parsed as xmldm for some reason.
I currently saw the problem only occur on names with three characters.
Directly compressing and decompressing the generated string (without saving it to a file and reading that) does work, so maybe the bug is caused by the file encoding.
As far as I know, the string output is in ISO 8859-1 format, but I couldn't get the file encoding right. When a file is read, it is read as it has to be read, and all the characters are read as ISO 8859-1 characters.
I've some things that could be a reason, I actually don't know how to test them:
The GZip output has a different encoding than the uncompressed encoding, causing small differences while storing as a file.
The file is stored as UTF-8 format, just ignoring the order to be ISO 8859-1 encoding ( don't know how to explain :) )
There is a little bug in the java GZip libraries.
But which one is true, and if none of them is right, what is the true reason for this bug?
I couldn't figure it out right now.
The MBDFFile class, reading and storing the files:
/* MBDFFile.java */
package com.redgalaxy.mbdf;
import java.io.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class MBDFFile {
public static MBDF readMBDFFromFile(String filename) throws IOException {
// FileInputStream is = new FileInputStream(filename);
// InputStreamReader isr = new InputStreamReader(is, "ISO-8859-1");
// BufferedReader br = new BufferedReader(isr);
//
// StringBuilder builder = new StringBuilder();
//
// String currentLine;
//
// while ((currentLine = br.readLine()) != null) {
// builder.append(currentLine);
// builder.append("\n");
// }
//
// builder.deleteCharAt(builder.length() - 1);
//
//
// br.close();
Path path = Paths.get(filename);
byte[] data = Files.readAllBytes(path);
return new MBDF(new String(data, "ISO-8859-1"));
}
private static void writeToFile(String filename, byte[] txt) throws IOException {
// BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
//// FileWriter writer = new FileWriter(filename);
// writer.write(txt.getBytes("ISO-8859-1"));
// writer.close();
// PrintWriter pw = new PrintWriter(filename, "ISO-8859-1");
FileOutputStream stream = new FileOutputStream(filename);
stream.write(txt);
stream.close();
}
public static void writeMBDFToFile(String filename, MBDF info) throws IOException {
writeToFile(filename, info.pack().getBytes("ISO-8859-1"));
}
}
The pack function generates the final string for the file, in ISO 8859-1 format.
For all the other code, see my MBDF Github repository.
I commented the code I've tried, trying to show what I tried.
My workspace:
- Macbook Air '11 (High Sierra)
- IntellIJ Community 2017.3
- JDK 1.8
I hope this is enough information, this is actually the only way to make clear what I'm doing, and what exactly isn't working.
Edit: MBDF.java
/* MBDF.java */
package com.redgalaxy.mbdf;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
public class MBDF {
private String data;
private InfoTag tag;
public MBDF(String data) {
this.tag = new InfoTag((byte) data.charAt(0));
this.data = data.substring(1);
}
public MBDF(String data, InfoTag tag) {
this.tag = tag;
this.data = data;
}
public MBDFBinary getBinaryObject() throws IOException {
String uncompressed = data;
if (tag.isCompressed) {
uncompressed = GZipUtils.decompress(data);
}
Binary binary = getBinaryFrom8Bit(uncompressed);
return new MBDFBinary(binary.subBit(0, binary.getLen() - tag.trailing));
}
public static Binary getBinaryFrom8Bit(String s8bit) {
try {
byte[] bytes = s8bit.getBytes("ISO-8859-1");
return new Binary(bytes, bytes.length * 8);
} catch( UnsupportedEncodingException ignored ) {
// This is not gonna happen because encoding 'ISO-8859-1' is always supported.
return new Binary(new byte[0], 0);
}
}
public static String get8BitFromBinary(Binary binary) {
try {
return new String(binary.getByteArray(), "ISO-8859-1");
} catch( UnsupportedEncodingException ignored ) {
// This is not gonna happen because encoding 'ISO-8859-1' is always supported.
return "";
}
}
/*
* Adds leading zeroes to the binary string, so that the final amount of bits is 16
*/
private static String addLeadingZeroes(String bin, boolean is16) {
int len = bin.length();
long amount = (long) (is16 ? 16 : 8) - len;
// Create zeroes and append binary string
StringBuilder zeroes = new StringBuilder();
for( int i = 0; i < amount; i ++ ) {
zeroes.append(0);
}
zeroes.append(bin);
return zeroes.toString();
}
public String pack(){
return tag.getFilePrefixChar() + data;
}
public String getData() {
return data;
}
public InfoTag getTag() {
return tag;
}
}
This class contains the pack() method. data is already compressed here (if it should be).
For other classes, please watch the Github repository, I don't want to make my question too long.

Solved it by myself!
It seemed to be the reading and writing system. When I exported a file, I made a string using the ISO-8859-1 table to turn bytes into characters. I wrote that string to a text file, which is UTF-8. The big problem was that I used FileWriter instances to write it, which are for text files.
Reading used the inverse system. The complete file was read into memory as a string (memory consuming!!) and was then being decoded.
I didn't know a file was binary data, where specific formats of them form text data. ISO-8859-1 and UTF-8 are some of those formats. I had problems with UTF-8, because it splitted some characters into two bytes, which I couldn't manage...
My solution to it was to use streams. There exist FileInputStreams and FileOutputStreams in Java, which could be used for reading and writing binary files. I didn't use the streams, as I thought there was no big difference ("files are text, so what's the problem?"), but there is... I implemented this (by writing a new similar library) and I'm now able to pass every input stream to the decoder and every output stream to the encoder. To make uncompressed files, you need to pass a FileOutputStream. GZipped files could use GZipOutputStreams, relying on a FileOutputStream. If someone wants a string with the binary data, a ByteArrayOutputStream could be used. Same rules apply to reading, where the InputStream variant of the mentioned streams should be used.
No UTF-8 or ISO-8859-1 problems anymore, and it seemed to work, even with GZip!

reading UTF-16 produces unexpected results

I use the beaglebuddy Java library in an Android project for reading/writing ID3 tags of mp3 files. I'm having an issue with reading the text that was previously written using the same library and could not find anything related in their docs.
Assume I write the following info:
MP3 mp3 = new MP3(pathToFile);
mp3.setLeadPerformer("Jon Skeet");
mp3.setTitle("A Million Rep");
mp3.save();
Looking at the source code of the library, I see that UTF-16 encoding is explicitly set, internally it calls
protected ID3v23Frame setV23Text(String text, FrameType frameType) {
return this.setV23Text(Encoding.UTF_16, text, frameType);
}
and
protected ID3v23Frame setV23Text(Encoding encoding, String text, FrameType frameType) {
ID3v23FrameBodyTextInformation frameBody = null;
ID3v23Frame frame = this.getV23Frame(frameType);
if(frame == null) {
frame = this.addV23Frame(frameType);
}
frameBody = (ID3v23FrameBodyTextInformation)frame.getBody();
frameBody.setEncoding(encoding);
frameBody.setText(encoding == Encoding.UTF_16?Utility.getUTF16String(text):text);
return frame;
}
At a later point, I read the data and it gives me some weird Chinese characters:
mp3.getLeadPerformer(); // 䄀 䴀椀氀氀椀漀渀 刀攀瀀
mp3.getTitle(); // 䨀漀渀 匀欀攀攀琀
I took a look at the built-in Utility.getUTF16String(String) method:
public static String getUTF16String(String string) {
String text = string;
byte[] bytes = string.getBytes(Encoding.UTF_16.getCharacterSet());
if(bytes.length < 2 || bytes[0] != -2 || bytes[1] != -1) {
byte[] bytez = new byte[bytes.length + 2];
bytes[0] = -2;
bytes[1] = -1;
System.arraycopy(bytes, 0, bytez, 2, bytes.length);
text = new String(bytez, Encoding.UTF_16.getCharacterSet());
}
return text;
}
I'm not quite getting the point of setting the first 2 bytes to -2 and -1 respectively, is this a pattern stating that the string is UTF-16 encoded?
However, I tried to explicitly call this method when reading the data, that seems to be readable, but always prepends some cryptic characters at the start:
Utility.getUTF16String(mp3.getLeadPerformer()); // ��Jon Skeet
Utility.getUTF16String(mp3.getTitle()); // ��A Million Rep
Since the count of those characters seems to be constant, I created a temporary workaround by simply cutting them off.
Fields like "comments" where the author does not explicitly enforce UTF-16 when writing are read without any issues.
I'm really curious about what's going on here and appreciate any suggestions.

Problems to compress Excel files, JAVA

I have some problems compressing excel files using the Hffman algorthim. The thing is that my code seems to work with .txt files, but when I'm trying to compress .xlsx or older versions of excel an error occurs.
First of all, I read my file like this:
File file = new File("fileName.xlsx");
byte[] dataOfFile = new byte[(int) file.length()];
DataInputStream dis = new DataInputStream(new FileInputStream(file));
dis.readFully(dataOfFile);
dis.close();
To check this (if everything seems OK) I use this code:
String entireFileText = new String(dataOfFile, "UTF-8");
for(int i=0;i<dataOfFile.length;i++)
{
System.out.print(dataOfFile[i]);
}
By doing this to a .txt file I get something like this (which seems to be OK):
"7210110810811132119111114108100331310721111193297114101321211111173"
But when I use this on .xlsx file I get this and I think the hyphen makes errors that might occur later in the compression:
"8075342006080003301165490-90122100-1245001908291671111101161011101169584121112101115934612010910832-944240-96020000000000000"... and so on
Anyway, by using a string a can map this into a HashMap, where I count the frequency of each character. I have a HashMap:
public static HashMap map;
public static boolean countHowOftenACharacterAppear(String s1) {
String s = s1;
for(int i = 0; i < s.length(); i++){
char c = s.charAt(i);
Integer val = map.get(new Character(c));
if(val != null){
map.put(c, new Integer(val + 1));
}
else{
map.put(c,1);
}
}
return true;
}
When I compress my string I use:
public static String compress(String s) {
String c = new String();
for(int i = 0; i < s.length(); i++)
c = c + fromCharacterToCode.get(s.charAt(i));
return c;
}
fromCharactertoCode is another HashMap of type :
public static HashMap fromCharacterToCode;
(I'm traversing through my table I've built. Dont't think this is the problem)
Anyway, the results from this using the .txt file is:
"01000110110111011011110001101110011011000001000000000"... (PERFECT)
From the .xlsx file:
"10101110110001110null0010000null0011000nullnullnull10110000null00001101011111" ...
I really don't get why I'm getting the nullpointers on the .xlsx files. I would be very happy if I could get some help here to solve this. Many thanks!!

Your problem is java I/O, well before getting to compression.
First, you don't really need DataInputStream here, but leave that aside. You then convert to String entireFileText assuming the contents of the file is text in UTF-8, whereas data files like .xlsx aren't text at all and many text files even on Windows aren't UTF-8. But you don't seem to use entireFileText, so that may not matter. If you do, and the file isn't plain ASCII text, your compressor will "lose" chunks of it and the output of decompression will be only a fraction of the compression input; that is usually considered unsatisfactory.
Then you extract each byte from dataOfFile. byte in Java is signed; plain ASCII text files will have only "positive" bytes 0x00 to 0x7F (and usually all 0x20 to 0x7E plus 0x09 0x0D 0x0A), but everything else (UTF-8 text, UTF-16 text, data, and executables) will have "negative" bytes 0x80 to 0xFF which come out as -0x80 to -0x01.
Your printout "7210110810811132119111114108100331310721111193297114101321211111173" for "the .txt file" is almost certainly the byte sequence 72=H 101=e 108=l 108=l 111=o 32=space 119=w 111=o 114=r 108=l 100=d 33=! 13=CR 10=LF 72=H 111=o 119=w 32=space 97=a 114=r 101=e 32=space 121=y 111=o 117=u 3=(ETX aka ctrl-C) (how did you get a ctrl-C into a file?! or was it really 30=ctrl-Z? that's somewhat usual for Windows text files)
Someone more familiar with .xlsx format might be able to reconstruct that one, but I can tell you right off the hyphens are due to bytes with negative values, printed in decimal (by default) as -128 to -1.
For a general purpose compressor, you shouldn't ever convert to java char's and String's; those are designed for text and not all files are text. Just work with bytes, but if you want them in consistently positive, mask with & 0xFF .

Generating a .ov2 file with Java

I am trying to figure out how to create a .ov2 file to add POI data to a TomTom GPS device. The format of the data needs to be as follow:
An OV2 file consists of POI records. Each record has the following data format.
1 BYTE, char, POI status ('0' or '2')
4 BYTES, long, denotes length of the POI record.
4 BYTES, long, longitude * 100000
4 BYTES, long, latitude * 100000
x BYTES, string, label for POI, x =3D=3D total length =96 (1 + 3 * 4)
Terminating null byte.
I found the following PHP code that is supposed to take a .csv file, go through it line by line, split each record and then write it into a new file in the proper format. I was hoping someone would be able to help me translate this to Java. I really only need the line I marked with the '--->' arrow. I do not know PHP at all, but everything other than that one line is basic enough that I can look at it and translate it, but I do not know what the PHP functions are doing on that one line. Even if someone could explain it well enough then maybe I could figure it out in Java. If you can translate it directly, please do, but even an explanation would be helpful. Thanks.
<?php
$csv = file("File.csv");
$nbcsv = count($csv);
$file = "POI.ov2";
$fp = fopen($file, "w");
for ($i = 0; $i < $nbcsv; $i++) {
$table = split(",", chop($csv[$i]));
$lon = $table[0];
$lat = $table[1];
$des = $table[2];
--->$TT = chr(0x02).pack("V",strlen($des)+14).pack("V",round($lon*100000)).pack("V",round($lat*100000)).$des.chr(0x00);
#fwrite($fp, "$TT");
}
fclose($fp);
?>

Load a file into an array, where each element is a line from the file.
$csv = file("File.csv");
Count the number of elements in the array.
$nbcsv = count($csv);
Open output file for writing.
$file = "POI.ov2";
$fp = fopen($file, "w");
While $i < number of array items, $i++
for ($i = 0; $i < $nbcsv; $i++) {
Right trim the line (remove all whitespace), and split the string by ','. $table is an array of values from the csv line.
$table = split(",", chop($csv[$i]));
Assign component parts of the table to their own variables by numeric index.
$lon = $table[0];
$lat = $table[1];
$des = $table[2];
The tricky bit.
chr(02) is literally character code number 2.
pack is a binary processing function. It takes a format and some data.
V = unsigned long (always 32 bit, little endian byte order).
I'm sure you can work out the maths bits, but you need to convert them into little endian order 32 bit values.
. is a string concat operator.
Finally it is terminated with chr(0). Null char.
$TT = chr(0x02).
pack("V",strlen($des)+14).
pack("V",round($lon*100000)).
pack("V",round($lat*100000)).
$des.chr(0x00);
Write it out and close the file.
#fwrite($fp, "$TT");
}
fclose($fp);

The key in JAVA is to apply proper byte order ByteOrder.LITTLE_ENDIAN to the ByteBuffer.
The whole function:
private static boolean getWaypoints(ArrayList<Waypoint> geopoints, File f)
{
try{
FileOutputStream fs = new FileOutputStream(f);
for (int i=0;i<geopoints.size();i++)
{
fs.write((byte)0x02);
String desc = geopoints.get(i).getName();
int poiLength = desc.toString().length()+14;
fs.write(ByteBuffer.allocate(4).order(ByteOrder.LITTLE_ENDIAN).putInt(poiLength).array());
int lon = (int)Math.round((geopoints.get(i).getLongitudeE6()/1E6)*100000);
fs.write(ByteBuffer.allocate(4).order(ByteOrder.LITTLE_ENDIAN).putInt(lon).array());
int lat = (int)Math.round((geopoints.get(i).getLatitudeE6()/1E6)*100000);
fs.write(ByteBuffer.allocate(4).order(ByteOrder.LITTLE_ENDIAN).putInt(lat).array());
fs.write(desc.toString().getBytes());
fs.write((byte)0x00);
}
fs.close();
return true;
}
catch (Exception e)
{
return false;
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.