Iterating Text in hadoop

Iterating Text in hadoop - java

I am trying to iterate through a Text and print its contents. This is the code I tried:
Text text = new Text();
text.set("Hadoop");
ByteBuffer buf = ByteBuffer.wrap(text.getBytes(),0,text.getLength());
int cp = 0;
while(buf.hasRemaining() && (cp = Text.bytesToCodePoint(buf)) != 1)
System.out.println(Integer.toHexString(cp));
This is printing me the code points. How to print the characters from this?
EDIT
For the input "Hadoop", casting the int cp to a char inside your while loop works. BUT, when the text is something like \u0041\u00DF\u6771\uD801\uDC00 then using the same code I am getting a couple of ? printed in the console. Any specific reasons for this? Please suggest.

I guess the easiest way would be for you to just cast your ints to chars. Like so:
int[] chars = { 0x41, 0xdf, 0x6671, 0x10400 };
for(int c : chars) {
String out = String.format("%d -> %s", c, (char) c);
System.out.println(out);
}
My output is:
65 -> A
223 -> ß
26225 -> 晱
66560 -> Ѐ

Related

COMP-3 data unpacking in Java (Embedded in Pentaho)

We are facing a challenge in reading the COMP-3 data in Java embedded inside Pentaho ETL. There are few Float values stored as packed decimals in a flat file along with other plain text. While the plain texts are getting read properly, we tried using Charset.forName("CP500");, but it never worked. We still get junk characters.
Since Pentaho scripts doesn't support COMP-3, in their forums they suggested to go with User Defined Java class. Could anyone help us if you have come across and solved such?

Is it a Cobol File ???, Do you have a Cobol Copybook ???.
Possible options include
As Bill said Convert the Comp-3 to Text on the source machine
Write your own Conversion Code
Use a library like JRecord. Note: I am the author of JRecord
Converting Comp-3
in Comp-3,
Value Comp-3 (signed) Comp-3 (Unsigned) Zoned-Decimal
123 x'123c' x'123f' ?? "12C"
-123 x'123d' "12L"
There is more than one way to convert a comp-3 to a decimal integer. One way
is to
Connvert x'123c' ->> String "123c"
Drop the last character and test for the sign
Java Code to convert comp3 (from a byte array:
public static String getMainframePackedDecimal(final byte[] record,
final int start,
final int len) {
String hex = getDecimal(record, start, start + len);
//Long.toHexString(toBigInt(start, len).longValue());
String ret = "";
String sign = "";
if (! "".equals(hex)) {
switch (hex.substring(hex.length() - 1).toLowerCase().charAt(0)) {
case 'd' : sign = "-";
case 'a' :
case 'b' :
case 'c' :
case 'e' :
case 'f' :
ret = sign + hex.substring(0, hex.length() - 1);
break;
default:
ret = hex;
}
}
if ("".equals(ret)) {
ret = "0";
}
}
public static String getDecimal(final byte[] record, final int start, final int fin) {
int i;
String s;
StringBuffer ret = new StringBuffer("");
int b;
for (i = start; i < fin; i++) {
b = toPostiveByte(record[i]);
s = Integer.toHexString(b);
if (s.length() == 1) {
ret.append('0');
}
ret.append(s);
}
return ret.toString();
}
JRecord
In JRecord, if you have a Cobol Copybook,
there is
Cobol2Csv a program to convert a Cobol-Data file to CSV using a Cobol Copybook
Data2Xml convert a Cobol Data file to Xml using a Cobol Copybook.
Read Cobol-Data File with a Cobol Copybook.
Read a Fixed width file with a Xml Description
Define the Fields in Java
Reading with Cobol Copybook in JRecord
ICobolIOBuilder ioBldr = JRecordInterface1.COBOL
.newIOBuilder(copybookName)
.setDialect( ICopybookDialects.FMT_MAINFRAME)
.setFont("cp037")
.setFileOrganization(Constants.IO_FIXED_LENGTH)
.setDropCopybookNameFromFields(true);
AbstractLine saleRecord;
AbstractLineReader reader = ioBldr.newReader(salesFile);
while ((saleRecord = reader.read()) != null) {
....
}
reader.close();
Defining the File in Java with JRecord
AbstractLineReader reader = JRecordInterface1.FIXED_WIDTH.newIOBuilder()
.defineFieldsByLength()
.addFieldByLength("Sku" , Type.ftChar, 8, 0)
.addFieldByLength("Store", Type.ftNumRightJustified, 3, 0)
.addFieldByLength("Date" , Type.ftNumRightJustified, 6, 0)
.addFieldByLength("Dept" , Type.ftNumRightJustified, 3, 0)
.addFieldByLength("Qty" , Type.ftNumRightJustified, 2, 0)
.addFieldByLength("Price", Type.ftNumRightJustified, 6, 2)
.endOfRecord()
.newReader(this.getClass().getResource("DTAR020_tst1.bin.txt").getFile());
AbstractLine saleRecord;
while ((saleRecord = reader.read()) != null) {
}
Zoned Decimal
Another Mainframe-Cobol numeric format is Zoned-Decimal. It is a text format where the sign is Over-typed on the last digit. In zoned-decimal 123 is "12C" while -123 is "12L".

Parsing A Text File, So That Every Line Is Stored As An Array Value

Basically, I want to parse, line by line, a Text file so that every line is in it's own array value.
E.g.
Hi There,
My Name's Aiden,
Not Really.
Array[0] = "Hi There"
Array[1] = "My Name's Aiden"
Array[2] = "Not Really"
But all the examples I have read already just confuse me and lead me to get frustrated. Maybe it's the way I approach it.
I don't know how to go about it, a point in the right direction would be most satisfying.

My suggestion is to use List<String> instead of String[] as arrays have fixed size, and that size is unknown before reading. Afterward one could make an array out of it, but to no real purpose.
For reading one has to know the encoding of the file.
Path path = Paths.get("C:/Users/Me/list.txt");
//Charset encoding = StandardCharsets.UTF_8;
Charset encoding = Charset.defaultCharset();
List<String> lines = Files.readAllLines(path, encoding);
for (String line : lines) {
...
}
for (int i = 0; i < lines.size(); ++i) {
String line = lines.get(i);
lines.set(i, "-- " + line;
}

Print all Unicode characters within a specific range

I can't find the right API for this. I tried this;
public static void main(String[] args) {
for (int i = 2309; i < 3000; i++) {
String hex = Integer.toHexString(i);
System.out.println(hex + " = " + (char) i);
}
}
This code only prints like this in Eclipse IDE.
905 = ?
906 = ?
907 = ?
...
How can I make us of these decimal and hex values to get the Unicode characters?

It prints like that because all consoles use a mono spaced font. Try that on a JLabel in a frame and it should display fine.
EDIT:
Try creating a unicode printstream
PrintStream out = new PrintStream(System.out, true, "UTF-8");
And then print to it.
Here's the output in CMD window.

I forgot to save it in UTF-8 format by changing it from
File > Properties > Select the text file encoding
This will properly print the right character from the Eclipse console. The default is cp1252 which will print only ? for those characters it does not understand.

Generating a .ov2 file with Java

I am trying to figure out how to create a .ov2 file to add POI data to a TomTom GPS device. The format of the data needs to be as follow:
An OV2 file consists of POI records. Each record has the following data format.
1 BYTE, char, POI status ('0' or '2')
4 BYTES, long, denotes length of the POI record.
4 BYTES, long, longitude * 100000
4 BYTES, long, latitude * 100000
x BYTES, string, label for POI, x =3D=3D total length =96 (1 + 3 * 4)
Terminating null byte.
I found the following PHP code that is supposed to take a .csv file, go through it line by line, split each record and then write it into a new file in the proper format. I was hoping someone would be able to help me translate this to Java. I really only need the line I marked with the '--->' arrow. I do not know PHP at all, but everything other than that one line is basic enough that I can look at it and translate it, but I do not know what the PHP functions are doing on that one line. Even if someone could explain it well enough then maybe I could figure it out in Java. If you can translate it directly, please do, but even an explanation would be helpful. Thanks.
<?php
$csv = file("File.csv");
$nbcsv = count($csv);
$file = "POI.ov2";
$fp = fopen($file, "w");
for ($i = 0; $i < $nbcsv; $i++) {
$table = split(",", chop($csv[$i]));
$lon = $table[0];
$lat = $table[1];
$des = $table[2];
--->$TT = chr(0x02).pack("V",strlen($des)+14).pack("V",round($lon*100000)).pack("V",round($lat*100000)).$des.chr(0x00);
#fwrite($fp, "$TT");
}
fclose($fp);
?>

Load a file into an array, where each element is a line from the file.
$csv = file("File.csv");
Count the number of elements in the array.
$nbcsv = count($csv);
Open output file for writing.
$file = "POI.ov2";
$fp = fopen($file, "w");
While $i < number of array items, $i++
for ($i = 0; $i < $nbcsv; $i++) {
Right trim the line (remove all whitespace), and split the string by ','. $table is an array of values from the csv line.
$table = split(",", chop($csv[$i]));
Assign component parts of the table to their own variables by numeric index.
$lon = $table[0];
$lat = $table[1];
$des = $table[2];
The tricky bit.
chr(02) is literally character code number 2.
pack is a binary processing function. It takes a format and some data.
V = unsigned long (always 32 bit, little endian byte order).
I'm sure you can work out the maths bits, but you need to convert them into little endian order 32 bit values.
. is a string concat operator.
Finally it is terminated with chr(0). Null char.
$TT = chr(0x02).
pack("V",strlen($des)+14).
pack("V",round($lon*100000)).
pack("V",round($lat*100000)).
$des.chr(0x00);
Write it out and close the file.
#fwrite($fp, "$TT");
}
fclose($fp);

The key in JAVA is to apply proper byte order ByteOrder.LITTLE_ENDIAN to the ByteBuffer.
The whole function:
private static boolean getWaypoints(ArrayList<Waypoint> geopoints, File f)
{
try{
FileOutputStream fs = new FileOutputStream(f);
for (int i=0;i<geopoints.size();i++)
{
fs.write((byte)0x02);
String desc = geopoints.get(i).getName();
int poiLength = desc.toString().length()+14;
fs.write(ByteBuffer.allocate(4).order(ByteOrder.LITTLE_ENDIAN).putInt(poiLength).array());
int lon = (int)Math.round((geopoints.get(i).getLongitudeE6()/1E6)*100000);
fs.write(ByteBuffer.allocate(4).order(ByteOrder.LITTLE_ENDIAN).putInt(lon).array());
int lat = (int)Math.round((geopoints.get(i).getLatitudeE6()/1E6)*100000);
fs.write(ByteBuffer.allocate(4).order(ByteOrder.LITTLE_ENDIAN).putInt(lat).array());
fs.write(desc.toString().getBytes());
fs.write((byte)0x00);
}
fs.close();
return true;
}
catch (Exception e)
{
return false;
}
}

Nibble Hex from Java to PHP

I'm translating one app from java to php and i'm finding some trouble.
I have a string like this 98191107990D0000EF050000789C65970BCCD75318C7CFEFFC ... in java there's this function where I pass this string as parameter:
private static byte[] decodeNibbleHex(String input)
{
ByteArrayOutputStream baos = new ByteArrayOutputStream();
char[] chars = input.toCharArray();
for (int i = 0; i < chars.length - 1; i += 2) {
char[] bChars = new char[2];
bChars[0] = chars[i];
bChars[1] = chars[(i + 1)];
int val = Integer.decode("0x" + new String(bChars)).intValue();
baos.write((byte)val);
}
return baos.toByteArray();
}
but... i don't know to to translate this function in PHP... i tried too many times and i'm becoming crazy! i tried with a for cycle, with this eval("\$hex = 0x" . $dati[$i].$dati[$i+1] . ";"); and this $binary_string = pack("h*" , $dati[$i].$dati[$i+1]); and many many other functions...
If someone understand Java and can help me I will appreciate it!!
Thank guys!

Take a look here:
http://www.php.net/manual/de/function.hexdec.php#100578
Is this not exactly what you whrere looking for?

If my understanding is correct of your java function, it takes the string's chars in pairs, and threats them as bytes and put them in a ByteArray. In php there's no such thing as a byte array but you can represent random binary data in everyday strings. This is my take on your function (didn't tried to compare with the java code's output).
$str= '98191107990D0000EF050000789C65970BCCD75318C7CFEFFC';
$output[] = array();
for ($i=0, $c = strlen($str) - 1; $i < $c; $i+=2) {
$output[] = chr(intval($str[$i].$str[$i+1], 16));
}
print join($output); // binary string, not really useful in ascii terminal (-:
In summary this seem to be a base16_decode() function, with base16_encode() written like it follows, you get back the input string:
function base16_encode($str) {
$byteArray = str_split($str);
foreach ($byteArray as &$byte) {
$byte = sprintf('%02x', ord($byte));
}
return join($byteArray);
}
print base16_encode(join($output)); // should print back the original input.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Iterating Text in hadoop - java

I guess the easiest way would be for you to just cast your ints to chars. Like so: int[] chars = { 0x41, 0xdf, 0x6671, 0x10400 }; for(int c : chars) { String out = String.format("%d -> %s", c, (char) c); System.out.println(out); } My output is: 65 -> A 223 -> ß 26225 -> 晱 66560 -> Ѐ

Related

COMP-3 data unpacking in Java (Embedded in Pentaho)

Parsing A Text File, So That Every Line Is Stored As An Array Value

Print all Unicode characters within a specific range

Generating a .ov2 file with Java

Nibble Hex from Java to PHP

Categories

Resources