Parsing a complicated CSV file

Parsing a complicated CSV file - java

I am in the difficult situation now where i need to make a parser to parse a formatted document from tekla to be processed in the database.
so on the .CSV i have this
,SMS-PW-BM31,,1,,,,287.9
,,SMS-PW-BM31,1,H350*175*7*11,SS400,5805,287.9
,------------,--------------,----,---------------,--------,------------,---------
,SMS-PW-BM32,,1,,,,405.8
,,SMSPW-H707,1,H350*175*7*11,SS400,6697,332.2
,,SMSPW-EN12,1,PLT12x175,SS400,500,8.2
,,SMSPW-EN14,1,PLT16x175,SS400,500,11
,------------,--------------,----,---------------,--------,------------,---------
That is the document generated from the tekla software. What i expect from the output is something like this
HEAD-MARK COMPONENT-TYPE QUANTITY PROFILE GRADE LENGTH WEIGHT
SMS-PW-BM31 1 287.9
SMS-PW-BM31 SMS-PW-BM31 1 H350*175*7*11 SS400 5805 287.9
SMS-PW-BM32 1 405.8
SMS-PW-BM32 SMSPW-H707 1 H350*175*7*11 SS400 6697 332.2
SMS-PW-BM32 SMSPW-EN12 1 PLT12X175 SS400 500 8.2
SMS-PW-BM32 SMSPW-EN14 1 PLT16X175 SS400 500 11
How do i start from in Java ? the most complicated thing is distributing the head mark that separated by the '-'

CSV format is quite simple, there is a column delimiter that is a comma(,) and a row delimiter that is a new line(\n). Some columns will be surrounded by quotes(") to contain column data but it looks like you wont have to worry about that given your current file.
Look at String.split and you will find your answer after a bit of pondering it.

Related

Amazon S3 Select Issue : not supporting line break occurring inside fields

I am trying to use Amazon S3 Select to read records from a CSV file and if the field contains a line break(\n), then the record is not being parsed as a single record. Also, the line break inside the field has been properly escaped by double quotes as per standard CSV format.
For example, the below CSV file
Id,Name,Age,FamilyName,Place
p1,Albert Einstein,25,"Einstein
Cambridge",Cambridge
p2,Thomas Edison,30,"Edison
Cardiff",Cardiff
is being parsed as
Line 1 : Id,Name,Age,FamilyName,Place
Line 2 : p1,Albert Einstein,25,"Einstein
Line 3 : Cambridge",Cambridge
Line 4 : p2,Thomas Edison,30,"Edison
Line 5 : Cardiff",Cardiff
Ideally it should have been parsed as given below:
Line 1:
Id,Name,Age,FamilyName,Place
Line 2:
p1,Albert Einstein,25,"Einstein
Cambridge",Cambridge
Line 3:
p2,Thomas Edison,30,"Edison
Cardiff",Cardiff
I'm setting AllowQuotedRecordDelimiter to TRUE in the SelectObjectContentRequest as given in their documentation. It's still not working.
Does anyone know if Amazon S3 Select supports line break inside fields as described in the case mentioned above? Or any other parameters I need to change or set to make this work?

This is being parsed / printed correctly. The confusion lies in that the literal newline is being printed in the output. You can test this if you run the following expression on the given csv:
SELECT COUNT(*) from s3Object s
Output: 2
Note that if you specify only the third column, you get only the correct value:
SELECT s._3 frin s3Object s
You get only the parts of each line that enclose said field:
"Einstein
Cambridge"
"Edison
Cardiff"
What's happening is the character in the field is the same as the default CSVOutput.RecordDelimiter value (\n) which is causing a clash. If you want to separate each field in a different way, you could add the the following to the CSVOutput part of the OutputSerialization:
"RecordDelimiter": "\r\n"
or use some other type of 1-2 length character sequence in place of \r\n

Spark adding extra space when record contain a "comma"

My input is a "|" (pipe) separator file. I can't change the input file.
The format is
HEADER_A|HEADER_B|HEADER_C
A|B|C
A D|B| => records without comma generates output like "A D|B|"
A,D|B| => records with comma generates output like " A,D|B| "
Spark config is :
sparkSession.read()
.option("header","true")
.option("delimiter","|")
.schema(schema) * assume this is valid and represents the correct schema
.csv(fileName)
.cache();
I've tried using the "sep" option but didn't work as well.
If my delimiter is "|", why Spark has a different effect on records with a comma?

I found my error. As the record contains a comma, I should not use the .csv(path) when writing the file
Changing from
dataset.write()...
.csv(path)
to
dataset,write()...
.text(path)
solved it

How to read a CSV file column wise using hadoop?

i am trying to read a csv file which does not contains coma separated values , these are columns for NASDAQ Stocks, i want to read a particular column, assume (3rd), do not know , how to get the column items. IS there any method to read Column wise data in hadoop? pls help here.
My CSV File Format is:
exchange stock_symbol date stock_price_open stock_price_high stock_price_low stock_price_close stock_volume stock_price_adj_close
NASDAQ ABXA 12/9/2009 2.55 2.77 2.5 2.67 158500 2.67
NASDAQ ABXA 12/8/2009 2.71 2.74 2.52 2.55 131700 2.55
Edited Here:
Column A : exchange
Column B : stock_symbol
Column C : date
Column D : stock_price_open
Column E : stock_price_high
and similarly.
These are Columns and not a comma separated values. i need to read this file as column wise.

In Pig it will look like this:
Q1 = LOAD 'file.csv' USING PigStorage('\t') AS (exchange, stock_symbol, stock_date:double, stock_price_open, stock_price_high, stock_price_low, stock_price_close, stock_volume, stock_price_adj_close);
Q2 = FOREACH Q1 GENERATE stock_date;
DUMP C;

You can try to format excel sheet like, adding columns to a single text by using formula like:
=CONCATENATE(A2,";",B2,";",C2,";"D2,";",E2,";",F2,";",G2,";",H2,";",I2)
and concatenate these columns by your required separator, i have used ;, here. use what you want there to be.

Create Excel files in java(invalid number)

I have a string like "2,345".I want to put it into a excel cell.I successfully did but in my excel file i got "2,345" as a string.So please suggest me how can i get "2,345" as a number value but with the same format as i used above(comma seperated).
Thanks in advance.

Remove the comma before inserting it into Excel, cast it to a number before inserting, then format the column to show the comma.
String replace
In Excel the code to format a Range with commas is:
SomeRange.Style = "Comma" 'or, recorded version
SomeRange.NumberFormat = "_-* #,##0_-;-* #,##0_-;_-* ""-""??_-;_-#_-"
'a simpler version..
SomeRange.NumberFormat = "#,##0"

Arranging text inside saved txt file

I have made an app that takes some values and adds them to a txt file.
It does something like this,they are strings[] :
product[1] quantity[1] price[1]
product[2] quantity[2] price[2]
.....
product[n] quantity[n] price[n]
The problem is,most of the time product[1] won't have the same lenght as product[2] or the other products and the same goes for quantities and prices.This results in a messy text layout,something like this.
ww 2 4
wwww 1 2.5
w 1.2 1.1
Is there any way i can make it tidier ? Something like creating a table or columns?
Thanks !
EDIT : To make it a bit clearer,i want to find a way for the stuff in the txt file to be arranged like this,instead of how it is in the above example
ww 2 4
wwww 1 2.5
w 1.2 1.1
At the moment i'm using this
pw.println(prod[n]+" "+cant[n]+" "+pret[n]);}
But this is making the text in the txt file be unaligned(example 1)

Use the format Method of the String class like this:
Declare a String with the format
String yourFormat = "%-10s %-10s %-10s%n"; //choose optimal ranges.
//if you exceed them, it will always automatically make one space
//between the next column
write the output with that format:
output.write(String.format(yourFormat, firstString, secondString, thirdString));
first string are your w's, second and third are the columns with numbers.
for your example:
String myFormat = "%-10s %-10s %-10s%n";
for(int i=0;i<prod.length();i++){
pw.println(String.format(myFormat, prod[n], cant[n], pret[n]));
}
more info here and here

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing a complicated CSV file - java

Related

Amazon S3 Select Issue : not supporting line break occurring inside fields

Spark adding extra space when record contain a "comma"

How to read a CSV file column wise using hadoop?

Create Excel files in java(invalid number)

Arranging text inside saved txt file

Categories

Resources