I am trying to use Amazon S3 Select to read records from a CSV file and if the field contains a line break(\n), then the record is not being parsed as a single record. Also, the line break inside the field has been properly escaped by double quotes as per standard CSV format.
For example, the below CSV file
Id,Name,Age,FamilyName,Place
p1,Albert Einstein,25,"Einstein
Cambridge",Cambridge
p2,Thomas Edison,30,"Edison
Cardiff",Cardiff
is being parsed as
Line 1 : Id,Name,Age,FamilyName,Place
Line 2 : p1,Albert Einstein,25,"Einstein
Line 3 : Cambridge",Cambridge
Line 4 : p2,Thomas Edison,30,"Edison
Line 5 : Cardiff",Cardiff
Ideally it should have been parsed as given below:
Line 1:
Id,Name,Age,FamilyName,Place
Line 2:
p1,Albert Einstein,25,"Einstein
Cambridge",Cambridge
Line 3:
p2,Thomas Edison,30,"Edison
Cardiff",Cardiff
I'm setting AllowQuotedRecordDelimiter to TRUE in the SelectObjectContentRequest as given in their documentation. It's still not working.
Does anyone know if Amazon S3 Select supports line break inside fields as described in the case mentioned above? Or any other parameters I need to change or set to make this work?
This is being parsed / printed correctly. The confusion lies in that the literal newline is being printed in the output. You can test this if you run the following expression on the given csv:
SELECT COUNT(*) from s3Object s
Output: 2
Note that if you specify only the third column, you get only the correct value:
SELECT s._3 frin s3Object s
You get only the parts of each line that enclose said field:
"Einstein
Cambridge"
"Edison
Cardiff"
What's happening is the character in the field is the same as the default CSVOutput.RecordDelimiter value (\n) which is causing a clash. If you want to separate each field in a different way, you could add the the following to the CSVOutput part of the OutputSerialization:
"RecordDelimiter": "\r\n"
or use some other type of 1-2 length character sequence in place of \r\n
Related
My input is a "|" (pipe) separator file. I can't change the input file.
The format is
HEADER_A|HEADER_B|HEADER_C
A|B|C
A D|B| => records without comma generates output like "A D|B|"
A,D|B| => records with comma generates output like " A,D|B| "
Spark config is :
sparkSession.read()
.option("header","true")
.option("delimiter","|")
.schema(schema) * assume this is valid and represents the correct schema
.csv(fileName)
.cache();
I've tried using the "sep" option but didn't work as well.
If my delimiter is "|", why Spark has a different effect on records with a comma?
I found my error. As the record contains a comma, I should not use the .csv(path) when writing the file
Changing from
dataset.write()...
.csv(path)
to
dataset,write()...
.text(path)
solved it
I have a csv which has a CRLF when i checked in notepad++ due to which the data corresponding to that column is populating in next column and the same is happening with the next column value.
I want to replace that with Replace text in Apache-Nifi. Any leads?
Example below:
Name,id,product,product_id,email,phone_no,fax_no
John,1,2,3,CRLF
x#p.com, +212 -909-9008, +212 -909-9009 -- it is coming in next line.
You can use replaceTextprocessor line by line mode.
replace \n with '' or any other value.
I am using text fields for displaying column names . For showing the corresponding name of the column I have tried the following method:
Method 1:
textField.setX(currentXPos);
textField.setY(0);
textField.setWidth(columnWidth);
textField.setPrintWhenDetailOverflows(false);
textField.setHeight(colDtlBandHeight);
textField.setStretchWithOverflow(true);
textField.setStretchType(StretchTypeEnum.RELATIVE_TO_BAND_HEIGHT);
textField.setStyle(normalFont);
textField.setBlankWhenNull(true);
JRDesignExpression expression = new JRDesignExpression();
expression.setValueClass(columnClass);
expression.setText("$F{" + columnName + "}");
But on using the above method it throws an exception saying:
net.sf.jasperreports.engine.JRException: Errors were encountered when compiling report expressions class file:
1. Syntax error on token "ID", delete this token
value = SHIFT ID; //$JR_EXPR_ID=44$
2. Syntax error, insert ";" to complete BlockStatements
value = BILL NO.; //$JR_EXPR_ID=45$
3. Syntax error on token ".", invalid VariableDeclarator
value = BILL NO.; //$JR_EXPR_ID=45$
4. Syntax error on token "DATE", delete this token
value = BILL DATE; //$JR_EXPR_ID=46$
But on using the below lines the column Names are set correctly .
Method 2:
textField.setExpression(new JRDesignExpression("new String(\""+colTitle+"\")"));
My doubts are:
1. For displaying the data the first method mentioned is used . Then how come there are no exceptions in that case ?
2. Why did it throw those exceptions when the same method was used for displaying column names?
3. How did the 2nd method work ?
1.:
I suppose the data is properly enclosed in quotes.
2.:
Judging by the exception explanation (e.g. Syntax error on token "ID", delete this token) the interpreter sees two values, SHIFT and ID. It seems here that quotes are missing, e.g.
"SHIFT ID"
"BILL NO."
3.:
In your first example, you create a JRDesignExpression, set the value class and set the text.
The field isn't enclosed in quotes as seen in your lower example. It should look like this:
expression.setText("\"$F{" + columnName + "}\"");
Also, you didn't assign the expression to your textField:
textField.setExpression(expression)
I am in the difficult situation now where i need to make a parser to parse a formatted document from tekla to be processed in the database.
so on the .CSV i have this
,SMS-PW-BM31,,1,,,,287.9
,,SMS-PW-BM31,1,H350*175*7*11,SS400,5805,287.9
,------------,--------------,----,---------------,--------,------------,---------
,SMS-PW-BM32,,1,,,,405.8
,,SMSPW-H707,1,H350*175*7*11,SS400,6697,332.2
,,SMSPW-EN12,1,PLT12x175,SS400,500,8.2
,,SMSPW-EN14,1,PLT16x175,SS400,500,11
,------------,--------------,----,---------------,--------,------------,---------
That is the document generated from the tekla software. What i expect from the output is something like this
HEAD-MARK COMPONENT-TYPE QUANTITY PROFILE GRADE LENGTH WEIGHT
SMS-PW-BM31 1 287.9
SMS-PW-BM31 SMS-PW-BM31 1 H350*175*7*11 SS400 5805 287.9
SMS-PW-BM32 1 405.8
SMS-PW-BM32 SMSPW-H707 1 H350*175*7*11 SS400 6697 332.2
SMS-PW-BM32 SMSPW-EN12 1 PLT12X175 SS400 500 8.2
SMS-PW-BM32 SMSPW-EN14 1 PLT16X175 SS400 500 11
How do i start from in Java ? the most complicated thing is distributing the head mark that separated by the '-'
CSV format is quite simple, there is a column delimiter that is a comma(,) and a row delimiter that is a new line(\n). Some columns will be surrounded by quotes(") to contain column data but it looks like you wont have to worry about that given your current file.
Look at String.split and you will find your answer after a bit of pondering it.
i'm using sql command load data to insert data in a csv file to mysql database. the problem is that at the end of the file there's a few line like ",,,,,,,,,,,,,,,,,," (the csv file is a conversion of an excel file). so when sql get to those lines he send me : #1366 - Incorrect integer value: '' for column 'Bug_ID' at row 661.
the 'bug_id' is an int and i have 32 column.
how can i tell him to ignore those lines considering the number of filed lines is variable?
thanks for your help.
MySQL supports a 'LINES STARTING BY "xxxx" ' for when reading delimited text files. If you can, require your specific .CVS file to have each data line with a 'prefix' and non-data lines to not have that prefix. This gives you the benefit of being able to putting comments into a .CSV if desired.
MySQL Doc: Load Data InFile
You can:
step 1 - (optionally) export data:
SELECT *
INTO OUTFILE "myFile.csv"
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '\\'
LINES STARTING BY 'DATA:'
TERMINATED BY '\n'
FROM table
step 2 - import data
LOAD DATA INFILE "myFile.csv"
INTO TABLE some_table
COLUMNS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '\\'
LINES STARTING BY 'DATA:'
Effectively you can modify the .csv file to look like this:
# Comment for humans
// Comment for humans
Comments for us humans.
DATA:1,3,4,5,6,'asdf','abcd'
DATA:4,5,6,7,8,'qwerty','zxcv'
DATA:9,8,7,6,5,'yuio','hjlk'
# Comments for humans
// Comments for humans
Comments for humans
DATA:13,15,64,78,54,'bla bla','foo bar'
Only the lines with 'DATA:' prefix will be interpreted/read by the construct.
I used this technique to create a 'config' file for a SQL script that needed external control information. But there was a human element that needed to be able to easily manipulate the .csv file and understand its contents.
-- J Jorgenson --
i fixed it:
i just added a condition on the line in my csv parser
while ((line = is.readLine()) != null) {
if (!line.equals(",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,"))
{
Iterator e = csv.parse(line).iterator();
......
}
}