Spring Batch : How to parse a horked CSV file? - java

One of the common tasks in Spring Batch it is used for is parsing CSV files, but CSV files come in all form and shapes… and some of them are syntactically incorrect.I am currently try to parse an invalid CSV file:
My CSV file data.CSV:
"1;13/4/1992;16/7/2006"
"2;31/5/1993;1/8/2009"
"3;21/4/1990;27/4/2010"
I used this code here.
This works for me but I want to use a CSV of my own! any suggestion please?

As I said in the comment of the answer your linked, you can use a FlatFileItemReader instead of the ListItemReader and it should work. I used the ListItemReader in order to provide a self-contained example you can run (without the need to upload a csv file, etc). The point of the other question was about how to trim the leading/trailing " before parsing the lines, and the trick was to use a composite item processor as shown in the example.
You don't need to create a new reader, here is an example of a FlatFileItemReader you can use with the same example:
#Bean
public FlatFileItemReader<String> itemReader() {
return new FlatFileItemReaderBuilder<String>()
.name("itemReader")
.resource(new FileSystemResource("/absolute/path/to/file"))
.lineMapper(new PassThroughLineMapper())
.build();
}
Hope this helps.

Related

How to keep zero begin string when export data using opencsv library

I using opencsv library in java and export csv. But i have problem. When i used string begin zero look like : 0123456 , when i export it remove 0 and my csv look like : 123456. Zero is missing. I using way :
"\"\t"+"0123456"+ "\""; but when csv export it look like : "0123456" . I don't want it. I want 0123456. I don't want edit from excel because some end user don't know how to edit. How to export csv using open csv and keep 0 begin string. Please help
I think it is not really the problem when generating CSV but the way excel treats the data when opened via explorer.
Tried this code, and viewed the CSV in a text editor ( not excel ), notice that it shows up correctly, though when opened in excel, leading 0s are lost !
CSVWriter writer = new CSVWriter(new FileWriter("yourfile.csv"));
// feed in your array (or convert your data to an array)
String[] entries = "0123131#21212#021213".split("#");
List<String[]> a = new ArrayList<>();
a.add(entries);
//don't apply quotes
writer.writeAll(a,false);
writer.close();
If you are really sure that you want to see the leading 0s for numeric values when excel is opened by user, then each cell entry be in format ="dataHere" format; see code below:
CSVWriter writer = new CSVWriter(new FileWriter("yourfile.csv"));
// feed in your array (or convert your data to an array)
String[] entries = "=\"0123131\"#=\"21212\"#=\"021213\"".split("#");
List<String[]> a = new ArrayList<>();
a.add(entries);
writer.writeAll(a);
writer.close();
This is how now excel shows when opening excel from windows explorer ( double clicking ):
But now, if we see the CSV in a text editor, with the modified data to "suit" excel viewing, it shows as :
Also see link :
format-number-as-text-in-csv-when-open-in-both-excel-and-notepad
have you tried to use String like this "'"+"0123456". ' char will mark number as text when parse into excel
For me OpenCsv works correctly ( vers. 5.6 ).
for example my csv file has a row as the following extract:
"999739059";;;"abcdefgh";"001024";
and opencsv reads the field "1024" as 001024 corretly. Of course I have mapped the field in a string, not in a Double.
But, if you still have problems, you can grab a simple yet powerful parser that fully adheres with RFC 4180 standard:
mykong.com
Mykong shows you some examples using opencsv directly and, in the end, he writes a simple parser to use if you don't want to import OpenCSV , and the parser works very well , and you can use it if you still have any problems.
So you have an easy-to-understand source code of a simple parser that you can modify as you want if you still have any problem or if you want to customize it for your needs.

How to validate values in a YAML configuration file while loading it?

Is there a way to validate values in a YAML file while loading it in the code. The requirement is I have some elements in the YAML file which must have values. If the validation fails, then YAML should not be loaded.
I'm using snakeyaml library and heard there is a way to do this via Representer.
Code I'm currently using to load the YAML,
Reader in = new InputStreamReader(Files.newInputStream(file), StandardCharsets.UTF_8);
Yaml yaml = new Yaml();
yaml.setBeanAccess(BeanAccess.FIELD);
return yaml.loadAs(in, School.class);
Since you can have any value in a YAML file, you should load the file in a function, test the values and raise an error if the values are not what you want. Return the loaded data if they are.
This may have side-effects if your YAML has tags that create arbitrary objects, but checking during loading will not prevent that, as such object might have been created before you come to the value you want to check.
If you do have tags in your YAML and that is a real problem, then you would have to make a safe_load-er for the YAML file that can handle the tags (by creating normal mapping objects), then check the values and reload with full tag support.

Reading JSON file with BigQuery to make table

I'm new to Google Dataflow, and can't get this thing to work with JSON. I've been reading throughout the documentation, but can't solve my problem.
So, following the WordCount example i figured how data is loaded from .csv file with next line
PCollection<String> input = p.apply(TextIO.Read.from(options.getInputFile()));
where inputFile in .csv file from my gcloud bucket. I can transform read lines from .csv with:
PCollection<TableRow> table = input.apply(ParDo.of(new ExtractParametersFn()));
(Extract ParametersFn defined by me). So far so good!
But then I realize my .csv file is too big and had to convert it to JSON (https://cloud.google.com/bigquery/preparing-data-for-bigquery).
Since BigQueryIO is supposedly better for reading JSON, I tried with the following code:
PCollection<TableRow> table = p.apply(BigQueryIO.Read.from(options.getInputFile()));
(inputFile is then JSON file and the output when reading with BigQuery is PCollection with TableRows) I tried with TextIO too (which returns PCollection with Strings) and neither of the two IO options work.
What am I missing? The documentation is really not that detailed to find an answer there, but perhaps some of you guys already dealt with this problem before?
Any suggestions would be very appreciated. :)
I believe there are two options to consider:
Use TextIO with TableRowJsonCoder to ingest the JSON files (e.g., like it is done in the TopWikipediaSessions example);
Import the JSON files into a bigquery table (https://cloud.google.com/bigquery/loading-data-into-bigquery), and then use BigQueryIO.Read to read from the table.

saveAsTextFile() to write the final RDD as single text file - Apache Spark

I am working on a batch application using Apache Spark, i wanted to write the final RDD as text file, currently i am using saveAsTextFile("filePath") method available in RDD.
My text file contains the fields delimited with \u0001 delimiter. So in the model class toString() method i added all the fields seperated with \u0001 delimiter.
is this the correct way to handle this? or any other best approach available?
Also what if i iterate the RDD and write the file content using FileWriter class available in Java?
Please advise on this.
Regards,
Shankar
To write as a single file there are a few options. If your writing to HDFS or a similar distributed store you can first coalesce your RDD down to a single partition (note your data must then fit on a single worker), or you could collect the data to the driver and then use a filewriter.
public static boolean copyMerge(SparkConf sparkConf, JavaRDD rdd, String dstPath) throws IOException, URISyntaxException {
Configuration hadoopConf = sparkConf.hadoopConfiguration();
hadoopConf.set("fs.s3.awsAccessKeyId", awsAccessKey);
hadoopConf.set("fs.s3.awsSecretAccessKey", awsSecretKey);
String tempFolder = "s3://bucket/folder";
rdd.saveAsTextFile(tempFolder);
FileSystem hdfs = FileSystem.get(new URI(tempFolder), hadoopConfig);
return FileUtil.copyMerge(hdfs, new Path(tempFolder), hdfs, new Path(dstPath), false, hadoopConfig, null);
}
This solution is for S3 or any HDFS system. Achieved in two steps:
Save the RDD by saveAsTextFile, this generates multiple files in the folder.
Run Hadoop "copyMerge".
Instead of doing collect and collecting it to driver I would rather suggest to use coalesce which would be good in reducing memory problems

How can I get content from Exchange.In:Body object from a ProcessDefinition in Camel

I am integrating data between two systems using Apache Camel. I want the resulting xml to be written to an xml file. I want to base the name of that file on some data which is unknown when the integration chain starts.
When I have done the first enrich step the data necessary is in the Exchange object.
So the question is how can I get data from the exchange.getIn().getBody() method outside of the process chain in order to generate a desirable filename for my output file and as a final step, write the xml to this file? Or is there some other way to accomplish this?
Here is my current Process chain from the routebuilders configuration method:
from("test_main", "jetty:server")
.process(new PiProgramCommonProcessor())
.enrich("piProgrammeEnricher", new PiProgrammeEnricher())
// after this step I have the data available in exchange.in.body
.to(freeMarkerXMLGenerator)
.to(xmlFileDestination)
.end();
best regards
RythmiC
The file component takes the file name from a header (if present). So you can just add a header to your message with the desired file name.
The header should use the key "CamelFileName" which is also defined from Exchange.FILE_NAME.
See more details at: http://camel.apache.org/file2

Categories

Resources