I have many CSV files with different column header. Currently I am reading those csv files and map them to different POJO classes based on their column header. So some of the CSV files have around 100 column headers which makes difficult to create a POJO class.
So Is there any technique where I can use single pojo, so when reading those csv files can map to a single POJO class or I should read the CSV file line by line and parse accordingly or I should create the POJO during runtime(javaassist)?
If I understand your problem correctly, you can use uniVocity-parsers to process this and get the data in a map:
//First create a configuration object - there are many options
//available and the tutorial has a lot of examples
CsvParserSettings settings = new CsvParserSettings();
settings.setHeaderExtractionEnabled(true);
CsvParser parser = new CsvParser(settings);
parser.beginParsing(new File("/path/to/your.csv"));
// you can also apply some transformations:
// NULL year should become 0000
parser.getRecordMetadata().setDefaultValueOfColumns("0000", "Year");
// decimal separator in prices will be replaced by comma
parser.getRecordMetadata().convertFields(Conversions.replace("\\.00", ",00")).set("Price");
Record record;
while ((record = parser.parseNextRecord()) != null) {
Map<String, String> map = record.toFieldMap(/*you can pass a list of column names of interest here*/);
//for performance, you can also reuse the map and call record.fillFieldMap(map);
}
Or you can even parse the file and get beans of different types in a single step. Here's how you do it:
CsvParserSettings settings = new CsvParserSettings();
//Create a row processor to process input rows. In this case we want
//multiple instances of different classes:
MultiBeanListProcessor processor = new MultiBeanListProcessor(TestBean.class, AmountBean.class, QuantityBean.class);
// we also need to grab the headers from our input file
settings.setHeaderExtractionEnabled(true);
// configure the parser to use the MultiBeanProcessor
settings.setRowProcessor(processor);
// create the parser and run
CsvParser parser = new CsvParser(settings);
parser.parse(new File("/path/to/your.csv"));
// get the beans:
List<TestBean> testBeans = processor.getBeans(TestBean.class);
List<AmountBean> amountBeans = processor.getBeans(AmountBean.class);
List<QuantityBean> quantityBeans = processor.getBeans(QuantityBean.class);
See an example here and here
If your data is too big and you can't hold everything in memory, you can stream the input row by row by using the MultiBeanRowProcessor instead. The method rowProcessed(Map<Class<?>, Object> row, ParsingContext context) will give you a map of instances created for each class in the current row. Inside the method, just call:
AmountBean amountBean = (AmountBean) row.get(AmountBean.class);
QuantityBean quantityBean = (QuantityBean) row.get(QuantityBean.class);
...
//perform something with the instances parsed in a row.
Hope this helps.
Disclaimer: I'm the author of this library. It's open-source and free (Apache 2.0 license)
To me, creating a POJO class is not a good idea in this case. As neither number of columns nor number of files are constant.
Therefore, it is better to use something more dynamic for which you do not have to change your code to a great extent just to support more columns OR files.
I would go for a List (Or Map) of Map List<Map<>> for a given csv file.
Where each map represents a row in your csv file with key as column name.
You can easily extend it to multiple csv files.
Related
Want to replace some of the columns in one csv file with the column values in other CSV files which cannot fit in memory together. Language contraints JAVA,SCALA. No Framwework constraints.
One of the file has key-value kind of mapping and other file have large number of columns. And we have to replace the the values in large CSV file with the values in file that have key-value mapping.
Under the assumption that you can take in memory all the key-value mappings, then process the big one in a streaming fashion
import java.io.{File, PrintWriter}
import scala.io.Source
val kv_file = scala.io.Source.fromFile("key_values.csv")
// Construct a simple key value map
val kv : Map[String,String] = kv_file.getLines().map { line =>
val cols = line.split(";")
cols(0) -> cols(1)
}.toMap
val writer = new PrintWriter(new File("processed_big_file.csv" ))
big_file.getLines().foreach( line => {
// this is the key-value replace logic (as I understood)
val processed_cols = line.split(";").map { s => kv.getOrElse(s,s) }
val out_line = processed_cols.mkString(";");
writer.write(out_line)
})
// close file
writer.close()
Under the assumption that you cannotbe fully load thye key-value mapping then you could partially load in memory the file with the key-value maps and then still process the big one. Of course you have to iterate a bunch of times the files to get processed all the keys
I am new to Apache beam. As per our requirement I need to pass a JSON file containing five to 10 JSON records as input and read this JSON data from the file line by line and store into BigQuery. Can anyone please help me with my sample code below which tries to read JSON data using apache beam:
PCollection<String> lines =
pipeline
.apply("ReadMyFile",
TextIO.read()
.from("C:\\Users\\Desktop\\test.json"));
if(null!=lines) {
PCollection<String> words =
lines.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
String line = c.element();
}
}));
pipeline.run();
}
Lets assume that we have a json strings in the file as below,
{"col1":"sample-val-1", "col2":1.0}
{"col1":"sample-val-2", "col2":2.0}
{"col1":"sample-val-3", "col2":3.0}
{"col1":"sample-val-4", "col2":4.0}
{"col1":"sample-val-5", "col2":5.0}
In order to store these values from file to BigQuery through DataFlow/Beam, you might have to follow below steps,
Define a TableReference to refer the BigQuery table.
Define TableFieldSchema for every column you expect to store.
Read the file using TextIO.read().
Create a DoFn to parse Json string to TableRow format.
Commit the TableRow objects using BigQueryIO.
You may refer the below code snippet regarding the above steps,
For TableReference and TableFieldSchema creation,
TableReference tableRef = new TableReference();
tableRef.setProjectId("project-id");
tableRef.setDatasetId("dataset-name");
tableRef.setTableId("table-name");
List<TableFieldSchema> fieldDefs = new ArrayList<>();
fieldDefs.add(new TableFieldSchema().setName("column1").setType("STRING"));
fieldDefs.add(new TableFieldSchema().setName("column2").setType("FLOAT"));
For the Pipeline steps,
Pipeline pipeLine = Pipeline.create(options);
pipeLine
.apply("ReadMyFile",
TextIO.read().from("path-to-json-file"))
.apply("MapToTableRow", ParDo.of(new DoFn<String, TableRow>() {
#ProcessElement
public void processElement(ProcessContext c) {
Gson gson = new GsonBuilder().create();
HashMap<String, Object> parsedMap = gson.fromJson(c.element().toString(), HashMap.class);
TableRow row = new TableRow();
row.set("column1", parsedMap.get("col1").toString());
row.set("column2", Double.parseDouble(parsedMap.get("col2").toString()));
c.output(row);
}
}))
.apply("CommitToBQTable", BigQueryIO.writeTableRows()
.to(tableRef)
.withSchema(new TableSchema().setFields(fieldDefs))
.withCreateDisposition(CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(WriteDisposition.WRITE_APPEND));
pipeLine.run();
The BigQuery table might look as below,
The answer is it depends.
TextIO reads the files line-by line. So in your test.json each line needs to contain a separate Json object.
The ParDo you have will then receive those lines one-by one, i.e. each call to #ProcessElement gets a single line.
Then in your ParDo you can use something like Jackson ObjectMapper to parse the Json from the line (or any other Json parser you're familiar with, but Jackson is widely used, including few places in Beam itself.
Overall the approach to writing a ParDo is this:
get the c.element();
do something to the value of c.element(), e.g. parse it from json into a java object;
send the result of what you did to c.element() to c.output();
I would recommend starting by looking at Jackson extension to Beam SDK, it adds PTransforms to do exactly that, see this and this.
Please also take a look at this post, it has some links.
There's also the JsonToRow transform that you can look for similar logic, the difference is that it doesn't parse the Json into a user-defined Java object but into a Beam Row class instead.
Before writing to BQ you need to convert the objects you parsed from Json into BQ rows, which will be another ParDo after your parsing logic, and then actually apply the BQIO as even another step. You can see few examples in BQ test.
i am going to make a application, comparising 2 .csv lists, using OpenCSV. It should works like that:
Open 2 .csv files ( each file has columns: Name,Emails)
Save results ( and here is a prbolem, i don't know if it should be save to table or something)
Compare From List1 and List2 value of "Emails column".
If Email from List 1 appear on List2 - delete it(from list 1)
Export results to new .csv file
I don't know if it's good algorithm. Please Tell me which option to saving results of reading .csv file is best in that case.
Kind Regards
You can get around this more easily with univocity-parsers as it can read your data into columns:
CsvParserSettings parserSettings = new CsvParserSettings(); //parser config with many options, check the tutorial
parserSettings.setHeaderExtractionEnabled(true); // uses the first row as headers
// To get the values of all columns, use a column processor
ColumnProcessor rowProcessor = new ColumnProcessor();
parserSettings.setRowProcessor(rowProcessor);
CsvParser parser = new CsvParser(parserSettings);
//This will parse everything and pass the data to the column processor
parser.parse(new FileReader(new File("/path/to/your/file.csv")));
//Finally, we can get the column values:
Map<String, List<String>> columnValues = rowProcessor.getColumnValuesAsMapOfNames();
Let's say you parsed the second CSV with that. Just grab the emails and create a set:
Set<String> emails = new HashSet<>(columnValues.get("Email"));
Now just iterate over the first CSV and check if the emails are in the emails set.
Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
If you have a hard requirement to use openCSV then here is what I believe is the easiest solution:
First off I like Jeronimo's suggestion about the HashSet. Read the second csv file first using the CSVToBean and save off the email addresses in the HashSet.
Then create a Filter class that implements the CSVToBeanFilter interface. In the constructor pass in the set and in the allowLine method you look up the email address and return true if it is not in the set (so you have a quick lookup).
Then you pass the filter in the CsvToBean.parse when reading/parsing the first file and all you will get are the records from the first file whose email addresses are not on the second file. The CSVToBeanFilter javadoc has a good example that shows how this works.
Lastly use the BeanToCSV to create a file from the filtered list.
In interest of fairness I am the maintainer of the openCSV project and it is also open source and free (Apache V2.0 license).
Let's say I have a class, Car, and I'm trying to import a large set of data to create multiple instances of "Car".
My CSV file is laid out like so:
Car Manufacturer,Model,Color,Owner,MPG,License Plate,Country of Origin,VIN,... etc
The point is, there is a lot of data that needs to be in the constructor. If there's only a few of these, it wouldn't be that bad to manually instantiate it by writing Car FordFocus = new Car(Ford,Focus,Blue,John Doe,108-J1AZ,USA,194241-12e1...), but if I have hundreds of these, is there any way to import all this data to make the classes?
As George mentions, you need a tool. I have used opencsv before to achieve this.
opencsv provides you three mapping strategies (which can be further extended) for mapping a CSV row to bean. The simplest is ColumnPositionMappingStrategy. So if your CSV format is fixed, e.g. the header row looks like:
Car Manufacturer,Model,Color,Owner,MPG,License Plate,Country of Origin,VIN,... etc
This code snippet will help you. I have also used HeaderColumnNameTranslateMappingStrategy which lets you map CSV header names to bean field names e.g. "Car Manufacturer" -> carManufacturer.
CSVReader csvReader = new CSVReader(new FileReader(csvFile));
ColumnPositionMappingStrategy<Car> strategy = new ColumnPositionMappingStrategy<Car>();
strategy.setType(Car.class);
String[] columns = new String[] {"CarManufacturer","Model","Color","Owner","MPG","LicensePlate","CountryOfOrigin","VIN"}; // the fields to bind do in your JavaBean
strategy.setColumnMapping(columns);
CsvToBean<Car> csv = new CsvToBean<Car>();
List<Car> list = csv.parse(strategy, csvReader);
A self contained sample program can be found here
Reflection is a possibility.
You can associate an attribute with a position in your CSV file (a column).
See for example of setting attribute with reflection : https://docs.oracle.com/javase/tutorial/reflect/member/fieldValues.html
You can read the csv file line by line and can create the Car object by constructor in loop.
For a project I need to deal with CSV files where I do not know the columns before runtime. The CSV files are perfectly valid, I only need to perform a simple task on several different files over and over again. I do need to analyse the values of the columns, which is why I would need to use a library for working with CSV files. For simplicity, lets assume that I need to do something simple like appending a date column to all files, regardless how many columns they have. I want to do that with Super CSV, because I use the library for other tasks as well.
What I am struggeling with is more a conceptual issue. I am not sure how to deal with the files if I do not know in advance how many columns there are. I am not sure how I should define POJOs that map arbitrary CSV files or how I should define the Cell Processors if I do not know which and how many columns will be in the file. How can I dynamically create Cell processors that match the number of columns? How would I define POJOs for instance based on the header of the CSV file?
Consider the case where I have two CSV files: products.csv and address.csv. Lets assume I want to append a date column with today’s date for both files, without having to write two different methods (e.g. addDateColumnToProduct() and addDateColumnToAddress()) which do the same thing.
product.csv:
name, description, price
"Apple", "red apple from Italy","2.5€"
"Orange", "orange from Spain","3€"
address.csv:
firstname, lastname
"John", "Doe"
"Coole", "Piet"
Based on the header information of the CSV files, how could I define a POJO that maps the product CSV? The same question for Cell Processors? How could I define even a very simple cell processor that just basically has the right amount of parameters for the constructor, e.g. for the product.csv
CellProcessor[] processor = new CellProcessor[] {
null,
null,
null
};
and for the address.csv:
CellProcessor[] processor = new CellProcessor[] {
null,
null
};
Is this even possible? Am I on the wrong track to achieve this?
Edit 1:
I am not looking for a solution that can deal with CSV files having variable columns in one file. I try to figure out if it is possible dealing with arbitrary CSV files during runtime, i.e. can I create POJOs based only on the header information which is contained in the CSV file during runtime. Without knowing in advance how many columns a csv file will have.
Solution
Based on the answer and comments from #baba
private static void readWithCsvListReader() throws Exception {
ICsvListReader listReader = null;
try {
listReader = new CsvListReader(new FileReader(fileName), CsvPreference.TAB_PREFERENCE);
listReader.getHeader(true); // skip the header (can't be used with CsvListReader)
int amountOfColumns=listReader.length();
CellProcessor[] processor = new CellProcessor[amountOfColumns];
List<Object> customerList;
while( (customerList = listReader.read(processor)) != null ) {
System.out.println(String.format("lineNo=%s, rowNo=%s, customerList=%s", listReader.getLineNumber(),
listReader.getRowNumber(), customerList));
}
}
finally {
if( listReader != null ) {
listReader.close();
}
}
}
Maybe a little bit late but could be helpful...
CellProcessor[] processors=new CellProcessor[properties.size()];
for(int i=0; i< properties.zise(); i++){
processors[i]=new Optional();
}
return processors;
This is a very common issue and there are multiple tutorials on the internetz, including the Super Csv page:
http://supercsv.sourceforge.net/examples_reading_variable_cols.html
As this line says:
As shown below you can execute the cell processors after calling
read() by calling the executeProcessors() method. Because it's done
after reading the line of CSV, you have an opportunity to check how
many columns there are (using listReader.length()) and supplying the
correct number of processors.