MultiResourceItemReader - Skip entire file if header is invalid - java

My Spring Batch job reads a list of csv files containing two types of headers. I want the reader to skip the entire file if its header does not match one of the two possible header types.
I've taken a look at Spring Boot batch - MultiResourceItemReader : move to next file on error.
But I don't see how to validate the header tokens to ensure they match up in count and content

I was able to figure this out by doing the following,
public FlatFileItemReader<RawFile> reader() {
return new FlatFileItemReaderBuilder<RawFile>()
.skippedLinesCallback(line -> {
// Verify file header is what we expect
if (!StringUtils.equals(line, header)) {
throw new IllegalArgumentException(String.format("Bad header!", line));
}
})
.name( "myReader" )
.linesToSkip( 1 )
.lineMapper( new DefaultLineMapper() {
{
setLineTokenizer( lineTokenizer );
setFieldSetMapper( fieldSetMapper );
}} )
.build();
}
I call the reader() method when setting the delegate in my MultiResourceItemReader.
Note that header, lineTokenizer, and fieldSetMapper are all variables that I set depending on which type of file (and hence which set of headers) my job is expected to read.

Can we do this in XML based configuration ?

Related

Transform a large jsonl file with unknown json properties into csv using apache beam google dataflow and java

How to convert a large jsonl file with unknown json properties into csv using Apache Beam, google dataflow and java
Here is my scenario:
A large jsonl file is in google storage
Json properties are unknown, so using Apache Beam's Schema can not be defined in Beam's pipeline.
Use Apache beam, google dataflow and java to convert jsonl to csv
Once transformation is done, store csv in google storage (same bucket where jsonl is stored)
Notify by some means, like transformation_done=true if possible (rest api or event)
Any help or guidance would be helpful, as I am new to Apache beam, though I am reading the doc from Apache Beam.
I have edited the question with an example JSONL data
{"Name":"Gilbert", "Session":"2013", "Score":"24", "Completed":"true"}
{"Name":"Alexa", "Session":"2013", "Score":"29", "Completed":"true"}
{"Name":"May", "Session":"2012B", "Score":"14", "Completed":"false"}
{"Name":"Deloise", "Session":"2012A", "Score":"19", "Completed":"true"}
While json key's are there in an input file but it's not known while transforming.
I'll explain that by an example, suppose I have three clients and each got it's own google storage, so each upload their own jsonl file with different json properties.
Client 1: Input Jsonl File
{"city":"Mumbai", "pincode":"2012A"}
{"city":"Delhi", "pincode":"2012N"}
Client 2: Input Jsonl File
{"Relation":"Finance", "Code":"2012A"}
{"Relation":"Production", "Code":"20XXX"}
Client 3: Input Jsonl File
{"Name":"Gilbert", "Session":"2013", "Score":"24", "Completed":"true"}
{"Name":"Alexa", "Session":"2013", "Score":"29", "Completed":"true"}
Question: How could I write A Generic beam pipeline which transforms all three as shown below
Client 1: Output CSV file
["city", "pincode"]
["Mumbai","2012A"]
["Delhi", "2012N"]
Client 2: Output CSV file
["Relation", "Code"]
["Finance", "2012A"]
["Production","20XXX"]
Client 3: Output CSV file
["Name", "Session", "Score", "true"]
["Gilbert", "2013", "24", "true"]
["Alexa", "2013", "29", "true"]
Edit: Removed the previous ans as questions have been modified with examples.
There is no generic way provided by anyone to achieve such result. You have to write the logic yourself depending on your requirements and how you are handling the pipeline.
Below there are some examples but you need to verify these for your case as I have only tried these on a small JSONL file.
TextIO
Approach 1
If you can collect the header value of the output csv
then it will be much easier. But getting the header beforehand itself another challenge.
//pipeline
pipeline.apply("ReadJSONLines",
TextIO.read().from("FILE URL"))
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processLines(#Element String line, OutputReceiver<String> receiver) {
String values = getCsvLine(line, false);
receiver.output(values);
}
}))
.apply("WriteCSV",
TextIO.write().to("FileName")
.withSuffix(".csv")
.withoutSharding()
.withDelimiter(new char[] { '\r', '\n' })
.withHeader(getHeader()));
private static String getHeader() {
String header = "";
//your logic to get the header line.
return header;
}
probable ways to get the header line(Only assumptions may not work in your case) :
You can have a text file in GCS which will store the header of a particular JSON File. And in your logic you can fetch the header by reading the file , check this SO thread about how to read files from GCS
You can try to pass the header as a runtime argument but that depends how you are configuring and executing your pipeline.
Approach 2
This is a workaround I found for small JsonFiles(~10k lines). This below example may not work for large files.
final int[] count = { 0 };
pipeline.apply(//read file)
.apply(ParDo.of(new DoFn<String, String>() {
#ProcessElement
public void processLines(#Element String line, OutputReceiver<String> receiver) {
// check if its the first processing element. If yes then create the header
if (count[0] == 0) {
String header = getCsvLine(line, true);
receiver.output(header);
count[0]++;
}
String values = getCsvLine(line, false);
receiver.output(values);
}
}))
.apply(//write file)
FileIO
As mentioned by Saransh in comments by using FileIO all you have to do is read the JSONL line by line manually and then convert those into comma separated format.EG:
pipeline.apply(FileIO.match().filepattern("FILE PATH"))
.apply(FileIO.readMatches())
.apply(FlatMapElements
.into(TypeDescriptors.strings())
.via((FileIO.ReadableFile f) -> {
List<String> output = new ArrayList<>();
try (BufferedReader br = new BufferedReader(Channels.newReader(f.open(), "UTF-8"))) {
String line = br.readLine();
while (line != null) {
if (output.size() == 0) {
String header = getCsvLine(line, true);
output.add(header);
}
String result = getCsvLine(line, false);
output.add(result);
line = br.readLine();
}
} catch (IOException e) {
throw new RuntimeException("Error while reading", e);
}
return output;
}))
.apply(//write to gcs)
In the above examples I have used a getCsvLine method(created for code usability) which takes a single line from the file and converts it into a comma separated format.To parse the JSON object I have used GSON.
/**
* #param line take each JSONL line
* #param isHeader true : Returns output combining the JSON keys || false:
* Returns output combining the JSON values
**/
public static String getCsvLine(String line, boolean isHeader) {
List<String> values = new ArrayList<>();
// convert the line into jsonobject
JsonObject jsonObject = JsonParser.parseString(line).getAsJsonObject();
// iterate json object and collect all values
for (Map.Entry<String, JsonElement> entry : jsonObject.entrySet()) {
if (isHeader)
values.add(entry.getKey());
else
values.add(entry.getValue().getAsString());
}
String result = String.join(",", values);
return result;
}

SuperCSV skips first line while reading CSV file

I'm Using SuperCSV api to read CSV files and and validates their entries.
For a reason, it seems that every read it skips the first row.
i do not have Headers on my csvs, and i do need the first row.
I have tried using : CsvMapReader and CsvListReader, but every execute its starts printing only from Line No 2.
Any help would be appreciated.
Thanks
I have tried using : CsvMapReader and CsvListReader, but every execute its starts printing only from Line No 2.
Here is a snippet of a code which i use to read the files.
listReader = new CsvListReader(new FileReader(CSV_FILENAME), CsvPreference.STANDARD_PREFERENCE);
listReader.getHeader(true);
final CellProcessor[] processors = getProcessors();
List<Object> customerList;
while( (customerList = listReader.read(processors)) != null ) {
System.out.println(String.format("lineNo=%s, rowNo=%s, customerList=%s", listReader.getLineNumber(), listReader.getRowNumber(), customerList));
}

Spring Batch creating multiple files .Gradle based project

I need to create 3 separate files.
My Batch job should read from Mongo then parse through the information and find the "business" column (3 types of business: RETAIL,HPP,SAX) then create a file for their respective business. the file should create either RETAIL +formattedDate; HPP + formattedDate; SAX +formattedDate as the file name and the information found in the DB inside a txt file. Also, I need to set the .resource(new FileSystemResource("C:\filewriter\index.txt)) into something that will send the information to the right location, right now hard coding works but only creates one .txt file.
example:
#Bean
public FlatFileItemWriter<PaymentAudit> writer() {
LOG.debug("Mongo-writer");
FlatFileItemWriter<PaymentAudit> flatFile = new
FlatFileItemWriterBuilder<PaymentAudit>()
.name("flatFileItemWriter")
.resource(new FileSystemResource("C:\\filewriter\\index.txt))
//trying to create a path instead of hard coding it
.lineAggregator(createPaymentPortalLineAggregator())
.build();
String exportFileHeader =
"CREATE_DTTM";
StringHeaderWriter headerWriter = new
StringHeaderWriter(exportFileHeader);
flatFile.setHeaderCallback(headerWriter);
return flatFile;
}
My idea would be something like but not sure where to go:
public Map<String, List<PaymentAudit>> getPaymentPortalRecords() {
List<PaymentAudit> recentlyCreated =
PaymentPortalRepository.findByCreateDttmBetween(yesterdayMidnight,
yesterdayEndOfDay);
List<PaymentAudit> retailList = new ArrayList<>();
List<PaymentAudit> saxList = new ArrayList<>();
List<PaymentAudit> hppList = new ArrayList<>();
//String exportFilePath = "C://filewriter/";??????
recentlyCreated.parallelStream().forEach(paymentAudit -> {
if (paymentAudit.getBusiness().equalsIgnoreCase(RETAIL)) {
retailList.add(paymentAudit);
} else if
(paymentAudit.getBusiness().equalsIgnoreCase(SAX)) {
saxList.add(paymentAudit);
} else if
(paymentAudit.getBusiness().equalsIgnoreCase(HPP)) {
hppList.add(paymentAudit);
}
});
To create a file for each business object type, you can use the ClassifierCompositeItemWriter. In your case, you can create a writer for each type and add them as delegates in the composite item writer.
As per creating the filename dynamically, you need to use a step scoped writer. There is an example in the Step Scope section of the reference documentation.
Hope this helps.

Spring batch filtering data inside item reader

I'm writing a batch that reads log files which should take many types (format of log log file ) then I want to read every file based on some characters inside log files for example
15:31:44,437 INFO <NioProcessor-32> Send to <SLE-
15:31:44,437 INFO <NioProcessor-32> [{2704=5, 604=1, {0=023pdu88mW00007z}]
15:31:44,437 DEBUG <NioProcessor-32> SCRecord 2944
In such a log file I want to read only log lines which contain ' [{}] ' and ignore all others. I have tried to read it in item reader and split it to object but I can't figure how. I think that I should create a custom item reader or something like that; my Logline class looks too simple:
public class logLine {
String idOrder;
String time;
String Tags;
}
and my item reader look like:
public FlatFileItemReader<logLine> customerItemReader() {
FlatFileItemReader<logLine> reader = new FlatFileItemReader<>();
reader.setResource(new ClassPathResource("/data/customer.log"));
DefaultLineMapper<LogLine> customerLineMapper = new DefaultLineMapper<>();
DelimitedLineTokenizer tokenizer = new DelimitedLineTokenizer();
tokenizer.setNames(new String[] {"idOrder", "date", "tags"});
customerLineMapper.setLineTokenizer(tokenizer);
customerLineMapper.setFieldSetMapper(new CustomerFieldSetMapper());
reader.setLineMapper(customerLineMapper);
return reader;
}
How can I add a filter in this item reader to read only lines which contain [{
without doing the job in the
item processor
filtering should be responsibility of processor and not the reader. You can use composite item processor and add First processor as Filtering.
Filtering processor should return null for log lines which does not contain ' [{}] ' .
These rows will be automatically ignore in next processor and in writer.
can implement a customfilereader extending FlatFileItemReader with partition num or filter criteria passed in constructor from config, and override the read() method -> https://github.com/spring-projects/spring-batch/blob/main/spring-batch-infrastructure/src/main/java/org/springframework/batch/item/support/AbstractItemCountingItemStreamItemReader.java#L90
Every slave step would instantiate based on a different constructor param.

Spring Integration : Transformer : file to Object

I am new to Spring Integration and I am trying to read a file and transform into a custom object which has to be sent to jms Queue wrapped in jms.Message.
It all has to be done using annotation.
I am reading the files from directory using below.
#Bean
#InboundChannelAdapter(value = "filesChannel", poller = #Poller(fixedRate = "5000", maxMessagesPerPoll = "1"))
public MessageSource<File> fileReadingMessageSource() {
FileReadingMessageSource source = new FileReadingMessageSource();
source.setDirectory(new File(INBOUND_PATH));
source.setAutoCreateDirectory(false);
/*source.setFilter(new AcceptOnceFileListFilter());*/
source.setFilter(new CompositeFileListFilter<File>(getFileFilters()));
return source;
}
Next Step is transforming the file content to Invoice Object(assume).
I want to know what would be incoming message type for my transformer and how should I transform it. Could you please help here. I am not sure what would be the incoming datatype and what should be the transformed object type (should it be wrapped inside Message ?)
#Transformer(inputChannel = "filesChannel", outputChannel = "jmsOutBoundChannel")
public ? convertFiletoInvoice(? fileMessage){
}
The payload is a File (java.io.File).
You can read the file and output whatever you want (String, byte[], Invoice etc).
Or you could use some of the standard transformers (e.g. FileToStringTransformer, JsonToObjectTransformer etc).
The JMS adapter will convert the object to TextMessage, ObjectMessage etc.

Categories

Resources