How to process larger files from JSON To CSV using Spring batch

How to process larger files from JSON To CSV using Spring batch - java

I am trying to implement a batch job for the following use-case. (New to spring batch)
Use-case
From one source system every day I will get 200+ compressed(.gz) files. Each(.gz) file will gives a 1GB of file on unzip. Which means 200GB of files in my input directory. Here the content type is JSON.
Sample Format of JSON File
{"name":"abc1","age":20}
{"name":"abc2","age":20}
{"name":"abc3","age":20}
.....
I need to process these files from JSON TO CSV to output directory. And the these csv generation should be similar like size based rolling in Log4J. After writing I need to remove the each file from input directory.
Question 1
Does the spring batch can handle this huge data? Because for single day I am getting nearly 200GB?
Question 2
I am thinking Spring batch can handle.So Implemented a code with partitioner using spring batch . But while reading I am seeing some dirty lines with out any end of line.
Faulty lines structure
{"name":"abc1","age":20,....}
{"name":"abc2","age":20......}
{"name":"abc3","age":20
{"name":"abc1","age":20,....}
{"name":"abc1","age":20,....}
.....
For this I have written a skip policy but its not working as expected. Its skipping all line from the error line on-wards instead one line. How to skip only that error line?
I am sharing my sample snippet below please give some suggestions or corrections on my code and to above questions and issues.
JobConfig.java
#Bean
public Job myJob() throws Exception {
return joubBuilderFactory.get(COnstants.JOB.JOB_NAME)
.incrementer(new RunIdIncrementer())
.listener(jobCompleteListener())
.start(masterStep())
.build();
//master
#Bean
public Step masterStep() throws Exception{
return stepBuilderFactory.get("step")
.listener(new UnzipListener())
.partitioner(slaveStep())
.partitioner("P",partitioner())
.gridSize(10).
taskExecutor(executor())
.build();
}
//slaveStep
#Bean
public Step slaveStep() throws Exception{
return stepBuilderFactory.get("slavestep")
.reader(reader(null))
.writer(customWriter)
.faultTolerant()
.skipPolicy(fileVerificationSkipper())
.build();
}
#Bean
public SkipPolicy fileVerificatoinSkipper(){
return new FileVerficationSkipper();
}
#Bean
#StepScop
public Partitioner partitioner() throws Exception{
MutliResourcePartitioner part = new MultiResourcePartitioner();
PathMatching ResourcePatternResolver resolver = new PathMatchingResourcePatternResolver();
Resource[] res = resolver.getResource("...path of files...");
part.setResoruces(res);
part.partition(20);
return part;
}
Skip Policy Code
public class LineVerificationSkipper implements SkipPolicy {
#Override
public boolean shouldSkip(Throwable exception, int skipCount) throws SkipLimitExceededException {
if (exception instanceof FileNotFoundException) {
return false;
} else if (exception instanceof FlatFileParseException && skipCount <= 5) {
FlatFileParseException ffpe = (FlatFileParseException) exception;
StringBuilder errorMessage = new StringBuilder();
errorMessage.append("An error occured while processing the " + ffpe.getLineNumber()
+ " line of the file. Below was the faulty " + "input.\n");
errorMessage.append(ffpe.getInput() + "\n");
System.err.println(errorMessage.toString());
return true;
} else {
return false;
}
}
Question 3
How to delete the input source files are processing each file?. Because I am not getting any info like file path or name in ItemWriter.?

Related

Spring batch doesn't seem to be closing item writers properly

I have a job that writes each item in one separated file. In order to do this, the job uses a ClassifierCompositeItemWriter whose the ClassifierCompositeItemWriter returns a new FlatFileItemWriter for each item (code bellow).
#Bean
#StepScope
public ClassifierCompositeItemWriter<ProcessorResult> writer(#Value("#{jobParameters['outputPath']}") String outputPath) {
ClassifierCompositeItemWriter<MyItem> compositeItemWriter = new ClassifierCompositeItemWriter<>();
compositeItemWriter.setClassifier((item) -> {
String filePath = outputPath + "/" + item.getFileName();
BeanWrapperFieldExtractor<MyItem> fieldExtractor = new BeanWrapperFieldExtractor<>();
fieldExtractor.setNames(new String[]{"content"});
DelimitedLineAggregator<MyItem> lineAggregator = new DelimitedLineAggregator<>();
lineAggregator.setFieldExtractor(fieldExtractor);
FlatFileItemWriter<MyItem> itemWriter = new FlatFileItemWriter<>();
itemWriter.setResource(new FileSystemResource(filePath));
itemWriter.setLineAggregator(lineAggregator);
itemWriter.setShouldDeleteIfEmpty(true);
itemWriter.setShouldDeleteIfExists(true);
itemWriter.open(new ExecutionContext());
return itemWriter;
});
return compositeItemWriter;
}
Here's how the job is configured:
#Bean
public Step step1() {
return stepBuilderFactory
.get("step1")
.<String, MyItem>chunk(1)
.reader(reader(null))
.processor(processor(null, null, null))
.writer(writer(null))
.build();
}
#Bean
public Job job() {
return jobBuilderFactory
.get("job")
.incrementer(new RunIdIncrementer())
.flow(step1())
.end()
.build();
}
Everything works perfectly. All the files are generated as I expected. However, one of the file cannot be deleted. Just one. If I try to delete it, I get a message saying that "OpenJDK Platform binary" is using it. If I increase the chunk to a size bigger that the amount of files I'm generating, none of the files can be deleted. Seems like there's an issue to delete the files generated in the last chunk, like if the respective writer is not being closed properly by the Spring Batch lifecycle or something.
If I kill the application process, I can delete the file.
Any I idea why this could be happening? Thanks in advance!
PS: I'm calling this "itemWriter.open(new ExecutionContext());" because if I don't, I get a "org.springframework.batch.item.WriterNotOpenException: Writer must be open before it can be written to".
EDIT:
If someone is facing a similar problem, I suggest reading the Mahmoud's answer to this question Spring batch : ClassifierCompositeItemWriter footer not getting called .

Probably you are using the itemwriter outside of the step scope when doing this:
itemWriter.open(new ExecutionContext());
Please check this question, hope that this helps you.

Hive UDF in Java fails when creating a table

What is the difference between those two queries:
SELECT my_fun(col_name) FROM my_table;
and
CREATE TABLE new_table AS SELECT my_fun(col_name) FROM my_table;
Where my_fun is a java UDF.
I'm asking, because when I create new table (second query) I receive a java error.
Failure while running task:java.lang.RuntimeException: java.lang.RuntimeException: Map operator initialization failed
...
Caused by: org.apache.hadoop.hive.ql.exec.UDFArgumentException: Unable to instantiate UDF implementation class com.company_name.examples.ExampleUDF: java.lang.NullPointerException
I found that the source of error is line in my java file:
encoded = Files.readAllBytes(Paths.get(configPath));
But the question is why it works when table is not created and fails if table is created?

The problem might be with the way you read the file. Try to pass the file path as the second argument in the UDF, then read as follows
private BufferedReader getReaderFor(String filePath) throws HiveException {
try {
Path fullFilePath = FileSystems.getDefault().getPath(filePath);
Path fileName = fullFilePath.getFileName();
if (Files.exists(fileName)) {
return Files.newBufferedReader(fileName, Charset.defaultCharset());
}
else
if (Files.exists(fullFilePath)) {
return Files.newBufferedReader(fullFilePath, Charset.defaultCharset());
}
else {
throw new HiveException("Could not find \"" + fileName + "\" or \"" + fullFilePath + "\" in inersect_file() UDF.");
}
}
catch(IOException exception) {
throw new HiveException(exception);
}
}
private void loadFromFile(String filePath) throws HiveException {
set = new HashSet<String>();
try (BufferedReader reader = getReaderFor(filePath)) {
String line;
while((line = reader.readLine()) != null) {
set.add(line);
}
} catch (IOException e) {
throw new HiveException(e);
}
}
The full code for different generic UDF that utilizes file reader can be found here

I think there are several points unclear, so this answer is based on assumptions.
First of all, it is important to understand that hive currently optimize several simple queries and depending on the size of your data, the query that is working for you SELECT my_fun(col_name) FROM my_table; is most likely running locally from the client where you are executing the job, that is why you UDF can access your config file locally available, this "execution mode" is because the size of your data. CTAS trigger a job independent on the input data, this job runs distributed in the cluster where each worker fail accessing your config file.
It looks like you are trying to read your configuration file from the local file system, not from the HDSFS Files.readAllBytes(Paths.get(configPath)), this means that your configuration has to either be replicated in all the worker nodes or be added previously to the distributed cache (you can use add file from this, doc here. You can find another questions here about accessing files from the distributed cache from UDFs.
One additional problem is that you are passing the location of your config file through an environment variable which is not propagated to worker nodes as part of your hive job. You should pass this configuration as a hive config, there is an answer for accessing Hive Config from UDF here assuming that you are extending GenericUDF.

why do I receive an empty string when mapping Mono<Void> to Mono<String>?

I am developing an API REST using Spring WebFlux, but I have problems when uploading files. They are stored but I don't get the expected return value.
This is what I do:
Receive a Flux<Part>
Cast Part to FilePart.
Save parts with transferTo() (this return a Mono<Void>)
Map the Mono<Void> to Mono<String>, using file name.
Return Flux<String> to client.
I expect file name to be returned, but client gets an empty string.
Controller code
#PostMapping(value = "/muscles/{id}/image")
public Flux<String> updateImage(#PathVariable("id") String id, #RequestBody Flux<Part> file) {
log.info("REST request to update image to Muscle");
return storageService.saveFiles(file);
}
StorageService
public Flux<String> saveFiles(Flux<Part> parts) {
log.info("StorageService.saveFiles({})", parts);
return
parts
.filter(p -> p instanceof FilePart)
.cast(FilePart.class)
.flatMap(file -> saveFile(file));
}
private Mono<String> saveFile(FilePart filePart) {
log.info("StorageService.saveFile({})", filePart);
String filename = DigestUtils.sha256Hex(filePart.filename() + new Date());
Path target = rootLocation.resolve(filename);
try {
Files.deleteIfExists(target);
File file = Files.createFile(target).toFile();
return filePart.transferTo(file)
.map(r -> filename);
} catch (IOException e) {
throw new RuntimeException(e);
}
}

FilePart.transferTo() returns Mono<Void>, which signals when the operation is done - this means the reactive Publisher will only publish an onComplete/onError signal and will never publish a value before that.
This means that the map operation was never executed, because it's only given elements published by the source.
You can return the name of the file and still chain reactive operators, like this:
return part.transferTo(file).thenReturn(part.filename());
It is forbidden to use the block operator within a reactive pipeline and it even throws an exception at runtime as of Reactor 3.2.
Using subscribe as an alternative is not good either, because subscribe will decouple the transferring process from your request processing, making those happen in different execution sequences. This means that your server could be done processing the request and close the HTTP connection while the other part is still trying to read the file part to copy it on disk. This is likely to fail in subtle ways at runtime.

FilePart.transferTo() returns Mono<Void> that is a constant empty. Then, map after that was never executed. I solved it by doing this:
private Mono<String> saveFile(FilePart filePart) {
log.info("StorageService.saveFile({})", filePart);
String filename = DigestUtils.sha256Hex(filePart.filename() + new Date());
Path target = rootLocation.resolve(filename);
try {
Files.deleteIfExists(target);
File file = Files.createFile(target).toFile();
return filePart
.transferTo(file)
.doOnSuccess(data -> log.info("do something..."))
.thenReturn(filename);
} catch (IOException e) {
throw new RuntimeException(e);
}
}

Reading Data From FTP Server in Hadoop/Cascading

I want to read data from FTP Server.I am providing path of the file which resides on FTP server in the format ftp://Username:Password#host/path.
When I use map reduce program to read data from file it works fine. I want to read data from same file through Cascading framework. I am using Hfs tap of cascading framework to read data. It throws following exception
java.io.IOException: Stream closed
at org.apache.hadoop.fs.ftp.FTPInputStream.close(FTPInputStream.java:98)
at java.io.FilterInputStream.close(Unknown Source)
at org.apache.hadoop.util.LineReader.close(LineReader.java:83)
at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:168)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:254)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:440)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Below is the code of cascading framework from where I am reading the files:
public class FTPWithHadoopDemo {
public static void main(String args[]) {
Tap source = new Hfs(new TextLine(new Fields("line")), "ftp://user:pwd#xx.xx.xx.xx//input1");
Tap sink = new Hfs(new TextLine(new Fields("line1")), "OP\\op", SinkMode.REPLACE);
Pipe pipe = new Pipe("First");
pipe = new Each(pipe, new RegexSplitGenerator("\\s+"));
pipe = new GroupBy(pipe);
Pipe tailpipe = new Every(pipe, new Count());
FlowDef flowDef = FlowDef.flowDef().addSource(pipe, source).addTailSink(tailpipe, sink);
new HadoopFlowConnector().connect(flowDef).complete();
}
}
I tried to look in Hadoop Source code for the same exception. I found that in the MapTask class there is one method runOldMapper which deals with stream. And in the same method there is finally block where stream gets closed (in.close()). When I remove that line from finally block it works fine. Below is the code:
private <INKEY, INVALUE, OUTKEY, OUTVALUE> void runOldMapper(final JobConf job, final TaskSplitIndex splitIndex,
final TaskUmbilicalProtocol umbilical, TaskReporter reporter)
throws IOException, InterruptedException, ClassNotFoundException {
InputSplit inputSplit = getSplitDetails(new Path(splitIndex.getSplitLocation()), splitIndex.getStartOffset());
updateJobWithSplit(job, inputSplit);
reporter.setInputSplit(inputSplit);
RecordReader<INKEY, INVALUE> in = isSkipping()
? new SkippingRecordReader<INKEY, INVALUE>(inputSplit, umbilical, reporter)
: new TrackedRecordReader<INKEY, INVALUE>(inputSplit, job, reporter);
job.setBoolean("mapred.skip.on", isSkipping());
int numReduceTasks = conf.getNumReduceTasks();
LOG.info("numReduceTasks: " + numReduceTasks);
MapOutputCollector collector = null;
if (numReduceTasks > 0) {
collector = new MapOutputBuffer(umbilical, job, reporter);
} else {
collector = new DirectMapOutputCollector(umbilical, job, reporter);
}
MapRunnable<INKEY, INVALUE, OUTKEY, OUTVALUE> runner = ReflectionUtils.newInstance(job.getMapRunnerClass(),
job);
try {
runner.run(in, new OldOutputCollector(collector, conf), reporter);
collector.flush();
} finally {
// close
in.close(); // close input
collector.close();
}
}
please assist me in solving this problem.
Thanks,
Arshadali

After some efforts I found out that hadoop uses org.apache.hadoop.fs.ftp.FTPFileSystem Class for FTP.
This class doesn't supports seek, i.e. Seek to the given offset from the start of the file. Data is read in one block and then file system seeks to next block to read. Default block size is 4KB for FTPFileSystem. As seek is not supported it can only read data less than or equal to 4KB.

Reading and writing files using Java 7 nio

I have files which consist of json elements in an array.
(several file. each file has json array of elements)
I have a process that knows to take each json element as a line from file and process it.
So I created a small program that reads the JSON array, and then writes the elements to another file.
The output of this utility will be the input of the other process.
I used Java 7 NIO (and gson).
I tried to use as much Java 7 NIO as possible.
Is there any improvement I can do?
What about the filter? Which approach is better?
Thanks,
public class TransformJsonsUsers {
public TransformJsonsUsers() {
}
public static void main(String[] args) throws IOException {
final Gson gson = new Gson();
Path path = Paths.get("C:\\work\\data\\resources\\files");
final Path outputDirectory = Paths
.get("C:\\work\\data\\resources\\files\\output");
DirectoryStream.Filter<Path> filter = new DirectoryStream.Filter<Path>() {
#Override
public boolean accept(Path entry) throws IOException {
// which is better?
// BasicFileAttributeView attView = Files.getFileAttributeView(entry, BasicFileAttributeView.class);
// return attView.readAttributes().isRegularFile();
return !Files.isDirectory(entry);
}
};
DirectoryStream<Path> directoryStream = Files.newDirectoryStream(path, filter);
directoryStream.forEach(new Consumer<Path>() {
#Override
public void accept(Path filePath) {
String fileOutput = outputDirectory.toString() + File.separator + filePath.getFileName();
Path fileOutputPath = Paths.get(fileOutput);
try {
BufferedReader br = Files.newBufferedReader(filePath);
User[] users = gson.fromJson(br, User[].class);
BufferedWriter writer = Files.newBufferedWriter(fileOutputPath, Charset.defaultCharset());
for (User user : users) {
writer.append(gson.toJson(user));
writer.newLine();
}
writer.flush();
} catch (IOException e) {
throw new RuntimeException(filePath.toString(), e);
}
}
});
}
}

There is no point of using Filter if you want to read all the files from the directory. Filter is primarily designed to apply some filter criteria and read a subset of files. Both of them may not have any real difference in over all performance.
If you looking to improve performance, you can try couple different approaches.
Multi-threading
Depending on how many files exists in the directory and how powerful your CPU is, you can apply multi threading to process more than one file at a time
Queuing
Right now you are reading and writing to another file synchronously. You can queue content of the file using Queue and create asynchronous writer.
You can combine both of these approaches as well to improve performance further.

Don't put the I/O into the filter. That's not what it's for. You should get the complete list of files and then process it. For example if the I/O creates another file in the directory, the behaviour is undefined. You might miss a file, or see the new file in the accept() method.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to process larger files from JSON To CSV using Spring batch - java

Related

Spring batch doesn't seem to be closing item writers properly

Hive UDF in Java fails when creating a table

why do I receive an empty string when mapping Mono<Void> to Mono<String>?

Reading Data From FTP Server in Hadoop/Cascading

Reading and writing files using Java 7 nio

Categories

Resources