We have to process large CSV files. We use Apache Camel for reading the files from an SFTP location (But we are open to Java based solutions if there are better approaches).
One of the requirement is to resume the processing from the point of failure. That is, if there is an exception happened while processing line number 1000, we should start processing from line 1000 rather than from the beginning. We should not process the record twice as well.
We are using Apache ActiveMQ to save the records in the queues and for managing the pipeline. But initial loading of the file from the location can also cause failures.
To track the state, we are using a database which will get updated at every step using Apache Camel.
We are open to ideas and suggestions. Thanks in advance.
As far as I know, Camel File component cannot resume from the point of failure.
It depends on your configuration (see moveFailed option) if a failed file is moved away or reprocessed on the next attempt (but from the beginning).
To read a CSV file, you need to split the single lines. Because your files are big, you should use the streaming option of the Splitter. Otherwise the whole file is read before splitting!
To decrease the probability of failures and reprocessing of the whole file, you can simply send every single CSV line to ActiveMQ (without parsing it). The simpler the splitter, the lower the probability that you need to reprocess the whole file because of problems in a single record.
The decoupled consumer of the queue can parse and process the CSV records without affecting the file import. Like this, you can handle errors for every single record.
If you nevertheless have file import failures, the file is reprocessed from the beginning. Therefore you should design you processing pipeline idempotent. For example check for an existing record and if there is already one, update it instead of just inserting every record.
In a messaging environment you have to deal with at-least-once delivery semantics. The only solution is to have idempotent components. Even if Camel would try to resume at the point of failure, it could not guarantee that every record is read only once.
Related
I have a dataflow pipeline that takes many files from a gcs bucket, extracts the records and applies some transformations, and finally outputs them into parquet files. It is continuously watching the bucket for files making this a streaming pipeline, though for now we have a termination condition to stop the pipeline after it has been 1 minute since the last new file. We are testing with a fixed set of files in the bucket
I initially ran this pipeline in batch mode(no continuous file watching) and from querying the parquet files in bigquery there was about 36 million records. However when I enabled continuous file watching and reran the pipeline the parquet files only contained ~760k records. I doubled checked that in both runs that the input bucket had the same set of files.
The metrics on the streaming job details page does not match up at all with what was outputted. Going by the section Elements added (Approximate) it says ~21 million records(which is wrong) were added to input collection for the final parquet writing step even though the files contained ~760k records.
The same step on the batch job had correct number(36 million) for Elements added (Approximate) and that was the same number of records in the outputted parquet files.
I haven't seen anything unusual in the logs.
Why is cloud dataflow marking the streaming job as Succeeded even though a ton of records have been dropped during writing the output?
Why is there an inconsistency with the metrics reporting for batch and streaming jobs on cloud dataflow with the same input?
For both jobs I have set 3 workers with a machine type of n1-highmem-4. I pretty much reached my quota for the project.
I suspect this might be due to the way you have configured Windows and triggers for your streaming pipeline. By default Beam/Dataflow triggers data when watermark passes the end of the window and default window configuration sets allowed lateness to zero. So any late data will be discarded by the pipeline. To change this behavior you can try setting the allowed lateness value or try setting a different trigger. See here for more information.
We use Quartz scheduler in our application to scan a particular folder for any new files and if there is a new file, then kick off the associated workflow in the application to process it. For this, we have created our custom listener object which is associated with a job and a trigger that polls the file location every 5 min.
The requirement is to process only the new file that arrives in that folder location while ignoring the already processed files. Also, we don't want the folder location to grow enormously with large number of files (otherwise it will slow down the folder scanning) - so at the end of workflow we actually delete the source file.
In order to implement it, we decided to maintain the list of processed files in job metadata. So at each polling, we fetch the list of processed files from the job metadata, compare it against current list of files and if the file is not yet processed - then kick off the associated process flow.
The problem with the above approach is that over the years (and depending on number of files received per day which could be range from 100K per day), the job metadata (that persist list of files processed) grows very large and it started giving us problems with data truncation error (while persisting job metadata in quartz table) and slowness.
To address this problem, we decided to refresh the list of processed files in job metadata with the current snapshot of the folder. This way, since we delete the processed file from folder location at the end of each workflow, the list of processed files remains in limit. But then we started having problem of processing of the duplicate files if it arrives with same name next day.
What is the best approach for implementing this requirement and ensuring that we don't process duplicate files that arrives with same name? Shall we consider the approach of persiting processed file list in the external database instead of job metadata? I am looking for recommended approach for implementing this solution. Thanks!
We had a similar request recently with our scheduler. If you are on linux, why not using solutions such as inotify ? Other systems may have other ways to monitor file system events.
Our solution was to trigger some file processing at each creation event and then every x days removing the older files (similar to Walen DB suggestion). In that case, the list does not inflate too much and duplicate file can be handled in their own specific case.
(Sorry I do not have the rights to comment yet.)
I've recently started introducing myself to the BigData world experimenting with the Apache Storm. I have faced the following problem, thought a lot how to solve it, but all my approaches seem naïve.
Technologies
Apache Storm 0.9.3, Java 1.8.0_20
Context
There is a big xml file (~400MB) that is required to be read line-by-line (xml-file-spout). Each read file line is then emitted and processed by a chain of bolts.
It has to be a guaranteed message processing (emitting with anchoring...)
Problem
As far as the file is pretty big (contains at about 20 billions of lines) I read it with a scanner, based on buffered stream not to load the whole file into the memory. So far so good. The problem emerges when there is an error somewhere in the middle of processing: the xml-file-spout itself died, or there is some internal issue...
The Nimbus will restart the spout, but the whole processing starts from the very beginning;
This approach does not scale at all.
Solution Thoughts
An initial idea for solving the 1'st problem was to save the current state somewhere: distributed cache, JMS queue, local disk file. When a spout opens, it should find such storage, read the state and proceed from the specified file line. Here I also thought about storing the state in the Storm's Zookeeper, but I don't know whether it is possible to address Zookeeper form the spout (is there such the ability)? Could you please suggest the best practice for this?
For the problem 2 I thought about breaking the initial file to a set of subfiles and process them in parallel. It can be done by introducing a new 'breaking' spout, where each file would be processed by a dedicated bolt. In this case the big problem raises with a guaranteed processing, cause in case of error the subfile, that contains the failed line, has to be fully reprocessed (ack/fail methods of the spout)... Could you please suggest the best practice for solution to this problem?
Update
Ok, what I did so far.
Prerequisites
The following topology works because all its parts (spouts and bolts) are idempotent.
Introduced a separate spout that reads file lines (one by one) and sends them to an intermediate ActiveMQ queue ('file-line-queue') to be able to replay failed file lines easily (see the next step);
Created a separate spout for the 'file-line-queue' queue, that receives each file line and emits it to the subsequent bolts. As far as I use the guaranteed message processing, in case of any bolt's failure a message is reprocessed, and if the bolt chain is successful a corresponding message is acknowledged (CLIENT_ACKNOWLEDGE mode).
In case of a first (file reading) spout's failure, a RuntimeException is thrown, which kills the spout. Later on a dedicated supervisor restarts the spout making an inout file be re-read. This will cause duplicated messages, but as far as everything is idempotent, it is not a problem. Also, here it is worth thinking about a state repository to produce less duplicates...
New Issue
In order to make the intermediate JMS more reliable I've added an on-exception-listener that restores a connection and a session for both the consumer and producer. The problem is with the consumer: if a session is restored and I have a JMS message unacked in the middle of the bolt processing, after a successful processing I need to ack it, but as far as a session is new, I receive the 'can't find correlation id' issue.
Could somebody please suggest how to deal with it?
To answer your questions first:
Yes you can store state somewhere like Zookeeper and use a library like Apache Curator to handle that.
Breaking the files up might help but still doesn't solve your problem of having to manage state.
Lets talk a bit about design here. Storm is built for streaming, not for batch. It seems to me that a Hadoop technology which works better for batch would work better here: MapReduce, Hive, Spark, etc.
If you are intent on using storm, then it will help to stream the data somewhere that is easier to work with. You could write the file to Kafka or a queue to help with your problem of managing state, ack/fail, and retry.
One of the steps in our job involves running an external process (R in this case) to do some processing on large files in the file system. The external process will then output files which then get fed back into the Spring Batch system.
The external process can take several minutes for each task to complete. We would effectively launch the external process for every file to be processed, so there could easily be on the order of dozens or hundreds of these executions during the life of the overall job. We would like to scale this execution horizontally (and vertically).
Using Spring Batch, would either Remote Chunking or Remote Partitioning be a viable solution for this step? The system really just needs to say "For each of these input files, launch an R script to process it", so there really is not any item or chunk-oriented processing involved.
Remote Chunking/Partitioning has been proving difficult to implement in a sensible manner for this without seeming like overkill. I have thought about instead making this task run "out of band". For example, in the Processor, I would put each "external task" on a JMS queue, let something pull it off and process it and wait for a response that it has finished. This seems like it would be a lot easier than using Remote Chunking/Partitioning.
Other alternative solutions besides Spring Batch are welcome too, but I would like to focus on integrating this solution with Spring Batch for now.
What you are describing is exactly what partitioning does. Even your "out of band" option still falls into what partitioning does.
Let's walk through what I would expect the job to look like.
Job and Master Step
The job, as you noted, is a single step job. What I would envision is that the single step is a partitioned step. With a partitioned step, the two main pieces you need to configure are the Partitioner (the component that knows how to divide the work up) and the PartitionHandler (the component that knows how to send the work to the workers). For the Partitioner, I'd expect using the MultiResourcePartitioner would work. This Partitioner implementation provided by Spring Batch creates one partition per file as defined by it's configuration.
The PartitionHandler is where you choose if you're going to be executing the slaves locally (via the TaskExecutorPartitionHandler) or remotely (via the MessageChannelPartitionHandler). The PartitionHandler is also responsible for aggregating the results of the executing slaves into a single status so the step's result can be evaluated.
Slave Step
For the slave step, there are two pieces. The first is the configuration of the step itself. This is no different than if you were running the step in line. In this example, I'd expect you to use the SystemCommandTasklet to run your R process (unless you're running it on the JVM).
How the step is launched is dependent upon remote vs local partitioning but is also straight forward.
For the record, I did a talk a while back demonstrating remote partitioning that's available on YouTube here: https://www.youtube.com/watch?v=CYTj5YT7CZU The code for that demo is also available on Github here: https://github.com/mminella/Spring-Batch-Talk-2.0
My program receives large CSV files and transforms them to XML files. In order to have better performance I would like to split this files in smaller segments of (for example) 500 lines. What are the available Java libraries for splitting text files?
I don't understand what you'd be gaining by splitting up the CSV file into smaller ones? With Java, you can read and process the file as you go, you don't have to read it all at once...
What do you intend to do with those data ?
If it is just record by record processing then event oriented (SAX or StaX) parsing will be the way to go. For record by record processing, an existing "pipeline" toolkit may be applicable.
You can pre-process your file with a splitter function like this one or this Splitter.java.
How are you planning on distributing the work once the files have been split?
I have done something similar to this on a framework called GridGain - it's a grid computing framework which allows you to execute tasks on a grid of computers.
With this in hand you can then use a cache provider such as JBoss Cache to distribute the file to multiple nodes, specify a start and end line number and process. This is outlined in the following GridGain example: http://www.gridgainsystems.com/wiki/display/GG15UG/Affinity+MapReduce+with+JBoss+Cache
Alternatively you could look at something like Hadoop and the Hadoop File System for moving the file between different nodes.
The same concept could be done on your local machine by loading the file into a cache and then assigning certain "chunks" of the file to be worked on by seperate threads. The grid computing stuff really is only for really large problems, or to provide some level of scalability transparently to your solution. You might need to watch out for IO bottlenecks and locks, but a simple thread pool which you dispatch "jobs" into after the file is split could work.