In my project, we have the following processes:
A spring batch job that reads X records from a DB table and dumps
them in rabbitmq as a topic
A spring XD stream that takes the messages from the queue and writes them to a file
Another stream takes the same records as above from the queue and puts them in a table
An independent spring batch job runs about 6 hours later that sends the file generated in (2) to a third party vendor
I want to make sure that the stream in (2) has finished processing. I was thinking of two options:
Have a dummy record at the end of the records in the queue which indicates completion of records (hacky, would prefer not to do this)
Have some sort of a batch identifier and verify that the queue does not contain any message with that batch identifier (How will this work?)
Any alternative suggestions on this problem? Thanks in advance!
Related
So I have a spark application that reads DB records (lets say 1000 records), processes them, and writes a CSV file (with 1000 lines) out to the cloud Object storage. So three questions here:
Is DB read request sent to executors? If so in case of 1000 DB records, would each executor read partial DB data (example 500 records each) and send the records back to driver? Or does it write to a central cache and driver would read it from there?
Next step processing the DB records (fold job), is sent to 2 executors. Lets say each executor gets 500 records or so. Once the executor finishes processing its partition does it send all 500 processed (formatted) rows back to driver? Or does it write some central cache and driver gets it back? How is the data exchange happening between driver and executor happen?
Last step is the .save csvfile call in my main() function. In this code I am doing a reparition(1) with the idea that I will only save this file from one executor. If so, how is the data collected into this one executor. Remember earlier we had two executors process 500 records each. How is a total of 1000 records sent to one executor and gets saved into the object storage by one executor? how is the data collected from all executors shared into that one executor executing the .save?
dataset.repartition(1)
.write()
.format("csv")
.option("header", "true")
.save(filepath);
If I dont do repartition(1), will the save happen from multiple executors and would it overwrite each other? I dont think there is a way we can specify the filename to be unique using spark. Do I have to save the file in temp and rename later and all that?
Are there any articles, youtube videos that will explain how data is distributed and collected or shared across executors? I can understand how .count() works. but how does .save work or how is large data results like millions of DB records or rows shared across executors? I have been looking for resources to read can't seem to find one that answers my questions. I am very new to spark, like 3 weeks new.
I have scenario where my Spring batch job is running every 3 mins.
Steps should be
Each user's record should get executed parallel. Each user can have maximum of 150k records.
Every user can have update and delete records. Update records should run before delete.
Update/delete sets should run parallel on their own. But strictly all updates should complete before delete.
Can anyone suggest the best approach to achieve the parallelism at multiple levels and follow the order at update and delete level.
I am looking something around Spring Async Executor Service, Parallel Streams and other Spring libraries. Rx, only if it gives some glaring performance which the above specified can't provide.
Glaring performance is based on the design of spring batch implementation and we are sure you will get with spring batch as we are processing millions of records with select delete and update.
Each user's record should get executed parllely. Each user can have maximum of 1.5 lakh records.
"Partition the selection based on User and Each user will run as parallel steps."
Every user can have update and delete records. Update records should run before delete.
" Create a Composite Writer and delegates added for update Ist writer and delete 2nd writer "
Update/delete sets should run parallel on their own. But strictly all updates should complete before delete.
"Each writer step update and delete manages the transaction and make sure update executes first ".
Please refer below
Spring Batch multiple process for heavy load with multiple thread under every process
Composite Writer Example
Spring Batch - Read a byte stream, process, write to 2 different csv files convert them to Input stream and store it to ECS and then write to Database
I currently have a Spring Batch Job with one single step that reads data from Oracle , passes the data through multiple Spring Batch Processors (CompositeItemProcessor) and writes the data to different destinations such as Oracle and files (CompositeItemWriter) :
<batch:step id="dataTransformationJob">
<batch:tasklet transaction-manager="transactionManager" task-executor="taskExecutor" throttle-limit="30">
<batch:chunk reader="dataReader" processor="compositeDataProcessor" writer="compositeItemWriter" commit-interval="100"></batch:chunk>
</batch:tasklet>
</batch:step>
In the above step, the compositeItemWriter is configured with 2 writers that run one after another and write 100 million records to Oracle as well as a file. Also, the dataReader has a synchronized read method to ensure that multiple threads don't read the same data from Oracle. This job takes 1 hour 30 mins to complete as of today.
I am planning to break down the above job into two parts such that the reader/processors produce data on 2 Kafka topics (one for data to be written to Oracle and the other for data to be written to a file). On the other side of the equation, I will have a job with two parallel flows that read data from each topic and write the data to Oracle and file respectively.
With the above architecture in mind, I wanted to understand how I can refactor a Spring Batch Job to use Kafka. I believe the following areas is what I would need to address :
In the existing job that doesn't use Kafka, my throttle limit is 30; however, when I use Kafka in the middle, how does one decide the right throttle-limit?
In the existing job I have a commit-interval of 100. This means that the CompositeItemWriter will be called for every 100 records and each writer will unpack the chunk and call the write method on it. Does this mean that when I write to Kafka, there will be 100 publish calls to Kafka?
Is there a way to club multiple rows into one single message in Kafka to avoid multiple network calls?
On the consumer side, I want to have a Spring batch multi-threaded step that is able to read each partition for a topic in parallel. Does Spring Batch have inbuilt classes to support this already?
The consumer will use standard JdbcBatchITemWriter or FlatFileItemWriter to write the data that was read from Kafka so I believe this should be standard Spring Batch in Action.
Note : I am aware of Kafka Connect but don't want to use it because it requires setting up a Connect cluster and I don't have the infrastructure available to support the same.
Answers to your questions:
No throttling is needed in your kafka producer, data should be available in kafka for consumption asap. Your consumers could be throttled (if needed) as per the implementation.
Kafka Producer is configurable. 100 messages do not necessarily mean 100 network calls. You could write 100 messages to kafka producer (which may or may not buffer it as per the config) and flush the buffer to force network call. This would lead to (almost) the same existing behaviour.
Multiple rows can be clubbed in a single message as the payload of kafka message is entirely upto you. But your reasoning multiple rows into one single message in Kafka to avoid multiple network calls? is invalid since multiple messages (rows) can be produced/consumed in a single network call. For your first draft, I would suggest to keep it simple by having a single row correspond to a single message.
Not as far as I know. (but I could be wrong on this one)
Yes I believe they should work just fine.
I have 3 executors in my spark streaming job which consumes from Kafka. Executor count depends on partition count in topic. When a message consumed from this topic, I am starting query on Hazelcast. Every executor finds results from some filtering operation on hazelcast and returns duplicated results. Because data statuses are not updated when executor returns the data and other executor finds the same data.
My question is, is there a way to combine all results in only one list which are found by executors during streaming?
Spark Executors are distributed across Cluster, so if you are trying to deduplicate data across cluster. So deduplicating is difficult. you have following options
Use accumulators.- problem here is that accumulators are not consistent when job is running and you may end up reading stale data
Other option is Offload this work to external system. - store your output in some external storage which can deduplicate it. (Probably HBase). efficiency of this storage system becomes key here.
I hope this helps
To avoid duplicate data read, you need to maintain the offset somewhere, preferred in HBase and everytime you consume the data from Kafka, you read it from HBase and then check the offset for each topic which is already consumed and then start reading and writing it. After each successful write, you must update the offset count.
Do you think that way it solves the issue?
We have an Servlet based application which can uploads files.
Uploaded files are saved in the server , and after that it is delegated for File Processing and later insert in our Mongo DB.
Files are large greater than 50 mbs and it will take 30m -1 hour depending upon server load.
Problem happens when multiple files gets processed at a time in separate threads off course, but it will eventually slow up the system and finally one of the thread gets aborted , which we can never trace it.
So we are now planning for a Multiple Producer - single consumer based approach , where file jobs are queued one by one , and the Consumer will consume it from queue one by one , but sequentially .
Provided we need the clustering capability to implemented in the application later on ,
For this approach we are planning to implement the below process.
when a file job comes, we will put it in a mongo collection with status New.
Next it will call the consumer thread immediately .
Consumer will check if there is already a running tasks with status "Running"
if there is no running status , it will start the task .
Upon completion ,before ending, consumer will again check the table, if there are any tasks with status NEW , if it is there ,it will take the task IN FIFO manner by checking the time stamp , and the process continuous.
If there is current Running tasks, it will simply insert the new task to db. since there is a already running consumer ,that thread will take care of the new job inserted into db ,while the current process ends.
By this way , we can also ensure that it will run smoothly on the clustered environment also without any additional configuration .
There are message queue based solution with RabbitMQ or ActiveMQ , but wee neeeds to minimize the additional component configuration
Let me know if our approach is correct or ,do we have a better solution out there .
Thanks,