Stream a database recordset to multiple thread workers - java

I have a process which requires streaming data from a database and passing the records off to an external server for processing before returning the results to store back in the database.
Get database row from table A
Hand off to external server
Receive result
insert database row into table B
Currently this is a single-threaded operation, and the bottleneck is the external server process and so I would like to improve performance by using other instances of the external server process to handle requests.
Get 100 database rows from table A
For each row
Hand off to external server 1
Receive Result
insert database row into table B
In parallel get 100 database rows from table A
For each row
Hand off to external server 2
Receive Result
insert database row into table B
Problem 1
I have been investigating Java thread pools, and dispatching records to the external servers this way, however I'm not sure how to fetch records from the database as quickly as possible without the workers rejecting new tasks. Can this be done with thread pools? What architecture should be used to achieve this?
Problem 2
At present I have optimised the database inserts by using batch statements and only executing once 2000 records have been processed. Would it be possible to adopt a similar approach within the workers?
Any help in structuring a solution to this problem would be greatly appreciated.

Based on your comments, I think the key point is controlling the count of pending tasks. You have several options:
Do an estimate on the number of records in your data set. Then, decide on a batch size that will produce a reasonable number of tasks. For example, if you want to limit pending task count to 100. Then if you have 100K records, you can have a batch size of 1K. If you have 1Mil records, then set batch size to 10K.
Supply your own bounded BlockingQueue to the threadpool. If you haven't done it before, you probably should study the java.util.concurrent package carefully before doing this.
Or you can use a java.util.concurrent.Semaphore, which is a simpler facility than a user supplied queue:
Declare a semaphore with your pending task count limit
Semaphore mySemaphore = new Semaphore(max_pending_task_count);
Since your task generation is fast, you can use a single thread to generate all tasks. In your task generating thread:
while(hasMoreTasks()) {
// this will block if you've reached the count limit
mySemaphore.acquire();
// generate a new task only after acquire
// The new task must have a reference to the Semaphore
Task task = new Task(..., mySemaphore);
threadpool.submit(task);
}
// now that you've generated all tasks,
// time to wait for them to finish.
// you may have a better way to detect that, however
while(mySemaphore.availablePermits() < max_pending_task_count) {
Thread.sleep(some_time);
}
// now, go ahead dealing with the results
In your Task thread:
public void run() {
...
// when finished, do a release which increases the permit
// by 1 and inform your task generator thread to produce 1 more task
mySemaphore.release();
}

Related

Flink block process event with the same id at the same time

I have a flink application that process stream of data and write some result to database. The stream is keyd by id. A database operation could take o quite of time (e.g 3 min) and can be only one operation for specified id key to protect against locks. At this moment, this sink operation could not be process with paralell and have to be set parallelism to 1.
process
.keyBy(new ProductKeySelector())
.addSink(new ProductSink())
.setParallelism(1)
I want to lock stream with actual processing id event and take another, out of order, and wait until same id end process then run process to it. It's will be process like blocking queue.
Update:
example:
kafkaKeyedStream
.map(new MapToProductType())
.keyBy(new ProductKeySelector())
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.reduce(new ProductAggregateFunction())
.addSink(new ProductSink());
From Kafka Source i recieved data:
enter image description here
As you can see, data are grouped by window function (first value in data is the key) and te results are process by sink function. For this example, let's say that processing will take 20s per each part of data. So if i have 1 thread its not a problem, because the next data waiting for process, but if i set parallelism= 2 then first part of data will be still process by one thread, and after 10s anoter thread start process next part of data with the same key as first. And this create a lock on database.
I would like in a situation where one thread is already processing data under a specific key,
the second thread did not take data on the same key, and took either a different one or did nothing if there is nothing else
If your DB operation could take 3 minutes, you don't want to use a regular JDBC sink. Instead, look at Flink's Async IO support. You'd want to keyBy(id), and then inside of your custom RichAsyncFunction operator you can keep track of whether you've got an active DB request for a given id.

Best Practice to process multiple rows from database in different thread

I would like to ask about the best way to do the following, currently I have many rows that are being inserted into database, with some status like 'NEW'
One thread(ThreadA) is reading 20 rows of the data from table with following query: select * from TABLE where status = 'NEW' order by some_date asc and puts read data into the the queue. It only reads data when number of elements in the queue is less then 20.
Another thread(ThreadB) is reading data from the queue and processes it, during the process it changes the status of the row to something like 'IN PROGRESS'.
My fear is that while ThreadB is processing one row, but still didn't update its status, if the number of elements in the queue is reduced to be lower than 20, it will fetch another 20 elements and put it into the queue, so there is a possibility of having duplicates in queue.
The data might come back with a status like 'NEW' I thought that I can update the data read with some flag(something like fetched), and to set the flag as not read after processing.
I feel like I am missing something. So I would like to ask if there some best practice on how to handle tasks similar to this.
PS. Number of threads that read the data might be increased in the future, this is what I try to keep in mind
Right, since no-one seems to be picking this one up, I'll continue here what was started in the comments:
There are lots of solutions to this one. In your case, with just one processing thread, you might want for example to store just the records ids in the queue. Then ThreadB can fetch the row itself to make sure the status is indeed NEW. Or use optimistic locking with update table set status='IN_PROGRESS' where id=rowId and status='NEW' and quit processing this row on exception.
Optimisting locking is fun, and you could also use it to get rid of producer thread altogether. Imagine a few threads, processing records from database. They could each pick up a record, and try to set the status with optimistic locking as in the first example. It's quite possible to get a lot of contention for records this way, so each thread could fetch N rows, where N is number of threads, or twice that much. And then try to process the first row that it succeeded to set IN_PROGRESS for. This solution makes for a less complicated system, and one less thing to take care of/synchronize with.
And you can have the thread not only pick up records that are NEW, but also these which are IN_PROGRESS and started_date < sysdate = timeout, that would include records that were not processed because of system failure (like a thread managed to set one row to IN_PROGRESS and then your system went down. So you get some resilience here.

Need suggestion about java thread pool execution queue processing

In my application we have number of clients Databases, every hour we
get new data for processing in that databases
There is a cron to checks data from this databases and pickup the data and
Then Create thread pool and It start execution of 30 threads in parallel and
remaining thread are store in queue
it takes several hours to process this all threads
So while execution, if new data arrives then it has to wait, because this cron
will not pickup this newly arrived data until it's current execution is not
got finished.
Sometimes we have priority data for processing but due to this case that
clients also need to wait for several hours for processing their data.
Please give me the suggestion to avoid this wait state for newly arrived data
(I am working on java 1.7 , tomcat7 and SQL server2012)
Thank you in advance
Please let me know, for more information on this if not clear
Each of your thread should procces data in bulk (for example 100/1000 records) and this records should be selected from DB by priority. Each time you select new records for proccesing data with highest priority go first.
I can't create comment yet :(
For this problem we are thinking about two solution
Create more then one thread pool for processing normal and high
priority data.
Create more then one tomcat instance with same code for processing normal and priority
data
But I am not understanding which solution is best for my case 1 or 2
Please give me suggestions about above solutions, so that i can take decision on it
You can use ExecutorService newCachedThreadPool()
Benefits of using a cached thread pool :
The pool creates new threads if needed but reuses previously constructed threads if they are available.
Only if no threads are available for reuse will a new thread be created and added to the pool.
Threads that have not been used for more than sixty seconds are terminated and removed from the cache. Hence a pool which has not been used long enough will not consume any resources.

Multi-threaded file processing and database batch insertions

I have an Java main application which will read a file, line-by-line. Each line represents subscriber data.
name, email, mobile, ...
An subscriber object is created for each line being processed and then this object is persisted in database using JDBC.
PS: Input file has around 15 million subscriber data and application takes around 10-12 hours to process. I need to reduce this to around 2-3 hours as this task is an migration activity and down-time that we get is around 4-5 hours.
I know I need to use multiple thread / thread pool may be Java's native ExecuterService. But I am asked to do a batch update as well. Say taking a thread pool of 50 or 100 worker threads and batch update of 500-1000 subscribers.
I am familiar using ExecuterService but not getting an approach where I can have batch update logic too in it.
My overall application code looks like:
while (null != (line = getNextLine())) {
Subscriber sub = getSub(line); // creates subscriber object by parsing the line
persistSub(sub); // JDBC - PreparedStatement insert query executed
}
Need to know an approach where I can process it faster with multiple threads and using batch update or any existing frameworks or Java API's which can be used for such cases.
persistSub(sub) should not immediately access the database. Instead, it should store sub in an array of length 500-1000 and only when the array is full, or the input file terminated, wrap it in a Runnable and submit to a thread pool. The Runnable then accesses database via jdbc like it is described in JDBC Batching with PrepareStatement Object.
UPDATE
If writing into database is slow and input file reading is fast, many arrays with data can be created waiting to be written in the database, and system can run out of memory. So persistSub(sub) should keep track of the number of allocated arrays. The easiest way is to use a Semaphore inbitialized with allowed number of arrays. Before a new array is allocated, persistSub(sub) makes Semaphore.aquire(). Each Runnable task, before its end, makes Semaphore.release().

BlockingQueue inside Transaction java

I am in process of building a system which has a basic Produce Consumer paradigm flavor to it, but the Producer part needs to be a in a Transaction mode.
Here is my exact scenario :
Poller Thread -
[Transaction START]
Polls the DB for say 10 records
* Sets the status in DB for those records as IN-Progress
* Puts the above 10 records in a LinkedBlockingQueue - workqueue
[Transaction END]
Worker Thread Pool of 5
* Polls the workqueue for the tasks, do some lookups and update the same records in DB with some looked up values.
Now my problem is with the Part 1 of the process because if for lets says some reason my process of extracting and updating from DB is successful but inserting in the queue fails for one record, i can rollback the whole transaction, and all my records in DB will be in state NOT Processed, but there can some elements which will be inserted in this work queue, and my worker threadpool can pickup them and start processing, which should not happen.
Trying to find if there is a way to have the writes to the blockingqueue in transnational manner.
Thinking of adding some writelock() readlock() mechanisms, if i can stop the worker threads from reading if something is being written in the queue.
Any thoughts for a better approach.
Thanks,
Consider the worst case scenario: the unplug case (database connection lost) and the crash case (program out of memory). How would you recover from that?
Some hints:
If you can think of a reason why inserting in the queue would fail (queue full), don't start the transaction. Just skip one poll.
First commit the in-progress transaction, then add all records to the workqueue. Or use one transaction per record so you can add the records to the workqueue one by one.
Maintain an in-memory HashSet of IDs of all records being processed. If the ID is in the set but the record is not in-progress, or vice versa, something is very wrong (e.g. task for a record did not complete/crashed).
Set a timestamp when in-progress was set. Have another background-process check for records that are in-progress for too long. Reset the in-progress state if the ID is not in the in-progress HashSet and your normal process will retry to operation.
Make your tasks idempotent: see if you can find a way that tasks can recognize work for the record has already been done. This might be a relative expensive operation, but it will give you the guarantee that work is only done once in case of retries.

Categories

Resources