I want to read records (1000k) from 1 table and push them to some service.
So I have clubbed 200 records(based on the service limitations) in 1 event and used the executor framework and have created 10 executors. 10 events will be processed (i.e. 10*200 records) parallelly.
Now I want to maintain the status of these events, like statistics on how many were processed successfully and how many failed.
So I was thinking of
Approach 1:
Before starting the execution,
writing each event id + record id with status
event1 + record1 -> start
and on completion
event1 + record1-> end
And later will check how many have both start and end in the file and how many do not have end.
Approach 2 :
Write all record ids in one file with status pending and
writing all successful records in another file
And then check for the missing in the successful file by using pivot
Is there a better way to maintain the status of the records?
In my view, if you want to process items parallelly, then it is better to create a log file by your amount of records. Why? Because one file is a bottleneck for multithreading, because you need to lock file to prevent conditon race. If you will decide to lock file, then each thread should wait when log file will be released and waiting of file will nullify all multithreading processing.
So one batch should have one log file.
Create an array and start threads with the passed id so they can write to the array cell by their id.
The main thread will read this array and print it.
You can use ReadWriteLock (threads will hold the read lock to write and the main thread will hold the write lock while reading the entire array).
You can store anything in this array, it can be very useful.
Related
I have a flink application that process stream of data and write some result to database. The stream is keyd by id. A database operation could take o quite of time (e.g 3 min) and can be only one operation for specified id key to protect against locks. At this moment, this sink operation could not be process with paralell and have to be set parallelism to 1.
process
.keyBy(new ProductKeySelector())
.addSink(new ProductSink())
.setParallelism(1)
I want to lock stream with actual processing id event and take another, out of order, and wait until same id end process then run process to it. It's will be process like blocking queue.
Update:
example:
kafkaKeyedStream
.map(new MapToProductType())
.keyBy(new ProductKeySelector())
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.reduce(new ProductAggregateFunction())
.addSink(new ProductSink());
From Kafka Source i recieved data:
enter image description here
As you can see, data are grouped by window function (first value in data is the key) and te results are process by sink function. For this example, let's say that processing will take 20s per each part of data. So if i have 1 thread its not a problem, because the next data waiting for process, but if i set parallelism= 2 then first part of data will be still process by one thread, and after 10s anoter thread start process next part of data with the same key as first. And this create a lock on database.
I would like in a situation where one thread is already processing data under a specific key,
the second thread did not take data on the same key, and took either a different one or did nothing if there is nothing else
If your DB operation could take 3 minutes, you don't want to use a regular JDBC sink. Instead, look at Flink's Async IO support. You'd want to keyBy(id), and then inside of your custom RichAsyncFunction operator you can keep track of whether you've got an active DB request for a given id.
I would like to ask about the best way to do the following, currently I have many rows that are being inserted into database, with some status like 'NEW'
One thread(ThreadA) is reading 20 rows of the data from table with following query: select * from TABLE where status = 'NEW' order by some_date asc and puts read data into the the queue. It only reads data when number of elements in the queue is less then 20.
Another thread(ThreadB) is reading data from the queue and processes it, during the process it changes the status of the row to something like 'IN PROGRESS'.
My fear is that while ThreadB is processing one row, but still didn't update its status, if the number of elements in the queue is reduced to be lower than 20, it will fetch another 20 elements and put it into the queue, so there is a possibility of having duplicates in queue.
The data might come back with a status like 'NEW' I thought that I can update the data read with some flag(something like fetched), and to set the flag as not read after processing.
I feel like I am missing something. So I would like to ask if there some best practice on how to handle tasks similar to this.
PS. Number of threads that read the data might be increased in the future, this is what I try to keep in mind
Right, since no-one seems to be picking this one up, I'll continue here what was started in the comments:
There are lots of solutions to this one. In your case, with just one processing thread, you might want for example to store just the records ids in the queue. Then ThreadB can fetch the row itself to make sure the status is indeed NEW. Or use optimistic locking with update table set status='IN_PROGRESS' where id=rowId and status='NEW' and quit processing this row on exception.
Optimisting locking is fun, and you could also use it to get rid of producer thread altogether. Imagine a few threads, processing records from database. They could each pick up a record, and try to set the status with optimistic locking as in the first example. It's quite possible to get a lot of contention for records this way, so each thread could fetch N rows, where N is number of threads, or twice that much. And then try to process the first row that it succeeded to set IN_PROGRESS for. This solution makes for a less complicated system, and one less thing to take care of/synchronize with.
And you can have the thread not only pick up records that are NEW, but also these which are IN_PROGRESS and started_date < sysdate = timeout, that would include records that were not processed because of system failure (like a thread managed to set one row to IN_PROGRESS and then your system went down. So you get some resilience here.
I have the next scenario:
the server send a lot of information from a Socket, so I need to read this information and validate it. The idea is to use 20 threads and batches, each time the batch size is 20, the thread must send the information to the database and keep reading from the socket waiting for more.
I don't know what it would be the best way to do this, I was thinking:
create a Socket that will read the information
Create a Executor (Executors.newFixedThreadPool(20)) and validate de information, and add each line into a list and when the size is 20 execute the Runnable class that will send the information to the database.
Thanks in advance for you help.
You don't want to do this with a whole bunch of threads. You're better off using a producer-consumer model with just two threads.
The producer thread reads records from the socket and places them on a queue. That's all it does: read record, add to queue, read next record. Lather, rinse, repeat.
The consumer thread reads a record from the queue, validates it, and writes it to the database. If you want to batch the items so that you write 20 at a time to the database, then you can have the consumer add the record to a list and when the list gets to 20, do the database update.
You probably want to look up information on using the Java BlockingQueue in producer-consumer programs.
You said that you might get a million records a day from the socket. That's only 12 records per second. Unless your validation is hugely processor intensive, a single thread could probably handle 1,200 records per second with no problem.
In any case, your major bottleneck is going to be the database updates, which probably won't benefit from multiple threads.
I have a process which requires streaming data from a database and passing the records off to an external server for processing before returning the results to store back in the database.
Get database row from table A
Hand off to external server
Receive result
insert database row into table B
Currently this is a single-threaded operation, and the bottleneck is the external server process and so I would like to improve performance by using other instances of the external server process to handle requests.
Get 100 database rows from table A
For each row
Hand off to external server 1
Receive Result
insert database row into table B
In parallel get 100 database rows from table A
For each row
Hand off to external server 2
Receive Result
insert database row into table B
Problem 1
I have been investigating Java thread pools, and dispatching records to the external servers this way, however I'm not sure how to fetch records from the database as quickly as possible without the workers rejecting new tasks. Can this be done with thread pools? What architecture should be used to achieve this?
Problem 2
At present I have optimised the database inserts by using batch statements and only executing once 2000 records have been processed. Would it be possible to adopt a similar approach within the workers?
Any help in structuring a solution to this problem would be greatly appreciated.
Based on your comments, I think the key point is controlling the count of pending tasks. You have several options:
Do an estimate on the number of records in your data set. Then, decide on a batch size that will produce a reasonable number of tasks. For example, if you want to limit pending task count to 100. Then if you have 100K records, you can have a batch size of 1K. If you have 1Mil records, then set batch size to 10K.
Supply your own bounded BlockingQueue to the threadpool. If you haven't done it before, you probably should study the java.util.concurrent package carefully before doing this.
Or you can use a java.util.concurrent.Semaphore, which is a simpler facility than a user supplied queue:
Declare a semaphore with your pending task count limit
Semaphore mySemaphore = new Semaphore(max_pending_task_count);
Since your task generation is fast, you can use a single thread to generate all tasks. In your task generating thread:
while(hasMoreTasks()) {
// this will block if you've reached the count limit
mySemaphore.acquire();
// generate a new task only after acquire
// The new task must have a reference to the Semaphore
Task task = new Task(..., mySemaphore);
threadpool.submit(task);
}
// now that you've generated all tasks,
// time to wait for them to finish.
// you may have a better way to detect that, however
while(mySemaphore.availablePermits() < max_pending_task_count) {
Thread.sleep(some_time);
}
// now, go ahead dealing with the results
In your Task thread:
public void run() {
...
// when finished, do a release which increases the permit
// by 1 and inform your task generator thread to produce 1 more task
mySemaphore.release();
}
I am in process of building a system which has a basic Produce Consumer paradigm flavor to it, but the Producer part needs to be a in a Transaction mode.
Here is my exact scenario :
Poller Thread -
[Transaction START]
Polls the DB for say 10 records
* Sets the status in DB for those records as IN-Progress
* Puts the above 10 records in a LinkedBlockingQueue - workqueue
[Transaction END]
Worker Thread Pool of 5
* Polls the workqueue for the tasks, do some lookups and update the same records in DB with some looked up values.
Now my problem is with the Part 1 of the process because if for lets says some reason my process of extracting and updating from DB is successful but inserting in the queue fails for one record, i can rollback the whole transaction, and all my records in DB will be in state NOT Processed, but there can some elements which will be inserted in this work queue, and my worker threadpool can pickup them and start processing, which should not happen.
Trying to find if there is a way to have the writes to the blockingqueue in transnational manner.
Thinking of adding some writelock() readlock() mechanisms, if i can stop the worker threads from reading if something is being written in the queue.
Any thoughts for a better approach.
Thanks,
Consider the worst case scenario: the unplug case (database connection lost) and the crash case (program out of memory). How would you recover from that?
Some hints:
If you can think of a reason why inserting in the queue would fail (queue full), don't start the transaction. Just skip one poll.
First commit the in-progress transaction, then add all records to the workqueue. Or use one transaction per record so you can add the records to the workqueue one by one.
Maintain an in-memory HashSet of IDs of all records being processed. If the ID is in the set but the record is not in-progress, or vice versa, something is very wrong (e.g. task for a record did not complete/crashed).
Set a timestamp when in-progress was set. Have another background-process check for records that are in-progress for too long. Reset the in-progress state if the ID is not in the in-progress HashSet and your normal process will retry to operation.
Make your tasks idempotent: see if you can find a way that tasks can recognize work for the record has already been done. This might be a relative expensive operation, but it will give you the guarantee that work is only done once in case of retries.