Flink block process event with the same id at the same time - java

I have a flink application that process stream of data and write some result to database. The stream is keyd by id. A database operation could take o quite of time (e.g 3 min) and can be only one operation for specified id key to protect against locks. At this moment, this sink operation could not be process with paralell and have to be set parallelism to 1.
process
.keyBy(new ProductKeySelector())
.addSink(new ProductSink())
.setParallelism(1)
I want to lock stream with actual processing id event and take another, out of order, and wait until same id end process then run process to it. It's will be process like blocking queue.
Update:
example:
kafkaKeyedStream
.map(new MapToProductType())
.keyBy(new ProductKeySelector())
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.reduce(new ProductAggregateFunction())
.addSink(new ProductSink());
From Kafka Source i recieved data:
enter image description here
As you can see, data are grouped by window function (first value in data is the key) and te results are process by sink function. For this example, let's say that processing will take 20s per each part of data. So if i have 1 thread its not a problem, because the next data waiting for process, but if i set parallelism= 2 then first part of data will be still process by one thread, and after 10s anoter thread start process next part of data with the same key as first. And this create a lock on database.
I would like in a situation where one thread is already processing data under a specific key,
the second thread did not take data on the same key, and took either a different one or did nothing if there is nothing else

If your DB operation could take 3 minutes, you don't want to use a regular JDBC sink. Instead, look at Flink's Async IO support. You'd want to keyBy(id), and then inside of your custom RichAsyncFunction operator you can keep track of whether you've got an active DB request for a given id.

Related

Maintain the status of events in multithreading

I want to read records (1000k) from 1 table and push them to some service.
So I have clubbed 200 records(based on the service limitations) in 1 event and used the executor framework and have created 10 executors. 10 events will be processed (i.e. 10*200 records) parallelly.
Now I want to maintain the status of these events, like statistics on how many were processed successfully and how many failed.
So I was thinking of
Approach 1:
Before starting the execution,
writing each event id + record id with status
event1 + record1 -> start
and on completion
event1 + record1-> end
And later will check how many have both start and end in the file and how many do not have end.
Approach 2 :
Write all record ids in one file with status pending and
writing all successful records in another file
And then check for the missing in the successful file by using pivot
Is there a better way to maintain the status of the records?
In my view, if you want to process items parallelly, then it is better to create a log file by your amount of records. Why? Because one file is a bottleneck for multithreading, because you need to lock file to prevent conditon race. If you will decide to lock file, then each thread should wait when log file will be released and waiting of file will nullify all multithreading processing.
So one batch should have one log file.
Create an array and start threads with the passed id so they can write to the array cell by their id.
The main thread will read this array and print it.
You can use ReadWriteLock (threads will hold the read lock to write and the main thread will hold the write lock while reading the entire array).
You can store anything in this array, it can be very useful.

Apache flink multi-threading/parallel execution

The input stream consists of data in JSON array of objects format.
Each object has one field/key named state by which we need to separate the input stream, see below example
Object1 -> "State":"Active"
Object2 -> "State":"Idle"
Object3 -> "State":"Blocked"
Object4 -> "State":"Active"
We have to start processing/thread as soon as we receive a particular state, keep on getting the data if a new state is similar to the previous state let the previous thread handle it else start a new thread for a new state. Also, it is required to run each thread for finite time and all the threads should run in parallel.
Please suggest how can I do it in Apache Flink. Pseudo codes and links would be helpful.
This can be done with Flink's Datastream API. Each JSON object can be treated as a tuple, which can be processed with any of the Flink Operators.
/----- * * | Active
------ (KeyBy) ------ * | Idle
\----- * | Blocked
Now, you can split the single data stream into multiple streams using the KeyBy operator. This operator splits and clubs together, all the tuples with a particular key (State in your case) into a keyedstream which is processed in parallel. Internally, this is implemented with hash partitioning.
Any new keys(States) are dynamically handled as new keyedstreams are created for them.
Explore the documentation for implementation purpose.
From your description, I believe you'd need to first have an operator with a parallelism of 1, that "chunks" events by the state, and adds a "chunk id" to the output record. Whenever you get an event with a new state, you'd increment the chunk id.
Then key by the chunk id, which will parallelize downstream processing. Add a custom function which is keyed by the chunk id, and has a window duration of 10 minutes. This is where the bulk of your data processing will occur.
And as #narush noted above, you should read through the documentation that he linked to, so you understand how windows work in Flink.

Is there a way to retry a Bolt in Storm?

We have an app that does database saves. If the save fails, is there a way to retry just the bolt that failed? We don't want to fail all the way back to the spout.
You could add an output "scorpion tail" stream to the bolt. The stream would be read by whichever bolt would begin the retry process. This would create a loop in the topology. The idea is that when a failure occurs, you can write a packet of information to this stream and have the tuple delivered to the upstream bolt that would begin the retry. The packet contains whatever state is needed for the retry.
There is no build-in support for this in Storm. However, you can code you own solution:
Do not ack (or fail) the failing tuple, buffer it in an internal data structure (ie, member variable; maybe a List), and return from execute()
Keep processing further tuples in execute() until you want to retry (maybe some timer, ie, you might want to get current timestamp or you retry counter based).
On retry, before processing the new input tuple, receive the failed tuple from your buffer and try to insert into DB. If it fails again, insert into buffer again. If insert is successful, ack buffered tuple and resume processing current input tuple.
You only need to consider Storm's MESSAGE_TIMEOUT. Retrying cannot take longer than this value because if the tuple gets not acked within the timeout value, Storm fails the tuple at the source automatically.

Stream a database recordset to multiple thread workers

I have a process which requires streaming data from a database and passing the records off to an external server for processing before returning the results to store back in the database.
Get database row from table A
Hand off to external server
Receive result
insert database row into table B
Currently this is a single-threaded operation, and the bottleneck is the external server process and so I would like to improve performance by using other instances of the external server process to handle requests.
Get 100 database rows from table A
For each row
Hand off to external server 1
Receive Result
insert database row into table B
In parallel get 100 database rows from table A
For each row
Hand off to external server 2
Receive Result
insert database row into table B
Problem 1
I have been investigating Java thread pools, and dispatching records to the external servers this way, however I'm not sure how to fetch records from the database as quickly as possible without the workers rejecting new tasks. Can this be done with thread pools? What architecture should be used to achieve this?
Problem 2
At present I have optimised the database inserts by using batch statements and only executing once 2000 records have been processed. Would it be possible to adopt a similar approach within the workers?
Any help in structuring a solution to this problem would be greatly appreciated.
Based on your comments, I think the key point is controlling the count of pending tasks. You have several options:
Do an estimate on the number of records in your data set. Then, decide on a batch size that will produce a reasonable number of tasks. For example, if you want to limit pending task count to 100. Then if you have 100K records, you can have a batch size of 1K. If you have 1Mil records, then set batch size to 10K.
Supply your own bounded BlockingQueue to the threadpool. If you haven't done it before, you probably should study the java.util.concurrent package carefully before doing this.
Or you can use a java.util.concurrent.Semaphore, which is a simpler facility than a user supplied queue:
Declare a semaphore with your pending task count limit
Semaphore mySemaphore = new Semaphore(max_pending_task_count);
Since your task generation is fast, you can use a single thread to generate all tasks. In your task generating thread:
while(hasMoreTasks()) {
// this will block if you've reached the count limit
mySemaphore.acquire();
// generate a new task only after acquire
// The new task must have a reference to the Semaphore
Task task = new Task(..., mySemaphore);
threadpool.submit(task);
}
// now that you've generated all tasks,
// time to wait for them to finish.
// you may have a better way to detect that, however
while(mySemaphore.availablePermits() < max_pending_task_count) {
Thread.sleep(some_time);
}
// now, go ahead dealing with the results
In your Task thread:
public void run() {
...
// when finished, do a release which increases the permit
// by 1 and inform your task generator thread to produce 1 more task
mySemaphore.release();
}

Is there a way to assure FIFO (first in, first out) behavior with Task Queues on GAE?

Is there a way to assure FIFO (first in, first out) behavior with Task Queues on GAE?
GAE Documentation says that FIFO is one of the factors that affect task execution order, but the same documentation says that “the system's scheduling may 'jump' new tasks to the head of the queue” and I have confirmed this behavior with a test. The effect: my events are being processed out of order.
Docs says:
https://developers.google.com/appengine/docs/java/taskqueue/overview-push
The order in which tasks are executed depends on several factors:
The position of the task in the queue. App Engine attempts to process tasks based on FIFO > (first in, first out) order. In general, tasks are inserted into the end of a queue, and
executed from the head of the queue.
The backlog of tasks in the queue. The system attempts to deliver the lowest latency
possible for any given task via specially optimized notifications to the scheduler.
Thus, in the case that a queue has a large backlog of tasks, the
system's scheduling may "jump" new tasks to the head of the queue.
The value of the task's etaMillis property. This property specifies the
earliest time that a task can execute. App Engine always waits until
after the specified ETA to process push tasks.
The value of the task's countdownMillis property. This property specifies the minimum
number of seconds to wait before executing a task. Countdown and eta
are mutually exclusive; if you specify one, do not specify the other.
What I need to do? In my use case, I'll process 1-2 million events/day coming from vehicles. These events can be sent at any interval (1 sec, 1 minute or 1 hour). The order of the event processing has to be assured. I need process by timestamp order, which is generated on a embedded device inside the vehicle.
What I have now?
A Rest servlet that is called by the consumer and creates a Task (Event data is on payload).
After this, a worker servlet get this Task and:
Deserialize Event data;
Put Event on Datastore;
Update Vehicle On Datastore.
So, again, is there any way to assure just FIFO behavior? Or how can I improve this solution to get this?
You need to approach this with three separate steps:
Implement a Sharding Counter to generate a monotonically
increasing ID. As much as I like to use the timestamp from
Google's server to indicate task ordering, it appears that timestamps
between GAE servers might vary more than what your requirement is.
Add your tasks to a Pull Queue instead of a Push Queue. When
constructing your TaskOption, add the ID obtained from Step #1 as a tag.
After adding the task, store the ID somewhere on your datastore.
Have your worker servlet lease Tasks by a certain tag from the Pull Queue.
Query the datastore to get the earliest ID that you need to fetch, and use the ID as
the lease tag. In this way, you can simulate FIFO behavior for your task queue.
After you finished your processing, delete the ID from your datastore, and don't forget to delete the Task from your Pull Queue too. Also, I would recommend you run your task consumptions on the Backend.
UPDATE:
As noted by Nick Johnson and mjaggard, sharding in step #1 doesn't seem to be viable to generate a monotonically increasing IDs, and other sources of IDs would then be needed. I seem to recall you were using timestamps generated by your vehicles, would it be possible to use this in lieu of a monotonically increasing ID?
Regardless of the way to generate the IDs, the basic idea is to use datastore's query mechanism to produce a FIFO ordering of Tasks, and use task Tag to pull specific task from the TaskQueue.
There is a caveat, though. Due to the eventual consistency read policy on high-replication datastores, if you choose HRD as your datastore (and you should, the M/S is deprecated as of April 4th, 2012), there might be some stale data returned by the query on step #2.
I think the simple answer is "no", however partly in order to help improve the situation, I am using a pull queue - pulling 1000 tasks at a time and then sorting them. If timing isn't important, you could sort them and put them into the datastore and then complete a batch at a time. You've still got to work out what to do with the tasks at the beginning and ends of the batch - because they might be out of order with interleaving tasks in other batches.
Ok. This is how I've done it.
1) Rest servlet that is called from the consumer:
If Event sequence doesn't match Vehicle sequence (from datastore)
Creates a task on a "wait" queue to call me again
else
State validation
Creates a task on the "regular" queue (Event data is on payload).
2) A worker servlet gets the task from the "regular" queue, and so on... (same pseudo code)
This way I can pause the "regular" queue in order to do a data maintenance without losing events.
Thank you for your answers. My solution is a mix of them.
You can put the work to be done in a row in the datastore with a create timestamp and then fetch work tasks by that timestamp, but if your tasks are being created too quickly you will run into latency issues.
Don't know the answer myself, but it may be possible that tasks enqueued using a deferred function might execute in order submitted. Likely you will need an engineer from G. to get an answer. Pull queues as suggested seem a good alternative, plus this would allow you to consider batching your put()s.
One note about sharded counters: they increase the probability of monotonically increasing ids, but do not guarantee them.
The best way to handle this, the distributed way or "App Engine way" is probably to modify your algorithm and data collection to work with just a timestamp, allowing arbitrary ordering of tasks.
Assuming this is not possible or too difficult, you could modify your algorithm as follow:
when creating the task don't put the data on payload but in the datastore, in a Kind with an ordering on timestamps and stored as a child entity of whatever entity you're trying to update (Vehicule?). The timestamps should come from the client, not the server, to guarantee the same ordering.
run a generic task that fetch the data for the first timestamp, process it, and then delete it, inside a transaction.
Following this thread, I am unclear as to whether the strict FIFO requirement is for all transactions received, or on a per-vehicle basis. Latter has more options vs. former.

Categories

Resources