How to implement a light weight database based FIFO queue in java - java

We have an Servlet based application which can uploads files.
Uploaded files are saved in the server , and after that it is delegated for File Processing and later insert in our Mongo DB.
Files are large greater than 50 mbs and it will take 30m -1 hour depending upon server load.
Problem happens when multiple files gets processed at a time in separate threads off course, but it will eventually slow up the system and finally one of the thread gets aborted , which we can never trace it.
So we are now planning for a Multiple Producer - single consumer based approach , where file jobs are queued one by one , and the Consumer will consume it from queue one by one , but sequentially .
Provided we need the clustering capability to implemented in the application later on ,
For this approach we are planning to implement the below process.
when a file job comes, we will put it in a mongo collection with status New.
Next it will call the consumer thread immediately .
Consumer will check if there is already a running tasks with status "Running"
if there is no running status , it will start the task .
Upon completion ,before ending, consumer will again check the table, if there are any tasks with status NEW , if it is there ,it will take the task IN FIFO manner by checking the time stamp , and the process continuous.
If there is current Running tasks, it will simply insert the new task to db. since there is a already running consumer ,that thread will take care of the new job inserted into db ,while the current process ends.
By this way , we can also ensure that it will run smoothly on the clustered environment also without any additional configuration .
There are message queue based solution with RabbitMQ or ActiveMQ , but wee neeeds to minimize the additional component configuration
Let me know if our approach is correct or ,do we have a better solution out there .
Thanks,

Related

How to safely unsubscribe to a topic in Kafka

I have a simple java program (dockerized) and deployed in kubernetes (pod)
This java program is just a normal java project that listens and consumes to a specific topic. eg. SAMPLE-SAFE-TOPIC
I have to unsubscribe to this topic safely, meaning no data will be lost even I deleted this pod (java consumer).
This is the code that I saw from searching:
public static void unsubscribeSafelyFromKafka() {
logger.debug("Safely unsubscribe to topic..");
if (myKakfaConsumer != null) {
myKafkaConsumer.unsubscribe();
myKafkaConsumer.close();
}
}
I need to run this via command line wherein the Java program has already an existing static main method.
My questions are:
Is the code above guarantees no records will be lost?
How can I trigger the code above via command line when there is already an existing static main()
Note: I am running the java project via command line. E.g. java -jar MyKafkaConsumer.jar as this is the requirement.
Please help
If I understand question 1 right you are concerned that after unsubscribing via one thread triggered by a console command there is a risk that the polling consumer is processing a batch of records that might be lost if the pod is killed?
If you have other pods that are consuming as part of the same consumer group, or if this or any pod subscribes again with the same group ID then the last committed offset will ensure that no records are lost (though some could be processed more than once) as that is where the consumer that takes over will start from.
If you use auto-commit that is safest as each commit happens in a subsequent poll so you cannot possibly commit records that haven't been processed (as long as you don't spawn additional threads to do the processing). Manual commit leaves it to you to decide when records have been dealt with and hence when it is safe to commit.
However, calling close after unsubscribe is a good idea and should ensure a clean completion of the current polled batch and commit of the final offsets as long as that all happens within a timeout period.
Re question 2, if you need to manually unsubscribe then I think you'd need JMX or expose an API or similar to call a method on the running JVM. However if you are just trying to ensure safe shutdown when the pod terminates, you could unsubscribe in a shutdown hook, or just not worry, given the safety provided by offset commits.

Need suggestion about java thread pool execution queue processing

In my application we have number of clients Databases, every hour we
get new data for processing in that databases
There is a cron to checks data from this databases and pickup the data and
Then Create thread pool and It start execution of 30 threads in parallel and
remaining thread are store in queue
it takes several hours to process this all threads
So while execution, if new data arrives then it has to wait, because this cron
will not pickup this newly arrived data until it's current execution is not
got finished.
Sometimes we have priority data for processing but due to this case that
clients also need to wait for several hours for processing their data.
Please give me the suggestion to avoid this wait state for newly arrived data
(I am working on java 1.7 , tomcat7 and SQL server2012)
Thank you in advance
Please let me know, for more information on this if not clear
Each of your thread should procces data in bulk (for example 100/1000 records) and this records should be selected from DB by priority. Each time you select new records for proccesing data with highest priority go first.
I can't create comment yet :(
For this problem we are thinking about two solution
Create more then one thread pool for processing normal and high
priority data.
Create more then one tomcat instance with same code for processing normal and priority
data
But I am not understanding which solution is best for my case 1 or 2
Please give me suggestions about above solutions, so that i can take decision on it
You can use ExecutorService newCachedThreadPool()
Benefits of using a cached thread pool :
The pool creates new threads if needed but reuses previously constructed threads if they are available.
Only if no threads are available for reuse will a new thread be created and added to the pool.
Threads that have not been used for more than sixty seconds are terminated and removed from the cache. Hence a pool which has not been used long enough will not consume any resources.

Producer - consumer using MySQL DB

My requirement is as follows
Maintain a pool of records in a table (MySQL DB).
A job acts as a producer and fills up this pool if the number of entries goes below a certain threshold. The job runs every 15 mins.
There can be multiple consumers with each consumer picking up just one record each. Two consumers coming in at the same time should get two different records.
The producer should not block the consumer. So while the producer job is running consumers should be able to pick up any available rows.
The producer / consumer is a part of the application code which is turn is a JBoss application.
In order to ensure that each consumer picks a distinct record (in case of concurrency) we do the following
We use an integer column as an index.
A consumer will first update the record with the lowest index value with its own name.
It will then select and pick up that record and proceed with that.
This approach ensures that two consumers do not end up with the same record.
One problem we are seeing is that when the producer is filling up the pool, consumers get blocked. Since the producer can take some time to complete, all consumers in that period are blocked as the update by the consumer waits for the insert by the producer to complete.
Is there any way to resolve this scenario? Any other approach to design this is also welcome.
Is it a hard requirement that you use a relational database as a queue? This seems like a bad approach to the problem, especially since the problems been addressed by message queues. You could use MySQL to persist the state of your queue, but it won't make a good queue itself.
Take a look at ActiveMQ or JBoss Messaging (given that you are using JBoss)

Is there a way to assure FIFO (first in, first out) behavior with Task Queues on GAE?

Is there a way to assure FIFO (first in, first out) behavior with Task Queues on GAE?
GAE Documentation says that FIFO is one of the factors that affect task execution order, but the same documentation says that “the system's scheduling may 'jump' new tasks to the head of the queue” and I have confirmed this behavior with a test. The effect: my events are being processed out of order.
Docs says:
https://developers.google.com/appengine/docs/java/taskqueue/overview-push
The order in which tasks are executed depends on several factors:
The position of the task in the queue. App Engine attempts to process tasks based on FIFO > (first in, first out) order. In general, tasks are inserted into the end of a queue, and
executed from the head of the queue.
The backlog of tasks in the queue. The system attempts to deliver the lowest latency
possible for any given task via specially optimized notifications to the scheduler.
Thus, in the case that a queue has a large backlog of tasks, the
system's scheduling may "jump" new tasks to the head of the queue.
The value of the task's etaMillis property. This property specifies the
earliest time that a task can execute. App Engine always waits until
after the specified ETA to process push tasks.
The value of the task's countdownMillis property. This property specifies the minimum
number of seconds to wait before executing a task. Countdown and eta
are mutually exclusive; if you specify one, do not specify the other.
What I need to do? In my use case, I'll process 1-2 million events/day coming from vehicles. These events can be sent at any interval (1 sec, 1 minute or 1 hour). The order of the event processing has to be assured. I need process by timestamp order, which is generated on a embedded device inside the vehicle.
What I have now?
A Rest servlet that is called by the consumer and creates a Task (Event data is on payload).
After this, a worker servlet get this Task and:
Deserialize Event data;
Put Event on Datastore;
Update Vehicle On Datastore.
So, again, is there any way to assure just FIFO behavior? Or how can I improve this solution to get this?
You need to approach this with three separate steps:
Implement a Sharding Counter to generate a monotonically
increasing ID. As much as I like to use the timestamp from
Google's server to indicate task ordering, it appears that timestamps
between GAE servers might vary more than what your requirement is.
Add your tasks to a Pull Queue instead of a Push Queue. When
constructing your TaskOption, add the ID obtained from Step #1 as a tag.
After adding the task, store the ID somewhere on your datastore.
Have your worker servlet lease Tasks by a certain tag from the Pull Queue.
Query the datastore to get the earliest ID that you need to fetch, and use the ID as
the lease tag. In this way, you can simulate FIFO behavior for your task queue.
After you finished your processing, delete the ID from your datastore, and don't forget to delete the Task from your Pull Queue too. Also, I would recommend you run your task consumptions on the Backend.
UPDATE:
As noted by Nick Johnson and mjaggard, sharding in step #1 doesn't seem to be viable to generate a monotonically increasing IDs, and other sources of IDs would then be needed. I seem to recall you were using timestamps generated by your vehicles, would it be possible to use this in lieu of a monotonically increasing ID?
Regardless of the way to generate the IDs, the basic idea is to use datastore's query mechanism to produce a FIFO ordering of Tasks, and use task Tag to pull specific task from the TaskQueue.
There is a caveat, though. Due to the eventual consistency read policy on high-replication datastores, if you choose HRD as your datastore (and you should, the M/S is deprecated as of April 4th, 2012), there might be some stale data returned by the query on step #2.
I think the simple answer is "no", however partly in order to help improve the situation, I am using a pull queue - pulling 1000 tasks at a time and then sorting them. If timing isn't important, you could sort them and put them into the datastore and then complete a batch at a time. You've still got to work out what to do with the tasks at the beginning and ends of the batch - because they might be out of order with interleaving tasks in other batches.
Ok. This is how I've done it.
1) Rest servlet that is called from the consumer:
If Event sequence doesn't match Vehicle sequence (from datastore)
Creates a task on a "wait" queue to call me again
else
State validation
Creates a task on the "regular" queue (Event data is on payload).
2) A worker servlet gets the task from the "regular" queue, and so on... (same pseudo code)
This way I can pause the "regular" queue in order to do a data maintenance without losing events.
Thank you for your answers. My solution is a mix of them.
You can put the work to be done in a row in the datastore with a create timestamp and then fetch work tasks by that timestamp, but if your tasks are being created too quickly you will run into latency issues.
Don't know the answer myself, but it may be possible that tasks enqueued using a deferred function might execute in order submitted. Likely you will need an engineer from G. to get an answer. Pull queues as suggested seem a good alternative, plus this would allow you to consider batching your put()s.
One note about sharded counters: they increase the probability of monotonically increasing ids, but do not guarantee them.
The best way to handle this, the distributed way or "App Engine way" is probably to modify your algorithm and data collection to work with just a timestamp, allowing arbitrary ordering of tasks.
Assuming this is not possible or too difficult, you could modify your algorithm as follow:
when creating the task don't put the data on payload but in the datastore, in a Kind with an ordering on timestamps and stored as a child entity of whatever entity you're trying to update (Vehicule?). The timestamps should come from the client, not the server, to guarantee the same ordering.
run a generic task that fetch the data for the first timestamp, process it, and then delete it, inside a transaction.
Following this thread, I am unclear as to whether the strict FIFO requirement is for all transactions received, or on a per-vehicle basis. Latter has more options vs. former.

How to design a system that queues requests & processes them in batches?

I have at my disposal a REST service that accepts a JSON array of image urls, and will return scaled thumbnails.
Problem
I want to batch up image URLs sent by concurrent clients before calling the REST service.
Obviously if I receive 1 image, I should wait a moment in case other images trickle in.
I've settled on a batch of 5 images. But the question is, how do I design it to take care of these scenarios:
If I receive x images, such that x < 5, how do I timeout from waiting if no new images will arrive in the next few minutes.
If I use a queue to buffer incoming image urls, I will probably need to lock it to prevent clients from concurrently writing while I'm busy reading my batches of 5. What data structure is good for this ? BlockingQueue ?
The data structure is not what's missing. What's missing is an entity - a Timer task, I'd say, which you stop and restart every time you send a batch of images to your service. You do this whether you send them because you had 5 (incidentally, I assume that 5 is just your starting number and it'll be configurable, along with your timeout), or whether because the timeout task fired.
So there's two entities running: a main thread which receives requests, queues them, checks queue depth, and if it's 5 or more, sends the oldest 5 to the service (and restarts the timer task); and the timer task, which picks up incomplete batches and sends them on.
Side note: that main thread seems to have several responsibilities, so some decomposition might be in order.
Well what you could do is have the clients send a special string to the queue, indicating that it is done sending image URLs. So if your last element in the queue is that string, you know that there are no URLs left.
If you have multiple clients and you know the number of clients you can always count the amount of the indicators in the queue to check if all of the clients are finished.
1- As example, if your Java web app is running on Google AppEngine, you could write each client request in the datastore, have cron job (i.e. scheduled task in GAE speak) read the datastore, build a batch and send it.
2- For the concurrency/locking aspect, then again you could rely on GAE datastore to provide atomicity.
Of course feel free to disregard my proposal if GAE isn't an option.

Categories

Resources