Java Fork/Join Pool with priority queue?

Java Fork/Join Pool with priority queue? - java

I need to process many PDF-Files. So I have a list of files (files that are in some folder or zip file). I want a subtask per PDF. Then I create a subtask per page, so it can be processed.
I was thinking of using a fork/join pool but that just keeps creating more subtasks to read more files and I run out of memory.
Sometimes I get many small files, sometimes I get large files with many pages. It makes no sense loading more documents when there are already many pages queued up to be processed.
Each pdf file from a folder is read and a subtask (2) is created, forked, and joined.
For each page a subtask (3) is created, forked, and joined.
Process this page.
There's ForkJoinTask.helpQuiesce(), which might be good enough in some situations. I can just call ForkJoinTask.helpQuiesce() after creating some subtasks. This way the subtasks are more likely to be processed before more data is loaded.
But I can't find anything to set the priority of a subtask. Wouldn't that be a lot easier? If I understand the documentation correctly, there is one submission queue and then one task queue per worker thread. Is there no way to control which tasks from the submission queue are processed first? I can pass a factory for the worker threads, but not for the submission queue.
Like in the divide-and-conquer metaphor: It might make more sense to plunder all cities before you invade a new country or even a new continent, so you get enough resources needed for those tasks. But how is this controlled?
I know Fork/Join uses work stealing and you usually don't have to bother. But I need to build a batch processing tool and I can't have it just load gigabytes of data to memory before it even begins processing any of the pages. But I don't need some framework like hadoop for a bunch of pdf files. That would be overkill.
I could use a PriorityQueue<E>, but that seems to be a lot more work as this is only a simple data structure, while Fork/Join is a framework.
Is there no way of controlling the order in which tasks are processed? What am I missing? Is there some other priority queue based solution available in Java?

Related

Load data into threads of Java ExecutorService

I am writing server to response to the queries of the same type. As I will have several clients I want to perform tasks in parallel. Prior to performing task I need to load static data from file - this process takes most of the time. When this is done I can answer any amount of queries using this loaded data.
I want to use Executors.newFixedThreadPool(n) as my ExecutorService, so it manages all multithreading staff for me. If I understand correctly threads are created once and then all my task are run using this threads. So it will ideally fit my problem, if it is possible to load data into every thread when it is created and use it for all tasks which will be run using this thread.
Is this possible?
Another idea is just to create array of several copies of the same data with boolean isInUse. As there will be fixed amount of tasks performed in parallel, just select the one data entity that is free at the moment, mark it taken and mark it free in the end of performing the task.
But I think I will need somehow synchronise this boolean parameter between threads.
(I need several copies of data as it can be modified while performing task, but will be returned to initial state after task is performed.)

Java - Thread-safe hash collection with no duplicates that is recoverable after crash

I am new to java, and I'm developing an application that needs to do the following. I need to read a high volume of files on multiple threads and perform a series of tasks on each file. I cannot process duplicate files, and I need to know when I've attempted a duplicate file so that I can rename/move the file from our watch folder and send notification of such an attempt. Additionally, any number of these tasks to be performed could result in the file being aborted, so to speak, where we can we can make another attempt after changes to the file have been made. This also needs to be recoverable in the event of a crash so that no file is missed or mishandled. These tasks to be performed could take a few milliseconds up to multiple seconds or possibly even minutes to complete.
My application is currently using multiple ExecutorServices and PriorityBlockingQueues for each task to be performed with an in-memory HashMap to keep track of which files we've already seen and to prevent duplicates. At present, with each new file, I check if HashSet.add() is true/false. If false, I rename the duplicate. For each file that is added to the HashSet, I add it to the first task's queue where multiple threads churn away on that queue and then either abort the file or pass to the next tasks's queue where multiple threads churn away and so on.
My thought was that I would continue down this road but with the following changes. 1.) The HashSet will be loaded from disk at startup and then workflow will proceed as I've outlined above. 2.) Once all tasks are completed then I'll write any file that isn't aborted to the HashSet on disk and it will remain in the in-memory hash map as well.
There will be multiple threads that addto the HashSet on disk and multiple threads that add to the HashSet in memory. As I stated, at times we'll want to reattempt a file that was previously aborted, and for this step it is possible that we'll require a change to one of the file's properties that our contains is evaluating...so it's not necessarily necessary that we have the ability to remove from the HashSet for files that we've aborted. We may require that we're presented with what we perceive as a new file for these scenarios.
All that said, I'm not looking for anyone to write or review my code so I won't post it here. Can I have multiple threads call HashSet.add() with no repercussions?

java threads and memory mgmt - short vs long lived

Im chasing some memory issues in an app that pulls file names from a kafka queue and does some processing on each. This app runs in Docker with an instance / partition.
Each instance has a single consumer handle that retrieves the next file name and puts it into an ArrayBlockingQueue. Meanwhile there are several threads that take the next file from this queue and do the processing. Im using this secondary queuing as each file can take some time to copy and process (there are instances of "exponential backoff" used IE a thread may be sleeping) so it seemed prudent to have several 'in the pipeline' simultaneously.
My question is about the relative benefits (w/re memory mgmt) of doing it this way (several 'permanent' threads reading from a shared queue) vs launching a new thread for each file as it gets pulled from the queue. In this alternative track I would imagine a FixedThreadPool that would generate a new thread as each file was pulled from kafka.
Is there any advantage to one method vs the other?
Edit:
my primary concern is minimizing GC time. I want to avoid having anything substantial sent to old-gen. This makes me think that the 2nd model is a better way to go.

Reading huge file in Java

I read a huge File (almost 5 million lines). Each line contains Date and a Request, I must parse Requests between concrete **Date**s. I use BufferedReader for reading File till start Date and than start parse lines. Can I use Threads for parsing lines, because it takes a lot of time?

It isn't entirely clear from your question, but it sounds like you are reparsing your 5 million-line file every time a client requests data. You certainly can solve the problem by throwing more threads and more CPU cores at it, but a better solution would be to improve the efficiency of your application by eliminating duplicate work.
If this is the case, you should redesign your application to avoid reparsing the entire file on every request. Ideally you should store data in a database or in-memory instead of processing a flat text file on every request. Then on a request, look up the information in the database or in-memory data structure.
If you cannot eliminate the 5 million-line file entirely, you can periodically recheck the large file for changes, skip/seek to the end of the last record that was parsed, then parse only new records and update the database or in-memory data structure. This can all optionally be done in a separate thread.

Firstly, 5 million lines of 1000 characters is only 5Gb, which is not necessarily prohibitive for a JVM. If this is actually a critical use case with lots of hits then buying more memory is almost certainly the right thing to do.
Secondly, if that is not possible, most likely the right thing to do is to build an ordered Map based on the date. So every date is a key in the map and points to a list of line numbers which contain the requests. You can then go direct to the relevant line numbers.
Something of the form
HashMap<Date, ArrayList<String>> ()
would do nicely. That should have a memory usage of order 5,000,000*32/8 bytes = 20Mb, which should be fine.
You could also use the FileChannel class to keep the I/O handle open as you go jumping from on line to a different line. This allows Memory Mapping.
See http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html
And http://en.wikipedia.org/wiki/Memory-mapped_file

A good way to parallelize a lot of small tasks is to wrap the processing of each task with a FutureTask and then pass each task to a ThreadPoolExecutor to run them. The executor should be initalized with the number of CPU cores your system has available.
When you call executor.execute(future), the future will be queued for background processing. To avoid creating and destroying too many threads, the ScheduledThreadPoolExecutor will only create as many threads as you specified and execute the futures one after another.
To retrieve the result of a future, call future.get(). When the future hasn't completed yet (or wasn't even started yet), this method will freeze until it is completed. But other futures get executed in background while you wait.
Remember to call executor.shutdown() when you don't need it anymore, to make sure it terminates the background threads it otherwise keeps around until the keepalive time has expired or it is garbage-collected.
tl;dr pseudocode:
create executor
for each line in file
create new FutureTask which parses that line
pass future task to executor
add future task to a list
for each entry in task list
call entry.get() to retrieve result
executor.shutdown()

Is there a way to assure FIFO (first in, first out) behavior with Task Queues on GAE?

Is there a way to assure FIFO (first in, first out) behavior with Task Queues on GAE?
GAE Documentation says that FIFO is one of the factors that affect task execution order, but the same documentation says that “the system's scheduling may 'jump' new tasks to the head of the queue” and I have confirmed this behavior with a test. The effect: my events are being processed out of order.
Docs says:
https://developers.google.com/appengine/docs/java/taskqueue/overview-push
The order in which tasks are executed depends on several factors:
The position of the task in the queue. App Engine attempts to process tasks based on FIFO > (first in, first out) order. In general, tasks are inserted into the end of a queue, and
executed from the head of the queue.
The backlog of tasks in the queue. The system attempts to deliver the lowest latency
possible for any given task via specially optimized notifications to the scheduler.
Thus, in the case that a queue has a large backlog of tasks, the
system's scheduling may "jump" new tasks to the head of the queue.
The value of the task's etaMillis property. This property specifies the
earliest time that a task can execute. App Engine always waits until
after the specified ETA to process push tasks.
The value of the task's countdownMillis property. This property specifies the minimum
number of seconds to wait before executing a task. Countdown and eta
are mutually exclusive; if you specify one, do not specify the other.
What I need to do? In my use case, I'll process 1-2 million events/day coming from vehicles. These events can be sent at any interval (1 sec, 1 minute or 1 hour). The order of the event processing has to be assured. I need process by timestamp order, which is generated on a embedded device inside the vehicle.
What I have now?
A Rest servlet that is called by the consumer and creates a Task (Event data is on payload).
After this, a worker servlet get this Task and:
Deserialize Event data;
Put Event on Datastore;
Update Vehicle On Datastore.
So, again, is there any way to assure just FIFO behavior? Or how can I improve this solution to get this?

You need to approach this with three separate steps:
Implement a Sharding Counter to generate a monotonically
increasing ID. As much as I like to use the timestamp from
Google's server to indicate task ordering, it appears that timestamps
between GAE servers might vary more than what your requirement is.
Add your tasks to a Pull Queue instead of a Push Queue. When
constructing your TaskOption, add the ID obtained from Step #1 as a tag.
After adding the task, store the ID somewhere on your datastore.
Have your worker servlet lease Tasks by a certain tag from the Pull Queue.
Query the datastore to get the earliest ID that you need to fetch, and use the ID as
the lease tag. In this way, you can simulate FIFO behavior for your task queue.
After you finished your processing, delete the ID from your datastore, and don't forget to delete the Task from your Pull Queue too. Also, I would recommend you run your task consumptions on the Backend.
UPDATE:
As noted by Nick Johnson and mjaggard, sharding in step #1 doesn't seem to be viable to generate a monotonically increasing IDs, and other sources of IDs would then be needed. I seem to recall you were using timestamps generated by your vehicles, would it be possible to use this in lieu of a monotonically increasing ID?
Regardless of the way to generate the IDs, the basic idea is to use datastore's query mechanism to produce a FIFO ordering of Tasks, and use task Tag to pull specific task from the TaskQueue.
There is a caveat, though. Due to the eventual consistency read policy on high-replication datastores, if you choose HRD as your datastore (and you should, the M/S is deprecated as of April 4th, 2012), there might be some stale data returned by the query on step #2.

I think the simple answer is "no", however partly in order to help improve the situation, I am using a pull queue - pulling 1000 tasks at a time and then sorting them. If timing isn't important, you could sort them and put them into the datastore and then complete a batch at a time. You've still got to work out what to do with the tasks at the beginning and ends of the batch - because they might be out of order with interleaving tasks in other batches.

Ok. This is how I've done it.
1) Rest servlet that is called from the consumer:
If Event sequence doesn't match Vehicle sequence (from datastore)
Creates a task on a "wait" queue to call me again
else
State validation
Creates a task on the "regular" queue (Event data is on payload).
2) A worker servlet gets the task from the "regular" queue, and so on... (same pseudo code)
This way I can pause the "regular" queue in order to do a data maintenance without losing events.
Thank you for your answers. My solution is a mix of them.

You can put the work to be done in a row in the datastore with a create timestamp and then fetch work tasks by that timestamp, but if your tasks are being created too quickly you will run into latency issues.

Don't know the answer myself, but it may be possible that tasks enqueued using a deferred function might execute in order submitted. Likely you will need an engineer from G. to get an answer. Pull queues as suggested seem a good alternative, plus this would allow you to consider batching your put()s.
One note about sharded counters: they increase the probability of monotonically increasing ids, but do not guarantee them.

The best way to handle this, the distributed way or "App Engine way" is probably to modify your algorithm and data collection to work with just a timestamp, allowing arbitrary ordering of tasks.
Assuming this is not possible or too difficult, you could modify your algorithm as follow:
when creating the task don't put the data on payload but in the datastore, in a Kind with an ordering on timestamps and stored as a child entity of whatever entity you're trying to update (Vehicule?). The timestamps should come from the client, not the server, to guarantee the same ordering.
run a generic task that fetch the data for the first timestamp, process it, and then delete it, inside a transaction.

Following this thread, I am unclear as to whether the strict FIFO requirement is for all transactions received, or on a per-vehicle basis. Latter has more options vs. former.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.