Launching a Runnable from Writer of Spring Batch partitioned step - java

I have a Spring batch job consisting of a partitioned step and partitioned step is doing processing in chunks.
Can I further launch new threads ( implementing Runnable ) from method, public void write(List<? extends VO> itemsToWrite)?
Basically, writer here writes indices using Lucene and since writer has a List of chunk-size items, I thought to divide that List into segments and pass each segment to a new Runnable.
Is that a good approach?
I coded a sample and it works most of the times but gets stuck few times.
Are there any thing that I need to worry about? OR is there something inbuilt in spring batch to achieve this?
I don't want write to happen by a single thread for whole chunk. I wish to further divide up chunk.
Lucene IndexWriter is thread safe and a approach is listed here
Sample Code - Writer gets a List of items for which I open threads from thread pool? Will there be any concern even if I wait for pool to terminate for a chunk,
#Override
public void write(List<? extends IndexerInputVO> inputItems) throws Exception {
int docsPerThread = Constants.NUMBER_OF_DOCS_PER_INDEX_WRITER_THREADS;
int docSize = inputItems.size();
int remainder = docSize%docsPerThread;
int poolSize = docSize/docsPerThread;
ExecutorService executor = Executors.newFixedThreadPool(poolSize+1);
int fromIndex=0;
int toIndex = docsPerThread;
if(docSize < docsPerThread){
executor.submit(new IndexWriterRunnable(this.luceneObjects,service,inputItems));
}else{
for(int i=1;i<=poolSize;i++){
executor.submit(new IndexWriterRunnable(this.luceneObjects,service,inputItems.subList(fromIndex, toIndex)));
fromIndex+=docsPerThread;
toIndex+=docsPerThread;
}
if(remainder != 0){
toIndex=docSize;
executor.submit(new IndexWriterRunnable(this.luceneObjects,service,inputItems.subList(fromIndex, toIndex)));
}
}
executor.shutdown();
while(executor.isTerminated()){
;
}

I'm not sure that launching new Threads in writer it's the good idea.
These threads are out of scope of spring batch framework, so you will need to implement shutdown and cancellation policy for above. If processing of one segments will fail it can lead to fail entire queue.
As alternate approach I can suggest to promote your custom segments of list from writer to next step as described in official docs passingDataToFutureSteps

Related

Using lmax Disruptor (3.0) in java to process millions of documents

I have the following use-case:
When my service starts, it may need to deal with millions of documents in as short of a burst as possible. There will be three sources of data.
I have set up the following:
/* batchSize = 100, bufferSize = 2^30
public MyDisruptor(#NonNull final MyDisruptorConfig config) {
batchSize = config.getBatchSize();
bufferSize = config.getBufferSize();
this.eventHandler = config.getEventHandler();
ThreadFactory threadFactory = createThreadFactory("disruptor-threads-%d");
executorService = Executors.newSingleThreadExecutor(threadFactory);
ringBuffer = RingBuffer.createMultiProducer(new EventFactory(), bufferSize, new YieldingWaitStrategy());
sequenceBarrier = ringBuffer.newBarrier();
batchEventProcessor = new BatchEventProcessor<>(ringBuffer, sequenceBarrier, eventHandler);
ringBuffer.addGatingSequences(batchEventProcessor.getSequence());
executorService.submit(batchEventProcessor);
}
public void consume(#NonNull final List<Document> documents) {
List<List<Document>> subLists = Lists.partition(documents, batchSize);
for (List<Document> subList : subLists) {
log.info("publishing sublist of size {}", subList.size());
long high = ringBuffer.next(subList.size());
long low = high - (subList.size() - 1);
long position = low;
for (Document document: subList) {
ringBuffer.get(position++).setEvent(document);
}
ringBuffer.publish(low, high);
lastPublishedSequence.set(high);
}
}
Each of my sources calls consume, I use Guice to create a Singleton disruptor.
My eventHandler routine is
public void onEvent(Event event, long sequence, boolean endOfBatch) throws Exception {
Document document = event.getValue();
handler.processDocument(document); //send the document to handler
if (endOfBatch) {
handler.processDocumentsList(); // tell handler to process all documents so far.
}
}
I am seeing in my logs that the producer (consume) is stalling at times. I assume that this is when the ringBuffer is full, and the eventHandler is not able to process quickly enough. I see that the eventHandler is processing documents (from my logs) and then after a while the producer starts publishing more documents to the ring buffer.
Questions:
Am I using the correct Disruptor pattern? I see there are quite a few ways to use it. I chose to use the batchEventProcessor so it would signal endOfBatch.
How can I increase the efficiency of my EventHandler? processDocumentsList can be slow.
Should I use parallel EventHandlers? The lmax user-guide mentions that this is possible, and the FAQ has a question on it. But how do I use this with the batchEventProcessor? It only takes one eventHandler.
Is your handler stateful? If not, you can use multiple parallel event handlers to process the documents. You could implement a basic sharding strategy where only one of the handlers processes each event.
endOfBatch is usually used to speed up the speed of processing by optimising IO operations that benefit from batching. E.g. writing to file on each event but only flushing on endOfBatch.
It's hard to give any more advice without know what happens in your document processor.

Massive tasks alternative pattern for Runnable or Callable

For massive parallel computing I tend to use executors and callables. When I have thousand of objects to be computed I feel not so good to instantiate thousand of Runnables for each object.
So I have two approaches to solve this:
I. Split the workload into a small amount of x-workers giving y-objects each. (splitting the object list into x-partitions with y/x-size each)
public static <V> List<List<V>> partitions(List<V> list, int chunks) {
final ArrayList<List<V>> lists = new ArrayList<List<V>>();
final int size = Math.max(1, list.size() / chunks + 1);
final int listSize = list.size();
for (int i = 0; i <= chunks; i++) {
final List<V> vs = list.subList(Math.min(listSize, i * size), Math.min(listSize, i * size + size));
if(vs.size() == 0) break;
lists.add(vs);
}
return lists;
}
II. Creating x-workers which fetch objects from a queue.
Questions:
Is creating thousand of Runnables really expensive and to be avoided?
Is there a generic pattern/recommendation how to do it by solution II?
Are you aware of a different approach?
Creating thousands of Runnable (objects implementing Runnable) is not more expensive than creating a normal object.
Creating and running thousands of Threads can be very heavy, but you can use Executors with a pool of threads to solve this problem.
As for the different approach, you might be interested in java 8's parallel streams.
Combining various answers here :
Is creating thousand of Runnables really expensive and to be avoided?
No, it's not in and of itself. It's how you will make them execute that may prove costly (spawning a few thousand threads certainly has its cost).
So you would not want to do this :
List<Computation> computations = ...
List<Thread> threads = new ArrayList<>();
for (Computation computation : computations) {
Thread thread = new Thread(new Computation(computation));
threads.add(thread);
thread.start();
}
// If you need to wait for completion:
for (Thread t : threads) {
t.join();
}
Because it would 1) be unnecessarily costly in terms of OS ressource (native threads, each having a stack on the heap), 2) spam the OS scheduler with a vastly concurrent workload, most certainly leading to plenty of context switchs and associated cache invalidations at the CPU level 3) be a nightmare to catch and deal with exceptions (your threads should probably define an Uncaught exception handler, and you'd have to deal with it manually).
You'd probably prefer an approach where a finite Thread pool (of a few threads, "a few" being closely related to your number of CPU cores) handles many many Callables.
List<Computation> computations = ...
ExecutorService pool = Executors.newFixedSizeThreadPool(someNumber)
List<Future<Result>> results = new ArrayList<>();
for (Computation computation : computations) {
results.add(pool.submit(new ComputationCallable(computation));
}
for (Future<Result> result : results {
doSomething(result.get);
}
The fact that you reuse a limited number threads should yield a really nice improvement.
Is there a generic pattern/recommendation how to do it by solution II?
There are. First, your partition code (getting from a List to a List<List>) can be found inside collection tools such as Guava, with more generic and fail-proofed implementations.
But more than this, two patterns come to mind for what you are achieving :
Use the Fork/Join Pool with Fork/Join tasks (that is, spawn a task with your whole list of items, and each task will fork sub tasks with half of that list, up to the point where each task manages a small enough list of items). It's divide and conquer. See: http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ForkJoinTask.html
If your computation were to be "add integers from a list", it could look like (there might be a boundary bug in there, I did not really check) :
public static class Adder extends RecursiveTask<Integer> {
protected List<Integer> globalList;
protected int start;
protected int stop;
public Adder(List<Integer> globalList, int start, int stop) {
super();
this.globalList = globalList;
this.start = start;
this.stop = stop;
System.out.println("Creating for " + start + " => " + stop);
}
#Override
protected Integer compute() {
if (stop - start > 1000) {
// Too many arguments, we split the list
Adder subTask1 = new Adder(globalList, start, start + (stop-start)/2);
Adder subTask2 = new Adder(globalList, start + (stop-start)/2, stop);
subTask2.fork();
return subTask1.compute() + subTask2.join();
} else {
// Manageable size of arguments, we deal in place
int result = 0;
for(int i = start; i < stop; i++) {
result +=i;
}
return result;
}
}
}
public void doWork() throws Exception {
List<Integer> computation = new ArrayList<>();
for(int i = 0; i < 10000; i++) {
computation.add(i);
}
ForkJoinPool pool = new ForkJoinPool();
RecursiveTask<Integer> masterTask = new Adder(computation, 0, computation.size());
Future<Integer> future = pool.submit(masterTask);
System.out.println(future.get());
}
Use Java 8 parallel streams in order to launch multiple parallel computations easily (under the hood, Java parallel streams can fall back to the Fork/Join pool actually).
Others have shown how this might look like.
Are you aware of a different approach?
For a different take at concurrent programming (without explicit task / thread handling), have a look at the actor pattern. https://en.wikipedia.org/wiki/Actor_model
Akka comes to mind as a popular implementation of this pattern...
#Aaron is right, you should take a look into Java 8's parallel streams:
void processInParallel(List<V> list) {
list.parallelStream().forEach(item -> {
// do something
});
}
If you need to specify chunks, you could use a ForkJoinPool as described here:
void processInParallel(List<V> list, int chunks) {
ForkJoinPool forkJoinPool = new ForkJoinPool(chunks);
forkJoinPool.submit(() -> {
list.parallelStream().forEach(item -> {
// do something with each item
});
});
}
You could also have a functional interface as an argument:
void processInParallel(List<V> list, int chunks, Consumer<V> processor) {
ForkJoinPool forkJoinPool = new ForkJoinPool(chunks);
forkJoinPool.submit(() -> {
list.parallelStream().forEach(item -> processor.accept(item));
});
}
Or in shorthand notation:
void processInParallel(List<V> list, int chunks, Consumer<V> processor) {
new ForkJoinPool(chunks).submit(() -> list.parallelStream().forEach(processor::accept));
}
And then you would use it like:
processInParallel(myList, 2, item -> {
// do something with each item
});
Depending on your needs, the ForkJoinPool#submit() returns an instance of ForkJoinTask, which is a Future and you may use it to check for the status or wait for the end of your task.
You'd most probably want the ForkJoinPool instantiated only once (not instantiate it on every method call) and then reuse it to prevent CPU choking if the method is called multiple times.
Is creating thousand of Runnables really expensive and to be avoided?
Not at all, the runnable/callable interfaces have only one method to implement each, and the amount of "extra" code in each task depends on the code you are running. But certainly no fault of the Runnable/Callable interfaces.
Is there a generic pattern/recommendation how to do it by solution II?
Pattern 2 is more favorable than pattern 1. This is because pattern 1 assumes that each worker will finish at the exact same time. If some workers finish before other workers, they could just be sitting idle since they only are able to work on the y/x-size queues you assigned to each of them. In pattern 2 however, you will never have idle worker threads (unless the end of the work queue is reached and numWorkItems < numWorkers).
An easy way to use the preferred pattern, pattern 2, is to use the ExecutorService invokeAll(Collection<? extends Callable<T>> list) method.
Here is an example usage:
List<Callable<?>> workList = // a single list of all of your work
ExecutorService es = Executors.newCachedThreadPool();
es.invokeAll(workList);
Fairly readable and straightforward usage, and the ExecutorService implementation will automatically use solution 2 for you, so you know that each worker thread has their use time maximized.
Are you aware of a different approach?
Solution 1 and 2 are two common approaches for generic work. Now, there are many different implementation available for you choose from (such as java.util.Concurrent, Java 8 parallel streams, or Fork/Join pools), but the concept of each implementation is generally the same. The only exception is if you have specific tasks in mind with non-standard running behavior.

Best way to write huge number of files

I am writing a lots of files like bellow.
public void call(Iterator<Tuple2<Text, BytesWritable>> arg0)
throws Exception {
// TODO Auto-generated method stub
while (arg0.hasNext()) {
Tuple2<Text, BytesWritable> tuple2 = arg0.next();
System.out.println(tuple2._1().toString());
PrintWriter writer = new PrintWriter("/home/suv/junk/sparkOutPut/"+tuple2._1().toString(), "UTF-8");
writer.println(new String(tuple2._2().getBytes()));
writer.close();
}
}
Is there any better way to write the files..without closing or creating printwriter every time.
There is no significantly better way to write lots of files. What you are doing is inherently I/O intensive.
UPDATE - #Michael Anderson is right, I think. Using multiple threads to write the files (probably) will speed things up considerably. However, the I/O is still going to be the ultimate bottleneck from a couple of respects:
Creating, opening and closing files involves file & directory metadata access and update. This entails non-trivial CPU.
The file data and metadata changes need to be written to disc. That is possibly multiple disc writes.
There are at least 3 syscalls for each file written.
Then there are thread stitching overheads.
Unless the quantity of data written to each file is significant (multiple kilobytes per file), I doubt that the techniques like using NIO, direct buffers, JNI and so on will be worthwhile. The real bottlenecks will be in the kernel: file system operations and low-level disk I/O.
... without closing or creating printwriter every time.
No. You need to create a new PrintWriter ( or Writer or OutputStream ) for each file.
However, this ...
writer.println(new String(tuple2._2().getBytes()));
... looks rather peculiar. You appear to be:
calling getBytes() on a String (?),
converting the byte array to a String
calling the println() method on the String which will copy it, and the convert it back into bytes before finally outputting them.
What gives? What is the point of the String -> bytes -> String conversion?
I'd just do this:
writer.println(tuple2._2());
This should be faster, though I wouldn't expect the percentage speed-up to be that large.
I'm assuming you're after the fastest way. Because everyone knows fastest is best ;)
One simple way is to use a bunch of threads to do your writing for you.
However you're not going to get much benefit by doing this unless your filesystem scales well. (I use this technique on Luster based cluster systems, and in cases where "lots of files" could mean 10k - in this case many of the writes will be going to different servers / disks)
The code would look something like this: (Note I think this version is not right as for small numbers of files this fills the work queue - but see the next version for the better version anyway...)
public void call(Iterator<Tuple2<Text, BytesWritable>> arg0) throws Exception {
int nThreads=5;
ExecutorService threadPool = Executors.newFixedThreadPool(nThreads);
ExecutorCompletionService<Void> ecs = new ExecutorCompletionService<>(threadPool);
int nJobs = 0;
while (arg0.hasNext()) {
++nJobs;
final Tuple2<Text, BytesWritable> tuple2 = arg0.next();
ecs.submit(new Callable<Void>() {
#Override Void call() {
System.out.println(tuple2._1().toString());
String path = "/home/suv/junk/sparkOutPut/"+tuple2._1().toString();
try(PrintWriter writer = new PrintWriter(path, "UTF-8") ) {
writer.println(new String(tuple2._2().getBytes()))
}
return null;
}
});
}
for(int i=0; i<nJobs; ++i) {
ecs.take().get();
}
}
Better yet is to start writing your files as soon as you have data for the first one, not when you've got data for all of them - and for this writing to not block the calculation thread(s).
To do this you split your application into several pieces communicating over a (thread safe) queue.
Code then ends up looking more like this:
public void main() {
SomeMultithreadedQueue<Data> queue = ...;
int nGeneratorThreads=1;
int nWriterThreads=5;
int nThreads = nGeneratorThreads + nWriterThreads;
ExecutorService threadPool = Executors.newFixedThreadPool(nThreads);
ExecutorCompletionService<Void> ecs = new ExecutorCompletionService<>(threadPool);
AtomicInteger completedGenerators = new AtomicInteger(0);
// Start some generator threads.
for(int i=0; ++i; i<nGeneratorThreads) {
ecs.submit( () -> {
while(...) {
Data d = ... ;
queue.push(d);
}
if(completedGenerators.incrementAndGet()==nGeneratorThreads) {
queue.push(null);
}
return null;
});
}
// Start some writer threads
for(int i=0; i<nWriterThreads; ++i) {
ecs.submit( () -> {
Data d
while((d = queue.take())!=null) {
String path = data.path();
try(PrintWriter writer = new PrintWriter(path, "UTF-8") ) {
writer.println(new String(data.getBytes()));
}
return null;
}
});
}
for(int i=0; i<nThreads; ++i) {
ecs.take().get();
}
}
Note I've not provided an implementation of the queue class you can easily wrap the standard java threadsafe ones to get what you need.
There's still lots more that can be done to reduce latency, etc - heres some of the further things I've used to get the times down ...
don't even wait for all the data to be generated for a given file. Pass another queue containing packets of bytes to write.
Watch out for allocations - you can reuse some of your buffers.
There's some latency in the nio stuff - you can get some performance improvements by using C writes and JNI and direct buffers.
Thread switching can hurt, and the latency in the queues can hurt, so you might want to batch up your data slightly. Balancing this with 1 can be tricky.

Java multithreading and iterators, should be simple, beginner

First I'd like to say that I'm working my way up from python to more complicated code. I'm now on to Java and I'm extremely new. I understand that Java is really good at multithreading which is good because I'm using it to process terabytes of data.
The data input is simply input into an iterator and I have a class that encapsulates a run function that takes one line from the iterator, does some analysis, and then writes the analysis to a file. The only bit of info the threads have to share with each other is the name of the object they are writing to. Simple right? I just want each thread executing the run function simultaneously so we can iterate through the input data quickly. In python it would b e simple.
from multiprocessing import Pool
f = open('someoutput.csv','w');
def run(x):
f.write(analyze(x))
p = Pool(8);
p.map(run,iterator_of_input_data);
So in Java, I have my 10K lines of analysis code and can very easily iterate through my input passing it my run function which in turn calls on all my analysis code sending it to an output object.
public class cool {
...
public static void run(Input input,output) {
Analysis an = new Analysis(input,output);
}
public static void main(String args[]) throws Exception {
Iterator iterator = new Parser(File(input_file)).iterator();
File output = File(output_object);
while(iterator.hasNext(){
cool.run(iterator.next(),output);
}
}
}
All I want to do is get multiple threads taking the iterator objects and executing the run statement. Everything is independent. I keep looking at java multithreading stuff but its for talking over networks, sharing data etc. Is this is simple as I think it is? If someone can just point me in the right direction I would be happy to do the leg work.
thanks
A ExecutorService (ThreadPoolExecutor) would be the Java equivelant.
ExecutorService executorService =
new ThreadPoolExecutor(
maxThreads, // core thread pool size
maxThreads, // maximum thread pool size
1, // time to wait before resizing pool
TimeUnit.MINUTES,
new ArrayBlockingQueue<Runnable>(maxThreads, true),
new ThreadPoolExecutor.CallerRunsPolicy());
ConcurrentLinkedQueue<ResultObject> resultQueue;
while (iterator.hasNext()) {
executorService.execute(new MyJob(iterator.next(), resultQueue))
}
Implement your job as a Runnable.
class MyJob implements Runnable {
/* collect useful parameters in the constructor */
public MyJob(...) {
/* omitted */
}
public void run() {
/* job here, submit result to resultQueue */
}
}
The resultQueue is present to collect the result of your jobs.
See the java api documentation for detailed information.

Which ThreadPool in Java should I use?

There are a huge amount of tasks.
Each task is belong to a single group. The requirement is each group of tasks should executed serially just like executed in a single thread and the throughput should be maximized in a multi-core (or multi-cpu) environment. Note: there are also a huge amount of groups that is proportional to the number of tasks.
The naive solution is using ThreadPoolExecutor and synchronize (or lock). However, threads would block each other and the throughput is not maximized.
Any better idea? Or is there exist a third party library satisfy the requirement?
A simple approach would be to "concatenate" all group tasks into one super task, thus making the sub-tasks run serially. But this will probably cause delay in other groups that will not start unless some other group completely finishes and makes some space in the thread pool.
As an alternative, consider chaining a group's tasks. The following code illustrates it:
public class MultiSerialExecutor {
private final ExecutorService executor;
public MultiSerialExecutor(int maxNumThreads) {
executor = Executors.newFixedThreadPool(maxNumThreads);
}
public void addTaskSequence(List<Runnable> tasks) {
executor.execute(new TaskChain(tasks));
}
private void shutdown() {
executor.shutdown();
}
private class TaskChain implements Runnable {
private List<Runnable> seq;
private int ind;
public TaskChain(List<Runnable> seq) {
this.seq = seq;
}
#Override
public void run() {
seq.get(ind++).run(); //NOTE: No special error handling
if (ind < seq.size())
executor.execute(this);
}
}
The advantage is that no extra resource (thread/queue) is being used, and that the granularity of tasks is better than the one in the naive approach. The disadvantage is that all group's tasks should be known in advance.
--edit--
To make this solution generic and complete, you may want to decide on error handling (i.e whether a chain continues even if an error occures), and also it would be a good idea to implement ExecutorService, and delegate all calls to the underlying executor.
I would suggest to use task queues:
For every group of tasks You have create a queue and insert all tasks from that group into it.
Now all Your queues can be executed in parallel while the tasks inside one queue are executed serially.
A quick google search suggests that the java api has no task / thread queues by itself. However there are many tutorials available on coding one. Everyone feel free to list good tutorials / implementations if You know some:
I mostly agree on Dave's answer, but if you need to slice CPU time across all "groups", i.e. all task groups should progress in parallel, you might find this kind of construct useful (using removal as "lock". This worked fine in my case although I imagine it tends to use more memory):
class TaskAllocator {
private final ConcurrentLinkedQueue<Queue<Runnable>> entireWork
= childQueuePerTaskGroup();
public Queue<Runnable> lockTaskGroup(){
return entireWork.poll();
}
public void release(Queue<Runnable> taskGroup){
entireWork.offer(taskGroup);
}
}
and
class DoWork implmements Runnable {
private final TaskAllocator allocator;
public DoWork(TaskAllocator allocator){
this.allocator = allocator;
}
pubic void run(){
for(;;){
Queue<Runnable> taskGroup = allocator.lockTaskGroup();
if(task==null){
//No more work
return;
}
Runnable work = taskGroup.poll();
if(work == null){
//This group is done
continue;
}
//Do work, but never forget to release the group to
// the allocator.
try {
work.run();
} finally {
allocator.release(taskGroup);
}
}//for
}
}
You can then use optimum number of threads to run the DoWork task. It's kind of a round robin load balance..
You can even do something more sophisticated, by using this instead of a simple queue in TaskAllocator (task groups with more task remaining tend to get executed)
ConcurrentSkipListSet<MyQueue<Runnable>> sophisticatedQueue =
new ConcurrentSkipListSet(new SophisticatedComparator());
where SophisticatedComparator is
class SophisticatedComparator implements Comparator<MyQueue<Runnable>> {
public int compare(MyQueue<Runnable> o1, MyQueue<Runnable> o2){
int diff = o2.size() - o1.size();
if(diff==0){
//This is crucial. You must assign unique ids to your
//Subqueue and break the equality if they happen to have same size.
//Otherwise your queues will disappear...
return o1.id - o2.id;
}
return diff;
}
}
Actor is also another solution for this specified type of issues.
Scala has actors and also Java, which provided by AKKA.
I had a problem similar to your, and I used an ExecutorCompletionService that works with an Executor to complete collections of tasks.
Here is an extract from java.util.concurrent API, since Java7:
Suppose you have a set of solvers for a certain problem, each returning a value of some type Result, and would like to run them concurrently, processing the results of each of them that return a non-null value, in some method use(Result r). You could write this as:
void solve(Executor e, Collection<Callable<Result>> solvers)
throws InterruptedException, ExecutionException {
CompletionService<Result> ecs = new ExecutorCompletionService<Result>(e);
for (Callable<Result> s : solvers)
ecs.submit(s);
int n = solvers.size();
for (int i = 0; i < n; ++i) {
Result r = ecs.take().get();
if (r != null)
use(r);
}
}
So, in your scenario, every task will be a single Callable<Result>, and tasks will be grouped in a Collection<Callable<Result>>.
Reference:
http://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorCompletionService.html

Categories

Resources