java.lang.IllegalStateException while trying to use MongoDB BulkWriteOperation - java

I have this code that dumps documents into MongoDB once an ArrayBlockingQueue fills it's quota. When I run the code, it seems to only run once and then gives me a stack trace. My guess is that the BulkWriteOperation someone has to 'reset' or start over again.
Also, I create the BulkWriteOperations in the constructor...
bulkEvent = eventsCollection.initializeOrderedBulkOperation();
bulkSession = sessionsCollection.initializeOrderedBulkOperation();
Here's the stacktrace.
10 records inserted
java.lang.IllegalStateException: already executed
at org.bson.util.Assertions.isTrue(Assertions.java:36)
at com.mongodb.BulkWriteOperation.insert(BulkWriteOperation.java:62)
at willkara.monkai.impl.managers.DataManagers.MongoDBManager.dumpQueue(MongoDBManager.java:104)
at willkara.monkai.impl.managers.DataManagers.MongoDBManager.addToQueue(MongoDBManager.java:85)
Here's the code for the Queues:
public void addToQueue(Object item) {
if (item instanceof SakaiEvent) {
if (eventQueue.offer((SakaiEvent) item)) {
} else {
dumpQueue(eventQueue);
}
}
if (item instanceof SakaiSession) {
if (sessionQueue.offer((SakaiSession) item)) {
} else {
dumpQueue(sessionQueue);
}
}
}
And here is the code that reads from the queues and adds them to an BulkWriteOperation (initializeOrderedBulkOperation) to execute it and then dump it to the database. Only 10 documents get written and then it fails.
private void dumpQueue(BlockingQueue q) {
Object item = q.peek();
Iterator itty = q.iterator();
BulkWriteResult result = null;
if (item instanceof SakaiEvent) {
while (itty.hasNext()) {
bulkEvent.insert(((SakaiEvent) itty.next()).convertToDBObject());
//It's failing at that line^^
}
result = bulkEvent.execute();
}
if (item instanceof SakaiSession) {
while (itty.hasNext()) {
bulkSession.insert(((SakaiSession) itty.next()).convertToDBObject());
}
result = bulkSession.execute();
}
System.out.println(result.getInsertedCount() + " records inserted");
}

The general documentation applies to all driver implementations in this case:
"After execution, you cannot re-execute the Bulk() object without reinitializing."
So the .execute() method effectively "drains" the current list of operations that have been sent to it and now contains state information about how the commands were actually sent. So you cannot add more entries or call .execute() again on the same instance without reinitializing .
So after you call execute on each "Bulk" object, you need to call the intialize again:
bulkEvent = eventsCollection.initializeOrderedBulkOperation();
bulkSession = sessionsCollection.initializeOrderedBulkOperation();
Each of those lines placed again repectively after each .execute() call in your function. Then further calls to those instances can add operations and call execute again continuing the cycle.
Note that "Bulk" operations objects will store as many items as you want to put into them but will break up requests to the server into maximum amounts of 1000 items. After execution the state of the operations list will reflect exactly how this is done should you want to inspect that.

Related

Can ChronicleQueue tailers for two different queues be interleaved?

I have two separate ChronicleQueues that were created by independent threads that monitor web socket streams in a Java application. When I read each queue independently in a separate single-thread program, I can traverse each entire queue as expected - using the following minimal code:
final ExcerptTailer queue1Tailer = queue1.createTailer();
final ExcerptTailer queue2Tailer = queue2.createTailer();
while (true)
{
try( final DocumentContext context = queue1Tailer.readingDocument() )
{
if ( isNull(context.wire()) )
break;
counter1++;
queue1Data = context.wire()
.bytes()
.readObject(Queue1Data.class);
queue1Writer.write(String.format("%d\t%d\t%d%n", counter1, queue1Data.getEventTime(), queue1Data.getEventContent()));
}
}
while (true)
{
try( final DocumentContext context = queue2Tailer.readingDocument() )
{
if ( isNull(context.wire()) )
break;
counter2++;
queue2Data = context.wire()
.bytes()
.readObject(Queue2Data.class);
queue2Writer.write(String.format("%d\t%d\t%d%n", counter2, queue2Data.getEventTime(), queue2Data.getEventContent()));
}
}
In the above, I am able to read all the Queue1Data objects, then all the Queue2Data objects and access values as expected. However, when I try to interleave reading the queues (read an object from one queue, based on a property of Queue1Data object (a time stamp), read Queue2Data objects until the first object that is after the time stamp (the limit variable below), of the active Queue1Data object is found - then do something with it) after only one object from the queue2Tailer is read, an exception is thrown .DecoratedBufferUnderflowException: readCheckOffset0 failed. The simplified code that fails is below (I have tried putting the outer while(true) loop inside and outside the the queue2Tailer try block):
final ExcerptTailer queue1Tailer = queue1Queue.createTailer("label1");
try( final DocumentContext queue1Context = queue1Tailer.readingDocument() )
{
final ExcerptTailer queue2Tailer = queue2Queue.createTailer("label2");
while (true)
{
try( final DocumentContext queue2Context = queue2Tailer.readingDocument() )
{
if ( isNull(queue2Context.wire()) )
{
terminate = true;
break;
}
queue2Data = queue2Context.wire()
.bytes()
.readObject(Queue2Data.class);
while(true)
{
queue1Data = queue1Context.wire()
.bytes()
.readObject(Queue1Data.class); // first read succeeds
if (queue1Data.getFieldValue() > limit) // if this fails the inner loop continues
{ // but the second read fails
// cache a value
break;
}
}
// continue working with queu2Data object and cached values
} // end try block for queue2 tailer
} // end outer while loop
} // end outer try block for queue1 tailer
I have tried as above, and also with both Tailers created at the beginning of the function which does the processing (a private function executed when a button is clicked in a relatively simple Java application). Basically I took the loop which worked independently, and put it inside another loop in the function, expecting no problems. I thinking I am missing something crucial in how tailers are positioned and used to read objects, but I cannot figure out what it is - since the same basic code works when reading queues independently. The use of isNull(context.wire()) to determine when there are no more objects in a queue I got from one of the examples, though I am not sure this is the proper way to determine when there are no more objects in a queue when processing the queue sequentially.
Any suggestions would be appreciated.
You're not writing it correctly in the first instance.
Now, there's hardcore way of achieving what you are trying to achieve (that is, do everything explicitly, on lower level), and use MethodReader/MethodWriter magic rovided by Chronicle.
Hardcore way
Writing
// write first event type
try (DocumentContext dc = queueAppender.writingDocument()) {
dc.wire().writeEventName("first").text("Hello first");
}
// write second event type
try (DocumentContext dc = queueAppender.writingDocument()) {
dc.wire().writeEventName("second").text("Hello second");
}
This will write different types of messages into the same queue, and you will be able to easily distinguish those when reading.
Reading
StringBuilder reusable = new StringBuilder();
while (true) {
try (DocumentContext dc = tailer.readingDocument()) {
if (!dc.isPresent) {
continue;
}
dc.wire().readEventName(reusable);
if ("first".contentEquals(reusable)) {
// handle first
} else if ("second".contentEquals(reusable)) {
// handle second
}
// optionally handle other events
}
}
The Chronicle Way (aka Peter's magic)
This works with any marshallable types, as well as any primitive types and CharSequence subclasses (i.e. Strings), and Bytes. For more details have a read of MethodReader/MethodWriter documentation.
Suppose you have some data classes:
public class FirstDataType implements Marshallable { // alternatively - extends SelfDescribingMarshallable
// data fields...
}
public class SecondDataType implements Marshallable { // alternatively - extends SelfDescribingMarshallable
// data fields...
}
Then, to write those data classes to the queue, you just need to define the interface, like this:
interface EventHandler {
void first(FirstDataType first);
void second(SecondDataType second);
}
Writing
Then, writing data is as simple as:
final EventHandler writer = appender.methodWriterBuilder(EventHandler).get();
// assuming firstDatum and secondDatum are created earlier
writer.first(firstDatum);
writer.second(secondDatum);
What this does is the same as in the hardcore section - it writes event name (which is taken from the method name in method writer, i.e. "first" or "second" correspondingly), and then the actual data object.
Reading
Now, to read those events from the queue, you need to provide an implementation of the above interface, that will handle corresponding event types, e.g.:
// you implement this to read data from the queue
private class MyEventHandler implements EventHandler {
public void first(FirstDataType first) {
// handle first type of events
}
public void second(SecondDataType second) {
// handle second type of events
}
}
And then you read as follows:
EventHandler handler = new MyEventHandler();
MethodReader reader = tailer.methodReader(handler);
while (true) {
reader.readOne(); // readOne returns boolean value which can be used to determine if there's no more data, and pause if appropriate
}
Misc
You don't have to use the same interface for reading and writing. In case you want to only read events of second type, you can define another interface:
interface OnlySecond {
void second(SecondDataType second);
}
Now, if you create a handler implementing this interface and give it to tailer#methodReader() call, the readOne() calls will only process events of second type while skipping all others.
This also works for MethodWriters, i.e. if you have several processes writing different types of data and one process consuming all that data, it is not uncommon to define multiple interfaces for writing data and then single interface extending all others for reading, e.g.:
interface FirstOut {
void first(String first);
}
interface SecondOut {
void second(long second);
}
interface ThirdOut {
void third(ThirdDataType third);
}
interface AllIn extends FirstOut, SecondOut, ThirdOut {
}
(I deliberately used different data types for method parameters to show how it is possible to use various types)
With further testing, I have found that nested loops to read multiple queues which contain data in different POJO classes is possible. The problem with the code in the above question is that queue1Context is obtained once, OUTSIDE the loop that I expected to read queue1Data objects. My fundamental misconception was that DocumentContext objects managed stepping through objects in a queue, whereas actually ExcerptTailer objects manage stepping (maintaining indices) when reading a queue sequentially.
In case it might help someone else just getting started with ChronicleQueues, the inner loop in the original question should be:
while(true)
{
try (final DocumentContext queue1Context = queue1Tailer() )
{
queue1Data = queue1Context.wire()
.bytes()
.readObject(Queue1Data.class); // first read succeeds
if (queue1Data.getFieldValue() > limit) // if this fails the inner loop continues as expected
{ // and second and subsequent reads now succeed
// cache a value
break;
}
}
}
And of course the outer-most try block containing queue1Context (in the original code) should be removed.

Multiple threads writing to same list in java and return that list to a function

So I have a really large list of zip codes (about 80,000) that I want to pass onto a url and get the JSON data from that url for each zip code.
I am running a query on that JSON to see if it has the end_lat and if it does then I want to save that zip code to a list.
As I am fetching and matching JSON for a lot of zip codes its taking forever.
So I tried few different methods to make it a multi threaded application. I tried the good old Thread method with runnable interface.
I tried executor services. But everything stops abruptly which makes me believe that I should be making synchronized writes to that list.
public void breakingZipCodesForThreads() {
List<String> zip_Codes = Serenity.sessionVariableCalled("zipCodes");
int size = (int) Math.ceil(zip_Codes.size() / 5.0);
ExecutorService executor = Executors.newFixedThreadPool(4);
for (int start = 0; start < zip_Codes.size(); start += size) {
int end = Math.min(start + size, zip_Codes.size());
Runnable worker = new MyRunnable(zip_Codes.subList(start, end));
executor.execute(worker);
}
//run() method bascially has this code for a function
for (String zipCode : zip_Codes) {
currentPage = pageUrl + zipCode;
Response response = given().urlEncodingEnabled(false)
.when()
.get(currentPage);
try {
Object end_lat = response.getBody().path("end_lat");
if (end_lat != null && !end_lat.toString().isEmpty()) {
resultantZipCode.add(zipCode);
}
} catch (Exception e) {
//Something else
}
So essentially I want all my threads to concurrently write to the list "resultantZipCode" and in the end give me single list for all the zipcodes in there that satisfy my condition.
So how do I make my zip codes break into pieces, run parallely for the run function and save all the resultant zip codes and returns me that list? What am I missing?

Tracking the progress between Queues in a Map

I have currently two queues and items traveling between them. Initially, an item gets put into firstQueue, then one of three dedicated thread moves it to secondQueue and finally another dedicated thread removes it. These moves obviously include some processing. I need to be able to get the status of any item (IN_FIRST, AFTER_FIRST, IN_SECOND, AFTER_SECOND, or ABSENT) and I implemented it manually by doing the update of the statusMap where the queue gets modified like
while (true) {
Item i = firstQueue.take();
statusMap.put(i, AFTER_FIRST);
process(i);
secondQueue.add(i);
statusMap.put(i, IN_SECOND);
}
This works, but it's ugly and leaves a time window where the status is inconsistent. The inconsistency is no big deal and it'd solvable by synchronization, but this could backfire as the queue is of limited capacity and may block. The ugliness bothers me more.
Efficiency hardly matters as the processing takes seconds. Dedicated threads are used in order to control concurrency. No item should ever be in multiple states (but this is not very important and not guaranteed by my current racy approach). There'll be more queues (and states) and they'll of different kinds (DelayQueue, ArrayBlockingQueue, and maybe PriorityQueue).
I wonder if there's a nice solution generalizable to multiple queues?
Does it make sense to wrap the queues with logic to manage the Item status?
public class QueueWrapper<E> implements BlockingQueue<E> {
private Queue<E> myQueue = new LinkedBlockingQueue<>();
private Map<E, Status> statusMap;
public QueueWrapper(Map<E, Status> statusMap) {
this.statusMap = statusMap;
}
[...]
#Override
public E take() throws InterruptedException {
E result = myQueue.take();
statusMap.put(result, Status.AFTER_FIRST);
return result;
}
That way status management is always related to (and contained in) queue operations...
Obviously statusMap needs to be synchronized, but that would be an issue anyway.
I see that your model might be improved in consistency, state control, and scaling.
A way of to implement this is accouple the item to your state, enqueue and dequeue this couple and create a mechanism to ensure state change.
My proposal can be see in figure below:
According with this model and your example, we can to do:
package stackoverflow;
import java.util.concurrent.LinkedBlockingQueue;
import stackoverflow.item.ItemState;
import stackoverflow.task.CreatingTask;
import stackoverflow.task.FirstMovingTask;
import stackoverflow.task.SecondMovingTask;
public class Main {
private static void startTask(String name, Runnable r){
Thread t = new Thread(r, name);
t.start();
}
public static void main(String[] args) {
//create queues
LinkedBlockingQueue<ItemState> firstQueue = new LinkedBlockingQueue<ItemState>();
LinkedBlockingQueue<ItemState> secondQueue = new LinkedBlockingQueue<ItemState>();
//start three threads
startTask("Thread#1", new CreatingTask(firstQueue));
startTask("Thread#2", new FirstMovingTask(firstQueue, secondQueue));
startTask("Thread#3", new SecondMovingTask(secondQueue));
}
}
Each task runs the operations op() of according with below affirmation on ItemState:
one of three dedicated thread moves it to secondQueue and finally
another dedicated thread removes it.
ItemState is a immutable object that contains Item and your State. This ensures consistency between Item and State values.
ItemState has acknowledgement about the next state creating a mechanism of self-controled state:
public class FirstMovingTask {
//others codes
protected void op() {
try {
//dequeue
ItemState is0 = new ItemState(firstQueue.take());
System.out.println("Item " + is0.getItem().getValue() + ": " + is0.getState().getValue());
//process here
//enqueue
ItemState is1 = new ItemState(is0);
secondQueue.add(is1);
System.out.println("Item " + is1.getItem().getValue() + ": " + is1.getState().getValue());
} catch (InterruptedException e) {
e.printStackTrace();
}
}
//others codes
}
With ItemState implemetation:
public class ItemStateImpl implements ItemState {
private final Item item;
private final State state;
public ItemStateImpl(Item i){
this.item = i;
this.state = new State();
}
public ItemStateImpl(ItemState is) {
this.item = is.getItem();
this.state = is.getState().next();
}
// gets attrs
}
So this way is possible build solutions more elegant, flexible and scalable.
Scalable because you can to control more states only changing next() and generalizing the moving task for increase the number of queue.
Results:
Item 0: AFTER_FIRST
Item 0: IN_FIRST
Item 0: IN_SECOND
Item 0: AFTER_SECOND
Item 1: IN_FIRST
Item 1: AFTER_FIRST
Item 1: IN_SECOND
Item 1: AFTER_SECOND
Item 2: IN_FIRST
Item 2: AFTER_FIRST
Item 2: IN_SECOND
... others
UPDATE(06/07/2018): analysing the use of map for search
Search in map using equals values like comparator might not work because usally the mapping between values and identity (key/hash) is not one-to-one(see figure bellow). In this way is need to create an sorted list for search values which results in O(n) (worst-case).
with Item.getValuesHashCode():
private int getValuesHashCode(){
return new HashCodeBuilder().append(value).hashCode();
}
In this case, you must keep Vector<ItemState> instead of Item and to use the key like the result of getValuesHashCode. Change the mechanism of state-control for keep first reference of the Item and the state current. See bellow:
//Main.class
public static void main(String[] args) {
... others code ...
//references repository
ConcurrentHashMap<Integer, Vector<ItemState>> statesMap = new ConcurrentHashMap<Integer, Vector<ItemState>>();
//start three threads
startTask("Thread#1", new CreatingTask(firstQueue, statesMap));
... others code ...
}
//CreateTask.class
protected void op() throws InterruptedException {
//create item
ItemState is = new ItemStateImpl(new Item(i++, NameGenerator.name()));
//put in monitor and enqueue
int key = is.getHashValue();
Vector<ItemState> items = map.get(key);
if (items == null){
items = new Vector<>();
map.put(key, items);
}
items.add(is);
//enqueue
queue.put(is);
}
//FirstMovingTask.class
protected void op() throws InterruptedException{
//dequeue
ItemState is0 = firstQueue.take();
//process
ItemState is1 = process(is0.next());
//enqueue
secondQueue.put(is1.next());
}
//ItemState.class
public ItemState next() {
//required for consistent change state
synchronized (state) {
state = state.next();
return this;
}
}
To search you must use concurrentMapRef.get(key). The result will the reference of updated ItemState.
Results in my tests for :
# key = hash("a")
# concurrentMapRef.get(key)
...
Item#7#0 : a - IN_FIRST
... many others lines
Item#7#0 : a - AFTER_FIRST
Item#12#1 : a - IN_FIRST
... many others lines
Item#7#0 : a - IN_SECOND
Item#12#1 : a - IN_FIRST
... many others lines
Item#7#0 : a - AFTER_SECOND
Item#12#1 : a - IN_FIRST
More details in code: https://github.com/ag-studies/stackoverflow-queue
UPDATED IN 06/09/2018: redesign
Generalizing this project, I can undestand that the state machine is something like:
In this way I decoupled the workers of the queues for improve concepts. I used an MemoryRep for keep the unique reference for item in overall processment.
Of course that you can use strategies event-based if you need keep ItemState in a physic repository.
This keep the previous idea and creates more legibility for the concepts. See this:
I understand that each job will have two queue (input/output) and relationship with a business model! The researcher will always find the most updated and consistent state of Item.
So, answering your ask:
I can find the consistent state of Item anywhere using MemoryRep (basically an Map), wrapping state and item in ItemState, and controlling the change state on job on enqueue or dequeue it.
The performace is keeped, except on running of next()
The state is allways consistent (for your problem)
In this model is possible use any queue type, any number of jobs/queues, and any number of state.
Additionaly this is beautiful!!
As previously answered, Wrap the queues or the item would be viable solutions or both.
public class ItemWrapper<E> {
E item;
Status status;
public ItemWrapper(Item i, Status s){ ... }
public setStatus(Status s){ ... }
// not necessary if you use a queue wrapper (see queue wrapper)
public boolean equals(Object obj) {
if ( obj instanceof ItemWrapper)
return item.equals(((ItemWrapper) obj).item)
return false;
}
public int hashCode(){
return item;
}
}
...
process(item) // process update status in the item
...
Probably a better way, already answered, is to have a QueueWrapper who update the queue status. For the fun I don't use a status map but I use the previously itemwrapper it seems cleaner (a status map works too).
public class QueueWrapper<E> implements Queue<E> {
private Queue<ItemWrapper<E>> myQueue;
static private Status inStatus; // FIRST
static private Status outStatus; // AFTER_FIRST
public QueueWrapper(Queue<E> myQueue, Status inStatus, Status outStatus) {...}
#Override
public boolean add(E e) {
return myQueue.add(new ItemWrapper(e, inStatus));
}
#Override
public E remove(){
ItemWrapper<E> result = myQueue.remove();
result.setStatus(outStatus)
return result.item;
}
...
}
You can also use AOP to inject status update in your queues without changing your queues (a status map should be more appropriate than itemwrapper).
Maybe I didn't answer well your question because an easy way to know where is your item could be to check in each queue with "contains" function.
Here's something different from what others have said. Taking from the world of queue services and systems we have the concept of message acknowledgement. This is nice, because it also gives you some built in retry logic.
I'll lay out how it would work from a high level, and if you need I can add code.
Essentially you'll have a Set to go with each of your queues. You'll wrap your queues in an object so that when you dequeue an item a few things happen
The item is removed from the queue
The item is added to the associated set
A task (lambda containing an atomic boolean (default false)) is scheduled. When run it will remove item from the set and if the boolean is false, put it back in the queue
The item and a wrapper around the boolean are returned to the caller
Once process(i); completes, your code will indicate receipt acknowledgement to the wrapper, and the wrapper will remove the item from the set and make the boolean false.
A method to return status would simply check which queue or set the item is in.
Note that this gives "at least once" delivery, meaning an item will be processed at least once, but potentially more than once if the processing time is too close to the timeout.

How to process data in chunks in java using Multi Threading?

I am working on a task in which I need to process data in chunks. I have a properties file in which I define the chunk size, suppose 500 and the data that I am getting form the data base is suppose 1000 records. I want to process 1000 records in chunks 500 each using Multi Threading.
This is the first time I am implementing this so please let me know if I can achieve the same using another technique. The main purpose behind this is that I am generating an excel file in which I populate the data keeping in mind the chunk size. So probably first thread processes 500 records and second thread next 500.
Partial Code (Rest parses the xml and writes in Excel using POI)
public List<NYProgramTO> getNYPPAData() throws Exception {
this.getConfiguration();
List<NYProgramTO> to = dao.getLatestNYData();
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
Document document = null;
// Returns chunkSize
List<NYProgramTO> myList = getNextChunk(to);
ExecutorService executor = Executors.newFixedThreadPool(myList.size());
myList.stream()
.forEach((NYProgramTO nyTo) ->
{
executor.execute(new NYExecutorThread(nyTo, migrationConfig , appContext, dao));
});
executor.shutdown();
executor.awaitTermination(300, TimeUnit.SECONDS);
System.gc();
dao.getLatestNYData(); method returns me the total number of records from the database and this is how I populate the list to.
I have the following method which gives me the next set of chunk, so suppose if 500 records had processed this method should give next 500 records to process (Hope this makes sense).
private static List<NYProgramTO> getNextChunk(List<NYProgramTO> list) {
currentIndex = 0; // This is static int class variable
List<NYProgramTO> nyList = new ArrayList<>();
if(list.size() == 0) {
return list;
}
int totalCount = list.size();
for(int i = currentIndex; i < (currentIndex + chunkSize); i++) {
if(i == totalCount) break;
nyList.add(list.get(i));
}
return nyList;
}
In my first method I create threads now here I am not sure to how many thread do I need to create. Currently I am passing the size of the list that I receive from getNextChunk(); method.
NYExecutorThread this class simply implements Runnable and I don't have any logic in it yet. Currently I simply pass parameters on the constructor to be able to get the configurations and create threads.
It is a little confusing and I want if anyone has implemented such a logic, please let me know how can I go ahead with this?
Thanks

Spring Batch Processor

I have a requirement in Spring Batch where I have a file with thousands of records coming in a sorted order.The key field is product code.
The file may have multiple records of the same product code.The requirement is that I have to group the records that have the same
product Code in a collection (i.e List) and then send them over to a method i.e validateProductCodes(List prodCodeList).
I am looking for the best way to do this.The approach I thought of was to read every record in the Processor and then build a collection
of records for the same product code in the processor.If at any point in the processor,if the product code in the record is different than it would imply that
the productCode grouping is complete and the validateProductCodes() can be called for that group of records with the same product code.Also I am using a Step.So does
not that automatically mean that the process is multithreaded?Meaning Groups of records with same productCode will be processed in a multithreaded way.Please advise.
Thanks
There are two questions in your question: first, you want to know how to group the items together and second how they are processed.
In order to group them, you could create a group reader as Luca suggested or something like:
public class GroupReader<I> implements ItemReader<List<I>>{
private SingleItemPeekableItemReader<I> reader;
private ItemReader<I> peekReaderDelegate;
public void setReader(ItemReader<I> reader) {
peekReaderDelegate = reader;
}
#Override
public void afterPropertiesSet() throws Exception {
Assert.notNull(peekReaderDelegate, "The 'itemReader' may not be null");
this.reader= new SingleItemPeekableItemReader<I>();
this.reader.setDelegate(delegateReader);
}
#Override
public List<I> read() throws Exception {
State state = State.NEW;
List<I> group = null;
I item = null;
while (state != State.COMPLETE) {
item = reader.read();
switch (state) {
case NEW: {
if (item == null) {
// end reached
state = State.COMPLETE;
break;
}
group = new ArrayList<I>();
group.add(item);
state = State.READING;
I nextItem = reader.peek();
if (isItAKeyChange(item, nextItem)) {
state = State.COMPLETE;
}
break;
}
case READING: {
group.add(item);
// peek and check if there the peeked entry has a new date
I nextItem = peekEntry();
if (isItAKeyChange(item, nextItem)) {
state = State.COMPLETE;
}
break;
}
default: {
throw new org.springframework.expression.ParseException(groupCounter, "ParsingError: Reader is in an invalid state");
}
}
}
return group;
}
}
For every key, this reader will return a list with all elements matching this key. Therefore, the grouping ist done directly in the reader.
You cannot do that with a processor, as you described.
Your second question about multithreading.
Now, using a step does not necessarily mean, that the step is processed with several threads.
In order to do that, you need set an AsyncTaskExecutor and you have to set the throttle limit.
But if you do that, your reader must be threadsafe, or otherwise your grouping won't work. You could do that by simply defining the read method above as synchronized.
Another way could be to write a small SynchronizedWrapperReader, as suggested in this question: Parellel Processing Spring Batch StaxEventItemReader
Please note, depending on your target you are writing to, you probably also have to synchronize the writer, and if necessary to reorder the result.

Categories

Resources