How to distribute work correctly in JSR-352? - java

So I have been using Java Batch Processing for some time now. My jobs were either import/export jobs which chunked from a reader to a writer, or I would write Batchlets that would do some more complex processing. Since I am beginning to hit memory limits I need to rethink the architecture.
So I want to want to better leverage the chunked Reader/Processor/Writer pattern. And apparently I feel unsure how to distribute the work over the three items. During processing it becomes clear whether to write zero, one or several other records.
The reader is quite clear: It reads the data to be processed from the DB. But I am unsure how to write the records back to the database. I see these options:
Make the processor now store the variable amounts of data in the DB.
Make the processor send variable amount of data to the writer that would then perform the writing.
Place the entire logic into the writer.
Which way would be the best for this kind of task?

Looking at https://www.ibm.com/support/pages/system/files/inline-files/WP102706_WLB_JSR352.002.pdf, especially the chapters Chunk/The Processor and Chunk/The Writer it becomes obvious that it is up to me.
The processor can return an object, and the writer will have to understand and write this object. So for the above case where the processor has zero, one or many items to write per input record, it should simply return a list. This list can contain zero, one or several elements. The writer has to understand the list and write its elements to the database.
Since the logic is divided this way, the code is still pluggable and can easily be extended or maintained.
Addon: Since both reader and writer this time connect to the same database, I perceived the problem that upon commit for each chunk the connection for the reader was also invalidated. The solution was to use a nonJTA datasource for the reader.

Typically, an item processor processes an input item passed from an item reader, and the processing result can be null or a domain object. So it's not suited for your cases where the processing result may be split into multiple objects. I would assume even in your case, multiple objects from a processing iteration is not common. So I would suggest to use list or any collection type as the element type of the processed object only when necessary. In other more common cases, the item processor will still return null (to skip the current processed item) or a domain object.
When the item writer iterates through accumulated items, it can check if it's a collection and then write out all contained elements. For domain object type, then just write it as usual.
Using non-jta datasource for the reader is fine. I think you would want to keep the reader connection open from the start to end to keep reading from the result set. In an item writer, the connection is typically acquired at the beginning of the write operation and closed at the end of the chunk transaction commit or rollback.
Some resources that may be of help:
Jakarta Batch API,
jberet-support JdbcItemReader,
jberet-support JdbcItemWriter

Related

Java Program Architecture: IO Buffer

I am new to Java and wrote a program which is very difficult to update because business logic, modal and repo data is combined in the classes. I have since researched many tutorials but I cannot find the answer to ensure my program is written in the most resource efficient manner.
The program imports 1 line of a CSV via BufferedReader class and creates a modal object instance to reflect each column read. However then a second CSV sheet is read via BufferedReader to check if any of the data from the first CSV matches, then updates the modal object with this new information.
Then once the updated modal object is updated, the updated modal objects are added to an ArrayList.
Which architectural way is more efficient, version 1 or version 2 and why:
VERSION 1
READ LINE
CREATE OBJECT INSTANCE
READ LINE
UPDATE OBJECT INSTANCE (& RUN LOGIC)
WRITE LINE (OBJECT) TO NEW FILE VIA BUFFEREDWRITER
REPEAT 3000 TIMES
VERSION 2
READ LINE
ADD TO ARRAYLIST A
READ LINE
ADD TO ARRAYLIST B
RUN LOGIC BY COMPARING ARRAYLIST A AND B
WRITE FINAL ARRAYLIST TO NEW FILE VIA BUFFEREDREADER
Please note, that although this is currently using CSV data with a limited number of lines (3000), in future the CSV lines will increase to over 50,000 lines of data, so is it better to add to ArrayLists and run logic on the complete ArrayList, or to run logic on each object first, then add the complated objects to ArrayList?
Version 2 is more efficient because you're batching operations, and you're not repeating expensive tasks. A classic example of that is a database connection - establishing a database connection is generally expensive, so opening a connection and doing 100 updates through it is always more efficient (for the caller) than opening connection, doing an update, closing connection x 100.
The trade-off with that example is that databases can only hold a finite number of connections, so doing a 100 updates through one connection is more efficient for the caller but might block other callers to the database.
Using a buffer is another "expensive" operation, that's why version 1 is so much slower, because you're setting it up 300 (50,000) times rather than once.

Spark streaming maintain state over window

For spark streaming, are there ways that we can maintain state only for the current window? I understand updateStateByKey works but that maintains the state forever unless we purge it. Is it possible to store and reset the state per window?
To give more context. I'm trying to convert one type of object into another within a windowed stream. However, the conversion is the following:
Object 1 is either an invocation or a response.
Object 2 is not considered complete until we see both a invocation and a response.
However, since the response for the an object could be in a separate batch I need to maintain states across batches.
But I only wish to maintain the state for the current window. Are there any ways that I could achieve this through spark.
thank you!
You can use the mapWithState transformation instead of updateStateByKey and you can set time out to the State spec with duration of your batch interval.by this you can have the state for only last batch every time.but it will work if you invocation and response depends only on the last batch.other wise when you try to update key which got removed it will throw exception.
MapwithState is fast in performance compared to updateStateByKey.
you can find the sample code snippet below.
import org.apache.spark.streaming._
val stateSpec =
StateSpec
.function(updateUserEvents _)
.timeout(Minutes(5))

Spring Batch Chunk processing

When processing a step level using chunk processing(specifying a commit-interval) in Spring Batch,is there a way to know inside the Writer,when all the records in a file have been read and processed.My idea was to pass the collection of records read from the file to the ExecutionContext once all the records have been read.
Please help.
I don't know if the is one of pre-built CompletionPolicy that do what you want, but if none you can write a custom CompletionPolicy that mark a chunk as completed when writer return null; in this way you hold all items read from file.
After that, are you sure this is exactly what you wnat? because store all item in ExecutionContext is not a good pratice; also you will lose chunk processing, restartability, and all other SB features...

Using A BlockingQueue With A Servlet To Persist Objects

First, this may be a stupid question, but I'm hoping someone will tell me so, and why. I also apologize if my explanation of what/why is lacking.
I am using a servlet to upload a HUGE (247MB) file, which is pipe (|) delineated. I grab about 5 of 20 fields, create an object, then add it to a list. Once this is done, I pass the the list to an OpenJPA transactional method called persistList().
This would be okay, except for the size of the file. It's taking forever, so I'm looking for a way to improve it. An idea I had was to use a BlockingQueue in conjunction with the persist/persistList method in a new thread. Unfortunately, my skills in java concurrency are a bit weak.
Does what I want to do make sense? If so, has anyone done anything like it before?
Servlets should respond to requests within a short amount of time. In this case, the persist of the file contents needs to be an asynchronous job, so:
The servlet should respond with some text about the upload job, expected time to complete or something like that.
The uploaded content should be written to some temp space in binary form, rather than keeping it all in memory. This is the usual way the multi-part post libraries to their work.
You should have a separate service that blocks on a queue of pending jobs. Once it gets a job, it processes it.
The 'job' is simply some handle to the temporary file that was written when the upload happened... and any metadata like who uploaded it, job id, etc.
The persisting service needs to upload a large number of rows, but make it appear 'atomic', either model the intermediate state as part of the table model(s), or write to temp spaces.
If you are writing to temp tables, and then copying all the content to the live table, remember to have enough log space and temp space at the database level.
If you have a full J2EE stack, consider modelling the job queue as a JMS queue, so recovery makes sense. Once again, remember to have proper XA boundaries, so all the row persists fall within an outer transaction.
Finally, consider also having a status check API and/or UI, where you can determine the state of any particular upload job: Pending/Processing/Completed.

Retrieving Large Lists of Objects Using Java EE

Is there a generally-accepted way to return a large list of objects using Java EE?
For example, if you had a database ResultSet that had millions of objects how would you return those objects to a (remote) client application?
Another example -- that is closer to what I'm actually doing -- would be to aggregate data from hundreds of sources, normalize it, and incrementally transfer it to a client system as a single "list".
Since all the data cannot fit in memory, I was thinking that a combination of a stateful SessionBean and some sort of custom Iterator that called back to the server would do the trick.
So, in other words, if I have an API like Iterator<Data> getData() then what's a good way to implement getData() and Iterator<Data>?
How have you successfully solved this problem in the past?
Definitely don't duplicate the entire DB into Java's memory. This makes no sense and only makes things unnecessarily slow and memory-hogging. Rather introduce pagination at database level. You should query only the data you actually need to display on the current page, like as Google does.
If you actually have a hard time in implementing this properly and/or figuring the SQL query for the specific database, then have a look at this answer. For JPA/Hibernate equivalent, have a look at this answer.
Update as per the comments (which actually changes the entire question subject...), here's a basic (pseudo) kickoff example:
List<Source> inputSources = createItSomehow();
Source outputSource = createItSomehow();
for (Source inputSource : inputSources) {
while (inputSource.next()) {
outputSource.write(inputSource.read());
}
}
This way you effectively end up with a single entry in Java's memory instead of the entire collection as in the following (inefficient) example:
List<Source> inputSources = createItSomehow();
List<Entry> entries = new ArrayList<Entry>();
for (Source inputSource : inputSources) {
while (inputSource.next()) {
entries.add(inputSource.read());
}
}
Source outputSource = createItSomehow();
for (Entry entry : entries) {
outputSource.write(entry);
}
Pagination is a good solution when working with a web based ui. sometimes, however, it is much more efficient to stream everything in one call. the rmiio library was written explicitly for this purpose, and is already known to work in a variety of app servers.
If your list is huge, you must assume that it can't fit in memory. Or at least that if your server need to handle that on many concurrent access then you have high risk of OutOfMemoryException.
So basically, what you do is paging and using batch reading. let say you load 1 thousand objects from your database, you send them to the client request response. And you loop until you have processed all objects. (See response from BalusC)
Problem is same on client side, and you'll likely to need to stream the data to the file system to prevent OutOfMemory errors.
Please also note : It is okay to load millions of object from a database as an administrative task : like for performing a backup, and export of some 'exceptional' case. But you should not use it as a request any user could do. It will be slow and drain server resources.

Categories

Resources