Spring Batch read and update from same table and multithreaded setup?

Spring Batch read and update from same table and multithreaded setup? - java

I am facing a similar issue as described in this question.
Spring batch jpaPagingItemReader why some rows are not read?
I have to read some records from a table like
SELECT * FROM TABLE1 WHERE COLUMN1 = NULL
and based on value of other columns, do some processing(connect to REST service) and fetch some data and update in COLUMN1.
Since, I am using RepositoryItemReader with taskExecutor the paginated fetch is not working fine and around half of the records are being skipped from the actual eligible records.
In order to avoid this issue, I reduced the query to SELECT * FROM TABLE1 so that pagination is not affected since the query will be idempotent. I put a check in code to skip the record if the column is not null. Even with this setup, the records are still being skipped.
Also, this problem only occurs when the page size is smaller than the actual number of records.
If I keep the page size greater than or equal to the total number of eligible records, I don't face any issues. I am not sure if such a large page size (~100000) is a wise thing to have.
I noticed that in such case, a single query retrieves all the records, and then the taskExecutors process and write those records in different threads.
Due to high volume of data, I cannot avoid multi-threading as single-threaded mode is dreadfully slow.
Any pointers what can be done?

You are basically trying to implement the process indicator pattern in a multi-threaded step. There is a sample for that here: Parallel sample. The idea is to use a staging table with the process indicator instead of modifying the original table.
That said, I am not sure if the process indicator pattern can be implemented with a paging technique. A partitioned step where each partition is read in sequence with a cursor-based reader is a better option in my opinion.

Related

Sorting funcationality Optimization using MySQL and Java

I am going to generate simple CSV file report in Java using Hibernate and MySQL.
I am using Native SQL (because query is too complex which is not possible with HQL or Criteria query and also this doesn't matter here) part of Hibernate to fetch the data and simply writing it using any of CSVWriter api (this doesn't matter here.)
As far all is well, but the problem starts now.
Requirements:
The report size can be with 5000K to 15000K records with 25 fields.
It can be run on real time.
There is one report column (let's say finalValue) for which I want sorting and it can be extract like this, (sum(b.quantity*c.unit_gross_price) - COALESCE(sum(pai.value),0)).
Problem:
MySQL Indexing can not be used for finalValue column (mentioned above) as it is complex combination of aggregate functions. So if execute the query (with or without limit) with sorting, it is taking 40sec, otherwise 0.075sec.
The Solutions:
These are the some solutions, that I can think but each have some limitations.
Sorting using java.util.TreeSet : It will throw the OutOfMemoryError, which is obvious as heap space will be exceed if I will put 15000K heavy objects.
Using limit in MySQL query and write file for each iteration : It will take much time as every query will take same time around 50sec as without sorting limit can't be use.
So the main problem here is to overcome two parameters : Memory and Time. I need to balance both of them.
Any ideas, suggestions?
NOTE: I am not given here any snaps of code that doesn't mean question details is not enough. Code doe's not require here.

I think you can use a streaming ResultSet here. As documeted on this page under the ResultSet section.
Here are the main points from the documentation.
By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate and, due to the design of the MySQL network protocol, is easier to implement. If you are working with ResultSets that have a large number of rows or large values and cannot allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row. After this, any result sets created with the statement will be retrieved row-by-row.
There are some caveats with this approach. You must read all of the rows in the result set (or close it) before you can issue any other queries on the connection, or an exception will be thrown.
The earliest the locks these statements hold can be released (whether they be MyISAM table-level locks or row-level locks in some other storage engine such as InnoDB) is when the statement completes.
If using streaming results, process them as quickly as possible if you want to maintain concurrent access to the tables referenced by the statement producing the result set.
So, with a streaming result-set, write your order by query, and then start writing the results into your CSV file.
This still probably doesn't solve the sorting issue, but I think if you can't pre-generate that value and put an index on it, the sorting is going to take some time.
However, there might be some server config variables that you can use to optimize the sorting performance.
From the MySQL Order-By optimization page
I think you can set the read_rnd_buffer_size value, which, according to the docs, can:
Setting the variable to a large value can improve ORDER BY performance by a lot
Another one is sort_buffer_size, for which, the docs say the follwing:
If you see many Sort_merge_passes per second in SHOW GLOBAL STATUS output, you can consider increasing the sort_buffer_size value to speed up ORDER BY or GROUP BY operations that cannot be improved with query optimization or improved indexing.
Another variable that can probably help is the innodb_buffer_pool_size. Which allows innodb to keep as much table data in memory as possible and avoid some disk-seeks.
However, all of these variables require some tuning. Some trial-and-error and probably some kind of benchmarking to get right.
There are some other suggestions on that MySQL Order-By optimization page as well.

Use a temporary table to store your select result with an index on finalValue. This will store and index your intermediate result.
CREATE TEMPORARY TABLE my_temp_table (INDEX my_index_name (finalValue))
SELECT ... -- your select
Note that complex expressions will require an alias in your SELECT to be used as a part of a CREATE TABLE SELECT. I assume that your SELECT has the alias finalValue (the column you mentioned).
Then select the temporary table ordered by the finalValue (the index will be used).
SELECT * FROM my_temp_table ORDER BY finalValue;
And finally drop the temporary table (or reuse it if you want, but remember that when client session terminates temporary data is automatically deleted).

Summary tables. (Let's see more details to be sure this is Data Warehouse type data.) Summary tables are augmented periodically with subtotals and counts. Then when the report is needed, the data is readily available almost directly from the summary table, rather than scanning lots of raw data and doing aggregates.
My blog on Summary Tables. Let's see your schema and report query; we can discuss this in more detail.

Optimization for fetching from a bulky table

I have a PostgreSQL table that has millions of record. I need to process every row and for that I am using a column in that table namely 'isProcessed' so by default it's false and when I process it I change it to true.
Now the problem is that there are too many records and due to exceptions code bypasses some records leaving them isProcessed=false and that makes the execution really slow.
I was thinking to use indexing but with boolean it does not help.
Please provide some optimization technique or some better practice.
UPDATE:
I don't have the code, It just a problem my colleagues were asking for my opinion.

Normally an index on a Boolean isn't a good idea, but in PostgreSQL you can do an index where it contains only entries for one value using a partial index http://www.postgresql.org/docs/9.3/interactive/indexes-partial.html. It ends up being a queue of things for you to process, items drop off once done.
CREATE INDEX "yourtable_isProcessed_idx" ON "public"."yourtable"
USING btree ("isProcessed")
WHERE (isProcessed IS NOT TRUE);
This will make life easier when it is looking for the next item to process. Ideally you should be processing more than one at a time, particularly if you can do it in a single query, though doing millions at once may be prohibitive. In that situation, you might be able to do
update yourtable
set ....
where id in (select id from yourtable where isProcessed = false limit 100 )
If you have to do things one at a time, I'd still limit what you retrieve, so potentially retrieve
select id from yourtable where iProcessed = false limit 1

Without seeing your code, it would be tough to say what is really going on. Doing any processing row by row, which it sounds like is what is going on, is going to take a VERY long time.
Generally speaking, the best way to work with data is in sets. At the end of your process, you're going to ultimately have a set of records where isProcessed needs to be true (where the operation was successful), and a set where isProcessed needs to be false (where the operation failed). As you process the data, keep track of which records could be updated successfully, as well as which could not be updated. You could do this by making a list or array of the primary key or whatever other data you use to identify the rows. Then, after you're done processing your data, do one update to flag the records that were successful, and one to update the records that were not successful. This will be a bit more code, but updating each row individually after you process it is going to be awfully slow.
Again, seeing code would help, but if you're updating each record after you process it, I suspect this is what's slowing you down.

Here is approach I use. You should be able to store processing state including errors. It can be one column with values PENDING, PROCESSED, ERROR or two columns is_processed, is_error.
This is to be able skip records which couldn't be successfully processed and which if not skipped slow down processing of good tasks. You may try to reprocess them later or give DevOps possibility to move tasks from ERROR to PENDING state if the reason for failure for example was temporary unavailable resource.
Then you create conditional index on the table which include only PENDING tasks.
Processing is done using following algorithm (using spring: transaction and nestedTransaction are spring transaction templates):
while (!(batch = getNextBatch()).isEmpty()):
transaction.execute( (TransactionStatus status) -> {
for (Element element : batch) {
try {
nestedTransaction.execute( (TransactionStatuc status ) -> {
processElement(element);
markAsProcessed(element);
});
} catch (Exception e) {
markAsFailed(element);
}
}
});
Several important notes:
getting of records is done in batches - this at least saves round trips to database and is quicker then one by one retrieval
processing of individual elements is done in nested transaction (this is implemented using postgresql SAVEPOINTs). This is quicker then processing each element in own transaction but have the benefit that failure in processing of one element will not lose results of processing of others elements in batch.
This is good when processing is complex enough and cannot be done in sql by single query to process batch. If processElement rather simple update of element then whole batch may be updated via single update statement.
processing on elements of the batch may be done in parallel. This requires propagation of transaction to worker threads.

efficient db operations

Here is the scenario I am researching a solution for at work. We have a table in postgres which stores events happening on network. Currently the way it works is, rows get inserted as network events come and at the same time older records which match the specific timestamp get deleted in order to keep table size limited to some 10,000 records. Basically, similar idea as log rotation. Network events come in burst of thousands at a time, hence rate of transaction is too high which causes performance degradation, after sometime either server just crashes or becomes very slow, on top of that, customer is asking to keep table size up to million records which is going to accelerate performance degradation (since we have to keep deleting record matching specific timestamp) and cause space management issue. We are using simple JDBC to read/write on table. Can tech community out there suggest better performing way to handle inserts and deletes in this table?

I think I would use partitioned tables, perhaps 10 x total desired size, inserting into the newest, and dropping the oldest partition.
http://www.postgresql.org/docs/9.0/static/ddl-partitioning.html
This makes load on "dropping oldest" much smaller than query and delete.
Update: I agree with nos' comment though, the inserts/deletes may not be your bottleneck. Maybe some investigation first.

Some things you could try -
Write to a log, have a separate batch proc. write to the table.
Keep the writes as they are, do the deletes periodically or at times of lower traffic.
Do the writes to a buffer/cache, have the actual db writes happen from the buffer.
A few general suggestions -
Since you're deleting based on timestamp, make sure the timestamp is indexed. You could also do this with a counter / auto-incremented rowId (e.g. delete where id< currentId -1000000).
Also, JDBC batch write is much faster than individual row writes (order of magnitude speedup, easily). Batch writing 100 rows at a time will help tremendously, if you can buffer the writes.

Is there a good patterns for distributed software and one backend database for this problem?

I'm looking for a high level answer, but here are some specifics in case it helps, I'm deploying a J2EE app to a cluster in WebLogic. There's one Oracle database at the backend.
A normal flow of the app is
- users feed data (to be inserted as rows) to the app
- the app waits for the data to reach a certain size and does a batch insert into the database (only 1 commit)
There's a constraint in the database preventing "duplicate" data insertions. If the app gets a constraint violation, it will have to rollback and re-insert one row at a time, so the duplicate rows can be "renamed" and inserted.
Suppose I had 2 running instances of the app. Each of the instances is about to insert 1000 rows. Even if there is only 1 duplicate, one instance will have to rollback and insert rows one by one.
I can easily see that it would be smarter to re-insert the non-conflicting 999 rows as a batch in this instance, but what if I had 3 running apps and the 999 rows also had a chance of duplicates?
So my question is this: is there a design pattern for this kind of situation?
This is a long question, so please let me know where to clarify. Thank you for your time.
EDIT:
The 1000 rows of data is in memory for each instance, but they cannot see the rows of each other. The only way they know if a row is a duplicate is when it's inserted into the database.
And if the current application design doesn't make sense, feel free to suggest better ways of tackling this problem. I would appreciate it very much.

http://www.oracle-developer.net/display.php?id=329

The simplest would be to avoid parallel processing of the same data. For example, your size or time based event could run only on one node or post a massage to a JMS queue, so only one of the nodes would process it (for instance, by using similar duplicate-check, e.g. based on a timestamp of the message/batch).

speed up operation on mysql

I'm currently writing java project against mysql in a cluster with ten nodes. The program simply pull some information from the database and do some calculation, then push some data back to the database. However, there are millions of rows in the table. Is there any way to split up the job and utilize the cluster architecture? How to do multi-threading on different node?

I watched an interesting presentation on using Gearman to do Map/Reduce style things on a mysql database. It might be what you are looking for: see here. There is a recording on the mysql webpage here (have to register for mysql.com though).

I'd think about doing that calculation in a stored procedure on the database server and pass on bringing millions of rows to the middle tier. You'll save yourself a lot of bytes on the wire. Depending on the nature of the calculation, your schema, indexing, etc. you might find that the database server is well equipped to do that calculation without having to resort to multi-threading.
I could be wrong, but it's worth a prototype to see.

Assume the table (A) you want to process has 10 million rows. Create a table B in the database to store the set of rows processed by a node. So you can write the Java program in such a way like it will first fetch the last row processed by other nodes and then it add an entry in the same table informing other nodes what range of rows it is going to process (you can decide this number). In our case, lets assume each node can process 1000 rows at a time. Node 1 fetches table B and finds it it empty. Then Node 1 inserts a row ('Node1', 1000) informing that it is processing till primary key of A is <=1000 ( Assuming primary key of table A is numeric and it is in ascending order). Node 2 comes and finds 1000 primary keys are processed by some other node. Hence it inserts a row ('Node2', 2000) informing others that it is processing rows between 1001 and 2000. Please note that access to table B should be synchronized, i.e. only one can work on it at a time.

Since you only have one mysql server, make sure you're using the innodb engine to reduce table locking on updates.
Also I'd try to keep your queries as simple as possible, even if you have to run more of them. This can increase chances of query cache hits, as well as reduce the over all workload on the backend, offloading some of the querying matching and work to the frontends (where you have more resources). It will also reduce the time a row lock is held therefore decreasing contention.
The proposed Gearman solution is probably the right tool for this job. As it will allow you to offload batch processing from mysql back to the cluster transparently.
You could set up sharding with a mysql on each machine but the set up time, maintenance and the changes to database access layer might be a lot of work compared to a gearman solution. You might also want to look at the experimental spider engine that could allow you to use multiple mysqls in unison.

Unless your calculation is very complex, most of the time will be spent retrieving data from MySql and sending the results back to MySQl.
As you have a single database no amount of parallelism or clustering on the application side will make much difference.
So your best options would be to do the update in pure SQL if that is at all possible, or, use a stored procedure so that all processing can take place within the MySql server and no data movement is required.
If this is not fast enough then you will need to split your database among several instances of MySql and come up with some schema to partition the data based on some application key.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.