Pattern for batch query crawler operations

Pattern for batch query crawler operations - java

I am trying to create an abstraction for a batch query crawler operation. The idea is that a query is executed, a result set is obtained and for each row an operation is performed that either commits or rollbacks. The requirement is that all rows are processed independent of whether there are failures and that the result set is not loaded into memory beforehand.
The problem boils down to the fact that it is not possible to maintain an open result set after a rollback. This is as per the spec, cursor holdability is maintainable on commit (using ResultSet.HOLD_CURSORS_OVER_COMMIT), but not on rollback.
A naive implementation with JTA/JDBC semantics, providing two extension points, one for specifying the query and one for specifing the actual operations for each row, would be something like this:
UserTransaction tx = getUserTransaction();
tx.begin();
ResultSet rs = executeQuery(); //extension point
tx.commit();
while(rs.next()) {
tx.begin();
try {
performOperationOnCurrentRow(ResultSet rs); //extension point
tx.commit();
logSuccess();
}catch(Exception e) {
tx.rollback();
logFailure(e);
}
}
This does not seem such a far-fetched scenario, however I've found very little relevant information on the web. The question is, has this been addressed elegantly by any of the popular frameworks? I do not necessarily need an out of the box solution, I just wonder if there is a known good/generally accepted approach to handle this scenario.
One solution would be to keep track of the row that failed and re-open a cursor after that point, which will generally require to enforce some extension rules (e.g. ordered result set, query using last failed row id on where clause etc).
Another would be to use two different Threads for the query and the row operation.
What would you do?

Since this has not been answered for a couple years now, I will go on and answer it myself.
The solution we worked out revolves around the idea that the process (called BatchProcess) executes a Query (not limited to SQL, mind you) and adds its results to a concurrent Queue. The BatchProcess spawns a number of QueueProcessor objects (run on new Threads) that consume entries of the Queue and execute an Operation that uses the entry as input. Each Operation execution is performed as a single unit of work. The underlying transaction manager is a JTA implementation.
A bit dated, but at some point the implementation was this https://bo2.googlecode.com/svn/trunk/Bo2ImplOpen/main/gr/interamerican/bo2/impl/open/runtime/concurrent/BatchProcess.java
There is even a GUI for BatchProcess monitoring and management somewhere in that repo ;)
Disclaimer: This guy had A LOT to do with this design and implementation https://stackoverflow.com/users/2362356/nakosspy

Related

save() performance - will saveAndFlush() be better?

I would like to ask you about the performance of save() in CrudRepository.
Firstly, an example of code.
for(int i=0; i<5000; i++){
Example example = new Example(0, true, false, i, "example");
example = exampleRepository.save(example);
List<ChildExample> childExamples = new ArrayList<>();
ChildExample childExample = new ChildExample(0, i, true, example);
childExamples.add(childExample);
childExampleRepository.saveAll(childExamples);
}
This is just an example, but everything has to stay at their position (e.g. creating a list of examples, then using saveAll, using cascade etc. It is not allowed).
What have I observed? The first 2000 objects were saved very fast - let's say: 10 minutes. But - next 2000 were saved in a much longer time, about 30 minutes. Why is that? Why saving each subsequent takes longer? What if I use JpaRepository and saveAndFlush()? Will this process be shortened if I use saveAndFlush()?

When you hit save() which is the equivalent of entityManager.persist(), the persistence provider does not implicitly perform an INSERT on the physical database. It simply stored the given entity in its Persistence Context. It has become managed in the current session cache (first level cache).
This is to prevent unnecessary overload of CRUD operations. By default, the changes are flushed on commit of the current transaction (or upon reaching a certain threshold of managed entities like in your case). An implicit flush may also be triggered when a SELECT operation is performed during the transaction which contains persisted entities somewhere in the JOINs (this is not the case here though).
When you use flush, the persistence provider is obliged to perform a physical save on the database at that moment.
But will it increase the performance? There is no clear answer to that question and it totally depends on each unique scenario. It is an option though and you need to perform a set of tests in order to find out.
You may also fiddle around with hibernate.jdbc.batch_size. You may gain a lot if you hit this configuration right for you particular circumstance.

Concurrent Conflicting statements with db

I am attempting to make a website (using html, javascript and jsp) that sends modification and selection queries to a db at the same time. MySQL apparently doesn't like that (ConcurrectModificationExceptions everywhere).
I thought of creating something that receives sql statements concurrently and then orders them into a queue based on some property, then execute the queue one by one, after making sure that they don't contradict each other(an insert statement after on that deletes a table would contradict).
The problem is that I'm not sure how to check if two statements conflict. What I had in mind is checking what the tables would theoretically look like if the statements were executed (by running them on a duplicated table) and then if an error is thrown, a statement conflicts with another statement. But this means I have to duplicate the table many times, and I highly doubt it would work.
So, How can I check if two statements conflict?
For example:
String sql1 = "DELETE FROM users WHERE id=3625036";
String sql2 = "UPDATE users SET displayName=\\"FOO\\" WHERE id=3625036";
If these two are received concurrently and then ordered in some way, then sql2 might be executed after sql1 and that would throw an exception. How can I check for conflict in the given example?

MySQL like all full DB systems supports lots of concurrent operations, within normal transactional and locking restrictions. You're best off solving your particular problem by asking a question on stack overflow.
I think you shouldn't set students the task of managing queueing etc. The complexity of what you're describing is significant, and more importantly that's what database systems are for. They should be taught not to reinvent the wheel when they can make use of something that's far better than they can build. Unless you're specifically wanting to teach such low-level DB construction.

It should be a driver issue of the mysql try updating the mysql driver.
Another workaround is to implement the table level synchronization at your code.
example:
class UserDAO{
public void upateUsers(String sql){
synchronized(UserDAO.class){
// do update operations
}
}
public void deleteUser(String sql){
synchronized(UserDAO.class){
// do delete operations
}
}
}

JPA using eclipselink and java.sql: when connect to DB

I explain better my question since from the title it could be not very clear, but I didn't find a way to summarize the problem in few work. Basically I have a web application whose DB have 5 tables. 3 of these are managed using JPA and eclipselink implementation. The other 2 tables are manager directly with SQL using the package java.sql. When I say "managed" I mean just that query, insertion, deletion and updates are performed in two different way.
Now the problem is that I have to monitor the response time of each call to the DB. In order to do this I have a library that use aspects and at runtime I can monitor the execution time of any code snippet. Now the question is, if I want to monitor the response time of a DB request (let's suppose the DB in remote, so the response time will include also network latency, but actually this is fine), what are in the two distinct case described above the instructions whose execution time has to be considered.
I make an example in order to be more clear.
Suppose tha case of using JPA and execute a DB update. I have the following code:
EntityManagerFactory emf = Persistence.createEntityManagerFactory(persistenceUnit);
EntityManager em=emf.createEntityManager();
EntityToPersist e=new EntityToPersist();
em.persist(e);
Actually it is correct to suppose that only the em.persist(e) instruction connects and make a request to the DB?
The same for what concern using java.sql:
Connection c=dataSource.getConnection();
Statement statement = c.createStatement();
statement.executeUpdate(stm);
statement.close();
c.close();
In this case it is correct to suppose that only the statement.executeUpdate(stm) connect and make a request to the DB?
If it could be useful to know, actually the remote DBMS is mysql.
I try to search on the web, but it is a particular problem and I'm not sure about what to look for in order to find a solution without reading the JPA or java.sql full specification.
Please if you have any question or if there is something that is not clear from my description, don't hesitate to ask me.
Thank you a lot in advance.

In JPA (so also in EcliplseLink) you have to differentiate from SELECT queries (that do not need any transaction) and queries that change the data (DELETE, CREATE, UPDATE: all these need a transacion). When you select data, then it is enough the measure the time of Query.getResultList() (and calls alike). For the other operations (EntityManager.persist() or merge() or remove()) there is a mechanism of flushing, which basically forces the queue of queries (or a single query) from the cache to hit the database. The question is when is the EntityManager flushed: usually on transaction commit or when you call EntityManager.flush(). And here again another question: when is the transaction commit: and the answer is: it depends on your connection setup (if autocommit is true or not), but a very correct setup is with autocommit=false and when you begin and commit your transactions in your code.
When working with statement.executeUpdate(stm) it is enough to measure only such calls.
PS: usually you do not connect directly to any database, as that is done by a pool (even if you work with a DataSource), which simply gives you a already established connection, but that again depends on your setup.
PS2: for EclipseLink probably the most correct way would be to take a look in the source code in order to find when the internal flush is made and to measure that part.

Multithreaded (or async) calculation on Spring Framework

I am learning Spring Framework, and it is pretty awesome.
I want to use JAVA multithreading, but I don't know how with the Spring Framework.
Here is the code for service:
//StudentService.java
public List<Grade> loadGradesForAllStudents(Date date) {
try{
List<Grade> grades = new ArrayList<Grade>();
List<Student> students = loadCurrentStudents(); // LOAD FROM THE DB
for(Student student : students) { // I WANT TO USE MULTITHREAD FOR THIS PART
// LOAD FROM DB (MANY JOINS)
History studentHistory = loadStudentHistory(student.getStudentId(), date);
// CALCULATION PART
Grade calculatedGrade = calcStudentGrade(studentHistory, date);
grades.add(calculatedGrade);
}
return grades;
} catch(Exception e) {
...
return null;
}
}
And without multithreading, it is pretty slow.
I guess the for loop causes the slowness, but I don't know how to approach this problem. If give me an useful link or example code, I'd appreciate it.
I figured out the method loadStudentHistory is pretty slow (around 300ms) compare to calcStudentGrade (around 30ms).

Using multithreading for this a bad idea in an application with concurrent users, because instead of having each request use one thread and one connection now each query uses multiple threads and multiple connections. It doesn't scale as the number of users grows.
When I look at your example I see two possible issues:
1) You have too many round trips between the application and the database, where each of those trips takes time.
2) It's not clear if each query is using a separate transaction (you don't say where the transactions are demarcated in the example code), if your queries are each creating their own transaction that could be wasteful, because each transaction has overhead associated with it.
Using multithreading will not do much to help with #1 (and if it does help it will put more load on the database) and will either have no effect on #2 or make it worse (depending on the current transaction scopes; if you had the queries in the same transaction before, using multiple threads they'll have to be in different transactions). And as already explained it won't scale up.
My recommendations:
1) Make the service transactional, if it is not already, so that everything it does is within one transaction. Remove the exception-catching/null-returning stuff (which interferes with how Spring wants to use exceptions to rollback transactions) and introduce an exception-handler so that anything thrown from controllers will be caught and logged. That will minimize your overhead from creating transactions and make your exception-handling cleaner.
2) Create one query that brings back a list of your students. That way the query is sent to the database once, then the resultset results are read back in chunks (according to the fetch size on the resultset). You can customize the query to get back only what you need so you don't have an excessive number of joins. Run explain-plan on the query and make sure it uses indexes. You will have a faster query and a much smaller number of round trips, which will make a big speed improvement.

The simple solution is called stream, these enable you to iterate in parallel, for example :
students.stream().parallel().forEach( student -> doSomething(student));
This will give you a noticeable performance-boost but it wont remove the database-query overhead ... if your DB-management system takes about 300ms to return results .... well ... you're either using ORM on big databases or your queries are highly inefficient, i recommend re-analyzing your current solution

Using Hibernate's ScrollableResults to slowly read 90 million records

I simply need to read each row in a table in my MySQL database using Hibernate and write a file based on it. But there are 90 million rows and they are pretty big. So it seemed like the following would be appropriate:
ScrollableResults results = session.createQuery("SELECT person FROM Person person")
.setReadOnly(true).setCacheable(false).scroll(ScrollMode.FORWARD_ONLY);
while (results.next())
storeInFile(results.get()[0]);
The problem is the above will try and load all 90 million rows into RAM before moving on to the while loop... and that will kill my memory with OutOfMemoryError: Java heap space exceptions :(.
So I guess ScrollableResults isn't what I was looking for? What is the proper way to handle this? I don't mind if this while loop takes days (well I'd love it to not).
I guess the only other way to handle this is to use setFirstResult and setMaxResults to iterate through the results and just use regular Hibernate results instead of ScrollableResults. That feels like it will be inefficient though and will start taking a ridiculously long time when I'm calling setFirstResult on the 89 millionth row...
UPDATE: setFirstResult/setMaxResults doesn't work, it turns out to take an unusably long time to get to the offsets like I feared. There must be a solution here! Isn't this a pretty standard procedure?? I'm willing to forgo Hibernate and use JDBC or whatever it takes.
UPDATE 2: the solution I've come up with which works ok, not great, is basically of the form:
select * from person where id > <offset> and <other_conditions> limit 1
Since I have other conditions, even all in an index, it's still not as fast as I'd like it to be... so still open for other suggestions..

Using setFirstResult and setMaxResults is your only option that I'm aware of.
Traditionally a scrollable resultset would only transfer rows to the client on an as required basis. Unfortunately the MySQL Connector/J actually fakes it, it executes the entire query and transports it to the client, so the driver actually has the entire result set loaded in RAM and will drip feed it to you (evidenced by your out of memory problems). You had the right idea, it's just shortcomings in the MySQL java driver.
I found no way to get around this, so went with loading large chunks using the regular setFirst/max methods. Sorry to be the bringer of bad news.
Just make sure to use a stateless session so there's no session level cache or dirty tracking etc.
EDIT:
Your UPDATE 2 is the best you're going to get unless you break out of the MySQL J/Connector. Though there's no reason you can't up the limit on the query. Provided you have enough RAM to hold the index this should be a somewhat cheap operation. I'd modify it slightly, and grab a batch at a time, and use the highest id of that batch to grab the next batch.
Note: this will only work if other_conditions use equality (no range conditions allowed) and have the last column of the index as id.
select *
from person
where id > <max_id_of_last_batch> and <other_conditions>
order by id asc
limit <batch_size>

You should be able to use a ScrollableResults, though it requires a few magic incantations to get working with MySQL. I wrote up my findings in a blog post (http://www.numerati.com/2012/06/26/reading-large-result-sets-with-hibernate-and-mysql/) but I'll summarize here:
"The [JDBC] documentation says:
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
This can be done using the Query interface (this should work for Criteria as well) in version 3.2+ of the Hibernate API:
Query query = session.createQuery(query);
query.setReadOnly(true);
// MIN_VALUE gives hint to JDBC driver to stream results
query.setFetchSize(Integer.MIN_VALUE);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
// iterate over results
while (results.next()) {
Object row = results.get();
// process row then release reference
// you may need to evict() as well
}
results.close();
This allows you to stream over the result set, however Hibernate will still cache results in the Session, so you’ll need to call session.evict() or session.clear() every so often. If you are only reading data, you might consider using a StatelessSession, though you should read its documentation beforehand."

Set fetch size in query to an optimal value as given below.
Also, when caching is not required, it may be better to use StatelessSession.
ScrollableResults results = session.createQuery("SELECT person FROM Person person")
.setReadOnly(true)
.setFetchSize( 1000 ) // <<--- !!!!
.setCacheable(false).scroll(ScrollMode.FORWARD_ONLY)

FetchSize must be Integer.MIN_VALUE, otherwise it won't work.
It must be literally taken from the official reference: https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html

Actually you could have gotten what you wanted -- low-memory scrollable results with MySQL -- if you had used the answer mentioned here:
Streaming large result sets with MySQL
Note that you will have problems with Hibernate lazy-loading because it will throw an exception on any queries performed before the scroll is finished.

With 90 million records, it sounds like you should be batching your SELECTs. I've done with with Oracle when doing the initial load into a distrbuted cache. Looking at the MySQL documentation, the equivalent seems to be using the LIMIT clause: http://dev.mysql.com/doc/refman/5.0/en/select.html
Here's an example:
SELECT * from Person
LIMIT 200, 100
This would return rows 201 through 300 of the Person table.
You'd need to get the record count from your table first and then divide it by your batch size and work out your looping and LIMIT parameters from there.
The other benefit of this would be parallelism - you can execute multiple threads in parallel on this for faster processing.
Processing 90 million records also doesn't sound like the sweet spot for using Hibernate.

The problem could be, that Hibernate keeps references to all objests in the session until you close the session. That has nothing to do with query caching. Maybe it would help to evict() the objects from the session, after you are done writing the object to the file. If they are no longer references by the session, the garbage collector can free the memory and you won't run out of memory anymore.

I propose more than a sample code, but a query template based on Hibernate to do this workaround for you (pagination, scrolling and clearing Hibernate session).
It can also easily be adapted to use an EntityManager.

I've used the Hibernate scroll functionality successfully before without it reading the entire result set in. Someone said that MySQL does not do true scroll cursors, but it claims to based on the JDBC dmd.supportsResultSetType(ResultSet.TYPE_SCROLL_INSENSITIVE) and searching around it seems like other people have used it. Make sure it's not caching the Person objects in the session - I've used it on SQL queries where there was no entity to cache. You can call evict at the end of the loop to be sure or test with a sql query. Also play around with setFetchSize to optimize the number of trips to the server.

recently i worked over a problem like this, and i wrote a blog about how face that problem. is very like, i hope be helpfull for any one.
i use lazy list approach with partial adquisition. i Replaced the limit and offset or the pagination of query to a manual pagination.
In my example, the select returns 10 millions of records, i get them and insert them in a "temporal table":
create or replace function load_records ()
returns VOID as $$
BEGIN
drop sequence if exists temp_seq;
create temp sequence temp_seq;
insert into tmp_table
SELECT linea.*
FROM
(
select nextval('temp_seq') as ROWNUM,* from table1 t1
join table2 t2 on (t2.fieldpk = t1.fieldpk)
join table3 t3 on (t3.fieldpk = t2.fieldpk)
) linea;
END;
$$ language plpgsql;
after that, i can paginate without count each row but using the sequence assigned:
select * from tmp_table where counterrow >= 9000000 and counterrow <= 9025000
From java perspective, i implemented this pagination through partial adquisition with a lazy list. this is, a list that extends from Abstract list and implements get() method. The get method can use a data access interface to continue get next set of data and release the memory heap:
#Override
public E get(int index) {
if (bufferParcial.size() <= (index - lastIndexRoulette))
{
lastIndexRoulette = index;
bufferParcial.removeAll(bufferParcial);
bufferParcial = new ArrayList<E>();
bufferParcial.addAll(daoInterface.getBufferParcial());
if (bufferParcial.isEmpty())
{
return null;
}
}
return bufferParcial.get(index - lastIndexRoulette);<br>
}
by other hand, the data access interface use query to paginate and implements one method to iterate progressively, each 25000 records to complete it all.
results for this approach can be seen here
http://www.arquitecturaysoftware.co/2013/10/laboratorio-1-iterar-millones-de.html

Another option if you're "running out of RAM" is to just request say, one column instead of the entire object How to use hibernate criteria to return only one element of an object instead the entire object? (saves a lot of CPU process time to boot).

For me it worked properly when setting useCursors=true, otherwise The Scrollable Resultset ignores all the implementations of fetch size, in my case it was 5000 but Scrollable Resultset fetched millions of records at once causing excessive memory usage. underlying DB is MSSQLServer.
jdbc:jtds:sqlserver://localhost:1433/ACS;TDS=8.0;useCursors=true

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.