Read 3 million records with hibernate

Read 3 million records with hibernate - java

I'm a noob in hibernate and I have to read 2 million records from a DB2 z/OS-Database with hibernate in Java. (JDBC)
My problem is, that I run OutOfMemory after 150000 records.
I've heard about batching etc, but I only find solutions for actually inserting new records. What I want to do is to read this records in an ArrayList for further usage.
So I'm actually just selecting one row of the database to reduce the data:
getEntityManager().createQuery("select t.myNumber from myTable t").getResultList();
Also it would be interesting, if there is a better way to read such a huge amount of records.(Maybe without Hibernate?)

The following is the way to do batch processing using hibernate. Keep in mind this is not 100% tested. It's kind of pseudo logic.
int i=0;
int batch = 100;
List<myNumber> numList = getEntityManager().createQuery("select t.myNumber from myTable t").setFirstResult(i).setMaxResults(batch).getResultList();
while(numList.size() == batch){
//process numList
i+=batch;
numList = getEntityManager().createQuery("select t.myNumber from myTable t").setFirstResult(i).setMaxResults(batch).getResultList();
}
Hibernate documentation for setFirstResult() and setMaxResults()

A best approach is to use statelessSession ( no deal with cache ) and bulk operations with the scrollableResults method :
StatelessSession statelessSession = sessionFactory.openStatelessSession(connection);
try {
ScrollableResults scrollableResults = statelessSession.createQuery("from Entity").scroll(ScrollMode.FORWARD_ONLY);
int count = 0;
while (scrollableResults.next()) {
if (++count > 0 && count % 100 == 0) {
System.out.println("Fetched " + count + " entities");
}
Entity entity = (Entity) scrollableResults.get()[0];
//Process and write result
}
} finally {
statelessSession.close();
}

You should not load all records into memory but process them in batch, e.g: loop every 1000 records by using
createQuery(...).setFirstResult(i*1000).setMaxResults(1000);

You have found the upper limit of your heap. Have a look here to know how to properly size you heap:
Increase heap size in Java
However, I cannot imagine why you would need to have a List of 3 million records in memory. Perhaps with more information we could find an alternative solution to your algorithm?

Yes off course
You may use The Apache™ Hadoop® for large project . it develops open-source software for reliable, scalable, distributed computing. It is designed to scale up from single servers to thousands of machines
hadoop apache

This is basically a design question for the problem you are working on. Forget Hibernate even if you are doing the same thing in plain JDBC you will hit the memory issue, maybe a bit late. The idea of loading such huge data and keeping in memory is not suitable for applications requiring short request-response cycles and is not good for scalability either. As others have suggested you can try the batch or paging behaviour or if you want to be more exotic you can try parallel processing via distributed data-grid (like Infinispan) or map-reduce framework from Hadoop.
Going by the description of your problem it seems that you need to keep the data around in memory.
If you must keep the huge data around in memory then you can query the data in batches and keep storing them in a distribited cache (like Infinispan) which can span multiple JVM's on single machine or multiple machine forming a cluster. This way your data will reside partially on each node.Here Infinispan can be used as a distributed cache.
There are framework like Spring Batch which take the route of solving such problems by dividing the work into chunks (batch) and then processing them one by one. It has even inbuilt JPA based readers and writers which perform this work in a batch.

Related

Is SQL IN Query better for performance or Java method ContainsAll

I have a scenario where user will select bulk of input up to 100K entries and i need to validate if this data belongs to the user and satisfies other X conditions so should I use complex Oracle SQL DB query - composite IN(id,column) to validate it OR
Should I fetch the data for this user satisfying the conditions to application memory and use List.containsAll, by first getting all the data (with all the other conditions)for this particular user and populating it in a dbList and then validating dbList.containsAll(inputList).
Which one will be better performace wise. DB Composite IN to send bulk input vs get the input and validate it with containsAll
I tried running SQL query in SIT environment, the query is taking around 70 -90 seconds which is too bad. It would be better in prod but still I feel the data has to sort through huge data in DB even though it is indexed by user ID.
IN DB i am using Count(*) with IN like below :
SQL Query :
select count(*) from user_table where user_id='X123' and X condtions and user_input IN(
('id','12344556'),
('id','789954334')
('id','343432443')
('id','455543545')
------- 50k entries
);
Also there are other AND conditions as well for validating the user_input are valid entries.
Sample JAVA code:
List<String> userInputList = request.getInputList();
List<String> userDBList = sqlStatement.execute(getConditionedQuery);
Boolean validDate = userDBList.containsAll(userInputList );
getConditionedQuery = "select user_backedn_id from user_table where user_id='X123'AND X complex conditions";
The SQL Query with composite IN condition takes around 70-90 seconds in lower environments, however Java code for the containsALL looks much faster.
Incidentally, I don't want to use temp table and execute the procedure because again bulk input entry in DB is a hassle. I am using ATG framework and the module is RESTful so performance is most important here.

I personally believe that you should apply all filters at the database side only for many reasons. First, exchanging that much data over the network will consume unnecessary bandwidth. Second, bringing all that data into JVM and processing it will consume more memory. Third, databases can be tuned and optimised for complex queries. Talk to your DBA, give him the query and him to run an analysis. The analysis will tell you if you need to add any indexes to optimise your query.
Also, contrary to your belief, my experience says that if a query takes 70-90 seconds in SIT, it will take MORE time in prod. Because although PROD machine are much faster, the amount of data in PROD is much much higher compared to SIT, so it will take longer. But that does not mean you should haul it over the network and process it in JVM. Plus, JVMs heap memory is much much lesser compared to database memory.
Also, as we move to a cloud-enabled, containerised application architecture, network bandwidth is charged. E.g. if your application is in the cloud and the database in on premise, imagine amount of data you will move back and forth to finally filter out 10 rows from a million rows.
I recommend that you write a good query, optimise it and process as many conditions as possible on the database side only. Hope it helps !

In general it's a good idea to push as much of the processing to the database. Even though it might actually like a bottleneck, it is generally well optimised and can work over the large amounts of data faster than you would.
For read queries like the one you're describing, you can even offload the work to read replicas, so it doesn't overwhelm the master.

Cassandra terminates connection in the middle?

I'm using Cassandra driver for java from datastax. I know that I have 20 millions of rows in one table. When I using
Select * from table
The process stops after around 800000 rows have been fetched.
In my Java code
futureResults = session.executeAsync(statement);
ResultSet results = futureResults.getUninterruptibly();
for (Row row : results) {
}
Maybe I did something wrong ?

What you are doing there is a fairly common anti-pattern with Cassandra. Since each partition of data lives in different parts of your cluster, that query will create a massive scatter/gather, centered around one coordinator. Eventually things start timing out and the coordinator will throw an error. A quick look in the logs should find it.
Almost always, a select query should include a partition key for locality. If that's not possible, switching to something batch that will efficiently scan each node is best. The Spark connector for Cassandra is perfect for an access pattern like this.

MySql slow with concurrent queries

I'm running queries in parallel against a MySql database. Each query takes less than a second and another half a second to a second to fetch.
This is acceptable for me. But when I run 10 of these queries in parallel and then attempt another set in a different session everything slows down and a single query can take some 20 plus seconds.
My ORM is hibernate and I'm using C3P0 with <property name="hibernate.c3p0.max_size">20</property>. I'm sending the queries in parallel by using Java threads. But I don't think these are related because the slowdown also happens when I run queries in MySql Workbench. So I'm assuming something in my MySql config is missing, or the machine is not fast enough.
This is the query:
select
*
FROM
schema.table
where
site = 'sitename' and (description like '% family %' or title like '% family %')
limit 100 offset 0;
How can I make this go faster when facing let's say 100 concurrent queries?

I'm guessing that this is slow because the where clause is doing a full text search on the description and title columns; this will require the database to look through the entire field on every record, and that's never going to scale.
Each of those 10 concurrent queries must read the 1 million rows to fulfill the query. If you have a bottleneck anywhere in the system - disk i/o, memory, CPU - you may not hit that bottleneck with a single query, but you do hit it with 10 concurrent queries. You could use one of these tools to find out which bottleneck you're hitting.
Most of the time, those bottlenecks (CPU, memory, disk) are too expensive to upgrade - especially if you need to scale to 100 concurrent queries. So it's better to optimize the query/ORM approach.
I'd consider using Hibernate's built-in free text capability here - it requires some additional configuration, but works MUCH better when looking for arbitrary strings in a textual field.

Splitting Long running SQL query into multiple smaller queries

I am using SQL Server 2008 and Java 6 / Spring jdbc.
We have a table with records count ~60mn.
We need to load this entire table into memory, but firing select * on this table takes hours to complete.
So I am splitting the query as below
String query = " select * from TABLE where " ;
for(int i =0;i<10;i++){
StringBuilder builder = new StringBuilder(query).append(" (sk_table_id % 10) =").append(i);
service.submit(new ParallelCacheBuilder(builder.toString(),namedParameters,jdbcTemplate));
}
basically, I am splitting the query by adding a where condition on primary key column,
above code snippet splits the query into 10 queries running in parallel.this uses java's ExecutorCompletionService.
I am not a SQL expert, but I guess above queries will need to load same data in memory before applyinh modulo operator on primary column.
Is this good/ bad/ best/worst way? Is there any other way, please post.
Thanks in advance!!!

If you do need all the 60M records in memory, select * from ... is the fastest approach. Yes, it's a full scan; there's no way around. It's disk-bound so multithreading won't help you any. Not having enough memory available (swapping) will kill performance instantly. Data structures that take significant time to expand will hamper performance, too.
Open the Task Manager and see how much CPU is spent; probably little; if not, profile your code or just comment out everything but the reading loop. Or maybe it's a bottleneck in the network between the SQL server and your machine.
Maybe SQL Server can offload data faster to an external dump file of known format using some internal pathways (e.g. Oracle can). I'd explore the possibility of dumping a table into a file and then parsing that file with C#; it could be faster e.g. because it won't interfere with other queries that the SQL server is serving at the same time.

Using Hibernate's ScrollableResults to slowly read 90 million records

I simply need to read each row in a table in my MySQL database using Hibernate and write a file based on it. But there are 90 million rows and they are pretty big. So it seemed like the following would be appropriate:
ScrollableResults results = session.createQuery("SELECT person FROM Person person")
.setReadOnly(true).setCacheable(false).scroll(ScrollMode.FORWARD_ONLY);
while (results.next())
storeInFile(results.get()[0]);
The problem is the above will try and load all 90 million rows into RAM before moving on to the while loop... and that will kill my memory with OutOfMemoryError: Java heap space exceptions :(.
So I guess ScrollableResults isn't what I was looking for? What is the proper way to handle this? I don't mind if this while loop takes days (well I'd love it to not).
I guess the only other way to handle this is to use setFirstResult and setMaxResults to iterate through the results and just use regular Hibernate results instead of ScrollableResults. That feels like it will be inefficient though and will start taking a ridiculously long time when I'm calling setFirstResult on the 89 millionth row...
UPDATE: setFirstResult/setMaxResults doesn't work, it turns out to take an unusably long time to get to the offsets like I feared. There must be a solution here! Isn't this a pretty standard procedure?? I'm willing to forgo Hibernate and use JDBC or whatever it takes.
UPDATE 2: the solution I've come up with which works ok, not great, is basically of the form:
select * from person where id > <offset> and <other_conditions> limit 1
Since I have other conditions, even all in an index, it's still not as fast as I'd like it to be... so still open for other suggestions..

Using setFirstResult and setMaxResults is your only option that I'm aware of.
Traditionally a scrollable resultset would only transfer rows to the client on an as required basis. Unfortunately the MySQL Connector/J actually fakes it, it executes the entire query and transports it to the client, so the driver actually has the entire result set loaded in RAM and will drip feed it to you (evidenced by your out of memory problems). You had the right idea, it's just shortcomings in the MySQL java driver.
I found no way to get around this, so went with loading large chunks using the regular setFirst/max methods. Sorry to be the bringer of bad news.
Just make sure to use a stateless session so there's no session level cache or dirty tracking etc.
EDIT:
Your UPDATE 2 is the best you're going to get unless you break out of the MySQL J/Connector. Though there's no reason you can't up the limit on the query. Provided you have enough RAM to hold the index this should be a somewhat cheap operation. I'd modify it slightly, and grab a batch at a time, and use the highest id of that batch to grab the next batch.
Note: this will only work if other_conditions use equality (no range conditions allowed) and have the last column of the index as id.
select *
from person
where id > <max_id_of_last_batch> and <other_conditions>
order by id asc
limit <batch_size>

You should be able to use a ScrollableResults, though it requires a few magic incantations to get working with MySQL. I wrote up my findings in a blog post (http://www.numerati.com/2012/06/26/reading-large-result-sets-with-hibernate-and-mysql/) but I'll summarize here:
"The [JDBC] documentation says:
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
This can be done using the Query interface (this should work for Criteria as well) in version 3.2+ of the Hibernate API:
Query query = session.createQuery(query);
query.setReadOnly(true);
// MIN_VALUE gives hint to JDBC driver to stream results
query.setFetchSize(Integer.MIN_VALUE);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
// iterate over results
while (results.next()) {
Object row = results.get();
// process row then release reference
// you may need to evict() as well
}
results.close();
This allows you to stream over the result set, however Hibernate will still cache results in the Session, so you’ll need to call session.evict() or session.clear() every so often. If you are only reading data, you might consider using a StatelessSession, though you should read its documentation beforehand."

Set fetch size in query to an optimal value as given below.
Also, when caching is not required, it may be better to use StatelessSession.
ScrollableResults results = session.createQuery("SELECT person FROM Person person")
.setReadOnly(true)
.setFetchSize( 1000 ) // <<--- !!!!
.setCacheable(false).scroll(ScrollMode.FORWARD_ONLY)

FetchSize must be Integer.MIN_VALUE, otherwise it won't work.
It must be literally taken from the official reference: https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html

Actually you could have gotten what you wanted -- low-memory scrollable results with MySQL -- if you had used the answer mentioned here:
Streaming large result sets with MySQL
Note that you will have problems with Hibernate lazy-loading because it will throw an exception on any queries performed before the scroll is finished.

With 90 million records, it sounds like you should be batching your SELECTs. I've done with with Oracle when doing the initial load into a distrbuted cache. Looking at the MySQL documentation, the equivalent seems to be using the LIMIT clause: http://dev.mysql.com/doc/refman/5.0/en/select.html
Here's an example:
SELECT * from Person
LIMIT 200, 100
This would return rows 201 through 300 of the Person table.
You'd need to get the record count from your table first and then divide it by your batch size and work out your looping and LIMIT parameters from there.
The other benefit of this would be parallelism - you can execute multiple threads in parallel on this for faster processing.
Processing 90 million records also doesn't sound like the sweet spot for using Hibernate.

The problem could be, that Hibernate keeps references to all objests in the session until you close the session. That has nothing to do with query caching. Maybe it would help to evict() the objects from the session, after you are done writing the object to the file. If they are no longer references by the session, the garbage collector can free the memory and you won't run out of memory anymore.

I propose more than a sample code, but a query template based on Hibernate to do this workaround for you (pagination, scrolling and clearing Hibernate session).
It can also easily be adapted to use an EntityManager.

I've used the Hibernate scroll functionality successfully before without it reading the entire result set in. Someone said that MySQL does not do true scroll cursors, but it claims to based on the JDBC dmd.supportsResultSetType(ResultSet.TYPE_SCROLL_INSENSITIVE) and searching around it seems like other people have used it. Make sure it's not caching the Person objects in the session - I've used it on SQL queries where there was no entity to cache. You can call evict at the end of the loop to be sure or test with a sql query. Also play around with setFetchSize to optimize the number of trips to the server.

recently i worked over a problem like this, and i wrote a blog about how face that problem. is very like, i hope be helpfull for any one.
i use lazy list approach with partial adquisition. i Replaced the limit and offset or the pagination of query to a manual pagination.
In my example, the select returns 10 millions of records, i get them and insert them in a "temporal table":
create or replace function load_records ()
returns VOID as $$
BEGIN
drop sequence if exists temp_seq;
create temp sequence temp_seq;
insert into tmp_table
SELECT linea.*
FROM
(
select nextval('temp_seq') as ROWNUM,* from table1 t1
join table2 t2 on (t2.fieldpk = t1.fieldpk)
join table3 t3 on (t3.fieldpk = t2.fieldpk)
) linea;
END;
$$ language plpgsql;
after that, i can paginate without count each row but using the sequence assigned:
select * from tmp_table where counterrow >= 9000000 and counterrow <= 9025000
From java perspective, i implemented this pagination through partial adquisition with a lazy list. this is, a list that extends from Abstract list and implements get() method. The get method can use a data access interface to continue get next set of data and release the memory heap:
#Override
public E get(int index) {
if (bufferParcial.size() <= (index - lastIndexRoulette))
{
lastIndexRoulette = index;
bufferParcial.removeAll(bufferParcial);
bufferParcial = new ArrayList<E>();
bufferParcial.addAll(daoInterface.getBufferParcial());
if (bufferParcial.isEmpty())
{
return null;
}
}
return bufferParcial.get(index - lastIndexRoulette);<br>
}
by other hand, the data access interface use query to paginate and implements one method to iterate progressively, each 25000 records to complete it all.
results for this approach can be seen here
http://www.arquitecturaysoftware.co/2013/10/laboratorio-1-iterar-millones-de.html

Another option if you're "running out of RAM" is to just request say, one column instead of the entire object How to use hibernate criteria to return only one element of an object instead the entire object? (saves a lot of CPU process time to boot).

For me it worked properly when setting useCursors=true, otherwise The Scrollable Resultset ignores all the implementations of fetch size, in my case it was 5000 but Scrollable Resultset fetched millions of records at once causing excessive memory usage. underlying DB is MSSQLServer.
jdbc:jtds:sqlserver://localhost:1433/ACS;TDS=8.0;useCursors=true

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Read 3 million records with hibernate - java

You should not load all records into memory but process them in batch, e.g: loop every 1000 records by using createQuery(...).setFirstResult(i*1000).setMaxResults(1000);

Yes off course You may use The Apache™ Hadoop® for large project . it develops open-source software for reliable, scalable, distributed computing. It is designed to scale up from single servers to thousands of machines hadoop apache

Related

Is SQL IN Query better for performance or Java method ContainsAll

Cassandra terminates connection in the middle?

MySql slow with concurrent queries

Splitting Long running SQL query into multiple smaller queries

Using Hibernate's ScrollableResults to slowly read 90 million records

Categories

Resources