I'm using Cassandra driver for java from datastax. I know that I have 20 millions of rows in one table. When I using
Select * from table
The process stops after around 800000 rows have been fetched.
In my Java code
futureResults = session.executeAsync(statement);
ResultSet results = futureResults.getUninterruptibly();
for (Row row : results) {
}
Maybe I did something wrong ?
What you are doing there is a fairly common anti-pattern with Cassandra. Since each partition of data lives in different parts of your cluster, that query will create a massive scatter/gather, centered around one coordinator. Eventually things start timing out and the coordinator will throw an error. A quick look in the logs should find it.
Almost always, a select query should include a partition key for locality. If that's not possible, switching to something batch that will efficiently scan each node is best. The Spark connector for Cassandra is perfect for an access pattern like this.
Related
I have a cluster of three Cassandra nodes with more or less default configuration. On top of that, I have a web layer consisting of two nodes for load balancing, both web nodes querying Cassandra all the time. After some time, with the data stored in Cassandra becoming non-trivial, one and only one of the web nodes started getting ReadTimeoutException on a specific query. The web nodes are identical in every way.
The query is very simple (? is placeholder for date, usually a few minutes before the current moment):
SELECT * FROM table WHERE time > ? LIMIT 1 ALLOW FILTERING;
The table is created with this query:
CREATE TABLE table (
user_id varchar,
article_id varchar,
time timestamp,
PRIMARY KEY (user_id, time));
CREATE INDEX articles_idx ON table(article_id);
When it times-out, the client waits for a bit more than 10s, which, not surprisingly, is the timeout configured in cassandra.yaml for most connects and reads.
There are a couple of things that are baffling me:
the query only timeouts when one of the web nodes execute it - one of the nodes always fail, one of the nodes always succeed.
the query returns instantaneously when I run it from cqlsh (although it seems it only hits one node when I run it from there)
there are other queries issued which take 2-3 minutes (a lot longer than the 10s timeout) that do not timeout at all
I cannot trace the query in Java because it times out. Tracing the query in cqlsh didn't provide much insight. I'd rather not change the Cassandra timeouts as this is production system and I'd like to exhaust non-invasive options first. The Cassandra nodes all have plenty of heap, their heap is far from full, and GC times seem normal.
Any ideas/directions will be much appreciated, I'm totally out of ideas. Cassandra version is 2.0.2, using com.datastax.cassandra:cassandra-driver-core:2.0.2 Java client.
A few things I noticed:
While you are using time as a clustering key, it doesn't really help you because your query is not restricting by your partition key (user_id). Cassandra only orders by clustering keys within a partition. So right now your query is pulling back the first row which satisfies your WHERE clause, ordered by the hashed token value of user_id. If you really do have tens of millions of rows, then I would expect this query to pull back data from the same user_id (or same select few) every time.
"although it seems it only hits one node when I run it from there" Actually, your queries should only hit one node when you run them. Introducing network traffic into a query makes it really slow. I think the default consistency in cqlsh is ONE. This is where Carlo's idea comes into play.
What is the cardinality of article_id? Remember, secondary indexes work the best on "middle-of-the-road" cardinality. High (unique) and low (boolean) are both bad.
The ALLOW FILTERING clause should not be used in (production) application-side code. Like ever. If you have 50 million rows in this table, then ALLOW FILTERING is first pulling all of them back, and then trimming down the result set based on your WHERE clause.
Suggestions:
Carlo might be on to something with the suggestion of trying a different (lower) consistency level. Try setting a consistency level of ONE in your application and see if that helps.
Either perform an ALLOW FILTERING query, or a secondary index query. They both suck, but definitely do not do both together. I would not use either. But if I had to pick, I would expect a secondary index query to suck less than an ALLOW FILTERING query.
To solve this adequately at the scale in which you are describing, I would duplicate the data into a query table. As it looks like you are concerned with organizing time-sensitive data, and in getting the most-recent data. A query table like this should do it:
CREATE TABLE tablebydaybucket (
user_id varchar,
article_id varchar,
time timestamp,
day_bucket varchar,
PRIMARY KEY (day_bucket , time))
WITH CLUSTERING ORDER BY (time DESC);
Populate this table with your data, and then this query will work:
SELECT * FROM tablebydaybucket
WHERE day_bucket='20150519' AND time > '2015-05-19 15:38:49-0500' LIMIT 1;
This will partition your data by day_bucket, and cluster your data by time. This way, you won't need ALLOW FILTERING or a secondary index. Also your query is guaranteed to hit only one node, and Cassandra will not have to pull all of your rows back and apply your WHERE clause after-the-fact. And clustering on time in DESCending order, helps your most-recent rows come back quicker.
I'm a noob in hibernate and I have to read 2 million records from a DB2 z/OS-Database with hibernate in Java. (JDBC)
My problem is, that I run OutOfMemory after 150000 records.
I've heard about batching etc, but I only find solutions for actually inserting new records. What I want to do is to read this records in an ArrayList for further usage.
So I'm actually just selecting one row of the database to reduce the data:
getEntityManager().createQuery("select t.myNumber from myTable t").getResultList();
Also it would be interesting, if there is a better way to read such a huge amount of records.(Maybe without Hibernate?)
The following is the way to do batch processing using hibernate. Keep in mind this is not 100% tested. It's kind of pseudo logic.
int i=0;
int batch = 100;
List<myNumber> numList = getEntityManager().createQuery("select t.myNumber from myTable t").setFirstResult(i).setMaxResults(batch).getResultList();
while(numList.size() == batch){
//process numList
i+=batch;
numList = getEntityManager().createQuery("select t.myNumber from myTable t").setFirstResult(i).setMaxResults(batch).getResultList();
}
Hibernate documentation for setFirstResult() and setMaxResults()
A best approach is to use statelessSession ( no deal with cache ) and bulk operations with the scrollableResults method :
StatelessSession statelessSession = sessionFactory.openStatelessSession(connection);
try {
ScrollableResults scrollableResults = statelessSession.createQuery("from Entity").scroll(ScrollMode.FORWARD_ONLY);
int count = 0;
while (scrollableResults.next()) {
if (++count > 0 && count % 100 == 0) {
System.out.println("Fetched " + count + " entities");
}
Entity entity = (Entity) scrollableResults.get()[0];
//Process and write result
}
} finally {
statelessSession.close();
}
You should not load all records into memory but process them in batch, e.g: loop every 1000 records by using
createQuery(...).setFirstResult(i*1000).setMaxResults(1000);
You have found the upper limit of your heap. Have a look here to know how to properly size you heap:
Increase heap size in Java
However, I cannot imagine why you would need to have a List of 3 million records in memory. Perhaps with more information we could find an alternative solution to your algorithm?
Yes off course
You may use The Apache™ Hadoop® for large project . it develops open-source software for reliable, scalable, distributed computing. It is designed to scale up from single servers to thousands of machines
hadoop apache
This is basically a design question for the problem you are working on. Forget Hibernate even if you are doing the same thing in plain JDBC you will hit the memory issue, maybe a bit late. The idea of loading such huge data and keeping in memory is not suitable for applications requiring short request-response cycles and is not good for scalability either. As others have suggested you can try the batch or paging behaviour or if you want to be more exotic you can try parallel processing via distributed data-grid (like Infinispan) or map-reduce framework from Hadoop.
Going by the description of your problem it seems that you need to keep the data around in memory.
If you must keep the huge data around in memory then you can query the data in batches and keep storing them in a distribited cache (like Infinispan) which can span multiple JVM's on single machine or multiple machine forming a cluster. This way your data will reside partially on each node.Here Infinispan can be used as a distributed cache.
There are framework like Spring Batch which take the route of solving such problems by dividing the work into chunks (batch) and then processing them one by one. It has even inbuilt JPA based readers and writers which perform this work in a batch.
I am trying to write a database independant application with JDBC. I now need a way to fetch the top N entries out of some table. I saw there is a setMaxRows method in JDBC, but I don't feel comfortable using it, because I am scared the database will push out all results, and only the JDBC driver will reduce the result. If I need the top 5 results in a table with a billion rows this will break my neck (the table has an usable index).
Writing special SQL-statements for every kind of database isn't very nice, but will let the database do clever query planning and stop fetching more results than necessary.
Can I rely on setMaxRows to tell the database to not work to much?
I guess in the worst case I can't rely on this working in the hoped way. I'm mostly interested in Postgres 9.1 and Oracle 11.2, so if someone has experience with these databases, please step forward.
will let the database do clever query planning and stop fetching more
results than necessary.
If you use
PostgreSQL:
SELECT * FROM tbl ORDER BY col1 LIMIT 10; -- slow without index
Or:
SELECT * FROM tbl LIMIT 10; -- fast even without index
Oracle:
SELECT *
FROM (SELECT * FROM tbl ORDER BY col1 DESC)
WHERE ROWNUM < 10;
.. then only 10 rows will be returned. But if you sort your rows before picking top 10, all basically qualifying rows will be read before they can be sorted.
Matching indexes can prevent this overhead!
If you are unsure, what JDBC actually send to the database server, run a test and have the database engine log the statements received. In PostgreSQL you can set in postgresql.conf:
log_statement = all
(and reload) to log all statements sent to the server. Be sure to reset that setting after the test or your log files may grow huge.
The thing which could/may kill you with billion(s) of rows is the (highly likely) ORDER BY clause in your query. If this order cannot be established using an index then . . . it'll break your neck :)
I would not depend on the jdbc driver here. As a previous comment suggests it's unclear what it really does (looking at different rdbms).
If you are concerned regarding speed of your query you can use a LIMIT clause as well. If you use LIMIT you can at least be sure that it's passed on to the DB server.
Edit: Sorry, I was not aware that Oracle doesn't support LIMIT.
In direct answer to your question regarding PostgreSQL 9.1: Yes, the JDBC driver will tell the server to stop generating rows beyond what you set.
As others have pointed out, depending on indexes and the plan chosen, the server might scan a very large number of rows to find the five you want. Proper server configuration can help accurately model the costs to prevent this, but if value distribution is unusual you may need to introduce and optimization barrier (like with a CTE) to coerce the planner to produce a good plan.
I simply need to read each row in a table in my MySQL database using Hibernate and write a file based on it. But there are 90 million rows and they are pretty big. So it seemed like the following would be appropriate:
ScrollableResults results = session.createQuery("SELECT person FROM Person person")
.setReadOnly(true).setCacheable(false).scroll(ScrollMode.FORWARD_ONLY);
while (results.next())
storeInFile(results.get()[0]);
The problem is the above will try and load all 90 million rows into RAM before moving on to the while loop... and that will kill my memory with OutOfMemoryError: Java heap space exceptions :(.
So I guess ScrollableResults isn't what I was looking for? What is the proper way to handle this? I don't mind if this while loop takes days (well I'd love it to not).
I guess the only other way to handle this is to use setFirstResult and setMaxResults to iterate through the results and just use regular Hibernate results instead of ScrollableResults. That feels like it will be inefficient though and will start taking a ridiculously long time when I'm calling setFirstResult on the 89 millionth row...
UPDATE: setFirstResult/setMaxResults doesn't work, it turns out to take an unusably long time to get to the offsets like I feared. There must be a solution here! Isn't this a pretty standard procedure?? I'm willing to forgo Hibernate and use JDBC or whatever it takes.
UPDATE 2: the solution I've come up with which works ok, not great, is basically of the form:
select * from person where id > <offset> and <other_conditions> limit 1
Since I have other conditions, even all in an index, it's still not as fast as I'd like it to be... so still open for other suggestions..
Using setFirstResult and setMaxResults is your only option that I'm aware of.
Traditionally a scrollable resultset would only transfer rows to the client on an as required basis. Unfortunately the MySQL Connector/J actually fakes it, it executes the entire query and transports it to the client, so the driver actually has the entire result set loaded in RAM and will drip feed it to you (evidenced by your out of memory problems). You had the right idea, it's just shortcomings in the MySQL java driver.
I found no way to get around this, so went with loading large chunks using the regular setFirst/max methods. Sorry to be the bringer of bad news.
Just make sure to use a stateless session so there's no session level cache or dirty tracking etc.
EDIT:
Your UPDATE 2 is the best you're going to get unless you break out of the MySQL J/Connector. Though there's no reason you can't up the limit on the query. Provided you have enough RAM to hold the index this should be a somewhat cheap operation. I'd modify it slightly, and grab a batch at a time, and use the highest id of that batch to grab the next batch.
Note: this will only work if other_conditions use equality (no range conditions allowed) and have the last column of the index as id.
select *
from person
where id > <max_id_of_last_batch> and <other_conditions>
order by id asc
limit <batch_size>
You should be able to use a ScrollableResults, though it requires a few magic incantations to get working with MySQL. I wrote up my findings in a blog post (http://www.numerati.com/2012/06/26/reading-large-result-sets-with-hibernate-and-mysql/) but I'll summarize here:
"The [JDBC] documentation says:
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
This can be done using the Query interface (this should work for Criteria as well) in version 3.2+ of the Hibernate API:
Query query = session.createQuery(query);
query.setReadOnly(true);
// MIN_VALUE gives hint to JDBC driver to stream results
query.setFetchSize(Integer.MIN_VALUE);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
// iterate over results
while (results.next()) {
Object row = results.get();
// process row then release reference
// you may need to evict() as well
}
results.close();
This allows you to stream over the result set, however Hibernate will still cache results in the Session, so you’ll need to call session.evict() or session.clear() every so often. If you are only reading data, you might consider using a StatelessSession, though you should read its documentation beforehand."
Set fetch size in query to an optimal value as given below.
Also, when caching is not required, it may be better to use StatelessSession.
ScrollableResults results = session.createQuery("SELECT person FROM Person person")
.setReadOnly(true)
.setFetchSize( 1000 ) // <<--- !!!!
.setCacheable(false).scroll(ScrollMode.FORWARD_ONLY)
FetchSize must be Integer.MIN_VALUE, otherwise it won't work.
It must be literally taken from the official reference: https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html
Actually you could have gotten what you wanted -- low-memory scrollable results with MySQL -- if you had used the answer mentioned here:
Streaming large result sets with MySQL
Note that you will have problems with Hibernate lazy-loading because it will throw an exception on any queries performed before the scroll is finished.
With 90 million records, it sounds like you should be batching your SELECTs. I've done with with Oracle when doing the initial load into a distrbuted cache. Looking at the MySQL documentation, the equivalent seems to be using the LIMIT clause: http://dev.mysql.com/doc/refman/5.0/en/select.html
Here's an example:
SELECT * from Person
LIMIT 200, 100
This would return rows 201 through 300 of the Person table.
You'd need to get the record count from your table first and then divide it by your batch size and work out your looping and LIMIT parameters from there.
The other benefit of this would be parallelism - you can execute multiple threads in parallel on this for faster processing.
Processing 90 million records also doesn't sound like the sweet spot for using Hibernate.
The problem could be, that Hibernate keeps references to all objests in the session until you close the session. That has nothing to do with query caching. Maybe it would help to evict() the objects from the session, after you are done writing the object to the file. If they are no longer references by the session, the garbage collector can free the memory and you won't run out of memory anymore.
I propose more than a sample code, but a query template based on Hibernate to do this workaround for you (pagination, scrolling and clearing Hibernate session).
It can also easily be adapted to use an EntityManager.
I've used the Hibernate scroll functionality successfully before without it reading the entire result set in. Someone said that MySQL does not do true scroll cursors, but it claims to based on the JDBC dmd.supportsResultSetType(ResultSet.TYPE_SCROLL_INSENSITIVE) and searching around it seems like other people have used it. Make sure it's not caching the Person objects in the session - I've used it on SQL queries where there was no entity to cache. You can call evict at the end of the loop to be sure or test with a sql query. Also play around with setFetchSize to optimize the number of trips to the server.
recently i worked over a problem like this, and i wrote a blog about how face that problem. is very like, i hope be helpfull for any one.
i use lazy list approach with partial adquisition. i Replaced the limit and offset or the pagination of query to a manual pagination.
In my example, the select returns 10 millions of records, i get them and insert them in a "temporal table":
create or replace function load_records ()
returns VOID as $$
BEGIN
drop sequence if exists temp_seq;
create temp sequence temp_seq;
insert into tmp_table
SELECT linea.*
FROM
(
select nextval('temp_seq') as ROWNUM,* from table1 t1
join table2 t2 on (t2.fieldpk = t1.fieldpk)
join table3 t3 on (t3.fieldpk = t2.fieldpk)
) linea;
END;
$$ language plpgsql;
after that, i can paginate without count each row but using the sequence assigned:
select * from tmp_table where counterrow >= 9000000 and counterrow <= 9025000
From java perspective, i implemented this pagination through partial adquisition with a lazy list. this is, a list that extends from Abstract list and implements get() method. The get method can use a data access interface to continue get next set of data and release the memory heap:
#Override
public E get(int index) {
if (bufferParcial.size() <= (index - lastIndexRoulette))
{
lastIndexRoulette = index;
bufferParcial.removeAll(bufferParcial);
bufferParcial = new ArrayList<E>();
bufferParcial.addAll(daoInterface.getBufferParcial());
if (bufferParcial.isEmpty())
{
return null;
}
}
return bufferParcial.get(index - lastIndexRoulette);<br>
}
by other hand, the data access interface use query to paginate and implements one method to iterate progressively, each 25000 records to complete it all.
results for this approach can be seen here
http://www.arquitecturaysoftware.co/2013/10/laboratorio-1-iterar-millones-de.html
Another option if you're "running out of RAM" is to just request say, one column instead of the entire object How to use hibernate criteria to return only one element of an object instead the entire object? (saves a lot of CPU process time to boot).
For me it worked properly when setting useCursors=true, otherwise The Scrollable Resultset ignores all the implementations of fetch size, in my case it was 5000 but Scrollable Resultset fetched millions of records at once causing excessive memory usage. underlying DB is MSSQLServer.
jdbc:jtds:sqlserver://localhost:1433/ACS;TDS=8.0;useCursors=true
I'm trying to create a java program to cleanup and merge rows in my table. The table is large, about 500k rows and my current solution is running very slowly. The first thing I want to do is simply get an in-memory array of objects representing all the rows of my table. Here is what I'm doing:
pick an increment of say 1000 rows at a time
use JDBC to fetch a resultset on the following SQL query
SELECT * FROM TABLE WHERE ID > 0 AND ID < 1000
add the resulting data to an in-memory array
continue querying all the way up to 500,000 in increments of 1000, each time adding results.
This is taking way to long. In fact its not even getting past the second increment from 1000 to 2000. The query takes forever to finish (although when I run the same thing directly through a MySQL browser its decently fast). Its been a while since I've used JDBC directly. Is there a faster alternative?
First of all, are you sure you need the whole table in memory? Maybe you should consider (if possible) selecting rows that you want to update/merge/etc. If you really have to have the whole table you could consider using a scrollable ResultSet. You can create it like this.
// make sure autocommit is off (postgres)
con.setAutoCommit(false);
Statement stmt = con.createStatement(
ResultSet.TYPE_SCROLL_INSENSITIVE, //or ResultSet.TYPE_FORWARD_ONLY
ResultSet.CONCUR_READ_ONLY);
ResultSet srs = stmt.executeQuery("select * from ...");
It enables you to move to any row you want by using 'absolute' and 'relative' methods.
One thing that helped me was Statement.setFetchSize(Integer.MIN_VALUE). I got this idea from Jason's blog. This cut down execution time by more than half. Memory consumed went down dramatically (as only one row is read at a time.)
This trick doesn't work for PreparedStatement, though.
Although it's probably not optimum, your solution seems like it ought to be fine for a one-off database cleanup routine. It shouldn't take that long to run a query like that and get the results (I'm assuming that since it's a one off a couple of seconds would be fine). Possible problems -
is your network (or at least your connection to mysql ) very slow? You could try running the process locally on the mysql box if so, or something better connected.
is there something in the table structure that's causing it? pulling down 10k of data for every row? 200 fields? calculating the id values to get based on a non-indexed row? You could try finding a more db-friendly way of pulling the data (e.g. just the columns you need, have the db aggregate values, etc.etc)
If you're not getting through the second increment something is really wrong - efficient or not, you shouldn't have any problem dumping 2000, or 20,000 rows into memory on a running JVM. Maybe you're storing the data redundantly or extremely inefficiently?