java select large table and export to file

java select large table and export to file - java

I have a table with 62,000,000 rows aprox, a need select data from these a export to .txt or .csv
My query limit the result to 60,000 rows aprox.
When I run my the query in my developer machine, I eat all memory and get a java.lang.OutOfMemoryError
In this moment I use Hibernate for DAO, but I can change to pure JDBC solution when you recommend
My pseoudo-code is
List<Map> list = myDao.getMyData(Params param); //program crash here
initFile();
for(Map map : list){
util.append(map); //this transform row to file
}
closeFile();
Suggesting me to write my file?
Note: I use .setResultTransformer(Transformers.ALIAS_TO_ENTITY_MAP); to get Map instead of any Entity

You could use hibernate's ScrollableResults. See documentation here: http://docs.jboss.org/hibernate/orm/4.3/manual/en-US/html/ch11.html#objectstate-querying-executing-scrolling
This uses server-side cursors, if your database engine / database driver supports this. Be sure for this to work you set the following properties:
query.setReadOnly(true);
query.setCacheable(false);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
while (results.next()) {
SomeEntity entity = results.get()[0];
}
results.close();

lock the table and then perform subset selection and exports, appending to the results file. ensure you unconditionally unlock when done.
Not nice, but the task will perform to completion even on limited resource servers or clients.

Related

Why Spark dataframe cache doesn't work here

I just wrote a toy class to test Spark dataframe (actually Dataset since I'm using Java).
Dataset<Row> ds = spark.sql("select id,name,gender from test2.dummy where dt='2018-12-12'");
ds = ds.withColumn("dt", lit("2018-12-17"));
ds.cache();
ds.write().mode(SaveMode.Append).insertInto("test2.dummy");
//
System.out.println(ds.count());
According to my understanding, there're 2 actions, "insertInto" and "count".
I debug the code step by step, when running "insertInto", I see several lines of:
19/01/21 20:14:56 INFO FileScanRDD: Reading File path: hdfs://ip:9000/root/hive/warehouse/test2.db/dummy/dt=2018-12-12/000000_0, range: 0-451, partition values: [2018-12-12]
When running "count", I still see similar logs:
19/01/21 20:15:26 INFO FileScanRDD: Reading File path: hdfs://ip:9000/root/hive/warehouse/test2.db/dummy/dt=2018-12-12/000000_0, range: 0-451, partition values: [2018-12-12]
I have 2 questions:
1) When there're 2 actions on same dataframe like above, if I don't call ds.cache or ds.persist explicitly, will the 2nd action always causes the re-executing of the sql query?
2) If I understand the log correctly, both actions trigger hdfs file reading, does that mean the ds.cache() actually doesn't work here? If so, why it doesn't work here?
Many thanks.

It's because you append into the table where ds is created from, so ds needs to be recomputed because the underlying data changed. In such cases, spark invalidates the cache. If you read e.g. this Jira (https://issues.apache.org/jira/browse/SPARK-24596):
When invalidating a cache, we invalid other caches dependent on this
cache to ensure cached data is up to date. For example, when the
underlying table has been modified or the table has been dropped
itself, all caches that use this table should be invalidated or
refreshed.
Try to run the ds.count before inserting into the table.

I found that the other answer doesn't work. What I had to do was break lineage such that the df I was writing does not know that one of its source is the table I am writing to. To break lineage, I created a copy df using
copy_of_df = sql_context.createDataframe(df.rdd)

Mongo Java Driver : How to create cursor in MongoDB by Cusor id returned by a db.runCommand

I am using db.runCommand(document) of Java Mongo driver api.
Sample code I am using
Document resultDocument = db.runCommand({
find: 'collectionName',
filter: { startDate:{$gte:'#startDate',$lte:'#endDate'}},
projection: { _id:0}});
I am using find command. My query is returning 101 records only as default batch size is 101. I want to create a cursor as mentioned in api below.
Snippet in mongo documentation:
https://docs.mongodb.org/manual/reference/command/find/#dbcmd.find
Executes a query and returns the first batch of results and the cursor id, from which the client can construct a cursor.
I don't want to give batchSize as I am not sure how many records my query will return. So I want to create a cursor and iterate over it.
Can any one help how to create a cursor from id returned by db.runCommand in mongo java driver to iterate over all the records.

You may get the next batches using
getMore
Use in conjunction with commands that return a cursor, e.g. find and aggregate, to return subsequent batches of documents currently pointed to by the cursor.

Update all objects in JPA entity

I'm trying to update all my 4000 Objects in ProfileEntity but I am getting the following exception:
javax.persistence.QueryTimeoutException: The datastore operation timed out, or the data was temporarily unavailable.
this is my code:
public synchronized static void setX4all()
{
em = EMF.get().createEntityManager();
Query query = em.createQuery("SELECT p FROM ProfileEntity p");
List<ProfileEntity> usersList = query.getResultList();
int a,b,x;
for (ProfileEntity profileEntity : usersList)
{
a = profileEntity.getA();
b = profileEntity.getB();
x = func(a,b);
profileEntity.setX(x);
em.getTransaction().begin();
em.persist(profileEntity);
em.getTransaction().commit();
}
em.close();
}
I'm guessing that I take too long to query all of the records from ProfileEntity.
How should I do it?
I'm using Google App Engine so no UPDATE queries are possible.
Edited 18/10
In this 2 days I tried:
using Backends as Thanos Makris suggested but got to a dead end. You can see my question here.
reading DataNucleus suggestion on Map-Reduce but really got lost.
I'm looking for a different direction. Since I only going to do this update once, Maybe I can update manually every 200 objects or so.
Is it possible to to query for the first 200 objects and after it the second 200 objects and so on?

Given your scenario, I would advice to run a native update query:
Query query = em.createNativeQuery("update ProfileEntity pe set pe.X = 'x'");
query.executeUpdate();
Please note: Here the query string is SQL i.e. update **table_name** set ....
This will work better.

Change the update process to use something like Map-Reduce. This means all is done in datastore. The only problem is that appengine-mapreduce is not fully released yet (though you can easily build the jar yourself and use it in your GAE app - many others have done so).

If you want to set(x) for all object's, better to user update statement (i.e. native SQL) using JPA entity manager instead of fetching all object's and update it one by one.

Maybe you should consider the use of the Task Queue API that enable you to execute tasks up to 10min. If you want to update such a number of entities that Task Queues do not fit you, you could also consider the user of Backends.

Put the transaction outside of the loop:
em.getTransaction().begin();
for (ProfileEntity profileEntity : usersList) {
...
}
em.getTransaction().commit();

Your class behaves not very well - JPA is not suitable for bulk updates this way - you just starting a lot of transaction in rapid sequence and produce a lot of load on the database. Better solution for your use case would be scalar query setting all the objects without loading them into JVM first ( depending on your objects structure and laziness you would load much more data as you think )
See hibernate reference:
http://docs.jboss.org/hibernate/orm/3.3/reference/en/html/batch.html#batch-direct

file (not in memory) based JDBC driver for CSV files

Is there a open source file based (NOT in-memory based) JDBC driver for CSV files? My CSV are dynamically generated from the UI according to the user selections and each user will have a different CSV file. I'm doing this to reduce database hits, since the information is contained in the CSV file. I only need to perform SELECT operations.
HSQLDB allows for indexed searches if we specify an index, but I won't be able to provide an unique column that can be used as an index, hence it does SQL operations in memory.
Edit:
I've tried CSVJDBC but that doesn't support simple operations like order by and group by. It is still unclear whether it reads from file or loads into memory.
I've tried xlSQL, but that again relies on HSQLDB and only works with Excel and not CSV. Plus its not in development or support anymore.
H2, but that only reads CSV. Doesn't support SQL.

You can solve this problem using the H2 database.
The following groovy script demonstrates:
Loading data into the database
Running a "GROUP BY" and "ORDER BY" sql query
Note: H2 supports in-memory databases, so you have the choice of persisting the data or not.
// Create the database
def sql = Sql.newInstance("jdbc:h2:db/csv", "user", "pass", "org.h2.Driver")
// Load CSV file
sql.execute("CREATE TABLE data (id INT PRIMARY KEY, message VARCHAR(255), score INT) AS SELECT * FROM CSVREAD('data.csv')")
// Print results
def result = sql.firstRow("SELECT message, score, count(*) FROM data GROUP BY message, score ORDER BY score")
assert result[0] == "hello world"
assert result[1] == 0
assert result[2] == 5
// Cleanup
sql.close()
Sample CSV data:
0,hello world,0
1,hello world,1
2,hello world,0
3,hello world,1
4,hello world,0
5,hello world,1
6,hello world,0
7,hello world,1
8,hello world,0
9,hello world,1
10,hello world,0

If you check the sourceforge project csvjdbc please report your expierences. the documentation says it is useful for importing CSV files.
Project page

This was discussed on Superuser https://superuser.com/questions/7169/querying-a-csv-file.
You can use the Text Tables feature of hsqldb: http://hsqldb.org/doc/2.0/guide/texttables-chapt.html
csvsql/gcsvsql are also possible solutions (but there is no JDBC driver, you will have to run a command line program for your query).
sqlite is another solution but you have to import the CSV file into a database before you can query it.
Alternatively, there is commercial software such as http://www.csv-jdbc.com/ which will do what you want.

To do anything with a file you have to load it into memory at some point. What you could do is just open the file and read it line by line, discarding the previous line as you read in a new one. Only downside to this approach is its linearity. Have you thought about using something like memcache on a server where you use Key-Value stores in memory you can query instead of dumping to a CSV file?

You can use either specialized JDBC driver, like CsvJdbc (http://csvjdbc.sourceforge.net) or you may chose to configure a database engine such as mySQL to treat your CSV as a table and then manipulate your CSV through standard JDBC driver.
The trade-off here - available SQL features vs performance.
Direct access to CSV via CsvJdbc (or similar) will allow you very quick operations on big data volumes, but without capabilities to sort or group records using SQL commands ;
mySQL CSV engine can provide rich set of SQL features, but with the cost of performance.
So if the size of your table is relatively small - go with mySQL. However if you need to process big files (> 100Mb) without need for grouping or sorting - go with CsvJdbc.
If you need both - handle very bif files and be able to manipulate them using SQL, then optimal course of action - to load the CSV into normal database table (e.g. mySQL) first and then handle the data as usual SQL table.

Hibernate ScrollableResults Do Not Return The Whole Set of Results

Some of the queries we run have 100'000+ results and it takes forever to load them and then send them to the client. So I'm using ScrollableResults to have a paged results feature. But we're topping at roughly 50k results (never exactly the same amount of results).
I'm on an Oracle9i database, using the Oracle 10 drivers and Hibernate is configured to use the Oracle9 dialect. I tried with the latest JDBC driver (ojdbc6.jar) and the problem was reproduced.
We also followed some advice and added an ordering clause, but the problem was reproduced.
Here is a code snippet that illustrates what we do:
final int pageSize = 50;
Criteria crit = sess.createCriteria(ABC.class);
crit.add(Restrictions.eq("property", value));
crit.setFetchSize(pageSize);
crit.addOrder(Order.asc("property"));
ScrollableResults sr = crit.scroll();
...
...
ArrayList page = new ArrayList(pageSize);
do{
for (Object entry : page)
sess.evict(entry); //to avoid having our memory just explode out of proportion
page.clear();
for (int i =0 ; i < pageSize && ! metLastRow; i++){
if (sr.next())
page.add(sr.get(0));
else
metLastRow = true;
}
metLastRow = metLastRow?metLastRow:sr.isLast();
sendToClient(page);
}while(!metLastRow);
So, why is it that I get the result set to tell me its at the end when it should be having so much more results?

Your code snippet is missing important pieces, like the definitions of resultSet and page. But I wonder anyway, shouldn't the line
if (resultSet.next())
be rather
if (sr.next())
?
As a side note, AFAIK cleaning up superfluous objects from the persistence context could be achieved simply by calling
session.flush();
session.clear();
instead of looping through the collection of object to evict each separately. (Of course, this requires that the query is executed in its own independent session.)
Update: OK, next round of guesses :-)
Can you actually check what rows are sent to the client and compare that against the result of the equivalent SQL query directly against the DB? It would be good to know whether this code retrieves (and sends to the client all rows up to a certain limit, or only some rows (like every 2nd) from the whole resultset, or ... that could shed some light on the root cause.
Another thing you could try is
crit.setFirstResults(0).setMaxResults(200000);

As I had the same issue with a large project code based on List<E> instances,
I wrote a really limited List implementation with only iterator support to browse a ScrollableResults without refactoring all services implementations and method prototypes.
This implementation is available in my IterableListScrollableResults.java Gist
It also regularly flushes Hibernate entities from session. Here is a way to use it, for instance when exporting all non archived entities from DB as a text file with a for loop:
Criteria criteria = getCurrentSession().createCriteria(LargeVolumeEntity.class);
criteria.add(Restrictions.eq("archived", Boolean.FALSE));
criteria.setReadOnly(true);
criteria.setCacheable(false);
List<E> result = new IterableListScrollableResults<E>(getCurrentSession(),
criteria.scroll(ScrollMode.FORWARD_ONLY));
for(E entity : result) {
dumpEntity(file, entity);
}
With the hope it may help

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java select large table and export to file - java

lock the table and then perform subset selection and exports, appending to the results file. ensure you unconditionally unlock when done. Not nice, but the task will perform to completion even on limited resource servers or clients.

Related

Why Spark dataframe cache doesn't work here

Mongo Java Driver : How to create cursor in MongoDB by Cusor id returned by a db.runCommand

Update all objects in JPA entity

file (not in memory) based JDBC driver for CSV files

Hibernate ScrollableResults Do Not Return The Whole Set of Results

Categories

Resources