Reading large amount of records MySQL into Java - java

Having a MySQL database with +8 million records that I need to process (that can't be done in the database itself), I encounter issues when trying to read them into my Java application.
I already tried some solutions of people with similar problems (eg., link) however, none have worked out for me. I tried to set the FetchSize and all, but no luck! My application is built making use of a BlockingQueue of which the Producer reads data continously from the database, stores it in the queue so the Consumer can process it. This way I limit the amount of records in main memory at the same time.
My code works for small amount of records (I tested for 1000 records) so I suggest the fase from database to my application needs to be fixed.
Edit1
connection = ConnectionFactory.getConnection(DATABASE);
preparedStatement = connection.prepareStatement(query, java.sql.ResultSet.CONCUR_READ_ONLY, java.sql.ResultSet.TYPE_FORWARD_ONLY);
preparedStatement.setFetchSize(1000);
preparedStatement.executeQuery();
rs = preparedStatement.getResultSet();
Edit2
Eventually now I get some output other than seeing my memory go down. I get this error:
Exception in thread "Thread-0" java.lang.OutOfMemoryError: Java heap space
at com.mysql.jdbc.Buffer.<init>(Buffer.java:59)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:2089)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3554)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:491)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3245)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2413)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2836)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2828)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2777)
at com.mysql.jdbc.StatementImpl.executeQuery(StatementImpl.java:1651)
at razoralliance.dao.DataDAOImpl.getAllDataRS(DataDAOImpl.java:38)
at razoralliance.app.DataProducer.run(DataProducer.java:34)
at java.lang.Thread.run(Thread.java:722)
Edit3
I did some more research around the Producer-Consumer pattern and it turns out that, when the Consumer can not keep up with the Producer, the queue will automatically enlarge thus eventually run out of memory. So I switched to ArrayBlockingQueue which makes the size fixed. However, I still get memoryleaks. Eclipse Memory Analyzer says that ArrayBlockingQueue occupies 65,31% of my memory while it only has 1000 objects in memory with 4 fields all text.

You will need to stream your results. With the MySQL driver it appears you have to set CONCUR_READ_ONLY and TYPE_FORWARD_ONLY for your ResultSet. Also, set the fetch size accordingly: stmt.setFetchSize(Integer.MIN_VALUE);
By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate, and due to the design of the MySQL network protocol is easier to implement. If you are working with ResultSets that have a large number of rows or large values, and cannot allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY); stmt.setFetchSize(Integer.MIN_VALUE);
The combination of a forward-only, read-only result set, with a fetch size of Integer.MIN_VALUE serves as a signal to the driver to stream result sets row-by-row. After this, any result sets created with the statement will be retrieved row-by-row.
There are some caveats with this approach...
Streaming large result sets with MySQL
http://dev.mysql.com/doc/connector-j/en/connector-j-reference-implementation-notes.html

Why do not you try this approach for this solution
Problem exporting a lot of data from database to .csv with java
Instead of fetching the entire result set it can be fetched one by one and then it can be used for processing. The link what i referring to you used to get record one by one and write into file but you can use this result for processing.This is one approach you can use.
another approach is you can multi threading concept which will fetch record as per your demand and will process separately.

Related

Reading parallel record in java from database

I have one database and in this we have millions of records. We are reading the record one by one using java and inserting those record to another system on daily basis after end of day. We have been told to make it faster.
I told them we will create multiple thread using thread pool and these thread will read data parallelly and inject into another system but I dont know how can we stop our thread to read same data again. how can make it faster and achieve data consistency as well. I mean how can we make this process faster using multithreading in java or is there any other way ,other than multithreading to achieve it?
One possible solution for your task would be taking the ids of records in your database, splitting them into chunks (e.g. with size 1000 each) and calling JpaRepository.findAllById(Iterable<ID>) within Runnables passed to ExecutorService.submit().
If you don't want to do it manually then you could have a look into Spring Batch. It is designed particularly for bulk transformation of large amounts of data.
I think you should identify the slowest part in this flow and try to optimize it step by step.
In the described flow you could:
Try to reduce the number of "roundtrips" between the java application (in coming from the driver driver) and the database: Stop reading records one by one and move to bulk reading. Namely, read, say, 2000 records at once from the db into memory and process the whole bulk. Consider even larger numbers (like 5000) but you should measure this really, it depends on the memory of the java application and other factors. Anyway, if there is an issue - discard the bulk.
The data itself might not be organized correctly: when you read the bulk of data you might need to order it by some criteria, so make sure it doesnt make a full table scan, define indices properly etc
If applicable, talk to your DBA, he/she might provide additional insights about data management itself: partitioning, storage related optimizations, etc.
If all this fails and reading from the db is still a bottleneck, consider the flow redesign (for instance - throw messages to kafka if you have one), these might be naturally partitioned so you could scale out the whole process, but this might be beyond the scope of this question.

Fastest way to read data from single MS SQL table into java under single transaction

I need to read a large result set from MS SQL server into a java program. I need to read a consistent data state, so its running under a single transaction. I don't want dirty reads.
I can split the read using offset and fetch next and having each set of rows processed by a separate thread.
However, when doing this, it seems that the overall performance is ~30k rows read / sec, which is pretty lame. I'd like to get ~1m / sec.
I've checked that I have no memory pressure using visual VM. There are no GC pauses. Looking at machine utilisation it seems that there is no CPU limitation either.
I believe that the upstream source (MS SQL) is the limiting factor.
Any ideas on what I should look at?

How to enhance the performance of a Java application that makes many calls to the database?

As per my understanding the execution of java programs is pretty fast, things that slow down an application are mainly Network and IO operations.
For example, if I have a for loop running 10000 times which opens a file, processes some data and saves the data back into the file. If at all the application is slow, it not because of the loop executing 10000 times but because of the file opening and closing within the loop.
I have an MVC application where before I view a page I go through a Controller which in turn calls Services, which finally calls some DAO methods.
The problem is that there are so many queries being fired before the page loads and hence the page load time is 2 mins, which is pathetic.
Since the service calls various DAO methods and each DAO method uses a different connection object, I thought of doing this: "Create only one DAO method that the Service would call and this DAO method would fire all queries on one Connection object."
So this would save the time of connecting and disconnecting to the database.
But, the connection object in my application is coming from a connection pool. And most connection pools don't close connections they just send them back to the connection pools. So my above solution would not have any effect as anyways there is no opening and closing of connections.
How can I enhance the performance of my application?
Firstly you should accurately determine where the time is spent using tools like Profiler.
Once the root cause is known you can see if the operations can be optimized, i.e remove unnecessary steps. If not then you can see if the result of the operations can be cached and reused.
Without accurate understanding of processing that is taking time, it will be difficult to make any reasonable optimization.
If you reuse connection objects from the pool, this means that the connection/disconnection does not create any performance problem.
I agree with Ashwinee K Jha that a Profiler would give you some clear information of what you could optimize.
Meanwhile some other ideas/suggestions:
Could you maintain some cache of answers? I guess that not all of the 10,000 queries are distinct!
Try tuning the number of Connection objects in the Pool. There should be an optimal number.
Is your query execution already multi-threaded? I guess it is, so try tuning the number of threads. Generally, the number of cores is a good number of threads BUT, in the case of I/Os a much larger number is optimal (the big cost is the I/Os, not the CPU)
Some suggestions :
Scale your database. Perhaps the database itself is just slow.
Use 'second level caching' or application session caching to
potentially speed things up and reduce the need to query the
database.
Change your queries, application or schemas to reduce the number of
calls made.
You can use The Apache DBCP which use connection pool, calling database IO is costly, but mostly db connection openning and closing take good chunk of time.
You can also increase the maxIdle time (The maximum number of connections that can remain idle in the pool)
Also you can look into in memory data grid eg hazelcast etc

Pro and Cons of opening multiple InputStream?

While coding the solution to the problem of downloading a huge dynamic zip with low RAM impact, an idea started besieging me, and led to this question, asked for pure curiosity / hunger of knowledge:
What kind of drawbacks could I meet with if, instead of loading the InputStreams one at a time (with separate queries to the database), I would load all the InputStreams in a single query, returning a List of (n, potentially thousands, "opened") InputStreams ?
Current (safe) version: n queries, one inputStream instantiated at a time
for (long id : ids){
InputStream in = getMyService().loadStreamById(id);
IOUtils.copyStream(in, out);
in.close();
}
Hypothetical version: one query, n instantiated inputStreams
List<InputStream> streams = getMyService().loadAllStreams();
for (InputStream in : streams){
IOUtils.copyStream(in, out);
in.close();
in = null;
}
Which are the pro and cons of the second approach, excluding the (I suppose little) amount of memory used to keep multiple java InputStream instantiated ?
Could it lead to some kind of network freeze or database stress (or lock, or problems if others read/write the same BLOB field the Stream is pointing to, etc...) more than multiple queries ?
Or are they smart enough to be almost invisible until asked for data, and then 1 query + 1000 active stream could be better than 1000 query + 1 active stream ?
The short answer is that you risk hitting a limit of your operating system and/or DBMS.
The longer answer depends on the specific operating system and DBMS, but here are a few things to think about:
On Linux there are a maximum number of open file descriptors that any process can hold. The default is/was 1024, but it's relatively easy to increase. The intent of this limit IMO is to kill a poorly-written process, as the amount of memory required per file/socket is minimal (on a modern machine).
If the open stream represents an individual socket connection to the database, there's a hard limit on the total number of client sockets that a single machine may open to a single server address/port. This is driven by the client's dynamic port address range, and it's either 16 or 32k (but can be modified). This limit is across all processes on the machine, so excessive consumption by one process may starve other processes trying to access the same server.
Depending on how the DBMS manages the connections used to retrieve BLOBs, you may run into a much smaller limit enforced by the DBMS. Oracle, for example, defaults to a total of 50 "cursors" (active retrieval operations) per user connection.
Aside from these limits, you won't get any benefit given your code as written, since it runs through the connections sequentially. If you were to use multiple threads to read, you may see some benefit from having multiple concurrent connections. However, I'd still open those connections on an as-needed basis. And lest you think of spawning a thread for each connection (and running into the physical limit of number of threads), you'll probably reach a practical throughput limit before you hit any physical limits.
I tested it in PostgreSQL, and it works.
Since PostgreSQL seems to not have a predefined max cursor limit, I still don't know if the simple assignment of a cursor/pointer from a BLOB field to an Java InputStream object via java.sql.ResultSet.getBinaryStream("blob_field") is considered an active retrieval operation or not (I guess no, but who knows...);
Loading all the InputStreams at once with something like SELECT blob_field FROM table WHERE length(blob_field)>0 , produced a very long query execution time, and a very fast access to the binary content (in the sequential way, as above).
With a test case of 200 MB with 20 files of 10 MB each one:
The old way was circa 1 second per each query, plus 0.XX seconds for the other operations (reading each InputStream and writing it to the outputstream, and something else);
Total Elapsed time: 35 seconds
The experimental way was circa 22 seconds for the big query, and 12 total seconds for iterating and performing the other operations.
Total Elapsed time: 34 seconds
This makes me think that while assigning the BinaryStream from the database to the Java InputStream object, the complete reading is already being performed :/ making the use of an InputStream similar to an byte[] one (but worst in this case, because of the memory overload coming from having all the items instantiated);
Conclusion
reading all at once is a bit faster (~ 1 second faster every 30 seconds of execution),
but it could seriously make the big query timing out, other than cause RAM memory leaks and, potentially, max cursor hits.
Do not try this at home, just stick with one query at once...

HBase Multithreaded Scan is really slow

I'm using HBase to store some time series data. Using the suggestion in the O'Reilly HBase book I am using a row key that is the timestamp of the data with a salted prefix. To query this data I am spawning multiple threads which implement a scan over a range of timestamps with each thread handling a particular prefix. The results are then placed into a concurrent hashmap.
Trouble occurs when the threads attmept to perform their scan. A query that normally takes approximately 5600 ms when done serially takes between 40000 and 80000 ms when 6 threads are spawned (corresponding to 6 salts/region servers).
I've tried to use HTablePools to get around what I thought was an issue with HTable being not thread-safe, but this did not result in any better performance.
in particular I am noticing a significant slow down when I hit this portion of my code:
for(Result res : rowScanner){
//add Result To HashMap
Through logging I noticed that everytime through the conditional of the loop I experienced delays of many seconds. These delays do not occur if I force the threads to execute serially.
I assume that there is some kind of issue with resource locking but I just can't see it.
Make sure that you are setting the BatchSize and Caching on your Scan objects (the object that you use to create the Scanner). These control how many rows are transferred over the network at once, and how many are kept in memory for fast retrieval on the RegionServer itself. By default they are both way too low to be efficient. BatchSize in particular will dramatically increase your performance.
EDIT: Based on the comments, it sounds like you might be swapping either on the server or on the client, or that the RegionServer may not have enough space in the BlockCache to satisfy your scanners. How much heap have you given to the RegionServer? Have you checked to see whether it is swapping? See How to find out which processes are swapping in linux?.
Also, you may want to reduce the number of parallel scans, and make each scanner read more rows. I have found that on my cluster, parallel scanning gives me almost no improvement over serial scanning, because I am network-bound. If you are maxing out your network, parallel scanning will actually make things worse.
Have you considered using MapReduce, with perhaps just a mapper to easily split your scan across the region servers? It's easier than worrying about threading and synchronization in the HBase client libs. The Result class is not threadsafe. TableMapReduceUtil makes it easy to set up jobs.

Categories

Resources