Copy data from Excel file to Database using Mutlithreading

Copy data from Excel file to Database using Mutlithreading - java

I have a requirement to copy a huge data from excel(5,00,000) rows to Database. Should I go with the Blocking Queue method of multi threading or is there any other way to leverage multi threading on a more efficient scale?

if you want to use multithreading to improve the performance better you can create one thread per table and try to perform db-operation over the table. for thread management you can use executor service, threadpoolexecutor etc.

500.000 rows nowadays is not that huge amount for the database.
I think you should, first of all, optimize the DB access and if you don't have the desired performance, go with more advanced techniques.
You've stated Java, 2 optimizations like this come to mind:
Use Prepared Statement and not Statement from JDBC (or if you use any abstraction over JDBC, make sure that that's the case under the hood). This will allow the DB to not reparse query every time
Use batch operations. There alone will boost the speed in an order of magnitude or so depending on your RDBMS setup:
PreparedStatement pstmt = connection.prepareStatement(<YOUR_INSERT_SQL>);
for(...) { // chose batch size like 500 to 2000
pstmt.setXXX(<bind the parameters here>)
pstmt.addBatch(); // add to the batch
}
pstmt.executeBatch(); // does bunch of inserts at once
It can take less than a minute to perform all these operations, or 1-2 minutes, but not, say hours (of course depending on where do you insert the data and the network quality, but typically this is a case).
If it's not enough, you can go with parallel access of course, using a number of connections simultaneously. But again, if its one-time operation, I wouldn't have bothered, after all, it will take you more time to write this multithreaded code than the difference in performance you'll gain :)

Related

Reading parallel record in java from database

I have one database and in this we have millions of records. We are reading the record one by one using java and inserting those record to another system on daily basis after end of day. We have been told to make it faster.
I told them we will create multiple thread using thread pool and these thread will read data parallelly and inject into another system but I dont know how can we stop our thread to read same data again. how can make it faster and achieve data consistency as well. I mean how can we make this process faster using multithreading in java or is there any other way ,other than multithreading to achieve it?

One possible solution for your task would be taking the ids of records in your database, splitting them into chunks (e.g. with size 1000 each) and calling JpaRepository.findAllById(Iterable<ID>) within Runnables passed to ExecutorService.submit().
If you don't want to do it manually then you could have a look into Spring Batch. It is designed particularly for bulk transformation of large amounts of data.

I think you should identify the slowest part in this flow and try to optimize it step by step.
In the described flow you could:
Try to reduce the number of "roundtrips" between the java application (in coming from the driver driver) and the database: Stop reading records one by one and move to bulk reading. Namely, read, say, 2000 records at once from the db into memory and process the whole bulk. Consider even larger numbers (like 5000) but you should measure this really, it depends on the memory of the java application and other factors. Anyway, if there is an issue - discard the bulk.
The data itself might not be organized correctly: when you read the bulk of data you might need to order it by some criteria, so make sure it doesnt make a full table scan, define indices properly etc
If applicable, talk to your DBA, he/she might provide additional insights about data management itself: partitioning, storage related optimizations, etc.
If all this fails and reading from the db is still a bottleneck, consider the flow redesign (for instance - throw messages to kafka if you have one), these might be naturally partitioned so you could scale out the whole process, but this might be beyond the scope of this question.

Does Java Berkeley DB have an upper limit for concurrent reading?

Java Berkeley DB is used in my system to store persistent data.
Since I have a large amount of data to be loaded, I attempt to do that with a number of threads. When the number of threads is low, e.g., 10, it works fine. However, when it is set to a higher value, e.g., 30, the reading processes get stuck. It looks like the Java Berkeley DB has an upper limit for concurrency reading? Am I right? How would I update the limit?

Did you say ... "to be loaded?"
Uh huh, thought you did!
Therefore, the threads that you speak of are not "reading" threads: they are "writing" threads!
And, guess what: they are competing with one another, and, guess what, they are losing!
Unfortunately, your "attempt" to speed things up through the use of threads was (IMHO ...) "sincere, but misguided." Ultimately, a Berkely DB is "a single, on-disk data structure," and so there are (IMHO...) no opportunities for speeding-up the process through the use of multi-threading.
Various other strategies, though, might work. For example, you might find that if you sort the records that are to be inserted, through some appropriate external command, the process of inserting those records might well become “usefully (much?) faster.” Enough of a speed-difference, in other words, to more-than(!) make up for the time spent sorting. (However, there will only be one way to find out if this is true in your situation: "benchmarking, your using actual data, your actual sort-command, and so on.")

How to enhance the performance of a Java application that makes many calls to the database?

As per my understanding the execution of java programs is pretty fast, things that slow down an application are mainly Network and IO operations.
For example, if I have a for loop running 10000 times which opens a file, processes some data and saves the data back into the file. If at all the application is slow, it not because of the loop executing 10000 times but because of the file opening and closing within the loop.
I have an MVC application where before I view a page I go through a Controller which in turn calls Services, which finally calls some DAO methods.
The problem is that there are so many queries being fired before the page loads and hence the page load time is 2 mins, which is pathetic.
Since the service calls various DAO methods and each DAO method uses a different connection object, I thought of doing this: "Create only one DAO method that the Service would call and this DAO method would fire all queries on one Connection object."
So this would save the time of connecting and disconnecting to the database.
But, the connection object in my application is coming from a connection pool. And most connection pools don't close connections they just send them back to the connection pools. So my above solution would not have any effect as anyways there is no opening and closing of connections.
How can I enhance the performance of my application?

Firstly you should accurately determine where the time is spent using tools like Profiler.
Once the root cause is known you can see if the operations can be optimized, i.e remove unnecessary steps. If not then you can see if the result of the operations can be cached and reused.
Without accurate understanding of processing that is taking time, it will be difficult to make any reasonable optimization.

If you reuse connection objects from the pool, this means that the connection/disconnection does not create any performance problem.
I agree with Ashwinee K Jha that a Profiler would give you some clear information of what you could optimize.
Meanwhile some other ideas/suggestions:
Could you maintain some cache of answers? I guess that not all of the 10,000 queries are distinct!
Try tuning the number of Connection objects in the Pool. There should be an optimal number.
Is your query execution already multi-threaded? I guess it is, so try tuning the number of threads. Generally, the number of cores is a good number of threads BUT, in the case of I/Os a much larger number is optimal (the big cost is the I/Os, not the CPU)

Some suggestions :
Scale your database. Perhaps the database itself is just slow.
Use 'second level caching' or application session caching to
potentially speed things up and reduce the need to query the
database.
Change your queries, application or schemas to reduce the number of
calls made.

You can use The Apache DBCP which use connection pool, calling database IO is costly, but mostly db connection openning and closing take good chunk of time.
You can also increase the maxIdle time (The maximum number of connections that can remain idle in the pool)
Also you can look into in memory data grid eg hazelcast etc

One reader thread, one writer thread, n worker threads

I am trying to develop a piece of code in Java, that will be able to process large amounts of data fetched by JDBC driver from SQL database and then persisted back to DB.
I thought of creating a manager containing one reader thread, one writer thread and customizable number of worker threads processing data. The reader thread would read data to DTOs and pass them to a Queue labled 'ready for processing'. Worker threads would process DTOs and put processed objects to another queue labeld 'ready for persistence'. The writer thread would persist data back to DB. Is such an approach optimal? Or perhaps I should allow more readers for fetching data? Are there any ready libraries in Java for doing this sort of thing I am not aware of?

Whether or not your proposed approach is optimal depends crucially on how expensive it is to process the data in relation to how expensive it is to get it from the DB and to write the results back into the DB. If the processing is relatively expensive, this may work well; if it isn't, you may be introducing a fair amount of complexity for little benefit (you still get pipeline parallelism which may or may not be significant to the overall throughput.)
The only way to be sure is to benchmark the three stages separately, and then deside on the optimal design.
Provided the multithreaded approach is the way to go, your design with two queues sounds reasonable. One additional thing you may want to consider is having a limit on the size of each queue.

I hear echoes from my past and I'd like to offer a different approach just in case you are about to repeat my mistake. It may or may not be applicable to your situation.
You wrote that you need to fetch a large amount of data out of the database, and then persist back to the database.
Would it be possible to temporarily insert any external data you need to work with into the database, and perform all the processing inside the database? This would offer the following advantages:
It eliminates the need to extract large amounts of data
It eliminates the need to persist large amounts of data
It enables set-based processing (which outperforms procedural)
If your database supports it, you can make use of parallel execution
It gives you a framework (Tables and SQL) to make reports on any errors you encounter during the process.
To give an example. A long time ago I implemented a (java) program whose purpose was to load purchases, payments and related customer data from files into a central database. At that time (and I regret it deeply), I designed the load to process the transactions one-by-one , and for each piece of data, perform several database lookups (sql) and finally a number of inserts into appropriate tables. Naturally this did not scale once the volume increased.
Then I made another misstake. I deemed that it was the database which was the problem (because I had heard that the SELECT is slow), so I decided to pull out all data from the database and do ALL processing in Java. And then finally persist back all data to the database. I implemented all kinds of layers with callback mechanisms to easily extend the load process, but I just couldn't get it to perform well.
Looking in the rear mirror, what I should have done was to insert the (laughably small amount of) 100,000 rows temporarily in a table, and process them from there. What took nearly half a day to process would have taken a few minutes at most if I played to the strength of all technologies I had at my disposal.

An alternative to using an explicit queue is to have an ExecutorService and add tasks to it. This way you let Java manager the pool of threads.

You're describing writing something similar to the functionality that Spring Batch provides. I'd check that out if I were you. I've had great luck doing operations similar to what you're describing using it. Parallel and multithreaded processing, and several different database readers/writers and whole bunch of other stuff are provided.

Use Spring Batch! That is exactly what you need

Java Large database inserts

I have a database in which I need to insert batches of data (around 500k records at a time). I was testing with derby and was seeing insert times of about 10-15minutes for this many records (I was doing a batch insert in Java).
Does this time seem slow (working on your average laptop)? And are there approaches to speeding it up?
thanks,
Jeff

This time seems perfectly reasonable, and is in agreement with times I have observed. If you want it to go faster, you need use bulk insert options and disable safety features:
Use PreparedStatements and batches of 5,000 to 10,000 records unless it MUST be one transaction
Use bulk loading options in the DBMS
Disable integrity checks temporarily for insert
Disable indexes temporarily or delete indexes and re-create them post-insert
Disable transaction logging and re-enable afterward.
EDIT: Database transactions are limited by disk I/O, and on laptops and most hard drives, the important number is seek time for the disk.
Laptops tend to have rather slow disks, at 5400 rpm. At this speed, seek time is about 5 ms. If we assume one seek per record (an over-estimate in most cases), it would take 40 minutes (500000 * 5 ms) to insert all rows. Now, the use of caching mechanisms and sequencing mechanisms reduces this somewhat, but you can see where the problem comes from.
I am (of course) vastly oversimplifying the problem, but you can see where I'm going with this; it's unreasonable to expect databases to perform at the same speed as sequential bulk I/O. You've got to apply some sort of indexing to your record, and that takes time.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.