Is JDBC multi-threaded insert possible? - java

I'm currently working on a Java project which i need to prepare a big(to me) mysql database. I have to do web scraping using Jsoup and store the results into my database as well. As i estimated, i will have roughly 1,500,000 to 2,000,000 records to be inserted. In my first trial, i just use a loop to insert these records and it takes me one week to insert about 1/3 of my required records, which is too slow i think. Is it possible to make this process multi-threaded, so that i can have my records split into 3 sets, say 500,000 records per set, and then insert them into one database( one table specifically)?

Multi-threading isn't going to help you here. You'll just move the contention bottleneck from your app server to the database.
Instead, try using batch-inserts instead, they generally make this sort of thing orders of magnitude faster. See "3.4 Making Batch Updates" in the JDBC tutorial.
Edit: As #Jon commented, you need to decouple the fetching of the web pages from their insertion into the database, otherwise the whole process will go at the speed of the slowest operation. You could have multiple threads fetching web pages, which add the data to a queue data structure, and then have a single thread draining the queue into the database using a batch insert.

Just make sure two (or more) threads doesn't use the same connection at the same time, using a connection pool resolves that. c3po and apache dbcp comes to mind ...

You can insert these records in different threads provided they do use different primary key values.
You should also look at Spring Batch which I believe will be useful in your case.

You can chunk your record set into batches and do this, but perhaps you should think about other factors as well.
Are you doing a network round trip for each INSERT? If yes, latency could be the real enemy. Try batching those requests to cut down on network traffic.
Do you have transactions turned on? If yes, the size of the rollback log could be the problem.
I'd recommend profiling the app server and the database server to see where the time is being spent. You can waste a lot of time guessing about the root cause.

I think multi thread approch usefull for your issue but you have to using a connection pool such as C3P0 or Tomca 7 Connetcion pool for more performance.
Another solution is using a batch-operation provider such as Spring-batch, exist anothers utility for batch operation also.
Another solution is using a PL/SQl Procedure with a input structure parameter.

Related

java - jdbc performance

I need one help from you guys regarding JDBC performance optimization. One of our pojo is using jdbc to connect to a oracle database and retrieve the records. Basically the records are email addresses basing upon which emails will be sent to the users. The problem here is the performance. This process happens every weekend and the records are very huge in number, around 100k.
The performance is very slow and it worries us a lot. Only 1000 records seem to be fetched from the database every 1 hour, which means that it will take 100 hours for this process to complete (which is very bad). Please help me on this.
The database server and the java process are in two different remote servers. We have used rs_email.setFetchSize(1000); hoping that it would make any difference but no change at all.
The same query executed on server takes 0.35 seconds to complete. Any quick suggestion would of great help to us.
Thanks,
Aamer.
First look at your queries. Analyze them. See if the SQL could be made more efficient (ie, ask the database for what you want, not for what you don't want -- makes a big difference). Also check to see if there are indexes on any fields in your where and join clauses. Indexes make a big difference. But it can't be just any indexes. They have to be good indexes (ie, that the fields that make up the index provide enough uniqueness for the database to retrieve things appropriately). Work with your DBA on this. Look for either high run time against the db or check for queries with high CPU usage (even if the queries run sub-second). These are the thing that can kill your database.
Also from a code perspective, check to see if you are opening and closing your connections or if you are re-using them. Can make a big difference too.
It would help to post your code, queries, table layouts, and any indexes you have.
Use log4jdbc to get the real sql for fetching single record. Then check speed and plan for that sql. You may need a proper index or even db defragmentation.
Not sure about the Oracle driver, but I do know that the MySQL driver supports two different results retrieval methods: "stream" and "wait until you've got it all".
The streaming method lets you start process the results the moment you've got the first row returned from the query, whereas the other method retrieves the entire resultset before you can start work on it. In cases where you deal with huge recordsets, this often leads to memory exceptions, or slow performance because java hit the "memory roof" and the garbage collector can't throw away "used" records like it can in the streaming mode.
The streaming mode doesn't let you navigate/scroll the resultset the way the "normal"/"wait until you've got it all" mode...
Anyway, not sure if this is of any help but it might be worth checking out.
My answer to your question, in summary is:
1. Check network
2. Check SQL
3. Check Java code.
It sounds very slow. First thing to check would be to see if you have a slow network. You can do this pretty quickly by just pinging the database server. Or run the database server on the same machine as your JVMM. If it is not the network, get an explain plan for your SQL and ensure you are not doing table scans when you don't need to be. If it is not the network or the SQL, then it's time to check your Java code. Are you doing anything like blocking when you shouldn't be?

What kind of Java / SQL solution do I need for 10 read/write every second in a million line table?

If I have an sql table that consist of one million rows. Let's say a user table.
What type of software do I need, in order to handle 10 read/write every second. I was thinking of using a Java NIO server to handle the connections.
But how does the back-end Database work? Could I simply use MySQL on the same computer?
Any insight would be great. Links, reading, examples. books?
I know SQL. I have done alot of SQLite but never created a scalable system to handle this kind of load.
Edit update,regarding helios comment
how many reads vs. writes?: 50/50
do you need up-to-date-reads(no delay): YES?
how big is each item?: 10% is 10-15 columns and the rest is 1-3 columns
are you accessing them individually?: NO, non of the USER threads are interacting but there can be simultaneous DB read/write on same row, (just make it synchronized?)
so you need 10 transcation/second on table with million rows.
that is really neither huge data set nor high performance.
MYSQL (currently 5.5+ ,innodb engine) , running on single server,can easily handle that.
you may need read first five chapter of 'High Performance MySQL' published by oreilly.
for nosql-db, i suggest mongodb, see http://www.mongodb.org/
If you make use of JDBC Connection Pooling (like C3PO, DBCP etc.) you would be able to have parallel inserts, and you would be able to have 10 threads (or more) simultaneously inserting data. Your limit would then be your platform resources (memory, I/O etc.).
All this would hold however only if the data insertion process itself can be parallel threads (i.e. you do not have a specific requirement to insert records sequentially) and that what you are doing are simple inserts and not something complex that locks the table or causes the other transactions to wait.
Also consider using JDBC prepared statements, and also committing in batches rather than after each record. This would speed up things greatly.

Hibernate Batch Insert. Would it ever use one insert instead of multiple inserts?

I've been looking around trying to determine some Hibernate behavior that I'm unsure about. In a scenario where Hibernate batching is properly set up, will it only ever use multiple insert statements when a batch is sent? Is it not possible to use a DB independent multi-insert statement?
I guess I'm trying to determine if I actually have the batching set up correctly. I see the multiple insert statements but then I also see the line "Executing batch size: 25."
There's a lot of code I could post but I'm trying to keep this general. So, my questions are:
1) What can you read in the logs to be certain that batching is being used?
2) Is it possible to make Hibernate use a multi-row insert versus multiple insert statements?
Hibernate uses multiple insert statements (one per entity to insert), but sends them to the database in batch mode (using Statement.addBatch() and Statement.executeBatch()). This is the reason you're seeing multiple insert statements in the log, but also "Executing batch size: 25".
The use of batched statements greatly reduces the number of roundtrips to the database, and I would be surprised if it were less efficient than executing a single statement with multiple inserts. Moreover, it also allows mixing updates and inserts, for example, in a single database call.
I'm pretty sure it's not possible to make Hibernate use multi-row inserts, but I'm also pretty sure it would be useless.
I know that this is an old question but i had the same problem that i thought that hibernate batching means that hibernate would combine multiple inserts into one statement which it doesn't seem to do.
After some testing i found this answer that a batch of multiple inserts is just as good as a multi-row insert. I did a test inserting 1000 rows one time using hibernate batch and one time without. Both tests took about 20s so there was no performace gain in using hibernate batch.
To be sure i tried using the rewriteBatchedStatements option from the MySQL Connector/J which actually combines multiple inserts into one statement. It reduced the time to insert 1000 records down to 3s.
So after all hibernate batch seems to be useless and a real multi-row insert to be much better. Am i doing something wrong or what causes my test results?
The Oracle bulk insert collect an array of entyty and pass in a single block to the db associating to it a unic ciclic insert/update/delete.
Is unic way to speed network throughput .
Oracle suggest to do it calling a stored procedure from hibernate passing it an array of datas.
http://biemond.blogspot.it/2012/03/oracle-bulk-insert-or-select-from-java.html?m=1
Is not only a software problem but infrastructural!
Problem is network data flow optimization and TCP stack fragmentation.
Mysql have function.
You have to do something like what is described in this article.
Normal transfer on network the correct volume of data is the solution
You have also to verify network mtu and Oracle sdu/tdu utilization respect data transferred between application and database

speed up operation on mysql

I'm currently writing java project against mysql in a cluster with ten nodes. The program simply pull some information from the database and do some calculation, then push some data back to the database. However, there are millions of rows in the table. Is there any way to split up the job and utilize the cluster architecture? How to do multi-threading on different node?
I watched an interesting presentation on using Gearman to do Map/Reduce style things on a mysql database. It might be what you are looking for: see here. There is a recording on the mysql webpage here (have to register for mysql.com though).
I'd think about doing that calculation in a stored procedure on the database server and pass on bringing millions of rows to the middle tier. You'll save yourself a lot of bytes on the wire. Depending on the nature of the calculation, your schema, indexing, etc. you might find that the database server is well equipped to do that calculation without having to resort to multi-threading.
I could be wrong, but it's worth a prototype to see.
Assume the table (A) you want to process has 10 million rows. Create a table B in the database to store the set of rows processed by a node. So you can write the Java program in such a way like it will first fetch the last row processed by other nodes and then it add an entry in the same table informing other nodes what range of rows it is going to process (you can decide this number). In our case, lets assume each node can process 1000 rows at a time. Node 1 fetches table B and finds it it empty. Then Node 1 inserts a row ('Node1', 1000) informing that it is processing till primary key of A is <=1000 ( Assuming primary key of table A is numeric and it is in ascending order). Node 2 comes and finds 1000 primary keys are processed by some other node. Hence it inserts a row ('Node2', 2000) informing others that it is processing rows between 1001 and 2000. Please note that access to table B should be synchronized, i.e. only one can work on it at a time.
Since you only have one mysql server, make sure you're using the innodb engine to reduce table locking on updates.
Also I'd try to keep your queries as simple as possible, even if you have to run more of them. This can increase chances of query cache hits, as well as reduce the over all workload on the backend, offloading some of the querying matching and work to the frontends (where you have more resources). It will also reduce the time a row lock is held therefore decreasing contention.
The proposed Gearman solution is probably the right tool for this job. As it will allow you to offload batch processing from mysql back to the cluster transparently.
You could set up sharding with a mysql on each machine but the set up time, maintenance and the changes to database access layer might be a lot of work compared to a gearman solution. You might also want to look at the experimental spider engine that could allow you to use multiple mysqls in unison.
Unless your calculation is very complex, most of the time will be spent retrieving data from MySql and sending the results back to MySQl.
As you have a single database no amount of parallelism or clustering on the application side will make much difference.
So your best options would be to do the update in pure SQL if that is at all possible, or, use a stored procedure so that all processing can take place within the MySql server and no data movement is required.
If this is not fast enough then you will need to split your database among several instances of MySql and come up with some schema to partition the data based on some application key.

Tips on Speeding up JDBC writes?

I am writing a program that does a lot of writes to a Postgres database. In a typical scenario I would be writing say 100,000 rows to a table that's well normalized (three foreign integer keys, the combination of which is the primary key and the index of the table). I am using PreparedStatements and executeBatch(), yet I can only manage to push in say 100k rows in about 70 seconds on my laptop, when the embedded database we're replacing (which has the same foreign key constraints and indices) does it in 10.
I am new at JDBC and I don't expect it to beat a custom embedded DB, but I was hoping it to be only 2-3x slower, not 7x. Anything obvious that I maybe missing? does the order of the writes matter? (i.e. say if it's not the order of the index?). Things to look at to squeeze out a bit more speed?
This is an issue that I have had to deal with often on my current project. For our application, insert speed is a critical bottleneck. However, we have discovered for the vast majority of database users, the select speed as their chief bottleneck so you will find that there are more resources dealing with that issue.
So here are a few solutions that we have come up with:
First, all solutions involve using the postgres COPY command. Using COPY to import data into postgres is by far the quickest method available. However, the JDBC driver by default does not currently support COPY accross the network socket. So, if you want to use it you will need to do one of two workarounds:
A JDBC driver patched to support COPY, such as this one.
If the data you are inserting and the database are on the same physical machine, you can write the data out to a file on the filesystem and then use the COPY command to import the data in bulk.
Other options for increasing speed are using JNI to hit the postgres api so you can talk over the unix socket, removing indexes and the pg_bulkload project. However, in the end if you don't implement COPY you will always find performance disappointing.
Check if your connection is set to autoCommit. If autoCommit is true, then if you have 100 items in the batch when you call executeBatch, it will issue 100 individual commits. That can be a lot slower than calling executingBatch() followed by a single explicit commit().
I would avoid the temptation to drop indexes or foreign keys during the insert. It puts the table in an unusable state while your load is running, since nobody can query the table while the indexes are gone. Plus, it seems harmless enough, but what do you do when you try to re-enable the constraint and it fails because something you didn't expect to happen has happened? An RDBMS has integrity constraints for a reason, and disabling them even "for a little while" is dangerous.
You can obviously try to change the size of your batch to find the best size for your configuration, but I doubt that you will gain a factor 3.
You could also try to tune your database structure. You might have better performances when using a single field as a primary key than using a composed PK. Depending on the level of integrity you need, you might save quite some time by deactivating integrity checks on your DB.
You might also change the database you are using. MySQL is supposed to be pretty good for high speed simple inserts ... and I know there is a fork of MySQL around that tries to cut functionalities to get very high performances on highly concurrent access.
Good luck !
try disabling indexes, and reenabling them after the insert. also, wrap the whole process in a transaction

Categories

Resources