in according to my question above, I need to insert over 20000 thousands rows in one table of my database. In according to the performance I'm searching for a way to increase the efficiency of this process. My first Idea was to realize this with Java.Thread but i not quite sure if this is save enough. Has someone any good advices for me?
Edit: I already use preparedStatement.addBatch()
Looks safe to me, provided that every thread uses a different Connection object, and then disposes of the resources properly.
Anyway, note that the DB itself has a limit on concurrently running requests (max_connections in MySQL), so it doesn't help creating more threads than this number. Also, consider other optimizations such as batch inserts.
There is nothing wrong with accessing a database from different threads. As Eyal wrote, just make sure things like Connections only get used with a single thread and properly disposed.
The other question is, if this will actually help your performance. I'd make sure that you did everything else before resorting to multiple threads. Especially using batch statements seems to be the first option to look into if you haven't already.
If I understand you correctly, you need a way to write thousands of query's to your Database.
Do you know that you can execute a batch of MySQL queries?
There is a question about this already on Stackoverflow,
Java: Insert multiple rows into MySQL with PreparedStatement
you can use: PreparedStatement#addBatch() http://docs.oracle.com/javase/6/docs/api/java/sql/PreparedStatement.html#addBatch%28%29
and
PreparedStatement#executeBatch()
http://docs.oracle.com/javase/6/docs/api/java/sql/Statement.html#executeBatch%28%29
Check out #BalusC 's answer in the link above for more detail.
Related
I recently got into an interview and I was asked a question
We have a table employee(id, name). And in our java code, we are writing a logic to fetch data from this table and display it in UI. The query is
Select id,name from employee
Query was that during debugging, we found that this jdbc call to fire the query and get the output is taking say 20 secs and we want to reduce this to say 5 seconds or to the optimal time. How can we you do that, or how will I tackle this problem?
As there is no where clause in the query, I didn't suggest to index the column.
As this logic is taking 20 secs every time, so, some other code getting a lock on this table is also out of question.
I suggested that limiting the number of records fetched from the table should help but the interviewer didn't look convinced
Is there anything else we can do as a developer to optimize the call. I guess DBA might tune database setting to improve the performance of this query, but is there any other way
OK, so this is an interview question, so both the problem and the solutions are hypothetical. The interviewer is asking for possible optimizations and / or approaches. Here are some that are most likely to help:
Modify the query to page the data rather than fetching the whole lot. This looks applicable for the example query. Note that this is not just "limiting the number of rows selected from the table" ... which is probably why the interviewer looked doubtful when you said that!
If you do need to display the entire selected record set but in a reduced form (e.g. summed, averaged, sorted, collated etc), do the reduction in the query rather than by fetching the records and doing it in the client.
Tune the fetchSize() as suggested by Ivan.
Here are some other ideas that are less likely to help and / or will require extensive reworking.
Look at the network configs. For example you may be able to get better throughput by OS-level tuning TCP buffer, or optimizing physical or virtual network paths.
Run the query on the database server itself (to eliminate network overheads)
Use an in-memory table
Query a secondary database server; e.g. a readonly snapshot or a slave
You can try to increase fetchSize() for Statement/PreparedStatement to decrease number of network roundtrips between application server/desktop and database server.
You can start several threads that will query some piece of data and then merge all data from several threads.
EDIT: doesn't apply to this situation because id and name are the only columns on this table, but still useful for other readers to note.
If you create an index covering both id and name, then the database can use that index to read the data faster since it wont even have to even read the table.
See this link for a more thorough explanation.
if the index contains all the columns you’re requesting it doesn’t even need to look in the table. That concept is known as index coverage.
I am using jdbc mysql. Let's assume there is a table in my db called Test. And there is a 700k rows. But fetching all rows are taking huge time. I am using preparedStatement. But I want to use multi threading in such a way that think there is 10 threads. for. eg 1st thread will fetch 70k rows then 2nd will fetch next 70k and so on. How to implement this?
Forgive me if this is too obvious and you tried it or it won't work in your situation, but caching might be very helpful here.
Regarding actually doing it with multi-threading, It might make sense to have some procedure you run (might need a new column in your table to do this) that would assign ids that you can query - something like " WHERE id BETWEEN value1 AND value2". Each Thread would query a different range. This would be faster than using order by, since this way avoids the need for the database to sort.
If you do want to go the order by route though, consider indexing your database so that that ordering doesn't take extra time.
I'm currently working on a Java project which i need to prepare a big(to me) mysql database. I have to do web scraping using Jsoup and store the results into my database as well. As i estimated, i will have roughly 1,500,000 to 2,000,000 records to be inserted. In my first trial, i just use a loop to insert these records and it takes me one week to insert about 1/3 of my required records, which is too slow i think. Is it possible to make this process multi-threaded, so that i can have my records split into 3 sets, say 500,000 records per set, and then insert them into one database( one table specifically)?
Multi-threading isn't going to help you here. You'll just move the contention bottleneck from your app server to the database.
Instead, try using batch-inserts instead, they generally make this sort of thing orders of magnitude faster. See "3.4 Making Batch Updates" in the JDBC tutorial.
Edit: As #Jon commented, you need to decouple the fetching of the web pages from their insertion into the database, otherwise the whole process will go at the speed of the slowest operation. You could have multiple threads fetching web pages, which add the data to a queue data structure, and then have a single thread draining the queue into the database using a batch insert.
Just make sure two (or more) threads doesn't use the same connection at the same time, using a connection pool resolves that. c3po and apache dbcp comes to mind ...
You can insert these records in different threads provided they do use different primary key values.
You should also look at Spring Batch which I believe will be useful in your case.
You can chunk your record set into batches and do this, but perhaps you should think about other factors as well.
Are you doing a network round trip for each INSERT? If yes, latency could be the real enemy. Try batching those requests to cut down on network traffic.
Do you have transactions turned on? If yes, the size of the rollback log could be the problem.
I'd recommend profiling the app server and the database server to see where the time is being spent. You can waste a lot of time guessing about the root cause.
I think multi thread approch usefull for your issue but you have to using a connection pool such as C3P0 or Tomca 7 Connetcion pool for more performance.
Another solution is using a batch-operation provider such as Spring-batch, exist anothers utility for batch operation also.
Another solution is using a PL/SQl Procedure with a input structure parameter.
I need one help from you guys regarding JDBC performance optimization. One of our pojo is using jdbc to connect to a oracle database and retrieve the records. Basically the records are email addresses basing upon which emails will be sent to the users. The problem here is the performance. This process happens every weekend and the records are very huge in number, around 100k.
The performance is very slow and it worries us a lot. Only 1000 records seem to be fetched from the database every 1 hour, which means that it will take 100 hours for this process to complete (which is very bad). Please help me on this.
The database server and the java process are in two different remote servers. We have used rs_email.setFetchSize(1000); hoping that it would make any difference but no change at all.
The same query executed on server takes 0.35 seconds to complete. Any quick suggestion would of great help to us.
Thanks,
Aamer.
First look at your queries. Analyze them. See if the SQL could be made more efficient (ie, ask the database for what you want, not for what you don't want -- makes a big difference). Also check to see if there are indexes on any fields in your where and join clauses. Indexes make a big difference. But it can't be just any indexes. They have to be good indexes (ie, that the fields that make up the index provide enough uniqueness for the database to retrieve things appropriately). Work with your DBA on this. Look for either high run time against the db or check for queries with high CPU usage (even if the queries run sub-second). These are the thing that can kill your database.
Also from a code perspective, check to see if you are opening and closing your connections or if you are re-using them. Can make a big difference too.
It would help to post your code, queries, table layouts, and any indexes you have.
Use log4jdbc to get the real sql for fetching single record. Then check speed and plan for that sql. You may need a proper index or even db defragmentation.
Not sure about the Oracle driver, but I do know that the MySQL driver supports two different results retrieval methods: "stream" and "wait until you've got it all".
The streaming method lets you start process the results the moment you've got the first row returned from the query, whereas the other method retrieves the entire resultset before you can start work on it. In cases where you deal with huge recordsets, this often leads to memory exceptions, or slow performance because java hit the "memory roof" and the garbage collector can't throw away "used" records like it can in the streaming mode.
The streaming mode doesn't let you navigate/scroll the resultset the way the "normal"/"wait until you've got it all" mode...
Anyway, not sure if this is of any help but it might be worth checking out.
My answer to your question, in summary is:
1. Check network
2. Check SQL
3. Check Java code.
It sounds very slow. First thing to check would be to see if you have a slow network. You can do this pretty quickly by just pinging the database server. Or run the database server on the same machine as your JVMM. If it is not the network, get an explain plan for your SQL and ensure you are not doing table scans when you don't need to be. If it is not the network or the SQL, then it's time to check your Java code. Are you doing anything like blocking when you shouldn't be?
I've been looking around trying to determine some Hibernate behavior that I'm unsure about. In a scenario where Hibernate batching is properly set up, will it only ever use multiple insert statements when a batch is sent? Is it not possible to use a DB independent multi-insert statement?
I guess I'm trying to determine if I actually have the batching set up correctly. I see the multiple insert statements but then I also see the line "Executing batch size: 25."
There's a lot of code I could post but I'm trying to keep this general. So, my questions are:
1) What can you read in the logs to be certain that batching is being used?
2) Is it possible to make Hibernate use a multi-row insert versus multiple insert statements?
Hibernate uses multiple insert statements (one per entity to insert), but sends them to the database in batch mode (using Statement.addBatch() and Statement.executeBatch()). This is the reason you're seeing multiple insert statements in the log, but also "Executing batch size: 25".
The use of batched statements greatly reduces the number of roundtrips to the database, and I would be surprised if it were less efficient than executing a single statement with multiple inserts. Moreover, it also allows mixing updates and inserts, for example, in a single database call.
I'm pretty sure it's not possible to make Hibernate use multi-row inserts, but I'm also pretty sure it would be useless.
I know that this is an old question but i had the same problem that i thought that hibernate batching means that hibernate would combine multiple inserts into one statement which it doesn't seem to do.
After some testing i found this answer that a batch of multiple inserts is just as good as a multi-row insert. I did a test inserting 1000 rows one time using hibernate batch and one time without. Both tests took about 20s so there was no performace gain in using hibernate batch.
To be sure i tried using the rewriteBatchedStatements option from the MySQL Connector/J which actually combines multiple inserts into one statement. It reduced the time to insert 1000 records down to 3s.
So after all hibernate batch seems to be useless and a real multi-row insert to be much better. Am i doing something wrong or what causes my test results?
The Oracle bulk insert collect an array of entyty and pass in a single block to the db associating to it a unic ciclic insert/update/delete.
Is unic way to speed network throughput .
Oracle suggest to do it calling a stored procedure from hibernate passing it an array of datas.
http://biemond.blogspot.it/2012/03/oracle-bulk-insert-or-select-from-java.html?m=1
Is not only a software problem but infrastructural!
Problem is network data flow optimization and TCP stack fragmentation.
Mysql have function.
You have to do something like what is described in this article.
Normal transfer on network the correct volume of data is the solution
You have also to verify network mtu and Oracle sdu/tdu utilization respect data transferred between application and database