I thought I would be clever and am using an ExecutorCompletionService to parallelize tasks that insert a bunch of records into a Postgres database. Motly it works great and I can see an increase in performance. However, now and then it fails with a primary key exception.. most likely due to concurrent threads trying to create records at the same time. Is there a graceful way to handle this situation?
You need to use some coordinated and thread safe way to generate primary keys. The best option if your primary key is numeric is to user database sequence - it is safe and efficient.
Turns out my original problem had to do with the sequence being different from the record count of the table. In fact Postgres can create new unique id's concurrently as far as I can tell. No additional coding needs to be done.
I am working on a Java Service (Hibernate) and I am calling sequentially a count query and a query to fetch the corresponding records (native queries). There are cases where the count is different than the actual records fetched by the query retrives the data.
I would like to secure that both queries are about to use the same dataset.
Any ideas on this?
I guess it is quite not good idea to use counts.
think about what primary key on record stands for... or maybe other fields identify records you need.
Retrieved Dataset on client gives you what was in DB at time you ran your query.
There are some dangerous abilities to lock table or records while your transaction not commited yet... but I do not recommend to try them. if it is about Db used by multiple services/clients or threads in parallel. I guess you have such system as counts change while your queries run.
It needs very careful handling to use locks and really dangerous to slow and hang other threads
I am accessing gridgain cache for large number of keys. I have two option to get values:
access gridgain cache and get value for each key in an IgniteClosure and return the result.
execute org.apache.ignite.cache.query.SqlQuery on the cache and then get the result.
Below are my questions:
What is the recommended/optimal way in this scenario?
Why one could be slower than others (like query parsing might be an extra overhead).
Have you considered doing a getAll(Set<K> keys) operation? Sounds like it suits your use case perfectly.
If you have even more data, consider collocated processing with local ScanQuery or map/reduce ExecuteTask/ExecuteJob.
If primary keys are known in advance, then use key-value APIs such as cache.get or cache.getAll. If those records are further to be used as part of a calculation then try to turn the calculation to a compute task and execute it one the nodes that store primary copies of the keys -- you can use compute.affinityRun methods for that.
SQL is favorable if the primary keys are not known in advance or if you need to filter data with the WHERE clause or do joins between tables.
I am using jdbc mysql. Let's assume there is a table in my db called Test. And there is a 700k rows. But fetching all rows are taking huge time. I am using preparedStatement. But I want to use multi threading in such a way that think there is 10 threads. for. eg 1st thread will fetch 70k rows then 2nd will fetch next 70k and so on. How to implement this?
Forgive me if this is too obvious and you tried it or it won't work in your situation, but caching might be very helpful here.
Regarding actually doing it with multi-threading, It might make sense to have some procedure you run (might need a new column in your table to do this) that would assign ids that you can query - something like " WHERE id BETWEEN value1 AND value2". Each Thread would query a different range. This would be faster than using order by, since this way avoids the need for the database to sort.
If you do want to go the order by route though, consider indexing your database so that that ordering doesn't take extra time.
This is a use case in member enrollment via web application/web service. We have a complex algorithm for checking if a member is duplicate, by looking at multiple tables like phone,address etc. The algorithm varies based on member's country. So this restriction cannot be implemented using primary key/unique key constraint.
So we have the checks in Java code. But if there are 2 duplicate concurrent requests, the 2 Java threads see that the member doesn't exist and they both insert the record resulting in duplicates. How can I prevent such duplicate inserts?
I can prevent updates by using row level locks or Hibernate's optimistic concurrency. I can think of table level locks to prevent such inserts, but limits the application performance as it also blocks updates. Another option I think would be to create a lock table with a record with id='memberInsert', and force all inserts via JDBC to obtain a row level lock on this record.
If it's going to be anywhere, I'd expect it to be in a write trigger, not in the Java code. Some other application or some other area of the application could do something badly.
Offloading this on the database gives you two advantages. 1) It prevents the race condition you mention up there and 2) It protects the integrity of the data by not allowing some errant application to modify records putting them in an illegal state.
Can't you hash the outcome of the algorithm or something and simply use that as a unique primary key?
As long as the database is not aware of your requirements, it will not help you. And then you probably have no other choice than table level locking.
I am writing a program that does a lot of writes to a Postgres database. In a typical scenario I would be writing say 100,000 rows to a table that's well normalized (three foreign integer keys, the combination of which is the primary key and the index of the table). I am using PreparedStatements and executeBatch(), yet I can only manage to push in say 100k rows in about 70 seconds on my laptop, when the embedded database we're replacing (which has the same foreign key constraints and indices) does it in 10.
I am new at JDBC and I don't expect it to beat a custom embedded DB, but I was hoping it to be only 2-3x slower, not 7x. Anything obvious that I maybe missing? does the order of the writes matter? (i.e. say if it's not the order of the index?). Things to look at to squeeze out a bit more speed?
This is an issue that I have had to deal with often on my current project. For our application, insert speed is a critical bottleneck. However, we have discovered for the vast majority of database users, the select speed as their chief bottleneck so you will find that there are more resources dealing with that issue.
So here are a few solutions that we have come up with:
First, all solutions involve using the postgres COPY command. Using COPY to import data into postgres is by far the quickest method available. However, the JDBC driver by default does not currently support COPY accross the network socket. So, if you want to use it you will need to do one of two workarounds:
A JDBC driver patched to support COPY, such as this one.
If the data you are inserting and the database are on the same physical machine, you can write the data out to a file on the filesystem and then use the COPY command to import the data in bulk.
Other options for increasing speed are using JNI to hit the postgres api so you can talk over the unix socket, removing indexes and the pg_bulkload project. However, in the end if you don't implement COPY you will always find performance disappointing.
Check if your connection is set to autoCommit. If autoCommit is true, then if you have 100 items in the batch when you call executeBatch, it will issue 100 individual commits. That can be a lot slower than calling executingBatch() followed by a single explicit commit().
I would avoid the temptation to drop indexes or foreign keys during the insert. It puts the table in an unusable state while your load is running, since nobody can query the table while the indexes are gone. Plus, it seems harmless enough, but what do you do when you try to re-enable the constraint and it fails because something you didn't expect to happen has happened? An RDBMS has integrity constraints for a reason, and disabling them even "for a little while" is dangerous.
You can obviously try to change the size of your batch to find the best size for your configuration, but I doubt that you will gain a factor 3.
You could also try to tune your database structure. You might have better performances when using a single field as a primary key than using a composed PK. Depending on the level of integrity you need, you might save quite some time by deactivating integrity checks on your DB.
You might also change the database you are using. MySQL is supposed to be pretty good for high speed simple inserts ... and I know there is a fork of MySQL around that tries to cut functionalities to get very high performances on highly concurrent access.
Good luck !
try disabling indexes, and reenabling them after the insert. also, wrap the whole process in a transaction