I have a database in which I need to insert batches of data (around 500k records at a time). I was testing with derby and was seeing insert times of about 10-15minutes for this many records (I was doing a batch insert in Java).
Does this time seem slow (working on your average laptop)? And are there approaches to speeding it up?
thanks,
Jeff
This time seems perfectly reasonable, and is in agreement with times I have observed. If you want it to go faster, you need use bulk insert options and disable safety features:
Use PreparedStatements and batches of 5,000 to 10,000 records unless it MUST be one transaction
Use bulk loading options in the DBMS
Disable integrity checks temporarily for insert
Disable indexes temporarily or delete indexes and re-create them post-insert
Disable transaction logging and re-enable afterward.
EDIT: Database transactions are limited by disk I/O, and on laptops and most hard drives, the important number is seek time for the disk.
Laptops tend to have rather slow disks, at 5400 rpm. At this speed, seek time is about 5 ms. If we assume one seek per record (an over-estimate in most cases), it would take 40 minutes (500000 * 5 ms) to insert all rows. Now, the use of caching mechanisms and sequencing mechanisms reduces this somewhat, but you can see where the problem comes from.
I am (of course) vastly oversimplifying the problem, but you can see where I'm going with this; it's unreasonable to expect databases to perform at the same speed as sequential bulk I/O. You've got to apply some sort of indexing to your record, and that takes time.
Related
I need to read a large result set from MS SQL server into a java program. I need to read a consistent data state, so its running under a single transaction. I don't want dirty reads.
I can split the read using offset and fetch next and having each set of rows processed by a separate thread.
However, when doing this, it seems that the overall performance is ~30k rows read / sec, which is pretty lame. I'd like to get ~1m / sec.
I've checked that I have no memory pressure using visual VM. There are no GC pauses. Looking at machine utilisation it seems that there is no CPU limitation either.
I believe that the upstream source (MS SQL) is the limiting factor.
Any ideas on what I should look at?
My current project includes an archiving function where data from an in-memory database is transferred to a relational database.
I stream over the results from the in-memory database, create hibernate entities and persist the data to the database in batches of 5000. These entities have a couple of relations so per entity I write to different tables.
As a reference you can assume that 1 million insert queries are executed in the entire archiving process.
This process was really slow in the beginning so I looked online and implemented some common suggestions for writing in batches with Hibernate:
I set hibernate.jdbc.batch_size to a good size and hibernate.order_inserts to true.
To prevent memory issues, every now and then I flush and clear the hibernate session.
Here is a small example of the batching:
RedisServiceImpl.Cursor<Contract> ctrCursor = contractAccessService.getCursor("*", taskId);
Iterators.partition(ctrCursor, BATCH_SIZE).forEachRemaining(chunk -> {
portfolioChunkSaver.saveContractChunk(chunk, taskId);
em.flush();
em.clear();
});
ctrCursor.close();
This process works but it is incredibly slow. Inserting the 1 million records in Oracle took about 2 hours to finish, which is ~2.5 queries per second.
Currently this entire archiving function is wrapped in 1 transaction, which doesn't feel right at all. The big benefit is that you can be sure if the archive was successfully completed or not without having to provide some additional checking system for that. (Everything is either in the DB or it isn't)
As a speedup experiment I modified the code to create a database transaction per chunk of entities (5000) instead of wrapping everything in 1 big transaction.
That change had a huge impact, the speed now is about 10-15x as fast as before.
When profiling I saw this behavior before the change:
Before:
Java - very low CPU
Oracle - very high CPU, low disk write activity
After:
Java - high CPU
Oracle - Low CPU, very high disk write activity
The second behavior makes a lot of sense, java is sending over as much queries as possible and the database server is constrained by the writing to disk speed on my local system.
Here comes my question: why is the impact so huge? What is Oracle doing differently when I send over everything in a bigger transaction?
As a side-note: I never had this issue with MySQL so Oracle (or the oracle JDBC driver) must be doing something in a different way.
I can imagine that guaranteeing ACID compliance causes the overhead, but I wouldn't expect this huge speed difference.
You should make sure you have enough UNDO space (also known as UNDO segments) as a large transaction will consume a lot of it.
When a ROLLBACK statement is issued, undo records are used to undo
changes that were made to the database by the uncommitted transaction.
It's always preferable to only commit when you're done for data integrity and a properly tuned Oracle database can support large transactions without any performance issue.
I have a (very) complicated application which translates a GET-Request to a number of Hibernate queries to an Oracle DB.
It basically retrieves attributes of an object which are scattered in ~100 tables.
I have to undercut a maximum request time even for edge cases (=big result sets).
In the edge cases, the performance is extremely slow on the first call (i.e. after some time has passed).
After that, the query is much faster, even when I flush both the buffer cache and shared pool.
This applies to the SAME GET-Request, i.e. the same object requested. Request of another object, but same attributes again takes a long time on the first call.
For example, same query, same conditions, total of rows fetched is in the (low) thousands:
first call: 26.000ms
first call after flush of buffer cache/shared pool: 2800ms
second call after flush: 1200ms
From researching the web, I already discovered that flushing the pool does not necessarily really flush it, so I cannot rely on that.
As a caveat, I am a developer and have good working knowledge of Oracle, but am not a DBA and do not have access to a full DBA.
I suspect the following reasons for the slow first execution:
Oracle does hard parses which take a long time (the queries executed may contain multiple thousand parameters): I was unable to find out how long a "bad" hard parse could take. Enterprise Manager tells me he only did 1 hard parse on my queries for multiple executions though, so it seems unlikely.
the queries themselves take a long time, but get cached and the caches are not emptied by my actions (maybe disk caching?): Again, Enterprise Manager disagrees and shows very low query times overall.
I did suspect Hibernate/Java reasons at first (lots of objects to create after all), but it seems unlikely with the huge differences in performance
I am at a loss on how to proceed performance tuning and am looking for helpful reading material and/or different ideas on why the first execution is so slow.
The first query frequently takes much more time than any subsequent ones in Oracle DB.
It doesn't seem to be a good practice to rely solely on the Oracle cache in such circumstances. Though, that may come handy if you can mimic a query at first by means of executing a dummy one (perhaps, right after application launched). It may help to reduce an execution time of any subsequent equal call.
Although, such solution might help to boost performance, the more reliable way would be to introduce a programmatic cache at the application level. It can be used for the entities or any other non-persistent objects that are repeatedly fetched.
Please note, in case the scope of the problem limited to the database, it would be a perfect candidate for a question at Database Administrators Stack Exchange.
I have a database table with 3 million records. A java thread reads 10,000 records from table and processes it. After processing it jumps to next 10,000 and so on. In order to speed up, i have 25 threads doing the same task (reading + processing), and then I have 4 physical servers running the same java program. So effectively i have 100 thread doing the same work (reading + processing).
I strategy i have used is to have a sql procedure which does the work of grabbing next 10,000 records and marking them as being processed by a particular thread. However, i have noticed that the threads seems to be waiting for a some time trying to invoke the procedure and getting a response back. What other strategy i can use to speed up this process of data selection.
My database server is mysql and programming language is java
The idiomatic way of handling such scenario is producer-consumer design pattern. And in idiomatic way of implementing it in Java land is by using jms.
Essentially you need one master server reading records and pushing them to JMS queue. Then you'll have arbitrary number of consumers reading from that queue and competing with each other. It is up to you how you want to implement this in detail: do you want to send a message with whole record or only ID? All 10000 records in one message or record per message?
Another approach is map-reduce, check out hadoop. But the learning curve is a bit steeper.
Sounds like a job for Hadoop to me.
I would suspect that you are majorly database IO bound with this scheme. If you are trying to increase performance of your system, I would suggest partitioning your data across multiple database servers if you can do so. MySQL has some partitioning modes that I have no experience with. If you do partition yourself, it can add a lot of complexity to a database schema and you'd have to add some sort of routing layer using a hash mechanism to divide up your records across the multiple partitions somehow. But I suspect you'd get a significant speed increase and your threads would not be waiting nearly as much.
If you cannot partition your data, then moving your database to a SSD memory drive would be a huge win I suspect -- anything to increase the IO rates on those partitions. Stay away from RAID5 because of the inherent performance issues. If you need a reliable file system then mirroring or RAID10 would have much better performance with RAID50 also being an option for a large partition.
Lastly, you might find that your application performs better with less threads if you are thrashing your database IO bus. This depends on a number of factors including concurrent queries, database layout, etc.. You might try dialing down the per-client thread count to see if that makes a different. The effect may be minimal however.
I have an application that has to insert about 13 million rows of about 10 average length strings into an embedded HSQLDB. I've been tweaking things (batch size, single threaded/multithreaded, cached/non-cached tables, MVCC transactions, log_size/no logs, regular calls to checkpoint, ...) and it still takes 7 hours on a 16 core, 12 GB machine.
I chose HSQLDB because I figured I might have a substantial performance gain if I put all of those cores to good use but I'm seriously starting to doubt my decision.
Can anyone show me the silver bullet?
With CACHED tables, disk IO is taking most of the time. There is no need for multiple threads because you are inserting into the same table. One thing that noticably improves performance is the reuse of a single parameterized PreparedStatment, setting the parameters for each row insert.
On your machine, you can improve IO significantly by using a large NIO limit for memory-mapped IO. For example SET FILES NIO SIZE 8192. A 64 bit JVM is required for larger sizes to have an effect.
http://hsqldb.org/doc/2.0/guide/management-chapt.html
To reduce IO for the duration of the bulk insert use SET FILES LOG FALSE and do not perform a checkpoint until the end of the insert. The details are discussed here:
http://hsqldb.org/doc/2.0/guide/deployment-chapt.html#dec_bulk_operations
UPDATE: An insert test with 16 million rows below resulted in a 1.9 GigaByte .data file and took just a few minutes on an average 2 core processor and 7200 RPM disk. The key is large NIO allocation.
connection time -- 47
complete setup time -- 78 ms
insert time for 16384000 rows -- 384610 ms -- 42598 tps
shutdown time -- 38109
check what your application is doing. First things would be to look at resource utilization in taskmanager (or OS specific comparable) and visualvm.
Good candidates for causing bad performance:
disk IO
garbage collector
H2Database may give you slightly better performance than HSQLDB (while maintaining syntax compatibility).
In any case, you might want to try using a higher delay for syncing to disk to reduce random access disk I/O. (ie. SET WRITE_DELAY <num>)
Hopefully you're doing bulk INSERT statements, rather than a single insert per row. If not, do that if possible.
Depending on your application requirements, you might be better off with a key-value store than an RDBMS. (Do you regularly need to insert 1.3*10^7 entries?)
Your main limiting factor is going to be random access operations to disk. I highly doubt that anything you're doing will be CPU-bound. (Take a look at top, then compare it to iotop!)
With so many records, maybe you could consider switching to a NoSQL DB. It depends on the nature/format of the data you need to store, of course.