I understand about the performance gain between regular inserts and bulk inserts.
My question is that, how best is to do multiple bulk operations, where each operations inserts a good number (Min 15000) of records. Or is there any better approach to do this?
I am not sure whether I understand the specific problem well, but in general terms I would do it in the following way:
I would define a setting for maximum number of rows to be inserted
I would define a setting for maximum time to be allowed to pass before a scheduled row is being inserted
I would have a thread with a queue of record params to be inserted
whenever a new row is needed to be inserted I would add it to the queue
whenever a row is added to the queue, if the number of rows reaches the maximum number of rows, I would issue the bulk insert and clear the queue
when the queue is initialized/cleared I would set the timer to 0
when the time elapsed reaches the maximum time defined in the setting, I would execute the bulk insert and would clear the queue
Related
I have 4 files and each one is 200 MB. I have created 4 threads and parallelly running 4 thread and each thread processing and adding in to Array blocking queue.
Some other thread is taking Array Blocking Queue and process and adding in to batch. The batch size is 5000 and executing batch and inserting records into database.But still its taking complete 4 files is around 6 mins to complete.
How increase performance in this case?
1) Make sure you have enough memory for queue+processor buffers+db buffers.
2) Batch size of 5k is a bit more than needed, in general you get up to speed in 100, not that iе makes much difference here though.
3) You can push data into oracle in multiple threads. Fetching sequences for ID fields population ahead, you'll be able to insert into 1 table in parallel, if you have not many indexes. Otherwise consider disabling/recalculating indexes, or insert into temporary table and then move everything into main one.
4) Take a look at oracle settings with fellow DB admin. Things like extend size/increase can change performance.
I am reading data from vertica database using multiple threads in java.
I have around 20 million records and I am opening 5 different threads having select queries like this....
start = threadnum;
while (start*20000<=totalRecords){
select * from tableName order by colname limit 20000 offset start*20000.
start +=5;
}
The above query assigns 20K distinct records to read from db to each thread.
for eg the first thread will read first 20k records then 20K records starting from 100 000 position,etc
But I am not getting performance improvement. In fact using a single thread if it takes x seconds to read 20 million records then it is taking almost x seconds for each thread to read from database.
Shouldn't there be some improvement from x seconds (was expecting x/5 seconds)?
Can anybody pinpoint what is going wrong?
Your database presumably lies on a single disk; that disk is connected to a motherboard using a single data cable; if the database server is on a network, then it is connected to that network using a single network cable; so, there is just one path that all that data has to pass through before it can arrive at your different threads and be processed.
The result is, of course, bad performance.
The lesson to take home is this:
Massive I/O from the same device can never be improved by multithreading.
To put it in different terms: parallelism never increases performance when the bottleneck is the transferring of the data, and all the data come from a single sequential source.
If you had 5 different databases stored on 5 different disks, that would work better.
If transferring the data was only taking half of the total time, and the other half of the time was consumed in doing computations with the data, then you would be able to halve the total time by desynchronizing the transferring from the processing, but that would require only 2 threads. (And halving the total time would be the best that you could achieve: more threads would not increase performance.)
As for why reading 20 thousand records appears to perform almost as bad as reading 20 million records, I am not sure why this is happening, but it could be due to a silly implementation of the database system that you are using.
What may be happening is that your database system is implementing the offset and limit clauses on the database driver, meaning that it implements them on the client instead of on the server. If this is in fact what is happening, then all 20 million records are being sent from the server to the client each time, and then the offset and limit clauses on the client throw most of them away and only give you the 20 thousand that you asked for.
You might think that you should be able to trick the system to work correctly by turning the query into a subquery nested inside another query, but my experience when I tried this a long time ago with some database system that I do not remember anymore is that it would result in an error saying that offset and limit cannot appear in a subquery, they must always appear in a top-level query. (Precisely because the database driver needed to be able to do its incredibly counter-productive filtering on the client.)
Another approach would be to assign an incrementing unique integer id to each row which has no gaps in the numbering, so that you can select ... where unique_id >= start and unique_id <= (start + 20000) which will definitely be executed on the server rather than on the client.
However, as I wrote above, this will probably not allow you to achieve any increase in performance by parallelizing things, because you will still have to wait for a total of 20 million rows to be transmitted from the server to the client, and it does not matter whether this is done in one go or in 1000 gos of 20 thousand rows each. You cannot have two stream of rows simultaneously flying down a single wire.
I will not repeat what Mike Nakis says as it is true and well explained :
I/O from a physical disk cannot be improved by multithreading
Nevertheless I would like to add something.
When you execute a query like that :
select * from tableName order by colname limit 20000 offset start*20000.
from the client side you may handle the result of the query that you could improve by using multiple threads.
But from the database side you have not the hand on the processing of the query and the Vertica database is probably designed to execute your query by performing parallel tasks according to the machine possibilities.
So from the client side you may split the execution of your query in one, two or three parallel threads, it should not change many things finally as a professional database is designed to optimize the response time according to the number of requests it receives and machine possibilities.
No, you shouldn't get x/5 seconds. You are not thinking about the fact that you are getting 5 times the number of records in the same amount of time. It's about throughput, not about time.
In my opinion, the following is a good solution. It has worked for us to stream and process millions of records without much of a memory and processing overhead.
PreparedStatement pstmt = conn.prepareStatement(sql, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
pstmt.setFetchSize(Integer.MIN_VALUE);
ResultSet rs = pstmt.executeQuery();
while(rs.next()) {
// Do the thing
}
Using OFFSET x LIMIT 20000 will result in the same query being executed again and again. For 20 million records and for 20K records per execution, the query will get executed 1000 times.
OFFSET 0 LIMIT 20000 will perform well, but OFFSET 19980000 LIMIT 20000 itself will take a lot of time. As the query will be executed fully and then from the top it will have to ignore 19980000 records and give the last 20000.
But using the ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY options and setting the fetch size to Integer.MIN_VALUE will result in the query being executed only ONCE and the records will be streamed in chunks, and can be processed in a single thread.
I run JDBC queries in a sequence INSERT, DELETE, INSERT, DELETE etc. I insert one million of records in 1000 batches then delete those million records in single query then insert it again. In this case I am interested only in insertion performance.
When I run in in a loop of i.e. 10 iterations, at 1st iteration the performance is fastest i.e. 11 seconds then after every next iteration performance of insert is few seconds slower then the previous one. However, when I run it not in the loop the insertion time is very similar.
Any idea why?
for(number of iterations){
//process insert of million records here, batch size is 1000
//prepared statement is used and clearBatch is called after every 1000
//inserts,
//at the end prepared statement is closed and connection.commit() is
//called
//Thread.sleep(1000) called here
//everything is inserted now in the DB so delete what has been inserted
//in single query. connection.commit() called again after delete.
//Thread.sleep(1000) and repeat the same actions until loop finishes.
}
Sorry, I don't have the code with me.
Any idea why at every next iteration the insertion is slower?
I can't be sure without the code, but I think you have a memory leak so that the extra time is due to garbage collection. That would explain why it is faster to run the same code several times without a loop. Run the program with GC logging enabled (-XX:+PrintGC) and see what happens.
To eliminate database issues you may want to test with another (new) table and replace the delete with truncate table, just in case.
I have a query returning ~200K hits from 7 different indices distributed across our cluster. I process my results as:
while (true) {
scrollResp = client.prepareSearchScroll(scrollResp.getScrollId()).setScroll(new TimeValue(600000)).execute().actionGet();
for (SearchHit hit : scrollResp.getHits()){
//process hit}
//Break condition: No hits are returned
if (scrollResp.hits().hits().length == 0) {
break;
}
}
I'm noticing that the client.prepareSearchScroll line can hang for quite some time before returning the next set of search hits. This seems to get worse the longer I run the code for.
My setup for the search is:
SearchRequestBuilder searchBuilder = client.prepareSearch( index_names )
.setSearchType(SearchType.SCAN)
.setScroll(new TimeValue(60000)) //TimeValue?
.setQuery( qb )
.setFrom(0) //?
.setSize(5000); //number of jsons to get in each search, what should it be? I have no idea.
SearchResponse scrollResp = searchBuilder.execute().actionGet();
Is it expected that scanning and scrolling just takes a long time when examining many results? I'm very new to Elastic Search so keep in mind that I may be missing something very obvious.
My query:
QueryBuilder qb = QueryBuilders.boolQuery().must(QueryBuilders.termsQuery("tweet", interesting_words));
.setSize(5000) means that each client.prepareSearchScroll call is going to retrieve 5000 records per shard. You are requesting back source, and if your records are big, assembling 5000 records in memory might take awhile. I would suggest trying a smaller number. Try 100 and 10 to see if you are getting a better performance.
.setFrom(0) is not necessary.
I'm going to add another answer here, because I was very puzzled by this behaviour and it took me a long time to find the answer in the comments by #AaronM
This applies to ES 1.7.2, using the java API.
I was scrolling/scanning an index of 500m records, but with a query that returns about 400k rows.
I started off with a scroll size of 1,000 which seemed to me a reasonable trade-off in terms of network versus CPU.
This query ran terribly slowly, taking about 30 minutes to complete, with very long pauses between fetches from the cursor.
I worried that maybe it was just the query I was running and did not believe that decreasing the scroll size could help, as 1000 seemed tiny.
However, seeing AaronM's comment above, I tried a scroll size of 10.
The whole job completed in 30 seconds (and this was whether I had restarted ES or not, so presumably nothing to do with caching) - a speed-up of about 60x!!!
So if you're having performance problems with scroll/scan, I highly recommend trying decreasing the scroll size. I couldn't find much about this on the internet, so posted this here.
Query data node not client node or
master node
Select the fields you need with filter_pathproperty
Set scroll size according your document size, there is no a magic rule, you must set value and try, and so on
Monitor your network band width
If it's not enough, let's go for some multi-threads stuff:
Think that elasticsearch index is composed of multiple shards. This design means you can parallelize operation.
Let's say your index has 3 shards, and your cluster 3 nodes (good practice to have more nodes than shards by index).
You could run 3 Java "workers", in a separate thread each, that will search scroll a different shard and node, and use a queue to "centralize" the results.
This way, you will have a good performance!
This is what the elasticsearch-hadoop library does.
To retrieve shards/nodes details about an index, use the https://www.elastic.co/guide/en/elasticsearch/reference/current/search-shards.html API.
You can read document here
SearchScrollRequestBuilder!
I think Timevalue is time to keep scrolling alive
setScroll(TimeValue keepAlive)
If set, will enable scrolling of the search request for the specified timeout.
You can read more here :
Scrolling
I have a database in which I need to insert batches of data (around 500k records at a time). I was testing with derby and was seeing insert times of about 10-15minutes for this many records (I was doing a batch insert in Java).
Does this time seem slow (working on your average laptop)? And are there approaches to speeding it up?
thanks,
Jeff
This time seems perfectly reasonable, and is in agreement with times I have observed. If you want it to go faster, you need use bulk insert options and disable safety features:
Use PreparedStatements and batches of 5,000 to 10,000 records unless it MUST be one transaction
Use bulk loading options in the DBMS
Disable integrity checks temporarily for insert
Disable indexes temporarily or delete indexes and re-create them post-insert
Disable transaction logging and re-enable afterward.
EDIT: Database transactions are limited by disk I/O, and on laptops and most hard drives, the important number is seek time for the disk.
Laptops tend to have rather slow disks, at 5400 rpm. At this speed, seek time is about 5 ms. If we assume one seek per record (an over-estimate in most cases), it would take 40 minutes (500000 * 5 ms) to insert all rows. Now, the use of caching mechanisms and sequencing mechanisms reduces this somewhat, but you can see where the problem comes from.
I am (of course) vastly oversimplifying the problem, but you can see where I'm going with this; it's unreasonable to expect databases to perform at the same speed as sequential bulk I/O. You've got to apply some sort of indexing to your record, and that takes time.