I have a query returning ~200K hits from 7 different indices distributed across our cluster. I process my results as:
while (true) {
scrollResp = client.prepareSearchScroll(scrollResp.getScrollId()).setScroll(new TimeValue(600000)).execute().actionGet();
for (SearchHit hit : scrollResp.getHits()){
//process hit}
//Break condition: No hits are returned
if (scrollResp.hits().hits().length == 0) {
break;
}
}
I'm noticing that the client.prepareSearchScroll line can hang for quite some time before returning the next set of search hits. This seems to get worse the longer I run the code for.
My setup for the search is:
SearchRequestBuilder searchBuilder = client.prepareSearch( index_names )
.setSearchType(SearchType.SCAN)
.setScroll(new TimeValue(60000)) //TimeValue?
.setQuery( qb )
.setFrom(0) //?
.setSize(5000); //number of jsons to get in each search, what should it be? I have no idea.
SearchResponse scrollResp = searchBuilder.execute().actionGet();
Is it expected that scanning and scrolling just takes a long time when examining many results? I'm very new to Elastic Search so keep in mind that I may be missing something very obvious.
My query:
QueryBuilder qb = QueryBuilders.boolQuery().must(QueryBuilders.termsQuery("tweet", interesting_words));
.setSize(5000) means that each client.prepareSearchScroll call is going to retrieve 5000 records per shard. You are requesting back source, and if your records are big, assembling 5000 records in memory might take awhile. I would suggest trying a smaller number. Try 100 and 10 to see if you are getting a better performance.
.setFrom(0) is not necessary.
I'm going to add another answer here, because I was very puzzled by this behaviour and it took me a long time to find the answer in the comments by #AaronM
This applies to ES 1.7.2, using the java API.
I was scrolling/scanning an index of 500m records, but with a query that returns about 400k rows.
I started off with a scroll size of 1,000 which seemed to me a reasonable trade-off in terms of network versus CPU.
This query ran terribly slowly, taking about 30 minutes to complete, with very long pauses between fetches from the cursor.
I worried that maybe it was just the query I was running and did not believe that decreasing the scroll size could help, as 1000 seemed tiny.
However, seeing AaronM's comment above, I tried a scroll size of 10.
The whole job completed in 30 seconds (and this was whether I had restarted ES or not, so presumably nothing to do with caching) - a speed-up of about 60x!!!
So if you're having performance problems with scroll/scan, I highly recommend trying decreasing the scroll size. I couldn't find much about this on the internet, so posted this here.
Query data node not client node or
master node
Select the fields you need with filter_pathproperty
Set scroll size according your document size, there is no a magic rule, you must set value and try, and so on
Monitor your network band width
If it's not enough, let's go for some multi-threads stuff:
Think that elasticsearch index is composed of multiple shards. This design means you can parallelize operation.
Let's say your index has 3 shards, and your cluster 3 nodes (good practice to have more nodes than shards by index).
You could run 3 Java "workers", in a separate thread each, that will search scroll a different shard and node, and use a queue to "centralize" the results.
This way, you will have a good performance!
This is what the elasticsearch-hadoop library does.
To retrieve shards/nodes details about an index, use the https://www.elastic.co/guide/en/elasticsearch/reference/current/search-shards.html API.
You can read document here
SearchScrollRequestBuilder!
I think Timevalue is time to keep scrolling alive
setScroll(TimeValue keepAlive)
If set, will enable scrolling of the search request for the specified timeout.
You can read more here :
Scrolling
Related
I am reading data from vertica database using multiple threads in java.
I have around 20 million records and I am opening 5 different threads having select queries like this....
start = threadnum;
while (start*20000<=totalRecords){
select * from tableName order by colname limit 20000 offset start*20000.
start +=5;
}
The above query assigns 20K distinct records to read from db to each thread.
for eg the first thread will read first 20k records then 20K records starting from 100 000 position,etc
But I am not getting performance improvement. In fact using a single thread if it takes x seconds to read 20 million records then it is taking almost x seconds for each thread to read from database.
Shouldn't there be some improvement from x seconds (was expecting x/5 seconds)?
Can anybody pinpoint what is going wrong?
Your database presumably lies on a single disk; that disk is connected to a motherboard using a single data cable; if the database server is on a network, then it is connected to that network using a single network cable; so, there is just one path that all that data has to pass through before it can arrive at your different threads and be processed.
The result is, of course, bad performance.
The lesson to take home is this:
Massive I/O from the same device can never be improved by multithreading.
To put it in different terms: parallelism never increases performance when the bottleneck is the transferring of the data, and all the data come from a single sequential source.
If you had 5 different databases stored on 5 different disks, that would work better.
If transferring the data was only taking half of the total time, and the other half of the time was consumed in doing computations with the data, then you would be able to halve the total time by desynchronizing the transferring from the processing, but that would require only 2 threads. (And halving the total time would be the best that you could achieve: more threads would not increase performance.)
As for why reading 20 thousand records appears to perform almost as bad as reading 20 million records, I am not sure why this is happening, but it could be due to a silly implementation of the database system that you are using.
What may be happening is that your database system is implementing the offset and limit clauses on the database driver, meaning that it implements them on the client instead of on the server. If this is in fact what is happening, then all 20 million records are being sent from the server to the client each time, and then the offset and limit clauses on the client throw most of them away and only give you the 20 thousand that you asked for.
You might think that you should be able to trick the system to work correctly by turning the query into a subquery nested inside another query, but my experience when I tried this a long time ago with some database system that I do not remember anymore is that it would result in an error saying that offset and limit cannot appear in a subquery, they must always appear in a top-level query. (Precisely because the database driver needed to be able to do its incredibly counter-productive filtering on the client.)
Another approach would be to assign an incrementing unique integer id to each row which has no gaps in the numbering, so that you can select ... where unique_id >= start and unique_id <= (start + 20000) which will definitely be executed on the server rather than on the client.
However, as I wrote above, this will probably not allow you to achieve any increase in performance by parallelizing things, because you will still have to wait for a total of 20 million rows to be transmitted from the server to the client, and it does not matter whether this is done in one go or in 1000 gos of 20 thousand rows each. You cannot have two stream of rows simultaneously flying down a single wire.
I will not repeat what Mike Nakis says as it is true and well explained :
I/O from a physical disk cannot be improved by multithreading
Nevertheless I would like to add something.
When you execute a query like that :
select * from tableName order by colname limit 20000 offset start*20000.
from the client side you may handle the result of the query that you could improve by using multiple threads.
But from the database side you have not the hand on the processing of the query and the Vertica database is probably designed to execute your query by performing parallel tasks according to the machine possibilities.
So from the client side you may split the execution of your query in one, two or three parallel threads, it should not change many things finally as a professional database is designed to optimize the response time according to the number of requests it receives and machine possibilities.
No, you shouldn't get x/5 seconds. You are not thinking about the fact that you are getting 5 times the number of records in the same amount of time. It's about throughput, not about time.
In my opinion, the following is a good solution. It has worked for us to stream and process millions of records without much of a memory and processing overhead.
PreparedStatement pstmt = conn.prepareStatement(sql, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
pstmt.setFetchSize(Integer.MIN_VALUE);
ResultSet rs = pstmt.executeQuery();
while(rs.next()) {
// Do the thing
}
Using OFFSET x LIMIT 20000 will result in the same query being executed again and again. For 20 million records and for 20K records per execution, the query will get executed 1000 times.
OFFSET 0 LIMIT 20000 will perform well, but OFFSET 19980000 LIMIT 20000 itself will take a lot of time. As the query will be executed fully and then from the top it will have to ignore 19980000 records and give the last 20000.
But using the ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY options and setting the fetch size to Integer.MIN_VALUE will result in the query being executed only ONCE and the records will be streamed in chunks, and can be processed in a single thread.
I have a web application and have to fetch 1000 records using a REST API. Each record is around 500 bytes.
What is the best way to do it from the following and why? Is there another better way to do it?
1>Fetch one record at a time. Trigger 1000 calls in parallel.
2>Fetch in groups of 20. Trigger 50 calls in parallel.
3>Fetch in groups of 100. Trigger 10 calls in parallel.
4>Fetch all 1000 records together.
As #Dima said in the comments, it really depends on what you are trying to do.
How are the records being consumed?
Is it a back end process to process or program to program communication? If so, then it depends on the difficulty of processing once the client receives it. Is it going to take them a long time to process each record? 1 ms per record, or 100ms per record? This option depends entirely on possible processing time per record.
Is there a front end consuming this for human users? If so, batch requesting would be good for reasons like paginating results. In such cases, I would go with option 2 or 3 personally.
In general though, depending upon the sheer volume of records, I would recommend considering batching requests (by triggering fewer calls). Heuristically speaking, you are likely to get better overall network throughput that way.
If you add more specifics, I'll happily update my answer, but until then, general will have to do!
Best for what case? What are you trying to optimize?
I did some tests a while back on a similar situation, with slightly larger payloads (images), where my goal was to utilize network efficiently on a high-latency setup (across continents).
My results were that after a minimal amount of parallelism (like 3-4 threads), the network was almost perfectly saturated. We compared it to specific (proprietary) UDP-based transfer protocols, and there was no measurable difference.
Anyway, it may be not what you are looking for, but sometimes having a "dumb" http endpoint is good enough.
I'm trying to improve query performance. It takes an average of about 3 seconds for simple queries which don't even touch a nested document, and it's sometimes longer.
curl "http://searchbox:9200/global/user/_search?n=0&sort=influence:asc&q=user.name:Bill%20Smith"
Even without the sort it takes seconds. Here are the details of the cluster:
1.4TB index size.
210m documents that aren't nested (About 10kb each)
500m documents in total. (nested documents are small: 2-5 fields).
About 128 segments per node.
3 nodes, m2.4xlarge (-Xmx set to 40g, machine memory is 60g)
3 shards.
Index is on amazon EBS volumes.
Replication 0 (have tried replication 2 with only little improvement)
I don't see any noticeable spikes in CPU/memory etc. Any ideas how this could be improved?
Garry's points about heap space are true, but it's probably not heap space that's the issue here.
With your current configuration, you'll have less than 60GB of page cache available, for a 1.5 TB index. With less than 4.2% of your index in page cache, there's a high probability you'll be needing to hit disk for most of your searches.
You probably want to add more memory to your cluster, and you'll want to think carefully about the number of shards as well. Just sticking to the default can cause skewed distribution. If you had five shards in this case, you'd have two machines with 40% of the data each, and a third with just 20%. In either case, you'll always be waiting for the slowest machine or disk when doing distributed searches. This article on Elasticsearch in Production goes a bit more in depth on determining the right amount of memory.
For this exact search example, you can probably use filters, though. You're sorting, thus ignoring the score calculated by the query. With a filter, it'll be cached after the first run, and subsequent searches will be quick.
Ok, a few things here:
Decrease your heap size, you have a heap size of over 32gb dedicated to each Elasticsearch instance on each platform. Java doesn't compress pointers over 32gb. Drop your nodes to only 32gb and, if you need to, spin up another instance.
If spinning up another instance instance isn't an option and 32gb on 3 nodes isn't enough to run ES then you'll have to bump your heap memory to somewhere over 48gb!
I would probably stick with the default settings for shards and replicas. 5 shards, 1 replica. However, you can tweak the shard settings to suit. What I would do is reindex the data in several indices under several different conditions. The first index would only have 1 shard, the second index would have 2 shards, I'd do this all the way up to 10 shards. Query each index and see which performs best. If the 10 shard index is the best performing one keep increasing the shard count until you get worse performance, then you've hit your shard limit.
One thing to think about though, sharding might increase search performance but it also has a massive effect on index time. The more shards the longer it takes to index a document...
You also have quite a bit of data stored, maybe you should look at Custom Routing too.
I'm using HBase to store some time series data. Using the suggestion in the O'Reilly HBase book I am using a row key that is the timestamp of the data with a salted prefix. To query this data I am spawning multiple threads which implement a scan over a range of timestamps with each thread handling a particular prefix. The results are then placed into a concurrent hashmap.
Trouble occurs when the threads attmept to perform their scan. A query that normally takes approximately 5600 ms when done serially takes between 40000 and 80000 ms when 6 threads are spawned (corresponding to 6 salts/region servers).
I've tried to use HTablePools to get around what I thought was an issue with HTable being not thread-safe, but this did not result in any better performance.
in particular I am noticing a significant slow down when I hit this portion of my code:
for(Result res : rowScanner){
//add Result To HashMap
Through logging I noticed that everytime through the conditional of the loop I experienced delays of many seconds. These delays do not occur if I force the threads to execute serially.
I assume that there is some kind of issue with resource locking but I just can't see it.
Make sure that you are setting the BatchSize and Caching on your Scan objects (the object that you use to create the Scanner). These control how many rows are transferred over the network at once, and how many are kept in memory for fast retrieval on the RegionServer itself. By default they are both way too low to be efficient. BatchSize in particular will dramatically increase your performance.
EDIT: Based on the comments, it sounds like you might be swapping either on the server or on the client, or that the RegionServer may not have enough space in the BlockCache to satisfy your scanners. How much heap have you given to the RegionServer? Have you checked to see whether it is swapping? See How to find out which processes are swapping in linux?.
Also, you may want to reduce the number of parallel scans, and make each scanner read more rows. I have found that on my cluster, parallel scanning gives me almost no improvement over serial scanning, because I am network-bound. If you are maxing out your network, parallel scanning will actually make things worse.
Have you considered using MapReduce, with perhaps just a mapper to easily split your scan across the region servers? It's easier than worrying about threading and synchronization in the HBase client libs. The Result class is not threadsafe. TableMapReduceUtil makes it easy to set up jobs.
We have a SELECT statement which will take approx. 3 secs to execute. We are calling this DB2 query inside a nested While loop.
Ex:
While(hashmap1.hasNext()){
while(hashmap2.hasNext()){
SQL Query
}
}
Problem is, the outer While loop will execute approx. 1200 times and inner While loop will execute 200 times. Which means the SQL will be called 1200*200 = 240,000 times. Approx. each iteration of Outer While loop will take 150 secs. So, 1200 * 150 secs = 50 hrs.
We can afford only around 12-15hrs of time, before we kick off the next process.
Is there any way to do this process quickly? Any new technology which can help us in fetching these records faster from DB2.
Any help would be highly appreciated.
Note: We already looked into all possible ways to cut down the no.of iterations.
Sounds to me like you're trying to use the middle tier for something that the database itself is better suited for. It's a classic "N+1" query problem.
I'd rewrite this logic to execute entirely on the database as a properly indexed JOIN. That'll not only cut down on all that network back and forth, but it'll bring the database optimizer to bear and save you the expense of bringing all that data to the middle tier for processing.