how to do pagination with elasticsearch? from vs scroll API - java

I'm using elasticsearch as DB to store a large batch of log data.
I know there are 2 ways to do pagination:
Use size and from API
Use scroll API
Now I'm using 'from' to do pagination.Get page and size parameters from front end,and at back end(Java)
searchSourceBuilder.size(size);
searchSourceBuilder.from(page * size);
However, if page*size > 10000, an exception thrown from ES.
Can I use scroll API to do pagination?
I know that if I use scroll API, the searchResponse object will return me a _scroll_id, which looks like a base64 string.
How can I control page and size?
It seems Scroll API only support successive page number?

There is nothing in Elasticsearch which allows direct jump to a specific page as the results have to be collected from different shards. So in your case search_after will be a better option. You can reduce the amount of data returned for the subsequent queries and then once you reach the page which is actually requested get the complete data.
Example: Let's say you have to jump to 99th page then you can reduce the amount of data for all 98th pages request and once you're at 99 you can get the complete data.

What you said is correct!
You can't do traditional pagination by using scroll API.
I might suggest you to look the Search After API
This might not help you to cover your requirement!
The From / Size default max result size is 10,000.
As it is mentioned here
Note that from + size can not be more than the index.max_result_window index setting which defaults to 10,000
So if you somehow increase the index.max_result_window setting persistent then it will increase the max number of search result! But this might not be the solution but decrease the limitation.
Remember this above solution might hamper the performance of your ES server. Read through all the posts here.
My suggestion is to use Scroll API and change the pagination style

Related

Paging Library - Load Specific Page By Key

I'm using Android Paging Library, and probably don't use it 100% correct.
I have a RecyclerView and a SnapHelper to basically implement a ViewPager.
My pages are per date, e.g. 2019-03-21, so there can be infinite amount of pages.
What I implemented is a ItemKeyedDataSource<String,String> which has the date as its param, and on loadAfter and loadBefore, all I do is add/sub a day.
This currently works just fine. If, for example, I load 2019-03-21, then I can easily cycle to its neighbors 2019-03-20 and 2019-03-22.
However, I'd like to add a feature to load a specific date, and then scroll there.
Using PositionalDataSource doesn't sound good either, since I can't say there's a finite count of items in my list.
I feel like I'm doing it wrong. Just not really sure what.
Also, if there's another way that doesn't include paging (sounds reasonable, since my paging is just doing some calculations but doesn't retrieve data), that's good too.
Well, I went for the alternative approach.
I'm using a regular list and adapter.
I'm using a list with padding for both sides (a list of 50 dates, while the initial date is at index 24), and when reaching index 10 or 40 (depending if we go up or down) I'm calculating a new list and post it instead of the old one.
It looks like an infinite list and everything looks good enough, so it works for me.
If anyone wants more details, please comment and I'll try to help.

Neo4j - Reading large amounts of data with Java

I'm currently trying to read large amounts of data into my Java application using the official Bolt driver. I'm having issues because the graph is fairly large (~17k nodes, ~500k relationships) and of course I'd like to read this in chunks for memory efficiency. What I'm trying to get is a mix of fields between the origin and destination nodes, as well as the relationship itself. I tried writing a pagination query:
MATCH (n:NodeLabel)-[r:RelationshipLabel]->(n:NodeLabel)
WITH r.some_date AS some_date, r.arrival_times AS arrival_times,
r.departure_times AS departure_times, r.path_ids AS path_ids,
n.node_id AS origin_node_id, m.node_id AS dest_node_id
ORDER BY id(r)
RETURN some_date, arrival_times, departure_times, path_ids,
origin_node_id, dest_node_id
LIMIT 5000
(I changed some of the label and field naming so it's not obvious what the query is for)
The idea was I'd use SKIP on subsequent queries to read more data. However, at 5000 rows/read this is taking roughly 7 seconds per read, presumably because of the full scan ORDER BY forces, and if I SKIP it goes up in execution time and memory usage significantly. This is way too long to read the whole thing, is there any way I can speed up the query? Or stream the results in chunks into my app? In general, what is the best approach to reading large amounts of data?
Thanks in advance.
Instead of skip. From the second call you can do id(r) > "last received id(r)" it should actually reduce the process time as you go.

ELK stack for storing metering data

In our project we're using an ELK stack for storing logs in a centralized place. However I've noticed that recent versions of ElasticSearch support various aggregations. In addition Kibana 4 supports nice graphical ways to build graphs. Even recent versions of grafana can now work with Elastic Search 2 datasource.
So, does all this mean that ELK stack can now be used for storing metering information gathered inside the system or it still cannot be considered as a serious competitor to existing solutions: graphite, influx db and so forth.
If so, does anyone use ELK for metering in production? Could you please share your experience?
Just to clarify the notions, I consider metering data as something that can be aggregated and and show in a graph 'over time' as opposed to regular log message where the main use case is searching.
Thanks a lot in advance
Yes you can use Elasticsearch to store and analyze time-series data.
To be more precise - it depends on your use case. For example in my use case (financial instrument price tick history data, in development) I am able to get 40.000 documents inserted / sec (~125 byte documents with 11 fields each - 1 timestamp, strings and decimals, meaning 5MB/s of useful data) for 14 hrs/day, on a single node (big modern server with 192GB ram) backed by corporate SAN (which is backed by spinning disks, not SSD!). I went to store up to 1TB of data, but I predict having 2-4TB could also work on a single node.
All this is with default config file settings, except for the ES_HEAP_SIZE of 30GB. I am suspecting it would be possible to get significantly better write performance on that hardware with some tuning (eg. I find it strange that iostat reports device util at 25-30% as if Elastic was capping it / conserving i/o bandwith for reads or merges... but it could also be that the %util is an unrealiable metric for SAN devices..)
Query performance is also fine - queries / Kibana graphs return quick as long as you restrict the result dataset with time and/or other fields.
In this case you would not be using Logstash to load your data, but bulk inserts of big batches directly into the Elasticsearch. https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
You also need to define a mapping https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html to make sure elastic parses your data as you want it (numbers, dates, etc..) creates the wanted level of indexing, etc..
Other recommended practices for this use case are to use a separate index for each day (or month/week depending on your insert rate), and make sure that index is created with just enough shards to hold 1 day of data (by default new indexes get created with 5 shards, and performance of shards starts degrading after a shard grows over a certain size - usually few tens of GB, but it might differ for your use case - you need to measure/experiment).
Using Elasticsearch aliases https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html helps with dealing with multiple indexes, and is a generally recommended best practice.

how to improve MappedByteBuffer get performance for my use case?

I have several large double and long arrays of 100k values each that needs to be accessed for computation at a given time, even with largeHeap requested the Android OS doesnt give me enough memory and i keep getting outofmemory exceptions in most of tested devices. So i went researching for ways to overcome this, and according to an answer i got from Waldheinz in my previous question i implemented an array that is based on a file, using RandomAccessMemory to get a channel to it, then map it using MappedByteBuffer as suggested, and use the MappedByteBuffer asLongBuffer or asDoubleBuffer. this works perfect, i 100% eliminated the outofmemory exceptions. but the performance is very poor. i get lot of calls to get(some index) that takes about 5-15 miliseconds each and therefore user exprience is ruined
some usefull information :
i am using binary search on the arrays to find a start and end indices and then i have a linear loop from start to end
I added a print command for any get() calls that takes more then 5 mili seconds to finish (printing out time it took,requested index and the last requested index), seems like all of the binary search get requests were printed, and few of the linear requests were too.
Any suggestions on how to make it go faster?
Approach 1
Index your data - add pointers for quick searching
Split your sorted data into 1000 buckets 100 values each
Maintain an index referencing each bucket's start and end
The algorithm is to first find your bucket in this memory index (even a loop is fine for this) and then to jump to this bucket in a memory mapped file
This will result into a single jump over a file (a single bucket to find) and an iteration on 100 elements max.
Approach 2
Utilize a lightweight embedded database. I.e. MapDB supports Android.

Solr paging performance

I have read (http://old.nabble.com/using-q%3D--,-adding-fq%3D-to26753938.html#a26805204):
FWIW: limiting the number of rows per
request to 50, but not limiting the
start doesn't make much sense -- the
same amount of work is needed to
handle start=0&rows=5050 and
start=5000&rows=50.
Than he completes:
There are very few use cases for
allowing people to iterate through all
the rows that also require sorting.
Is that right? Is that true just for sorted results?
How many pages of 10 rows each do you recommend to allow the user to iterate?
Does Solr 1.4 suffer the same limitation?
Yes that's true, also for Solr 1.4. That does not mean that start=0&rows=5050 has the same performance as start=5000&rows=50, since the former has to return 5050 documents while the latter only 50. Less data to transfer -> faster.
Solr doesn't have any way to get ALL results in a single page since it doesn't make much sense. As a comparison, you can't fetch the whole Google index in a single query. Nobody really needs to do that.
The page size of your application should be user-definable (i.e. the user might choose to see 10, 25, 50, or 100 results at once).
The default page size depends on what kind of data you're paging and how relevant the results really are. For example, when searching on Google you usually don't look beyond the first few results, so 10 elements are enough. eBay, on the other hand, is more about browsing the results, so it shows 50 results per page by default, and it doesn't even offer 10 results per page.
You also have to take scrolling into account. Users would probably get lost when trying to browse through a 200-result page, not to mention that it takes considerably longer to load.
start=0&rows=5050 and start=5000&rows=50
Depends how you jump to start=5000. If you scroll through all results from 0 to 4999 ignoring them all and then continue scrolling from 5000 to 5050 then yes, same amount of work is done here. Best thing to do is to limit the rows fetched from database itself by using something like ROWNUM in Oracle.
.
iterate through all the rows that also require sorting
Few but yes there are use cases that have this requirement. Examples would be CSV/Excel/PDF exports.

Categories

Resources