Elasticsearch - build scroll id manually

Elasticsearch - build scroll id manually - java

I have a case in which I shouldn't make requests to get the scroll_id - I have to manage it somehow so I can get the URL for next pages offline (I am making GET requests against a certain site that exposes their Elasticsearch instance)
So basically, I have a certain URL containing Elasticsearch query and it returns me only 20 results out of 40(20 per request is the max size). I want to get an URL for the next pages - so given I had the connection to the Internet, I would just get the scroll_id from the first request and use it to make next ones.
But I want to avoid it and see if I can have a helper class that builds scroll ids by itself.
Is it possible?
Thanks in advance.

The scroll_id ties directly to some internal state (i.e. the context of the initial query) managed by ES and which eventually times out after the a given period of time.
Once the period times out, the search context is cleared and the scroll id is not valid anymore. I'm afraid there's no way you can craft a scroll id by hand.
But if the result set contains 40 results ans you can only retrieve 20 at a time, I suggest you simple set from: 20 in your second query and you'll be fine.

Related

Query past the 500 limit in Gerrit REST API

I'm trying to get 2000 change results from a specific branch with a query request using Gerrit REST API in Java. The problem is that I'm only getting 500 results no matter what I add to the query search.
I have tried the options listed here but I'm not getting the 2000 results that I need. I also read that an admin can increase this limit but would prefer a method that doesn't require this detour.
So what I'm wondering is:
Is it possible to increase the limit without the need to contact the admin?
If not. Is it possible to continue/repeat the query in order to get the remaining 1500 results that I want, using a loop that performs the query on the following 500 results from the previous query until I finally get 2000 results in total?

When using the list changes REST API, the results are returned as a list of ChangeInfo Elements. If there are more results than were returned, the last entry in that list will have a _more_changes field with value true. You can then query again and set the start option to skip over the ones that you've already received.

I want to add a minor workaround to David's great answer.
If you want to crawl Gerrit instances hosted on Google servers (such as Android, Chromium, Golang), you will notice that they block queries with more than 10000 results. You can check this e.g. with
curl "https://android-review.googlesource.com/changes/?q=status:closed&S=10000"
I solved the problem in such a way, that I split up these list of changes with before: and until: in a query string, for example lie
_url_/changes/?q=after:{2018-01-01 00:00:00.000} AND before:{2018-01-01 00:59:99.999}
_url_/changes/?q=after:{2018-01-01 01:00:00.000} AND before:{2018-01-01 01:59:99.999}
_url_/changes/?q=after:{2018-01-01 02:00:00.000} AND before:{2018-01-01 02:59:99.999}
and so on. I think you get the idea. ;-) Please notice, that both limits (before: and after:) are inclusive! For each day I use the pagination described by David.
A nice side effect is, that you can track the progress of the crawling.
I wrote a small Python tool named "Gerry" to crawl open source instances. Feel free to use, adopt it and send me pull requests!

I almost had the same problem. But there is no way as you mentioned you don't want admin to increase the query limit and also you don't want to fire the rest query in a loop with the counter. I will suggest you to follow the second approach firing the query in a loop with a counter set. That's the way I have implemented the rest client in Java.

Instagram API Pagination : Next page

I am trying to get all the recent media by tag using this Instagram endpoint. The purpose here is to track all the recent media for tags. I have configured a scheduled task (with Java and Spring) (to execute every hour) that sends requests and gets data. Below is the execution sequence:
Send the GET request to Instagram and previously stored max_tag_id (send with null if there's no previous id)
Iterate through results, extract next_max_tag_id from pagination element and store it in the database against corresponding tag
Send GET request again with new max_tag_id and continue
Stop if next_url in the result is null or number of media returned is less than 20 (configured)
Once the execution finishes, next execution (after let's say an hour) will start with previously stored max_tag_id.
The issue I am seeing is, I never get 'recent' documents in subsequent executions. As per the documentation, passing max_tag_id in the request should return all the media after that id, however it's not happening. I keep getting old media.
If I want to get the recent documents in every execution, do I need to pass null max_id in the first request of every execution? If I do that, will I not get redundant documents in every execution? I tried asking Instagram but haven't got any response. Also, their documentation explains little about pagination. Not sure whether pagination for recent media endpoint works backwards.

If you want most recent don't use max_tag_id, If you use max_tag_id it will return all media dated before that.
You need to get the min_tag_id and store it, in the next hour start by making call with only min_tag_id, if there is pagination.next_url, use that to get next set of 20, until pagination.next_url does not exist.... use the stored min_tag_id to make calls the next hour.
The very first time you make call without max_tag_id or min_tag_id
You can also set the &count=32, to get 32 posts with every API call, (32 is max from my experience)

Split a big Jira-Rest-Request

I'm looking for an opportunity to split a big request like:
rest/api/2/search?jql=(project in (project1, project2, project3....project10)) AND issuetype = Bug AND (component not in (projectA, projectB) OR component = EMPTY). The result will containe > 500 Bugs -> It's very very slow. I want to get them with different requests (methode to performe the request will be annotated with #Asynchronous) but the jql needs to be the same. I don't want to search separately for project1, project2...project10. Would be nice if someone has an idea to resolve my problem.
Thank you :)

You need to calculate pagination. First get the metadata.
rest/api/2/search?jql=[complete search query]&fields=*none&maxResults=0
you should get something like this:
{"startAt":0,"maxResults":0,"total":100,"issues":[]}
so completely without fields, just pagination metadata.
Than create search URI like this.
rest/api/2/search?jql=[complete search query]&startAt=0&maxResults=10
rest/api/2/search?jql=[complete search query]&startAt=10&maxResults=10
..etc
Beware data should change so you should be prepared that you won't recieve all the data and also pagination metadata if calculation is expensive (exspecially "total") should not be presented. More Paged API

Can you not break into 2 parts? If you are displaying in a web page ( display what you can without performance hit. If its a report then get all objects gradually and show once completed.
Get the count in total for JQL & just get the minimum information needed for step 2 - assume its 900
Use the pagination feature (maxResults=100) make multiple calls.
Work on each request.

If you don't want to run the two requests at once and need paging of bugs by user request, you can:
Make a request with the 'maxResults' property set to how much you need.
On the next request set the 'maxResults' property and the 'startAt' with the same value.
If you need to fetch more data, make new request with the same 'maxResults' but update 'startAt' to be the count of bugs you fetched in the previous requests.

How to retain first search data when second search is performed

I need the implementation logic, where mine logic is failing. Let me explain the functionality.
a) There are two searches in screen , SearchA and SearchB . When i perform SearchA it gives me some result and i display on the screen.
b) When i perform the SearchB, application will search the data and display on the screen.
Expected Result :
When SearchB is performed both the search results should be show on the application.
Present Result:
When searchB is performed the search result of SearchA is disappearing and only SearchB is displaying. Visa Versa
Please dnt suggest to place the search result in session. Its a huge data (Millions of records) . So please suggest any other apt implementation.

I am assuming that both search, results in page to be reloaded and search result being request scope, the first response gets wiped off on searching the second time (SearchB)
You could paint the result portion of the page using Ajax instead of full page refresh, thereby retaining the search results.
Say you have 2 div SearchAResults and SearchBResults. SearchA will perform ajax request and populate the SearchAResults div once it gets results from server. There is no page load at this time. Similarly, SearchB will just paint the SearchBResults.
Also, if memory(and not design) is the only criteria stopping you from storing result in session, and the search results need to be reused - you could look into some external cache solutions like Memcached whereby there would be dedicated servers just to handle these humongous data.

HTTP POST size issue

I've developed a module with Spring framework and for the view i've used some Spring JSTL tags like <form:hidden>
I have a table on the jsp which i store using an Arraylist.
Now when i do some other action, i have to maintain the Table and since we are not using AJAX(Client doesnt wants it!!) , what i've done is that i've put all the list elements one by one into <form:hidden>.Now everytime i do a select for one of the elements of the list, i have to maintain the list and that is taken care off via the tag.
But when i go on selecting multiple records one by one, i've noticed( System.out.println("Request Size : " + request.getContentLength())), the size increases everytime and when it reaches 3MB, the system crashes. Is there any way i can increase the size of the POST method, in eclipse or websphere? or is there any way i can clear the request so that the size doesnt increase? please help.

instead of using form:hidden to submit all the array values, maybe you could use form:hidden to submit only the index of the elements in the array

You should maintain the state at the server side, possible in HTTPSession. Whenever, the state changes on the page and has to be committed, only the state changes should be POSTed back to the server. Sending 3 MB worth of data in request will not scale.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Elasticsearch - build scroll id manually - java

Related

Query past the 500 limit in Gerrit REST API

Instagram API Pagination : Next page

Split a big Jira-Rest-Request

How to retain first search data when second search is performed

HTTP POST size issue

Categories

Resources