Query past the 500 limit in Gerrit REST API - java

I'm trying to get 2000 change results from a specific branch with a query request using Gerrit REST API in Java. The problem is that I'm only getting 500 results no matter what I add to the query search.
I have tried the options listed here but I'm not getting the 2000 results that I need. I also read that an admin can increase this limit but would prefer a method that doesn't require this detour.
So what I'm wondering is:
Is it possible to increase the limit without the need to contact the admin?
If not. Is it possible to continue/repeat the query in order to get the remaining 1500 results that I want, using a loop that performs the query on the following 500 results from the previous query until I finally get 2000 results in total?

When using the list changes REST API, the results are returned as a list of ChangeInfo Elements. If there are more results than were returned, the last entry in that list will have a _more_changes field with value true. You can then query again and set the start option to skip over the ones that you've already received.

I want to add a minor workaround to David's great answer.
If you want to crawl Gerrit instances hosted on Google servers (such as Android, Chromium, Golang), you will notice that they block queries with more than 10000 results. You can check this e.g. with
curl "https://android-review.googlesource.com/changes/?q=status:closed&S=10000"
I solved the problem in such a way, that I split up these list of changes with before: and until: in a query string, for example lie
_url_/changes/?q=after:{2018-01-01 00:00:00.000} AND before:{2018-01-01 00:59:99.999}
_url_/changes/?q=after:{2018-01-01 01:00:00.000} AND before:{2018-01-01 01:59:99.999}
_url_/changes/?q=after:{2018-01-01 02:00:00.000} AND before:{2018-01-01 02:59:99.999}
and so on. I think you get the idea. ;-) Please notice, that both limits (before: and after:) are inclusive! For each day I use the pagination described by David.
A nice side effect is, that you can track the progress of the crawling.
I wrote a small Python tool named "Gerry" to crawl open source instances. Feel free to use, adopt it and send me pull requests!

I almost had the same problem. But there is no way as you mentioned you don't want admin to increase the query limit and also you don't want to fire the rest query in a loop with the counter. I will suggest you to follow the second approach firing the query in a loop with a counter set. That's the way I have implemented the rest client in Java.

Related

Is it better to count in server side API using java stream() then using count query call repeatedly in spring jpa

I want to count the number of rows in a table three times depending on three filters/conditions. I want to know which one of the following two ways is better for performance and cost-efficiency. We are using AWS as our server, java spring to develop server-side API and MySQL for the database.
Use the count feature of MySQL to query three times in the database for three filtering criteria to get the three count result.
Fetch all the rows of the table from the database first using only one query. Then using java stream three times based on three filtering criteria to get the three count result.
It'll be better to go with option (1) in extreme cases. If it's slow to execute SELECT COUNT(*) FROM table then you should consider some tweak on SQL side. Not sure what you're using but I found this example for sql server
Assuming you go with Option (2) and you have hundreds of thousands of rows, I suspect that your application will run out of memory (especially under high load) before you have time to worry about slow response time from running SELECT count(*). Not to mention that you'll have lots of unnecessary rows and slow down transfer time between database and application
A basic argument against doing counts in the app is that hauling lots of data from the server to the client is time-consuming. (There are rare situations where it is worth the overhead.) Note that your client and AWS may be quite some distance apart, thereby exacerbating the cost of shoveling lots of data. I am skeptical of what you call "server-side API". But even if you can run Java on the server, there is still some cost of shoveling between MySQL and Java.
Sometimes this pattern lets you get 3 counts with one pass over the data:
SELECT
SUM(status='ready') AS ready_count,
SUM(status='complete') AS completed_count,
SUM(status='unk') AS unknown_count,
...
The trick here is that a Boolean expression has a value of 0 (for false) or 1 (for true). Hence the SUM() works like a 'conditional count'.

Split a big Jira-Rest-Request

I'm looking for an opportunity to split a big request like:
rest/api/2/search?jql=(project in (project1, project2, project3....project10)) AND issuetype = Bug AND (component not in (projectA, projectB) OR component = EMPTY). The result will containe > 500 Bugs -> It's very very slow. I want to get them with different requests (methode to performe the request will be annotated with #Asynchronous) but the jql needs to be the same. I don't want to search separately for project1, project2...project10. Would be nice if someone has an idea to resolve my problem.
Thank you :)
You need to calculate pagination. First get the metadata.
rest/api/2/search?jql=[complete search query]&fields=*none&maxResults=0
you should get something like this:
{"startAt":0,"maxResults":0,"total":100,"issues":[]}
so completely without fields, just pagination metadata.
Than create search URI like this.
rest/api/2/search?jql=[complete search query]&startAt=0&maxResults=10
rest/api/2/search?jql=[complete search query]&startAt=10&maxResults=10
..etc
Beware data should change so you should be prepared that you won't recieve all the data and also pagination metadata if calculation is expensive (exspecially "total") should not be presented. More Paged API
Can you not break into 2 parts? If you are displaying in a web page ( display what you can without performance hit. If its a report then get all objects gradually and show once completed.
Get the count in total for JQL & just get the minimum information needed for step 2 - assume its 900
Use the pagination feature (maxResults=100) make multiple calls.
Work on each request.
If you don't want to run the two requests at once and need paging of bugs by user request, you can:
Make a request with the 'maxResults' property set to how much you need.
On the next request set the 'maxResults' property and the 'startAt' with the same value.
If you need to fetch more data, make new request with the same 'maxResults' but update 'startAt' to be the count of bugs you fetched in the previous requests.

What does percolator mean/do in elasticsearch?

Even though I read the documentation for Elasticsearch to understand what a percolator is. I still have difficulty understanding what it means and where it is used in simple terms. Can anyone provide me with more details?
What you usually do is index documents and get them back by querying. What the percolator allows you to do in a nutshell is index your queries and percolate documents against the indexed queries to know which queries they match. It's also called reversed search, as what you do is the opposite to what you are used to.
There are different usecases for the percolator, the first one being any platform that stores users interests in order to send the right content to the right users as soon as it comes in.
For instance a user subscribes to a specific topic, and as soon as a new article for that topic comes in, a notification will be sent to the interested users. You can express the users interests as an elasticsearch query, using the query DSL, and you can register it in elasticsearch as it was a document. Every time a new article is issued, without needing to index it, you can percolate it to know which users are interested in it. At this point in time you know who needs to receive a notification containing the article link (sending the notification is not done by elasticsearch though). An additional step would also be to index the content itself but that is not required.
Have a look at this presentation to see other couple of usecases and other features available in combination with the percolator starting from elasticsearch 1.0.
In Simple terms percolator does this:
User: Hey Percolator! How can you help me?
Percolator: Hai User! I can help you to get the alerts of your interests.
User: That's great! What should I do next?
Percolator: Please let me know your interests in the form of queries indexed in Elasticsearch.
User: I've prepared all my interests as queries and indexed them into Elasticsearch. Is it that simple?
Percolator: Yes! It is that simple! I'll watch all incoming documents and get back to you with documents if they matches with any of your interests(queries)!
User: That's awesome! I'm just curious and want to know that how can
you figure out which documents match with my interests.
Percolator: That's a good question! Answer for that is very simple! You had indexed your interests as queries into Elasticsearch right? I use them and run all those(not exactly all but for simplicity let's assume all) queries against incoming documents(these docs need not to be indexed and could be just sent for percolation!). In fact this process is called percolation! If any document matches with any of your queries then I'll send that result to the client(It could be you also)!
Under the hood, a percolate query will take what you want to percolate (e.g. that news article that you want to alert on) and Elasticsearch will create a tiny in-memory index with that document.
You'd have a bunch of registered queries (e.g. one for each user's preferences). Initially, Elasticsearch will pre-filter queries that are likely to match, then run those likely ones. Much like Luwak used to do (now Lucene Monitor).
The rule of thumb, for the alerting use-case at least, is:
have lots of incoming documents and few queries (e.g. alert on logs)? Simply run queries at a scheduled interval
have fewer documents and lots of queries? Then percolate these documents
I've also seen people using percolator to tag documents, but implementing something custom in the indexing pipeline to do that sounds more logical.

How to get the number of results in an App Engine query before actually iterating through them all

In my Google App Engine app I need to fetch and return a potentially large number of entities from a datastore query in response to a service call GET request. This call may return potentially thousands of entities and MBs of serialized data.
The first portion of the response packet communicates how many entities are in the serialized results, followed by all of the serialized entities. Currently I am iterating through all the entities in the query with a QueryResultIterator up to a maximum page size limit, after which I return a cursor from which can be used to continue fetching where the previous call left off (if the maximum was reached and there are still results in the query). As I iterate through the results, I save them in a list. Once I've either exhausted the query results or reached the maximum page size, I can then get the number of entities from the size of this list. But then I have to iterate through this list again to serialize each of the entities and write the results to the response output stream.
I don't know that this is the most efficient method to perform this operation. Is there a way I can get the number of entities in a query's results before actually iterating through them all or fetching them directly into a list? (The list method doesn't work anyway because I'm using cursors, which requires the use of QueryResultIterator).
QueryResultIterator has a method getIndexList(). Would this be a less costly way to get the number of entities in the query's results? I'm assuming this list would contain exactly one index object for each entity in the query's results. Also, I'd need this list to only contain the indexes for the entities after the current cursor position for the interator. Is my understanding correct or would this method not do what I think it would?
A list of just indexes would require much less memory than loading a list of whole entities. Although, I don't know if this list would be limited at all by the query's prefetch or chunk sizes, or if I'd want to use the query's limit parameter at all because I would only be interested in knowing how many entities were in the results up to the maximum page size plus one (to know there are still more results and provide a cursor to continue).
Currently I'm setting the prefetch and chunk size (to the size of my page limit), but I'm not using the limit or offset parameters since I'm using cursors instead. From what I understand cursors are preferable to offset/limit. Would setting the limit parameter affect continuing a query with a cursor?
Clearly I have quite a few questions as to how GAE datastore queries work and how they're affected by changing parameters. So any insights are appreciated. The documentation for App Engine APIs is often sparse, as in one sentence descriptions of methods stating pretty much what can be deduced from the method signature. They don't generally go into much detail otherwise. Maybe the way I'm doing it currently is just fine after all. It works as is, but I'm trying to optimize the service call to get the best response time possible for my client application.
UPDATE: By the way, I am using Objectify v3 in my app and to perform this query. There are several places I am required to use the low-level datastore API, including to do geo-location queries (with geomodel) and projection queries (which aren't support in Objectify v3). So if there is a good way to do this using Objectify, that would be ideal. Otherwise I can use the low-level API, but it's always messier this way.
Both the low-level api and Objectify have a count() method (look at the javadocs for details). However, counting can be a very expensive and lengthy operation - it costs 1 small op for every number returned. For example, count() returning 5000 costs 5000 small ops (plus 1 read for the query), and takes as long as it would take to do a keys-only scan of all 5000 (which is what GAE actually does).
If you absolutely must have an exact count, you probably need to aggregate this value yourself by incrementing/decrementing a (possibly sharded) counter. This gets very tricky when you are dealing with filtered queries.
There is no one right solution here. Google searches give you totals like "About 119,000,000 results" which are deliberately inexact and almost certainly precalculated. For smaller result sets, using count() can be acceptable - but you might want to apply a limit() so that you never break the bank. You can always say "More than 500 results..."
if you want to fetch no of record than you can use following code
com.google.appengine.api.datastore.Query qry = new com.google.appengine.api.datastore.Query("EntityName");
com.google.appengine.api.datastore.DatastoreService datastoreService = DatastoreServiceFactory.getDatastoreService();
int totalCount = datastoreService.prepare(qry).countEntities(FetchOptions.Builder.withDefaults());
anf if you want filter than you can used
qry.addFilter("firstName", FilterOperator.EQUAL, firstName);
i hope it will help you

java - jdbc performance

I need one help from you guys regarding JDBC performance optimization. One of our pojo is using jdbc to connect to a oracle database and retrieve the records. Basically the records are email addresses basing upon which emails will be sent to the users. The problem here is the performance. This process happens every weekend and the records are very huge in number, around 100k.
The performance is very slow and it worries us a lot. Only 1000 records seem to be fetched from the database every 1 hour, which means that it will take 100 hours for this process to complete (which is very bad). Please help me on this.
The database server and the java process are in two different remote servers. We have used rs_email.setFetchSize(1000); hoping that it would make any difference but no change at all.
The same query executed on server takes 0.35 seconds to complete. Any quick suggestion would of great help to us.
Thanks,
Aamer.
First look at your queries. Analyze them. See if the SQL could be made more efficient (ie, ask the database for what you want, not for what you don't want -- makes a big difference). Also check to see if there are indexes on any fields in your where and join clauses. Indexes make a big difference. But it can't be just any indexes. They have to be good indexes (ie, that the fields that make up the index provide enough uniqueness for the database to retrieve things appropriately). Work with your DBA on this. Look for either high run time against the db or check for queries with high CPU usage (even if the queries run sub-second). These are the thing that can kill your database.
Also from a code perspective, check to see if you are opening and closing your connections or if you are re-using them. Can make a big difference too.
It would help to post your code, queries, table layouts, and any indexes you have.
Use log4jdbc to get the real sql for fetching single record. Then check speed and plan for that sql. You may need a proper index or even db defragmentation.
Not sure about the Oracle driver, but I do know that the MySQL driver supports two different results retrieval methods: "stream" and "wait until you've got it all".
The streaming method lets you start process the results the moment you've got the first row returned from the query, whereas the other method retrieves the entire resultset before you can start work on it. In cases where you deal with huge recordsets, this often leads to memory exceptions, or slow performance because java hit the "memory roof" and the garbage collector can't throw away "used" records like it can in the streaming mode.
The streaming mode doesn't let you navigate/scroll the resultset the way the "normal"/"wait until you've got it all" mode...
Anyway, not sure if this is of any help but it might be worth checking out.
My answer to your question, in summary is:
1. Check network
2. Check SQL
3. Check Java code.
It sounds very slow. First thing to check would be to see if you have a slow network. You can do this pretty quickly by just pinging the database server. Or run the database server on the same machine as your JVMM. If it is not the network, get an explain plan for your SQL and ensure you are not doing table scans when you don't need to be. If it is not the network or the SQL, then it's time to check your Java code. Are you doing anything like blocking when you shouldn't be?

Categories

Resources