Elasticsearch Scan&scroll with JEST API - java

I am currently working with JEST:
https://github.com/searchbox-io/Jest
Is it possible to do scan&scroll with this API?
http://www.elasticsearch.org/guide/reference/api/search/search-type/
I am currently using the Search command:
Search search = new Search("{\"size\" : "+RESULT_SIZE+", \"query\":{\"match_all\":{}}}");
but am worried about large result sets. If you use the Search command for this how do you set the "search_type=scan&scroll=10m&size=50" arguments?

Is it possible to do scan&scroll with this API?
Yes it is. My implementation it's working like this.
Start the scroll search on elastic search:
public SearchResult startScrollSearch (String type, Long size) throws IOException {
String query = ConfigurationFactory.loadElasticScript("my_es_search_script.json");
Search search = new Search.Builder(query)
// multiple index or types can be added.
.addIndex("myIndex")
.addType(type)
.setParameter(Parameters.SIZE, size)
.setParameter(Parameters.SCROLL, "1m")
.build();
SearchResult searchResult = EsClientConn.getJestClient().execute(search);
return searchResult;
}
SearchResult object will return the first (size) itens off the search as usual but will return to a scrollId parameter that is a reference to remain resultSet that elasticSearch keeps in memory for you. Parameters.SCROLL, will define the time that this search will be keeped on memory.
For read the scrollId:
scrollId = searchResult.getJsonObject().get("_scroll_id").getAsString();
For read more items from the resultSet you should use something like follow:
public JestResult readMoreFromSearch(String scrollId, Long size) throws IOException {
SearchScroll scroll = new SearchScroll.Builder(scrollId, "1m")
.setParameter(Parameters.SIZE, size).build();
JestResult searchResult = EsClientConn.getJestClient().execute(scroll);
return searchResult;
}
Don't forget that each time you read from the result set a new scrollId is returned from elastic.
Please tell me if you have any doubt.

Agreed we need to catch up however please open an issue if you need a feature.
Please check https://github.com/searchbox-io/Jest/blob/master/jest/src/test/java/io/searchbox/core/SearchScrollIntegrationTest.java at master

EDIT:
It doesn't appear that JEST currently supports the "Scan" search type: In a wicked fast turnaround, it appears that JEST now supports Scan type searches! Props to #Ferhat for the quick turnaround! JEST - SearchType.java
Have you considered just using the ElasticSearch Transport client? I could understand if you like the JEST API a little better, but as new features roll out for ElasticSearch (Exhibit A: ElasticSearch 0.90 is fantastic!), you'll get to have them as soon as they pop out instead of waiting for JEST to catch up.
My $0.02.

Related

Convert DirectoryObject to User

Given a query for members of a particular directory role, I would like to return a list of corresponding users. What I have is this:
IDirectoryObjectCollectionWithReferencesRequest request = graphServiceClient.directoryRoles(roleId).members().buildRequest();
IDirectoryObjectCollectionWithReferencesPage page = request.select(USER_FIELDS_TO_RETURN).get();
List<DirectoryObject> objects = page.getCurrentPage();
IDirectoryObjectCollectionWithReferencesRequestBuilder builder = page.getNextPage();
while (builder != null) {
request = builder.buildRequest();
page = request.select(USER_FIELDS_TO_RETURN).get();
objects.addAll(page.getCurrentPage());
builder = page.getNextPage();
}
return objects.stream().filter(o -> o.oDataType.equals("#microsoft.graph.user")).map(o -> new User()).collect(Collectors.toList());
The question lies in the return statement. Filter on only user objects (couldn't find a more elegant way of doing this than comparing the oDataType) and return the user object with the contents of o:
objects.stream().filter(o -> o.oDataType.equals("#microsoft.graph.user")).map(o -> {
// the only thing that I could think of is to do some weird
// serialization/deserialization logic here which is a bad solution
// for anything other than a small number of elements
}).collect(Collectors.toList());
what is the correct way of converting DirectoryObject to User
Microsoft Graph does not currently support this requirement.
If you're checking a specific directoryRole, you could come at this from the other direction. The /members endpoint does support filtering by member id:
v1.0/directoryRoles/{role-id}/members?$filter=id eq '{user-id}'
Please check the answers and workarounds provided in this thread. How to get admin roles that I am a member of, from Microsoft Graph using .Net Client SDK?
I know this is an old question, but I had the same problem and found a better solution.
You can actually convert it to a user after you have the list. So if you are iterating through the list:
var myDirectoryList = (List<DirectoryObject>)myRetrievedList;
foreach(var item in myDirectoryList)
{
var myUser = (User)item;
Console.WriteLine($"My name is {myUser.GivenName}");
}
Where DirectoryObject is Microsoft.Graph.DirectoryObject and User is Microsoft.Graph.User.
Just had the same problem, so, for anyone getting there, here is what i did (And i could not find any other simple solution...).
What you call "some weird serialization/deserialization logic" can actually be done this way using the DefaultSerializer :
private ISerializer serializer = new DefaultSerializer(new DefaultLogger());
...
objects.stream().filter(o -> o.oDataType.equals("#microsoft.graph.user")).map(o -> {
return serializer.deserializeObject(o.getRawObject().toString(), User.class)
}).collect(Collectors.toList());

Lucene 6 - How to influence ranking with numeric value?

I am new to Lucene, so apologies for any unclear wording. I am working on an author search engine. The search query is the author name. The default search results are good - they return the names that match the most. However, we want to rank the results by author popularity as well, a blend of both the default similarity and a numeric value representing the circulations their titles have. The problem with the default results is it returns authors nobody is interested in, and while I can rank by circulation alone, the top result is generally not a great match in terms of name. I have been looking for days for a solution for this.
This is how I am building my index:
IndexWriter writer = new IndexWriter(FSDirectory.open(Paths.get(INDEX_LOCATION)),
new IndexWriterConfig(new StandardAnalyzer()));
writer.deleteAll();
for (Contributor contributor : contributors) {
Document doc = new Document();
doc.add(new TextField("name", contributor.getName(), Field.Store.YES));
doc.add(new StoredField("contribId", contributor.getContribId()));
doc.add(new NumericDocValuesField("sum", sum));
writer.addDocument(doc);
}
writer.close();
The name is the field we want to search on, and the sum is the field we want to weight our search results with (but still taking into account the best match for the author name). I'm not sure if adding the sum to the document is the correct thing to do in this situation. I know that there will need to be some experimentation to figure out how to best blend the weighting of the two factors, but my problem is I don't know how to do it in the first place.
Any examples I've been able to find are either pre-Lucene 4 or don't seem to work. I thought this was what I was looking for, but it doesn't seem to work. Help appreciated!
As demonstrated in the blog post you linked, you could use a CustomScoreQuery; this would give you a lot of flexibility and influence over the scoring process, but it is also a bit overkill. Another possibility is to use a FunctionScoreQuery; since they behave differently, I will explain both.
Using a FunctionScoreQuery
A FunctionScoreQuery can modify a score based on a field.
Let's say you create you are usually performing a search like this:
Query q = .... // pass the user input to the QueryParser or similar
TopDocs hits = searcher.search(query, 10); // Get 10 results
Then you can modify the query in between like this:
Query q = .....
// Note that a Float field would work better.
DoubleValuesSource boostByField = DoubleValuesSource.fromLongField("sum");
// Create a query, based on the old query and the boost
FunctionScoreQuery modifiedQuery = new FunctionScoreQuery(q, boostByField);
// Search as usual
TopDocs hits = searcher.search(query, 10);
This will modify the query based on the value of field. Sadly, however, there isn't a possibility to control the influence of the DoubleValuesSource (besides by scaling the values during indexing) - at least none that I know of.
To have more control, consider using the CustomScoreQuery.
Using a CustomScoreQuery
Using this kind of query will allow you to modify a score of each result any way you like. In this context we will use it to alter the score based on a field in the index. First, you will have to store your value during indexing:
doc.add(new StoredField("sum", sum));
Then we will have to create our very own query class:
private static class MyScoreQuery extends CustomScoreQuery {
public MyScoreQuery(Query subQuery) {
super(subQuery);
}
// The CustomScoreProvider is what actually alters the score
private class MyScoreProvider extends CustomScoreProvider {
private LeafReader reader;
private Set<String> fieldsToLoad;
public MyScoreProvider(LeafReaderContext context) {
super(context);
reader = context.reader();
// We create a HashSet which contains the name of the field
// which we need. This allows us to retrieve the document
// with only this field loaded, which is a lot faster.
fieldsToLoad = new HashSet<>();
fieldsToLoad.add("sum");
}
#Override
public float customScore(int doc_id, float currentScore, float valSrcScore) throws IOException {
// Get the result document from the index
Document doc = reader.document(doc_id, fieldsToLoad);
// Get boost value from index
IndexableField field = doc.getField("sum");
Number number = field.numericValue();
// This is just an example on how to alter the current score
// based on the value of "sum". You will have to experiment
// here.
float influence = 0.01f;
float boost = number.floatValue() * influence;
// Return the new score for this result, based on the
// original lucene score.
return currentScore + boost;
}
}
// Make sure that our CustomScoreProvider is being used.
#Override
public CustomScoreProvider getCustomScoreProvider(LeafReaderContext context) {
return new MyScoreProvider(context);
}
}
Now you can use your new Query class to modify an existing query, similar to the FunctionScoreQuery:
Query q = .....
// Create a query, based on the old query and the boost
MyScoreQuery modifiedQuery = new MyScoreQuery(q);
// Search as usual
TopDocs hits = searcher.search(query, 10);
Final remarks
Using a CustomScoreQuery, you can influence the scoring process in all kinds of ways. Remember however that the method customScore is called for each search result - so don't perform any expensive computations there, as this would severely slow down the search process.
I've creating a small gist of a full working example of the CustomScoreQuery here: https://gist.github.com/philippludwig/14e0d9b527a6522511ae79823adef73a

Displaying more than 10000 rows using Core Reporting Google API v4 ( Java)

I'm fetching Google Analytics data using Core Reporting API v4. I'm able to capture at most 10,000 records for a given combination of Dimensions & Metrics.
My question is that if my query can produce more than 10,000 search results then how can I fetch all those records? I have gone through the documentation and found that in a single request we can't access more than 10,000 records by setting the properties of ReportRequest object.
ReportRequest request = new ReportRequest()
.setDateRanges(Arrays.asList(dateRange))
.setViewId(VIEW_ID)
.setDimensions(Arrays.asList(dimension))
.setMetrics(Arrays.asList(metric))
.setPageSize(10000);
How can we enable multiple requests in a single run depending upon the number of search-results that can be obtained.
For example : If my query can return 35,000 records then there should be 4 requests (10,000,10,000, 10,000 & 3,500) managed internally.
Please look into this and facilitate me some guidance. Thanks in Advance.
The Analytics Core Reporting API returns a maximum of 10,000 rows per
request, no matter how many you ask for.
If the request you are making will generate more then 10000 rows then there will be additional rows you can request. The response returned from the first request will contain a parameter called nextPageToken which you can use to request the next set of data.
You will have to dig around the Java library the only documentation on how to do it I have found is HTTP.
POST https://analyticsreporting.googleapis.com/v4/reports:batchGet
{
"reportRequests":[
{
...
# Taken from `nextPageToken` of a previous response.
"pageToken": "XDkjaf98234xklj234",
"pageSize": "10000",
}]
}
Here's a stable and extensively tested solution in Java. It is a recursive solution that stores every 10000 results batch (if any) and recalls itself until finds a null nextToken. In this specific solution every 10000 results batch is saved into a csv and then a recursive call is performed! Note that the first time this function called from somewhere outside, the nextPageToken is also null!! Focus on the recursive rationale and the null value check!
private static int getComplexReport(AnalyticsReporting service,int
reportIndex,java.lang.String startDate,String endDate,ArrayList<String>
metricNames,ArrayList<String> dimensionNames,String pageToken)
throws IOException
ReportRequest req = createComplexRequest(startDate,endDate,metricNames,dimensionNames,pageToken);
ArrayList<ReportRequest> requests = new ArrayList<>();
requests.add(req);
// Create the GetReportsRequest object.
GetReportsRequest getReport = new GetReportsRequest()
.setReportRequests(requests);
// Call the batchGet method.
GetReportsResponse response = service.reports().batchGet(getReport).execute();
//printResponse(response);
saveBatchToCsvFile("dummy_"+startDate+"_"+endDate+"_"+Integer.toString(reportIndex)+".csv",startDate+"_"+endDate,response,metricNames,dimensionNames);
String nextToken = response.getReports().get(0).getNextPageToken();
//System.out.println(nextToken);
if(nextToken!=null)
return getComplexReport(service,reportIndex+1,"2016-06-21","2016-06-21",metricNames,dimensionNames,nextToken);
return reportIndex;
}
var reportRequest = new ReportRequest
{
DateRanges = new List<DateRange> { dateRange },
Dimensions = new List<Dimension> { date, UserId, DeviceCategory},
Metrics = new List<Metric> { sessions },
ViewId = view,
PageSize = 400000
};

Get IAM users list by java aws sdk

Want to get list of all IAM users using sdk aws java. Class that we are using is AmazonIdentityManagementClient and method used is listuser(). API doc suggest pass parameter MaxItem and Marker. Whereas method do not recognize the parameter. Can anyone suggest how to do pagination here.
AmazonIdentityManagementClient amazonidentitymanagmentclient = new AmazonIdentityManagementClient();
ListUsersResult listuserresult = new ListUsersResult();
try {
listuserresult=amazonidentitymanagmentclient.listUsers();
List<User> listuser = new ArrayList<User>();
listuser = listuserresult.getUsers() //need to pass maxitems,marker here
}
} catch (Exception e) {
return null;
}
You need to use
ListUsersResult listUsers(ListUsersRequest listUsersRequest)
throws AmazonServiceException,
AmazonClientException
to use the marker feature.
You can set the marker in the ListUsersRequest . You need to get the marker from the results ( ListUsersResult ) of previous call of the listusers. The ListUsersResult has a method getMarker which can be used to get the marker to be used for next call. Then use the object ListUsesrsRequest. set the marker with the value got from getMarker and then call this listusers . Do this in a loop till the isTruncated method in the ListUsersResults indicates there are no more elements to return. If you don't set maxitem, by default it will return 100 items as per documentation. You can set that in your ListUsersRequest to a different value based on how much you want to display in a page.

Delete all files in 'folder' or with prefix in Google Cloud Bucket from Java

I know the idea of 'folders' is sort of non existent or different in Google Cloud Storage, but I need a way to delete all objects in a 'folder' or with a given prefix from Java.
The GcsService has a delete function, but as far as I can tell it only takes 1 GscFilename object and does not honor wildcards (i.e., "folderName/**" did not work).
Any tips?
The API only supports deleting a single object at a time. You can only request many deletions using many HTTP requests or by batching many delete requests. There is no API call to delete multiple objects using wildcards or the like. In order to delete all of the objects with a certain prefix, you'd need to list the objects, then make a delete call for each object that matches the pattern.
The command-line utility, gsutil, does exactly that when you ask it to delete the path "gs://bucket/dir/**. It fetches a list of objects matching that pattern, then it makes a delete call for each of them.
If you need a quick solution, you could always have your Java program exec gsutil.
Here is the code that corresponds to the above answer in case anyone else wants to use it:
public void deleteFolder(String bucket, String folderName) throws CoultNotDeleteFile {
try
{
ListResult list = gcsService.list(bucket, new ListOptions.Builder().setPrefix(folderName).setRecursive(true).build());
while(list.hasNext())
{
ListItem item = list.next();
gcsService.delete(new GcsFilename(file.getBucket(), item.getName()));
}
}
catch (IOException e)
{
//Error handling
}
}
Extremely late to the party, but here's for current google searches. We can delete multiple blobs efficiently by leveraging com.google.cloud.storage.StorageBatch.
Like so:
public static void rmdir(Storage storage, String bucket, String dir) {
StorageBatch batch = storage.batch();
Page<Blob> blobs = storage.list(bucket, Storage.BlobListOption.currentDirectory(),
Storage.BlobListOption.prefix(dir));
for(Blob blob : blobs.iterateAll()) {
batch.delete(blob.getBlobId());
}
batch.submit();
}
This should run MUCH faster than deleting one by one when your bucket/folder contains a non trivial amount of items.
Edit since this is getting a little attention, I'll demo error handling:
public static boolean rmdir(Storage storage, String bucket, String dir) {
List<StorageBatchResult<Boolean>> results = new ArrayList<>();
StorageBatch batch = storage.batch();
try {
Page<Blob> blobs = storage.list(bucket, Storage.BlobListOption.currentDirectory(),
Storage.BlobListOption.prefix(dir));
for(Blob blob : blobs.iterateAll()) {
results.add(batch.delete(blob.getBlobId()));
}
} finally {
batch.submit();
return results.stream().allMatch(r -> r != null && r.get());
}
}
This method will:
Delete every blob in the given folder of the given bucket returning true if so. The method will return false otherwise. One can look into the return method of batch.delete() for a better understanding and error proofing.
To ensure ALL items are deleted, you could call this like:
boolean success = false
while(!success)) {
success = rmdir(storage, bucket, dir);
}
I realise this is an old question, but I just stumbled upon the same issue and found a different way to resolve it.
The Storage class in the Google Cloud Java Client for Storage includes a method to list the blobs in a bucket, which can also accept an option to set a prefix to filter results to blobs whose names begin with the prefix.
For example, deleting all the files with a given prefix from a bucket can be achieved like this:
Storage storage = StorageOptions.getDefaultInstance().getService();
Iterable<Blob> blobs = storage.list("bucket_name", Storage.BlobListOption.prefix("prefix")).iterateAll();
for (Blob blob : blobs) {
blob.delete(Blob.BlobSourceOption.generationMatch());
}

Categories

Resources