Getting all records from Elasticsearch using Java API

Getting all records from Elasticsearch using Java API - java

I am trying to get all the records from Elasticsearch using Java API. But I receive the below error
n[[Wild Thing][localhost:9300][indices:data/read/search[phase/dfs]]];
nested: QueryPhaseExecutionException[Result window is too large, from
+ size must be less than or equal to: [10000] but was [10101].
My code is as below
Client client;
try {
client = TransportClient.builder().build().
addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("localhost"), 9300));
int from = 1;
int to = 100;
while (from <= 131881) {
SearchResponse response = client
.prepareSearch("demo_risk_data")
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH).setFrom(from)
.setQuery(QueryBuilders.boolQuery().mustNot(QueryBuilders.termQuery("user_agent", "")))
.setSize(to).setExplain(true).execute().actionGet();
if (response.getHits().getHits().length > 0) {
for (SearchHit searchData : response.getHits().getHits()) {
JSONObject value = new JSONObject(searchData.getSource());
System.out.println(value.toString());
}
}
}
}
Total number of records currently present are 131881 ,so I start with from = 1 and to = 100 and then get 100 records until from <= 131881. Is there are way where I can check get records in set of say 100 until there are no further records in Elasticsearch.

Yes, you can do so using the scroll API, which the Java client also supports.
You can do it like this:
Client client;
try {
client = TransportClient.builder().build().
addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("localhost"), 9300));
QueryBuilder qb = QueryBuilders.boolQuery().mustNot(QueryBuilders.termQuery("user_agent", ""));
SearchResponse scrollResp = client.prepareSearch("demo_risk_data")
.addSort(SortParseElement.DOC_FIELD_NAME, SortOrder.ASC)
.setScroll(new TimeValue(60000))
.setQuery(qb)
.setSize(100).execute().actionGet();
//Scroll until no hits are returned
while (true) {
//Break condition: No hits are returned
if (scrollResp.getHits().getHits().length == 0) {
break;
}
// otherwise read results
for (SearchHit hit : scrollResp.getHits().getHits()) {
JSONObject value = new JSONObject(searchData.getSource());
System.out.println(value.toString());
}
// prepare next query
scrollResp = client.prepareSearchScroll(scrollResp.getScrollId()).setScroll(new TimeValue(60000)).execute().actionGet();
}
}

Related

How to read from ES 1.7 of huge data to index into ES 6.7

Need to read data from ES 1.7 to index to 6.7.
As there is no upgrade available. Need to index almost 5 TB data of 200 million records. We are using ES_REST_high_level_client(6.7.2) using the search and scroll approach. but not able to scroll using the scroll id. and another approach tried is using from and batch size. initially the read is faster as the from offset increase the read is really bad. what is the best approach to do.
1st Approach using search and scroll.
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.size(10);
searchRequest.source(searchSourceBuilder);
searchRequest.scroll(TimeValue.timeValueMinutes(2));
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
String scrollId = searchResponse.getScrollId();
while (run) {
SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
scrollRequest.scroll(TimeValue.timeValueSeconds(60));
SearchResponse searchScrollResponse = client.scroll(scrollRequest, RequestOptions.DEFAULT);
scrollId = searchScrollResponse.getScrollId();
hits = searchScrollResponse.getHits();
if (hits.getHits().length == 0) {
run = false;
}
}
Exception
Exception in thread "main" ElasticsearchStatusException[Elasticsearch exception [type=exception, reason=ElasticsearchIllegalArgumentException[Failed to decode scrollId]; nested: IOException[Bad Base64 input character decimal 123 in array position 0]; ]]
at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:177)
at org.elasticsearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:2050)
at org.elasticsearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:2026)
:
2nd approach :
int offset = 0;
boolean run = true;
while (run) {
SearchRequest searchRequest = new SearchRequest("indexname");
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.from(offset);
searchSourceBuilder.size(500);
searchRequest.source(searchSourceBuilder);
long start = System.currentTimeMillis();
SearchResponse searchResponse = client.search(searchRequest, RequestOptions.DEFAULT);
long end = System.currentTimeMillis();
SearchHits hits = searchResponse.getHits();
System.out.println(" Total hits : " + hits.totalHits + " time : " + (end - start));
offset += 500;
if(hits.getHits().length == 0) {
run = false;
}
}
Any other approach to read data.

Generally the best solution would be a remote reindex: https://www.elastic.co/guide/en/elasticsearch/reference/6.7/docs-reindex.html#reindex-from-remote
I'm not sure the REST clients are still compatible with 1.x while remote reindex should support it.
Deep pagination is very expensive that's why it should be avoided — you see why in your example.

TFS JAVA SDK - How to run shared query

I have a application which useds TFS JAVA SDK 14.0.3 .
I have a shared query  on my tfs , how can i run the shared query and get the response back using TFS SDK 14.0.3
Also I could see that the query url will expire in every 90 days , so any better way to execute the shared query?
Now I  have a method to run a query , i want method  to  run shared query also.
public void getWorkItem(TFSTeamProjectCollection tpc, Project project){
WorkItemClient workItemClient = project.getWorkItemClient();
// Define the WIQL query.
String wiqlQuery = "Select ID, Title,Assigned from WorkItems where (State = 'Active') order by Title";
// Run the query and get the results.
WorkItemCollection workItems = workItemClient.query(wiqlQuery);
System.out.println("Found " + workItems.size() + " work items.");
System.out.println();
// Write out the heading.
System.out.println("ID\tTitle");
// Output the first 20 results of the query, allowing the TFS SDK to
// page
// in data as required
final int maxToPrint = 5;
for (int i = 0; i < workItems.size(); i++) {
if (i >= maxToPrint) {
System.out.println("[...]");
break;
}
WorkItem workItem = workItems.getWorkItem(i);
System.out.println(workItem.getID() + "\t" + workItem.getTitle());
}
}

Shared query is a query which has been run and saved, so what you need should be getting a a shared query, not run a shared query. You could refer to case Access TFS Team Query from Client Object API:
///Handles nested query folders
private static Guid FindQuery(QueryFolder folder, string queryName)
{
foreach (var item in folder)
{
if (item.Name.Equals(queryName, StringComparison.InvariantCultureIgnoreCase))
{
return item.Id;
}
var itemFolder = item as QueryFolder;
if (itemFolder != null)
{
var result = FindQuery(itemFolder, queryName);
if (!result.Equals(Guid.Empty))
{
return result;
}
}
}
return Guid.Empty;
}
static void Main(string[] args)
{
var collectionUri = new Uri("http://TFS/tfs/DefaultCollection");
var server = new TfsTeamProjectCollection(collectionUri);
var workItemStore = server.GetService<WorkItemStore>();
var teamProject = workItemStore.Projects["TeamProjectName"];
var x = teamProject.QueryHierarchy;
var queryId = FindQuery(x, "QueryNameHere");
var queryDefinition = workItemStore.GetQueryDefinition(queryId);
var variables = new Dictionary<string, string>() {{"project", "TeamProjectName"}};
var result = workItemStore.Query(queryDefinition.QueryText,variables);
}
By the way, you could also check the REST API in the following link:
https://learn.microsoft.com/en-us/rest/api/vsts/wit/queries/get

Neo4j Java bolt driver: how to convert the result to Json?

I am using the Java Bolt driver (1.0.1) and I am wondering there is a way to convert the result to Json (possibly the same as in the REST api)?
I tried to use gson in this way:
Result r = null;
try ( Transaction tx = graphDb.beginTx() )
{
r = graphDb.execute("MATCH...");
tx.success();
} catch {...}
new Gson().toJson(result);
but what I get is:
java.lang.StackOverflowError
at com.google.gson.internal.$Gson$Types.canonicalize($Gson$Types.java:98)
at com.google.gson.reflect.TypeToken.<init>(TypeToken.java:72)
etc...

The API you show is not the Bolt-Driver, it's the embedded Java-API.
In the bolt-driver you can do
Driver driver = GraphDatabase.driver( "bolt://localhost", AuthTokens.basic( "neo4j", "neo4j" ) );
Session session = driver.session();
StatementResult result = session.run( "MATCH (a:Person) WHERE a.name = 'Arthur' RETURN a.name AS name, a.title AS title" );
while ( result.hasNext() ) {
Record record = result.next();
gson.toJson(record.asMap());
}
session.close();
driver.close();

I am developing an app in flask and need to do the same and then get it into a response but in Python. Im using jsonify instead of gson. Any suggestions??? Code right here:
#concepts_api.route('/concepts', methods=['GET'])
def get_concepts_of_conceptgroup():
try:
_json = request.json
_group_name = _json['group_name']
if _group_name and request.method == 'GET':
rows = concepts_service.get_concepts_of_conceptgroup(_group_name)
resp = jsonify(rows)
resp.status_code = 200
return resp
return not_found()
except:
message = {
'status': 500,
'message': 'Error: Imposible to get concepts of conceptgroup.',
}
resp = jsonify(message)
resp.status_code = 500
return resp

Fetching all the document URI's in MarkLogic Using Java Client API

i am trying to fetch all the documents from a database without knowing the exact url's . I got one query
DocumentPage documents =docMgr.read();
while (documents.hasNext()) {
DocumentRecord document = documents.next();
System.out.println(document.getUri());
}
But i do not have specific urls , i want all the documents

The first step is to enable your uris lexicon on the database.
You could eval some XQuery and run cts:uris() (or server-side JS and run cts.uris()):
ServerEvaluationCall call = client.newServerEval()
.xquery("cts:uris()");
for ( EvalResult result : call.eval() ) {
String uri = result.getString();
System.out.println(uri);
}
Two drawbacks are: (1) you'd need a user with privileges and (2) there is no pagination.
If you have a small number of documents, you don't need pagination. But for a large number of documents pagination is recommended. Here's some code using the search API and pagination:
// do the next eight lines just once
String options =
"<options xmlns='http://marklogic.com/appservices/search'>" +
" <values name='uris'>" +
" <uri/>" +
" </values>" +
"</options>";
QueryOptionsManager optionsMgr = client.newServerConfigManager().newQueryOptionsManager();
optionsMgr.writeOptions("uriOptions", new StringHandle(options));
// run the following each time you need to list all uris
QueryManager queryMgr = client.newQueryManager();
long pageLength = 10000;
queryMgr.setPageLength(pageLength);
ValuesDefinition query = queryMgr.newValuesDefinition("uris", "uriOptions");
// the following "and" query just matches all documents
query.setQueryDefinition(new StructuredQueryBuilder().and());
int start = 1;
boolean hasMore = true;
Transaction transaction = client.openTransaction();
try {
while ( hasMore ) {
CountedDistinctValue[] uriValues =
queryMgr.values(query, new ValuesHandle(), start, transaction).getValues();
for (CountedDistinctValue uriValue : uriValues) {
String uri = uriValue.get("string", String.class);
//System.out.println(uri);
}
start += uriValues.length;
// this is the last page if uriValues is smaller than pageLength
hasMore = uriValues.length == pageLength;
}
} finally {
transaction.commit();
}
The transaction is only necessary if you need a guaranteed "snapshot" list isolated from adds/deletes happening concurrently with this process. Since it adds some overhead, feel free to remove it if you don't need such exactness.

find out the page length and in the queryMgr you can specify the starting point to access. Keep on increasing the starting point and loop through all the URL. I was able to fetch all URI. This could be not so good approach but works.
List<String> uriList = new ArrayList<>();
QueryManager queryMgr = client.newQueryManager();
StructuredQueryBuilder qb = new StructuredQueryBuilder();
StructuredQueryDefinition querydef = qb.and(qb.collection("xxxx"), qb.collection("whatever"), qb.collection("whatever"));//outputs 241152
SearchHandle results = queryMgr.search(querydef, new SearchHandle(), 10);
long pageLength = results.getPageLength();
long totalResults = results.getTotalResults();
System.out.println("Total Reuslts: " + totalResults);
long timesToLoop = totalResults / pageLength;
for (int i = 0; i < timesToLoop; i = (int) (i + pageLength)) {
System.out.println("Printing Results from: " + (i) + " to: " + (i + pageLength));
results = queryMgr.search(querydef, new SearchHandle(), i);
MatchDocumentSummary[] summaries = results.getMatchResults();//10 results because page length is 10
for (MatchDocumentSummary summary : summaries) {
// System.out.println("Extracted friom URI-> " + summary.getUri());
uriList.add(summary.getUri());
}
if (i >= 1000) {//number of URI to store/retreive. plus 10
break;
}
}
uriList= uriList.stream().distinct().collect(Collectors.toList());
return uriList;

how to disable page query in Spring-data-elasticsearch

I use spring-data-elasticsearch framework to get query result from elasticsearch server, the java code like this:
public void testQuery() {
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withFields("createDate","updateDate").withQuery(matchAllQuery()).withPageable(new PageRequest(0,Integer.MAX_VALUE)).build();
List<Entity> list = template.queryForList(searchQuery, Entity.class);
for (Entity e : list) {
System.out.println(e.getCreateDate());
System.out.println(e.getUpdateDate());
}
}
I get the raw query log in server, like this:
{"from":0,"size":10,"query":{"match_all":{}},"fields":["createDate","updateDate"]}
As per the query log, spring-data-elasticsearch will add size limit to the query. "from":0, "size":10, How can I avoid it to add the size limit?

You don't want to do this, you could use the findAll functionality on a repository that returns an Iterable. I think the best way to obtain all items is to use the scan/scroll functionality. Maybe the following code block can put you in the right direction:
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withQuery(QueryBuilders.matchAllQuery())
.withIndices("customer")
.withTypes("customermodel")
.withSearchType(SearchType.SCAN)
.withPageable(new PageRequest(0, NUM_ITEMS_PER_SCROLL))
.build();
String scrollId = elasticsearchTemplate.scan(searchQuery, SCROLL_TIME_IN_MILLIS, false);
boolean hasRecords = true;
while (hasRecords) {
Page<CustomerModel> page = elasticsearchTemplate.scroll(scrollId, SCROLL_TIME_IN_MILLIS, CustomerModel.class);
if (page != null) {
// DO something with the records
hasRecords = (page.getContent().size() == NUM_ITEMS_PER_SCROLL);
} else {
hasRecords = false;
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Getting all records from Elasticsearch using Java API - java

Related

How to read from ES 1.7 of huge data to index into ES 6.7

TFS JAVA SDK - How to run shared query

Neo4j Java bolt driver: how to convert the result to Json?

Fetching all the document URI's in MarkLogic Using Java Client API

how to disable page query in Spring-data-elasticsearch

Categories

Resources