Elastic search query for getting array of values - java

Hi have written a query for getting the avg of values at a position in Elastic search
elastic search payload : "userData": [ { "sub":1234, "value":678,"condition" :"A" },{ "sub":1234, "value":678,"condition" :"B" }]
{
"aggs": {
"student_data": {
"date_histogram": {
"field":"#timestamp",
"calendar_interval":"minute"
},
"aggs": {
"user_avg": {
"avg": {
"field":"value"
}
}
}
}
}
}
What I want is to get the array of elements of which the avg value is returned.
For example, if the avg of values on the basis of condition 'A' is 42 with values as {20,10,40,60,80}
In the output needed a field which can provide an array of [20,10,40,60,80]

I don't think you can obtain an array formatted like [20, 10, 40, 60, 80] in the response of a query. I can't think of a way to obtain it by using aggregations or scripted fields. Nevertheless, you can easily (1) get that information from the same query that specifies the aggregations and the filter logic; then, (2) post-process the query response to collect all the value's values used to calculate the average, by formatting format them in the way you prefer. How you post-process your response depends on the client/script you are using to send queries to Elasticsearch.
For example, you can output the values used to calculate the average as query hits.
{
"size": 100, <-- adjust this upper limit to your use case
"_source": "value", <-- include only the `value` field in the response
"query": {
"match": {
"condition": "A"
}
},
"aggs": {
"user_avg": {
"avg": {
"field": "value"
}
}
}
}
Or you can output the values used to calculate the average in a more compact way, by using terms aggregations.
{
"size": 0,
"_source": "value",
"query": {
"match": {
"condition": "A"
}
},
"aggs": {
"group_by_values": {
"terms": {
"field": "value",
"size": 100 . <-- adjust this upper limit to your use case
}
},
"user_avg": {
"avg": {
"field": "value"
}
}
}
}
The result of the latter will be something like:
"aggregations" : {
"array_of_values" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : 50,
"doc_count" : 2
},
{
"key" : 60,
"doc_count" : 1
},
{
"key" : 100,
"doc_count" : 1
}
]
},
"user_avg" : {
"value" : 65.0
}
}

Related

How can I find the number of duplicates for a field using MongoDB Java?

How can I find the number of duplicates in each document in Java-MongoDB
I have collection like this.
Collection example:
{
"_id": {
"$oid": "5fc8eb07d473e148192fbecd"
},
"ip_address": "192.168.0.1",
"mac_address": "00:A0:C9:14:C8:29",
"url": "https://people.richland.edu/dkirby/141macaddress.htm",
"datetimes": {
"$date": "2021-02-13T02:02:00.000Z"
}
{
"_id": {
"$oid": "5ff539269a10d529d88d19f4"
},
"ip_address": "192.168.0.7",
"mac_address": "00:A0:C9:14:C8:30",
"url": "https://people.richland.edu/dkirby/141macaddress.htm",
"datetimes": {
"$date": "2021-02-12T19:00:00.000Z"
}
}
{
"_id": {
"$oid": "60083d9a1cad2b613cd0c0a2"
},
"ip_address": "192.168.1.5",
"mac_address": "00:0A:05:C7:C8:31",
"url": "www.facebook.com",
"datetimes": {
"$date": "2021-01-24T17:00:00.000Z"
}
}
example query:
BasicDBObject whereQuery = new BasicDBObject();
DBCursor cursor = table1.find(whereQuery);
while (cursor.hasNext()) {
DBObject obj = cursor.next();
String ip_address = (String) obj.get("ip_address");
String mac_address = (String) obj.get("mac_address");
Date datetimes = (Date) obj.get("datetimes");
String url = (String) obj.get("url");
System.out.println(ip_address, mac_address, datetimes, url);
}
in Java, How I can know count duplicated data of "url". And how many of duplicated.
in mongodb you can solve this problem with "Aggregation Pipelines". You need to implement this pipeline in "Mongodb Java Driver". It gives only duplicated results with their duplicates count.
db.getCollection('table1').aggregate([
{
"$group": {
// group by url and calculate count of duplicates by url
"_id": "$url",
"url": {
"$first": "$url"
},
"duplicates_count": {
"$sum": 1
},
"duplicates": {
"$push": {
"_id": "$_id",
"ip_address": "$ip_address",
"mac_address": "$mac_address",
"url": "$url",
"datetimes": "$datetimes"
}
}
}
},
{ // select documents that only duplicates count higher than 1
"$match": {
"duplicates_count": {
"$gt": 1
}
}
},
{
"$project": {
"_id": 0
}
}
]);
Output Result:
{
"url" : "https://people.richland.edu/dkirby/141macaddress.htm",
"duplicates_count" : 2.0,
"duplicates" : [
{
"_id" : ObjectId("5fc8eb07d473e148192fbecd"),
"ip_address" : "192.168.0.1",
"mac_address" : "00:A0:C9:14:C8:29",
"url" : "https://people.richland.edu/dkirby/141macaddress.htm",
"datetimes" : {
"$date" : "2021-02-13T02:02:00.000Z"
}
},
{
"_id" : ObjectId("5ff539269a10d529d88d19f4"),
"ip_address" : "192.168.0.7",
"mac_address" : "00:A0:C9:14:C8:30",
"url" : "https://people.richland.edu/dkirby/141macaddress.htm",
"datetimes" : {
"$date" : "2021-02-12T19:00:00.000Z"
}
}
]
}
If I understand your question correctly you're trying to find the amount of duplicate entries for the field url. You could iterate over all your documents and add them to a Set. A Set has the property of only storing unique values. When you add your values, the ones that are already in the Set will not be added again. Thus the difference of the number of entries in the Set to the number of documents is the amount of duplicate entries for the given field.
If you wanted to know which URLs are non-unique, you could evaluate the return value from Set.add(Object) which will tell you, whether or not the given value has been in the Set beforehand. If it has, you got yourself a duplicate.

Elasticsearch composite group by queries across the documents

We have an elastic search document which has a dimension called city. Each document will have only one value for city field. I have a scenario where I need to query the person based on the city or cities.
Documents in Elasticsearch
{
person_id: "1",
property_value : 25000,
city: "Bangalore"
}
{
person_id: "2",
property_value : 100000,
city: "Bangalore"
}
{
person_id: "1",
property_value : 15000,
city: "Delhi"
}
Note: The aggregation should be performed on property_value and group by on person_id.
For eg.,
If I query for Bangalore it should return document with person_id 1 and 2.
If I query for both Delhi and Bangalore it should return this
{
person_id: "1",
property_value : 40000,
city: ["Bangalore", "Delhi"]
}
Looking at your data, I've come up with a sample mapping, request query and the response.
Mapping:
PUT my_index_city
{
"mappings": {
"properties": {
"person_id":{
"type": "keyword"
},
"city":{
"type":"text",
"fields":{
"keyword":{
"type": "keyword"
}
}
},
"property_value":{
"type": "long"
}
}
}
}
Sample Request:
Note that I've made use of simple query string to filter the documents having Bangalore and Delhi.
For aggregation I've made use of Terms Aggregation on person_id and Sum Aggregation on the property_value field.
POST my_index_city/_search
{
"size": 0,
"query": {
"query_string": {
"default_field": "city",
"query": "Bangalore Delhi"
}
},
"aggs": {
"my_person": {
"terms": {
"field": "person_id",
"size": 10,
"min_doc_count": 2
},
"aggs": {
"sum_property_value": {
"sum": {
"field": "property_value"
}
}
}
}
}
}
Sample Response:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"my_person" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 2,
"sum_property_value" : {
"value" : 40000.0
}
}
]
}
}
}
Note: This query would only work if the person_id has multiple documents but each document having unique/different city value.
What I mean to say is, if the person_id has multiple documents with same city, the aggregation would not give right answer.
Updated Answer:
There is no direct way to achieve what you are looking for unless you modify the mapping. What I've done is, made use of nested datatype and ingested all the documents for person_id as a single document.
Mapping:
PUT my_sample_city_index
{
"mappings": {
"properties": {
"person_id":{
"type": "keyword"
},
"property_details":{
"type":"nested", <------ Note this
"properties": {
"city":{
"type": "text",
"fields":{
"keyword":{
"type":"keyword"
}
}
},
"property_value":{
"type": "long"
}
}
}
}
}
}
Sample Documents:
POST my_sample_city_index/_doc/1
{
"person_id": "1",
"property_details":[
{
"property_value" : 25000,
"city": "Bangalore"
},
{
"property_value" : 15000,
"city": "Delhi"
}
]
}
POST my_sample_city_index/_doc/2
{
"person_id": "2",
"property_details":[
{
"property_value" : 100000,
"city": "Bangalore"
}
]
}
Aggregation Query:
POST my_sample_city_index/_search
{
"size": 0,
"query": {
"nested": {
"path": "property_details",
"query": {
"query_string": {
"default_field": "property_details.city",
"query": "bangalore delhi"
}
}
}
},
"aggs": {
"persons": {
"terms": {
"field": "person_id",
"size": 10
},
"aggs": {
"property_sum": {
"nested": { <------ Note this
"path": "property_details"
},
"aggs": {
"total_sum": {
"sum": {
"field": "property_details.property_value"
}
}
}
}
}
}
}
}
Note that I've applied initially a term query on person_id post which I've applied Nested Aggregation, further on which I've applied metric sum aggregation query.
This should also work correctly if a person has multiple properties in the same city.
Response:
{
"took" : 31,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"persons" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "1",
"doc_count" : 1,
"property_sum" : {
"doc_count" : 2,
"total_sum" : {
"value" : 40000.0
}
}
},
{
"key" : "2",
"doc_count" : 1,
"property_sum" : {
"doc_count" : 1,
"total_sum" : {
"value" : 100000.0
}
}
}
]
}
}
}
Let me know if this helps!

How to filter documents by embedded arrays in MongoDB using Java

I have the following document:
{
"_id": {
"$oid": "5d4037f811b787414dcfb3a5"
},
"id": 1,
"seats": [{
"available": true
}, {
"available": true
}, {
"available": true
}, {
"available": false
}]
}
How to query using mongodb-java-driver only the documents that have number of seats available greater than 3?
What I want is something like this: (seats.available eq true) gt 3
Is it possible?
i think this might be a duplicate to Query for documents where array size is greater than 1
so something like this i would guess:
{ $where: "this.seats.available.length > 3" }
To count the number of available seats, we can use $reduce aggregation stage. It would count the seats whose availability are marked as 'true'. In the next pipeline stage, the documents are filtered which have the count of available seats greater than 3.
db.checks.aggregate([
{
$addFields:{
"availableSeats":{
$reduce:{
"input":"$seats",
"initialValue": 0,
"in":{
$sum:[
"$$value",
{
$cond:[
{
$eq:["$$this.available",true]
},
1,
0
]
}
]
}
}
}
}
},
{
$match:{
"availableSeats":{
$gte: 3
}
}
}
]).pretty()

Elasticsearch how sum values after aggregation result

I have a lot of documents as below under Elasticsearch index:
{
"_index": "f2016-07-17",
"_type": "trkvjadsreqpxl.gif",
"_id": "AVX2N3dl5siG6SyfyIjb",
"_score": 1,
"_source": {
"time": "1468714676424",
"meta": {
"cb_id": 25681,
"mt_id": 649,
"c_id": 1592,
"revenue": 2.5,
"mt_name": "GMS-INAPP-EN-2.5",
"c_description": "COULL-INAPP-EN-2.5",
"domain": "wv.inner-active.mobi",
"master_domain": "649###wv.inner-active.mobi",
"child_domain": "1592###wv.inner-active.mobi",
"combo_domain": "25681###wv.inner-active.mobi",
"ip": "52.42.87.73"
}
}
}
I want to make date histogram/range aggregation on multiple fields and store the result in other collection/index.
So I could make doc_count sum using query/aggregation between hours range.
The Aggregation is:
{
"aggs": {
"hour":{
"date_histogram": {
"field": "time",
"interval": "hour"
},
"aggs":{
"hourly_M_TAG":{
"terms":{
"field":"meta.mt_id"
}
}
}....
}
}
}
The Result as expected:
"aggregations": {
"hour": {
"buckets": [
{
"key_as_string": "2016-07-17T00:00:00.000Z",
"key": 1468713600000,
"doc_count": 94411750,
"hourly_M_TAG": {
"doc_count_error_upper_bound": 1485,
"sum_other_doc_count": 30731646,
"buckets": [
{
"key": 10,
"doc_count": 10175501
},
{
"key": 649,
"doc_count": 200000
}....
]
}
},
{
"key_as_string": "2016-07-17T01:00:00.000Z",
"key": 1468717200000,
"doc_count": 68738743,
"hourly_M_TAG": {
"doc_count_error_upper_bound": 2115,
"sum_other_doc_count": 22478590,
"buckets": [
{
"key": 559,
"doc_count": 8307018
},
{
"key": 649,
"doc_count" :100000
}...
Lets assume that I parse the response and try to store the Result in other Index/Collection.
My Question
What is the best way to store the aggregated results ,
so I can make other query/aggregation that sums the "doc_count" between different hour ranges?
for example: between "2016-07-17T00:00:00.000Z" - "2016-07-17T01:00:00.000Z" want to see the total doc_count on each key
EXPECTED RESULT:
{
"range_sum": {
"buckets": [
{
"key": 649,
"doc_count": 300000 // (200000+100000)
},
{
"key": 588,
"doc_count": 2928548 // ... + ...
}....
]
}
}
Thanks!
I might have your end goal wrong, but it seems to me like you want
the total doc_count for each value of meta.mt_id, over configurable ranges of time?
If that is the case I don't think you really need to store the result of the first aggregation, you really just need to change the interval value to reflect the bucket sizes you want. If you want totals for each value of meta.mt_id, it might help to flip the aggregation around so you are first aggregating on terms and then on the dates:
{
"size": 0,
"aggs": {
"hourly_M_TAG": {
"terms": {
"field": "meta.mt_id"
},
"aggs": {
"hour": {
"date_histogram": {
"field": "time",
"interval": "2h"
}
}
}
}
}
This will give you results for each meta.mt_id if you wish to get totals added for a particular time range just change the interval to reflect that.
EDIT:
There might be some smart elasticsearch way of doing this but I think I would just do it like this:
Do your original aggregation
foreach bucket in buckets:
index:
{
"id" : {meta.id},
"timestamp" : {key_as_string}
"count" : {doc_count}
}
You should then have an index of all meta.id documents with their doc_count at various timestamps, the granularity of the interval depends on what you need.
Then you can just do a term->sum aggregation on the new index with a range filter (Assuming use of elasticsearch 2.x) for the dates:
{
"size": 0,
"filter": {
"range": {
"timestamp": {
"gte": "now-1h",
"lte": "now"
}
}
},
"aggs": {
"termName": {
"terms": {
"field": "id"
},
"aggs": {
"sumCounts": {
"sum": {
"field": "count"
}
}
}
}
}
}
Sorry if that is still not what you are looking for, I think there are many different ways of doing this.

Term not found in Search but present in a term vector in Elasticsearch

I have term in my dataset which does not give any search results but is present in a document.
If I request a term vector:
GET index_5589b14f3004fb6be70e4724/document_set/382.txt/_termvector
{
"fields" : ["plain_text", "pdf_text"],
"term_statistics" : true,
"field_statistics" : true
}
The term vector has this word:
...
"advis": { //porter stemmed version of the word "advising"
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 81,
"start_offset": 412,
"end_offset": 420
}
]
},...
"air": {
But when I search this word to retrieve all the documents where it has occurred I get zero hits:
GET index_5589b14f3004fb6be70e4724/document_set/_search
{
"query": {
"multi_match": {
"query": "advis",
"fields": ["plain_text", "pdf_text"]
}
},
"explain": true
}
Why is this happening?
This is due to the fact that the search term is getting analyzed most probably in the above example advis is being stemmed to advi.
You can explicitly specify keyword analyzer in the query and you should get the values
GET index_5589b14f3004fb6be70e4724/document_set/_search
{
"query": {
"multi_match": {
"query": "advis",
"fields": ["plain_text", "pdf_text"],
"analyzer" : "keyword"
}
},
"explain": true
}

Categories

Resources