Elasticsearch for logging - need architectural advice - java

I am trying to come up with an optimized architecture to store event logging messages on Elasticsearch.
Here are my specs/needs:
Messages are read-only; once entered, they are only queried for reporting.
No free text search. User will use only filters for reporting.
Must be able to do timestamp range queries.
Mainly need to filter by agent and customer interactions (in addition to other fields).
customers and agents belong to the same location.
So the most frequently executed query will be: get all LogItems given client_id, customer_id, and timestamp range.
Here is what a LogItem looks like:
"_source": {
"agent_id" : 14,
"location_id" : 2,
"customer_id" : 5289,
"timestamp" : 1320366520000, //Java Long millis since epoch
"event_type" : 7,
"screen_id" : 12
}
I need help indexing my data.
I have been reading what is an elasticsearch index? and using elasticsearch to serve events for customers to get an idea of a good indexing architecture, but I need assistance from the pros.
So here are my questions:
The article suggests creating "One index per day". How would I do range queries with that architecture? (eg: is it possible to query on index range?)
Currently I'm using one big index. If I create one index per location_id, how do I use shards for further organization of my records?
Given the specs above, is there a better architecture you can suggest?
What fields should I filter with vs query with?
EDIT: Here's a sample query run from my app:
{
"query" : {
"bool" : {
"must" : [ {
"term" : {
"agent_id" : 6
}
}, {
"range" : {
"timestamp" : {
"from" : 1380610800000,
"to" : 1381301940000,
"include_lower" : true,
"include_upper" : true
}
}
}, {
"terms" : {
"event_type" : [ 4, 7, 11 ]
}
} ]
}
},
"filter" : {
"term" : {
"customer_id" : 56241
}
}
}

You can definitely search on multiple indices. You can use wildcards or a comma-separated list of indices for instance, but keep in mind that index names are strings, not dates.
Shards are not for organizing your data but to distribute it and eventually scale out. How you do that is driven by your data and what you do with it. Have a look at this talk: http://vimeo.com/44716955 .
Regarding your question about filters VS queries, have a look at this other question.

Take a good look at logstash (and kibana). They are all about solving this problem. If you decide to roll your own architecture for this, you might copy some of their design.

Related

Connecting two DatabaseReferences in Android

I have a firebase realtime database, where for every user i have the ids of the groups, they are member of, stored. I have the names of the groups stored seperately. Now what I'm trying to do is to display the users groups ordered alphabetically by the group name (GroupA, GroupB, GroupH, GroupX, ...). But I can't really figure out, how to connect the two references. Does anyone know a way to do it? I don't think it is possible to filter children based on a list of valid keys in realtime database, there's only equalTo, startAt etc. Or do I have to just load the ids, get the corresponding group names, and order them myself?
Here's my database structure:
"group_profiles" : {
"-MAz5iuen-BpsLWP1TR0" : { //GID
"name" : "GroupA"
},
"-MAkiUQ7UnIttXy0ZgZx" : { //GID
"name" : "GroupB"
}
},
"groups" : {
"iwfcfGR4TNatxwxpqEAx7ycNfT43" : { //UID
"-MAz5iuen-BpsLWP1TR0" : { //GID
"key" : "..."
},
"-MAkiUQ7UnIttXy0ZgZx" : { //GID
"key" : "..."
}
},
...
Or do I have to just load the ids, get the corresponding group names, and order them myself?
That's one way to do it.
The other way is to duplicate the required data from the groups into groups_profiles for the purpose of performing the query. This is common in nosql type databases, and is called "denormalization".

How to handle race-condition (Synchronisation) in Java project using ElasticSearch as persistence

Problem: We have data stored in DB. For example, take a USER object which is stored. CRUD operations can occur on any of the USER.
Is there any efficient way in which I can make these cruds as thread-safe if simultaneous operations come for the same USER.
My Solution
We use a Cache like a concurrent map which will help in holding the lock on specific OID if it is being worked upon.
The issue with this approach is we are having an additional cache, which will be handled in various scenarios like system restart and all.
Is there any other way in which I can achieve this which is much efficient. Anything can be done in persistence layer itself ?
PS: I am not using any Framework.
I think you need to use _seq_no and _primary_term.
You should read the document from the elastic search and retrieve sequence number and primary term:
request:
GET example-index/_doc/FOjCPXdy5uFjy80o4vxPI4kzlRQo
response:
{
"_index" : "example-index",
"_type" : "_doc",
"_id" : "FOjCPXdy5uFjy80o4vxPI4kzlRQo",
"_version" : 2,
"_seq_no" : 1,
"_primary_term" : 1,
"found" : true,
"_source" : {
"data": "text"
}
}
then pass them with your update query:
request:
POST example-index/_update/FOjCPXdy5uFjy80o4vxPI4kzlRQo?if_seq_no=1&if_primary_term=1
{
"doc":{
"data": "updated text"
}
}
so if the other agent updated your document those parameters will change after and you will get error 409:
response:
{
"error" : {
"root_cause" : [
{
"type" : "version_conflict_engine_exception",
"reason" : "[FOjCPXdy5uFjy80o4vxPI4kzlRQo]: version conflict, required seqNo [1], primary term [1]. current document has seqNo [2] and primary term [1]",
"index_uuid" : "GUZKnab3T5aBlzXEuOPI7Q",
"shard" : "0",
"index" : "example-index"
}
],
"type" : "version_conflict_engine_exception",
"reason" : "[FOjCPXdy5uFjy80o4vxPI4kzlRQo]: version conflict, required seqNo [1], primary term [1]. current document has seqNo [2] and primary term [1]",
"index_uuid" : "GUZKnab3T5aBlzXEuOPI7Q",
"shard" : "0",
"index" : "example-index"
},
"status" : 409
}
documentation: elastic.co/optimistic-concurrency-control

When profiling a Mongo query, what does "millis" mean?

We are working on an application where Java code talks to Mongo and streams the results back with Spring Data. We have been looking at the profiler output and I am not 100% on what it means.
https://docs.mongodb.com/manual/reference/database-profiler/
{
"op" : "query",
"ns" : "test.c",
"query" : {
"find" : "c",
"filter" : {
"a" : 1
}
},
"keysExamined" : 2,
"docsExamined" : 2,
"cursorExhausted" : true,
...
"responseLength" : 108,
"millis" : 0,
The documentation's description is:
system.profile.millis
The time in milliseconds from the perspective of the mongod from the beginning of the operation to the end of the operation.
OK, but what is the operation? If I am executing a query and I am pulling 1000 results back, is the "millis" time just for the query plan? Or does it include the ENTIRE it spends pulling the results back and sending them to the driver?
Will this give different answers when streaming vs non-streaming?
The operation is the query; the query does not return documents, but instead returns a cursor that points to the locations of the documents on disk:
https://docs.mongodb.com/v3.0/core/cursors/
The "millis" result is the time it takes MongoDB to search for the query results (perform index or collection scan, identify all documents that meet the query criteria and perform sorts if necessary) and return the corresponding cursor to the driver.
I'm not certain with what you mean by "streaming", but it could be the driver iterating over the cursor to access the results of the query.

elasticsearch match/term query not returning exact match

I am using elasticsearch in my project in Java, with the document format like
/index/type/_mapping
{
"my_id" : "string"
}
Now, suppose the my_id values are
A01, A02, A01.A1, A012.AB0
For the query,
{
"query" : {
"term" : {
"my_id" : "a01"
}
}
}
Observed : the documents returned are for A01, A01.A1, A012.AB0
Expected : I need the A01 document only.
I looked for the solution and found that i would have to use a custom analyzer for my_id field. I do not want to change my mapping for the document.
Also, I used "index": "not_analyzed" in the query but there was no change in the output.
Yes, you could use 'not_analyzed' analyzer, but try to use term filter instead of term query
Also check current mapping of the document

importing mongodb data to solr in optimal way

I have millions of documents in my mongodb database and to implement searching I am using Apache Solr. I googled how to import data from mongodb to solr, but found that there is no straight forward approach using Data Import Handler. So I decided to insert the document into Solr at the same time when I am inserting it into mongoDB using the solr client for Java, SolrJ. My documents are in following format in mongoDB:
{
"_id" : ObjectId("51cc52c9e4b04f75b27542ba"),
"article" : {
"summary" : "As more people eschew landlines, companies are offering technologies like\npersonal cell sites and signal boosters to augment cellphones, Eric A. Taub\nwrites in The New York Times.\n\n",
"author" : "By THE NEW YORK TIMES",
"title" : "Daily Report: Trying to Fix a Big Flaw in Cellphone Technology (NYT)",
"source" : "NYT",
"publish_date" : "Thu, 27 Jun 2013 12:01:00 +0100",
"source_url" : "http://bits.blogs.nytimes.com/feed/",
"url" : "http://news.feedzilla.com/en_us/stories/top-news/315578403?client_source=api&format=json"
},
"topics" : [
{
"categoryName" : "Technology Internet",
"score" : "94%"
}
],
"socialTags" : [
{
"originalValue" : "Cell site",
"importance" : "1"
},
{
"originalValue" : "Cellular repeater",
"importance" : "1"
},
{
"originalValue" : "Technology Internet",
"importance" : "1"
}
],
"entities" : [
{
"_type" : "PublishedMedium",
"name" : "The New York Times"
},
{
"_type" : "Company",
"name" : "The New York Times"
},
{
"_type" : "Person",
"name" : "Eric A. Taub"
}
]
}
I want to do indexing on two fields: 'summary' and 'title' of 'article' array.
So far, what I have learned is putting this entire document into solr does not make sense as it will increase the size of index and make searches slower. So, I decided to store following fields in Solr: 'docId', 'title', 'summary', so while searching in solr I will retrieve only docId and then retrieve other details from mongodb bec it faster than retrieving data from solr bec of analysers, tokenisers and all.
FIRST:
So, I need to maintain a unique field 'docId', shall I use use the default '_id' generated by mongod? But for that the doc has to be inserted first, so that mongod can generate the '_id'. So I need to retrieve the doc after inserting into mongodb, fetch the '_id' then insert in solr the fields 'docId', 'summary' and 'title'. Can this be imroved?
SECOND:
I need to define the schema in solr for this where I have to map fields from mongodb to fields in solr. I have the default instance of solr running, from solr/example/start.jar.
It has a default schema and a default collection called 'collection1'. How can I create my own collection, I cannot find anything in the admin interface for this. I want to create a collection for my project and then write a schema as I have defined above.
Whatever tutorials I have found, they simply add the documents to solr. So do I need to override the default schema?
Note: I am new to Solr, you would have already inferred after reading the question :D So please help!

Categories

Resources