We are working on an application where Java code talks to Mongo and streams the results back with Spring Data. We have been looking at the profiler output and I am not 100% on what it means.
https://docs.mongodb.com/manual/reference/database-profiler/
{
"op" : "query",
"ns" : "test.c",
"query" : {
"find" : "c",
"filter" : {
"a" : 1
}
},
"keysExamined" : 2,
"docsExamined" : 2,
"cursorExhausted" : true,
...
"responseLength" : 108,
"millis" : 0,
The documentation's description is:
system.profile.millis
The time in milliseconds from the perspective of the mongod from the beginning of the operation to the end of the operation.
OK, but what is the operation? If I am executing a query and I am pulling 1000 results back, is the "millis" time just for the query plan? Or does it include the ENTIRE it spends pulling the results back and sending them to the driver?
Will this give different answers when streaming vs non-streaming?
The operation is the query; the query does not return documents, but instead returns a cursor that points to the locations of the documents on disk:
https://docs.mongodb.com/v3.0/core/cursors/
The "millis" result is the time it takes MongoDB to search for the query results (perform index or collection scan, identify all documents that meet the query criteria and perform sorts if necessary) and return the corresponding cursor to the driver.
I'm not certain with what you mean by "streaming", but it could be the driver iterating over the cursor to access the results of the query.
Related
I'm trying to filter the data from my database using this code:
fdb.orderByChild("title").startAt(searchquery).endAt(searchquery+"\uf8ff").addValueEventListener(valuelistener2);
My database is like this:
"g12" : {
"Books" : {
"-Mi_He4vHXOuKHNL7yeU" : {
"title" : "Technical Sciences P1"
},
"-Mi_He50tUPTN9XDiVow" : {
"title" : "Life Sciences"
},
"-Mi_He51dhQfl3RAjysQ" : {
"title" : "Technical Sciences P2"
}}
While the code works, it only returns the first value that matches the query and doesn't fetch the rest of the data even though it matches.
If I put a "T" as my search query, I just get the first title "Technical Sciences P1 " and don't get the other one with P2
(Sorry for the vague and common question title, it's just I've been looking for a solution for so long)
While the codes works, it only returns the first value that matches the query
That's the expected behavior since Firebase Realtime Database does not support native indexing or search for text fields in database properties.
When you are using the following query:
fdb.orderByChild("title").startAt(searchquery).endAt(searchquery+"\uf8ff")
It means that you are trying to get all elements that start with searchquery. For example, if you have a title called "Don Quixote" and you search for "Don", your query will return the correct results. However, searching for "Quix" will yield no results.
You might consider downloading the entire node to search for fields client-side but this solution isn't practical at all. To enable full-text search of your Firebase Realtime Database data, I recommend you to use a third-party search service like Algolia or Elasticsearch.
If you consider at some point in time to try using Cloud Firestore, please see the following example:
Is it possible to use Algolia query in FirestoreRecyclerOptions?
To see how it works with Cloud Firestore but in the same way, you can use it with Firebase Realtime Database.
I am having a problem with mongodb (version 4.2), specifically with the Change Stream functionality.
I have a ReplicaSet cluster consisting of 1 primary, 1 secondary and 1 Arbiter and, in my java code with API mongodb-driver-sync 4.2.0-beta1, I have a .watch() process on a collection of interest.
Like below:
MongoClient mongoClient = MongoClients.create("mongodb://localhost:27017,localhost:27018,localhost:27019/?replicaSet=replica");
MongoDatabase database = mongoClient.getDatabase("test");
MongoCollection<Document> collectionStream = database.getCollection("myCollection");
List<Bson> pipeline = Arrays.asList(Aggregates.match(Filters.and(Filters.in("operationType", Arrays.asList("insert", "update", "replace", "invalidate")))));
MongoCursor<ChangeStreamDocument<Document>> cursor = collectionStream.watch(pipeline).fullDocument(FullDocument.UPDATE_LOOKUP).iterator();
ChangeStreamDocument<Document> streamedEvent = cursor.next();
System.out.println("Streamed event: " + streamedEvent);
Basically, the stream works fine. When an insert/update operation occurs, the event is recognized and the document is streamed correctly.
However, when one of the two nodes (either the primary or the secondary, a data-bearing one) goes down, the watcher stops streaming anything.
Update/insert operations continue fine on database but the stream is blocked. As soon as I restart one of the two nodes, the stream immediately resumes correctly and also shows me the events not streamed previously.
If, on the other hand, the arbiter node goes down, the stream keeps working fine.
Below my rs.conf() file and, as you can see, the WriteConcern parameter is 1.
{
"_id" : "replica",
"version" : 31,
"protocolVersion" : NumberLong(1),
"writeConcernMajorityJournalDefault" : true,
"members" : [
{
"_id" : 0,
"host" : "host1:27017",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1000,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 1,
"host" : "host2:27017",
"arbiterOnly" : false,
"buildIndexes" : true,
"hidden" : false,
"priority" : 1,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
},
{
"_id" : 4,
"host" : "host2:27018",
"arbiterOnly" : true,
"buildIndexes" : true,
"hidden" : false,
"priority" : 0,
"tags" : {
},
"slaveDelay" : NumberLong(0),
"votes" : 1
}
],
"settings" : {
"chainingAllowed" : true,
"heartbeatIntervalMillis" : 500,
"heartbeatTimeoutSecs" : 3,
"electionTimeoutMillis" : 3000,
"catchUpTimeoutMillis" : -1,
"catchUpTakeoverDelayMillis" : 30000,
"getLastErrorModes" : {
},
"getLastErrorDefaults" : {
"w" : 1,
"j" : false,
"wtimeout" : 0
},
"replicaSetId" : ObjectId("5f16a4e1e1c622bbea578576")
}}
Can anyone help me to solve this problem?
Updates:
To solve this behavior, in every .conf file of each node I set replication.enableMajorityReadConcern equals to false, in order to disable the ReadConcernMajority.
Anyway, following this setting, by stopping one of the primary or secondary node, I am always getting the following exception in console:
Exception in thread "main" com.mongodb.MongoExecutionTimeoutException: Error waiting for snapshot not less than { ts: Timestamp(1605805914, 1), t: -1 }, current relevant optime is { ts: Timestamp(1605805864, 1), t: 71 }. :: caused by :: operation exceeded time limit
at com.mongodb.internal.connection.ProtocolHelper.createSpecialException(ProtocolHelper.java:239)
at com.mongodb.internal.connection.ProtocolHelper.getCommandFailureException(ProtocolHelper.java:171)
at com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:359)
at com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:280)
at com.mongodb.internal.connection.UsageTrackingInternalConnection.sendAndReceive(UsageTrackingInternalConnection.java:100)
at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.sendAndReceive(DefaultConnectionPool.java:490)
at com.mongodb.internal.connection.CommandProtocolImpl.execute(CommandProtocolImpl.java:71)
at com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:259)
at com.mongodb.internal.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:202)
at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:118)
at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:110)
at com.mongodb.internal.operation.CommandOperationHelper.executeCommand(CommandOperationHelper.java:345)
at com.mongodb.internal.operation.CommandOperationHelper.executeCommand(CommandOperationHelper.java:336)
at com.mongodb.internal.operation.CommandOperationHelper.executeCommandWithConnection(CommandOperationHelper.java:222)
at com.mongodb.internal.operation.CommandOperationHelper$5.call(CommandOperationHelper.java:208)
at com.mongodb.internal.operation.OperationHelper.withReadConnectionSource(OperationHelper.java:583)
at com.mongodb.internal.operation.CommandOperationHelper.executeCommand(CommandOperationHelper.java:205)
at com.mongodb.internal.operation.AggregateOperationImpl.execute(AggregateOperationImpl.java:189)
at com.mongodb.internal.operation.ChangeStreamOperation$1.call(ChangeStreamOperation.java:325)
at com.mongodb.internal.operation.ChangeStreamOperation$1.call(ChangeStreamOperation.java:321)
at com.mongodb.internal.operation.OperationHelper.withReadConnectionSource(OperationHelper.java:583)
at com.mongodb.internal.operation.ChangeStreamOperation.execute(ChangeStreamOperation.java:321)
at com.mongodb.internal.operation.ChangeStreamOperation.execute(ChangeStreamOperation.java:60)
at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:178)
at com.mongodb.client.internal.ChangeStreamIterableImpl.execute(ChangeStreamIterableImpl.java:204)
at com.mongodb.client.internal.ChangeStreamIterableImpl.cursor(ChangeStreamIterableImpl.java:158)
at com.mongodb.client.internal.ChangeStreamIterableImpl.iterator(ChangeStreamIterableImpl.java:153)
at com.softstrategy.ProvaWatcher.ProvaWatcherApplication.main(ProvaWatcherApplication.java:34)
On the other hand, if I comment out the enableMajorityReadConcern in the every file .conf nodes as it is by default, that exception does not appear.
Hence my questions are the following two ones:
Why that exception is raised only when ReadConcern is set to false and when the node down is a data-bearing one?
Why when the arbiter node goes down, that exception is not raised regardless of ReadConcern setting?
Thanks!
With a PSA architecture, if either of the data bearing nodes is unavailable, you no longer have a majority of data bearing nodes present. Meaning, you would be able to insert with w:1 but not w:majority and you wouldn't be able to perform majority reads.
Change streams per https://docs.mongodb.com/manual/reference/read-concern-majority/#disable-read-concern-majority use majority read concern:
Disabling "majority" read concern disables support for Change Streams for MongoDB 4.0 and earlier. For MongoDB 4.2+, disabling read concern "majority" has no effect on change streams availability.
This is also implied by https://www.mongodb.com/blog/post/an-introduction-to-change-streams given
Total ordering
MongoDB 3.6 has a global logical clock that enables the server to order all changes across a sharded cluster. Applications will always receive changes in the order they were applied to the database.
The total ordering is only possible with a majority read.
Use PSS architecture if you wish change streams to continue producing events when one of the data bearing nodes is unavailable.
You can also try disabling read concern majority on 4.2+ but this has other issues as described in the first link.
I have inserted some test records to the mongo database with following structure.
{
"_id" : ObjectId("5563fe96a826638b48c77c26"),
"date" : ISODate("2015-05-02T07:00:00.326Z"),
"createdDate" : ISODate("2015-05-26T05:03:18.899Z"),
"updatedDate" : ISODate("2015-05-26T05:03:18.899Z"),
"status" : 0
}
Now when I try to query it using Spring data or via MongoDB I am always getting returned result list size to be 0.
Calendar calendar = Calendar.getInstance();
calendar.set(2015, 4, 2, 0, 0, 0);
Query query = new Query();
query.addCriteria(Criteria.where("date").is(calendar.getTime());
List<DateRecord> attendanceList = findAll(query, DateRecord.class);
System.out.println(attendanceList.size());
I am getting a very similar result for BasicDBObject, list of size 0.
DBCursor cursor;
BasicDBObject query1 = new BasicDBObject();
query1.append("date", calendar.getTime());
cursor = collection.find(query1);
System.out.println("Total objects returned "+cursor.size());
Any pointers on same will be highly appreciated. All I just want that data should be returned based upon year,month and day and any timestamp field values should be ignored.
I suggest using a different query - look for date greater than 2015-4-2 00:00:00 and explicitly less than 2015-4-3:00:00:00
Another approach, that I'm less enthusiastic about, would be to to add a field to the document just for the search purpose (e.g. "dateWithoutHour" calculated by java just before saving a document, and assuming data doesn't arrive from other sources). I don't like it, because I prefer my data to be pure logic and not change any time someone comes up with a new search requirement... but sometimes I had to resort to it).
And as always, when facing a difficult query it's tempting to consider $where , but I won't recommend it because it can't use indices.
We run a nightly query against BigQuery via the Java REST API that specifies a destination table for the results to be pushed to (write disposition=WRITE_TRUNCATE). Today's query appeared to run without errors but the results were not pushed to the destination table.
This query has been running for a few weeks now and we've had no issues. No code changes were made either.
Manually running it a second time after it "failed" worked fine. It was just this one glitch that we spotted and we're concerned it may happen again.
Our logged JSON response from the "failed" query looks fine (I've obfuscated any sensitive data):
INFO: Job finished successfully: {
"configuration" : {
"dryRun" : false,
"query" : {
"createDisposition" : "CREATE_IF_NEEDED",
"destinationTable" : {
"datasetId" : "[REMOVED]",
"projectId" : "[REMOVED]",
"tableId" : "[REMOVED]"
},
"priority" : "INTERACTIVE",
"query" : "[REMOVED]",
"writeDisposition" : "WRITE_TRUNCATE"
}
},
"etag" : "[REMOVED]",
"id" : "[REMOVED]",
"jobReference" : {
"jobId" : "[REMOVED]",
"projectId" : "[REMOVED]"
},
"kind" : "bigquery#job",
"selfLink" : "[REMOVED]",
"statistics" : {
"creationTime" : "1390435780070",
"endTime" : "1390435780769",
"query" : {
"cacheHit" : false,
"totalBytesProcessed" : "12546"
},
"startTime" : "1390435780245",
"totalBytesProcessed" : "12546"
},
"status" : {
"state" : "DONE"
}
}
Using the "try it!" for Jobs/GET here and plugging in the job id also shows the job was indeed successful and matches our logged output (pasted above).
Checking the web console shows the destination table has been truncated but not updated. Weirdly, the "Last Modified" has not been updated (I did try refreshing the page numerous times):
http://i.stack.imgur.com/384NL.png
Has anyone experienced this before with BigQuery - a query appearing to run successfully but if a destination/reference table was specified the results were not pushed yet the table was truncated?
I am a developer on the BigQuery team. I've looked up the details of you job from the breadcrumbs you left (your query was the only one that started at that start time).
It looks like your destination table was truncated at 4:09 pm today PST, which is the time your job ran, but it was left empty -- the query that truncated it didn't actually fill in any information.
I'm having a little bit of trouble piecing together the details, because one of the source tables appears to have been overwritten (the left table in your left outer of join was created at 4:20 PM).
However, there is a clue in the "total bytes processed" field -- it says that the query only processed 12K of data. The internal statistics say that only 384 rows were involved in the query among both tables that were involved.
My guess is that the query legitimately returned 0 rows, so the table was cleared.
There is a bug in that deleting all of the data in a table doesn't update the last modified time. We use last modified to mean either ast time the metadata was updated (like description, schema, etc) or the last time the table had data added to it). But if you just truncate the table, that doesn't update the metadata or add data, so we end up with a stale last-modified time.
If this doesn't sound like a reasonable chain of events, we'll need more information from you about how to debug it (especially since it looks like the tables involved have been modified since you ran this query), and a way that we can reproduce it would be great.
So, we figured out what the problem is with this. It failed again a few times over the last few days so we dug in further.
The query that is being executed is dependant on a another query which is executed immediately before it. Although we do wait for the first query to finish (job status = "DONE"), it appears that behind the scenes it's actually not fully complete and it's data is not yet available to be used.
Current process is:
Fetch data from another data source and stream the results into table A
When (1) is complete (poll job id and get status "DONE") submit another query which uses the results in table A to join on to create table B
Table A's data is not yet available so query from (2) results in an empty table
We've noticed it takes about 5-10 seconds for the data to actually appear and be available in BigQuery when using streaming for the first query.
We used a fairly ugly workaround - simply wait a few seconds after the first query before running the next one. Not exactly elegant but it works.
I am trying to come up with an optimized architecture to store event logging messages on Elasticsearch.
Here are my specs/needs:
Messages are read-only; once entered, they are only queried for reporting.
No free text search. User will use only filters for reporting.
Must be able to do timestamp range queries.
Mainly need to filter by agent and customer interactions (in addition to other fields).
customers and agents belong to the same location.
So the most frequently executed query will be: get all LogItems given client_id, customer_id, and timestamp range.
Here is what a LogItem looks like:
"_source": {
"agent_id" : 14,
"location_id" : 2,
"customer_id" : 5289,
"timestamp" : 1320366520000, //Java Long millis since epoch
"event_type" : 7,
"screen_id" : 12
}
I need help indexing my data.
I have been reading what is an elasticsearch index? and using elasticsearch to serve events for customers to get an idea of a good indexing architecture, but I need assistance from the pros.
So here are my questions:
The article suggests creating "One index per day". How would I do range queries with that architecture? (eg: is it possible to query on index range?)
Currently I'm using one big index. If I create one index per location_id, how do I use shards for further organization of my records?
Given the specs above, is there a better architecture you can suggest?
What fields should I filter with vs query with?
EDIT: Here's a sample query run from my app:
{
"query" : {
"bool" : {
"must" : [ {
"term" : {
"agent_id" : 6
}
}, {
"range" : {
"timestamp" : {
"from" : 1380610800000,
"to" : 1381301940000,
"include_lower" : true,
"include_upper" : true
}
}
}, {
"terms" : {
"event_type" : [ 4, 7, 11 ]
}
} ]
}
},
"filter" : {
"term" : {
"customer_id" : 56241
}
}
}
You can definitely search on multiple indices. You can use wildcards or a comma-separated list of indices for instance, but keep in mind that index names are strings, not dates.
Shards are not for organizing your data but to distribute it and eventually scale out. How you do that is driven by your data and what you do with it. Have a look at this talk: http://vimeo.com/44716955 .
Regarding your question about filters VS queries, have a look at this other question.
Take a good look at logstash (and kibana). They are all about solving this problem. If you decide to roll your own architecture for this, you might copy some of their design.