How to query nested mongodb fields efficiently in Java? - java

I'm not very experienced with running Mongo queries from Java, so I'm no expert at commands. I have a Mongo collection with ~6500 documents, each containing multiple fields (some of which have sub-fields), like below:
"_id" : NumberLong(714847),
"franchiseIds" : [
NumberLong(714848),
NumberLong(714849)
],
"profileSettings" : {
"DISCLAIMER_SETUP" : {
"settingType" : "DISCLAIMER_SETUP",
"attributeMap" : {
...
I want to have an operation which will go through the entire collection from time to time and calculate how many franchiseIds are present, since different documents could have anywhere from 1 to 4 franchiseIds.
From the Mongo shell, I did a very simple script to get this, and it calculated the result immediately:
rs_default:SECONDARY> var totalCount = 0;
rs_default:SECONDARY> db.profiles.find().forEach( function(profile) { totalCount += profile.franchiseIds.length } );
rs_default:SECONDARY> totalCount
However, when I attempted to do the same thing in Java, which is where this would run on the server from time to time, it was much less performant, taking around 15 seconds to complete:
int result = 0;
List<Profile> allProfiles = mongoTemplate.findAll(Profile.class, PROFILE_COLLECTION);
for (Profile profile : allProfiles) {
result += profile.getFranchiseIds().size();
}
return results
I realize the above isn't performant in Java as it's having to allocate memory for all of the Profiles being loaded in. In the Mongo shell script, is Mongo simply taking care of this itself?
Any ideas how I can do something similar in Java?
EDIT:
I returned only the franchiseIds field on the response from Mongo, and that helped significantly. Below is the improved code:
final Query query = new Query();
query.fields().include(FRANCHISE_IDS);
final List<Profile> allProfiles = mongoTemplate.find(query, Profile.class, PROFILE_COLLECTION);
for (Profile profile : allProfiles) {
result += profile.getFranchiseIds().size();
}
return result;

Related

Mongo CDC throws : BSONObjectTooLarge . How to ignore this and proceed further?

I would liketo listen only to 3 collections in a database: c1, c2, c3. I was not able to figureout how to limit listening to these 3 collections only. Below is my code.
i would like to ignore this error and proceed further. How to do it? In this case the cursor itself is not getting created.
Like i said previously, is there a way to limit the listening to the collections c1, c2 c3 collections only?-- on the db side. Below code is listening to the full db and then filtering the collections on the java side.
List<Bson> pipeline = singletonList(match(in("operationType", asList("insert", "delete", "update"))));
MongoChangeStreamCursor<ChangeStreamDocument<Document>> cursor;
String resumeTokenStr = getResumeTokenFromS3(cdcConfig);
if (resumeTokenStr == null) {
cursor = mongoClient.watch(pipeline).fullDocument(FullDocument.UPDATE_LOOKUP).cursor();
} else {
BsonDocument resumeToken = BsonDocument.parse(resumeTokenStr);
cursor = mongoClient.watch(pipeline).batchSize(1).maxAwaitTime(60, TimeUnit.SECONDS).startAfter(resumeToken).fullDocument(FullDocument.UPDATE_LOOKUP).cursor();
}
return cursor;
The above code throws the below error
com.mongodb.MongoCommandException: Command failed with error 10334 (BSONObjectTooLarge): 'BSONObj size: 16795345 (0x10046D1) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: { _data: "826337A73B0000000A2B022C0100296E5A1004B317A529F739433BA840730515AC0EAC46645F6964006462624E8146E0FB000934F6560004" }' on server crm-mongo-report01.prod.phenom.local:27017. The full response is {"operationTime": {"$timestamp": {"t": 1664707966, "i": 25}}, "ok": 0.0, "errmsg": "BSONObj size: 16795345 (0x10046D1) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: { _data: \"826337A73B0000000A2B022C0100296E5A1004B317A529F739433BA840730515AC0EAC46645F6964006462624E8146E0FB000934F6560004\" }", "code": 10334, "codeName": "BSONObjectTooLarge", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1664707966, "i": 26}}, "signature": {"hash": {"$binary": {"base64": "NZDJKhCse19Eud88kNh7XRWRgas=", "subType": "00"}}, "keyId": 7113062344413937666}}}
at com.mongodb.internal.connection.ProtocolHelper.getCommandFailureException(ProtocolHelper.java:198)
at com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:413)
at com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:337)
at com.mongodb.internal.connection.UsageTrackingInternalConnection.sendAndReceive(UsageTrackingInternalConnection.java:116)
at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.sendAndReceive(DefaultConnectionPool.java:644)
at com.mongodb.internal.connection.CommandProtocolImpl.execute(CommandProtocolImpl.java:71)
at com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:240)
at com.mongodb.internal.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:226)
at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:126)
at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:116)
at com.mongodb.internal.connection.DefaultServer$OperationCountTrackingConnection.command(DefaultServer.java:345)
at com.mongodb.internal.operation.CommandOperationHelper.createReadCommandAndExecute(CommandOperationHelper.java:232)
at com.mongodb.internal.operation.CommandOperationHelper.lambda$executeRetryableRead$4(CommandOperationHelper.java:214)
at com.mongodb.internal.operation.OperationHelper.lambda$withSourceAndConnection$2(OperationHelper.java:575)
at com.mongodb.internal.operation.OperationHelper.withSuppliedResource(OperationHelper.java:600)
at com.mongodb.internal.operation.OperationHelper.lambda$withSourceAndConnection$3(OperationHelper.java:574)
at com.mongodb.internal.operation.OperationHelper.withSuppliedResource(OperationHelper.java:600)
at com.mongodb.internal.operation.OperationHelper.withSourceAndConnection(OperationHelper.java:573)
at com.mongodb.internal.operation.CommandOperationHelper.lambda$executeRetryableRead$5(CommandOperationHelper.java:211)
at com.mongodb.internal.async.function.RetryingSyncSupplier.get(RetryingSyncSupplier.java:65)
at com.mongodb.internal.operation.CommandOperationHelper.executeRetryableRead(CommandOperationHelper.java:217)
at com.mongodb.internal.operation.CommandOperationHelper.executeRetryableRead(CommandOperationHelper.java:197)
at com.mongodb.internal.operation.AggregateOperationImpl.execute(AggregateOperationImpl.java:195)
at com.mongodb.internal.operation.ChangeStreamOperation$1.call(ChangeStreamOperation.java:347)
at com.mongodb.internal.operation.ChangeStreamOperation$1.call(ChangeStreamOperation.java:343)
at com.mongodb.internal.operation.OperationHelper.withReadConnectionSource(OperationHelper.java:538)
at com.mongodb.internal.operation.ChangeStreamOperation.execute(ChangeStreamOperation.java:343)
at com.mongodb.internal.operation.ChangeStreamOperation.execute(ChangeStreamOperation.java:58)
at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:191)
at com.mongodb.client.internal.ChangeStreamIterableImpl.execute(ChangeStreamIterableImpl.java:221)
at com.mongodb.client.internal.ChangeStreamIterableImpl.cursor(ChangeStreamIterableImpl.java:174)
at com.company.cdc.services.CDCMain.getCursorAtResumeToken(CdcServiceMain.java:217)
line 217 points to the line : cursor = mongoClient.watch(pipeline).batchSize(1).maxAwaitTime(60, TimeUnit.SECONDS).startAfter(resumeToken).fullDocument(FullDocument.UPDATE_LOOKUP).cursor();
This is how i ended up solving the problem. (just incase somebody else is searching for solution to-this problem)
removed FullDocument.UPDATE_LOOKUP when creating the cursor. So now my code looks like cursor = mongoClient.watch(pipeline).batchSize(20000).cursor(); --Now this avoided the gigantic column that may endup in the document which was eventually error-ing out. This worked.
In my case i didnt have to listen to the collection updates which had this bad data. So i modified my cursor to listen only on the databases and collections of my interest --instead of listening on the entire database and then ignoring the unwanted collections later. Below is the code
When writing to the destination i had done bulk lookup on mongo db, constructed the full document and then written it --This approach of lazy lookup reduced a lot of memory footprint of the java program.
private List<Bson> generatePipeline(CdcConfigs cdcConfig) {
List<String> whiteListedCollections = getWhitelistedCollection(cdcConfig);
List<String> whiteListedDbs = getWhitelistedDbs(cdcConfig);
log.info("whitelisted dbs:" + whiteListedDbs + " coll:" + whiteListedCollections);
List<Bson> pipeline;
if (whiteListedDbs.size() > 0)
pipeline = singletonList(match(and(
in("ns.db", whiteListedDbs),
in("ns.coll", whiteListedCollections),
in("operationType", asList("insert", "delete", "update")))));
else
pipeline = singletonList(match(and(
in("ns.coll", whiteListedCollections),
in("operationType", asList("insert", "delete", "update")))));
return pipeline;
}
private MongoChangeStreamCursor<ChangeStreamDocument<Document>> getCursorAtResumeToken(CdcConfigs cdcConfig, MongoClient mongoClient) {
List<Bson> pipeline = generatePipeline(cdcConfig);
MongoChangeStreamCursor<ChangeStreamDocument<Document>> cursor;
String resumeTokenStr = getResumeTokenFromS3(cdcConfig);
if (resumeTokenStr == null) {
// cursor = mongoClient.watch(pipeline).fullDocument(FullDocument.UPDATE_LOOKUP).cursor();
cursor = mongoClient.watch(pipeline).batchSize(20000).cursor();
log.warn("RESUME TOKEN IS NULL. READING CDC FROM CURRENT TIMESTAMP FROM MONGO DB !!! ");
} else {
BsonDocument resumeToken = BsonDocument.parse(resumeTokenStr);
cursor = mongoClient.watch(pipeline).batchSize(20000).maxAwaitTime(30, TimeUnit.MINUTES).startAfter(resumeToken).cursor();
// cursor = mongoClient.watch(pipeline).startAfter(resumeToken).fullDocument(FullDocument.UPDATE_LOOKUP).cursor();
}
return cursor;
}
The solutions available sofar are more inclined towards python code. So it was a challange translating them into Java.
My case is a little bit different. I have a trigger used to align documents from a collection to another, with the option to include the fullDocument in the change event. Problem is that the metadata added to the event, combined with the fullDocument, exceeds the BSON max size (16MB).
I solved removing the fullDocument from the trigger and getting the document from the collection itself performing an aggregate:
collection1.aggregate([
{$match: {_id: changeEvent.documentKey._id}},
{ $merge: { into: "collection", on: "_id", whenMatched: "replace", whenNotMatched: "insert" } }
]);

Spring data aggregation query elasticsearch

I am trying to make the below elasticsearch query to work with spring data. The intent is to return unique results for the field "serviceName". Just like a SELECT DISTINCT serviceName FROM table would do comparing to a SQL database.
{
"aggregations": {
"serviceNames": {
"terms": {
"field": "serviceName"
}
}
},
"size":0
}
I configured the field as a keyword and it made the query work perfectly in the index_name/_search api as per the response snippet below:
"aggregations": {
"serviceNames": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "service1",
"doc_count": 20
},
{
"key": "service2",
"doc_count": 8
},
{
"key": "service3",
"doc_count": 8
}
]
}
}
My problem is the same query doesn't work in Spring data when I try to run with a StringQuery I get the error below. I am guessing it uses a different api to run queries.
Cannot execute jest action , response code : 400 , error : {"root_cause":[{"type":"parsing_exception","reason":"no [query] registered for [aggregations]","line":2,"col":19}],"type":"parsing_exception","reason":"no [query] registered for [aggregations]","line":2,"col":19} , message : null
I have tried using the SearchQuery type to achieve the same results, no duplicates and no object loading, but I had no luck. The below sinnipet shows how I tried doing it.
final TermsAggregationBuilder aggregation = AggregationBuilders
.terms("serviceName")
.field("serviceName")
.size(1);
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withIndices("index_name")
.withQuery(matchAllQuery())
.addAggregation(aggregation)
.withSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.withSourceFilter(new FetchSourceFilter(new String[] {"serviceName"}, new String[] {""}))
.withPageable(PageRequest.of(0, 10000))
.build();
Would someone know how to achieve no object loading and object property distinct aggregation on spring data?
I tried many things without success to print queries on spring data, but I could not, maybe because I am using the com.github.vanroy.springdata.jest.JestElasticsearchTemplate implementation.
I got the query parts with the below:
logger.info("query:" + searchQuery.getQuery());
logger.info("agregations:" + searchQuery.getAggregations());
logger.info("filter:" + searchQuery.getFilter());
logger.info("search type:" + searchQuery.getSearchType());
It prints:
query:{"match_all":{"boost":1.0}}
agregations:[{"serviceName":{"terms":{"field":"serviceName","size":1,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}}}]
filter:null
search type:DFS_QUERY_THEN_FETCH
I figured out, maybe can help someone. The aggregation don't come with the query results, but in a result for it self and is not mapped to any object. The Objects results that comes apparently are samples of the query elasticsearch did to run your aggregation (not sure, maybe).
I ended up by creating a method which can do a simulation of what would be on the SQL SELECT DISTINCT your_column FROM your_table, but I think this will work only on keyword fields, they have a limitation of 256 characters if I am not wrong. I explained some lines in comments.
Thanks #Val since I was only able to figure it out when debugged into Jest code and check the generated request and raw response.
public List<String> getDistinctField(String fieldName) {
List<String> result = new ArrayList<>();
try {
final String distinctAggregationName = "distinct_field"; //name the aggregation
final TermsAggregationBuilder aggregation = AggregationBuilders
.terms(distinctAggregationName)
.field(fieldName)
.size(10000);//limits the number of aggregation list, mine can be huge, adjust yours
SearchQuery searchQuery = new NativeSearchQueryBuilder()
.withIndices("your_index")//maybe can be omitted
.addAggregation(aggregation)
.withSourceFilter(new FetchSourceFilter(new String[] { fieldName }, new String[] { "" }))//filter it to retrieve only the field we ar interested, probably we can take this out.
.withPageable(PageRequest.of(0, 1))//can't be zero, and I don't want to load 10 results every time it runs, will always return one object since I found no "size":0 in query builder
.build();
//had to use the JestResultsExtractor because com.github.vanroy.springdata.jest.JestElasticsearchTemplate don't have an implementation for ResultsExtractor, if you use Spring defaults, you can probably use it.
final JestResultsExtractor<SearchResult> extractor = new JestResultsExtractor<SearchResult>() {
#Override
public SearchResult extract(SearchResult searchResult) {
return searchResult;
}
};
final SearchResult searchResult = ((JestElasticsearchTemplate) elasticsearchOperations).query(searchQuery,
extractor);
final MetricAggregation aggregations = searchResult.getAggregations();
final TermsAggregation termsAggregation = aggregations.getTermsAggregation(distinctAggregationName);//this is where your aggregation results are, in "buckets".
result = termsAggregation.getBuckets().parallelStream().map(TermsAggregation.Entry::getKey)
.collect(Collectors.toList());
} catch (Exception e) {
// threat your error here.
e.printStackTrace();
}
return result;
}

How to add a sort query to my StructuredQueryBuilder before talking to Marklogic

I am trying to add a sort/ordering query.
At my java:
StructuredQueryBuilder qb = new StructuredQueryBuilder();
QueryDefinition queryDef = qb.and(qb.value(qb.jsonProperty("status"), "Active"));
SearchHandle resultsHandle = new SearchHandle();
queryManager.setPageLength(PAGE_SIZE_TEN);
int start = PAGE_SIZE_TEN * (pageNumber - 1) + 1;
queryManager.search(queryDef, resultsHandle, start);
The above will return the resultsHandle with 10 json files found for each page specified for the variable "start", with status "Active".
My question is how do I include a sorting query like maybe something along the line of the following:
QueryDefinition queryDef = qb.and(qb.value(qb.jsonProperty("status"), "Active"),
qb.sort?(qb.jsonProperty("dateCreated"));
I want it to get me the 1st 10 json files in order of latest date. It is too late to do a Comparator after getting the result, as the result returns a random 10 json files not in any particular order.
A few samples of the json files will look as such:
1.json
[
{
"id":"1",
"dateCreated":"2017-10-01 12:00:00",
"status":"Active"
"body":"This is a test"
}
]
2.json
[
{
"id":"2",
"dateCreated":"2017-10-02 12:00:00",
"status":"Active"
"body":"This is a test 2"
}
]
I realized there's a enum StructuredQueryBuilder.Ordering, how do I use it?
StructuredQueryBuilder.Ordering is specifically for use with near-query and is unrelated to what you want to do. You need to use query options to define a sort order for your search results. See the sort-order query option:
http://docs.marklogic.com/guide/search-dev/appendixa#id_44212
Options can be pre-defined and installed on MarkLogic and then referenced in your search, or you can define them at runtime and combine them with your structured query in a combined query.
Predefined: http://docs.marklogic.com/guide/java/query-options#chapter
Dynamic: http://docs.marklogic.com/guide/java/searches#id_76144

Neo4j ExecutionEngine does not return valid results

Trying to use a similar example from the sample code found here
My sample function is:
void query()
{
String nodeResult = "";
String rows = "";
String resultString;
String columnsString;
System.out.println("In query");
// START SNIPPET: execute
ExecutionEngine engine = new ExecutionEngine( graphDb );
ExecutionResult result;
try ( Transaction ignored = graphDb.beginTx() )
{
result = engine.execute( "start n=node(*) where n.Name =~ '.*79.*' return n, n.Name" );
// END SNIPPET: execute
// START SNIPPET: items
Iterator<Node> n_column = result.columnAs( "n" );
for ( Node node : IteratorUtil.asIterable( n_column ) )
{
// note: we're grabbing the name property from the node,
// not from the n.name in this case.
nodeResult = node + ": " + node.getProperty( "Name" );
System.out.println("In for loop");
System.out.println(nodeResult);
}
// END SNIPPET: items
// START SNIPPET: columns
List<String> columns = result.columns();
// END SNIPPET: columns
// the result is now empty, get a new one
result = engine.execute( "start n=node(*) where n.Name =~ '.*79.*' return n, n.Name" );
// START SNIPPET: rows
for ( Map<String, Object> row : result )
{
for ( Entry<String, Object> column : row.entrySet() )
{
rows += column.getKey() + ": " + column.getValue() + "; ";
System.out.println("nested");
}
rows += "\n";
}
// END SNIPPET: rows
resultString = engine.execute( "start n=node(*) where n.Name =~ '.*79.*' return n.Name" ).dumpToString();
columnsString = columns.toString();
System.out.println(rows);
System.out.println(resultString);
System.out.println(columnsString);
System.out.println("leaving");
}
}
When I run this in the web console I get many results (as there are multiple nodes that have an attribute of Name that contains the pattern 79. Yet running this code returns no results. The debug print statements 'in loop' and 'nested' never print either. Thus this must mean there are not results found in the Iterator, yet that doesn't make sense.
And yes, I already checked and made sure that the graphDb variable is the same as the path for the web console. I have other code earlier that uses the same variable to write to the database.
EDIT - More info
If I place the contents of query in the same function that creates my data, I get the correct results. If I run the query by itself it returns nothing. It's almost as the query works only in the instance where I add the data and not if I come back to the database cold in a separate instance.
EDIT2 -
Here is a snippet of code that shows the bigger context of how it is being called and sharing the same DBHandle
package ContextEngine;
import ContextEngine.NeoHandle;
import java.util.LinkedList;
/*
* Class to handle streaming data from any coded source
*/
public class Streamer {
private NeoHandle myHandle;
private String contextType;
Streamer()
{
}
public void openStream(String contextType)
{
myHandle = new NeoHandle();
myHandle.createDb();
}
public void streamInput(String dataLine)
{
Context context = new Context();
/*
* get database instance
* write to database
* check for errors
* report errors & success
*/
System.out.println(dataLine);
//apply rules to data (make ContextRules do this, send type and string of data)
ContextRules contextRules = new ContextRules();
context = contextRules.processContextRules("Calls", dataLine);
//write data (using linked list from contextRules)
NeoProcessor processor = new NeoProcessor(myHandle);
processor.processContextData(context);
}
public void runQuery()
{
NeoProcessor processor = new NeoProcessor(myHandle);
processor.query();
}
public void closeStream()
{
/*
* close database instance
*/
myHandle.shutDown();
}
}
Now, if I call streamInput AND query in in the same instance (parent calls) the query returns results. If I only call query and do not enter ANY data in that instance (yet web console shows data for same query) I get nothing. Why would I have to create the Nodes and enter them into the database at runtime just to return a valid query. Shouldn't I ALWAYS get the same results with such a query?
You mention that you are using the Neo4j Browser, which comes with Neo4j. However, the example you posted is for Neo4j Embedded, which is the in-process version of Neo4j. Are you sure you are talking to the same database when you try your query in the Browser?
In order to talk to Neo4j Server from Java, I'd recommend looking at the Neo4j JDBC driver, which has good support for connecting to the Neo4j server from Java.
http://www.neo4j.org/develop/tools/jdbc
You can set up a simple connection by adding the Neo4j JDBC jar to your classpath, available here: https://github.com/neo4j-contrib/neo4j-jdbc/releases Then just use Neo4j as any JDBC driver:
Connection conn = DriverManager.getConnection("jdbc:neo4j://localhost:7474/");
ResultSet rs = conn.executeQuery("start n=node({id}) return id(n) as id", map("id", id));
while(rs.next()) {
System.out.println(rs.getLong("id"));
}
Refer to the JDBC documentation for more advanced usage.
To answer your question on why the data is not durably stored, it may be one of many reasons. I would attempt to incrementally scale back the complexity of the code to try and locate the culprit. For instance, until you've found your problem, do these one at a time:
Instead of looping through the result, print it using System.out.println(result.dumpToString());
Instead of the regex query, try just MATCH (n) RETURN n, to return all data in the database
Make sure the data you are seeing in the browser is not "old" data inserted earlier on, but really is an insert from your latest run of the Java program. You can verify this by deleting the data via the browser before running the Java program using MATCH (n) OPTIONAL MATCH (n)-[r]->() DELETE n,r;
Make sure you are actually working against the same database directories. You can verify this by leaving the server running. If you can still start your java program, unless your Java program is using the Neo4j REST Bindings, you are not using the same directory. Two Neo4j databases cannot run against the same database directory simultaneously.

MongoDB SELF JOIN query having 1 collection

I'd like to do something like
SELECT e1.sender
FROM email as e1, email as e2
WHERE e1.sender = e2.receiver;
but in MongoDB. I found many forums about JOIN, which can be implemented via MapReduce in MongoDB, but I don't understand how to do it in this example with self-join.
I was thinking about something like this:
var map1 = function(){
var output = {
sender:db.collectionSender.email,
receiver: db.collectionReceiver.findOne({email:db.collectionSender.email}).email
}
emit(this.email, output);
};
var reduce1 = function(key, values){
var outs = {sender:null, receiver:null
values.forEach(function(v) {
if(outs.sender == null){
outs.sender = v.sender
}
if(outs.receivers == null){
outs.receiver = v.receiver
}
});
return outs; }};
db.email.mapReduce(map2,reduce2,{out:'rec_send_email'})
to create 2 new collections - collectionReceiver containing only receiver email and collectionSender containing only sender email
OR
var map2 = function(){
var output = {sender:this.sender,
receiver: db.email.findOne({receiver:this.sender})}
emit(this.sender, output);
};
var reduce2 = function(key, values){
var outs = {sender:null, receiver:null
values.forEach(function(v){
if(outs.sender == null){
outs.sender = v.sender
}
if(outs.receiver == null){
outs.receiver = v.receiver
}
});
return outs; };};
db.email.mapReduce(map2,reduce2,{out:'rec_send_email'})
but none of them is working and I don't understand this MapReduce-thing well. Could somebody explain it to me please? I was inspired by this article http://tebros.com/2011/07/using-mongodb-mapreduce-to-join-2-collections/ .
Additionally, I need to write it in Java. Is there any way how to solve it?
If you need to implement a "self-join" when using MongoDB then you may have structured your schema incorrectly (or sub-optimally).
In MongoDB (and noSQL in general) the schema structure should reflect the queries you will need to run against them.
It looks like you are assuming a collection of emails where each document has one sender and one receiver and now you want to find all senders who also happen to be receivers of email? The only way to do this would be via two simple queries, and not via map/reduce (which would be far more complex, unnecessary and the way you've written them wouldn't work as you can't query from within map function).
You are writing in Java - why not make two queries - the first to get all unique senders and the second to find all unique receivers who are also in the list of senders?
In the shell it would be:
var senderList = db.email.distinct("sender");
var receiverList = db.email.distinct("receiver", {"receiver":{$in:senderList}})

Categories

Resources