MongoDB's reduce-phase is not working as expected - java

I worked with a java-tutorial for mapReduce-Programming in MongoDB and ended up with the following Code:
package mapReduceExample;
import com.mongodb.BasicDBObject;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.DBObject;
import com.mongodb.MapReduceCommand;
import com.mongodb.MapReduceOutput;
import com.mongodb.Mongo;
public class MapReduceExampleMain {
/**
* #param args
*/
public static void main(String[] args) {
Mongo mongo;
try {
mongo = new Mongo("localhost", 27017);
DB db = mongo.getDB("library");
DBCollection books = db.getCollection("books");
BasicDBObject book = new BasicDBObject();
book.put("name", "Understanding JAVA");
book.put("pages", 100);
books.insert(book);
book = new BasicDBObject();
book.put("name", "Understanding JSON");
book.put("pages", 200);
books.insert(book);
book = new BasicDBObject();
book.put("name", "Understanding XML");
book.put("pages", 300);
books.insert(book);
book = new BasicDBObject();
book.put("name", "Understanding Web Services");
book.put("pages", 400);
books.insert(book);
book = new BasicDBObject();
book.put("name", "Understanding Axis2");
book.put("pages", 150);
books.insert(book);
String map = "function()"
+ "{ "
+ "var category; "
+ "if ( this.pages > 100 ) category = 'Big Books'; "
+ "else category = 'Small Books'; "
+ "emit(category, {name: this.name});"
+ "}";
String reduce = "function(key, values)"
+ "{"
+ "return {books: values.length};"
+ "} ";
MapReduceCommand cmd = new MapReduceCommand(books, map, reduce,
null, MapReduceCommand.OutputType.INLINE, null);
MapReduceOutput out = books.mapReduce(cmd);
for (DBObject o : out.results()) {
System.out.println(o.toString());
}
//aufräumen
db.dropDatabase();
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
This is a pretty simple reduce-Phase, but it does not what I want :(
The output is:
{ "_id" : "Big Books" , "value" : { "books" : 4.0}}
{ "_id" : "Small Books" , "value" : { "name" : "Understanding JAVA"}}
I would expect this:
{ "_id" : "Big Books" , "value" : { "books" : 4.0}}
{ "_id" : "Small Books" , "value" : { "books" : 1.0}}
Why does the reduce-Phase not give back the values.length in the case of a small book?
Greetings, Andre

Becuase if there is only one results the reduce is never run. Change it to be a finalise function or something.

A Basic Understanding of how mapReduce Works
Let us introduce the concepts of mapReduce
mapper - This is the stage that emit's the data to be fed into the reduce stage. It requires a key and a value be to sent. You can emit several times if you want in a mapper, but the requirements stay the same.
reducer - A reducer is called when there is more than one value of a given key to process the list of values that have been emitted for that key.
That said, since the mapper only emitted one key value your reducer was not called.
You can clean this up in finalise, but the behavior of the emit from the mapper going straight through is by standard design.

Related

Porting code from lucene to elasticsearch

I have to following simple code that I want to port from lucene 6.5.x to elasticsearch 5.3.x.
However, the scores are different and I want to have the same score results like in lucene.
As example, the idf:
Lucenes docFreq is 3 (3 docs contains the term "d") and docCount is 4 (documents with this field). Elasticsearch has 1 docFreq and 2 docCount (or 1 and 1). I am not sure how these values relate to each other in elasticsearch...
The other different in scoring is the avgFieldLength:
Lucene is right with 14 / 4 = 3.5. Elasticsearch is different for each score result - but this should be the same for all documents...
Can you please tell me, which settings/mapping I missed in elasticsearch to get it to work like lucene?
IndexingExample.java:
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.search.*;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.document.Field;
import java.io.IOException;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;
public class IndexingExample {
private static final String INDEX_DIR = "/tmp/lucene6idx";
private IndexWriter createWriter() throws IOException {
FSDirectory dir = FSDirectory.open(Paths.get(INDEX_DIR));
IndexWriterConfig config = new IndexWriterConfig(new StandardAnalyzer());
return new IndexWriter(dir, config);
}
private List<Document> createDocs() {
List<Document> docs = new ArrayList<>();
FieldType summaryType = new FieldType();
summaryType.setIndexOptions(IndexOptions.DOCS_AND_FREQS);
summaryType.setStored(true);
summaryType.setTokenized(true);
Document doc1 = new Document();
doc1.add(new Field("title", "b c d d d", summaryType));
docs.add(doc1);
Document doc2 = new Document();
doc2.add(new Field("title", "b c d d", summaryType));
docs.add(doc2);
Document doc3 = new Document();
doc3.add(new Field("title", "b c d", summaryType));
docs.add(doc3);
Document doc4 = new Document();
doc4.add(new Field("title", "b c", summaryType));
docs.add(doc4);
return docs;
}
private IndexSearcher createSearcher() throws IOException {
Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
IndexReader reader = DirectoryReader.open(dir);
return new IndexSearcher(reader);
}
public static void main(String[] args) throws IOException, ParseException {
// indexing
IndexingExample app = new IndexingExample();
IndexWriter writer = app.createWriter();
writer.deleteAll();
List<Document> docs = app.createDocs();
writer.addDocuments(docs);
writer.commit();
writer.close();
// search
IndexSearcher searcher = app.createSearcher();
Query q1 = new TermQuery(new Term("title", "d"));
TopDocs hits = searcher.search(q1, 20);
System.out.println(hits.totalHits + " docs found for the query \"" + q1.toString() + "\"");
int num = 0;
for (ScoreDoc sd : hits.scoreDocs) {
Explanation expl = searcher.explain(q1, sd.doc);
System.out.println(expl);
}
}
}
Elasticsearch:
DELETE twitter
PUT twitter/tweet/1
{
"title" : "b c d d d"
}
PUT twitter/tweet/2
{
"title" : "b c d d"
}
PUT twitter/tweet/3
{
"title" : "b c d"
}
PUT twitter/tweet/4
{
"title" : "b c"
}
POST /twitter/tweet/_search
{
"explain": true,
"query": {
"term" : {
"title" : "d"
}
}
}
Problem solved with the help of jimczy:
Don't forget that ES creates an index with 5 shards by default and
that docFreq and docCount are computed per shard. You can create an
index with 1 shard or use the dfs mode to compute distributed stats:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-type.html#dfs-query-then-fetch
This search query (dfs_query_then_fetch) worked like expected:
POST /twitter/tweet/_search?search_type=dfs_query_then_fetch
{
"explain": true,
"query": {
"term" : {
"title" : "d"
}
}
}

How to optimize query for Mongodb

I have 300,000 documents in this specific collection.
Each document is considered as one taxi trip.
Each document contains a TaxiStation number and a License number.
My goal is to figure out the number of trips per TaxiLicense per TaxiStation.
For example:
TaxiStation A License X had 5 trips.
TaxiStation A License Y had 9 trips. And so on.
How can I optimize my query? It is takes an upwards time of 30 minutes to complete!
List /*of*/ taxistationOfCollection, taxiLicenseOfTaxistation;
//Here I get all the distinct TaxiStation numbers in the collection
taxistationOfCollection = coll.distinct("TaxiStation");
BasicDBObject query, tripquery;
int tripcount;
//Now I have to loop through each Taxi Station
for(int i = 0; i<taxistationOfCollection.size(); i++)
{
query = new BasicDBObject("TaxiStation", taxistationOfCollection.get(i));
//Here, I make a list of each distinct Taxi License in the current Taxi station
taxiLicenseOfTaxistation = coll.distinct("TaxiLicense", query);
//Now I make a loop to process each Taxi License within the current Taxi station
for(int k = 0; k<taxiLicenseOfTaxistation.size();k++)
{
tripcount=0;
if(taxiLicenseOfTaxistation.get(k) !=null)
{
//I'm looking for each Taxi Station with this Taxi License
tripquery= new BasicDBObject("TaxiStation", taxistationOfCollection.get(i)).append("TaxiLicense", taxiLicenseOfTaxistation.get(k));
DBCursor cursor = coll.find(tripquery);
try {
while(cursor.hasNext()) {
//Increasing my counter everytime I find a match
tripcount++;
cursor.next();
}
} finally {
//Finally printing the results
System.out.println("Station: " + taxistationOfCollection.get(i) + " License:" + taxiLicenseOfTaxistation.get(k)
+ " Trips: " + tripcount);
}
}
}
}
Sample Document :
{
"_id" : ObjectId("53df46ed9b2ed78fb7ca4f23"),
"Version" : "2",
"Display" : [],
"Generated" : "2014-08-04,16:40:05",
"GetOff" : "2014-08-04,16:40:05",
"GetOffCellInfo" : "46001,43027,11237298",
"Undisplay" : [],
"TaxiStation" : "0000",
"GetOn" : "2014-08-04,16:40:03",
"GetOnCellInfo" : "46001,43027,11237298",
"TaxiLicense" : "000000",
"TUID" : "26921876-3bd5-432e-a014-df0fb26c0e6c",
"IMSI" : "460018571356892",
"MCU" : "CM8001MA121225V1",
"System_ID" : "000",
"MeterGetOffTime" : "",
"MeterGetOnTime" : "",
"Setup" : [],
"MeterSID" : "",
"MeterWaitTime" : "",
"OS" : "4.2",
"PackageVersion" : "201407300888",
"PublishVersion" : "201312060943",
"SWVersion" : "rel_touchbox_20101010",
"MeterMile" : 0,
"MeterCharged" : 0,
"GetOnLongitude" : 0,
"GetOnLatitude" : 0,
"GetOffLongitude" : 0,
"TripLength" : 2,
"GetOffLatitude" : 0,
"Clicks" : 0,
"updateTime" : "2014-08-04 16:40:10"
}
Aggregation is probably what you are looking for. With an aggregation operation your whole code runs on the database and can be performed in a few lines. Performance should also be a lot better since the database handles everything that needs to be done an can take full advantage of indexes and other stuff.
From what you postet this boils down to a simple $group operation. In the shell this would look like:
db.taxistationOfCollection.aggregate([
{$group:
{ _id:
{station: "$TaxiStation",
licence: "$TaxiLicense"},
count : {$sum : 1}
}
])
This will give you documents of the form
{_id : {station: stationid, licence: licence_number}, count: number_of_documents}
For Java it would look like this:
DBObject taxigroup = new BasicDBObject("$group",
new BasicDBObject("_id",
new BasicDBObject("station","$TaxiStation")
.append("Licence","$TaxiLicense"))
.append("count", new BasicDBObject("$sum",1)));
AggregationOutput aggout = taxistationOfCollection.aggregate(
Arrays.asList(taxigroup));
Please note that the code snippets are not tested.

I got error using MapReduce in MongoDB

first of all I using OS windows XP 32-bit, MongoDB as NoSQL DB and Eclipse as editor program. I got an assignment from my school about MapReduce, so I decided to find how many working-age and non-working population using mapreduce. I use this codes to input data and save as Insert.java :
package mongox;
import com.mongodb.BasicDBObject;
import com.mongodb.Mongo;
import com.mongodb.DB;
import com.mongodb.DBCollection;
public class Insert {
public static void main(String[] args) {
try{
Mongo mongox = new Mongo();
DB db = mongox.getDB("DBPublic");
DBCollection koleksi = db.getCollection("lancestorvalley");
BasicDBObject object = new BasicDBObject();
object = new BasicDBObject();
object.put("NIK", "7586930211");
object.put("Name", "Richard Bou");
object.put("Sex", "M");
object.put("Age", "31");
object.put("Blood", "A");
object.put("Status", "Married");
object.put("Education", "Bachelor degree");
object.put("Employment", "Labor");
koleksi.insert(object);
}
catch(Exception e){
System.out.println(e.toString());
}
}
}
I use this code for MapReduce and save as Mapreduce.java :
package mongox;
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.DBObject;
import com.mongodb.MapReduceCommand;
import com.mongodb.MapReduceOutput;
import com.mongodb.Mongo;
public class Mapreduce {
public static void main(String[] args) {
try{
Mongo mongox = new Mongo("localhost", 27017);
DB db = mongox.getDB("DBPublic");
DBCollection koleksi = db.getCollection("lancestorvalley");
String map = "function() { "+
"var category; " +
"if ( this.Age >= 15 && this.Age <=59 ) "+
"category = 'Working-Age Population'; " +
"else " +
"category = 'Non-Working-Age Population'; "+
"emit(category, {Nama: this.Nama});}";
String reduce = "function(key, values) { " +
"var sum = 0; " +
"values.forEach(function(doc) { " +
"sum += 1; "+
"}); " +
"return {data: sum};} ";
MapReduceCommand cmd = new MapReduceCommand(koleksi, map, reduce,
null, MapReduceCommand.OutputType.INLINE, null);
MapReduceOutput out = koleksi.mapReduce(cmd);
for (DBObject o : out.results()) {
System.out.println(o.toString());
}
}
catch(Exception e){
e.printStackTrace();;
}
}
}
I already input 5000 data and when I run the Mapreduce.java the output is :
{ "_id" : "Non-Working-age population" , "value" : { "data" : 41.0}}
{ "_id" : "Working-age Population" , "value" : { "data" : 60.0}}
Is there something wrong with my code in Mapreduce.java? why the output is only like that while the data is about 5000?
Hopefully someone could help me, Thanks before guys
MongoDB docs explicity state the below , which might be cause of un-expected behavior:
Platform Support
Starting in version 2.2, MongoDB does not support Windows XP. Please use a more recent version of Windows to use more recent releases of MongoDB.
Moreover :
MongoDB for Windows 32-bit runs on any 32-bit version of Windows newer than Windows XP. 32-bit versions of MongoDB are only intended for older systems and for use in testing and development systems. 32-bit versions of MongoDB only support databases smaller than 2GB.

MongoDB Java driver: Undefined values are not shown

Open mongo shell and create a document with a undefined value:
> mongo
MongoDB shell version: 2.4.0
connecting to: test
> use mydb
switched to db mydb
> db.mycol.insert( {a_number:1, a_string:"hi world", a_null:null, an_undefined:undefined} );
> db.mycol.findOne();
{
"_id" : ObjectId("51c2f28a7aa5079cf24e3999"),
"a_number" : 1,
"a_string" : "hi world",
"a_null" : null,
"an_undefined" : null
}
As we can see, javascript translates the "undefined" value (stored in the db) to a "null" value, when showing it to the user. But, in the db, the value is still "undefined", as we are going to see with java.
Let's create a "bug_undefined_java_mongo.java" file, with the following content:
import com.mongodb.DB;
import com.mongodb.DBCollection;
import com.mongodb.DBCursor;
import com.mongodb.MongoClient;
public class bug_undefined_java_mongo
{
String serv_n = "myserver"; // server name
String db_n = "mydb"; // database name
String col_n = "mycol"; // collection name
public static void main(String[] args)
{
new bug_undefined_java_mongo().start();
}
public void start()
{
pr("Connecting to server ...");
MongoClient cli = null;
try
{
cli = new MongoClient( serv_n );
}
catch (Exception e)
{
pr("Can't connecto to server: " + e);
System.exit(1);
}
if (cli == null)
{
pr("Can't connect to server");
System.exit(1);
}
pr("Selecting db ...");
DB db_res = cli.getDB( db_n );
pr("Selecting collection ...");
DBCollection col = db_res.getCollection( col_n );
pr("Searching documents ...");
DBCursor cursor = null;
try
{
cursor = col.find( );
}
catch (Exception e)
{
pr("Can't search for documents: " + e);
System.exit(1);
}
pr("Printing documents ...");
try
{
while (cursor.hasNext())
{
Object doc_obj = cursor.next();
System.out.println("doc: " + doc_obj);
}
}
catch (Exception e)
{
pr("Can't browse documents: " + e);
return;
}
finally
{
pr("Closing cursor ...");
cursor.close();
}
}
public void pr(String cad)
{
System.out.println(cad);
}
}
After compiling and running it, we get this:
Connecting to server ...
Selecting db ...
Selecting collection ...
Searching documents ...
Printing documents ...
doc: { "_id" : { "$oid" : "51c2f0f85353d3425fcb5a14"} , "a_number" : 1.0 , "a_string" : "hi world" , "a_null" : null }
Closing cursor ...
We see that the "a_null:null" pair is shown, but... the "an_undefined:undefined" pair has disappeared! (both the key and the value).
Why? Is it a bug?
Thank you
Currently undefined is not supported by the java driver as there is no equivalent mapping in java.
Other drivers such as pymongo and the js shell handles this differently by casting undefined to None when representing the data, however it is a separate datatype and is deprecated in the bson spec.
If you need it in the java driver then you will have to code your own decoder factory and then set it like so:
collection.setDBDecoderFactory(MyDecoder.FACTORY);
A minimal example that has defined handling for undefined and factory is available on github in the horn of mongo repo.
I see, creating a factory could be a solution.
Anyway, probably many developers would find it useful the posibility of enabling a mapping in the driver to convert automatically "undefined" values to "null" value. For example, by calling a mapUndefToNull() method:
cli = new MongoClient( myserver );
cli.mapUndefToNull(true);
In my case, I'm running a MapReduce (it is Javascript code) on my collections, and I am having to explicitly convert the undefined values (generated when accessing to non existent keys) to null, in order to avoid Java driver to remove it:
try { value = this[ key ] } catch(e) {value = null}
if (typeof value == "undefined") value = null; // avoid Java driver to remove it
So, as a suggestion, I'd like the mapUndefToNull() method to be added to the Java driver. If possible.
Thank you

Searching for tags in Mongodb java

I have 3 different collections, with different content in my script:
image, audio and video.
In each element I put in the database, I add a tag.
When I am trying to search for the tags (of the files I add each collection) I can only find the tags for the image collection:
-------------------------------CODE---------------------------------------------------
protected void search(String term) {
tagCounter = 0;
DBCollection image = db.getCollection("p");
DBCollection audio = db.getCollection("a");
DBCollection video = db.getCollection("video");
String search = searchField.getText();
search.trim().toLowerCase();
BasicDBObject tagQuery= new BasicDBObject();
tagQuery.put("tags", search);
DBCursor cursor = collection.find(tagQuery);
tagQuery.put("tags", search);
cursor = image.find(tagQuery);
while(cursor.hasNext()) {
results.addElement( cursor.next().toString());
tagCounter++;
searchField.setText(null);
}
cursor = audio.find(tagQuery);
while(cursor.hasNext()) {
results.addElement(cursor.next());
tagCounter++;
searchField.setText(null);
}
cursor = video.find(tagQuery);
while(cursor.hasNext()) {
results.addElement( cursor.next().toString()) ;
tagCounter++;
searchField.setText(null);
}
JOptionPane counter = new JOptionPane();
counter.showMessageDialog(resultList, "Search gave " + tagCounter + " files");
}
Can anyone help a newbie out? :)
The code works perfectly for me, except for the fact that you have a lot of references to things that are not declared/defined and also you are missing .toString() in the audio collection.
In a nutshell, the data is fetched the same way from all the collections, what you need to make sure that you do in your code is check what searchField.setText(null); line does - since you are getting things fine for the first collection but not the next two it tells me you are likely clearing something that's needed by the code.
Best thing to do is to use lots of "debugging" statements throughout, not just at the end. Here is my simplified version of your code (I put one matching document in each collection):
int tagCounter = 0;
DBCollection image = db.getCollection("p");
DBCollection audio = db.getCollection("a");
DBCollection video = db.getCollection("video");
String search = "tag1";
search.trim().toLowerCase();
BasicDBObject tagQuery= new BasicDBObject();
tagQuery.put("tags", search);
DBCursor cursor = null;
cursor = image.find(tagQuery);
while(cursor.hasNext()) {
System.out.println( cursor.next().toString());
tagCounter++;
}
System.out.println(tagCounter + " matches found in image");
cursor = audio.find(tagQuery);
tagCounter = 0;
while(cursor.hasNext()) {
System.out.println( cursor.next().toString());
tagCounter++;
}
System.out.println(tagCounter + " matches found in audio");
cursor = video.find(tagQuery);
tagCounter = 0;
while(cursor.hasNext()) {
System.out.println( cursor.next().toString());
tagCounter++;
}
System.out.println(tagCounter + " matches found in video");
And my output is:
{ "_id" : { "$oid" : "5186a59151058e0786e90eee"} , "tags" : [ "tag1" , "tag2"]}
1 matches found in image
{ "_id" : { "$oid" : "5186a59851058e0786e90eef"} , "tags" : [ "tag1" , "tag2"]}
1 matches found in audio
{ "_id" : { "$oid" : "5186a5a851058e0786e90ef0"} , "tags" : [ "tag1" , "tag2"]}
1 matches found in video

Categories

Resources