MongoDB distinct too big 16mb cap - java

I have a Mongodb collection. Simply, it has two columns: user and url. It has 39274590 rows. The key of this table is {user, url}.
Using Java, I try to list distinct urls:
MongoDBManager db = new MongoDBManager( "Website", "UserLog" );
return db.getDistinct("url");
But I receive an exception:
Exception in thread "main" com.mongodb.CommandResult$CommandFailure: command failed [distinct]:
{ "serverUsed" : "localhost/127.0.0.1:27017" , "errmsg" : "exception: distinct too big, 16mb cap" , "code" : 10044 , "ok" : 0.0}
How can I solve this problem? Is there any plan B that can avoid this problem?

In version 2.6 you can use the aggregate commands to produce a separate collection:
http://docs.mongodb.org/manual/reference/operator/aggregation/out/
This will get around mongodb's limit of 16mb for most queries. You can read more about using the aggregation framework on large datasets in mongodb 2.6 here:
http://vladmihalcea.com/mongodb-2-6-is-out/
To do a 'distinct' query with the aggregation framework, group by the field.
db.userlog.aggregate([{$group: {_id: '$url'} }]);
Note: I don't know how this works for the Java driver, good luck.

Take a look at this answer
1) The easiest way to do this is via the aggregation framework. This takes two "$group" commands: the first one groups by distinct values, the second one counts all of the distinct values
2) If you want to do this with Map/Reduce you can. This is also a two-phase process: in the first phase we build a new collection with a list of every distinct value for the key. In the second we do a count() on the new collection.
Note that you cannot return the result of the map/reduce inline, because that will potentially overrun the 16MB document size limit. You can save the calculation in a collection and then count() the size of the collection, or you can get the number of results from the return value of mapReduce().

If you are using mongodb 3.0 and above you can use
DistinctIterable class with batchSize.
MongoCollection coll = null;
coll = mongodb.getCollection("mycollection");
DistinctIterable<String> ids = coll.distinct("id", String.class).batchSize(100);
for (String id: ids) {
System.out.println("" + id);
}
http://api.mongodb.com/java/current/com/mongodb/client/DistinctIterable.html

Version 3.x on Groovy :
import com.mongodb.client.AggregateIterable
import com.mongodb.client.MongoCollection
import com.mongodb.client.MongoCursor
import com.mongodb.client.MongoDatabase
import static com.mongodb.client.model.Accumulators.sum
import static com.mongodb.client.model.Aggregates.group
import static java.util.Arrays.asList
import org.bson.Document
//other code
AggregateIterable<Document> iterable = collection.aggregate(
asList(
group("\$" + "url", sum("count", 1))
)
).allowDiskUse(true)
MongoCursor cursor = iterable.iterator()
while(cursor.hasNext()) {
Document doc = cursor.next()
println(doc.toJson())
}

Related

How to sum a mongodb inner field and push it during grouping using MongoTemplate

I can use the sum within the push operation in a MongoDB console. However, I am not getting how can I do the same using MongoTemplate?
$group : {
_id: "$some_id",
my_field: { $push : {$sum: "$my_field" }}
}
The code I am using for this is something like:
Aggregation aggregation =
Aggregation.newAggregation(
match(matchingCriteria),
group("some_id")
.count()
.as("count")
.push("my_field")
.as("my_field")
project("some_id", "count", "my_field"));
AggregationResults<MyModel> result =
mongoTemplate.aggregate(aggregation, "my_collection", MyModel.class);
The thing is I want the sum of my_field but it is coming as an array of my_field here(as I am directly using the push). I am able to achieve the same using the above sum inside of push operation. But not able to use that for MongoTemplate. My app is in Spring Boot. I have also looked into the docs for these methods but couldn't find much.
Also, I tried directly using .sum() as well on the field(without using the push), but that is not working for me as my_field is an inner object, and it's not a number but an array of numbers after the grouping. That is why I need to use the push and sum combination.
Any help regarding this is appreciated. Thanks in advance.
I was able to get this to work using the below code:
Aggregation aggregation =
Aggregation.newAggregation(
match(allTestMatchingCriteria),
project("some_id")
.and(AccumulatorOperators.Sum.sumOf("my_field"))
.as("my_field_sum")
group("some_id")
.count()
.as("count")
.push("my_field_sum")
.as("my_field_sum"),
project("some_id", "count", "my_field_sum"));
AggregationResults<MyModel> result =
mongoTemplate.aggregate(aggregation, "my_collection", MyModel.class);
I used AccumulatorOperators.Sum in the projection stage itself and sum the inner fields and get the desired output. Then I passed this to the grouping stage where I did the count aggregation as I needed that data as well and then had to project all the data generated to be collected as output.

Too many parameters error on the following $in query

I'm using jongo API - org.jongo.MongoCollection is the class.
I have list of object ids and converted the same as ObjectId[] and trying to query as follows
collection.find("{_id:{$in:#}}", ids).as(Employee.class);
The query throws the exception - "java.lang.IllegalArgumentException: Too
many parameters passed to query: {"_id":{"$in":#}}"
The query doesn't work as specified in the URL In Jongo, how to find multiple documents from Mongodb by a list of IDs
Any suggestion on how to resolve?
Thanks.
Try it with a List as shown on the docs:
List<String> ages = Lists.newArrayList(22, 63);
friends.find("{age: {$in:#}}", ages); //→ will produce {age: {$in:[22,63]}}
For example the following snippet I crafted quick & dirty right now worked for me (I use older verbose syntax as I am currently on such a system ...)
List<ObjectId> ids = new ArrayList<ObjectId>();
ids.add(new ObjectId("57bc7ec7b8283b457ae4ef01"));
ids.add(new ObjectId("57bc7ec7b8283b457ae4ef02"));
ids.add(new ObjectId("57bc7ec7b8283b457ae4ef03"));
int count = friends.find("{ _id: { $in: # } }", ids).as(Friend.class).count();

How to get array of document using Mongodb java?

How to get all the document under array in mongodb java. My Database is as below. Want to retrieve all the data under array 198_168_1_134.
below is some of What i tried,
eventlist.find(new BasicDBObject("$match","192_168_10_17"))
eventlist.find(new BasicDBObject("$elemMatch","192_168_10_17"))
eventlist.find(null, new BasicDBObject("$192_168_10_17", 1))
You have two options:
using .find() with cherry-picking which document you have to have fetched.
using the aggregation framework by projecting the documents.
By using .find() , you can do:
db.collection.find({}, { 192_168_10_17 : 1 })
By using the aggregation framework, you can do:
db.collection.aggregate( { $project : { 192_168_10_17 : 1 } } )
which will fetch only the 192_168_10_17 document data.
Of course, in order to get this working in Java, you have to translate these queries to a corresponding chain of BasicDBObject instances.
By using mongo java driver you can do this by following query -
eventlist.find(new BasicDBObject(), new BasicDBObject("198_168_1_134", 1))

How to compare 2 fields in Spring Data MongoDB using query object

What seems almost natural in simple SQL is impossible in mongodb.
Given a simple document:
{
"total_units" : 100,
"purchased_unit" : 60
}
I would like to query the collection, using spring data Criteria class, where "total_units > purchased_units".
To my understanding it should be as trivial as any other condition.
Found nothing to support this on Spring api.
You can use the following pattern:
Criteria criteria = new Criteria() {
#Override
public DBObject getCriteriaObject() {
DBObject obj = new BasicDBObject();
obj.put("$where", "this.total_units > this.purchased_units");
return obj;
}
};
Query query = Query.query(criteria);
I don't think Spring Data API supports this yet but you may need to wrap the $where query in your Java native DbObject. Note, your query performance will be fairly compromised since it evaluates Javascript code on every record so combine with indexed queries if you can.
Native Mongodb query:
db.collection.find({ "$where": "this.total_units > this.purchased_units" });
Native Java query:
DBObject obj = new BasicDBObject();
obj.put( "$where", "this.total_units > this.purchased_units");
Some considerations you have to look at when using $where:
Do not use global variables.
$where evaluates JavaScript and cannot take advantage of indexes.
Therefore, query performance improves when you express your query
using the standard MongoDB operators (e.g., $gt, $in). In general, you
should use $where only when you can’t express your query using another
operator. If you must use $where, try to include at least one other
standard query operator to filter the result set. Using $where alone
requires a table scan. Using normal non-$where query statements
provides the following performance advantages:
MongoDB will evaluate non-$where components of query before $where
statements. If the non-$where statements match no documents, MongoDB
will not perform any query evaluation using $where. The non-$where
query statements may use an index.
As far as I know you can't do
query.addCriteria(Criteria.where("total_units").gt("purchased_units"));
but would go with your suggestion to create an additional computed field say computed_units that is the difference between total_units and purchased_units which you can then query as:
Query query = new Query();
query.addCriteria(Criteria.where("computed_units").gt(0));
mongoOperation.find(query, CustomClass.class);
Thanks #Andrew Onischenko for the historic good answer.
On more recent version of spring-data-mongodb (ex. 2.1.9.RELEASE), I had to write the same pattern like below:
import org.bson.Document;
import org.springframework.data.mongodb.core.query.Criteria;
import org.springframework.data.mongodb.core.query.Query;
// (...)
Criteria criteria = new Criteria() {
#Override
public Document getCriteriaObject() {
Document doc = new Document();
doc.put("$where", "this.total_units > this.purchased_units");
return doc;
}
};
Query query = Query.query(criteria);
One way is this:
Criteria c = Criteria.where("total_units").gt("$purchased_unit");
AggregationOperation matchOperation = Aggregation.match(c);
Aggregation aggregation = Aggregation.newAggregation(matchOperation);
mongoTemplate.aggregate(aggregation, "collectionNameInStringOnly", ReturnTypeEntity.class);
Remember to put collection name in string so as to match the spellings of fields mentioned in criteria with fields in database collection.

Get all documents with GridFSOperations

I've decided to move one of our projects from PostgreSQL to MongoDB and this project deals with images. I am able to save images and retrieve them by their _id now but I couldn't find a function with GridFSOperations where I could safely get all documents. I am doing this so that I could take photo meta-data I saved with the image and index them with Lucene (as I needed a full text search on some relevant metadata, also future possible scenarios where we might need to rebuild the Lucene index)
In the old code, I simply had a function with an offset and limit for the SQL query as I found out (the hard way) that our dev system can only do bulk Lucene adds in groups of 5k. Is there an equivalent way of doing this with GridFS?
Edit:
function inherited from the old interface:
public List<Photo> getPublicPhotosForReindexing(long offset, long limit) {
List<Photo> result = new ArrayList<>();
List<GridFSDBFile> files = gridFsOperations.find(new Query().limit((int) limit).skip((int) offset));
for(GridFSDBFile file:files) {
result.add(convertToPhoto(file));
}
return result;
}
a simple converter taking parts of the metadata and putting it in the POJO I made:
private Photo convertToPhoto(GridFSDBFile fsFile) {
Photo resultPhoto = new Photo(fsFile.getId().toString());
try {
resultPhoto
.setOriginalFilename(fsFile.getFilename())
// .setPhotoData(IOUtils.toByteArray(fsFile.getInputStream()))
.setDateAdded(fsFile.getUploadDate());
} catch (Exception e) {
logger.error("Should not hit this one", e);
}
return resultPhoto;
}
When you are using GridFS the information is stored in your MongoDB database in two collections. The first is fs.files which has the main reference to the file and fs.chunks that actually holds the "chunks" of data. See the examples
Collection: fs.files
{
"_id" : ObjectId("53229d20f3dde871df8b89a7"),
"filename" : "receptor.jpg",
"chunkSize" : 262144,
"uploadDate" : ISODate("2014-03-14T06:09:36.462Z"),
"md5" : "f1e71af6d0ba9c517280f33b4cbab3f9",
"length" : 138905
}
Collection: fs.chunks
{
"_id" : ObjectId("53229d20824b12efe88cc1f2"),
"files_id" : ObjectId("53229d20f3dde871df8b89a7"),
"n" : 0,
"data" : // all of the binary data
}
So really these are just normal MongoDB documents and normal collections.
As you can see, there are various ways you can "query" these collections with the standard API:
The Object Id is monotonic and therefore ever increasing. Newer entries will have a higher ObjectId value than older ones. Most importantly, the last Id that was indexed.
The updloadDate also holds a general date timestamp that you can use for date range based queries.
So you see, that GridFS is really just "Driver level magic" to work with ordinary MongoDB documents, and treat the binary data as a single document.
As they are just normal collections with normal documents, unless you are retrieving or otherwise updating the content, then just use the normal methods to select and find.

Categories

Resources