Map Reduce with mongoDB and Morphia - java

I am running map reduce in mongoDB with morphia, this is my java code
String map = "function() { emit(this.id,this.cal.charge)}";
String reduce = "function(k, v) {var i, sum = 0;for (i in v) {sum += v[i];}return sum;}";
MapreduceResults<Results> mrRes = ds.mapReduce(MapreduceType.MERGE,ds.createQuery(MyTable.class).field("id").equal(5),map,reduce,null,null,Re.class);
This work fine and put results to 'Re' collection, but how can i get result as objects or list without inserting to a new Collection ?
Thanks

I couldn't commented this because it violates the length limit.
If it is not going to be too much of a fuss, you can do it by using java driver directly without using the morphia interface. just get the mongo object from morphia datastore and use java driver's map reduce command; it is something like this:
DBObject queryObject = new BasicDBObject("id", 5);
DBCollection collection = ds.getCollection(MyTable.class);
MapReduceCommand mrc = new MapReduceCommand(collection, // collection to do map-reduce on
map, // map function
reduce, // reduce function
null, // output collection for the result
MapReduceCommand.OutputType.INLINE, // output result type
queryObject); // query to use in map reduce function
btw morphia has a newer version in github https://github.com/jmkgreen/morphia maybe you wanna check that out too. I saw that the newer version also don't support inline operation on map-reduce.

Related

How to do pagination with DynamoDBMapper?

I'm developing an application in Quarkus that integrates with the DynamoDB database. I have a query method that returns a list and I'd like this list to be paginated, but it would have to be done manually by passing the parameters.
I chose to use DynamoDBMapper because it gives more possibilities to work with lists of objects and the level of complexity is lower.
Does anyone have any idea how to do this pagination manually in the function?
DynamoDBScanExpression scanExpression = new DynamoDBScanExpression()
.withLimit(pageSize)
.withExclusiveStartKey(paginationToken);
PaginatedScanList<YourModel> result = mapper.scan(YourModel.class, scanExpression);
String nextPaginationToken = result.getLastEvaluatedKey();
You can pass the pageSize and paginationToken as parameters to your query method. The nextPaginationToken can be returned along with the results, to be used for the next page.
DynamoDB Mapper paginates by iterating over the results, by lazily loading the dataset:
By default, the scan method returns a "lazy-loaded" collection. It initially returns only one page of results, and then makes a service call for the next page if needed. To obtain all the matching items, iterate over the result collection.
Ref
For example:
List<Customer> result = mapper.scan(Customer.class, scanExpression);
for ( Customer cust : result ) {
System.out.println(cust.getId());
}
To Scan manually page by page you can use ScanPage
final DynamoDBScanExpression scanPageExpression = new DynamoDBScanExpression()
.withLimit(limit);
do {
ScanResultPage<MyClass> scanPage = mapper.scanPage(MyClass.class, scanPageExpression);
scanPage.getResults().forEach(System.out::println);
System.out.println("LastEvaluatedKey=" + scanPage.getLastEvaluatedKey());
scanPageExpression.setExclusiveStartKey(scanPage.getLastEvaluatedKey());
} while (scanPageExpression.getExclusiveStartKey() != null);
Ref
Ref

Query dsl - Looping through a list to create OR predicates

How do I dynamically create "OR" predicates if I have a List<List<String>>
I am using query dsl and spring data.
QOrder order = QOrder.order;
JPQLQuery<Order> query = from(order);
query.where(order.status.eq("ready"));
List<List<String>> filterTypes;
This is what I am trying to do:
for(List<String> types : filterTypes) {
query.where(order.type.in(types));
}
So the result should be something like
select * from order o where o.status='ready' and (o.type in(t1,t2) or o.type in(t3,t4))
To directly answer your question: assuming you're using a relatively modern version of QueryDSL, you should be able to use a BooleanBuilder:
QOrder order = QOrder.order;
SQLQuery<Order> query = from(order);
query.where(order.status.eq("ready"));
// Example data
List<List<String>> filterTypes = ImmutableList.of(
ImmutableList.of("t1", "t2"),
ImmutableList.of("t3", "t4"));
BooleanBuilder builder = new BooleanBuilder();
for (List<String> types : filterTypes) {
builder.or(order.type.in(types));
}
query.where(builder);
Backing up, assuming your actual application data model is similar to the example you provided, this:
o.type in (t1,t2) or o.type in (t3,t4)
Is equivalent to:
o.type in (t1,t2,t3,t4)
You could translate your List<List<String>> into List<String> and do your type query update once:
QOrder order = QOrder.order;
SQLQuery<Order> query = from(order);
query.where(order.status.eq("ready"));
// Example data
List<List<String>> filterTypes = ImmutableList.of(
ImmutableList.of("t1", "t2"),
ImmutableList.of("t3", "t4"));
List<String> flatFilterTypes = filterTypes.stream().flatMap(List::stream).collect(Collectors.toList());
query.where(order.type.in(flatFilterTypes));
I suspect that your database's query optimizer would do the same thing for either query (you'd have to check a query execution plan to be sure), but it'd probably be more clear what's going on if you did simplified the query on the Java side rather than relying on the database query optimizer.

Consume the results of Neo4j driver in java

Using Neo4j driver for java, i want to send a search query to the database such as:
MATCH(a:`Label`{Property:"NODE_PROPERTY"})
RETURN *
First i create a session and the i use the run methods of the driver to run a query:
Result run = session.run(query);
run variable contains a list of Records. My question is how can i consume the records so that i can convert them to java objects? I tried to get the values of the results but since they're not iterable, it's not possible to get them one by one.
Result implements Iterator<Record>, so there is a bunch of ways of consuming it, e.g.:
While loop (Java 6 style):
Result result = session.run(query);
List<MyPojo> myList = new ArrayList<>();
while(result.hasNext()) {
Record r = result.next();
myList.add(mapToMyPojo(r));
}
Stream (Java 8+ style):
Result result = session.run(query);
List<MyPojo> myList = result.stream()
.map(record -> mapToMyPojo(record))
.collect(Collectors.toList());
Using Result.list(Function<Record,T> mapFunction):
Result result = session.run(query);
List<MyPojo> myList = result.list(r -> mapToMyPojo(r));
Mapping to a Java object is pretty stright-forward:
public MyPojo mapToMyPojo(Record record) {
MyPojo pojo = new MyPojo();
pojo.setProperty(record.get("Property").asString());
// ...
return pojo;
}
Although instead of mapping results manually, you might want to use neo4j-ogm

Check for duplicate records while BulkWriteOperation into mongo using hadoop reducer

I am using hadoop map-reduce for processing XML file. I am directly storing the JSON data into mongodb. How can I achieve that only non-duplicate records will be stored into database before executing BulkWriteOperation?
The duplicate records criteria will be based on product image and product name, I do not want to use a layer of morphia where we can assign indexes to the class members.
Here is my reducer class:
public class XMLReducer extends Reducer<Text, MapWritable, Text, NullWritable>{
private static final Logger LOGGER = Logger.getLogger(XMLReducer.class);
protected void reduce(Text key, Iterable<MapWritable> values, Context ctx) throws IOException, InterruptedException{
LOGGER.info("reduce()------Start for key>"+key);
Map<String,String> insertProductInfo = new HashMap<String,String>();
try{
MongoClient mongoClient = new MongoClient("localhost", 27017);
DB db = mongoClient.getDB("test");
BulkWriteOperation operation = db.getCollection("product").initializeOrderedBulkOperation();
for (MapWritable entry : values) {
for (Entry<Writable, Writable> extractProductInfo : entry.entrySet()) {
insertProductInfo.put(extractProductInfo.getKey().toString(), extractProductInfo.getValue().toString());
}
if(!insertProductInfo.isEmpty()){
BasicDBObject basicDBObject = new BasicDBObject(insertProductInfo);
operation.insert(basicDBObject);
}
}
//How can I check for duplicates before executing bulk operation
operation.execute();
LOGGER.info("reduce------end for key"+key);
}catch(Exception e){
LOGGER.error("General Exception in XMLReducer",e);
}
}
}
EDIT: After the suggested answer I have added :
BasicDBObject query = new BasicDBObject("product_image", basicDBObject.get("product_image"))
.append("product_name", basicDBObject.get("product_name"));
operation.find(query).upsert().updateOne(new BasicDBObject("$setOnInsert", basicDBObject));
operation.insert(basicDBObject);
I am getting error like: com.mongodb.MongoInternalException: no mapping found for index 0
Any help will be useful.Thanks.
I suppose it all depends on what you really want to do with the "duplicates" here as to how you handle it.
For one you can always use .initializeUnOrderedBulkOperation() which won't "error" on a duplicate key from your index ( which you need to stop duplicates ) but will report any such errors in the returned BulkWriteResult object. Which is returned from .execute()
BulkWriteResult result = operation.execute();
On the other hand, you can just use "upserts" instead and use operators such as $setOnInsert to only make changes where no duplicate existed:
BasicDBObject basicdbobject = new BasicDBObject(insertProductInfo);
BasicDBObject query = new BasicDBObject("key", basicdbobject.get("key"));
operation.find(query).upsert().updateOne(new BasicDBObject("$setOnInsert", basicdbobject));
So you basically look up the value of the field that holds the "key" to determine a duplicate with a query, then only actually change any data where that "key" was not found and therefore a new document and "inserted".
In either case the default behaviour here will be to "insert" the first unique "key" value and then ignore all other occurances. If you want to do other things like "overwrite" or "increment" values where the same key is found then the .update() "upsert" approach is the one you want, but you will use other update operators for those actions.

MongoDB - copy collection in java without looping all items

Is there a way to copy all items collection to new collection without looping all items ?
I find a way with looping by DBCursor:
...
DB db = mongoTemplate.getDb();
DBCursor cursor = db.getCollection("xxx").find();
//loop all items in collection
while (cursor.hasNext()) {
BasicDBObject b = (BasicDBObject) cursor.next();
// copy to new collection
service.createNewCollection(b);
}
...
Can you suggest do copy in java without looping all items ?
(Not In the mongo shell, with java implemintation)
Tnx.
In MongoDB 2.6, the $out aggregation operator was added which writes the results of the aggregation to a collection. This provides a simple way to do a server-side copy of all the items in a collection to another collection in the same database using the Java driver (I used Java driver version 2.12.0):
// set up pipeline
List<DBObject> ops = new ArrayList<DBObject>();
ops.add(new BasicDBObject("$out", "target")); // writes to collection "target"
// run it
MongoClient client = new MongoClient("host");
DBCollection source = client.getDB("db").getCollection("source")
source.aggregate(ops);
The one-liner version:
source.aggregate(Arrays.asList((DBObject)new BasicDBObject("$out", "target")));
According to the docs, for large datasets (>100MB) you may want to use the allowDiskUse option (Aggregation Memory Restrictions), although I didn't run into that limit when I ran it on a >2GB collection, so it may not apply to this particular pipeline, at least in 2.6.0.
I followed the advice of inserting an array of objects: Better way to move MongoDB Collection to another Collection
This reduced my time from 45 minutes to 2 minutes. Here's the Java code.
final int OBJECT_BUFFER_SIZE = 2000;
int rowNumber = 0;
List<DBObject> objects;
final int totalRows = cursor.size();
logger.debug("Mongo query result size: " + totalRows);
// Loop design based on this:
// https://stackoverflow.com/questions/18525348/better-way-to-move-mongodb-collection-to-another-collection/20889762#20889762
// Use multiple threads to improve
do {
logger.debug(String.format("Mongo buffer starts row %d - %d copy into %s", rowNumber,
(rowNumber + OBJECT_BUFFER_SIZE) - 1, dB2.getStringValue()));
cursor = db.getCollection(collectionName.getStringValue()).find(qo)
.sort(new BasicDBObject("$natural", 1)).skip(rowNumber).limit(OBJECT_BUFFER_SIZE);
objects = cursor.toArray();
try {
if (objects.size() > 0) {
db2.getCollection(collectionName.getStringValue()).insert(objects);
}
} catch (final BSONException e) {
logger.warn(String.format(
"Mongodb copy %s %s: mongodb error. A row between %d - %d will be skipped.",
dB1.getStringValue(), collectionName.getStringValue(), rowNumber, rowNumber
+ OBJECT_BUFFER_SIZE));
logger.error(e);
}
rowNumber = rowNumber + objects.size();
} while (rowNumber < totalRows);
The buffer size appears to be important. A size of 10,000 worked fine; however, for a variety of other reasons I selected a smaller size.
You could use google guava to do this. To have a Set from an iterator, you can use Sets#NewHashSet(Iterator).
My idea is to send the cloneCollection admin command from the Java Driver. Below is a partial example.
DB db = mongo.getDB("admin");
DBObject cmd = new BasicDBObject();
cmd.put("cloneCollection", "users.profiles");//the collection to clone
//add the code here to build the rest of the required fields as JSON string
CommandResult result = db.command(cmd);
I remember leveraging the JSON.parse(...) util API of the driver to let the driver build the structure behind the scenes. Try this as this is much simpler.
NOTE: I haven't tried this but I'am confident this will work.
I think the using the aggregation operator stated by kellogg.lee is best method if the target collection is in the same database.
In order to copy to a collection that is in some other database running at a different mongod instance the following methods can be used:
First Method:
List<Document> documentList = sourceCollection.find().into(new ArrayList<Document>);
targetCollection.insertMany(documentList);
However this method might cause outOfMemory error if source collection is huge.
Second Method:
sourceCollection.find().batchSize(1000).forEach((Block<? super Document>) document -> targetCollection.insertOne(document));
This method is safer than the first one since it is not keeping a local list of whole documents and chunk size can be determined according to memory requirements. However this might be slower than the first one.

Categories

Resources