Update collection in MongoDb via Apache Spark using Mongo-Hadoop connector

Update collection in MongoDb via Apache Spark using Mongo-Hadoop connector - java

I would like to update a specific collection in MongoDb via Spark in Java.
I am using the MongoDB Connector for Hadoop to retrieve and save information from Apache Spark to MongoDb in Java.
After following Sampo Niskanen's excellent post regarding retrieving and saving collections to MongoDb via Spark, I got stuck with updating collections.
MongoOutputFormat.java includes a constructor taking String[] updateKeys, which I am guessing is referring to a possible list of keys to compare on existing collections and perform an update. However, using Spark's saveAsNewApiHadoopFile() method with parameter MongoOutputFormat.class, I am wondering how to use that update constructor.
save.saveAsNewAPIHadoopFile("file:///bogus", Object.class, Object.class, MongoOutputFormat.class, config);
Prior to this, MongoUpdateWritable.java was being used to perform collection updates. From examples I've seen on Hadoop, this is normally set on mongo.job.output.value, maybe like this in Spark:
save.saveAsNewAPIHadoopFile("file:///bogus", Object.class, MongoUpdateWritable.class, MongoOutputFormat.class, config);
However, I'm still wondering how to specify the update keys in MongoUpdateWritable.java.
Admittedly, as a hacky way, I've set the "_id" of the object as my document's KeyValue so that when a save is performed, the collection will overwrite the documents having the same KeyValue as _id.
JavaPairRDD<BSONObject,?> analyticsResult; //JavaPairRdd of (mongoObject,result)
JavaPairRDD<Object, BSONObject> save = analyticsResult.mapToPair(s -> {
BSONObject o = (BSONObject) s._1;
//for all keys, set _id to key:value_
String id = "";
for (String key : o.keySet()){
id += key + ":" + (String) o.get(key) + "_";
}
o.put("_id", id);
o.put("result", s._2);
return new Tuple2<>(null, o);
});
save.saveAsNewAPIHadoopFile("file:///bogus", Object.class, Object.class, MongoOutputFormat.class, config);
I would like to perform the mongodb collection update via Spark using MongoOutputFormat or MongoUpdateWritable or Configuration, ideally using the saveAsNewAPIHadoopFile() method. Is it possible? If not, is there any other way that does not involve specifically setting the _id to the key values I want to update on?

I tried several combination of config.set("mongo.job.output.value","....") and several combination of
.saveAsNewAPIHadoopFile(
"file:///bogus",
classOf[Any],
classOf[Any],
classOf[com.mongodb.hadoop.MongoOutputFormat[Any, Any]],
mongo_config
)
and none of them worked.
I made it to work by using MongoUpdateWritable class as output of my map method:
items.map(row => {
val mongo_id = new ObjectId(row("id").toString)
val query = new BasicBSONObject()
query.append("_id", mongo_id)
val update = new BasicBSONObject()
update.append("$set", new BasicBSONObject().append("field_name", row("new_value")))
val muw = new MongoUpdateWritable(query,update,false,true)
(null, muw)
})
.saveAsNewAPIHadoopFile(
"file:///bogus",
classOf[Any],
classOf[Any],
classOf[com.mongodb.hadoop.MongoOutputFormat[Any, Any]],
mongo_config
)
The raw query executed in mongo is then something like this:
2014-11-09T13:32:11.609-0800 [conn438] update db.users query: { _id: ObjectId('5436edd3e4b051de6a505af9') } update: { $set: { value: 10 } } nMatched:1 nModified:0 keyUpdates:0 numYields:0 locks(micros) w:24 3ms

Related

Does spring-data-mongodb supports Atlas search? need an example of it

I am trying to find how to use mongo Atlas search indexes, from java application, which is using spring-data-mongodb to query the data, can anyone share an example for it
what i found was as code as below, but that is used for MongoDB Text search, though it is working, but not sure whether it is using Atlas search defined index.
TextQuery textQuery = TextQuery.queryText(new TextCriteria().matchingAny(text)).sortByScore();
textQuery.fields().include("cast").include("title").include("id");
List<Movies> movies = mongoOperations
.find(textQuery, Movies.class);
I want smaple java code using spring-data-mongodb for below query:
[
{
$search: {
index: 'cast-fullplot',
text: {
query: 'sandeep',
path: {
'wildcard': '*'
}
}
}
}
]
It will be helpful if anyone can explain how MongoDB Text Search is different from Mongo Atlas Search and correct way of using Atalas Search with the help of java spring-data-mongodb.
How to code below with spring-data-mongodb:
Arrays.asList(new Document("$search",
new Document("index", "cast-fullplot")
.append("text",
new Document("query", "sandeep")
.append("path",
new Document("wildcard", "*")))),
new Document())

Yes, spring-data-mongo supports the aggregation pipeline, which you'll use to execute your query.
You need to define a document list, with the steps defined in your query, in the correct order. Atlas Search must be the first step in the pipeline, as it stands. You can translate your query to the aggregation pipeline using the Mongo Atlas interface, they have an option to export the pipeline array in the language of your choosing. Then, you just need to execute the query and map the list of responses to your entity class.
You can see an example below:
public class SearchRepositoryImpl implements SearchRepositoryCustom {
private final MongoClient mongoClient;
public SearchRepositoryImpl(MongoClient mongoClient) {
this.mongoClient = mongoClient;
}
#Override
public List<SearchEntity> searchByFilter(String text) {
// You can add codec configuration in your database object. This might be needed to map
// your object to the mongodb data
MongoDatabase database = mongoClient.getDatabase("aggregation");
MongoCollection<Document> collection = database.getCollection("restaurants");
List<Document> pipeline = List.of(new Document("$search", new Document("index", "default2")
.append("text", new Document("query", "Many people").append("path", new Document("wildcard", "*")))));
List<SearchEntity> searchEntityList = new ArrayList<>();
collection.aggregate(pipeline, SearchEntity.class).forEach(searchEntityList::add);
return searchEntityList;
}
}

How to update existing MongoDB Collection Validation?

I have created a MongoDB Collection using the following code, I want to update this collection add a new column named "username" and also want to change the data type of roleId from String to Long. Please advice how to do this in Java Mongo API
fun createUsers(mongoDB: MongoDatabase) {
var collOptions: ValidationOptions = ValidationOptions()
collOptions.validator(
Filters.and(
Filters.exists("userId"),
Filters.type("userId",BsonType.STRING),
Filters.exists("roleId"),
Filters.type("roleId", BsonType.INT32)
)
)
collOptions.validationLevel(ValidationLevel.STRICT)
collOptions.validationAction(ValidationAction.ERROR)
mongoDB.createCollection("user", CreateCollectionOptions().validationOptions(collOptions))
}

Java driver: how to get the objectId of an updated object with Mongodb's updateFirst method

I'm trying to get the objectId of an object that I have updated - this is my java code using the java driver:
Query query = new Query();
query.addCriteria(Criteria.where("color").is("pink"));
Update update = new Update();
update.set("name", name);
WriteResult writeResult = mongoTemplate.updateFirst(query, update, Colors.class);
Log.e("object id", writeResult.getUpsertedId().toString());
The log message returns null. I'm using a mongo server 3.0 on mongolab as I'm on the free tier so it shouldn't return null. My mongo shell is also:
MongoDB shell version: 3.0.7
Is there an easy way to return the object ID for the doc that I have just updated? What is the point of the method getUpsertedId() if I cannot return the upsertedId?
To do what I want, I currently have to issue two queries which is highly cumbersome:
//1st query - updating the object first
Query query = new Query();
query.addCriteria(Criteria.where("color").is("pink"));
Update update = new Update();
update.set("name", name);
WriteResult writeResult = mongoTemplate.updateFirst(query, update, Colors.class);
//2nd query - find the object so that I can get its objectid
Query queryColor = new Query();
queryColor.addCriteria(Criteria.where("color").is("pink"));
queryColor.addCriteria(Criteria.where("name").is(name));
Color color = mongoTemplate.findOne(queryColor, Color.class);
Log.e("ColorId", color.getId());
As per David's answer, I even tried his suggestion to rather use upsert on the template, so I changed the code to the below and it still does not work:
Query query = new Query();
query.addCriteria(Criteria.where("color").is("pink"));
Update update = new Update();
update.set("name", name);
WriteResult writeResult = mongoTemplate.upsert(query, update, Colors.class);
Log.e("object id", writeResult.getUpsertedId().toString());

Simon, I think its possible to achieve in one query. What you need is a different method called findAndModify().
In java driver for mongoDB, it has a method called findOneAndUpdate(filter, update, options).
This method returns the document that was updated. Depending on the options you specified for the method, this will either be the document as it was before the update or as it is after the update. If no documents matched the query filter, then null will be returned. Its not required to pass options, in that case it will return the document that was updated before update operation was applied.
A quick look at the mongoTemplate java driver docs here: http://docs.spring.io/spring-data/mongodb/docs/current/api/org/springframework/data/mongodb/core/FindAndModifyOptions.html tells me that you can use the method call:
public <T> T findAndModify(Query query,
Update update,
FindAndModifyOptions options,
Class<T> entityClass)
You can also change the FindAndModifyOptions class to take on an 'upsert' if the item was not found in the query.If it is found, the object will just be modified.

Upsert only applies if both
The update options had upsert on
A new document was actually created.
Your query neither has upsert enabled, nor creates a new document. Therefore it makes perfect sense that the getUpsertedId() returns null here.
Unfortunately it is not possible to get what you want in a single call with the current API; you need to split it into two calls. This is further indicated by the Mongo shell API for WriteResults:
The _id of the document inserted by an upsert. Returned only if an
upsert results in an insert.

This is an example to do this with findOneAndUpdate(filter,update,options) in Scala:
val findOneAndUpdateOptions = new FindOneAndUpdateOptions
findOneAndUpdateOptions.returnDocument(ReturnDocument.AFTER)
val filter = Document.parse("{\"idContrato\":\"12345\"}")
val update = Document.parse("{ $set: {\"descripcion\": \"New Description\" } }")
val response = collection.findOneAndUpdate(filter,update,findOneAndUpdateOptions)
println(response)

String Mongo delete all files from GridFs

Using Spring Mongo is there a way of deleting multiple files from Mongo(stored with GridFS) all at once using only query or something similar? For example if I have in collection .files a field Language I want to be able to delete all entries from .files and also from .chunks in a single query. Not sure if this is possible in Mongo (outside Spring).
Tried to use GridFsTemplate but the delete method calls GridFS.remove method (code from Spring-Mongodb)
public void delete(Query query) {
getGridFs().remove(getMappedQuery(query));
}
And this method gets all files in memory and then deletes them one by one:
public void remove( DBObject query ){
for ( GridFSDBFile f : find( query ) ){
f.remove();
}
}
void remove(){
_fs._filesCollection.remove( new BasicDBObject( "_id" , _id ) );
_fs._chunkCollection.remove( new BasicDBObject( "files_id" , _id ) );
}
UPDATE: from comments:
Yes, you can't perform such kind of queries in MongoDB. Also, there is no direct way of deleting bunch of files in GridFS using both mongofiles command line tool and mongo java driver.

To remove an entire collection at once, you can perform one of these operations:
If your collection is a MongoDB collection, you can delete it with
MongoDatabase db = client.getDatabase("db");
db.getCollection("MyCollectionName").drop();
If you are not certain collection named "MyCollectionName" exists, you should check that first, as described here.
If your collection is a GridFS collection, you can delete it with
MongoDatabase db = client.getDatabase("db");
db.getCollection("MyCollectionName.files").drop();
db.getCollection("MyCollectionName.chunks").drop();
This works, since GridFS collections consist of a .files and a .chunks collection behind the screens, and those are accessible (and removable) in the 'classic' way.

DBObject query;
//write appropriate query
List<GridFSDBFile> fileList = gfs.find(query);
for (GridFSDBFile f : fileList)
{
gfs.remove(f.getFilename());
// if you have not set fileName and
// your _id is of ObjectId type, then you can use
//gfs.remove((ObjectId) file.getId());
}

MongoDB SELF JOIN query having 1 collection

I'd like to do something like
SELECT e1.sender
FROM email as e1, email as e2
WHERE e1.sender = e2.receiver;
but in MongoDB. I found many forums about JOIN, which can be implemented via MapReduce in MongoDB, but I don't understand how to do it in this example with self-join.
I was thinking about something like this:
var map1 = function(){
var output = {
sender:db.collectionSender.email,
receiver: db.collectionReceiver.findOne({email:db.collectionSender.email}).email
}
emit(this.email, output);
};
var reduce1 = function(key, values){
var outs = {sender:null, receiver:null
values.forEach(function(v) {
if(outs.sender == null){
outs.sender = v.sender
}
if(outs.receivers == null){
outs.receiver = v.receiver
}
});
return outs; }};
db.email.mapReduce(map2,reduce2,{out:'rec_send_email'})
to create 2 new collections - collectionReceiver containing only receiver email and collectionSender containing only sender email
OR
var map2 = function(){
var output = {sender:this.sender,
receiver: db.email.findOne({receiver:this.sender})}
emit(this.sender, output);
};
var reduce2 = function(key, values){
var outs = {sender:null, receiver:null
values.forEach(function(v){
if(outs.sender == null){
outs.sender = v.sender
}
if(outs.receiver == null){
outs.receiver = v.receiver
}
});
return outs; };};
db.email.mapReduce(map2,reduce2,{out:'rec_send_email'})
but none of them is working and I don't understand this MapReduce-thing well. Could somebody explain it to me please? I was inspired by this article http://tebros.com/2011/07/using-mongodb-mapreduce-to-join-2-collections/ .
Additionally, I need to write it in Java. Is there any way how to solve it?

If you need to implement a "self-join" when using MongoDB then you may have structured your schema incorrectly (or sub-optimally).
In MongoDB (and noSQL in general) the schema structure should reflect the queries you will need to run against them.
It looks like you are assuming a collection of emails where each document has one sender and one receiver and now you want to find all senders who also happen to be receivers of email? The only way to do this would be via two simple queries, and not via map/reduce (which would be far more complex, unnecessary and the way you've written them wouldn't work as you can't query from within map function).
You are writing in Java - why not make two queries - the first to get all unique senders and the second to find all unique receivers who are also in the list of senders?
In the shell it would be:
var senderList = db.email.distinct("sender");
var receiverList = db.email.distinct("receiver", {"receiver":{$in:senderList}})

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Update collection in MongoDb via Apache Spark using Mongo-Hadoop connector - java

Related

Does spring-data-mongodb supports Atlas search? need an example of it

How to update existing MongoDB Collection Validation?

Java driver: how to get the objectId of an updated object with Mongodb's updateFirst method

String Mongo delete all files from GridFs

MongoDB SELF JOIN query having 1 collection

Categories

Resources