MongoDB - copy collection in java without looping all items

MongoDB - copy collection in java without looping all items - java

Is there a way to copy all items collection to new collection without looping all items ?
I find a way with looping by DBCursor:
...
DB db = mongoTemplate.getDb();
DBCursor cursor = db.getCollection("xxx").find();
//loop all items in collection
while (cursor.hasNext()) {
BasicDBObject b = (BasicDBObject) cursor.next();
// copy to new collection
service.createNewCollection(b);
}
...
Can you suggest do copy in java without looping all items ?
(Not In the mongo shell, with java implemintation)
Tnx.

In MongoDB 2.6, the $out aggregation operator was added which writes the results of the aggregation to a collection. This provides a simple way to do a server-side copy of all the items in a collection to another collection in the same database using the Java driver (I used Java driver version 2.12.0):
// set up pipeline
List<DBObject> ops = new ArrayList<DBObject>();
ops.add(new BasicDBObject("$out", "target")); // writes to collection "target"
// run it
MongoClient client = new MongoClient("host");
DBCollection source = client.getDB("db").getCollection("source")
source.aggregate(ops);
The one-liner version:
source.aggregate(Arrays.asList((DBObject)new BasicDBObject("$out", "target")));
According to the docs, for large datasets (>100MB) you may want to use the allowDiskUse option (Aggregation Memory Restrictions), although I didn't run into that limit when I ran it on a >2GB collection, so it may not apply to this particular pipeline, at least in 2.6.0.

I followed the advice of inserting an array of objects: Better way to move MongoDB Collection to another Collection
This reduced my time from 45 minutes to 2 minutes. Here's the Java code.
final int OBJECT_BUFFER_SIZE = 2000;
int rowNumber = 0;
List<DBObject> objects;
final int totalRows = cursor.size();
logger.debug("Mongo query result size: " + totalRows);
// Loop design based on this:
// https://stackoverflow.com/questions/18525348/better-way-to-move-mongodb-collection-to-another-collection/20889762#20889762
// Use multiple threads to improve
do {
logger.debug(String.format("Mongo buffer starts row %d - %d copy into %s", rowNumber,
(rowNumber + OBJECT_BUFFER_SIZE) - 1, dB2.getStringValue()));
cursor = db.getCollection(collectionName.getStringValue()).find(qo)
.sort(new BasicDBObject("$natural", 1)).skip(rowNumber).limit(OBJECT_BUFFER_SIZE);
objects = cursor.toArray();
try {
if (objects.size() > 0) {
db2.getCollection(collectionName.getStringValue()).insert(objects);
}
} catch (final BSONException e) {
logger.warn(String.format(
"Mongodb copy %s %s: mongodb error. A row between %d - %d will be skipped.",
dB1.getStringValue(), collectionName.getStringValue(), rowNumber, rowNumber
+ OBJECT_BUFFER_SIZE));
logger.error(e);
}
rowNumber = rowNumber + objects.size();
} while (rowNumber < totalRows);
The buffer size appears to be important. A size of 10,000 worked fine; however, for a variety of other reasons I selected a smaller size.

You could use google guava to do this. To have a Set from an iterator, you can use Sets#NewHashSet(Iterator).

My idea is to send the cloneCollection admin command from the Java Driver. Below is a partial example.
DB db = mongo.getDB("admin");
DBObject cmd = new BasicDBObject();
cmd.put("cloneCollection", "users.profiles");//the collection to clone
//add the code here to build the rest of the required fields as JSON string
CommandResult result = db.command(cmd);
I remember leveraging the JSON.parse(...) util API of the driver to let the driver build the structure behind the scenes. Try this as this is much simpler.
NOTE: I haven't tried this but I'am confident this will work.

I think the using the aggregation operator stated by kellogg.lee is best method if the target collection is in the same database.
In order to copy to a collection that is in some other database running at a different mongod instance the following methods can be used:
First Method:
List<Document> documentList = sourceCollection.find().into(new ArrayList<Document>);
targetCollection.insertMany(documentList);
However this method might cause outOfMemory error if source collection is huge.
Second Method:
sourceCollection.find().batchSize(1000).forEach((Block<? super Document>) document -> targetCollection.insertOne(document));
This method is safer than the first one since it is not keeping a local list of whole documents and chunk size can be determined according to memory requirements. However this might be slower than the first one.

Related

How to do pagination with DynamoDBMapper?

I'm developing an application in Quarkus that integrates with the DynamoDB database. I have a query method that returns a list and I'd like this list to be paginated, but it would have to be done manually by passing the parameters.
I chose to use DynamoDBMapper because it gives more possibilities to work with lists of objects and the level of complexity is lower.
Does anyone have any idea how to do this pagination manually in the function?

DynamoDBScanExpression scanExpression = new DynamoDBScanExpression()
.withLimit(pageSize)
.withExclusiveStartKey(paginationToken);
PaginatedScanList<YourModel> result = mapper.scan(YourModel.class, scanExpression);
String nextPaginationToken = result.getLastEvaluatedKey();
You can pass the pageSize and paginationToken as parameters to your query method. The nextPaginationToken can be returned along with the results, to be used for the next page.

DynamoDB Mapper paginates by iterating over the results, by lazily loading the dataset:
By default, the scan method returns a "lazy-loaded" collection. It initially returns only one page of results, and then makes a service call for the next page if needed. To obtain all the matching items, iterate over the result collection.
Ref
For example:
List<Customer> result = mapper.scan(Customer.class, scanExpression);
for ( Customer cust : result ) {
System.out.println(cust.getId());
}
To Scan manually page by page you can use ScanPage
final DynamoDBScanExpression scanPageExpression = new DynamoDBScanExpression()
.withLimit(limit);
do {
ScanResultPage<MyClass> scanPage = mapper.scanPage(MyClass.class, scanPageExpression);
scanPage.getResults().forEach(System.out::println);
System.out.println("LastEvaluatedKey=" + scanPage.getLastEvaluatedKey());
scanPageExpression.setExclusiveStartKey(scanPage.getLastEvaluatedKey());
} while (scanPageExpression.getExclusiveStartKey() != null);
Ref
Ref

How to process Iterables.partition(...) results in parallel for use with BatchGetItem API?

I am trying to call BatchGetItem to retrieve items from DynamoDB. As input we can get a list of up to 1000 keys (or as little as 1 key). These keys coincide with the hashKey for our DynamoDB table.
Since the BatchGetItem API only takes in up to 100 items per call, I am trying to split up the request into batches of only 100 items each, make the calls in parallel, and then merge the results into a single Set again.
For those unfamiliar with the DynamoDB who could still give advice on an extremely stripped down version (1st example) I'd appreciate it! Otherwise, please see the second more accurate example below.
1st Example - extremely stripped down
public Set<SomeResultType> retrieveSomething(Set<String> someSet) {
ImmutableSet.Builder<SomeResultType> resultBuilder = ImmutableSet.builder();
// FIXME - how to parallelize?
for (List<Map<String, String>> batch : Iterables.partition(someSet, 100)) {
result = callSomeLongRunningAPI(batch);
resultBuilder.addAll(result.getItems());
}
return resultBuilder.build();
}
2nd Example - closer to my actual problem -
Below is a stripped down, dummy version of what I'm currently doing (as such, please forgive formatting / style issues). It currently works and gets all the items, but I can't figure out how to get the batches (see FIXME) to get executed in parallel and end up in a single set. Since performance is pretty important in the system I'm trying to build, any tips would be appreciated in helping this code be more efficient!
public Set<SomeResultType> retrieveSomething(Set<String> someIds) {
if (someIds.isEmpty()) {
// handle this here
}
Collection<Map<String, AttributeValue>> keyAttributes = someIds.stream()
.map(id -> ImmutableMap.<String, AttributeValue>builder()
.put(tableName, new AttributeValue().withS(id)).build())
.collect(ImmutableList.toImmutableList());
ImmutableSet.Builder<SomeResultType> resultBuilder = ImmutableSet.builder();
Map<String, KeysAndAttributes> itemsToProcess;
BatchGetItemResult result;
// FIXME - make parallel?
for (List<Map<String, AttributeValue>> batch : Iterables.partition(keyAttributes, 100)) {
KeysAndAttributes keysAndAttributes = new KeysAndAttributes()
.withKeys(batch)
.withAttributesToGet(...// some attribute names);
itemsToProcess = ImmutableMap.of(tableName, keysAndAttributes);
result = this.dynamoDB.batchGetItem(itemsToProcess);
resultBuilder.addAll(extractItemsFromResults(tableName, result));
}
return resultBuilder.build());
}
Help with either the super stripped down case or the 2nd example would be greatly appreciated! Thanks!

What is the better approach for solving Restrictions.in with large lists?

It has been established that when you use Hibernate's Restrictions.in(String property, List list), you have to limit the size of list.
This is because the database server might not be able to handle long queries. Aside from adjusting the configuration of the database server.
Here are the solutions I found:
SOLUTION 1: Split the list into smaller ones and then add the smaller lists separately into several Restrictions.in
public List<Something> findSomething(List<String> subCdList) {
Criteria criteria = getSession().createCriteria(getEntityClass());
//if size of list is greater than 1000, split it into smaller lists. See List<List<String>> cdList
if(subCdList.size() > 1000) {
List<List<String>> cdList = new ArrayList<List<String>>();
List<String> tempList = new ArrayList<String>();
Integer counter = 0;
for(Integer i = 0; i < subCdList.size(); i++) {
tempList.add(subCdList.get(i));
counter++;
if(counter == 1000) {
counter = 0;
cdList.add(tempList);
tempList = new ArrayList<String>();
}
}
if(tempList.size() > 0) {
cdList.add(tempList);
}
Criterion criterion = null;
//Iterate the list of lists, add the restriction for smaller list
for(List<String> cds : cdList) {
if (criterion == null) {
criterion = Restrictions.in("subCd", cds);
} else {
criterion = Restrictions.or(criterion, Restrictions.in("subCd", cds));
}
}
criteria.add(criterion);
} else {
criteria.add(Restrictions.in("subCd", subCdList));
}
return criteria.list();
}
This is an okay solution since you will only have one select statement. However, I think it's a bad idea to have for loops on the DAO layer because we do not want the connection to be open for a long time.
SOLUTION 2: Use DetachedCriteria. Instead of passing the list, query it on the WHERE clause.
public List<Something> findSomething() {
Criteria criteria = getSession().createCriteria(getEntityClass());
DetachedCriteria detached = DetachedCriteria.forClass(DifferentClass.class);
detached.setProjection(Projections.property("cd"));
criteria.add(Property.forName("subCd").in(detached));
return criteria.list();
}
The problem in this solution is on the technical usage of DetachedCriteria. You usually use it when you want to create a query to a another class that is totally not connected (or does not have relationship) on your current class. On the example, Something.class has a property subCd that is a foreign key from DifferentClass. Another, this produces a subquery on the where clause.
When you look at the code:
1. SOLUTION 2 is simpler and concise.
2. But SOLUTION 1 offers a query with only one select.
Please help me decide which one is more efficient.
Thanks.

For Solution 1 : Instead of using for loops, you can try as below
To avoid this use an utility method to build the Criterion Query IN clause if the number of parameter values passed has a size more than 1000.
class HibernateBuildCriteria {
private static final int PARAMETER_LIMIT = 800;
public static Criterion buildInCriterion(String propertyName, List<?> values) {
Criterion criterion = null;
int listSize = values.size();
for (int i = 0; i < listSize; i += PARAMETER_LIMIT) {
List<?> subList;
if (listSize > i + PARAMETER_LIMIT) {
subList = values.subList(i, (i + PARAMETER_LIMIT));
} else {
subList = values.subList(i, listSize);
}
if (criterion != null) {
criterion = Restrictions.or(criterion, Restrictions.in(propertyName, subList));
} else {
criterion = Restrictions.in(propertyName, subList);
}
}
return criterion;
}
}
Using the Method :
criteria.add(HibernateBuildCriteria.buildInCriterion(propertyName, list));
hope this helps.

Solution 1 has one major drawback: you may end up with a lot of different prepared statements which would need to be parsed and for which execution plan would need to be calculated and cached. This process may be much more expensive than the actual execution of the query for which the statement has already been cached by the database. Please see this question for more details.
The way how I solve this is to utilize the algorithm used by Hibernate for batch fetching of lazy loaded associated entities. Basically, I use ArrayHelper.getBatchSizes to get the sublists of ids and then I execute a separate query for each sublist.
Solution 2 is appropriate only if you can project ids in a subquery. But if you can't, then you can't use it. For example, the user of your app edited 20 entities on a screen and now they are saving the changes. You have to read the entities by ids to merge the changes and you cannot express it in a subquery.
However, an alternative approach to solution 2 could be to use temporary tables. For example Hibernate does it sometimes for bulk operations. You can store your ids in the temporary table and then use them in the subquery. I personally consider this to be an unnecessary complication compared to the solution 1 (for this use case of course; Hibernate's reasoning is good for their use case), but it is a valid alternative.

Returning certain amount of documents from elasticsearch query in java

I am trying to limit the document size returned by my query.i want lets say 10 documents back only,any my query normally displays 22,how would i go buy setting a limit for the returned output. i am aware i can just limit the list size by creating a list and adding to that list however i want to do it on the query level.
My Query: Thanks in advance :)
ueryBuilder raceGenderQuery = QueryBuilders.boolQuery()
.must(termQuery("lep_etg_desc", "indian"))
.must(termQuery("lep_gen_desc", "male"));
Set<String> suburbanLocationSet = new HashSet<String>();
suburbanLocationSet.add("queensburgh");
suburbanLocationSet.add("umhlanga");
suburbanLocationSet.add("tongaat");
suburbanLocationSet.add("phoenix");
suburbanLocationSet.add("shallcross");
suburbanLocationSet.add("balito");
//Build the necessary location query.
QueryBuilder locationQuery = QueryBuilders.boolQuery().must(termsQuery("lep_suburb_home", suburbanLocationSet));
//Combine all Queries so that its filtered to get exact results.
FilteredQueryBuilder finalSearchQuery = QueryBuilders.filteredQuery(QueryBuilders.boolQuery().must(raceGenderQuery).must(locationQuery), FilterBuilders.boolFilter().must(FilterBuilders.rangeFilter("lep_age").gte(25).lte(45)).must(FilterBuilders.rangeFilter("lep_max_income").gte(25000).lte(45000)));
//Run Query through elasticsearch iterating through documents in the traceps index for query matches.
List<Leads> finalLeadsList = new ArrayList<Leads>();
for (Leads leads : this.leadsRepository.search(finalSearchQuery)) {
finalLeadsList.add(leads);
}

I think this is what you want:
SearchResponse response = client.prepareSearch().setSearchType(SearchType.QUERY_THEN_FETCH).setSize(10).setQuery(finalSearchQuery).execute().get
You have to use QUERY_THEN_FETCH for it to return exactly size results because otherwise it gets size results from each shard.

Map Reduce with mongoDB and Morphia

I am running map reduce in mongoDB with morphia, this is my java code
String map = "function() { emit(this.id,this.cal.charge)}";
String reduce = "function(k, v) {var i, sum = 0;for (i in v) {sum += v[i];}return sum;}";
MapreduceResults<Results> mrRes = ds.mapReduce(MapreduceType.MERGE,ds.createQuery(MyTable.class).field("id").equal(5),map,reduce,null,null,Re.class);
This work fine and put results to 'Re' collection, but how can i get result as objects or list without inserting to a new Collection ?
Thanks

I couldn't commented this because it violates the length limit.
If it is not going to be too much of a fuss, you can do it by using java driver directly without using the morphia interface. just get the mongo object from morphia datastore and use java driver's map reduce command; it is something like this:
DBObject queryObject = new BasicDBObject("id", 5);
DBCollection collection = ds.getCollection(MyTable.class);
MapReduceCommand mrc = new MapReduceCommand(collection, // collection to do map-reduce on
map, // map function
reduce, // reduce function
null, // output collection for the result
MapReduceCommand.OutputType.INLINE, // output result type
queryObject); // query to use in map reduce function
btw morphia has a newer version in github https://github.com/jmkgreen/morphia maybe you wanna check that out too. I saw that the newer version also don't support inline operation on map-reduce.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

MongoDB - copy collection in java without looping all items - java

You could use google guava to do this. To have a Set from an iterator, you can use Sets#NewHashSet(Iterator).

Related

How to do pagination with DynamoDBMapper?

How to process Iterables.partition(...) results in parallel for use with BatchGetItem API?

What is the better approach for solving Restrictions.in with large lists?

Returning certain amount of documents from elasticsearch query in java

Map Reduce with mongoDB and Morphia

Categories

Resources