MongoDB (Java): efficient update of multiple documents to different(!) values - java

I have a MongoDB database and the program I'm writing is meant to change the values of a single field for all documents in a collection. Now if I want them all to change to a single value, like the string value "mask", then I know that updateMany does the trick and it's quite efficient.
However, what I want is an efficient solution for updating to different new values, in fact I want to pick the new value for the field in question for each document from a list, e.g. an ArrayList. But then something like this
collection.updateMany(new BasicDBObject(),
new BasicDBObject("$set",new BasicDBObject(fieldName,
listOfMasks.get(random.nextInt(size)))));
wouldn't work since updateMany doesn't recompute the value that the field should be set to, it just computes what the argument
listOfMasks.get(random.nextInt(size))
would be once and then it uses that for all the documents. So I don't think there's a solution to this problem that can actually employ updateMany since it's simply not versatile enough.
But I was wondering if anyone has any ideas for at least making it faster than simply iterating through all the documents and each time do updateOne where it updates to a new value from the ArrayList (in a random order but that's just a detail), like below?
// Loop until the MongoCursor is empty (until the search is complete)
try {
while (cursor.hasNext()) {
// Pick a random mask
String mask = listOfMasks.get(random.nextInt(size));
// Update this document
collection.updateOne(cursor.next(), Updates.set("test_field", mask));
}
} finally {
cursor.close();
}```

MongoDB provides the bulk write API to batch updates. This would be appropriate for your example of setting the value of a field to a random value (determined on the client) for each document.
Alternatively if there is a pattern to the changes needed you could potentially use find and modify operation with the available update operators.

Related

Iterate over large collection in mongo [duplicate]

I have over 300k records in one collection in Mongo.
When I run this very simple query:
db.myCollection.find().limit(5);
It takes only few miliseconds.
But when I use skip in the query:
db.myCollection.find().skip(200000).limit(5)
It won't return anything... it runs for minutes and returns nothing.
How to make it better?
One approach to this problem, if you have large quantities of documents and you are displaying them in sorted order (I'm not sure how useful skip is if you're not) would be to use the key you're sorting on to select the next page of results.
So if you start with
db.myCollection.find().limit(100).sort({created_date:true});
and then extract the created date of the last document returned by the cursor into a variable max_created_date_from_last_result, you can get the next page with the far more efficient (presuming you have an index on created_date) query
db.myCollection.find({created_date : { $gt : max_created_date_from_last_result } }).limit(100).sort({created_date:true});
From MongoDB documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the server to walk from the beginning of the collection, or index, to get to the offset/skip position before it can start returning the page of data (limit). As the page number increases skip will become slower and more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow you to easily jump to a specific page.
You have to ask yourself a question: how often do you need 40000th page? Also see this article;
I found it performant to combine the two concepts together (both a skip+limit and a find+limit). The problem with skip+limit is poor performance when you have a lot of docs (especially larger docs). The problem with find+limit is you can't jump to an arbitrary page. I want to be able to paginate without doing it sequentially.
The steps I take are:
Create an index based on how you want to sort your docs, or just use the default _id index (which is what I used)
Know the starting value, page size and the page you want to jump to
Project + skip + limit the value you should start from
Find + limit the page's results
It looks roughly like this if I want to get page 5432 of 16 records (in javascript):
let page = 5432;
let page_size = 16;
let skip_size = page * page_size;
let retval = await db.collection(...).find().sort({ "_id": 1 }).project({ "_id": 1 }).skip(skip_size).limit(1).toArray();
let start_id = retval[0].id;
retval = await db.collection(...).find({ "_id": { "$gte": new mongo.ObjectID(start_id) } }).sort({ "_id": 1 }).project(...).limit(page_size).toArray();
This works because a skip on a projected index is very fast even if you are skipping millions of records (which is what I'm doing). if you run explain("executionStats"), it still has a large number for totalDocsExamined but because of the projection on an index, it's extremely fast (essentially, the data blobs are never examined). Then with the value for the start of the page in hand, you can fetch the next page very quickly.
i connected two answer.
the problem is when you using skip and limit, without sort, it just pagination by order of table in the same sequence as you write data to table so engine needs make first temporary index. is better using ready _id index :) You need use sort by _id. Than is very quickly with large tables like.
db.myCollection.find().skip(4000000).limit(1).sort({ "_id": 1 });
In PHP it will be
$manager = new \MongoDB\Driver\Manager("mongodb://localhost:27017", []);
$options = [
'sort' => array('_id' => 1),
'limit' => $limit,
'skip' => $skip,
];
$where = [];
$query = new \MongoDB\Driver\Query($where, $options );
$get = $manager->executeQuery("namedb.namecollection", $query);
I'm going to suggest a more radical approach. Combine skip/limit (as an edge case really) with sort range based buckets and base the pages not on a fixed number of documents, but a range of time (or whatever your sort is). So you have top-level pages that are each range of time and you have sub-pages within that range of time if you need to skip/limit, but I suspect the buckets can be made small enough to not need skip/limit at all. By using the sort index this avoids the cursor traversing the entire inventory to reach the final page.
My collection has around 1.3M documents (not that big), properly indexed, but still takes a big performance hit by the issue.
After reading other answers, the solution forward is clear; the paginated collection must be sorted by a counting integer similar to the auto-incremental value of SQL instead of the time-based value.
The problem is with skip; there is no other way around it; if you use skip, you are bound to hit with the issue when your collection grows.
Using a counting integer with an index allows you to jump using the index instead of skip. This won't work with time-based value because you can't calculate where to jump based on time, so skipping is the only option in the latter case.
On the other hand,
by assigning a counting number for each document, the write performance would take a hit; because all documents must be inserted sequentially. This is fine with my use case, but I know the solution is not for everyone.
The most upvoted answer doesn't seem applicable to my situation, but this one does. (I need to be able to seek forward by arbitrary page number, not just one at a time.)
Plus, it is also hard if you are dealing with delete, but still possible because MongoDB support $inc with a minus value for batch updating. Luckily I don't have to deal with the deletion in the app I am maintaining.
Just write this down as a note to my future self. It is probably too much hassle to fix this issue with the current application I am dealing with, but next time, I'll build a better one if I were to encounter a similar situation.
If you have mongos default id that is ObjectId, use it instead. This is probably the most viable option for most projects anyway.
As stated from the official mongo docs:
The skip() method requires the server to scan from the beginning of
the input results set before beginning to return results. As the
offset increases, skip() will become slower.
Range queries can use indexes to avoid scanning unwanted documents,
typically yielding better performance as the offset grows compared to
using skip() for pagination.
Descending order (example):
function printStudents(startValue, nPerPage) {
let endValue = null;
db.students.find( { _id: { $lt: startValue } } )
.sort( { _id: -1 } )
.limit( nPerPage )
.forEach( student => {
print( student.name );
endValue = student._id;
} );
return endValue;
}
Ascending order example here.
If you know the ID of the element from which you want to limit.
db.myCollection.find({_id: {$gt: id}}).limit(5)
This is a lil genious solution which works like charm
For faster pagination don't use the skip() function. Use limit() and find() where you query over the last id of the precedent page.
Here is an example where I'm querying over tons of documents using spring boot:
Long totalElements = mongockTemplate.count(new Query(),"product");
int page =0;
Long pageSize = 20L;
String lastId = "5f71a7fe1b961449094a30aa"; //this is the last id of the precedent page
for(int i=0; i<(totalElements/pageSize); i++) {
page +=1;
Aggregation aggregation = Aggregation.newAggregation(
Aggregation.match(Criteria.where("_id").gt(new ObjectId(lastId))),
Aggregation.sort(Sort.Direction.ASC,"_id"),
new CustomAggregationOperation(queryOffersByProduct),
Aggregation.limit((long)pageSize)
);
List<ProductGroupedOfferDTO> productGroupedOfferDTOS = mongockTemplate.aggregate(aggregation,"product",ProductGroupedOfferDTO.class).getMappedResults();
lastId = productGroupedOfferDTOS.get(productGroupedOfferDTOS.size()-1).getId();
}

Parsing and looking up a string with variable number of fields java

I have to read a file and store the values and then later do a lookup.
For e.g., the file will look as follows:
Gryffindor = 5
Gryffindor.Name.Harry = 10
Gryffindor.Name.Harry.Cloak.Black = 15
and so on...
I need to store these (I was thinking of a map). Later, I need to process every character and lookup this map to assign them points. Suppose I encounter Harry, I know that he's from Gryffindor and he's wearing a blue cloak. I will have to lookup this map (or whatever object I use) as
Gryffindor.Name.Harry.Cloak.Blue
which should return me nothing. I then need to fall back to just the name and lookup
Gryffindor.Name.Harry
that should return me a 10.
Similarly, if I lookup for Ron, (suppose he's wearing black),
Gryffindor.Name.Ron.Cloak.Black
should return nothing, fall back to
Gryffindor.Name.Ron
again nothing, fall back to
Gryffindor
which should return 5.
What will be an elegant way to store and read this data? I was thinking of using a map for storing the key value pairs and then a switch case to read them back. How would you do it?
Java has a built-in Properties class that implements Map and can read and write the data format you describe (see that class's load() and store() methods).
There's nothing in there to implement your "fall back to a higher-level key" feature, so you'll need to write a method that looks in the Properties instance for data under the desired key, and keeps trying successively shorter versions of the same key if it finds nothing.

Using JTOpen to read from a data area on an AS400, does the data area object get locked?

Given a DecimalDataArea from JTOpen, when reading and writing to the data area, does the object on the AS400 get locked, preventing simultaneous writes to it from other applications that are on the AS400?
This is the sample code from the javadoc on how to read/write, etc.
// Prepare to work with the system named "My400".
AS400 system = new AS400("My400");
// Create a DecimalDataArea object.
QSYSObjectPathName path = new QSYSObjectPathName("MYLIB", "MYDATA", "DTAARA");
DecimalDataArea dataArea = new DecimalDataArea(system, path.getPath());
// Create the decimal data area on the system using default values.
dataArea.create();
// Clear the data area.
dataArea.clear();
// Write to the data area.
dataArea.write(new BigDecimal("1.2"));
// Read from the data area.
BigDecimal data = dataArea.read();
// Delete the data area from the system.
dataArea.delete();
http://javadoc.midrange.com/jtopen/com/ibm/as400/access/DecimalDataArea.html
No ... the data area operations are atomic, so no locking occurs unless you do it yourself.
Internally, the implementation actually uses CHGDTAARA to update the data area.
Wouldn't be a bad enhancement though.
If you create the data are with an SQL CREATE SEQUENCE statement, then you can use a NEXT VALUE via JDBC. You can use a NEXT VALUE expression in SQL statements such as SELECT, INSERT, UPDATE, etc. It will read the value, increment it, update the SEQUENCE, and return the new value to you, and can be done under commitment control. The PREVIOUS VALUE expression will return the last value generated by a NEXT VALUE expression for that SEQUENCE during your current session.
Generally a numeric data area is used to manage generating a series of numbers. If that is the case here, then you'll be better off using a SEQUENCE.

Lucene: Searching multiple fields with default operator = AND

To allow users to search across multiple fields with Lucene 3.5 I currently create and add a QueryParser for each field to be searched to a DisjunctionMaxQuery. This works great when using OR as the default operator but I now want to change the default operator to AND to get more accurate (and fewer) results.
Problem is, queryParser.setDefaultOperator(QueryParser.AND_OPERATOR) misses many documents since all terms must be in atleast 1 field.
For example, consider the following data for a document: title field = "Programming Languages", body field = "Java, C++, PHP". If a user were to search for Java Programming this particular document would not be included in the results since the title nor the body field contains all terms in the query although combined they do. I would want this document returned for the above query but not for the query HTML Programming.
I've considered a catchall field but I have a few problems with it. First, users frequently include per field terms in their queries (author:bill) which is not possible with a catchall field. Also, I highlight certain fields with FastVectorHighlighter which requires them to be indexed and stored. So by adding a catchall field I would have to index most of the same data twice which is time and space consuming.
Any ideas?
Guess I should have done a little more research. Turns out MultiFieldQueryParser provides the exact functionality I was looking for. For whatever reason I was creating a QueryParser for each field I wanted to search like this:
String[] fields = {"title", "body", "subject", "author"};
QueryParser[] parsers = new QueryParser[fields.length];
for(int i = 0; i < parsers.length; i++)
{
parsers[i] = new QueryParser(Version.LUCENE_35, fields[i], analyzer);
parsers[i].setDefaultOperator(QueryParser.AND_OPERATOR);
}
This would result in a query like this:
(+title:java +title:programming) | (+body:java +body:programming)
...which is not what I was looking. Now I create a single MultiFieldQueryParser like this:
MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_35, new String[]{"title", "body", "subject"}, analyzer);
parser.setDefaultOperator(QueryParser.AND_OPERATOR);
This gives me the query I was looking for:
+(title:java body:java) +(title:programming body:programming)
Thanks to #seeta and #femtoRgon for the help!
Perhaps what you need is a combination of Boolean queries that capture the different combinations of fields and terms. In your given example, the query could be -
(title:Java AND body:programming) OR (title:programming AND body:Java).
I don't know if there's an existing Query class that generates this automatically for you, but I think that's what should be the ultimate query that's run on the index.
You want to be able to search multiple fields with the same set of terms, then the question from your comment:
((title:java title:programming) | (body:java body:programming))~0.2
May not be the best implementation.
You're effectively getting either the score from the title, or the score from the body for the combined set of terms. The case where you hit java in the title and programming in the body would be given approx. equal weight to a hit on java in the body and no hit on programming.
I think a better structured query would be:
(title:java body:java)~0.2 (title:programming body:programming)~0.2
This makes more sense to me, since you want the dismax queries to limit score growing on multiple queries of the same term (in different fields), but you do want scoring to grow for hits on different terms, I believe.
If that sort of query structure gets you better score results, limiting results to a certain minimum score (a percentage of the max score returned, rather than a simple hard-coded value) may be adequate to prevent too-weak results from being seen.
I also still wouldn't count out indexing an all field. It's an implementation I've used before, while indexing BOTH the specific field and the catchall field, thus allowing both general querying and specific single-field queries. Index storage tends to be pretty lean for unstored terms, and it will generally help performance, if you find yourself having to create big, complicated queries to make up for not having it.
If you really want to be sure that it takes minimal storage, you can even turn off TermVectors for that field:
new Field(name, value, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO);
Although I don't know how much of a difference that would really make.

SingleColumnValueFilter has no impact on result

hy,
this question is pretty similar to SingleColumnValueFilter not returning proper number of rows .
I use four SingleColumnValueFilter's w/ operator EQUAL and add them to a FilterList with Operator MUST_PASS_ONE. the number of results is the same as w/o setting the FilterList. The value to compare is a byte[] that should be correct as I just store the values from previous results. (it is an IP address that I convert to InetAddress, new InetAddress(value as byte[]), when retrieving the data, and for the query described I just call InetAddress.getAddress which returns a byte[])
Do you have any ideas what might be the problem? Am I using the Filter wrong?
EDIT:
I also used the original values retrieved by the query as value for SingleColumnValueFilter, and there was no difference in the results, thus the byte[] contents can't be the problem.
I think I can give the answer myself, sorry for not debugging and checking all the hbase code before.
I just checked the implementation of the compare algorithm (which is lexicographically), and thus i realized that the length is not taken into account, though I thought it would be filled up w/ zero's; unfortunately it is not.
The only reasonable option would be to create a custom comparator (eg see How do you use a custom comparator with SingleColumnValueFilter on HBase?)

Categories

Resources