Related
I have over 300k records in one collection in Mongo.
When I run this very simple query:
db.myCollection.find().limit(5);
It takes only few miliseconds.
But when I use skip in the query:
db.myCollection.find().skip(200000).limit(5)
It won't return anything... it runs for minutes and returns nothing.
How to make it better?
One approach to this problem, if you have large quantities of documents and you are displaying them in sorted order (I'm not sure how useful skip is if you're not) would be to use the key you're sorting on to select the next page of results.
So if you start with
db.myCollection.find().limit(100).sort({created_date:true});
and then extract the created date of the last document returned by the cursor into a variable max_created_date_from_last_result, you can get the next page with the far more efficient (presuming you have an index on created_date) query
db.myCollection.find({created_date : { $gt : max_created_date_from_last_result } }).limit(100).sort({created_date:true});
From MongoDB documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the server to walk from the beginning of the collection, or index, to get to the offset/skip position before it can start returning the page of data (limit). As the page number increases skip will become slower and more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow you to easily jump to a specific page.
You have to ask yourself a question: how often do you need 40000th page? Also see this article;
I found it performant to combine the two concepts together (both a skip+limit and a find+limit). The problem with skip+limit is poor performance when you have a lot of docs (especially larger docs). The problem with find+limit is you can't jump to an arbitrary page. I want to be able to paginate without doing it sequentially.
The steps I take are:
Create an index based on how you want to sort your docs, or just use the default _id index (which is what I used)
Know the starting value, page size and the page you want to jump to
Project + skip + limit the value you should start from
Find + limit the page's results
It looks roughly like this if I want to get page 5432 of 16 records (in javascript):
let page = 5432;
let page_size = 16;
let skip_size = page * page_size;
let retval = await db.collection(...).find().sort({ "_id": 1 }).project({ "_id": 1 }).skip(skip_size).limit(1).toArray();
let start_id = retval[0].id;
retval = await db.collection(...).find({ "_id": { "$gte": new mongo.ObjectID(start_id) } }).sort({ "_id": 1 }).project(...).limit(page_size).toArray();
This works because a skip on a projected index is very fast even if you are skipping millions of records (which is what I'm doing). if you run explain("executionStats"), it still has a large number for totalDocsExamined but because of the projection on an index, it's extremely fast (essentially, the data blobs are never examined). Then with the value for the start of the page in hand, you can fetch the next page very quickly.
i connected two answer.
the problem is when you using skip and limit, without sort, it just pagination by order of table in the same sequence as you write data to table so engine needs make first temporary index. is better using ready _id index :) You need use sort by _id. Than is very quickly with large tables like.
db.myCollection.find().skip(4000000).limit(1).sort({ "_id": 1 });
In PHP it will be
$manager = new \MongoDB\Driver\Manager("mongodb://localhost:27017", []);
$options = [
'sort' => array('_id' => 1),
'limit' => $limit,
'skip' => $skip,
];
$where = [];
$query = new \MongoDB\Driver\Query($where, $options );
$get = $manager->executeQuery("namedb.namecollection", $query);
I'm going to suggest a more radical approach. Combine skip/limit (as an edge case really) with sort range based buckets and base the pages not on a fixed number of documents, but a range of time (or whatever your sort is). So you have top-level pages that are each range of time and you have sub-pages within that range of time if you need to skip/limit, but I suspect the buckets can be made small enough to not need skip/limit at all. By using the sort index this avoids the cursor traversing the entire inventory to reach the final page.
My collection has around 1.3M documents (not that big), properly indexed, but still takes a big performance hit by the issue.
After reading other answers, the solution forward is clear; the paginated collection must be sorted by a counting integer similar to the auto-incremental value of SQL instead of the time-based value.
The problem is with skip; there is no other way around it; if you use skip, you are bound to hit with the issue when your collection grows.
Using a counting integer with an index allows you to jump using the index instead of skip. This won't work with time-based value because you can't calculate where to jump based on time, so skipping is the only option in the latter case.
On the other hand,
by assigning a counting number for each document, the write performance would take a hit; because all documents must be inserted sequentially. This is fine with my use case, but I know the solution is not for everyone.
The most upvoted answer doesn't seem applicable to my situation, but this one does. (I need to be able to seek forward by arbitrary page number, not just one at a time.)
Plus, it is also hard if you are dealing with delete, but still possible because MongoDB support $inc with a minus value for batch updating. Luckily I don't have to deal with the deletion in the app I am maintaining.
Just write this down as a note to my future self. It is probably too much hassle to fix this issue with the current application I am dealing with, but next time, I'll build a better one if I were to encounter a similar situation.
If you have mongos default id that is ObjectId, use it instead. This is probably the most viable option for most projects anyway.
As stated from the official mongo docs:
The skip() method requires the server to scan from the beginning of
the input results set before beginning to return results. As the
offset increases, skip() will become slower.
Range queries can use indexes to avoid scanning unwanted documents,
typically yielding better performance as the offset grows compared to
using skip() for pagination.
Descending order (example):
function printStudents(startValue, nPerPage) {
let endValue = null;
db.students.find( { _id: { $lt: startValue } } )
.sort( { _id: -1 } )
.limit( nPerPage )
.forEach( student => {
print( student.name );
endValue = student._id;
} );
return endValue;
}
Ascending order example here.
If you know the ID of the element from which you want to limit.
db.myCollection.find({_id: {$gt: id}}).limit(5)
This is a lil genious solution which works like charm
For faster pagination don't use the skip() function. Use limit() and find() where you query over the last id of the precedent page.
Here is an example where I'm querying over tons of documents using spring boot:
Long totalElements = mongockTemplate.count(new Query(),"product");
int page =0;
Long pageSize = 20L;
String lastId = "5f71a7fe1b961449094a30aa"; //this is the last id of the precedent page
for(int i=0; i<(totalElements/pageSize); i++) {
page +=1;
Aggregation aggregation = Aggregation.newAggregation(
Aggregation.match(Criteria.where("_id").gt(new ObjectId(lastId))),
Aggregation.sort(Sort.Direction.ASC,"_id"),
new CustomAggregationOperation(queryOffersByProduct),
Aggregation.limit((long)pageSize)
);
List<ProductGroupedOfferDTO> productGroupedOfferDTOS = mongockTemplate.aggregate(aggregation,"product",ProductGroupedOfferDTO.class).getMappedResults();
lastId = productGroupedOfferDTOS.get(productGroupedOfferDTOS.size()-1).getId();
}
We have a system that processes flat-file and (with a couple of validations only) inserts into database.
This code:
//There can be 8 million lines-of-codes
for(String line: lines){
if (!Class.isBranchNoValid(validBranchNoArr, obj.branchNo)){
continue;
}
list.add(line);
}
definition of isBranchNoValid:
//the array length ranges from 2 to 5 only
public static boolean isBranchNoValid(String[] validBranchNoArr, String branchNo) {
for (int i = 0; i < validBranchNoArr.length; i++) {
if (validBranchNoArr[i].equals(branchNo)) {
return true;
}
}
return false;
}
The validation is at line-level (we have to filter or skip the line that doesn't have a branchNo in the array). Earlier, this wasn't (filter) the case.
Now, high-performance degradation is troubling us.
I understand (may be, I am wrong) that this repeated function call is causing a lot of stack creation resulting in a very high GC invocation.
I can't figure out a way (is it even possible) to perform this filter without this high cost of performance degradation (a little difference is fine).
This is not a stack problem for sure, because your function is not recursive nothing is kept in the stack between calls; after each call the variables are erased since they are not needed anymore.
You can put the valid numbers in a set and use that one for some optimization but in your case I am not sure it will bring any benefits at all since you have at most 5 elements.
So there are several possible bottlenecks in your scenario.
reading the lines of the file
Parse the line to construct the object to insert into the database
check the applicability of the object (ie branch no filter)
insert into the db
Generally, you'd say IO is the slowest, so 1. and 2. You're saying nothing except 2. changed, right? That is weird.
Anyway, if you want to optimize that, I wouldn't be passing the array around 8 million times, and I wouldn't iterate it every time either. Since your valid branches are known, create a HashSet from it - it has O(1) access.
Set<String> validBranches = Arrays.stream(branches)
.collect(Collectors.toCollection(HashSet::new));
Then, iterate the lines
for (String line : lines) {
YourObject obj = parse(line);
if (validBranches.contains(obj.branchNo)) {
writeToDb(obj);
}
}
or, in the stream version
Files.lines(yourPath)
.map(this::parse)
.filter(o -> validBranches.contains(o.branchNo))
.forEach(this::writeToDb);
I'd also check if it isn't more efficient to first collect a batch of objects, then write to db. Also, it's possible that handling the lines in parallel gains some speed, in case the parsing is time intensive.
I am working on some inherited code and I am not use to the entity frame work. I'm trying to figure out why a previous programmer coded things the way they did, sometimes mixing and matching different ways of querying data.
Deal d = _em.find(Deal.class, dealid);
List<DealOptions> dos = d.getDealOptions();
for(DealOptions o : dos) {
if(o.price == "100") {
//found the 1 item i wanted
}
}
And then sometimes i see this:
Query q = _em.createQuery("select count(o.id) from DealOptions o where o.price = 100 and o.deal.dealid = :dealid");
//set parameters, get results then check result and do whatver
I understand what both pieces of code do, and I understand that given a large dataset, the second way is more efficient. However, given only a few records, is there any reason not to do a query vs just letting the entity do the join and looping over your recordset?
Some reasons never to use the first approach regardless of the number of records:
It is more verbose
The intention is less clear, since there is more clutter
The performance is worse, probably starting to degrade with the very first entities
The performance of the first approach will degrade much more with each added entity than with the second approach
It is unexpected - most experienced developers would not do it - so it needs more cognitive effort for other developers to understand. They would assume you were doing it for a compelling reason and would look for that reason without finding one.
Java: I have a problem using System.currentTimeMillis() function
i am using System.currentTimeMillis() to generate unique values in foor loop problem is loop executes too fast and System.currentTimeMillis() gives me duplicate values.
How can i generate for sure unique values.
for(int a=0;a<=10;a++){
System.out.println(System.currentTimeMillis())
}
I also tried following but it is also not generaet to generate unique number
System.currentTimeMillis()+Math.random()
why don't you use System.nanoTime() instead?
Why don't you use a UUID library to generate unique identifiers (already there in the JDK http://download.oracle.com/javase/6/docs/api/java/util/UUID.html).
Or for a more simple approach: append a static counter
I think your approach is wrong, if this is a requirement.
Theoretically, no matter how fine-grained your timer, a machine might execute it in less time than the timer's granularity. It's not correct in a technical sense to depend on this being true.
Or looking at it another way - why do you need these values to be unique (what are you using them for)? If you really want them to be a measure of the time it was executed, then you ought to be happy that two iterations that happened within the same millisecond got the same value.
Have you considered using a static, monotonous counter to assign IDs to each iteration that are unique within each execution (AtomicLong is great for this)? Something like the following is very easy and has no concurrency issues:
public class YourClass {
private static final AtomicLong COUNTER = new AtomicLong();
private static nextId() { return COUNTER.getAndIncrement(); }
// Rest of the class, which calls nextId() when it needs an identifier
}
If you need the timing info and uniqueness, then that's two separate requirements, so why not have a composite key made up of the time and an arbitrary unique ID?
The answer is obvious - get a slower computer! Well, that or use System.nanoTime as described right here on SO - System.currentTimeMillis vs System.nanoTime. But seriously, you shouldn't be using time as unique number generator unless you absolutely have to.
The problem with using the system time of course being that:
The time returned by your system
calls is rounded up to a higher
degree of precision than the actual
CPU clock time. If your ID generation
code runs faster than this degree of
precision then you will have
collision.
If your code is distributed and each
unit of work is generating ID's then
you run into the possibility of ID
collision as the separate CPU's or
CPU core's allocate ID's using their
independent clocks.
In libraries like Java that are
actually returning the system time
based off a user settable property
you run into a higher chance of
multiple ID collision anytime the
date is reset to some period in the
past, for whatever reason.
A very good alternative to generating unique identifiers is to utilize the not-so-ironically named Universally Unique Identifier. There is a multiple implementations in various languages, for Java 5 and higher you can use the UUID class.
Edit: To add some useful information about UUID.
Similar to #Andrej's solution, but combining a timer and a counter so your numbers shouldn't repeat if you restart your application.
public enum IdGenerator {
;
private static final AtomicLong COUNTER = new AtomicLong(System.currentTimeMillis()*1000);
public static long nextId() { return COUNTER.getAndIncrement(); }
}
If you want to still use your method, you could do:
for(int a=0;a<=10;a++){
Thread.sleep(1);
System.out.println(System.currentTimeMillis())
}
Explicitly making your CPU slower.
try Math.random()*System.currentTimeMillis()
here is a sample outcome
4.1140390961236145E11,
4.405289623285403E11,
6.743938910583776E11,
2.0358542930175632E11,
1.2561886548511025E12,
8.629388909268735E11,
1.158038719369676E12,
2.5899667030405692E11,
7.815373208372445E11,
1.0887553507952611E12,
3.947241572203385E11,
1.6723200316764807E11,
1.3071550541162832E12,
2.079941126415029E11,
1.304485187296599E12,
3.5889095083604164E10,
1.3230275106525027E11,
6.484641777434403E11,
5.109822261418748E11,
1.2291750972884333E12,
8.972865957307518E11,
4.022754883048088E11,
7.997154244301389E11,
1.139245696210086E12,
2.633248409945871E11,
8.699957189419155E11,
9.487098785390422E11,
1.1645067228773708E12,
1.5274939161218903E11,
4.8470112347655725E11,
8.749120668472205E11,
2.435762445513599E11,
5.62884487469596E11,
1.1412787212758718E12,
1.0724213377031631E12,
3.1388106597100226E11,
1.1405727247661633E12,
1.2464739913912961E12,
3.2771161059896655E11,
1.2102869787179648E12,
1.168806596179512E12,
5.871383012375131E11,
1.2765757372075571E12,
5.868323434343102E11,
9.887351363037219E11,
5.392282944314777E11,
1.1926033895638833E12,
6.867917070018711E11,
1.1682059242674294E12,
2.4442056772643954E11,
1.1250254537683052E12,
8.875186600355891E10,
3.46331811747409E11,
1.127077925657995E12,
7.056541627184794E11,
1.308631075052609E12,
7.7875319089675E11,
5.52717019956371E11,
7.727797813063546E11,
6.177219592063667E11,
2.9448141585070874E11,
9.617992263836586E11,
6.762500987418107E11,
1.1954995292124463E12,
1.0741763597148225E12,
1.9915919731861673E11,
9.507720563185525E11,
1.1009594810160002E12,
4.1381256571745465E11,
2.2526550777831213E11,
2.5919816802026202E11,
3.8453225321522577E11,
3.796715779825083E11,
6.512277843921505E10,
1.0483456960599313E12,
1.0725956186588704E11,
5.701504883615902E11,
9.085583903150035E11,
1.2764816439306753E12,
1.033783414053437E12,
1.188379914238302E12,
6.42733442524156E11,
3.911345432964901E11,
7.936334657654698E11,
1.4473479058272617E11,
1.2030471387183499E12,
5.900668555531211E11,
8.078992189613184E11,
1.2004364275316113E12,
1.250275098717202E12,
2.856556784847933E11,
1.9118298791320355E11,
5.4291847597892596E11,
3.9527733898520874E11,
6.384539941791654E11,
1.2812873515441786E11,
6.325269269733575E9,
5.403119000792323E11,
8.023708335126083E11,
3.761680594623883E10,
1.2641772837928888E11,
Check out UUID as well...
My suggestion
long id = System.currentTimeMillis();
for (int i = 0; i < 10; i++) {
//do your work
id++;
}
I am writing code in Java where I branch off based on whether a string starts with certain characters while looping through a dataset and my dataset is expected to be large.
I was wondering whether startsWith is faster than indexOf. I did experiment with 2000 records but not found any difference.
startsWith only needs to check for the presence at the very start of the string - it's doing less work, so it should be faster.
My guess is that your 2000 records finished in a few milliseconds (if that). Whenever you want to benchmark one approach against another, try to do it for enough time that differences in timing will be significant. I find that 10-30 seconds is long enough to show significant improvements, but short enough to make it bearable to run the tests multiple times. (If this were a serious investigation I'd probably try for longer times. Most of my benchmarking is for fun.)
Also make sure you've got varied data - indexOf and startsWith should have roughly the same running time in the case where indexOf returns 0. So if all your records match the pattern, you're not really testing correctly. (I don't know whether that was the case in your tests of course - it's just something to watch out for.)
In general, the golden rule of micro-optimization applies here:
"Measure, don't guess".
As with all optimizations of this type, the difference between the two calls almost certainly won't matter unless you are checking millions of strings that are each tens of thousands of characters long.
Run a profiler over your code, and only optimize this call when you can measure that it's slowing you down. Till then, go with the more readable options (startsWith, in this case). Once you know that this block is slowing you down, try both and use whichever is faster. Rinse. Repeat ;-)
Academically, my guess is that startsWith will likely be implemented using indexOf. Check the source code and see if you're interested. (Turns out that startsWith does not call indexOf)
Even without looking into the sources, it should be clear that startsWith() is faster at least for large strings and short pattern.
The running time of a.startsWith(b) is bound be the length of b. After at most the first b characters are checked, the search finished.
The running time of a.indexOf(b) is larger (depending on the actual algorithm). Every algorithm has at least a running time depending on the length of a. Roughly, you can say, that you have to look at each character once to check if the pattern starts at that position.
However, as always, it depends on the actual use case if you really see a difference in practice. Measuring the difference in real life is never bad.
Probably, if it doesn't match it can stop looking whereas indexOf needs to look for occurrences later in the string.
startsWith is clearer than indexOf == 0.
Have you identified the test as a performance bottleneck for which you need to sacrifice readability?
public class Test
{
public static void main(String args[]) {
long value1 = System.currentTimeMillis();
for(long i=0;i<100000000;i++)
{
"abcd".indexOf("a");
}
long value2 = System.currentTimeMillis();
System.out.println(value2-value1);
value1 = System.currentTimeMillis();
for(long i=0;i<100000000;i++)
{
"abcd".startsWith("a");
}
value2 = System.currentTimeMillis();
System.out.println(value2-value1);
}
}
Tested it with this piece of code and perf for startsWith seems to be better, for obvious reason that it doesn't have to traverse through string. But in best case scenario both should perform close while in a worst case scenario startsWith will always perform better than indexOf
You mentioned the dataset is expected to be large. So i will bet a lot of performanve will go into access this dataset and handle it in memory. That means use one or the other will not change the perfomance significant. But if this is important to you you may write your own startWith method that could be significant faster than standard library methods or at least you know exactly what is done.
Unfortunate, statsWith is not working as supposed to! it uses "indexOf" behind the sene (lazy developpers :D) so indexOf is 10x faster than implemented statsWith