Related
I have over 300k records in one collection in Mongo.
When I run this very simple query:
db.myCollection.find().limit(5);
It takes only few miliseconds.
But when I use skip in the query:
db.myCollection.find().skip(200000).limit(5)
It won't return anything... it runs for minutes and returns nothing.
How to make it better?
One approach to this problem, if you have large quantities of documents and you are displaying them in sorted order (I'm not sure how useful skip is if you're not) would be to use the key you're sorting on to select the next page of results.
So if you start with
db.myCollection.find().limit(100).sort({created_date:true});
and then extract the created date of the last document returned by the cursor into a variable max_created_date_from_last_result, you can get the next page with the far more efficient (presuming you have an index on created_date) query
db.myCollection.find({created_date : { $gt : max_created_date_from_last_result } }).limit(100).sort({created_date:true});
From MongoDB documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the server to walk from the beginning of the collection, or index, to get to the offset/skip position before it can start returning the page of data (limit). As the page number increases skip will become slower and more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow you to easily jump to a specific page.
You have to ask yourself a question: how often do you need 40000th page? Also see this article;
I found it performant to combine the two concepts together (both a skip+limit and a find+limit). The problem with skip+limit is poor performance when you have a lot of docs (especially larger docs). The problem with find+limit is you can't jump to an arbitrary page. I want to be able to paginate without doing it sequentially.
The steps I take are:
Create an index based on how you want to sort your docs, or just use the default _id index (which is what I used)
Know the starting value, page size and the page you want to jump to
Project + skip + limit the value you should start from
Find + limit the page's results
It looks roughly like this if I want to get page 5432 of 16 records (in javascript):
let page = 5432;
let page_size = 16;
let skip_size = page * page_size;
let retval = await db.collection(...).find().sort({ "_id": 1 }).project({ "_id": 1 }).skip(skip_size).limit(1).toArray();
let start_id = retval[0].id;
retval = await db.collection(...).find({ "_id": { "$gte": new mongo.ObjectID(start_id) } }).sort({ "_id": 1 }).project(...).limit(page_size).toArray();
This works because a skip on a projected index is very fast even if you are skipping millions of records (which is what I'm doing). if you run explain("executionStats"), it still has a large number for totalDocsExamined but because of the projection on an index, it's extremely fast (essentially, the data blobs are never examined). Then with the value for the start of the page in hand, you can fetch the next page very quickly.
i connected two answer.
the problem is when you using skip and limit, without sort, it just pagination by order of table in the same sequence as you write data to table so engine needs make first temporary index. is better using ready _id index :) You need use sort by _id. Than is very quickly with large tables like.
db.myCollection.find().skip(4000000).limit(1).sort({ "_id": 1 });
In PHP it will be
$manager = new \MongoDB\Driver\Manager("mongodb://localhost:27017", []);
$options = [
'sort' => array('_id' => 1),
'limit' => $limit,
'skip' => $skip,
];
$where = [];
$query = new \MongoDB\Driver\Query($where, $options );
$get = $manager->executeQuery("namedb.namecollection", $query);
I'm going to suggest a more radical approach. Combine skip/limit (as an edge case really) with sort range based buckets and base the pages not on a fixed number of documents, but a range of time (or whatever your sort is). So you have top-level pages that are each range of time and you have sub-pages within that range of time if you need to skip/limit, but I suspect the buckets can be made small enough to not need skip/limit at all. By using the sort index this avoids the cursor traversing the entire inventory to reach the final page.
My collection has around 1.3M documents (not that big), properly indexed, but still takes a big performance hit by the issue.
After reading other answers, the solution forward is clear; the paginated collection must be sorted by a counting integer similar to the auto-incremental value of SQL instead of the time-based value.
The problem is with skip; there is no other way around it; if you use skip, you are bound to hit with the issue when your collection grows.
Using a counting integer with an index allows you to jump using the index instead of skip. This won't work with time-based value because you can't calculate where to jump based on time, so skipping is the only option in the latter case.
On the other hand,
by assigning a counting number for each document, the write performance would take a hit; because all documents must be inserted sequentially. This is fine with my use case, but I know the solution is not for everyone.
The most upvoted answer doesn't seem applicable to my situation, but this one does. (I need to be able to seek forward by arbitrary page number, not just one at a time.)
Plus, it is also hard if you are dealing with delete, but still possible because MongoDB support $inc with a minus value for batch updating. Luckily I don't have to deal with the deletion in the app I am maintaining.
Just write this down as a note to my future self. It is probably too much hassle to fix this issue with the current application I am dealing with, but next time, I'll build a better one if I were to encounter a similar situation.
If you have mongos default id that is ObjectId, use it instead. This is probably the most viable option for most projects anyway.
As stated from the official mongo docs:
The skip() method requires the server to scan from the beginning of
the input results set before beginning to return results. As the
offset increases, skip() will become slower.
Range queries can use indexes to avoid scanning unwanted documents,
typically yielding better performance as the offset grows compared to
using skip() for pagination.
Descending order (example):
function printStudents(startValue, nPerPage) {
let endValue = null;
db.students.find( { _id: { $lt: startValue } } )
.sort( { _id: -1 } )
.limit( nPerPage )
.forEach( student => {
print( student.name );
endValue = student._id;
} );
return endValue;
}
Ascending order example here.
If you know the ID of the element from which you want to limit.
db.myCollection.find({_id: {$gt: id}}).limit(5)
This is a lil genious solution which works like charm
For faster pagination don't use the skip() function. Use limit() and find() where you query over the last id of the precedent page.
Here is an example where I'm querying over tons of documents using spring boot:
Long totalElements = mongockTemplate.count(new Query(),"product");
int page =0;
Long pageSize = 20L;
String lastId = "5f71a7fe1b961449094a30aa"; //this is the last id of the precedent page
for(int i=0; i<(totalElements/pageSize); i++) {
page +=1;
Aggregation aggregation = Aggregation.newAggregation(
Aggregation.match(Criteria.where("_id").gt(new ObjectId(lastId))),
Aggregation.sort(Sort.Direction.ASC,"_id"),
new CustomAggregationOperation(queryOffersByProduct),
Aggregation.limit((long)pageSize)
);
List<ProductGroupedOfferDTO> productGroupedOfferDTOS = mongockTemplate.aggregate(aggregation,"product",ProductGroupedOfferDTO.class).getMappedResults();
lastId = productGroupedOfferDTOS.get(productGroupedOfferDTOS.size()-1).getId();
}
I have a query with a resultset of half a million records, with each record I'm creating an object and trying to add it into an ArrayList.
How can I optimize this operation to avoid memory issues as I'm getting out of heap space error.
This is a fragment o code :
while (rs.next()) {
lista.add(sd.loadSabanaDatos_ResumenLlamadaIntervalo(rs));
}
public SabanaDatos loadSabanaDatos_ResumenLlamadaIntervalo(ResultSet rs)
{
SabanaDatos sabanaDatos = new SabanaDatos();
try {
sabanaDatos.setId(rs.getInt("id"));
sabanaDatos.setHora(rs.getString("hora"));
sabanaDatos.setDuracion(rs.getInt("duracion"));
sabanaDatos.setNavegautenticado(rs.getInt("navegautenticado"));
sabanaDatos.setIndicadorasesor(rs.getInt("indicadorasesor"));
sabanaDatos.setLlamadaexitosa(rs.getInt("llamadaexitosa"));
sabanaDatos.setLlamadanoexitosa(rs.getInt("llamadanoexitosa"));
sabanaDatos.setTipocliente(rs.getString("tipocliente"));
} catch (SQLException e) {
logger.info("dip.sabana.SabanaDatos SQLException : "+ e);
e.printStackTrace();
}
return sabanaDatos;
}
NOTE: The reason of using list is that this is a critic system, and I just can make a call every 2 hours to the bd. I don't have permission to do more calls to the bd in short times, but I need to show data every 10 minutes. Example : first query 10 rows, I show 1 rows each minute after the sql query.
I dont't have permission to create local database, write file or other ... Just acces to memory.
First Of All - It is not a good practice to read half million objects
You can think of breaking down the number of records to be read into small chunks
As a solution to this you can think of following options
1 - use of CachedRowSetImpl - it is same resultSet , it is a bad practice to keep resultSet open (as it is a Database connection property) If you use ArrayList - then you are again performing operations and utilizing the memory
For more info on cachedRowSet you can go to
https://docs.oracle.com/javase/tutorial/jdbc/basics/cachedrowset.html
2 - you can think of using an In-Memory Database, such as HSQLDB or H2. They are very lightweight and fast, provide the JDBC interface you can run the SQL queries as well
For HSQLDB implementation you can check
https://www.tutorialspoint.com/hsqldb/
It might help to have Strings interned, have for two occurrences of the same string just one single object.
public class StringCache {
private Map<String, String> identityMap = new Map<>();
public String cached(String s) {
if (s == null) {
return null;
}
String t = identityMap.get(s);
if (t == null) {
t = s;
identityMap.put(t, t);
}
return t;
}
}
StringCache horaMap = new StringCache();
StringCache tipoclienteMap = new StringCache();
sabanaDatos.setHora(horaMap.cached(rs.getString("hora")));
sabanaDatos.setTipocliente(tipoclienteMap .cached(rs.getString("tipocliente")));
Increasing memory is already said.
A speed-up is possible by using column numbers; if needed gotten from the column name once before the loop (rs.getMetaData()).
Option1:
If you need all the items in the list at the same time you need to increase the heap space of the JVM, adding the argument -Xmx2G for example when you launch the app (java -Xmx2G -jar yourApp.jar).
Option2:
Divide the sql in more than one call
Some of your options:
Use a local database, such as SQLite. That's a very lightweight database management system which is easy to install – you don't need any special privileges to do so – its data is held in a single file in a directory of your choice (such as the directory that holds your Java application) and can be used as an alternative to a large Java data structure such as a List.
If you really must use an ArrayList, make sure you take up as little space as possible. Try the following:
a. If you know the approximate number of rows, then construct your ArrayList with an appropriate initialCapacity to avoid reallocations. Estimate the maximum number of rows your database will grow to, and add another few hundred to your initialCapacity just in case.
b. Make sure your SabanaDatos objects are as small as they can be. For example, make sure the id field is an int and not an Integer. If the hora field is just a time of day, it can be more efficiently held in a short than a String. Similarly for other fields, e.g. duracion - perhaps it can even fit into a byte, if its range allows it to? If you have several flag/Boolean fields, they can be packed into a single byte or short as bits. If you have String fields that have a lot of repetitions, you can intern them as per Joop's suggestion.
c. If you still get out-of-memory errors, increase your heap space using the JVM flags -Xms and -Xmx.
I am trying to parse a large file (6.5 million rows) but am getting the mentioned out-of-memory error. I am using this same method to read other files of around 50K rows, and it works fairly quickly. Here it runs extremely slowly, then fails with the error. I originally had 2 GB dedicated to intelliJ, which I changed to 4 GB (-Xmx4000m), then 6 GB (-Xmx6000m), and still finish with the same error. My computer only has 8 GB RAM so I can't go any higher. Any suggestions?
Thanks!
public static List<UmlsEntry> umlsEntries(Resource resource) throws
IOException {
return CharStreams.readLines(new InputStreamReader(resource.getInputStream())).stream().distinct()
.map(UmlsParser::toUmlsEntry).collect(Collectors.toList());
}
private static UmlsEntry toUmlsEntry(String line) {
String[] umlsEntry = line.split("|");
return new UmlsEntry(umlsEntry[UNIQUE_IDENTIFIER_FOR_CONCEPT_COLUMN_INDEX],
umlsEntry[LANGUAGE_OF_TERM_COLUMN_INDEX], umlsEntry[TERM_STATUS_COLUMN_INDEX],
umlsEntry[UNIQUE_IDENTIFIER_FOR_TERM_COLUMN_INDEX], umlsEntry[STRING_TYPE_COLUMN_INDEX],
umlsEntry[UNIQUE_IDENTIFIER_FOR_STRING_COLUMN_INDEX],
umlsEntry[IS_PREFERRED_STRING_WITHIN_THIS_CONCEPT_COLUMN_INDEX],
umlsEntry[UNIQUE_IDENTIFIER_FOR_ATOM_COLUMN_INDEX], umlsEntry[SOURCE_ASSERTED_ATOM_INDENTIFIER_COLUMN_INDEX],
umlsEntry[SOURCE_ASSERTED_CONCEPT_IDENTIFIER_COLUMN_INDEX],
umlsEntry[SOURCE_ASSERTED_DESCRIPTOR_IDENTIFIER_COLUMN_INDEX],
umlsEntry[ABBREVIATED_SOURCE_NAME_COLUMN_IDENTIFIER_COLUMN_INDEX],
umlsEntry[ABBREVIATION_FOR_TERM_TYPE_IN_SOURCE_VOCABULARY_COLUMN_INDEX],
umlsEntry[MOST_USEFUL_SOURCE_ASSERTED_IDENTIFIER_COLUMN_INDEX], umlsEntry[STRING_COLUMN_INDEX],
umlsEntry[SOURCE_RESTRICTION_LEVEL_COLUMN_INDEX], umlsEntry[SUPPRESSIBLE_FLAG_COLUMN_INDEX],
umlsEntry[CONTENT_VIEW_FLAG_COLUMN_INDEX]);
}
You need to treat the lines a few at a time to avoid using up all available memory, since the file doesn't fit in memory. CharStreams.readLines confusingly isn't streaming. It reads all lines at once and returns you a list. This won't work. Try File.lines instead. I suspect that you will get into trouble with distinct as well. It will need to keep track of all hashes of all lines, and if this balloons too far you might have to change that tactic as well. Oh, and collect won't work either if you don't have enough memory to hold the result. Then you might want to write to a new file or a database or so.
Here is an example of how you can stream lines from a file, compute distinct entries and print the md5 of each line:
Files.lines(FileSystems.getDefault().getPath("/my/file"))
.distinct()
.map(DigestUtils::md5)
.forEach(System.out::println);
If you run into trouble detecting distinct rows, sort the file in-place first and then filter out identical adjacent rows only.
I am looking some java implementation of sorting algorithm. The file could be HUGE, say 20000*600=12,000,000 lines of records. The line is comma delimited with 37 fields and we use 5 fields as keys. Is it possible to sort it quickly, say 30 minutes?
If you got other approach other than java, it is welcome if it can be easily integrated into java system. For example, unix utility.
Thanks.
Edit: The lines need to be sort is dispersed into 600 files, with 20000 lines each, 4mb for each file. Finally I would like them to be 1 big sorted file.
I am trying to time a unix sort, would update that afterwards.
Edit:
I appended all the files into a big one, and tried the unix sort function, it is pretty good. The time to sort a 2gb file is 12-13 minutes. The append action require 4 minutes for 600 files.
sort -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r big.txt -o sorted.txt
How does the data get in the CSV format? Does it come from a relational database? You can make it such that whatever process creates the file writes its entries in the right order so you don't have to solve this problem down the line.
If you are doing a simple lexicographic order you can try the unix sort, but I am not sure how that will perform on a file with that size.
Calling unix sort program should be efficient. It does multiple passes to ensure it is not a memory hog. You can fork a process with java's Runtime, but the outputs of the process are redirected, so you have to some juggling to get the redirect to work right:
public static void sortInUnix(File fileIn, File sortedFile)
throws IOException, InterruptedException {
String[] cmd = {
"cmd", "/c",
// above should be changed to "sh", "-c" if on Unix system
"sort " + fileIn.getAbsolutePath() + " > "
+ sortedFile.getAbsolutePath() };
Process sortProcess = Runtime.getRuntime().exec(cmd);
// capture error messages (if any)
BufferedReader reader = new BufferedReader(new InputStreamReader(
sortProcess.getErrorStream()));
String outputS = reader.readLine();
while (outputS != null) {
System.err.println(outputS);
outputS = reader.readLine();
}
sortProcess.waitFor();
}
Use the java library big-sorter which is published to Maven Central and has an optional dependency on commons-csv for CSV processing. It handles files of any size by splitting to intermediate files, sorting and merging the intermediate files repeatedly until there is only one left. Note also that the max group size for a merge is configurable (useful for when there are a large number of input files).
Here's an example:
Given the CSV file below, we will sort on the second column (the "number" column):
name,number,cost
WIPER BLADE,35,12.55
ALLEN KEY 5MM,27,3.80
Serializer<CSVRecord> serializer = Serializer.csv(
CSVFormat.DEFAULT
.withFirstRecordAsHeader()
.withRecordSeparator("\n"),
StandardCharsets.UTF_8);
Comparator<CSVRecord> comparator = (x, y) -> {
int a = Integer.parseInt(x.get("number"));
int b = Integer.parseInt(y.get("number"));
return Integer.compare(a, b);
};
Sorter
.serializer(serializer)
.comparator(comparator)
.input(inputFile)
.output(outputFile)
.sort();
The result is:
name,number,cost
ALLEN KEY 5MM,27,3.80
WIPER BLADE,35,12.55
I created a CSV file with 12 million rows and 37 columns and filled the grid with random integers between 0 and 100,000. I then sorted the 2.7GB file on the 11th column using big-sorter and it took 8 mins to do single-threaded on an i7 with SSD and max heap set at 512m (-Xmx512m).
See the project README for more details.
Java Lists can be sorted, you can try starting there.
Python on a big server.
import csv
def sort_key( aRow ):
return aRow['this'], aRow['that'], aRow['the other']
with open('some_file.csv','rb') as source:
rdr= csv.DictReader( source )
data = [ row for row in rdr ]
data.sort( key=sort_key )
fields= rdr.fieldnames
with open('some_file_sorted.csv', 'wb') as target:
wtr= csv.DictWriter( target, fields }
wtr.writerows( data )
This should be reasonably quick. And it's very flexible.
On a small machine, break this into three passes: decorate, sort, undecorate
Decorate:
import csv
def sort_key( aRow ):
return aRow['this'], aRow['that'], aRow['the other']
with open('some_file.csv','rb') as source:
rdr= csv.DictReader( source )
with open('temp.txt','w') as target:
for row in rdr:
target.write( "|".join( map(str,sort_key(row)) ) + "|" + row )
Part 2 is the operating system sort using "|" as the field separator
Undecorate:
with open('sorted_temp.txt','r') as source:
with open('sorted.csv','w') as target:
for row in rdr:
keys, _, data = row.rpartition('|')
target.write( data )
You don't mention platform, so it is hard to come to terms with the time specified. 12x10^6 records isn't that many, but sorting is a pretty intensive task. Let's say 37 fields, say 100bytes/field would be 45GB? That's a bit much for most machines, but if the records average 10bytes/field your server should be able to fit the entire file in RAM, which would be ideal.
My suggestion: Break the file into chunks that are 1/2 the available RAM, sort each chunk, then merge-sort the resulting sorted chunks. This lets you do all of your sorting in memory rather than hitting swap, which is what I suspect of causing any slow-down.
Say (1G chunks, in a directory you can play around in):
split --line-bytes=1000000000 original_file chunk
for each in chunk*
do
sort $each > $each.sorted
done
sort -m chunk*.sorted > original_file.sorted
As your data set is huge as you have mentioned. Sorting it all at one go will be time consuming depending on your machine (If you try QuickSort).
But since you would like it to be done within 30 mins. I would suggest that you have a look at Map Reduce using
Apache Hadoop as your application server.
Please keep in mind it's not an easy approach, but in the longer run you can easily scale up depending upon your data size.
I am also pointing you to an excellent link on Hadoop setup
Work your way through single node setup and move to Hadoop cluster.
I would be glad to help you if you get stuck anywhere.
You really do need to make sure you have the right tools for the job. ( Today, I am hoping to get a 3.8 GHz PC with 24 GB memory for home use. It been a while since I bought myself a new toy. ;)
However, if you want to sort these lines and you don't have enough hardware, you don't need to break up the data because its in 600 files already.
Sort each file individually, then do a 600-way merge sort (you only need to keep 600 lines in memory at once) Its not as simple as doing them all at once, but you could probably do it on a mobile phone. ;)
Since you have 600 smaller files, it could be faster to sort all of them concurrently. This will eat up 100% of the CPU. That's the point, correct?
waitlist=
for f in ${SOURCE}/*
do
sort -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r -o ${f}.srt ${f} &
waitlist="$waitlist $!"
done
wait $waitlist
LIST=`echo $SOURCE/*.srt`
sort --merge -t ',' -k 1,1 -k 4,7 -k 23,23 -k 2,2r -o sorted.txt ${LIST}
This will sort 600 small files all at the same time and then merge the sorted files. It may be faster than trying to sort a single large file.
Use Map/Reduce Hadoop to do the sorting.. i recommend Spring Data Hadoop. Java.
Well since you're talking about HUGE datasets this means you'll need some external sorting algorithm anyhow. There are some for java and pretty much any other language out there - since the result will have to be stored on the disk anyhow which language you're using is pretty uninteresting.
I have a large (3Gb) binary file of doubles which I access (more or less) randomly during an iterative algorithm I have written for clustering data. Each iteration does about half a million reads from the file and about 100k writes of new values.
I create the FileChannel like this...
f = new File(_filename);
_ioFile = new RandomAccessFile(f, "rw");
_ioFile.setLength(_extent * BLOCK_SIZE);
_ioChannel = _ioFile.getChannel();
I then use a private ByteBuffer the size of a double to read from it
private ByteBuffer _double_bb = ByteBuffer.allocate(8);
and my reading code looks like this
public double GetValue(long lRow, long lCol)
{
long idx = TriangularMatrix.CalcIndex(lRow, lCol);
long position = idx * BLOCK_SIZE;
double d = 0;
try
{
_double_bb.position(0);
_ioChannel.read(_double_bb, position);
d = _double_bb.getDouble(0);
}
...snip...
return d;
}
and I write to it like this...
public void SetValue(long lRow, long lCol, double d)
{
long idx = TriangularMatrix.CalcIndex(lRow, lCol);
long offset = idx * BLOCK_SIZE;
try
{
_double_bb.putDouble(0, d);
_double_bb.position(0);
_ioChannel.write(_double_bb, offset);
}
...snip...
}
The time taken for an iteration of my code increases roughly linearly with the number of reads. I have added a number of optimisations to the surrounding code to minimise the number of reads, but I am at the core set that I feel are necessary without fundamentally altering how the algorithm works, which I want to avoid at the moment.
So my question is whether there is anything in the read/write code or JVM configuration I can do to speed up the reads? I realise I can change hardware, but before I do that I want to make sure that I have squeezed every last drop of software juice out of the problem.
Thanks in advance
As long as your file is stored on a regular harddisk, you will get the biggest possible speedup by organizing your data in a way that gives your accesses locality, i.e. causes as many get/set calls in a row as possible to access the same small area of the file.
This is more important than anything else you can do because accessing random spots on a HD is by far the slowest thing a modern PC does - it takes about 10,000 times longer than anything else.
So if it's possible to work on only a part of the dataset (small enough to fit comfortably into the in-memory HD cache) at a time and then combine the results, do that.
Alternatively, avoid the issue by storing your file on an SSD or (better) in RAM. Even storing it on a simple thumb drive could be a big improvement.
Instead of reading into a ByteBuffer, I would use file mapping, see: FileChannel.map().
Also, you don't really explain how your GetValue(row, col) and SetValue(row, col) access the storage. Are row and col more or less random? The idea I have in mind is the following: sometimes, for image processing, when you have to access pixels like row + 1, row - 1, col - 1, col + 1 to average values; on trick is to organize the data in 8 x 8 or 16 x 16 blocks. Doing so helps keeping the different pixels of interest in a contiguous memory area (and hopefully in the cache).
You might transpose this idea to your algorithm (if it applies): you map a portion of your file once, so that the different calls to GetValue(row, col) and SetValue(row, col) work on this portion that's just been mapped.
Presumably if we can reduce the number of reads then things will go more quickly.
3Gb isn't huge for a 64 bit JVM, hence quite a lot of the file would fit in memory.
Suppose that you treat the file as "pages" which you cache. When you read a value, read the page around it and keep it in memory. Then when you do more reads check the cache first.
Or, if you have the capacity, read the whole thing into memory, in at the start of processing.
Access byte-by-byte always produce poor performance (not only in Java). Try to read/write bigger blocks (e.g. rows or columns).
How about switching to database engine for handling such amounts of data? It would handle all optimizations for you.
May be This article helps you ...
You might want to consider using a library which is designed for managing large amounts of data and random reads rather than using raw file access routines.
The HDF file format may by a good fit. It has a Java API but is not pure Java. It's licensed under an Apache Style license.