After investigating a bit at work I noticed that the application I'm working on is using the default fetch size (which is 10 for Oracle from what I know). The problem is that in the majority of cases the users fetch large amount of data (ranging from few thousand to even hundreds of thousands) and that the default 10 is really a huge bottleneck.
So the obvious conclusion here would be to make the fetch size larger. At first I was thinking about setting the default to 100 and bumping it to a 1000 for several queries. But then I read on the net that the default is so small to prevent memory issues (i.e. when the JVM heap cannot handle so much data), should I be worried about it?
I haven't seen anywhere further explanation to this. Does it mean that a bigger fetch sizes means more overhead while fetching the result set? Or do they just mean that with the default I can fetch 10 records and then GC them and fetch another 10 and so on (whereas lets say fetching a 10000 all at once would result in an OutOfMemory exception)? In such case I wouldn't really care as I need all the records in the memory anyway. In the former case (where bigger result set means bigger memory overhead) I guess I should load test it first.
By setting the fetch size too, big you are risking OutOfMemoryError.
The fact that you need all these records anyway is probably not justifiable. More chances you need the entities reflected by the returned ResultSets... Setting the fetch size to 10000 means you're heaping 10000 records represented by JDBC classes. Of course, you don't pass these around through your application. You first transform them into your favorite business-logic-entities and then hand them to your business-logic-executor. This way, The records form the first fetch bulk are available for GC as soon as JDBC fetches the next fetch bulk.
Typically, this transformation is done a little bunch at a time exactly because of the memory threat aforementioned.
One thing you're absolutely right, though: you should test for performance with well-defined requirements before tweaking.
So the obvious conclusion here would be to make the fetch size larger.
Perhaps an equally obvious conclusion should be: "Let's see if we can cut down on the number of objects that users bring back." When Google returns results, it does so in batches of 25 or 50 sorted by greatest likelihood to be considered useful by you. If your users are bringing back thousands of objects, perhaps you need to think about how to cut down on that. Can the database do more of the work? Are there other operations that could be written to eliminate some of those objects? Could the objects themselves be smarter?
Related
Let's say, we have a highly configurable report system, which allows users to select columns, filters, and sorting.
All this configuration comes to BE, where it's being transformed to SQL, executed against DB and then the user sees his report and can continue to work with it. But on each operation, like sorting, we still build a query.
The transformation itself takes few milliseconds, but the query execution against DB can take 3-5 seconds (up to 20 if there are a lot of parallel executions).
So, I'm thinking about adding some sort of cache.
Currently, I see 3 ways:
Add one table to cache all results without filtering, and then on user request sort/filter it on Java side.
Add one table per result, still without the filters. In this case, I will have the possibility to sort/filter on much less amount of data, but there are more than 10k different reports, and I don't think it would be good to create 10k small tables.
Like the first option, but LRU cache on Java side. We can fit in memory 2-3k report results. It will be usually faster than in the first option since we don't have a lot of parallel users, just users with lots of reports.
The cache invalidation will be a few times a day.
What do you see is the best way to make it faster? What cons and pros in proposed solutions from yours perspective? What would you do if you are free in selecting Database and technology (Java stack)?
OK, let's make sure I got it right.
there are more than 10k different reports
So it doesn't make sense to pre-calculate and pre-cache them, they have to be generated on-demand.
there is not a lot of data in rows, just short strings, dates and integers. It’s not costly to fetch it in memory and even save there for a while
So caching a small amount of data can avoit a big costly query, that's good.
Add one table to cache all results without filtering, and then on user request sort/filter it on Java side.
Problem is, most likely every report query will have different columns, with different names, so that doesn't fit a single table well unless you use a format like JSON, storing each cached result row as a JSON dictionary... And in this case indexing it would be a problem, even if you create indexes on fields inside JSON values, if you have a zillion different column names from your many reports you'll need a zillion indexes too...
Smells like a can of worms.
Add one table per result, still without the filters. In this case, I will have the possibility to sort/filter on much less amount of data, but there are more than 10k different reports, and I don't think it would be good to create 10k small tables.
Pros: each cache table can have the proper columns, data types and indexes. It is easy to invalidate the cache, just truncate it. You can set all the cache tables to UNLOGGED to make them faster. And you can do all the extra sorting/filtering on the cached result using the same SQL queries you were using before, so this might be the simpler option to code. It is also nice for pagination if you only want to fetch part of the result. And that will be the fastest option as far as copying the results of reporting queries into cache since the cache is already in postgres, there is no need to transfer data. You can also store the cache on another drive/SSD.
Cons: I've heard the main issue with tons of tables is if your filesystem slows down on directories with large numbers of files. That shouldn't be an issue on modern filesystems though, and I don't think postgres itself is going to be bothered at all by 10k tables.
It might make queries on information_schema slow, and stuff like "\dt" in psql problematic, so the cache tables would be better hidden away in a "cache" schema so they don't interfere. This will also make it easier to exclude them from backups.
It will also use some RAM on postgres server to cache the cache tables, that depends on the number of online users.
I'd say it would be worth a little bit of benchmarking. Create a schema, add 10k tables, see if something breaks.
Like the first option, but LRU cache on Java side. We can fit in memory 2-3k report results. It will be usually faster than in the first option since we don't have a lot of parallel users, just users with lots of reports.
That's a bit of reinventing the wheel, and you got to reimplement the sort/filter in java... plus the cache algos... meeeh.
There are other options though:
Put the cache in another database, on another machine. This may be a postgres instance, or another database (which may require rewriting some queries). Could be interesting only if the cache eats too much RAM on your database.
Put the cache in the web browser, and use javascript to filter/sort. That could be faster depending on speed of internet connection, and it would reduce server load, but you'll have to write lots of javascript code.
IMO you're cautious about the large number of tables, it is good to be cautious, but if it works well, it really is the simplest solution...
DynamoDb has provisioned data throughput limits and based on the answer in this post it appears that if you exceed your throughput limits DynamoDb can start throwing exceptions, rather than just slowing your query down. If you run a query (e.g. as described here), how do you ensure the query doesn't exceed the throughputs?
A query will be automatically limited to 1 MB in which case you have to do some tricky re-querying using the LastEvaluatedKey as described in their docs (with no code example). However dependent on the query speed or similar even this 1 MB could presumably blow your provisioned limits?
You can also set the maximum number of items returned in the QuerySpec object so you then process the query in individual pages - which I'm guessing do separate queries under the hood. However if you don't know the size of your items in-advance, you can't limit it by number of KB as needed to stay within the throughput. Chicken and egg situation.
Am I missing something here? Is there any easier way to do this?
Your problem can be solved by following solutions or by combining the solutions
Solution 1: As you have stated in your question if you know your Item size you can Limit the number of records, pass the lastEvaluatedKey to get next set of records.
Again even this might be enough to throttle your limits, you need to have a sleep mechanism for a while so that you wont hit the table continuously.
Solution 2: By far the best method I have implement is that don't hit the DynamoDB table rather store your information in some form of cache(Memcahce, Elasticache, etc)
This will ensure that you wont require very high throughput rate.
Note: This will depend on your type of application as well if your data is updated very frequently than this might not be very helpful.
I am facing some optimization problem in java. I have to process a table which has 5 attributes. The table contains about 5 millions records. To simplify the problem, let say I have to read every record one by one. Then I have to process each record. From each record I have to generate a mathematical lattice structure which has 500 nodes. In other words each record generate 500 more new records which can be referred as parents of the original record. So in total there are 500 X 5 Millions records including original plus parent records. Now the job is to find the number of distinct records out of all 500 X 5 Millions records with their frequencies. Currently I have solved this problem as follow. I convert every record to a string with value of each attribute separated by "-". And I count them in a java HashMap. Since these records involve intermediate processing. A record is converted to a string and then back to a record during intermediate steps. The code is tested and it is working fine and produce accurate results for small number of records but it can not process 500 X 5 Millions records.
For large number of records it produce the following error
java.lang.OutOfMemoryError: GC overhead limit exceeded
I understand that the number of distinct records are not more than 50 thousands for sure. Which means that the data should not cause memory or heap overflow. Can any one suggest any option. I will be very thankful.
Most likely, you have some data-structure somewhere which is keeping references to the processed records, also known as a "memory leak". It sounds like you intend to process each record in turn and then throw away all the intermediate data, but in fact the intermediate data is being kept around. the garbage collector can't throw away this data if you have some collection or something still pointing to it.
Note also that there is the very important java runtime parameter "-Xmx". Without any further detail than what you've provided, I would have thought that 50,000 records would fit easily into the default values, but maybe not. Try doubling -Xmx (hopefully your computer has enough RAM). If this solves the problem then great. If it just gets you twice as far before it fails, then you know it's an algorithm problem.
Using a sqlite database can used to safe (1.3tb?) data. With query´s you can find fast info back. Also the data get saved when youre program ends.
You probably need to adopt a different approach to calculating the frequencies of occurrence. Brute force is great when you only have a few million :)
For instance, after your calculation of the 'lattice structure' you could combine that with the original data and take either the MD5 or SHA1. This should be unique except when the data is not "distinct". Which then should reduce your total data down back below 5 million.
We have some part of our application that need to load a large set of data (>2000 entities) and perform computation on this set. The size of each entity is approximately 5 KB.
On our initial, naïve, implementation, the bottleneck seems to be the time required to load all the entities (~40 seconds for 2000 entities), while the time required to perform the computation itself is very small (<1 second).
We had tried several strategies to speed up the entities retrieval:
Splitting the retrieval request into several parallel instances and then merging the result: ~20 seconds for 2000 entities.
Storing the entities at an in-memory cache placed on a resident backend: ~5 seconds for 2000 entities.
The computation needs to be dynamically computed, so doing a precomputation at write time and storing the result does not work in our case.
We are hoping to be able to retrieve ~2000 entities in just under one second. Is this within the capability of GAE/J? Any other strategies that we might be able to implement for this kind of retrieval?
UPDATE: Supplying additional information about our use case and parallelization result:
We have more than 200.000 entities of the same kind in the datastore and the operation is retrieval-only.
We experimented with 10 parallel worker instances, and a typical result that we obtained could be seen in this pastebin. It seems that the serialization and deserialization required when transferring the entities back to the master instance hampers the performance.
UPDATE 2: Giving an example of what we are trying to do:
Let's say that we have a StockDerivative entity that need to be analyzed to know whether it's a good investment or not.
The analysis performed requires complex computations based on many factors both external (e.g. user's preference, market condition) and internal (i.e. from the entity's properties), and would output a single "investment score" value.
The user could request the derivatives to be sorted based on its investment score and ask to be presented with N-number of highest-scored derivatives.
200.000 by 5kb is 1GB. You could keep all this in memory on the largest backend instance or have multiple instances. This would be the fastest solution - nothing beats memory.
Do you need the whole 5kb of each entity for computation?
Do you need all 200k entities when querying before computation? Do queries touch all entities?
Also, check out BigQuery. It might suit your needs.
Use Memcache. I cannot guarantee that it will be sufficient, but if it isn't you probably have to move to another platform.
This is very interesting, but yes, its possible & Iv seen some mind boggling results.
I would have done the same; map-reduce concept
It would be great if you would provide us more metrics on how many parallel instances do you use & what are the results of each instance?
Also, our process includes retrieval alone or retrieval & storing ?
How many elements do you have in your data store? 4000? 10000? Reason is because you could cache it up from the previous request.
regards
In the end, it does not appear that we could retrieve >2000 entities from a single instance in under one second, so we are forced to use in-memory caching placed on our backend instance, as described in the original question. If someone comes up with a better answer, or if we found a better strategy/implementation for this problem, I would change or update the accepted answer.
Our solution involves periodically reading entities in a background task and storing the result in a json blob. That way we can quickly return more than 100k rows. All filtering and sorting is done in javascript using SlickGrid's DataView model.
As someone has already commented, MapReduce is the way to go on GAE. Unfortunately the Java library for MapReduce is broken for me so we're using non optimal task to do all the reading but we're planning to get MapReduce going in the near future (and/or the Pipeline API).
Mind that, last time I checked, the Blobstore wasn't returning gzipped entities > 1MB so at the moment we're loading the content from a compressed entity and expanding it into memory, that way the final payload gets gzipped. I don't like that, it introduces latency, I hope they fix issues with GZIP soon!
I've read the following:
http://wiki.apache.org/solr/SolrPerformanceFactors
http://wiki.apache.org/solr/SolrCaching
http://www.lucidimagination.com/content/scaling-lucene-and-solr
And I have questions about a few things:
If I use the JVM option -XX:+UseCompressedStrings what kind of memory savings can I achieve? To keep a simple example, if I have 1 indexed field (string) and 1 stored field (string) with omitNorms=true and omitTf=true, what kind of savings in the index and document cache can I expect? I'm guessing about 50%, but maybe that's too optimistic.
When exactly is the Solr filter cache doing? If I'm just doing a simple query with AND and a few ORs, and sorting by score, do I even need it?
If I want to cache all documents in the document cache, how would I compute the space required? Using the example from above, if I have 20M documents, use compressed strings, and the average length of the stored field is 25 characters, is the space required basically (25 bytes + small_admin_overhead) * 20M?
if all documents are in the document cache, how important is the query cache?
If I want to autowarm every document into the doc cache, will autowarm query of *:* do it?
The scaling-lucene-and-solr article says FuzzyQuery is slow. If I'm using the spellcheck feature of solr then I'm basically using fuzzy query right (because spellcheck does the same edit distance calculation)? So presumably spellcheck and fuzzy query are both equally "slow"?
The section describing the lucene field cache for strings is a bit confusing. Am I reading it correctly that the space required is basically the size of the indexed string field + an integer arry equal to the number of unique terms in that field?
Finally, under maximizing throughput, there is a statement about leaving enough space for the OS disk cache. It says, "All in all, for a large scale index, it's best to be sure you have at least a few gigabytes of RAM beyond what you are giving to the JVM.". So if I have a 12GB memory machine (as an example), I should give at least 2-3GB to the OS? Can I estimate the disk cache space needed by the OS by looking at the on disk index size?
Only way to be sure is to try it out. However, I would expect very little savings in the Index, as the index would only contain the actual string once each time, the rest is data for locations of that string within documents. They aren't a large part of the index.
Filter cache only caches filter queries. It may not be useful for your precise use case, but many do find them useful. For example, narrowing results by country, language, product type, etc. Solr can avoid recalculating the query results for things like this if you use them frequently.
Realistically, you just have to try it and measure it with a profiler. Without in depth knowledge of EXACTLY the data structure used, anything else is pure SWAG. Your calculation is just as good as anyone else's without profiling.
Document cache only saves time in constituting the results AFTER the query has been calculated. If you spend most of your time calculating queries, the document cache will do you little good. Query cache is only useful for re-used queries. If none of your queries are repeated, then Query cache is useless
yes, assuming your Document cache is large enough to hold them all.
6-8 Not positive.
From my own experience with Solr performance tuning, you should leave Solr to deal with queries, not document storage. The majority of your questions focus on how documents take up space. Solr is a search engine, not a document storage repository. If you want Solr to be FAST and take up minimal memory, then the only thing it should hold onto is index information for searching purposes. The documents themselves should be stored, retrieved, and rendered elsewhere. Preferably in system that is optimized specifically for that job. The only field you should store in your Solr document is an ID for retrieval from the document storage system.
Caches
In general, caching looks like a good idea to improve performance, but this also has a lot of issues:
cached objects are likely to go into the old generation of the garbage collector, which is more costly to collect,
managing insertions and evictions adds some overhead.
Moreover, caching is unlikely to improve your search latency much unless there are patterns in your queries. On the contrary, if 20% of your traffic is due to a few queries, then the query results cache may be interesting. Configuring caches requires you to know your queries and your documents very well. If you don't, you should probably disable caching.
Even if you disable all caches, performance could still be pretty good thanks to the OS I/O cache. Practically, this means that if you read the same portion of a file again and again, it is likely that it will be read from disk only the first time, and then from the I/O cache. And disabling all caches allows you to give less memory to the JVM, so that there will be more memory for the I/O cache. If your system has 12GB of memory and if you give 2GB to the JVM, this means that the I/O cache might be able to cache up to 10G of your index (depending on other applications running which require memory too).
I recommand you read this to get more information on application-level cache vs. I/O cache:
https://www.varnish-cache.org/trac/wiki/ArchitectNotes
http://antirez.com/post/what-is-wrong-with-2006-programming.html
Field cache
The size of the field cache for a string is (one array of integers of length maxDoc) + (one array for all unique string instances). So if you have an index with one string field which has N instances of size S on average, and if your index has M documents, then the size of the field cache for this field will be approximately M * 4 + N * S.
The field cache is mainly used for facets and sorting. Even very short strings (less than 10 chars) are more than 40 bytes, this means that you should expect Solr to require a lot of memory if you sort or facet on a String field which has a high number of unique values.
Fuzzy Query
FuzzyQuery is slow in Lucene 3.x, but much faster in Lucene 4.x.
It depends on the Spellchecker implementation you choose but I think that the Solr 3.x spell checker uses N-Grams to find candidates (this is why it needs a dedicated index) and then only computes distances on this set on candidates, so the performance is still reasonably good.