Elasticsearch Field Data Cache Distributed? - java

Since upgrading to Elasticsearch 1.0.1 I've become aware of the field data cache and its circuit breaker.
https://www.elastic.co/guide/en/elasticsearch/reference/1.3/index-modules-fielddata.html
I use the facets(and now aggregates) quite heavily and I was just wondering if the field cache is distributed and if so, how it is distributed?
I.e. If I use 2GB of field cache on one node, if I then add 3 more nodes, will the 2GB be distributed over the 4 nodes or will I see a 2GB cache on each node?
Thanks in advance,
J

You can think of the field data as a data structure that's loaded into memory per shard. You have field data potentially on any data node. It is correct that its memory footprint gets distributed if you scale out by adding more data nodes, although that depends on how many indices/shards you have and which ones you are using for faceting/sorting/scripting.

Related

Cassandra - Setting a huge field to null not giving back the disk space

In our keyspace, we have only a few tables out of which one contains most of the data. In that table, there is only one ColumnEntity(say column X) that contains 99.99% data. When data is no more relevant we set the TTL for few days and also set the column X to null(from java process). Ideally, this should immediately free up significant space on disk as Column X had 90% of total keyspace data but we are not seeing any reduction in disk space usage.
And also, after TTL expires that data is deleting perfectly but again we are not seeing any space freeing up.
What are we missing?
In Cassandra, no data is modified in-place - all files are immutable. When you perform delete or insert the null (it's the same), the special marker is added, in addition to the having previous data on disk. So when you adding the data, you're actually adding more data :-)
The actual deletion of the data happens when the SSTable files are compacted by background compaction. The scheduling of file's compaction depends on the used compaction strategy, and its configuration options. There could be situations, when you have old data in the big files, that may not be compacted for a while. Depending on the your version of Cassandra/DSE, you can enforce the compaction of all data, by performing nodetool compact -s on every node, but this will require to have enough disk space (the size of the table). Another opportunity is to use nodetool garbagecollect -g CELL on the individual SSTables, but it will also require free disk space.
P.S. I recommend to take at least DS201 course on the DataStax Academy.

Java caching design for 100M+ keys?

Need to cache over 100+ million string Key (~100 chars length) for Java standalone application.
Standard cache properties requisite:
Persistent.
TPS to fetch keys from cache in 10s of milli seconds range.
Allows invalidation and expiry.
Independent caching server, to allow multi-threaded access.
Preferably don't want to use enterprise database, as this 100M keys can scale to 500M which would use high memory and system resources with sluggish throughput.
For distributed cache you can try to use hazelcast.
It can be scaled as you need to and have backups and synchronizations out of the box. And it is a JSR-107 provider and have many other helpfull tools to use. However, if you want persistence, you will need to handle it by yourself or buy their enterprise version.
Finally, to resolve this big data problem, with existing cache solutions available (hazelcast, Guava cache, eh-cache etc):
Have broken the cache into two levels.
grouped ~100K keys into one java collection and associated them with common property, in my case keys were having timestamp. So, that timestamp slot became the key for this second level cache block of 100K
This time slot key is stored in Java persistent cache with value as compressed Java collection.
The reason I manage to get good throughput with 2 level caching with overheads of compression and decompression is, my key searches were range bound so when cache match found, most of the subsequent searches were addressed by in memory java collection of previous search.
To conclude: identify common attribute in keys to group and break them into multilevel cache otherwise you would need hefty hardware and enterprise cache to support this big data problem.
Try Guava Cache. It meets all of your requirement.
Links:
Guava Cache Explained
guava-cache
Persistence: Guava cache
Edit: Another One. I did not use it yet. eh-cache

Entity prepopulation for MongoDB to avoid padding with Spring

In an application I use the concept of buckets to store objects. All buckets are empty at creation time. Some of which may fill up to their maximum capacity of 20 objects in 2hrs, some in 6 months. Each object's size is pretty much fixed, i.e. I don't expect their size to differ more than 10%, i.e. the sizes of full buckets wouldn't either. The implementation looks similar to that.
#Document
public class MyBucket {
// maximum capacity of 20
private List<MyObject> objects;
}
One approach to keep the padding factor low would be to prepopulate my bucket with dummy data. Two options come to my mind:
Create the bucket with dummy data, save it, then reset its content and save it again
Create the bucket with dummy data and flag it as "pristine". On the first write the flag is set to false and the data get reset.
The disadvantages are obvious, option 1 requires two DB writes, option 2 requires extra (non-business) code in my entities.
Probably I won't get off cheaply with any solution. Nevertheless, any real-life experience with that issue, any best practices or hints?
Setup: Spring Data MongoDB 1.9.2, MongoDB 3.2
As far as understand your main concern is performance overhead related to increasing of documents size resulting to documents relocation and indexes update. It is actual for the mmapv1 storage engine, however since MongoDB version 3.0 there is the WiredTiger storage engine available that does not have such issues (check the similar question).

Java - Cache HashMap with dynamic object size

I have a map of HashMap where Node is a class containing some info and most importantly containing a neighbors HashMap.
My algorithm does random gets from the outer HashMap and puts in the inner HashMap, e.g.
Node node = map.get(someInt);
node.getNeighbors().put(someInt, someOtherInt);
I have lots of entries in both the outer map and inner maps. The full dataset cannot be handled by memory alone, so I need to use some disk caching. I would like to use memory as much as possible (until almost full, or until it reaches a threshold I specify) and then evict entries to disk.
I have used several caches (mostly mapDb and EhCache) trying to solve my problem but with no luck. I am setting a maximum memory size, but the cache just ignores it.
I am almost certain that the problem lies to the fact that my object is of dynamic size.
Anyone got any ideas on how I could handle this problem?
Thanks in advance.
Note: I work on Ehcache
Ehcache, as most other caching products, cannot know about your object size growing unless you let it know by updating the mapping from the outside:
Node node = cache.get(someInt);
node.getNeighbors().put(someInt, someOtherInt);
cache.put(someInt, node);
In that context, Ehcache will properly track the object growth and trigger memory eviction.
Note that Ehcache no longer uses an overflow model between the heap and disk but instead always stores the mapping on disk while keeping a hotset on heap.

SOLR performance tuning

I've read the following:
http://wiki.apache.org/solr/SolrPerformanceFactors
http://wiki.apache.org/solr/SolrCaching
http://www.lucidimagination.com/content/scaling-lucene-and-solr
And I have questions about a few things:
If I use the JVM option -XX:+UseCompressedStrings what kind of memory savings can I achieve? To keep a simple example, if I have 1 indexed field (string) and 1 stored field (string) with omitNorms=true and omitTf=true, what kind of savings in the index and document cache can I expect? I'm guessing about 50%, but maybe that's too optimistic.
When exactly is the Solr filter cache doing? If I'm just doing a simple query with AND and a few ORs, and sorting by score, do I even need it?
If I want to cache all documents in the document cache, how would I compute the space required? Using the example from above, if I have 20M documents, use compressed strings, and the average length of the stored field is 25 characters, is the space required basically (25 bytes + small_admin_overhead) * 20M?
if all documents are in the document cache, how important is the query cache?
If I want to autowarm every document into the doc cache, will autowarm query of *:* do it?
The scaling-lucene-and-solr article says FuzzyQuery is slow. If I'm using the spellcheck feature of solr then I'm basically using fuzzy query right (because spellcheck does the same edit distance calculation)? So presumably spellcheck and fuzzy query are both equally "slow"?
The section describing the lucene field cache for strings is a bit confusing. Am I reading it correctly that the space required is basically the size of the indexed string field + an integer arry equal to the number of unique terms in that field?
Finally, under maximizing throughput, there is a statement about leaving enough space for the OS disk cache. It says, "All in all, for a large scale index, it's best to be sure you have at least a few gigabytes of RAM beyond what you are giving to the JVM.". So if I have a 12GB memory machine (as an example), I should give at least 2-3GB to the OS? Can I estimate the disk cache space needed by the OS by looking at the on disk index size?
Only way to be sure is to try it out. However, I would expect very little savings in the Index, as the index would only contain the actual string once each time, the rest is data for locations of that string within documents. They aren't a large part of the index.
Filter cache only caches filter queries. It may not be useful for your precise use case, but many do find them useful. For example, narrowing results by country, language, product type, etc. Solr can avoid recalculating the query results for things like this if you use them frequently.
Realistically, you just have to try it and measure it with a profiler. Without in depth knowledge of EXACTLY the data structure used, anything else is pure SWAG. Your calculation is just as good as anyone else's without profiling.
Document cache only saves time in constituting the results AFTER the query has been calculated. If you spend most of your time calculating queries, the document cache will do you little good. Query cache is only useful for re-used queries. If none of your queries are repeated, then Query cache is useless
yes, assuming your Document cache is large enough to hold them all.
6-8 Not positive.
From my own experience with Solr performance tuning, you should leave Solr to deal with queries, not document storage. The majority of your questions focus on how documents take up space. Solr is a search engine, not a document storage repository. If you want Solr to be FAST and take up minimal memory, then the only thing it should hold onto is index information for searching purposes. The documents themselves should be stored, retrieved, and rendered elsewhere. Preferably in system that is optimized specifically for that job. The only field you should store in your Solr document is an ID for retrieval from the document storage system.
Caches
In general, caching looks like a good idea to improve performance, but this also has a lot of issues:
cached objects are likely to go into the old generation of the garbage collector, which is more costly to collect,
managing insertions and evictions adds some overhead.
Moreover, caching is unlikely to improve your search latency much unless there are patterns in your queries. On the contrary, if 20% of your traffic is due to a few queries, then the query results cache may be interesting. Configuring caches requires you to know your queries and your documents very well. If you don't, you should probably disable caching.
Even if you disable all caches, performance could still be pretty good thanks to the OS I/O cache. Practically, this means that if you read the same portion of a file again and again, it is likely that it will be read from disk only the first time, and then from the I/O cache. And disabling all caches allows you to give less memory to the JVM, so that there will be more memory for the I/O cache. If your system has 12GB of memory and if you give 2GB to the JVM, this means that the I/O cache might be able to cache up to 10G of your index (depending on other applications running which require memory too).
I recommand you read this to get more information on application-level cache vs. I/O cache:
https://www.varnish-cache.org/trac/wiki/ArchitectNotes
http://antirez.com/post/what-is-wrong-with-2006-programming.html
Field cache
The size of the field cache for a string is (one array of integers of length maxDoc) + (one array for all unique string instances). So if you have an index with one string field which has N instances of size S on average, and if your index has M documents, then the size of the field cache for this field will be approximately M * 4 + N * S.
The field cache is mainly used for facets and sorting. Even very short strings (less than 10 chars) are more than 40 bytes, this means that you should expect Solr to require a lot of memory if you sort or facet on a String field which has a high number of unique values.
Fuzzy Query
FuzzyQuery is slow in Lucene 3.x, but much faster in Lucene 4.x.
It depends on the Spellchecker implementation you choose but I think that the Solr 3.x spell checker uses N-Grams to find candidates (this is why it needs a dedicated index) and then only computes distances on this set on candidates, so the performance is still reasonably good.

Categories

Resources