I am using hashmap as a cache to store id and name. because it is frequently used and it lives through the lifetime of the application.
for every user using the application, around 5000 and more (depends on workspace) ids and names get stored in hashmap. At some point java.lang.OutOfMemoryError exception gets thrown. since I am saving a lot of (id, name) in hashmap.
I don't want to clear my hashmap cache value. but I know to be efficient we have to clear cache using the LRU approach or other approaches.
Note: I don't want to use Redis, Memcached, or any in-memory key-value store.
Usecase: slack will return id in place of the user name in every
message.
for eg: Hello #john doe = return Hello #dxap123.
I don't want an API hit for every message to get the user name.
Can somebody provide me an alternate efficient approach or correct me if I am doing something wrong in my approach.?
Like others have said 5000 shouldn't give you out of memory, but if you don't keep a limit on the size of the map eventually you will get out of memory error. You should cache the values that are most recently used or most frequently used to optimize the size of the map.
Google guava library has cache implementations which i think would fit your usecase
https://github.com/google/guava/wiki/CachesExplained
For 5000 key-value pairs, It should not through OutOfMemoryException. If It is throwing the same you are not managing the HashMap properly. If you have more than 5000 items and want an alternate for hashmap you can use ehcache, A widely adopted Java cache with tiered storage options instead of going with in-memory cache technologies.
The memory areas supported by Ehcache include:
On-Heap Store: Uses the Java heap memory to store cache entries and shares the memory with the application. The cache is also scanned by the garbage collection. This memory is very fast, but also very limited.
Off-Heap Store: Uses the RAM to store cache entries. This memory is not subject to garbage collection. Still quite fast memory, but slower than the on-heap memory, because the cache entries have to be moved to the on-heap memory before they can be used.
Disk Store: Uses the hard disk to store cache entries. Much slower than RAM. It is recommended to use a dedicated SSD that is only used for caching.
You can find the documentation here. http://www.ehcache.org/documentation/
If you are using spring-boot you can follow this article to implement the same.
https://springframework.guru/using-ehcache-3-in-spring-boot/
If the "names" are not unique, then try with calling on it String.intern() before inserting the "name" to the map, this would then reduce the memory usage.
Related
Need to cache over 100+ million string Key (~100 chars length) for Java standalone application.
Standard cache properties requisite:
Persistent.
TPS to fetch keys from cache in 10s of milli seconds range.
Allows invalidation and expiry.
Independent caching server, to allow multi-threaded access.
Preferably don't want to use enterprise database, as this 100M keys can scale to 500M which would use high memory and system resources with sluggish throughput.
For distributed cache you can try to use hazelcast.
It can be scaled as you need to and have backups and synchronizations out of the box. And it is a JSR-107 provider and have many other helpfull tools to use. However, if you want persistence, you will need to handle it by yourself or buy their enterprise version.
Finally, to resolve this big data problem, with existing cache solutions available (hazelcast, Guava cache, eh-cache etc):
Have broken the cache into two levels.
grouped ~100K keys into one java collection and associated them with common property, in my case keys were having timestamp. So, that timestamp slot became the key for this second level cache block of 100K
This time slot key is stored in Java persistent cache with value as compressed Java collection.
The reason I manage to get good throughput with 2 level caching with overheads of compression and decompression is, my key searches were range bound so when cache match found, most of the subsequent searches were addressed by in memory java collection of previous search.
To conclude: identify common attribute in keys to group and break them into multilevel cache otherwise you would need hefty hardware and enterprise cache to support this big data problem.
Try Guava Cache. It meets all of your requirement.
Links:
Guava Cache Explained
guava-cache
Persistence: Guava cache
Edit: Another One. I did not use it yet. eh-cache
I have a map of HashMap where Node is a class containing some info and most importantly containing a neighbors HashMap.
My algorithm does random gets from the outer HashMap and puts in the inner HashMap, e.g.
Node node = map.get(someInt);
node.getNeighbors().put(someInt, someOtherInt);
I have lots of entries in both the outer map and inner maps. The full dataset cannot be handled by memory alone, so I need to use some disk caching. I would like to use memory as much as possible (until almost full, or until it reaches a threshold I specify) and then evict entries to disk.
I have used several caches (mostly mapDb and EhCache) trying to solve my problem but with no luck. I am setting a maximum memory size, but the cache just ignores it.
I am almost certain that the problem lies to the fact that my object is of dynamic size.
Anyone got any ideas on how I could handle this problem?
Thanks in advance.
Note: I work on Ehcache
Ehcache, as most other caching products, cannot know about your object size growing unless you let it know by updating the mapping from the outside:
Node node = cache.get(someInt);
node.getNeighbors().put(someInt, someOtherInt);
cache.put(someInt, node);
In that context, Ehcache will properly track the object growth and trigger memory eviction.
Note that Ehcache no longer uses an overflow model between the heap and disk but instead always stores the mapping on disk while keeping a hotset on heap.
I have question here regarding the cache technique in java web applications.
Suppose If i implement the ehcache, where the cached data will be stored?
will the cached data comes under GC covered area? i mean will GC deletes the java objects which i cached earlier?
after reading some cached framework sites, i understood that at core level they(caching framework) are using the hastable or hashmap, where data will be our value and key be depending on logic.
Suppose in ehcache
maxBytesLocalHeap="50m"
maxBytesLocalDisk="50G"
1. what i understood here is 50Mb(maxBytesLocalHeap) will be stored in heap memory(the data which comes under this memory will be observed by GC),
2. If maxBytesLocalDisk 50GB will stored in the local disk (assuming file will be stored as a flat file in temp folder of the server) where GC will not care about the entities or objects as it is out of heap memory.
Is my understanding is correct?
Thanks
Vijay
GC will delete your objects only if any other objects don't have references at it. GC doesn't know where your cached data it just looks for alone objects.
Yes, HashMap is commonly used to store cached data and retrieve it when needed.
I am developing a web application in which I need to store session, user messages etc. I am thinking of using HashMap or H2 database.
Please let me know which is better approach in terms of performance and memory utilization. The web site has to support 10,000 users.
Thanks.
As usual with these questions, I would worry about performance as/when you know it's an issue.
10000 users is not a lot of data to hold in memory. I would likely start off with a standard Java collection, and look at performance when you predict it's going to cause you grief.
Abstract out the access to this Java collection such that when you substitute it, the refactoring required is localised (and perhaps make it configurable, such that you can easily perform before/after performance tests with your different solutions -H2, Derby, Oracle, etc. etc.)
If your session objects aren't too big (which should be the case), there is no need to persist them in a database.
Using a database for this would add a lot of complexity in a case when you can start with a few lines of code. So don't use a database, simply store them in a ligth memory structure (HashMap for example).
You may need to implement a way to clean your HashMap if you don't want to keep sessions in memory when the user left from a long time. Many solutions are available (the easiest is simply to have a background thread removing from time to time the too old sessions). Note that it's usually easier to clean a hashmap than a database.
Both H2 and Hash Map are gonna keep the data in memory (So from space point of view they are almost the same).
If look ups are simple like KEY VALUE then looking up in the Hash Map will be quicker.
If you have to do comparisons like KEY < 100 etc use H2.
In fact 10K user info is not that high a number.
If you don't need to save user messages - use the collections. But if the message is should be saved, be sure to use a database. Because after restart you lost all data.
The problem with using a HashMap for storing objects is that you would run into issues when your site becomes too big for one server and would need to be clustered in order to scale with demand. Then you would face problems with how to synchronise the HashMap instances on different servers.
A possible alternative would be to use a key-value store like Redis as you won't need the structure of a database or even use the distributed cache abilities of something like EHCache
We are running into unusually high memory usage issues. And I observed that many places in our code we are pulling 100s of records from DB, packing it in custom data objects, adding it to an arraylist and storing in session. I wish to know what is the recommended upper limit storing data in session. Just a good practice bad practice kind of thing.
I am using JRockit 1.5 and 1.6GB of RAM. I did profiling with Jprobe and found that some parts of app have very heavy memory footprint. Most of this data is being into session to be used later.
That depends entirely on how many sessions are typically present (which in turn depends on how many users you have, how long they stay on the site, and the session timeout) and how much RAM your server has.
But first of all: have you actually used a memory profiler to tell you that your "high memory usage" is caused by session data, or are you just guessing?
If the only problem you have is "high memory usage" on a production machine (i.e. it can handle the production load but is not performing as well as you'd like), the easiest solution is to get more RAM for the server - much quicker and cheaper than redesigning the app.
But caching entire result sets in the session is bad for a different reason as well: what if the data changes in the DB and the user expects to see that change? If you're going to cache, use one of the existing systems that do this at the DB request level - they'll allow you to cache results between users and they have facilities for cache invalidation.
If you're storing data in session to improve performance, consider using true caching since cache is application-wide, whereas session is per-user, which results in unneccessary duplication of otherwise similar objects.
If, however, you're storing them for user to edit this objects (which I doubt, since hundreds of objects is way too much), try minimizing the amount of data stored or research optimistic concurrency control.
I'd say this heavily depends on the number of active sessions you expect. If you're writing an intranet application with < 20 users, it's certainly no problem to put a few MB in the session. However, if you're expecting 5000 live session for instance, each MB of data stored per session accounts for 5GB of RAM.
However, I'd generally recommend not to store any data from DB in session. Just fetch from DB for every request. If performance is an issue, use an application-wide cache (e.g. Hibernate's 2nd level cache).
What kind of data is it? Is it really needed per session or could it be cached at application level? Do you really need all the columns or only a subset? How often is it being accessed? What pages does it need to be available on? And so on.
It may make much more sense to retrieve the records from the DB when you really need to. Storing hundreds of records in session is never a good strategy.
I'd say try to store the minimum amount of data that will be enough to recreate the necessary environment in a subsequent request. If you're storing in memory to avoid a database round-trip, then a true caching solution such as Memcache might be helpful.
If you're storing these sessions in memory instead of a database, then the round-trip is saved, and requests will be served faster as long as the memory load is low, and there's no paging. Once the number of clients goes up and paging begins, most clients will see a huge degradation in response times. Both these variables and inversely related.
Its better to measure the latency to your database server, which is usually low enough in most cases to be considered as a viable means of storage instead of in-memory.
Try to split the data you are currently storing in the session into user-specific and static data. Then implement caching for all the static parts. This will give you a lot of reuse application-wide and still allow you to cache the specific data a user is working on.
You could also make per-user mini sqlite database and connect to it, and store the data the user is accessing in it, then just retrieve the records from it, while the user is requesting it, and after the user disconnects just delete the sqlite database.