I have a program which, at some point persist a blob in a blob store.
For a given blob, a hash is used as an identifier derived always from the same algorithm, so when two blobs are identical, their BlobID are identical.
Persisting a blob is slow.
To make things worse, the program is used to put the same blob several times in a short time-frame (few seconds).
My initial idea was to use a concurrent set to keep track of what have been put in the blob store.
Sadly it will create a memory leak.
I'm looking for a kind of concurrent LRU (Least Recently Used) set in Java.
Is there such data structure? or can I build one with existing libraries? (such as Guava)
Related
I have a web application in which I'm maintaining many static Maps to store my relevant information. Since the application is deployed on a server. Each and every hit to the server side java uses these maps to match the key and get appropriate result and send back to the client side. My code contains a rank and retrieval feature so I have to read the entire keySet of each of these Maps.
My question is:
1. Is working with static variables better than storing this data in a local embedded DB like Apache Derby and then using it?
2. The use of this data is very frequent. So if I use database will that be faster approach? Since I read the full keyset the where clause may not come handy in many operations.
3. How does the server's memory gets impacted on holding data in static variables?
My no. of maps are fixed but the size of the Maps keeps increasing? Please suggest the better solution.
If you want the data to be saved regularly an embedded database like H2 makes sense. You then also have snapshots of the data, and development, structural changes are a bit more safe.
A real database also has an incredible power behind it: concurrency, caching and so on. An embedded (when file based) database less so.
The problem with maps is that the data extraction can become several indirections. It is more versatile to have SQL queries with joins on the tables.
So SQL is more abstract (does not prescribe the actual query implementation), and easier to test. SQL for instance releases the developer of programming reports.
So go for a database IMHO, when you are really doing hard work.
What you might want to consider is to store the data searched in map when it's searched.
For instance, if a user searches for something specific, that something is stored in the map so that the next user who searches for that gets the data directly from the map rather than the database.
There are some downsides though, as you need to make sure that if the data is changed on the database, the hashmap/cache should be cleared or updated with the new data, as to prevent feeding outdated data to the user.
As for the impact on the server's memory, it depends on the size of the data you're storing. It's hard to give you a precise answer, but you can however test that on your own:
long memoryBefore = Runtime.getRuntime().freeMemory();
// populate your map
long memoryAfter = Runtime.getRuntime().freeMemory();
System.out.println(memoryBefore - memoryAfter);
That should give you the amount of bytes used (more or less, depending on the operations you run between memoryBefore and memoryAfter, as you may have instantiated other classes/variables unrelated to the hashmap)
If we are using MongoDB (NoSQL) or MySQL (Relational DB) to retrieve objects and we want to search for a specific element (where clause) but we already have an in-memory list (LinkedList, ArrayList or whatever) containing some of the Objects, either for caching or for any other reason.
Is there an equation / library that can "advice" when is cheaper to use the in-memory structure for retrieval instead of querying the Database? (taking into consideration,for example, the size of the in-memory list?)
It's always cheaper to query the in-memory cache.
The only exception would be if you had constructed the cache so that it was very inefficient to search it (e.g., linear search). But as long as it's a hash-based structure, it'll be doing at worst what the database needs to do, but without the network overhead. Looking something up in a hash table is essentially free.
The bigger question is whether your cache uses so much memory it starves the rest of the application. You'll want a weak hash map or similar, to avoid this. If this is a cache produced by some sort of ORM, it'll be weakly referenced already.
I currently have a postgres database with a simple schema of a fixed length key (20 bytes) and a fixed length value (40 bytes). Its a massive table with billions of rows, but unfortunately we have lots of duplicated data. We'd like to separate this table into its own data store.
Ideally, I'm looking for ways to store this data on a large hard drive where it can be queried on occasion. Performance is not critical for reads, disk access is fast enough - no need to store anything in memory. And there is rarely new data added after the initial load.
If there is no product available I would be willing to roll my own with suggestions. I originally thought of using the key as a folder path based on the byte /0/32/231/32/value but obviously that results in too many files/folders on a single disk. Is there an optimization that can be used since both keys and values are fixed length?
Any suggestions?
Try some pure Java database engines like MapDb or LevelDB.
I am developing a web application in which I need to store session, user messages etc. I am thinking of using HashMap or H2 database.
Please let me know which is better approach in terms of performance and memory utilization. The web site has to support 10,000 users.
Thanks.
As usual with these questions, I would worry about performance as/when you know it's an issue.
10000 users is not a lot of data to hold in memory. I would likely start off with a standard Java collection, and look at performance when you predict it's going to cause you grief.
Abstract out the access to this Java collection such that when you substitute it, the refactoring required is localised (and perhaps make it configurable, such that you can easily perform before/after performance tests with your different solutions -H2, Derby, Oracle, etc. etc.)
If your session objects aren't too big (which should be the case), there is no need to persist them in a database.
Using a database for this would add a lot of complexity in a case when you can start with a few lines of code. So don't use a database, simply store them in a ligth memory structure (HashMap for example).
You may need to implement a way to clean your HashMap if you don't want to keep sessions in memory when the user left from a long time. Many solutions are available (the easiest is simply to have a background thread removing from time to time the too old sessions). Note that it's usually easier to clean a hashmap than a database.
Both H2 and Hash Map are gonna keep the data in memory (So from space point of view they are almost the same).
If look ups are simple like KEY VALUE then looking up in the Hash Map will be quicker.
If you have to do comparisons like KEY < 100 etc use H2.
In fact 10K user info is not that high a number.
If you don't need to save user messages - use the collections. But if the message is should be saved, be sure to use a database. Because after restart you lost all data.
The problem with using a HashMap for storing objects is that you would run into issues when your site becomes too big for one server and would need to be clustered in order to scale with demand. Then you would face problems with how to synchronise the HashMap instances on different servers.
A possible alternative would be to use a key-value store like Redis as you won't need the structure of a database or even use the distributed cache abilities of something like EHCache
I'm interesting in using DB4O to store the training data for a learning algorithm. This will consist of (potentially) hundreds of millions of objects. Each object is on average 2k in size based on my benchmarking.
The training algorithm needs to iterate over the entire set of objects repeatedly (perhaps 10 times). It doesn't care what order the objects are in.
My question is this: When I retrieve a very large set of objects from DB4O, are they all loaded into memory, or are they pulled off disk as needed?
Clearly, pulling hundreds of millions of 2k objects into memory won't be practical on the type of servers I'm working with (they hvae about 19GB of RAM).
Is Db4o a wise choice here?
db4o activation mechanism allows you to control which object are loaded into memory. For complex object graphs you probably should us transparent activation, where db4o loads an object into memory as soon as it is used.
However db4o doesn't explicit remove object from memory. It just keeps a weak reference to all loaded objects. If a object is reachable, it will stay there (just like any other object). Optionally you can explicitly deactivate an object.
I just want to add a few notes to the scalability of db4o. db4o was built for embedding in application and devices. It was never built for large datasets. Therefore it has its limitations.
It is internally single-threaded. Most db4o operation block all other db4o operations.
It can only deal with relatively small databases. By default a database can only be 2GB. You can increase it up to 127 GB. However I think db4o operates well in the 2-16 GB range. Afterwards the database is probably to large for it. Anyway, hundreds of millions of 2K objects is way to large database. (100Mio 2K obj => 200GB)
Therefore you probably should look at larger object databases, like VOD. Or maybe a graph database like Neo4J is also a good choise for your problem?