I'm looking for a efficient way to store many key->value pairs
on disc for persistence, preferably with some caching.
The features needed are to either add to the value (concatenate)
for a given key or to let the model be key -> list of values,
both options are fine. The value-part is typically a binary document.
I will not have too much use of clustering, redundancy etc in this scenario.
Language-wise we're using java and we are experienced in classic databases (Oracle, MySQL and more).
I see a couple of obvious scenarios and would like advice on what
is fastest in terms of stores (and retrievals) per second:
1) Store the data in classic db-tables by standard inserts.
2) Do it yourself using a file system tree to spread to many files,
one or several per key.
3) Use some well known tuple-storage. Some obvious candidates are:
3a) Berkeley db java edition
3b) Modern NoSQL-solutions like cassandra and similar
Personally I like the Berkely DB JE for my task.
To summarize my questions:
Does Berkely seem like a sensible choice given the above?
What kind of speed can I expect for some operations, like
updates (insert, addition of new value for a key) and
retrievals given key?
You could also give a try to Chronicle Map or JetBrains Xodus which are both Java embeddable key-value stores much faster than Berkeley DB JE (if you are really looking for speed). Chronicle Map provides an easy-to-use java.util.Map interface.
BerkeleyDB sounds sensible. Cassandra would also be sensible but perhaps is overkill if you don't need redundancy, clustering etc.
That said, a single Cassandra node can handle 20k writes per second (provided that you use multiple clients to exploit the high concurrency within Cassandra) on relatively modest hardware.
FWIW, I'm using Ehcache with completely satisfactory performance; I've never tried Berkeley DB.
Berkeley DB JE should work just fine for the use case that you describe. Performance will vary, largely depending on how many I/Os are required per operation (and the corollary -- how big is the available cache) and on the durability constraints that you define for your write transactions (ie. does a commit transaction have to write all the way to the disk or not)?
Generally speaking, we typically see 50-100K reads per second and 5-12K writes per second on commodity hardware with BDB JE. Obviously, YMMV.
Performance tuning and throughput questions about BDB JE are best asked on the Berkeley DB JE forum, where there is an active community of BDB JE application developers on hand to help out. There are several useful performance tuning recommendations in the BDB JE FAQ which may also come in handy.
Best of luck with your implementation. Please let us know if we can help.
Regards,
Dave -- Product Manager for Berkeley DB
Related
I would like to ask the experts of what is the recommendation for fetching 3000-5000 records from oracle 11g database from Java application (using JDBC). Our standard is to always invoke a stored procedure.
I did some research and found that ref cursor makes multiple round trips to the database based on the JDBC fetch count property. (can somebody throw more light on this of the end to end flow of how data is stored in memory in oracle and JVM when processing ref cursors)
I was thinking collections are more efficient because the data is sent in one shot to the caller (Java) from oracle db (use bulk collect). With this approach we can avoid multiple network calls from Java to Oracle servers. is this a true assumption?
Appreciate your help!
This is a much bigger topic than anyone is willing to commit to in a posting. Here's a link that discusses how Oracle manages read consistency. That entire page is probably a good read to get some of idea of what's going on in the server. There's also an article here that discusses what happens using collections. How would you return the collection to a JDBC Client (not something I've ever tried)?
Essentially, there's a lot involved in performance, from how your database is configured to how your network is tuned to disk performance, to client performance, etc.
The short answer is you need to try things. Retrieving 3-5k records isn't a lot, and it depends on how big the record is, that your bringing back across the network. If they are 20 byte records, and your network (MTU?) size is 4k blocks, you can fit about 200 records in a block. At some point, you run into the law of diminishing returns.
I use stored procedures as a matter of habit, but you don't need to. It would depend on the complexity of the query (number of tables and the type of joins) and the ability for someone like a DBA to be able to go in and see what the query is doing.
Worrying about network trips is a little less critical, because there's only so much data you can stuff in a packet. There's going to be a number of network trips no matter what you use, it really depends on your use case to determine how critical it is to get that to a bare minimum.
I am working on a project that involves parsing through a LARGE amount of data rapidly. Currently this data is on disk and broken down into a directory hierarchy:
(Folder: DataSource) -> (Files: Day1, Day2, Day3...Day1000...)
(Folder: DataSource2) -> (Files: Day1, Day2, Day3...Day1000...)
...
(Folder: DataSource1000) -> ...
...
Each Day file consists of entries that need to be accessed very quickly.
My initial plans were to use traditional FileIO in java to access these files, but upon further reading, I began to fear that this might be too slow.
In short, what is the fastest way I can selectively load entries from my filesystem from varying DataSources and Days?
The issue could be solved both ways but it depends on few factors
go for FileIO.
if the volume is < millons of rows
if your dont do a complicated query like Jon Skeet said
if your referance for fetching the row is by using hte Folder Name: "DataSource" as the key
go for DB
if you see your program reading through millions of records
you can do complicated selection, even multiple rows using a single select.
if you have knowledge of creating a basic table structure for DB
Depending on architecture you are using you can implement different ways of caching, in the Jboss there is a built-in Jboss Caching, there are also third party opensource software that lets utilizes caching, like Redis, or EhCache depending on your needs. Basically Caching stores objects in their memory, some are passivated/activated upon demand, when memory is exhausted it is stored as a physical IO file, which are also easily activated marshalled by the caching mechanism. It lowers the database connectivity held by your program. There are other caches but here are some of them that I've worked with:
Jboss:http://www.jboss.org/jbosscache/
Redis:http://redis.io/
EhCache:http://ehcache.org/
what is the fastest way I can selectively load entries from my filesystem from varying DataSources and Days?
selectively means filtering, so my answer is a localhost database. Generally speaking if you filter, sort, paginate or extract distinct records from a large number of records, it's hard to beat a localhost SQL server. You get a query optimizer (nobody does that Java), a cache (which requires effort in Java, especially the invalidation), database indexes (have not seen that being done in Java either) etc. It's possible to implement these things manually, but then your are writing a database in Java.
On top of this you gain access to higher level SQL functions like window aggegrates etc., so in most cases there is no need to post-process data in Java.
I am developing a web application in which I need to store session, user messages etc. I am thinking of using HashMap or H2 database.
Please let me know which is better approach in terms of performance and memory utilization. The web site has to support 10,000 users.
Thanks.
As usual with these questions, I would worry about performance as/when you know it's an issue.
10000 users is not a lot of data to hold in memory. I would likely start off with a standard Java collection, and look at performance when you predict it's going to cause you grief.
Abstract out the access to this Java collection such that when you substitute it, the refactoring required is localised (and perhaps make it configurable, such that you can easily perform before/after performance tests with your different solutions -H2, Derby, Oracle, etc. etc.)
If your session objects aren't too big (which should be the case), there is no need to persist them in a database.
Using a database for this would add a lot of complexity in a case when you can start with a few lines of code. So don't use a database, simply store them in a ligth memory structure (HashMap for example).
You may need to implement a way to clean your HashMap if you don't want to keep sessions in memory when the user left from a long time. Many solutions are available (the easiest is simply to have a background thread removing from time to time the too old sessions). Note that it's usually easier to clean a hashmap than a database.
Both H2 and Hash Map are gonna keep the data in memory (So from space point of view they are almost the same).
If look ups are simple like KEY VALUE then looking up in the Hash Map will be quicker.
If you have to do comparisons like KEY < 100 etc use H2.
In fact 10K user info is not that high a number.
If you don't need to save user messages - use the collections. But if the message is should be saved, be sure to use a database. Because after restart you lost all data.
The problem with using a HashMap for storing objects is that you would run into issues when your site becomes too big for one server and would need to be clustered in order to scale with demand. Then you would face problems with how to synchronise the HashMap instances on different servers.
A possible alternative would be to use a key-value store like Redis as you won't need the structure of a database or even use the distributed cache abilities of something like EHCache
I need to store about 100 thousands of objects representing users. Those users have a username, age, gender, city and country.
The users should be searchable by a range of age and any of the other attributes, but also a combination of attributes (e.g. women between 30 and 35 from Brussels). The results should be found quickly as it is one of the Server's services for many connected Clients). Users may only be deleted or added, not updated.
I've thought of a fast database with indexed attributes (like h2 db which seems to be pretty fast, and I've seen they have a in-memory mode)
I was wondering if any other option was possible before going for the DB.
Thank you for any ideas !
How much memory does your server have? How much memory would these objects take up? Is it feasible to keep them all in memory, or not? Do you really need the speedup of keeping in memory, vs shoving in a database? It does make it more complex to keep in memory, and it does increase hardware requirements... are you sure you need it?
Because all of what you describe could be ran on a very simple server and put in a very simple database and give you the results you want in the order of 100ms per request. Do you need faster than 100ms response time? Why?
I would use a RDBMS - there are plenty of good ORMs available, such as Hibernate, which allow you to transparently stuff the POJOs into a db. Once you've got the data access abstracted, you then have the freedom to decide how best to persist the data.
For this size of project, I would use the H2 database. It has both embedded and client/server modes, and can operate from disk or entirely in memory.
Most definitely a relational database. With that size you'll want a client-server system, not something embedded like Sqlite. Pick one system depending on further requirements. Indexing is a basic feature, most systems support it. Personally I'd try something that's popular and free such as MySQL or PostgreSQL so you can more easily google your way out of problems. If you make your SQL queries generic enough (no vendor-specific constructs), you can switch systems without much pain. I agree with bwawok, try whether a standard setup is good enough and think of optimizations later.
Did you think to use cache system like EHCache or Memcached?
Also If you have enough memory you can use some sorted collection like TreeMap as index map, or HashMap to search user by name (separate Map per field). It will take more memory but can be effective. Also you can find based on the user query experience the most frequently used query with the best selectivity and create comparator based on this query onli. In this case subset of the element will not be a big and can can be filter fast without any additional optimization.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 months ago.
Improve this question
I need a disk backed Map structure to use in a Java app. It must have the following criteria:
Capable of storing millions of records (even billions)
Fast lookup - the majority of operations on the Map will simply to see if a key already exists. This, and 1 above are the most important criteria. There should be an effective in memory caching mechanism for frequently used keys.
Persistent, but does not need to be transactional, can live with some failure. i.e. happy to synch with disk periodically, and does not need to be transactional.
Capable of storing simple primitive types - but I don't need to store serialised objects.
It does not need to be distributed, i.e. will run all on one machine.
Simple to set up & free to use.
No relational queries required
Records keys will be strings or longs. As described above reads will be much more frequent than writes, and the majority of reads will simply be to check if a key exists (i.e. will not need to read the keys associated data). Each record will be updated once only and records are not deleted.
I currently use Bdb JE but am seeking other options.
Update
Have since improved query performance on my existing BDB setup by reducing the dependency on secondary keys. Some queries required a join on two secondary keys and by combining them into a composite key I removed a level of indirection in the lookup which speeds things up nicely.
JDBM3 does exactly what you are looking for. It is a library of disk backed maps with really simple API and high performance.
UPDATE
This project has now evolved into MapDB http://www.mapdb.org
You may want to look into OrientDB.
You can try Java Chronicles from http://openhft.net/products/chronicle-map/
Chronicle Map is a high performance, off-heap, key-value, in memory, persisted data store. It works like a standard java map
I'd likely use a local database. Like say Bdb JE or HSQLDB. May I ask what is wrong with this approach? You must have some reason to be looking for alternatives.
In response to comments:
As the problem performance and I guess you are already using JDBC to handle this it might be worth trying HSQLB and reading the chapter on Memory and Disk Use.
As of today I would either use MapDB (file based/backed sync or async) or Hazelcast. On the later you will have to implement you own persistency i.e. backed by a RDBMS by implementing a Java interface. OpenHFT chronicle might be an other option. I am not sure how persistency works there since I never used it, but the claim to have one. OpenHFT is completely off heap and allows partial updates of objects (of primitives) without (de-)serialization, which might be a performance benefit.
NOTE: If you need your map disk based because of memory issues the easiest option is MapDB. Hazelcast could be used as a cache (distributed or not) which allows you to evict elements from heap after time or size. OpenHFT is off heap and could be considered if you only need persistency for jvm restarts.
I've found Tokyo Cabinet to be a simple persistent Hash/Map, and fast to set-up and use.
This abbreviated example, taken from the docs, shows how simple it is to save and retrieve data from a persistent map:
// create the object
HDB hdb = new HDB();
// open the database
hdb.open("casket.tch", HDB.OWRITER | HDB.OCREAT);
// add item
hdb.put("foo", "hop");
hdb.close();
SQLite does this. I wrote a wrapper for using it from Java: http://zentus.com/sqlitejdbc
As I mentioned in a comment, I have successfully used SQLite with gigabytes of data and tables of hundreds of millions of rows. If you think out the indexing properly, it's very fast.
The only pain is the JDBC interface. Compared to a simple HashMap, it is clunky. I often end up writing a JDBC-wrapper for the specific project, which can add up to a lot of boilerplate code.
JBoss (tree) Cache is a great option. You can use it standalone from JBoss. Very robust, performant, and flexible.
I think Hibernate Shards may easily fulfill all your requirements.