Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 months ago.
Improve this question
I need a disk backed Map structure to use in a Java app. It must have the following criteria:
Capable of storing millions of records (even billions)
Fast lookup - the majority of operations on the Map will simply to see if a key already exists. This, and 1 above are the most important criteria. There should be an effective in memory caching mechanism for frequently used keys.
Persistent, but does not need to be transactional, can live with some failure. i.e. happy to synch with disk periodically, and does not need to be transactional.
Capable of storing simple primitive types - but I don't need to store serialised objects.
It does not need to be distributed, i.e. will run all on one machine.
Simple to set up & free to use.
No relational queries required
Records keys will be strings or longs. As described above reads will be much more frequent than writes, and the majority of reads will simply be to check if a key exists (i.e. will not need to read the keys associated data). Each record will be updated once only and records are not deleted.
I currently use Bdb JE but am seeking other options.
Update
Have since improved query performance on my existing BDB setup by reducing the dependency on secondary keys. Some queries required a join on two secondary keys and by combining them into a composite key I removed a level of indirection in the lookup which speeds things up nicely.
JDBM3 does exactly what you are looking for. It is a library of disk backed maps with really simple API and high performance.
UPDATE
This project has now evolved into MapDB http://www.mapdb.org
You may want to look into OrientDB.
You can try Java Chronicles from http://openhft.net/products/chronicle-map/
Chronicle Map is a high performance, off-heap, key-value, in memory, persisted data store. It works like a standard java map
I'd likely use a local database. Like say Bdb JE or HSQLDB. May I ask what is wrong with this approach? You must have some reason to be looking for alternatives.
In response to comments:
As the problem performance and I guess you are already using JDBC to handle this it might be worth trying HSQLB and reading the chapter on Memory and Disk Use.
As of today I would either use MapDB (file based/backed sync or async) or Hazelcast. On the later you will have to implement you own persistency i.e. backed by a RDBMS by implementing a Java interface. OpenHFT chronicle might be an other option. I am not sure how persistency works there since I never used it, but the claim to have one. OpenHFT is completely off heap and allows partial updates of objects (of primitives) without (de-)serialization, which might be a performance benefit.
NOTE: If you need your map disk based because of memory issues the easiest option is MapDB. Hazelcast could be used as a cache (distributed or not) which allows you to evict elements from heap after time or size. OpenHFT is off heap and could be considered if you only need persistency for jvm restarts.
I've found Tokyo Cabinet to be a simple persistent Hash/Map, and fast to set-up and use.
This abbreviated example, taken from the docs, shows how simple it is to save and retrieve data from a persistent map:
// create the object
HDB hdb = new HDB();
// open the database
hdb.open("casket.tch", HDB.OWRITER | HDB.OCREAT);
// add item
hdb.put("foo", "hop");
hdb.close();
SQLite does this. I wrote a wrapper for using it from Java: http://zentus.com/sqlitejdbc
As I mentioned in a comment, I have successfully used SQLite with gigabytes of data and tables of hundreds of millions of rows. If you think out the indexing properly, it's very fast.
The only pain is the JDBC interface. Compared to a simple HashMap, it is clunky. I often end up writing a JDBC-wrapper for the specific project, which can add up to a lot of boilerplate code.
JBoss (tree) Cache is a great option. You can use it standalone from JBoss. Very robust, performant, and flexible.
I think Hibernate Shards may easily fulfill all your requirements.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
We have a table of vocabulary items that we use to search text documents. The java program that uses this table currently reads it from a database, stores it in memory and then searches documents for individual items in the table. The table is brought into memory for performance reasons. This has worked for many years but the table has grown quite large over time and now we are starting to see Java Heap Space errors.
There is a brute force approach to solving this problem which is to upgrade to a larger server, install more memory, and then allocate more memory to the Java heap. But I'm wondering if there are better solutions. I don't think an embedded database will work for our purposes because the tables are constantly being updated and the application is hosted on multiple sites suggesting a maintenance nightmare. But, I'm uncertain about what other techniques are out there that might help in this situation.
Some more details, there are currently over a million vocabulary items (think of these items as short text strings, not individual words). The documents are read from a directory by our application, and then each document is analyzed to determine if any of the vocabulary is present in the document. If it is, we note which items are present and store them in a database. The vocabulary itself is stored and maintained in a MS SQL relational database that we have been growing for years. Since, all vocabulary items must be analyzed for each document, repeatedly reading from the database is inefficient. And the number of documents that need to be analyzed each day can at some of our installations be quite large (on the order of 100K documents a day). The documents are typically 2 to 3 pages long although we occasionally see documents as large a 100 pages.
In the hopes of making your application more performant, you're taking all the data out of a database that is designed with efficient data operations in mind and putting it into your application's memory. This works fine for small data sets, but as those data sets grow, you are eventually going to run out of resources in the application to handle the entire dataset.
The solution is to use a database, or at least a data tier, that's appropriate for your use case. Let your data tier do the heavy lifting instead of replicating the data set into your application. Databases are incredible, and their ability to crunch through huge amounts of data is often underrated. You don't always get blazing fast performance for free (you might have to think hard about indexes and models), but few are the use cases where java code is going to be able to pull an entire data set down and process it more efficiently than a database can.
You don't say much about which database technologies you're using, but most relational databases are going to offer a lot of useful tools for full text searching . I've seen well designed relational databases perform text searches very effectively. But if you're constrained by your database technology or your table really is so big that a relational database text search isn't feasible, you should put your data into a searchable cache such as elastic search. If you model and index your data effectively, you can build a very performant text search platform that will scale reliably. Tom's suggestion of lucene is another good one. There's a lot of cloud technologies that can help with this kind of thing too: S3 + Athena comes to mind, if you're into AWS.
I'd look at http://lucene.apache.org it should be a good fit for what you've described.
I was having the same issue with a Table with one more than millon of Data and there was a Client that want export all that data. My solution was very simple I followed this Question. But there was a little Issue having more than 100k records go to Heap Space. So I just use Chunks with my queries WITH NO LOCK ( I know this can have some inconsistent data, but I needed to do that because it was Blocking the DB Without this Statement). I hope this approach help you.
When you had a small table, you probably implemented an approach of looping over the words in the table and for each one looking it up in the document to be processed.
Now the table has grown to the point where you have trouble loading it all in memory. I expect that the processing of each document has also slowed down due to having more words to look up in each document.
If you flip the processing around, you have more opportunities to optimize this process. In particular, to process a document you first identify the set of words in the document (e.g., by adding each word to a Set). Then you loop over each document word and look it up in the table. The simplest implementation simply does a database query for each word.
To optimize this, without loading the whole table in memory, you will want to implement an in-memory cache of some kind. Your database server will actually automatically implement this for you when you query the database for each word; the efficacy of this approach will depend on the hardware and configuration of your database server as well as the other queries that are competing with your word look-ups.
You can also implement an in-memory cache of the most-used portion of the table. You will limit the size of the cache based on how much memory you can afford to give it. All words that you look up that are not in the cache need to be checked by querying the database. Your cache might use a least-recently-used eviction strategy so that you keep the most common words in the cache.
While you can only store words that exist in the table in your cache, you might achieve better performance if you cache the result of the lookup. This will result in your cache having the most common words that show up in the documents being in the cache (and each one with a boolean value that indicates if the word is or is not in the table).
There are several really good open source in-memory caching implementations available in Java, which will minimize the amount of code you need to write to implement a caching solution.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Let me first brief about the scenario. The database is Sybase. There are some 2-3 k stored procedures. Stored procedures might return huge data (around million records). There will be a service (servlet / spring controller) which will call the required procedure and flush the data back to client in xml format.
I need to appy filtering (on multiple column & multiple condition) / sort (based on some dynamic criteria), this I have done.
The issue is, as the data is huge, doing all the filtering / sorting in-memory is not good. I have thought of below options.
Option 1:
Once I get the ResultSet object, read some X no. of records, filter it, store it in some file, repeat this process till all the data is read. Then just read the file and flush the data to client.
I need to figure out how do I sort the data in file and how to store objects in file so that the filtering/sorting is fast.
Option 2:
Look for some Java API, which takes the data, filters it & sort it based on the given criteria and returns it back as a stream
Option 3:
Use in-memory database like hsqldb, h2database, But I think this will overhead instead of helping. I will need to insert data first and then query data and this will also in turn use the file system.
Note I don't want to modify the stored procedures so the option of doing filtering/sorting in database is not an option or might be the last option if nothing else works.
Also if it helps, every record that I read from ResultSet, I store it in a Map, with keys being the column name and this Map is stored in a List, on which I apply the filtering & sorting.
Which option do you think will be good for memory footprint, scalable, performance wise or any other option which will be good for this scenario ?
Thanks
I would recommend your Option 3 but it doesn't need to be an in-memory database; you could use a proper database instead. Any other option would be just a more specific solution to the general problem of sorting huge amounts of data. That is, after all, exactly what a database is for and it does it very well.
If you really believe your Option 3 is not a good solution then you could implement a sort/merge solution. Gather your Maps as you already do but whenever you reach a limit of records (say 10,000 perhaps) sort them, write them to disk and clear them down from memory.
Once your data is complete you can now open all files you wrote and perform a merge on them.
Is hadoop applicable for your problem?
You should filter the data in database itself. You can write aggregation procedure which will execute all other procedures, combine data or filter them However the best option is to modify 2-3 thousands stored procedures so they return only needed data.
I am developing a web application in which I need to store session, user messages etc. I am thinking of using HashMap or H2 database.
Please let me know which is better approach in terms of performance and memory utilization. The web site has to support 10,000 users.
Thanks.
As usual with these questions, I would worry about performance as/when you know it's an issue.
10000 users is not a lot of data to hold in memory. I would likely start off with a standard Java collection, and look at performance when you predict it's going to cause you grief.
Abstract out the access to this Java collection such that when you substitute it, the refactoring required is localised (and perhaps make it configurable, such that you can easily perform before/after performance tests with your different solutions -H2, Derby, Oracle, etc. etc.)
If your session objects aren't too big (which should be the case), there is no need to persist them in a database.
Using a database for this would add a lot of complexity in a case when you can start with a few lines of code. So don't use a database, simply store them in a ligth memory structure (HashMap for example).
You may need to implement a way to clean your HashMap if you don't want to keep sessions in memory when the user left from a long time. Many solutions are available (the easiest is simply to have a background thread removing from time to time the too old sessions). Note that it's usually easier to clean a hashmap than a database.
Both H2 and Hash Map are gonna keep the data in memory (So from space point of view they are almost the same).
If look ups are simple like KEY VALUE then looking up in the Hash Map will be quicker.
If you have to do comparisons like KEY < 100 etc use H2.
In fact 10K user info is not that high a number.
If you don't need to save user messages - use the collections. But if the message is should be saved, be sure to use a database. Because after restart you lost all data.
The problem with using a HashMap for storing objects is that you would run into issues when your site becomes too big for one server and would need to be clustered in order to scale with demand. Then you would face problems with how to synchronise the HashMap instances on different servers.
A possible alternative would be to use a key-value store like Redis as you won't need the structure of a database or even use the distributed cache abilities of something like EHCache
I'm looking for a efficient way to store many key->value pairs
on disc for persistence, preferably with some caching.
The features needed are to either add to the value (concatenate)
for a given key or to let the model be key -> list of values,
both options are fine. The value-part is typically a binary document.
I will not have too much use of clustering, redundancy etc in this scenario.
Language-wise we're using java and we are experienced in classic databases (Oracle, MySQL and more).
I see a couple of obvious scenarios and would like advice on what
is fastest in terms of stores (and retrievals) per second:
1) Store the data in classic db-tables by standard inserts.
2) Do it yourself using a file system tree to spread to many files,
one or several per key.
3) Use some well known tuple-storage. Some obvious candidates are:
3a) Berkeley db java edition
3b) Modern NoSQL-solutions like cassandra and similar
Personally I like the Berkely DB JE for my task.
To summarize my questions:
Does Berkely seem like a sensible choice given the above?
What kind of speed can I expect for some operations, like
updates (insert, addition of new value for a key) and
retrievals given key?
You could also give a try to Chronicle Map or JetBrains Xodus which are both Java embeddable key-value stores much faster than Berkeley DB JE (if you are really looking for speed). Chronicle Map provides an easy-to-use java.util.Map interface.
BerkeleyDB sounds sensible. Cassandra would also be sensible but perhaps is overkill if you don't need redundancy, clustering etc.
That said, a single Cassandra node can handle 20k writes per second (provided that you use multiple clients to exploit the high concurrency within Cassandra) on relatively modest hardware.
FWIW, I'm using Ehcache with completely satisfactory performance; I've never tried Berkeley DB.
Berkeley DB JE should work just fine for the use case that you describe. Performance will vary, largely depending on how many I/Os are required per operation (and the corollary -- how big is the available cache) and on the durability constraints that you define for your write transactions (ie. does a commit transaction have to write all the way to the disk or not)?
Generally speaking, we typically see 50-100K reads per second and 5-12K writes per second on commodity hardware with BDB JE. Obviously, YMMV.
Performance tuning and throughput questions about BDB JE are best asked on the Berkeley DB JE forum, where there is an active community of BDB JE application developers on hand to help out. There are several useful performance tuning recommendations in the BDB JE FAQ which may also come in handy.
Best of luck with your implementation. Please let us know if we can help.
Regards,
Dave -- Product Manager for Berkeley DB
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have question, maybe from someone it see as stupid. Is Hibernate fast? I use it in system which would have really large count of queries to database. And performance become alerted.
And other question off the point of current context. What would better - many simple query (with single table) or a bit less query with several JOIN?
Tanks for advance
Artem
From our experience Hibernate can be very fast. However, there can be many pitfalls that are specific to ORM tools including:
Incorrect lazy loading, where too much or too little data is selected
N+1 select problems, where iterating over collections is slow
Collections of List should be avoided and prefer Set so ordering information does not need to be included in the table
Batch actions where it's best to fall back to direct SQL
The Improving Performance page in the Hibernate docs is a good place to start to learn about these issues and other methods to improve performance.
First of all, there are many things you can do to speed up Hibernate. Check out these High-Performance Hibernate Tips article for a comprehensive list of things you can do to speed up your data access layer.
With "many" queries, you are meeting with the typical N+1 query problem. You load an Entity with Hibernate, which has related objects. With LAZY joins, you'll get a single query for every record. Each query goes through the network to your database server, and it returns the data. This all takes time (opening connection, network latency, lookup, data throughput, etc.).
For a single query (with joins) the lookup and data throughput are larger than with multiple queries. But you'll only have the opening of the connection and network latency just once. So with 100 or more queries you have a small lookup and data throughput, but you will have it 100 times (including opening the connection and network latency).
A single query that takes 20ms. vs 100 queries that take 1ms.? You do the math ;)
And if it can grow to be 1000's of records. The single query will have a small performance impact, but 1000's of queries vs 100's are 10 times more. So with multiple queries, you'll have reduced the performance greatly.
When using HQL queries to retrieve the data, you can add FETCH to a JOIN in order to load the data with the same query (using JOIN's).
For more info related to this topic, check out this Hibernate Performance Tuning Tutorial.
Hibernate can be fast. Designing a good schema, tuning your queries, and getting good performance is kind of an artform. Remember, under the covers its all sql anyway, so anything you can do with sql you can do with hibernate.
Typically on advanced application the hibernate mapping/schema is only the initial step of writing your persistence layer. That step gives you a lot of functionality. But the next step is to write custom queries using hql that allow you to fetch only the data you need.
Yes, it can be fast.
In past i got several cases when people think "aaaa it's this stupid ORM kills all performance of our nice application"
in ALL cases after profiling we found out other reasons for problem. (bad hash code implementation for collections, regExps from hell, db design made by mad hatter & etc.)
Actually it can do the job in most of the common cases. If you migrate huge and complex data - it will be poor competitor to plain well optimized SQL (but i hope it's not you case- i personally hate data migration with passion :)
This is not the first question about it, but I couldn't find an appropriate answer in my previous ones (perhaps it was for another "forum"). So, I'll answer once again :-)
I like to answer this in a somewhat provocative way (nothing personal!): do you think you'll be able to come with a solution which is better than Hibernate? That involves not only the basic problems, like mapping database columns to Java properties and "eager or lazy loading" (which is an actual concern from your question), but also caching (1L and 2L), connection pooling, bytecode enhancing, batching, ...
That said, Hibernate is a complex piece of software, which requires some learning to properly use and fine tune it. So, I'd say that it's better to spend your time in learning Hibernate than writing your own ORM.
Hibernate could be reasonable fast, if you know how to use it that way. However, polepos.org performance tests shows that for Hibernate could slow down applications by orders of magnitude.
If you want ORM which is light and faster, I can recommend fjorm
... which would have really large count of
queries to database ...
If you still in design/development phase do not optimize preventive.
Hibernate is a very well made piece of software and beware of performance issues. I would tell you when you project is more mature to go into performance issues and analyse for direct JDBC usage where necessary.
It's usually fast enough, and can be faster than a custom JDBC-based solution. But as all tools, you have to use it correctly.
Fast doesn't mean anything. You have to define maximum response time, minimum throughput, etc., then measure if the solution meets the requirements, and tune the app to meet them if it doesn't.
Regarding joins vs. multiple queries, it all depends. Usually, joins are obviously faster, since they require only one inter-process/network roundtrip.