I need ideas to implement a (really) high performance in-memory Database/Storage Mechanism in Java. In the range of storing 20,000+ java objects, updated every 5 or so seconds.
Some options I am open to:
Pure JDBC/database combination
JDO
JPA/ORM/database combination
An Object Database
Other Storage Mechanisms
What is my best option? What are your experiences?
EDIT: I also need like to be able to Query these objects
You could try something like Prevayler (basically an in-memory cache that handles serialization and backup for you so data persists and is transactionally safe). There are other similar projects.
I've used it for a large project, it's safe and extremely fast.
If it's the same set of 20,000 objects, or at least not 20,000 new objects every 5 seconds but lots of changes, you might be better off cacheing the changes and periodically writing the changes in batch mode (jdbc batch updates are much faster than individual row updates). Depends on whether you need each write to be transactionally wrapped, and whether you'll need a record of the change logs or just aggregate changes.
Edit: as other posts have mentioned Prevayler I thought I'd leave a note on what it does:
Basically you create a searchable/serializable object (typically a Map of some sort) which is wrapped in a Prevayler instance, which is serialized to disk. Rather than making changes directly to your map, you make changes by sending your Prevayler instance a serializable record of your change (just an object that contains the change instruction). Prevayler's version of a transaction is to write your serialization changes to disk so that in the event of failure it can load the last complete backup and then replay the changes against that. It's safe, although you do have to have enough memory to load all of your data, and it's a fairly old API, so no generic interfaces, unfortunately. But definitely stable and works as advertised.
I highly recommend H2. This is a kind of "second generation" version of HSQLDB done by one of the original authors. H2 allows us to unit-test our DAO layer without requiring an actual PostgreSQL database, which is awesome.
There is an active net group and mailing list, and the author Thomas Mueller is very responsive to queries (hah, little pun there.)
I don't know if it is the fastest option, but I've been very satisfied with H2 whenever I've used it. It's written by the same person who originally wrote Hypersonic (which later became HSQLDB).
Another option that is allegedly very fast is Prevayler.
It is a bit of an old question, but these days there is a whole lot of databases that have a level of performance of 20,000/s. Which database to chose depends on data structure and type of queries you'd like to be making. It also depends on overall volume.
We had similar problem with large volume of time series data, about 300,000 rec/s and we ended up writing a new database, with simple enough API and decent performance. It can do about 2,000,000 object writes/s and we did away without ORM.
It later evolved into QuestDB.
Try the following, it performs really well with Hibernate and other ORM frameworks
http://hsqldb.org/
Chronicle Map is an embeddable pure Java persistent database, providing a simple java.util.Map interface. It withstands about 1 million queries/updates per second from a single thread, consistent read/write performance and scales almost linearly to the number of cores in the machine.
Here are some recent performance research with actual numbers:
Comparison of Jetbrains Xodus, Oracle Berkeley DB JE BTree, MapDB TreeMap, Chronicle Map and H2 MVStore Map
LmdbJava Benchmarks
I would give a try to OrientDB.
Terracotta might also be an answer for you. It allows multiple VMs to share objects so you can distribute load etc...
You can also check out db4o
If you want to store all of your data in memory, you might want to look at Prevayler.
I've never used it myself, but it seems like a much better solution than using a relational database for those cases in which all of your data can be stored in memory.
Berkeley DB for Java is a fast in memory database, extremely useful for simple object graphs.
hsqldb is quite fast, but it is not ACID transaction-safe. The fastest java-database I know is db4o: benchmarks.
Edit: Please notice that Prevayler is not a database, see http://www.prevayler.org/wiki.jsp?topic=PrevaylerIsNotADatabase. If you're out of RAM, you're out of luck.
H2 is truly fantastic, indeed, in memory, normal server and transactional, you have it all. However It doesn't compare in performance to the object databases, I see Db4o mentioned, I have had much better performance with Neodatis in fact, and everything nicely set up in Maven repositories. Although not very robust, like a Ferrari, fast but not a truck like Oracle.
You can try CSQL (available under open source and enterprise version) It provides 30X performance improvement over disk based database systems and provides JDBC interface. It can be configured to work as stand alone main memory database or as a transparent cache to MySQL, Postgres, Oracle databases.
Related
I am going to make a desktop application with mysql database. My database tables are frequenlty changing -- almost 60% of the tables. So I think caching may be a bad idea. Can anyone suggest me:
How can I make a fast desktop application with a remote database ?
My language is Java.
The biggest problem with most projects that have performance as their primary concern is that people tend make some exotic choices that end up complicating the project without any real benefits. Unless you have previous actual hands-on experience with the environment you will be working start simple.
Set some realistic goals about how often you have to refresh your data before you start. If your data changes very frequently, eg. every second, does it make sense to try and show the changes in real time? A query every second will make everyone involved miserable.
Use a thread to take care of the queries. You don't need more than one, since any more will only make the race conditions in the database worst.
Design your database layer to be insulated from the rest of the application. Also time your DB-related operations from the beginning in order to track the impact of your optimizations.
Start with Hibernate / ORMLite. Although I cannot talk about ORMLite, I have used (optimized) Hibernate in heavy load environments without any problems. If you have complicated objects you should give it a try, it sure beats using plain JDBC and implementing the cache mechanism yourself.
Find out when you need lazy loading and when it's slowing you down (due to the select n+1 problem).
If you have performance issues optimize. You don't have to map every single relationship. Use custom SQL in separate methods to get the objects you need when you need them. You can write a query that only returns table ids and afterwards ask Hibernate to load the corresponding objects.
Optimize your SQL. Avoid joins, use subselects, where id in etc.
Implement (database) paging if it makes sense.
If all else fails, start using plain SQL. You' ll have already written the most complex queries and you'll know where your bigger bottlenecks are.
You could use a local SQLite to save the less volatile data and talk to the database mainly to get lists of ids and the stuff that you're missing. For example if you have users and orders, you can assume that you will have many more new orders per minute/second than users per hour.
To sum up, set clear performance goals before you start, always use a separate thread for data retrieval, avoid reinventing the wheel and keep it as simple as possible.
Here goes some generic approaches to the problem.
0) HW: make sure you are not having bottlenecks in you hardware, that you can cheaply increase. (adding HW is faster and cheaper that dev hours in most cases)
1) Caching:
Perhaps you can cache (locally or in a distributed cache like memcache) the 40% of data that tends to be immutable. You could invalidate the cache when data gets modified. You should choose the right entities and granularity level for building the keys.
2) Replication:
If the first is to much overhead, you could create slaves of your mysql and read from there. Again, you have to know when you can afford to have some stale data.
3) NoSQL:
Moving in that direction, but increasing the dev effort, you could move to some distributed store (take a look at the CAP theorem before making a choice)
Hope it helps
Depends on your database structure and application. You can use an object relational mapping library like ormlite and refresh objects loaded from database at the background with threads. With ormlite you may also use LazyForeignCollection to load only required data in your application.
Minimize unnecessary database call.
If your fields on database is changing, you can shift from relational to NoSQL database like MongoDB.
You can perform multithreading on the server side for data processing and clustering of application servers. While using multithreading use it effectively, be aware of the sychronized keyword, it will degrade the performance to some extend.
Perform best practice of coding, don't use more instance variable, try to use local variable, it will make you thread safe also.
You can use Mybatis for ORM also for large queries.
You can perform caching on DAO layer, service layer and even in client side but be sure to sychronize with the database, you can use different caching soutions.
You can do database indexing for first retrival.
Do not use same service for large data querying break it down into different services which will help u to process in multithreading way.
If the application is not very hard real time system you can use messaging solution also, like asychronously processing of data.
I needed to implement a utility server that tracks few custom variables that will be sent from any other server. To track the variables, a key value collection, either JDK defined or custom needs to be used.
Here are few considerations -
Keeping all the variables in memory of the server all the time is memory intensive.
This server needs to be a very lightweight server and I do not want heavy database operations.
Is there a pre-defined streaming collection which can serialize the data after a threshold memory and retrieve it on need basis?
I hope I am clear in defining the problem statement.
Please suggest if any other better approach.
this thing looks very promising, but is in development stage...
JDBM3
Edit Current version of the file backed collections: MapDB.
Database
What you've described sounds exactly like you should use a database (i.e. indexed key/value store, too big for memory but want performance benefits of in-memory caching where possible).
I'd recommend a lightweight embedded database such as H2 - it's small, fast and should suit your purposes very well.
Have you thought of using an on the shelf nosql queue value store? Redis for example?
If you want it java only you have the option of using a lib like ehcache, it would have the functionalities you need.
I wag going through a hibernate tutorial, where they say that hibernate is not suitable for data centric application. I am very much impressed by the 'object oriented structure' it gives to the program, but my application is very much data centric(it fetches and updates huge number of records. But I dont use any stored procedures). Cant I use hibernate?Are there any wrappers written over hibernate, which I can use for my application?Any help is appreciated.
I am not sure about specific meaning of phrase data centric. Aren't all database applications data centric? However, if you do process tons of data, Hibernate may not be the best choice. Hibernate is best to represent object models mapped to the database and it may have role in any application, but to do ETL (extract/transform/load) tasks you may need to write very efficient SQL by hand.
In principal you can, but it tends to be slow. Hibernate more or less creates an object for every row retrieved from the database. If you do this with large volumes of data, performance takes a serious hit. Also updates on many rows using a single update have only very basic support.
A wrapper won't help, at least with the object creation issue.
There are many advantages of using Hibernate, when one gets their object model correct as a developer there is a lot of appeal in interacting with the database via objects but in practice I have found initially Hibernate is great but becomes very frustrating when you come against issues like performance and fault finding.
When it comes to decision on the DA (Data Access) layer I ask myself this question.
Am I writing an application which has a requirement to run an different databases?
If the answer is yes then I will consider an (ORM) like Hibernate.
If its no then I will normally just use JDBC normally via Spring.
I feel that interacting with the database via JDBC is a lot more transparant and easier to find faults and performance tune.
I am working with an object that serves as a database in my application. However, I need to have redundant copies of this database. So, on init, I create multiple instances (say 5) copies of the same object. (I am using JAVA for this, so any hint of pre-existing libraries could be helpful as well.)
The object is a server that listens on a port for request for the information it is holding. This information may be updated by other entities via the same or a different port at any time.
My question is as follows:
Would a lock strategy
work in this case? That is, every time an update is made in
any instance, that instance contacts
all other instances and passes the
update.
During this time, all the requests
(read or update) from other entities
are queued.
Would this approach work? I have my doubts because, even if this works, I think the system is creating its own bottleneck. What do you guys say? Is there a better way of doing this distributed synchronization?
What you're describing is a distributed cache. The big player in that space is currently Coherence though I believe JBoss Cache is catching up.
As for rolling your own, having seen the complexity in what superficially sounds quite a simple problem, I wouldn't recommend it in a comercial setting, though it'd be a fun home project.
Are you talking about a distributed cache? Have you looked at ehcache?
Would this approach work? I have my
doubts because, even if this works, I
think the system is creating its own
bottleneck.
It would be creating its own bottleneck. You'd be better off using an in-memory database like HSQLDB or an embedded database like SQLite.
There is lot more to distributed syntonization than it's possible to mention in a single answer. You have to worry about two-phase commits, network partitions, etc. etc. I would advise you to look into an existing distributed DB solution combined with an n-tier Java EE architecture that includes load-balancing.
In the old days we used to access the database through stored procedures. They were seen as `the better' way of managing the data. We keep the data in the database, and any language/platform can access it through JDBC/ODBC/etc.
However, in recent years run-time reflection/meta-data based storage retrieval mechanisms such as Hibernate/DataNucleus have become popular. Initially we were worried that they'd be slow because of the extra steps involved (reflection is expensive) and how they retrieve unnecessary data (the whole object) when all we need is one field.
I'm starting to plan for a large data warehousing project that uses J2EE, but I'm a bit unsure whether to go for Stored Procedures or JDO/JPA and the like. Recently, I've been working with Hibernate, and to be quite honest, I don't miss writing CRUD stored procedures!
It essentially boils down to:
Stored procedures
+ Can be optimised on the server (although only the queries)
- There's likely to be more than a thousand stored procedures: add, delete, update, getById, etc, for each table.
JDO
+ I won't spend the next few months writing parameters.add("#firstNames", customer.getFirstName()); ...
- Will be slower than SPs (but most support paging)
What would you plump for in my situation. In this case I think it's a much of a muchness.
Thanks,
John
"JDO - Will be slower than SPs (but most support paging)"
This assumption is often false. There's no reason for SP's to be particularly fast. I've done some measurements and they're no faster than code outside the database.
A data warehouse is characterized by insert-only loads and long-running SELECT...GROUP BY... queries.
You're not writing OLTP transactional processing. You're not using 3NF as a way to prevent update anomalies on update/delete transactions.
Since you're doing bulk inserts, a SP will definitely be slower than a bulk load utility. Bulk loaders are often multi-threaded and will consume all available CPU resources. The SP is part of the DB and can only share limited DB resources.
Since you're mostly doing SELECT GROUP BY, a SP won't help much here, either. The SELECT statement doesn't benefit from being wrapped in a procedure.
You don't need them. They don't help.
You can easily benchmark a bulk-load and a query to demonstrate that SP's aren't helping.
Rod Johnson in his "J2EE Design adn Development" wrote a very clear analysis about ORM/StoredProcedures. He said that
Stored procedures should only be used in a J2EE system to perform operations that will always use the database heavily, whether they're implemented in the database or in Java code that exchanges a lot of data with the database.
As you're planning to implement a datawarehouse, I think that the stored procedures approach is the right choice.
I would suggest using the metadata to generate the scripts you use for loading into the data warehouse. This allows you to get performance benefits from using specialised load tools and perhaps from stored procedures (if you're using a sufficiently ancient database). Also, you will probably end up hand coding at least some SQL. Having your generic scripts done as stored procs will allow you to schedule all of them in the same way and not have to worry about changing how they are invoked when you rewrite some generated code to make it run better.
As for getting the data out, if what you're building in J2EE is a reporting tool, then you may be better off using JDO. While I'm not terribly familiar with the reporting side of things, one benefit I can see is that it will be easier to allow your end users to make custom reports that you did not anticipate in advance (although you've still got to have some limits on what they can do so that they don't take down the database in the process).