Stored procedures vs JDO for data warehousing project

Stored procedures vs JDO for data warehousing project - java

In the old days we used to access the database through stored procedures. They were seen as `the better' way of managing the data. We keep the data in the database, and any language/platform can access it through JDBC/ODBC/etc.
However, in recent years run-time reflection/meta-data based storage retrieval mechanisms such as Hibernate/DataNucleus have become popular. Initially we were worried that they'd be slow because of the extra steps involved (reflection is expensive) and how they retrieve unnecessary data (the whole object) when all we need is one field.
I'm starting to plan for a large data warehousing project that uses J2EE, but I'm a bit unsure whether to go for Stored Procedures or JDO/JPA and the like. Recently, I've been working with Hibernate, and to be quite honest, I don't miss writing CRUD stored procedures!
It essentially boils down to:
Stored procedures
+ Can be optimised on the server (although only the queries)
- There's likely to be more than a thousand stored procedures: add, delete, update, getById, etc, for each table.
JDO
+ I won't spend the next few months writing parameters.add("#firstNames", customer.getFirstName()); ...
- Will be slower than SPs (but most support paging)
What would you plump for in my situation. In this case I think it's a much of a muchness.
Thanks,
John

"JDO - Will be slower than SPs (but most support paging)"
This assumption is often false. There's no reason for SP's to be particularly fast. I've done some measurements and they're no faster than code outside the database.
A data warehouse is characterized by insert-only loads and long-running SELECT...GROUP BY... queries.
You're not writing OLTP transactional processing. You're not using 3NF as a way to prevent update anomalies on update/delete transactions.
Since you're doing bulk inserts, a SP will definitely be slower than a bulk load utility. Bulk loaders are often multi-threaded and will consume all available CPU resources. The SP is part of the DB and can only share limited DB resources.
Since you're mostly doing SELECT GROUP BY, a SP won't help much here, either. The SELECT statement doesn't benefit from being wrapped in a procedure.
You don't need them. They don't help.
You can easily benchmark a bulk-load and a query to demonstrate that SP's aren't helping.

Rod Johnson in his "J2EE Design adn Development" wrote a very clear analysis about ORM/StoredProcedures. He said that
Stored procedures should only be used in a J2EE system to perform operations that will always use the database heavily, whether they're implemented in the database or in Java code that exchanges a lot of data with the database.
As you're planning to implement a datawarehouse, I think that the stored procedures approach is the right choice.

I would suggest using the metadata to generate the scripts you use for loading into the data warehouse. This allows you to get performance benefits from using specialised load tools and perhaps from stored procedures (if you're using a sufficiently ancient database). Also, you will probably end up hand coding at least some SQL. Having your generic scripts done as stored procs will allow you to schedule all of them in the same way and not have to worry about changing how they are invoked when you rewrite some generated code to make it run better.
As for getting the data out, if what you're building in J2EE is a reporting tool, then you may be better off using JDO. While I'm not terribly familiar with the reporting side of things, one benefit I can see is that it will be easier to allow your end users to make custom reports that you did not anticipate in advance (although you've still got to have some limits on what they can do so that they don't take down the database in the process).

Related

How Can I Make Fast Desktop Application With Remote Database?

I am going to make a desktop application with mysql database. My database tables are frequenlty changing -- almost 60% of the tables. So I think caching may be a bad idea. Can anyone suggest me:
How can I make a fast desktop application with a remote database ?
My language is Java.

The biggest problem with most projects that have performance as their primary concern is that people tend make some exotic choices that end up complicating the project without any real benefits. Unless you have previous actual hands-on experience with the environment you will be working start simple.
Set some realistic goals about how often you have to refresh your data before you start. If your data changes very frequently, eg. every second, does it make sense to try and show the changes in real time? A query every second will make everyone involved miserable.
Use a thread to take care of the queries. You don't need more than one, since any more will only make the race conditions in the database worst.
Design your database layer to be insulated from the rest of the application. Also time your DB-related operations from the beginning in order to track the impact of your optimizations.
Start with Hibernate / ORMLite. Although I cannot talk about ORMLite, I have used (optimized) Hibernate in heavy load environments without any problems. If you have complicated objects you should give it a try, it sure beats using plain JDBC and implementing the cache mechanism yourself.
Find out when you need lazy loading and when it's slowing you down (due to the select n+1 problem).
If you have performance issues optimize. You don't have to map every single relationship. Use custom SQL in separate methods to get the objects you need when you need them. You can write a query that only returns table ids and afterwards ask Hibernate to load the corresponding objects.
Optimize your SQL. Avoid joins, use subselects, where id in etc.
Implement (database) paging if it makes sense.
If all else fails, start using plain SQL. You' ll have already written the most complex queries and you'll know where your bigger bottlenecks are.
You could use a local SQLite to save the less volatile data and talk to the database mainly to get lists of ids and the stuff that you're missing. For example if you have users and orders, you can assume that you will have many more new orders per minute/second than users per hour.
To sum up, set clear performance goals before you start, always use a separate thread for data retrieval, avoid reinventing the wheel and keep it as simple as possible.

Here goes some generic approaches to the problem.
0) HW: make sure you are not having bottlenecks in you hardware, that you can cheaply increase. (adding HW is faster and cheaper that dev hours in most cases)
1) Caching:
Perhaps you can cache (locally or in a distributed cache like memcache) the 40% of data that tends to be immutable. You could invalidate the cache when data gets modified. You should choose the right entities and granularity level for building the keys.
2) Replication:
If the first is to much overhead, you could create slaves of your mysql and read from there. Again, you have to know when you can afford to have some stale data.
3) NoSQL:
Moving in that direction, but increasing the dev effort, you could move to some distributed store (take a look at the CAP theorem before making a choice)
Hope it helps

Depends on your database structure and application. You can use an object relational mapping library like ormlite and refresh objects loaded from database at the background with threads. With ormlite you may also use LazyForeignCollection to load only required data in your application.

Minimize unnecessary database call.
If your fields on database is changing, you can shift from relational to NoSQL database like MongoDB.
You can perform multithreading on the server side for data processing and clustering of application servers. While using multithreading use it effectively, be aware of the sychronized keyword, it will degrade the performance to some extend.
Perform best practice of coding, don't use more instance variable, try to use local variable, it will make you thread safe also.
You can use Mybatis for ORM also for large queries.
You can perform caching on DAO layer, service layer and even in client side but be sure to sychronize with the database, you can use different caching soutions.
You can do database indexing for first retrival.
Do not use same service for large data querying break it down into different services which will help u to process in multithreading way.
If the application is not very hard real time system you can use messaging solution also, like asychronously processing of data.

JPA Native Queries versus 'pure' JPA persistence

I have a scenario where in I need to keep a log of all incoming files (flat, xml) to an application. This log table is hardly used, except for fault investigation or regulatory purposes and things like that, and data will be purged regularly.
We are using JPA 2.0 for persistence. We tried the initial prototype with pure JPA persistence using entityManager.persist(); and flush immediately. But the performance was not up to the expectation. So I suggested NativeNamedQueries for this operation and the performance improvement was huge (300 milliseconds vs 47 milliseconds) on tests.
But the lead engineer is bit adamant on using NativeNamedQueries, saying that its coupled to the database and less maintainable and things like that.
Questions :
What is your take on this, in case if you had to take a decision. How often does database or schema changes happen once the application goes to production ?
Is there any other way to improve performance? Performance is very very critical for this application.
Its only 4 years since I started programming, but never seen a DB schema change or DB provider change happening for an existing application.
Note : We are using EclipseLink 2.3 and Oracle. Also its a fresh application that we are developing. Just in case these points makes question more clear

How often does database or schema changes happen once the application goes to production ?
This is immaterial to your problem at hand. The quantity of changes to database schemas does not matter. What matters is the maintainability of your database model, how well it has been designed. Most business apps will see a lot of changes being done if sufficient performance testing hasn't been done, which is sadly true for most apps.
If you are a writing a typical line-of-business application, I would expect some form of round-trip engineering between the object model and the database model to occur in development. Your DBAs ought to own and know the database model quite well, so that they can aid or perform the fine-tuning the queries issued by your ORM framework. This is keeping in mind that you may not rely on the queries issued by the ORM framework alone. All changes should preferably be done and tested in the development and integration-testing (and possibly UAT, if you have one) environments before it is rolled out to production, and as common sense would suggest, all changes would be under version control.
On the topic of coupling the queries to a database, then that is a decision your business has to take. If you are in the business of supporting multiple databases, then you ought to testing against all. Also, you should be capable of providing different distributions for supporting different databases; this is made easier if you place your native queries in database specific orm.xml files like orm-oracle.xml, orm-mysql.xml etc. and rename the files to orm.xml before you prepare a distribution. Using Maven or Ant would make the proposed change easy to implement.
Is there any other way to improve performance? Performance is very very critical for this application.
That would depend on how well you have designed your object and data models, how well you've understood your ORM framework and how willing you are in "corrupting" your object model.
The first bit of performance tuning any application is to always measure twice and cut once. You cannot simply iterate through a list of possible solutions and try each one of them without knowing how they work and in what circumstances they are useful; okay, you could do that if your business is willing to invest time in that, but it is often not the case.
To begin, you'll need to understand why native queries are providing or appear* to provide a better performance. Maybe this has got a lot to do with the fact that you are merely inserting data, and it would be better for an ORM framework to simply issue the INSERT statement rather than construct one from HQL or the abstract query notation used under the hood; only a profiler will reveal the difference.
If the above is true, then you could reconsider whether your audit tables must be managed by the ORM framework. If your application is responsible for only writing to these tables and not reading from them (and it is quite possible that another app is responsible for reading the entries), then I would suspect that not managing these tables in ORM would provide better performance, especially if you use plain JDBC to issue the INSERT statement. The reason is quite simple - if your ORM framework is managing the entity, then it is also responsible for managing the persistence context (which now includes the class and the associated table); not having ORM manage the entity would possibly result in the scenario where the persistence context need not be updated at all for audit entries.
There is a healthy possibility of other performance tuning measures that you can undertake, but like I stated earlier, it would require you to understand a profiler report and estimate which possible choices would be better in your application.
* I'm afraid that unless you publish benchmarks and how you conducted them I will be skeptical of claims.

It's quite rare that you actually DO switch the database provider, especially once you've paid several 100k's of license for an excellent and high-performant database like Oracle. Besides, the SQL syntax variants of the INSERT statement are not so distinct that you wouldn't be able to switch the database, even when using native SQL, exceptionally.
I don't see why patching a single query that needs extra tuning is bad. Ask your lead developer why he's so strict. But before you do, use a profiler, such as JProfiler, or Yourkit to identify the exact spot that's causing the performance issues. With JPA, any of these may cause issues: caching, eager loading of dependent data (which you wouldn't need, probably), inefficient SQL generation, a bad query execution plan in your Oracle database, etc... Maybe you don't need a native query after all.
If performance is so critical, then maybe JPA is not good enough for the job. Have you (and your lead developer) considered other frameworks such as jOOQ, QueryDSL, MyBatis or anything similar? I have understood from your comments that your main use-cases are OLAP-querying, and not OLTP, hence you might even like to use advanced Oracle features, such as analytic functions and data-warehousing functionality, for which jOOQ has native support, for instance...

1) I have seen only 2 applications that moved from oracle to MySQL (to save on license costs) in 10 years, so it's not something that happens very often, BUT if you want to write integration tests using another database (eg hsqldb) you'll be in trouble.
About how often schema changes after an app goes to production, my answer is: A LOT!! If the app will be updated regularly, expect LOTs of changes, as usually the team understand the business better. I even worked on the project in which the schema was considerably different after one year of the app going live.
At the same time, this looks like you deferred optimizing the until the last posible time (a good thing to do) and now you need optimize the sql using some native queries (which also happens quite regularly)... What I'm trying to say is that your idea doesn't sound bad at all for me.
2) In the past I've used a mix of Hibernate and iBatis (or mybatis nowadays) for similar situations (in case you want to check iBatis). And one question, why are you doing a flush() after each persist()? You shoulnd't really need to do that.
Also, I'm quite surprised that the inserts take so much longer if they're done in EclipseLink. The calls to persist() should take almost the same amount of time as native query (I assuming they'll take longer if there is any lifecycle callbacks). I assume you've seen the sql generated by eclipseLink, is it that different?
I know my answer is not specific at all, but I hope it helps.

Can I use hibernate for data centric applications?

I wag going through a hibernate tutorial, where they say that hibernate is not suitable for data centric application. I am very much impressed by the 'object oriented structure' it gives to the program, but my application is very much data centric(it fetches and updates huge number of records. But I dont use any stored procedures). Cant I use hibernate?Are there any wrappers written over hibernate, which I can use for my application?Any help is appreciated.

I am not sure about specific meaning of phrase data centric. Aren't all database applications data centric? However, if you do process tons of data, Hibernate may not be the best choice. Hibernate is best to represent object models mapped to the database and it may have role in any application, but to do ETL (extract/transform/load) tasks you may need to write very efficient SQL by hand.

In principal you can, but it tends to be slow. Hibernate more or less creates an object for every row retrieved from the database. If you do this with large volumes of data, performance takes a serious hit. Also updates on many rows using a single update have only very basic support.
A wrapper won't help, at least with the object creation issue.

There are many advantages of using Hibernate, when one gets their object model correct as a developer there is a lot of appeal in interacting with the database via objects but in practice I have found initially Hibernate is great but becomes very frustrating when you come against issues like performance and fault finding.
When it comes to decision on the DA (Data Access) layer I ask myself this question.
Am I writing an application which has a requirement to run an different databases?
If the answer is yes then I will consider an (ORM) like Hibernate.
If its no then I will normally just use JDBC normally via Spring.
I feel that interacting with the database via JDBC is a lot more transparant and easier to find faults and performance tune.

Native Stored Proc v/s Hibernate

I am a newbie in Hibernate.
I am working on a cloud service data access layer.
Currently we are using Hibernate for OR mapping and as data access layer using Hibernate annotations. But lately i have been asked to implement Hibernate/Data Access layer in such a way that my stored procedures be in HQL and we can change our DB at a short notice and port our entire code.
The closest i can think in this regard is by using Named queries , where stored procedures are at DB side and my hibernate is resolving the stored procedure calls using named queries.
The reason for all that is the notion that since stored procedures are precompiled therefore they give good performance and security optimization for a large cloud service implementation.
currently i am using java , hibernate and Mysql.
Can anybody examine my assumptions and validate or give/suggest some better alternatives.
Performance and security are top priority.

I think the approach you outlined is great.
That is exactly what I would do if I were in your position. (I'm on Hibernate backed by MySql also, and have considered doing this if needed for performance reasons.)

Since parsing and optimizing statements is fast with the most DBMSes, I prefer not to use stored procedures if my application is the 'owner' of the catalogue(s).
With stored procedures, migration and maintainance can become more difficult which outweights the little tiny performance profits.
Cases, where I see the benefits of stored procedures:
I'm not the owner of the database. Access to data is provided by database developers / maintainers (like you find in Datawarehouses often). So stored procedures are an interface to the data.
Statements are complex and the runtime is unpredictable or should have no affects on my application (like triggering long running transactions or batches).
Hope, that'll help you with your decision.

Highest Performance Database in Java

I need ideas to implement a (really) high performance in-memory Database/Storage Mechanism in Java. In the range of storing 20,000+ java objects, updated every 5 or so seconds.
Some options I am open to:
Pure JDBC/database combination
JDO
JPA/ORM/database combination
An Object Database
Other Storage Mechanisms
What is my best option? What are your experiences?
EDIT: I also need like to be able to Query these objects

You could try something like Prevayler (basically an in-memory cache that handles serialization and backup for you so data persists and is transactionally safe). There are other similar projects.
I've used it for a large project, it's safe and extremely fast.
If it's the same set of 20,000 objects, or at least not 20,000 new objects every 5 seconds but lots of changes, you might be better off cacheing the changes and periodically writing the changes in batch mode (jdbc batch updates are much faster than individual row updates). Depends on whether you need each write to be transactionally wrapped, and whether you'll need a record of the change logs or just aggregate changes.
Edit: as other posts have mentioned Prevayler I thought I'd leave a note on what it does:
Basically you create a searchable/serializable object (typically a Map of some sort) which is wrapped in a Prevayler instance, which is serialized to disk. Rather than making changes directly to your map, you make changes by sending your Prevayler instance a serializable record of your change (just an object that contains the change instruction). Prevayler's version of a transaction is to write your serialization changes to disk so that in the event of failure it can load the last complete backup and then replay the changes against that. It's safe, although you do have to have enough memory to load all of your data, and it's a fairly old API, so no generic interfaces, unfortunately. But definitely stable and works as advertised.

I highly recommend H2. This is a kind of "second generation" version of HSQLDB done by one of the original authors. H2 allows us to unit-test our DAO layer without requiring an actual PostgreSQL database, which is awesome.
There is an active net group and mailing list, and the author Thomas Mueller is very responsive to queries (hah, little pun there.)

I don't know if it is the fastest option, but I've been very satisfied with H2 whenever I've used it. It's written by the same person who originally wrote Hypersonic (which later became HSQLDB).
Another option that is allegedly very fast is Prevayler.

It is a bit of an old question, but these days there is a whole lot of databases that have a level of performance of 20,000/s. Which database to chose depends on data structure and type of queries you'd like to be making. It also depends on overall volume.
We had similar problem with large volume of time series data, about 300,000 rec/s and we ended up writing a new database, with simple enough API and decent performance. It can do about 2,000,000 object writes/s and we did away without ORM.
It later evolved into QuestDB.

Try the following, it performs really well with Hibernate and other ORM frameworks
http://hsqldb.org/

Chronicle Map is an embeddable pure Java persistent database, providing a simple java.util.Map interface. It withstands about 1 million queries/updates per second from a single thread, consistent read/write performance and scales almost linearly to the number of cores in the machine.
Here are some recent performance research with actual numbers:
Comparison of Jetbrains Xodus, Oracle Berkeley DB JE BTree, MapDB TreeMap, Chronicle Map and H2 MVStore Map
LmdbJava Benchmarks

I would give a try to OrientDB.

Terracotta might also be an answer for you. It allows multiple VMs to share objects so you can distribute load etc...

You can also check out db4o

If you want to store all of your data in memory, you might want to look at Prevayler.
I've never used it myself, but it seems like a much better solution than using a relational database for those cases in which all of your data can be stored in memory.

Berkeley DB for Java is a fast in memory database, extremely useful for simple object graphs.

hsqldb is quite fast, but it is not ACID transaction-safe. The fastest java-database I know is db4o: benchmarks.
Edit: Please notice that Prevayler is not a database, see http://www.prevayler.org/wiki.jsp?topic=PrevaylerIsNotADatabase. If you're out of RAM, you're out of luck.

H2 is truly fantastic, indeed, in memory, normal server and transactional, you have it all. However It doesn't compare in performance to the object databases, I see Db4o mentioned, I have had much better performance with Neodatis in fact, and everything nicely set up in Maven repositories. Although not very robust, like a Ferrari, fast but not a truck like Oracle.

You can try CSQL (available under open source and enterprise version) It provides 30X performance improvement over disk based database systems and provides JDBC interface. It can be configured to work as stand alone main memory database or as a transparent cache to MySQL, Postgres, Oracle databases.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.