I'm doing a school software project with my class mates in Java.
We store the info on a remote db.
When we start the application we pull all the information from the database and transform it into objects to use in our application (using java sql statemens).
In the application we edit some of these objects and then when we exit the application
we save or update information in the database using Hibernate.
As you see we dont use Hibernate for pulling in information, we use it just for saving and updating.
We have 2, but very similar problems.
The loading of object(when we start the app) and the saving of objects(with Hibernate) in the db(when closing the app) is taking too much time.
And our project its not a huge enterprise application, its a quite small app, we just manage some students, teachers, homeworks and tests. So our db is also very very small.
How could we increase performance ?
later edit: if we use a local database it runs very quick, it just runs slow on remote databases
Are you saying you are loading the entire database into memory and then manipulating it? If that is the case, why don't you instead simply use the database as a storage device, and do lookups and manipulation as necessary (using Hibernate if you like, or something else if you don't)? The key there is to make sure that you are using connection pooling, as that will reduce the connection time.
If this is what you are doing, then you could be running into memory issues as well - first, by not caching the entire database in memory, you will reduce memory and will spread out the network load from the beginning/end to the times when it needs to happen.
These 2 sentences are red flags for me :
When we start the application we pull
all the information from the database
and transform it into objects to use
in our application (using java sql
statemens). In the application we edit
some of these objects and then when we
exit the application we save or update
information in the database using
Hibernate.
Is there a requirements reason that you are loading all the information from the database into memory at startup, or why you're waiting until shutdown to save changes back in the database?
If not, I'd suggest a design change. If you've already got Hibernate mappings for the tables in the DB, I'd use Hibernate for both all of your CRUD (create, read, update, delete) operations. And, I'd only load the data that each page in your app needs, as it needs it.
If you can't make that kind of design change at this point, I think you've got to look closely at how you're managing the database connections. Are you using connection pools? Are you opening up multiple connections? Forgetting to release them?
Something else to look at. How are you using Hibernate to save the entities to the db? Are you doing a getHibernateTemplate().get on each one and then doing an entity.save or entity.update on each one? If so, that means you are also causing Hibernate to run a select query for each database object before it does a save or update. So, essentially, you'd be loading each database object twice (once at the beginning of the program, once before saving). To see if that's what's happening, you can turn on the show_sql property or use P6Spy to see exactly what queries Hibernate is running.
For what you are doing, you may very well be better off serializing your objects and writing them out to a flat file.
But, much more likely, you should just read / update objects directly from your database as needed instead of all at once, for all the reasons aperkins gives.
Also, consider what happens if your application crashes? If all of your updates are saved only in memory until the application is closed, everything would be lost if the app closes unexpectedly.
The difference in loading everything from a remote DB server versus loading everything from a local DB server is the network latency / pipe size. The network is a much smaller pipe than anything else. Two questions: first, how much data are we really talking about? Second, what is your network speed? 10/100/1000? Figure between 10 and 20% of your pipe size is going to be overhead due to everything from networking protocols to the actual queries themselves.
As others have stated, the way you've architected is usually high on the list of "don't do". When starting, pull only enough data to initialize the app. As the user works through it, pull what you need for that task.
The ONLY time you pull everything is when they are working in a disconnected state. In that case, you still don't load everything as objects in the application, you just work from a local data store which gets sync'ed with the remote server every so often.
The project its pretty much complete. we cant do large refactoring on it now.
I tried to use a second level cache for Hibernate when saving. EhCacheProvider.
in hibernate.xml:
net.sf.ehcache.hibernate.EhCacheProvider
i have done a config for the cache, ehcache.xml:
i have put the cache.jar in the project build path
and i have set the hibernate property for every class and set in the mapping.
But this cache doesn't seem to have an effect. I dont know if it works(if it is used).
Try minimising number of SQL queries, since every query has its own overhead.
You can enable database compression, which should speed things up when there is a lot of data.
Maybe you are connecting to the database many times?
Check the ping time of remote database server - it might be the problem.
As your application is just slow when running on a remote database server, I'd assume that the performance loss is due to:
Connecting to the server: try to reuse connections (pass the instance around) or use connection pooling
Query round-trip time: use as few queries as possible, see here in case of a hand-written DAL:
Preferred way of retrieving row with multiple relating rows
For hibernate you may use its batch functionality and adjust hibernate.batch_size.
In all cases, especially when you can't refactor larger parts of the codebase, use a profiler (method time or sql queries) to find the bottleneck. I bet you'll find thousands of queries, each taking 10ms RTT) which could be merged into one.
Some other things you can look into:
You can allocate more memory to the JVM
Use the jconsole tool to investigate what the bottlenecks are.
Why dont you have two separate threads?
Thread 1 will load your objects one by one.
Thread 2 will process objects as they are loaded.
Your app will seem more interactive at startup.
It never hurts to review the basics:
Improving speed means reducing time (obviously), and to do that, you find activities that take significant time but can be eliminated or replaced with something that uses less time. What I mean by activity is almost always a function call, method call, or property call, performed on a specific line of code for a specific purpose. If may invoke I/O or it may invoke computation, or both. If its purpose is not essential, then it can be optimized.
Many people use profilers to try to find these time-wasting lines of code, but most profilers miss the target because they look at functions, not lines, they go to sleep during I/O, and they worry about "self time".
Many more people try to guess what could be the problem, or they ask others to guess, such as by asking on SO. Such guesses, in the nature of guesses, are sometimes right - more often not, but people still invest time and resources in them.
There's a very simple way to find out for sure, without guessing, what could fruitfully be optimized, and here is one way to do it in Java.
Thanks for your answers. Their were more than helpful.
We completely solved this problem like so:
Refactored the LOAD code. Now it uses Hibernate with Lazy Fetching.
Refactored the SAVE code. Now it saves, just the data that was modified and right after the time it was modified. This way we dont have a HUGE save an the end.
Im amazed of how good it all went. The amount of new code we had to write was very very small.
Related
We have a situation where we have to perform a lengthy query to the database based on human input. As the input changes, the query has to be done over and over again, and the input may change once per second.
The problem is, we know that this will cause a spike in server activity for several seconds, and since it is not critical to have an answer immediately or on every input change, it means we can afford executing or not executing the query.
The criteria we would like to use is the current state of the database server, and only allow the query to be done if it is in a low or medium load state, skipping the query when the database server is under stress.
We use Oracle database for this, and so far we have not found any way, from Java, to do this except by actually loading into the server a known query and benchmark it, but that is essentially adding some load to the server. So my question: is there any other way, specifically in Oracle database, where we can discover from the Java side of the application the load of the database?
Depending on how you define "low or medium load state", I'd guess that hitting v$osstat would give you the information you're after Of course, hitting v$osstat constantly will also add to the load on the server. You may want to write a job that copies the v$osstat data to a table you control periodically (and can thus index appropriately) so that your application can hit that table rather than hitting the dynamic performance view constantly. Depending on the goal (i.e. are you trying to ensure that other users have enough resources or are you trying to ensure that your app remains responsive), you may want to use Resource Manager to control resource utilization among users, you may want to run the query asynchronously from the application, and/or you may want to use some sort of cache at the middle tier to avoid hitting the database every time.
I have a message driven bean running on Glassfish. It receives hundreds of messages per second. After receiving a message, it needs to read the values in the JDBC database and process.
However, the values in database will only be updated one time a day or less. So what the MDB will read is consistent most of the time. So, is it there a good way to cache the content into the memory in order to increase the performance?
Update: is it posible to configure a in-memory database JDBC Connection Pool in Glassfish for the MDB?
You may get inspired by the identity map pattern and implement your own cache mechanism (with your own expiration policy), if you think that using a third party solution (like memcached) would be overkill.
The most obvious answer is to define the table as a MEMORY table type. If the underlying hardware is not prone to crashing and the OS is stable and there's a UPS attached, you might want to think about this. It also depends on the consequences of losing a number of transactions since the last backup whenever this fails. But performance-wise, this is lightning fast. More information can be found here for MySQL. (YMMV)
I've implemented multiple tables that way, and it worked great for me.
You can use a simple SoftReference HashMap based cache. Here is the complete implementation example SoftReference Cache In addition you can clear the complete Map periodically to bring in fresh data.
Or in case you can use 3rd party library, you might use ReferenceMap provided as part of Apache Collections.
It just occurred to me why not to have most of the objects in a cache(memory) when an application start.
if it's not that large web application. Or to have a settings for how much I want to put in the cache/memory.
I just guess it could require to have something like below 1 GB RAM or a lot less.
Everything in order to speed up the application even more by not querying database.
Is it good idea?
Caching is definitely a good idea and is widely used, but it has to be implemented correctly. There are plenty of pitfalls if done incorrectly. Try looking into one of the big proven systems, like memcached.
Caching is definitely a good idea.
Databases are also not a catch-all solution, though you have to be careful about consistency between runs of your program. What if you change the data but your program crashes before you update it to the database?
There are also lightweight memory resident databases that can let you keep your current queries for now, but run much stuff from memory. Using an ORM tool instead of SQL is particularly effective for this since the switch is almost transparent.
Quickly becomes Not so good idea, when some other node starts updating database.
In that case your cache will be holding stale data.
You can maintain a cache of Frequently Used objects in the memory, just don't forget to add methods to refresh the cache when the underlying database state changes.
Eg: If you have a user's table and you need user names in many many pages, then load the entire table in cache at time of Application Startup, just make sure to update the cache when you are adding new users online or modifying / deleting entries from user table
You don't persist objects to database. What you persist is object's state. So that you can have exactly the same state even after your app stops/closes/restarts. If you want to keep states of your objects persisted, you have no choice, but to use db (or anything else, that allows you to write data to file system).
The details are beyond the scope of an answer here, but we have had good experience of using ehCache ( http://ehcache.org/ )
The combination of support for distributed caches, and overflow to disk has allowed us to keep large numbers of computationally heavy, but fairly unchanging pages in the cache for a site being served from multiple tomcats.
Distribution addresses the question of staleness (if you invalidate your items correctly) and the disk overflow allows us to basically cache everything which was just not feasible with an in-memory cache.
Of course the implementation is not trivial for a real world application, but it improved our performance significantly once the caches were bubbling.
I'm using Java 1.6, JTDS 1.2.2 (also just tried 1.2.4 to no avail) and SQL Server 2005 to create a CallableStatement to run a stored procedure (with no parameters). I am seeing the Java wrapper running the same stored procedure 30% slower than using SQL Server Management Studio. I've run the MS SQL profiler and there is little difference in I/O between the two processes, so I don't think it's related to query plan caching.
The stored proc takes no arguments and returns no data. It uses a server-side cursor to calculate the values that are needed to populate a table.
I can't see how the calling a stored proc from Java should add a 30% overhead, surely it's just a pipe to the database that SQL is sent down and then the database executes it....Could the database be giving the Java app a different query plan??
I've posted to both the MSDN forums, and the sourceforge JTDS forums (topic: "stored proc slower in JTDS than direct in DB") I was wondering if anyone has any suggestions as to why this might be happening?
Thanks in advance,
-James
(N.B. Fear not, I will collate any answers I get in other forums together here once I find the solution)
Java code snippet:
sLogger.info("Preparing call...");
stmt = mCon.prepareCall("SP_WB200_POPULATE_TABLE_limited_rows");
sLogger.info("Call prepared. Executing procedure...");
stmt.executeQuery();
sLogger.info("Procedure complete.");
I have run sql profiler, and found the following:
Java app :
CPU: 466,514 Reads: 142,478,387 Writes: 284,078 Duration: 983,796
SSMS :
CPU: 466,973 Reads: 142,440,401 Writes: 280,244 Duration: 769,851
(Both with DBCC DROPCLEANBUFFERS run prior to profiling, and both produce the correct number of rows)
So my conclusion is that they both execute the same reads and writes, it's just that the way they are doing it is different, what do you guys think?
It turns out that the query plans are significantly different for the different clients (the Java client is updating an index during an insert that isn't in the faster SQL client, also, the way it is executing joins is different (nested loops Vs. gather streams, nested loops Vs index scans, argh!)). Quite why this is, I don't know yet (I'll re-post when I do get to the bottom of it)
Epilogue
I couldn't get this to work properly. I tried homogenising the connection properties (arithabort, ansi_nulls etc) between the Java and Mgmt studio clients. It ended up the two different clients had very similar query/execution plans (but still with different actual plan_ids). I posted a summary of what I found to the MSDN SQL Server forums as I found differing performance not just between a JDBC client and management studio, but also between Microsoft's own command line client, SQLCMD, I also checked some more radical things like network traffic too, or wrapping the stored proc inside another stored proc, just for grins.
I have a feeling the problem lies somewhere in the way the cursor was being executed, and it was somehow giving rise to the Java process being suspended, but why a different client should give rise to this different locking/waiting behaviour when nothing else is running and the same execution plan is in operation is a little beyond my skills (I'm no DBA!).
As a result, I have decided that 4 days is enough of anyone's time to waste on something like this, so I will grudgingly code around it (if I'm honest, the stored procedure needed re-coding to be more incremental instead of re-calculating all data each week anyway), and chalk this one down to experience. I'll leave the question open, big thanks to everyone who put their hat in the ring, it was all useful, and if anyone comes up with anything further, I'd love to hear some more options...and if anyone finds this post as a result of seeing this behaviour in their own environments, then hopefully there's some pointers here that you can try yourself, and hope fully see further than we did.
I'm ready for my weekend now!
-James
You can attach the Profiler and monitor for the events SQL:BatchCompleted and SP:Completed, with a filter on duration > 1000. Run the procedure from your Java client and from SSMS. Compare the Reads and the Writes of the two events (Java vs. SSMS). Are they significantly different? This would indicate considerably different execution paths or plans, with significant difference in I/O.
Also try to capture the Showplan XML event of the two and compare the plans (save the event as a .sqlplan file, open it in SSMS to easy analysis). Do they have similar plans? Are there wild differences in Estimate vs. Actual (rows, rewinds, rebinds)? Do they have same degree of parallelism? The plans can aso be retrieved from sys.dm_exec_requests view.
Are there any warning events raised, like Missing Column Statistics, Sort Warnings, Hash Warning, Execution Warnings, Blocked Process?
the point is that you have at your disposal a whole arsenal of investigation tools. Once you find the root cause of the difference, you can trace it down to what is different between your Java environment settings and the SSMS environment (ADO.Net SqlClient). Things like default transaction isolation level, ANSI settings etc etc.
Checking: Is your problem that two applications (SSMS, Java) are making the exact same identical call to SQL Server, and SQL Server is acting differently for each? If so, I hit things like this every year or two, and they hurt my brain for days.
Once, I ultimately isolated each process call and logging everything for the entire process in Profiler. I eventually noticed that the Login event (under TextData) showed a host of information, like so:
-- network protocol: TCP/IP
set quoted_identifier on
set arithabort off
set numeric_roundabort off
set ansi_warnings on
set ansi_padding on
set ansi_nulls on
set concat_null_yields_null on
set cursor_close_on_commit off
set implicit_transactions off
set language us_english
set dateformat mdy
set datefirst 7
set transaction isolation level read committed
The "Existing Connection" event will show this information as well--but, sometimes immediately subsequent calls (batches, RPCs, I disremember just now) are sent [ISQL or OSQL did this, I think] to immediately reset some of these -- Arithabort and Quoted_Identifier seem to be favorites, and other SET options also get modified depending on the settings or requirements of whatever connectivity protocols your application's database interface is using.
Another one: some settings are kept as attributes of a procedure at "create" time, and others are factored in at compile time. On the one hand, your connection's SET values may be being overwritten by the configuration saved at the time the procedure was created; on the other hand, your two connections may differ so much that two execution plans are generated for one procedure. (All of this information is, after sufficient research, available in the sys. tables and DMVs.)
In short, it seems to me that SQL obscurities are messing you up. To this day, I loathe all these goombah settings. Things below my notice keep messing around with them [I mean, really, what fool would set implicit_transaction for a connection pool on? But once they did...] and it's hard to build structures when the ground (rules) keep changing out from underneath you. After all, remember what the guy said about building castles in a swamp...
I recall having a similar issue a while ago, because JTDS was silently converting a string parameter to Unicode or something similar. As a result of that conversion, SQL Server was unable to use the index which is was using when we ran the stored proc from SSMS.
HIH
Does the Java case include transmission of the results to the Java server (network overhead) plus some Java processing? A 12 minute query might produce quite a large amount of data.
If you are looking at the profiler and there is no difference between the executions then the difference must be with the client systems.
4 mins does seem like to long just to prepare a statement to send so the 12 min wait must cause some other effect -- no idea what it is.
I am not sure if this post is still relevant. We faced a similar problem in our application.
One key difference between running a stored procedure in SQL Management studio and one running from JDBC is that of transaction context. If you are using an ORM in Java, by default the stored procedure runs in a transaction context. When you run a stored procedure directly in SQL management studio the transaction is off. There is a substantial performance difference.
Sorry, I've not found a correct answer to this, so I don't want to allocate any of these as correct, so I am going to mark this answer as correct, and wish anyone luck who comes across anything similar!
Did you know that Microsoft ship JDBC drivers for their databases?
These may be more performant.
Obviously.. you may have resolved the problem by now.
I am currently in need of a high performance java storage mechanism.
This means:
1) I have 10,000+ objects with 1 - Many Relationship.
2) The objects are updated every 5 seconds, with the most recent updates persistent in the case of system failure.
3) The objects need to be queryable in a reasonable time (1-5 seconds). (IE: Give me all of the objects with this timestamp or give me all of the objects within these location boundaries).
4) The objects need to be available across various Glassfish installs.
Currently:
I have been using JMS to distribute the objects, Hibernate as an ORM, and HSQLDB to provide the needed recoverablity.
I am not exactly happy with the performance. Especially the JMS part of this.
After doing some Stack Overflow research, I am wondering if this would be a better solution. Keep in mind that I have no experience with what Terracotta gives me.
I would use Terracotta to distribute objects around the system, and something else need to give the ability to "query" for attributes of those objects.
Does this sound reasonable? Would it meet these performance constraints? What other solutions should I consider?
I know it's not what you asked, but, you may want to start by switching from HSQLDB to H2. H2 is a relatively new, pure Java DB. It is written by the same guy who wrote HSQLDB and he claims the performance is much better. I'm using it for some time now and I'm very happy with it. It should be a very quick transition (add a Jar, change the connection string, create the database) so it's worth a shot.
In general, I believe in trying to get the most of what I have before rewriting the application in a different architecture. Try profiling it to identify the bottleneck first.
At first, Lucene isn't your friend here. (read only)
Terracotta is to scale around at the Logical layer! Your problem seems not to be related to the processing logic. It's more around the Storage/Communication point.
Identify your bottleneck! Benchmark the Storage/Logic/JMS processing time and overhead!
Kill JMS issues with a good JMS framework (eg. ActiveMQ) and a good/tuned configuration.
Maybe a distributed key=>value store is your friend. Try Project Voldemort!
If you like to stay at Hibernate and HSQL, check out the Hibernate 2nd level cache and connection pooling (c3po, container driven...)!
Several Terracotta users have built systems like this in the past, so I can you tell you by proof of existence that it can be done. :)
Compass does have support for clustering with Terracotta so that might help you. I suspect you might get further faster by just being careful with how you create your clustered data structures.
Regarding your requirements and Terracotta:
1) 10k objects is quite small from a Terracotta perspective
2) 5 sec update rate doesn't seem like an issue. Might depend how many nodes there are and whether there is any natural partitioning you can take advantage of. All updates will be persistent.
3) 1-5 second query time seems quite easy. Building your own well-organized data structures for lookup is the tricky part. Obviously you want to avoid scanning all the data.
4) Terracotta currently supports Glassfish v1 and v2.
If you post on the Terracotta forums, you could probably get more Terracotta eyeballs on the problem.
I am currently working on writing the client for a very (very) fast Key/Value distributed hash DB that provides set + list semantics. The DB is C99 and requires GCC and right now I'm battling with good old Java network IO to break my current 30,000 get/sets per/sec barrier. Hope to be done within the week. Drop me a line through my account and I'll get back when its show time.
With such a high update rate, Lucene is almost definitely not what you're looking for, since there is no way to update a document once it's indexed. You'd have to keep all the object versions in the index and select the one with the latest time stamp, which will kill your performance.
I'm no DB expert, but I think you should look into any one of the distributed DB solutions that's been on the news lately. (CouchDB, Cassandra)
Maybe you should take a look to: Prevayler.
Your objects are always in mem.
The "changes" to your objects are persisted.
From time to time you are able to take a snapshot: every object is persisted.
You don't say what vendor you are using for JMS, but I wouldn't surprise me if you have some bottle neck there. I couldn't get more than 100 messages a second from ActiveMq, and whatever I tried in terms of configuration of acknowledgment, queue size, etc we were unable to soak the CPU beyond a few percent.
The solution was to batch many queries into one JMS message. We had a simple class that either sent a batch of messages when it got to 200 queries or reached a timeout (we used 20ms), which gave us a dramatic increase in message throughput.
Guaranteed messaging is going to be much slower than volatile messaging. Given every object is updated every few second, you might consider batching your updates (into say 500 changes or by time say 1-10 ms' worth), sending over volatile messaging, and batching your transactions. In this case you are more likely to be limited by bandwidth. Tuning your use case you may find smaller batch sizes also work efficiently. If bandwidth is critical (say you have a 10 MB connection or slower, then you could use compression over JMS)
You can achieve much higher performance with a custom solution (which also might be simpler) e.g. Hazelcast & JGroups are free (you can add a node(s) which does the database synchronization so your main app doesn't slow down). There are commercial products which handle in the order of half a million durable messages/sec.
Terracotta + jofti = queryable persistent clustered data structures
Search google for terracotta querymap or visit tusharkhairnar.blogspot.com for querymap blog
You may want to integrate timasync as well to update your database. Database is is your system of record use terracotta as caching and database offloading mechanism you can even batch async updates to make it faster so that I'd db contains fairly recent data
Tushar
tusharkhairnar.blogspot.com