Database query vs java processing

Database query vs java processing - java

Is it better to make two database calls or one database call with some java processing?
One database call gets only the relevant data which is to be separated into two different list which requires few lines of java.

Database dips are always an expensive operation. If you can manage with one db fetch and do some java processing, it should be a better and faster choice for you.
But you may have to analyze in your scenarion, which one is turning to be a more efficent choice. I assume singly DB fetch and java processing should be better.

Testing is key. Some questions you may want to ask yourself:
How big is each database call
How much bigger/smaller would the calls be if I combined them
Should I push the procesing to the client?
Timing
How time critical is processing?
Do you need to swarm the DB or is it okay to piggy back on the client?
Is the difference negligible?

Java Processing is much faster than SQL Fetch, As I had the same problem so I recommend you to fetch single data with some processing, because maybe the time both options take has a minor difference but some Computers take a lot of time to fetch data from DB so I suggest you to just Get single data with some Java Processing.

Generally Javaprocessing is better if its not some simple DB query that you are doing.
I Would recomend you for trying them both, measure some time and load and see what fits your application the best.

It all depends on how intensive your processing is and how your database is setup. For instance an Oracle running on a native file system will most likely be more performant then doing the java processing code on your own for complex operations. Note that most build in operations on well known databases are highly optimized and usually very performant.

Related

What is faster? SQL database read/write or file read/write?

I am implementing a load testing tool in java. I want to store load test results temporally to match them with the expected results at the end. I cannot match responses while running the load test as it will affect the load test speed.
So I want to store the results in a minimal writing time. What is the best possible way to store data? Write to a local database or write to local hdd as files?
Note: Results cannot be kept in the RAM as results may be large several gigs.

File is more efficient than DB. DB is a software based on file system but if you implement your solution by file effectively it would be more rapid and does not have the overload of Db.
Also in DB if you use ORM (JPA) it makes the code more slower than pure JDBC. It's obvious that more convenience in data handling (file->JDBC->JPA) results more time consuming.
I suggest you to use file manipulation for this purpose and use some more fast technology like nio (New IO) in java.

With load testing, usual approach would be to have multiple agents putting load on the system. Then there would need to be some benchmark on system and on agents. Usually load will be split so one agent can keep own statistics. When test is finished you can aggregate agents statistics kind of "off line" and cross compare it with benchmark on system side.
Answering your question about I/O write speed: it depends on many different factors so you would need to benchmark both scenarios. However considering fact that database needs to support transactions, indexing and store data my blind guess would be that in your use case file and raw data would be faster.

BigQuery performance and Running concurrent jobs

We are working with Google BigQuery (using Java) for one of our cloud solution and facing lot of issues in development. Our observations and issues as follows -
We are using Query Jobs (Example: jobs().insert()/jobs().query() method first and then tablesdata().list() for data) for data retrieval. The Job execution taking 2-3 seconds (we had data in MBs only right now). We looked into sample codes on code.google.com and github.com and tried to implement them. However, we are not able to achieve fast execution than 2-3 seconds. What is the fast way to retrieve data from BigQuery tables? Is there a way to improvise Job execution speed? If yes, Can you provide links for sample codes?
In our screens, we need to fetch data from different tables (different queries) and display them. So, we inserted multiple query jobs and total execution time getting summed-up (Example: if we had two jobs (i.e. two queries), it takes 6-7 seconds). In Google documentation it has been mentioned that, we can run concurrent Jobs. Is there any sample code available for this?
Waiting for your valuable responses.

Query of cached results can be much faster, if you can run the query independently. The following query will run faster.
Check that the bottle-neck is not related to network\ paginating\ page rendering\ etc. you can do it by trying executing only the 2nd step.
Parallel jobs might be queued on BQ end based on their current load.
My recommendation would be to separate the query from presentation. Run the BQ queries, retrieve the "Small size" data to a fast access data store (flat file, Cache, Cloud SQL, etc.) and present it from there.
As Pentium10 says, BQ is excellent for huge datas (and returns results faster and cheaper than any other comparable solution). If you are looking for a back-end of a fast reporting visualization tool, I am afraid that BQ might not be your solution.

1) Big Query is a highly scalable database, before being a "super fast" database. It's designed to process HUGE amount of data distributing the processing among several different machines using a technique named Dremel. Because it's designed to use several machines and parallel processing, you should expect to have super-scalability with a good performance.
2) BigQuery is an asset when you want to analyze billions of rows.
For example: analyzing all the wikipedia revisions in 5-10 seconds isn't bad, is it? But even a much smaller table would take about the same time, even if has 10k rows.
3) Under this size, you'll be better off using more traditional data storage solutions such as Cloud SQL or the App Engine Datastore. If you want to keep SQL capability, Cloud SQL is the best guess.
Sybase IQ is often installed in a single database and it doesn't use Dremel. That said, it's going to be faster than Big Query in many scenarios...as designed.
4) Certainly the performance differ from a dedicated environment. You get your dedicated environment for 20K$ a month.

Cache the database content in memory to increase performance

I have a message driven bean running on Glassfish. It receives hundreds of messages per second. After receiving a message, it needs to read the values in the JDBC database and process.
However, the values in database will only be updated one time a day or less. So what the MDB will read is consistent most of the time. So, is it there a good way to cache the content into the memory in order to increase the performance?
Update: is it posible to configure a in-memory database JDBC Connection Pool in Glassfish for the MDB?

You may get inspired by the identity map pattern and implement your own cache mechanism (with your own expiration policy), if you think that using a third party solution (like memcached) would be overkill.

The most obvious answer is to define the table as a MEMORY table type. If the underlying hardware is not prone to crashing and the OS is stable and there's a UPS attached, you might want to think about this. It also depends on the consequences of losing a number of transactions since the last backup whenever this fails. But performance-wise, this is lightning fast. More information can be found here for MySQL. (YMMV)
I've implemented multiple tables that way, and it worked great for me.

You can use a simple SoftReference HashMap based cache. Here is the complete implementation example SoftReference Cache In addition you can clear the complete Map periodically to bring in fresh data.
Or in case you can use 3rd party library, you might use ReferenceMap provided as part of Apache Collections.

How to improve my software project's speed?

I'm doing a school software project with my class mates in Java.
We store the info on a remote db.
When we start the application we pull all the information from the database and transform it into objects to use in our application (using java sql statemens).
In the application we edit some of these objects and then when we exit the application
we save or update information in the database using Hibernate.
As you see we dont use Hibernate for pulling in information, we use it just for saving and updating.
We have 2, but very similar problems.
The loading of object(when we start the app) and the saving of objects(with Hibernate) in the db(when closing the app) is taking too much time.
And our project its not a huge enterprise application, its a quite small app, we just manage some students, teachers, homeworks and tests. So our db is also very very small.
How could we increase performance ?
later edit: if we use a local database it runs very quick, it just runs slow on remote databases

Are you saying you are loading the entire database into memory and then manipulating it? If that is the case, why don't you instead simply use the database as a storage device, and do lookups and manipulation as necessary (using Hibernate if you like, or something else if you don't)? The key there is to make sure that you are using connection pooling, as that will reduce the connection time.
If this is what you are doing, then you could be running into memory issues as well - first, by not caching the entire database in memory, you will reduce memory and will spread out the network load from the beginning/end to the times when it needs to happen.

These 2 sentences are red flags for me :
When we start the application we pull
all the information from the database
and transform it into objects to use
in our application (using java sql
statemens). In the application we edit
some of these objects and then when we
exit the application we save or update
information in the database using
Hibernate.
Is there a requirements reason that you are loading all the information from the database into memory at startup, or why you're waiting until shutdown to save changes back in the database?
If not, I'd suggest a design change. If you've already got Hibernate mappings for the tables in the DB, I'd use Hibernate for both all of your CRUD (create, read, update, delete) operations. And, I'd only load the data that each page in your app needs, as it needs it.
If you can't make that kind of design change at this point, I think you've got to look closely at how you're managing the database connections. Are you using connection pools? Are you opening up multiple connections? Forgetting to release them?
Something else to look at. How are you using Hibernate to save the entities to the db? Are you doing a getHibernateTemplate().get on each one and then doing an entity.save or entity.update on each one? If so, that means you are also causing Hibernate to run a select query for each database object before it does a save or update. So, essentially, you'd be loading each database object twice (once at the beginning of the program, once before saving). To see if that's what's happening, you can turn on the show_sql property or use P6Spy to see exactly what queries Hibernate is running.

For what you are doing, you may very well be better off serializing your objects and writing them out to a flat file.
But, much more likely, you should just read / update objects directly from your database as needed instead of all at once, for all the reasons aperkins gives.
Also, consider what happens if your application crashes? If all of your updates are saved only in memory until the application is closed, everything would be lost if the app closes unexpectedly.

The difference in loading everything from a remote DB server versus loading everything from a local DB server is the network latency / pipe size. The network is a much smaller pipe than anything else. Two questions: first, how much data are we really talking about? Second, what is your network speed? 10/100/1000? Figure between 10 and 20% of your pipe size is going to be overhead due to everything from networking protocols to the actual queries themselves.
As others have stated, the way you've architected is usually high on the list of "don't do". When starting, pull only enough data to initialize the app. As the user works through it, pull what you need for that task.
The ONLY time you pull everything is when they are working in a disconnected state. In that case, you still don't load everything as objects in the application, you just work from a local data store which gets sync'ed with the remote server every so often.

The project its pretty much complete. we cant do large refactoring on it now.
I tried to use a second level cache for Hibernate when saving. EhCacheProvider.
in hibernate.xml:
net.sf.ehcache.hibernate.EhCacheProvider
i have done a config for the cache, ehcache.xml:
i have put the cache.jar in the project build path
and i have set the hibernate property for every class and set in the mapping.
But this cache doesn't seem to have an effect. I dont know if it works(if it is used).

Try minimising number of SQL queries, since every query has its own overhead.
You can enable database compression, which should speed things up when there is a lot of data.
Maybe you are connecting to the database many times?
Check the ping time of remote database server - it might be the problem.

As your application is just slow when running on a remote database server, I'd assume that the performance loss is due to:
Connecting to the server: try to reuse connections (pass the instance around) or use connection pooling
Query round-trip time: use as few queries as possible, see here in case of a hand-written DAL:
Preferred way of retrieving row with multiple relating rows
For hibernate you may use its batch functionality and adjust hibernate.batch_size.
In all cases, especially when you can't refactor larger parts of the codebase, use a profiler (method time or sql queries) to find the bottleneck. I bet you'll find thousands of queries, each taking 10ms RTT) which could be merged into one.

Some other things you can look into:
You can allocate more memory to the JVM
Use the jconsole tool to investigate what the bottlenecks are.

Why dont you have two separate threads?
Thread 1 will load your objects one by one.
Thread 2 will process objects as they are loaded.
Your app will seem more interactive at startup.

It never hurts to review the basics:
Improving speed means reducing time (obviously), and to do that, you find activities that take significant time but can be eliminated or replaced with something that uses less time. What I mean by activity is almost always a function call, method call, or property call, performed on a specific line of code for a specific purpose. If may invoke I/O or it may invoke computation, or both. If its purpose is not essential, then it can be optimized.
Many people use profilers to try to find these time-wasting lines of code, but most profilers miss the target because they look at functions, not lines, they go to sleep during I/O, and they worry about "self time".
Many more people try to guess what could be the problem, or they ask others to guess, such as by asking on SO. Such guesses, in the nature of guesses, are sometimes right - more often not, but people still invest time and resources in them.
There's a very simple way to find out for sure, without guessing, what could fruitfully be optimized, and here is one way to do it in Java.

Thanks for your answers. Their were more than helpful.
We completely solved this problem like so:
Refactored the LOAD code. Now it uses Hibernate with Lazy Fetching.
Refactored the SAVE code. Now it saves, just the data that was modified and right after the time it was modified. This way we dont have a HUGE save an the end.
Im amazed of how good it all went. The amount of new code we had to write was very very small.

Terracotta + Compass = Hibernate + HSQLDB + JMS?

I am currently in need of a high performance java storage mechanism.
This means:
1) I have 10,000+ objects with 1 - Many Relationship.
2) The objects are updated every 5 seconds, with the most recent updates persistent in the case of system failure.
3) The objects need to be queryable in a reasonable time (1-5 seconds). (IE: Give me all of the objects with this timestamp or give me all of the objects within these location boundaries).
4) The objects need to be available across various Glassfish installs.
Currently:
I have been using JMS to distribute the objects, Hibernate as an ORM, and HSQLDB to provide the needed recoverablity.
I am not exactly happy with the performance. Especially the JMS part of this.
After doing some Stack Overflow research, I am wondering if this would be a better solution. Keep in mind that I have no experience with what Terracotta gives me.
I would use Terracotta to distribute objects around the system, and something else need to give the ability to "query" for attributes of those objects.
Does this sound reasonable? Would it meet these performance constraints? What other solutions should I consider?

I know it's not what you asked, but, you may want to start by switching from HSQLDB to H2. H2 is a relatively new, pure Java DB. It is written by the same guy who wrote HSQLDB and he claims the performance is much better. I'm using it for some time now and I'm very happy with it. It should be a very quick transition (add a Jar, change the connection string, create the database) so it's worth a shot.
In general, I believe in trying to get the most of what I have before rewriting the application in a different architecture. Try profiling it to identify the bottleneck first.

At first, Lucene isn't your friend here. (read only)
Terracotta is to scale around at the Logical layer! Your problem seems not to be related to the processing logic. It's more around the Storage/Communication point.
Identify your bottleneck! Benchmark the Storage/Logic/JMS processing time and overhead!
Kill JMS issues with a good JMS framework (eg. ActiveMQ) and a good/tuned configuration.
Maybe a distributed key=>value store is your friend. Try Project Voldemort!
If you like to stay at Hibernate and HSQL, check out the Hibernate 2nd level cache and connection pooling (c3po, container driven...)!

Several Terracotta users have built systems like this in the past, so I can you tell you by proof of existence that it can be done. :)
Compass does have support for clustering with Terracotta so that might help you. I suspect you might get further faster by just being careful with how you create your clustered data structures.
Regarding your requirements and Terracotta:
1) 10k objects is quite small from a Terracotta perspective
2) 5 sec update rate doesn't seem like an issue. Might depend how many nodes there are and whether there is any natural partitioning you can take advantage of. All updates will be persistent.
3) 1-5 second query time seems quite easy. Building your own well-organized data structures for lookup is the tricky part. Obviously you want to avoid scanning all the data.
4) Terracotta currently supports Glassfish v1 and v2.
If you post on the Terracotta forums, you could probably get more Terracotta eyeballs on the problem.

I am currently working on writing the client for a very (very) fast Key/Value distributed hash DB that provides set + list semantics. The DB is C99 and requires GCC and right now I'm battling with good old Java network IO to break my current 30,000 get/sets per/sec barrier. Hope to be done within the week. Drop me a line through my account and I'll get back when its show time.

With such a high update rate, Lucene is almost definitely not what you're looking for, since there is no way to update a document once it's indexed. You'd have to keep all the object versions in the index and select the one with the latest time stamp, which will kill your performance.
I'm no DB expert, but I think you should look into any one of the distributed DB solutions that's been on the news lately. (CouchDB, Cassandra)

Maybe you should take a look to: Prevayler.
Your objects are always in mem.
The "changes" to your objects are persisted.
From time to time you are able to take a snapshot: every object is persisted.

You don't say what vendor you are using for JMS, but I wouldn't surprise me if you have some bottle neck there. I couldn't get more than 100 messages a second from ActiveMq, and whatever I tried in terms of configuration of acknowledgment, queue size, etc we were unable to soak the CPU beyond a few percent.
The solution was to batch many queries into one JMS message. We had a simple class that either sent a batch of messages when it got to 200 queries or reached a timeout (we used 20ms), which gave us a dramatic increase in message throughput.

Guaranteed messaging is going to be much slower than volatile messaging. Given every object is updated every few second, you might consider batching your updates (into say 500 changes or by time say 1-10 ms' worth), sending over volatile messaging, and batching your transactions. In this case you are more likely to be limited by bandwidth. Tuning your use case you may find smaller batch sizes also work efficiently. If bandwidth is critical (say you have a 10 MB connection or slower, then you could use compression over JMS)
You can achieve much higher performance with a custom solution (which also might be simpler) e.g. Hazelcast & JGroups are free (you can add a node(s) which does the database synchronization so your main app doesn't slow down). There are commercial products which handle in the order of half a million durable messages/sec.

Terracotta + jofti = queryable persistent clustered data structures
Search google for terracotta querymap or visit tusharkhairnar.blogspot.com for querymap blog
You may want to integrate timasync as well to update your database. Database is is your system of record use terracotta as caching and database offloading mechanism you can even batch async updates to make it faster so that I'd db contains fairly recent data
Tushar
tusharkhairnar.blogspot.com

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.