I am hitting a REST API to get data from a service. I transform this data and store it in a database. I will have to do this on some interval, 15 minutes, and then make sure this database has latest information.
I am doing this in a Java program. I am wondering if it would be better, after I have queried all data, to do
1. SELECT statements and compare vs transformed data and do UPDATEs (DELETE all associated records to what was changed and INSERT new)
OR
DELETE ALL and INSERT ALL every time.
Option 1 has potential to be a lot less transactions, guaranteed SELECT on all records because we are comparing, but potentially not a lot of UPDATEs since I don't expect data to be changing much. But it has downside of doing comparisons on all records to detect a change
I am planning on doing this using Spring Boot, JPA layer and possibly postgres
The short answer is "It depends. Test and see for your usecase."
The longer answer: this feels like preoptimization. And the general response for preoptimization is "don't." Especially in DB realms like this, what would be best in one situation can be awful in another. There are a number of factors, including (and not exclusive to) schema, indexes, HDD backing speed, concurrency, amount of data, network speed, latency, and so on:
First, get it working
Identify what's wrong → get a metric
Measure against that metric
Make any obvious or necessary changes
Repeat 1 through 4 as appropriate
The first question I would ask of you is "What does better mean?" Once you define that, the path forward will likely become clearer.
I have a table from which I extract 8 columns, said columns will be properties of a pojo, say MyPojo.
I want to remove duplicates.
I came up with two strategies.
1-Let oracle take care of this with distinct keyword
select distinct c1,c2...c8 from TABLE where...`
2-Do this in java with cqengine (https://code.google.com/p/cqengine/wiki/DeduplicationStrategies#Logical_Elimination_Strategy):
DeduplicationOption deduplication = deduplicate(DeduplicationStrategy.LOGICAL_ELIMINATION);
ResultSet<Car> results = cars.retrieve(query, queryOptions(deduplication));
3-Do this in java with a set
simply storing rows inside of a Set<MyPojo>
From a performance point of view which one is better?
Let the database do the work. In this case you don't send unnecessary data over the network which will - probably - have the biggest positive impact on performance.
Also it is the most compact solution in terms of code size.
The best way to decide these things is to model it.
What are the access patterns in your application?
If this is would be a one-off request: have the database do the filtering.
If you expect to get many such identical requests: have the database do the filtering, and consider caching results in the application.
If you expect to get a variety of queries on the same dataset, consider caching the unfiltered dataset into the application tier, and querying it with CQEngine.
There is no rule of thumb such as "always have the database do the work". If your application operates at any kind of scale, you will not want every request to hit the database. You need to scale out your application tier.
On the other hand, you should not over-engineer. The answer depends on the traffic volume and data access patterns that you expect.
I am currently using raw JDBC to query records in a MySql database; each record in the subsequent Resultset is ultimately extracted, placed in a domain specific model, and stored to a List Instance.
My query is: in circumstances where there is a requirement to further filter that data (incidentally based on columns that exist in the SAME Table) which of the following approaches would generally be considered best practice:
1.The issuance of further WHERE clause calls into the database. This will effectively offload the filtering process to the database but obviously results in an additional query or queries where multiple filters are applied consecutively.
2.Explicitly filtering the aforementioned preprocessed List at the Application level, thus negating the need to have to make additional calls into the database each time the records are filtered.
3.Some hybrid combination of the above two approaches, perhaps where all filtering operations are initially undertaken by the database server but THEN preprocessed to a application specific model and implicitly cached to a collection for some finite amount of time. Further filter queries, received within this interval, would then be serviced from the data stored in the cache.
It is important to note that the Database Server in this scenario is actually located on
an external machine, therefore the overhead and latency of sending query traffic over the local network also has to be factored into the approach we ultimately elect to take.
I am patently aware of the age-old mantra that stipulates that: "The database server should be used to do what its good at." however in this scenario it just seems like a less than adequate solution to be making numerous calls into the database to filter data that I ALREADY HAVE at the application level.
Your thoughts and insights would be greatly appreciated.
I have used the hybrid approach on many applications with good results.
Database filtering works good especially for columns that are indexed. This reduces network overhead since fewer rows are sent to application.
Database filtering can be really slow for some columns depending upon the quantity of rows in the results and the lack of indexes. The network overhead can be negligible compared to database query time so application filtering may be faster for this situation.
I also find that application filtering in Java easier to write and understand instead of complex SQL.
I usually experiment manually to get the fewest rows in a reasonable time with plain SQL. Then write Java to refine to the desired rows.
i appreciate this question first...as i too faced similar situation few days back...as you already discussed all available options i prefer to go with the second option....i mean handling at application level rather than filtering at DB level.
I've been using ORM frameworks for a while but I am rather new to Hibernate, though.
Suppose you have a query (is it a Query or a Criteria, does not matter) that retrieves a great result set and that you want to paginate though it. Would you rather use the setMaxResult() and setFirstResult() methods combo, or a ScrollableResult?
Which is the best approach regarding the performances (execution time AND memory consumption)?
If you are implementing a Web application that serves separate pages of results in separate request-response cycles, then there's no way you can use ScrollableResult to any advantage. Use setFirst/Max/Result. However, this can be a real performance killer, depending on the exact query and the total size of the result. Especially if the poor db must sort the whole result set every time so it can calculate what are the 100-110th records.
We had the same questions the other day, and settled for setMaxResult(..) and setFirstResult(..). The problems are two:
ScrollableResult may execute one query for each call to next() if your jdbc driver or database are not handling it properly. This was the case with us (MySQL)
it is hibernate-specific, rather than JPA standard.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I have question, maybe from someone it see as stupid. Is Hibernate fast? I use it in system which would have really large count of queries to database. And performance become alerted.
And other question off the point of current context. What would better - many simple query (with single table) or a bit less query with several JOIN?
Tanks for advance
Artem
From our experience Hibernate can be very fast. However, there can be many pitfalls that are specific to ORM tools including:
Incorrect lazy loading, where too much or too little data is selected
N+1 select problems, where iterating over collections is slow
Collections of List should be avoided and prefer Set so ordering information does not need to be included in the table
Batch actions where it's best to fall back to direct SQL
The Improving Performance page in the Hibernate docs is a good place to start to learn about these issues and other methods to improve performance.
First of all, there are many things you can do to speed up Hibernate. Check out these High-Performance Hibernate Tips article for a comprehensive list of things you can do to speed up your data access layer.
With "many" queries, you are meeting with the typical N+1 query problem. You load an Entity with Hibernate, which has related objects. With LAZY joins, you'll get a single query for every record. Each query goes through the network to your database server, and it returns the data. This all takes time (opening connection, network latency, lookup, data throughput, etc.).
For a single query (with joins) the lookup and data throughput are larger than with multiple queries. But you'll only have the opening of the connection and network latency just once. So with 100 or more queries you have a small lookup and data throughput, but you will have it 100 times (including opening the connection and network latency).
A single query that takes 20ms. vs 100 queries that take 1ms.? You do the math ;)
And if it can grow to be 1000's of records. The single query will have a small performance impact, but 1000's of queries vs 100's are 10 times more. So with multiple queries, you'll have reduced the performance greatly.
When using HQL queries to retrieve the data, you can add FETCH to a JOIN in order to load the data with the same query (using JOIN's).
For more info related to this topic, check out this Hibernate Performance Tuning Tutorial.
Hibernate can be fast. Designing a good schema, tuning your queries, and getting good performance is kind of an artform. Remember, under the covers its all sql anyway, so anything you can do with sql you can do with hibernate.
Typically on advanced application the hibernate mapping/schema is only the initial step of writing your persistence layer. That step gives you a lot of functionality. But the next step is to write custom queries using hql that allow you to fetch only the data you need.
Yes, it can be fast.
In past i got several cases when people think "aaaa it's this stupid ORM kills all performance of our nice application"
in ALL cases after profiling we found out other reasons for problem. (bad hash code implementation for collections, regExps from hell, db design made by mad hatter & etc.)
Actually it can do the job in most of the common cases. If you migrate huge and complex data - it will be poor competitor to plain well optimized SQL (but i hope it's not you case- i personally hate data migration with passion :)
This is not the first question about it, but I couldn't find an appropriate answer in my previous ones (perhaps it was for another "forum"). So, I'll answer once again :-)
I like to answer this in a somewhat provocative way (nothing personal!): do you think you'll be able to come with a solution which is better than Hibernate? That involves not only the basic problems, like mapping database columns to Java properties and "eager or lazy loading" (which is an actual concern from your question), but also caching (1L and 2L), connection pooling, bytecode enhancing, batching, ...
That said, Hibernate is a complex piece of software, which requires some learning to properly use and fine tune it. So, I'd say that it's better to spend your time in learning Hibernate than writing your own ORM.
Hibernate could be reasonable fast, if you know how to use it that way. However, polepos.org performance tests shows that for Hibernate could slow down applications by orders of magnitude.
If you want ORM which is light and faster, I can recommend fjorm
... which would have really large count of
queries to database ...
If you still in design/development phase do not optimize preventive.
Hibernate is a very well made piece of software and beware of performance issues. I would tell you when you project is more mature to go into performance issues and analyse for direct JDBC usage where necessary.
It's usually fast enough, and can be faster than a custom JDBC-based solution. But as all tools, you have to use it correctly.
Fast doesn't mean anything. You have to define maximum response time, minimum throughput, etc., then measure if the solution meets the requirements, and tune the app to meet them if it doesn't.
Regarding joins vs. multiple queries, it all depends. Usually, joins are obviously faster, since they require only one inter-process/network roundtrip.