First of all, is this possibly ready-made?
Hibernate 3.6 , JDBC batch_size 500 , using Hilo generator on the
200.000 entities.
In my example case, I have a request that takes 56 seconds, and I am creating 200,000 entities in the session. So, the session.flush() command takes 32 of those 56 seconds with only one CPU core at %100.
Is there a way to get the list of entities that need to be updated and create the SQL statements, say in four threads?
You cannot simply flush() in different threads, because what flush() does is basically sending all pending SQL INSERT statements to the database using underlying connection. JDBC connections aren't thread safe, which means you would have to use 4 different connections and thus 4 different transactions. If all inserts need to take place in one transaction, there is nothing you can do here.
If you can live with 4 separate transactions, just create a thread pool and store records in smaller batches. Pool will distribute INSERT operations across several threads.
Also are you sure this will really help? I would guess flush() is not CPU-bound but I/O or network bound. However your experience is different with 100% usage, so I might be wrong. Also try optimizing INSERTs - using stateless session, raw JDBC/Native queries, batch inserts, etc. Spliting into separate threads is much harder.
Related
I want to use more than one app instance for my Java application. The aplication works with DB: writes and reads data. I'd like to use Hibernate with L1 caching (Session level) only. My question is: should I sync cache for each instance or there's no need to worry about synchronization of the caches?
It's all depends on what is your application is about ?
Say you are running an e-commerce shop, in the admin panel, there is a service for managing products. Two separate user opens the same product page and update them. There is nothing wrong about it (unless you have some specific business case)
Another scenario is, you are tracking the inventory of products. Say you maintain a count of each product. When you add product, this count get increased and when you sell products this count get decreased. This operation is very sensitive and require some sort of locking. Without locking following scenario can happen
Timestamp
App Instance A
App Instance B
T1
Reads and found 10 product
Reads and found 10 product
T2
Removes Two product and write 8
Does nothing
T3
Done nothing
Add Two product and write 12
Thus it now tracks wrong count in the database.
To tackle these scenarios there are mostly two kind of locking mechanism
Optimistic locking
Pessimistic locking
To learn more about these sort of locking read here.
A simple way to implement the optimistic locking in hibernate is using the version column in the database and application entity.
Here is a good article about the entity versioning.
You can use caches like Hazelcast, Infinispan or EHCache that allow caching spanning multiple instances. These caches have different strategies, basically distributed (via a distributed hash table, DHT) or replicating. With a distributed cache only a subset of data is on each instance which leads to non uniform access times, however, its possible to cache a vast amount of data. With a replicating cache all data is replicated via the instances, so you get fast access times, but modification takes longer because all instances need to be notified.
To prevent a dirty read hibernate stops caching an object before the write transaction gets committed and then starts caching again. In case of a replicating cache this adds at least two network requests, so write throughput can decrease quite dramatically.
There are many details that should be understood and maybe tested before going in operation. Especially what happens when an instance is added or dies or is unreachable for an amount of time. When I looked at the hibernate code a few years back there was a hard coded timeout for a cache lockout of 30 seconds, meaning: If an instance disappears that was modifying data, other instances modifying the same data would "hang" for at most 30 seconds. But if the node did not die and it was just a connection problem and appears again after the timeout, you will get data inconsistencies. Within the caches you also have timers and recovery strategies for failure and connections problems that you need to understand and configure correctly depending on your operating environment.
My current project includes an archiving function where data from an in-memory database is transferred to a relational database.
I stream over the results from the in-memory database, create hibernate entities and persist the data to the database in batches of 5000. These entities have a couple of relations so per entity I write to different tables.
As a reference you can assume that 1 million insert queries are executed in the entire archiving process.
This process was really slow in the beginning so I looked online and implemented some common suggestions for writing in batches with Hibernate:
I set hibernate.jdbc.batch_size to a good size and hibernate.order_inserts to true.
To prevent memory issues, every now and then I flush and clear the hibernate session.
Here is a small example of the batching:
RedisServiceImpl.Cursor<Contract> ctrCursor = contractAccessService.getCursor("*", taskId);
Iterators.partition(ctrCursor, BATCH_SIZE).forEachRemaining(chunk -> {
portfolioChunkSaver.saveContractChunk(chunk, taskId);
em.flush();
em.clear();
});
ctrCursor.close();
This process works but it is incredibly slow. Inserting the 1 million records in Oracle took about 2 hours to finish, which is ~2.5 queries per second.
Currently this entire archiving function is wrapped in 1 transaction, which doesn't feel right at all. The big benefit is that you can be sure if the archive was successfully completed or not without having to provide some additional checking system for that. (Everything is either in the DB or it isn't)
As a speedup experiment I modified the code to create a database transaction per chunk of entities (5000) instead of wrapping everything in 1 big transaction.
That change had a huge impact, the speed now is about 10-15x as fast as before.
When profiling I saw this behavior before the change:
Before:
Java - very low CPU
Oracle - very high CPU, low disk write activity
After:
Java - high CPU
Oracle - Low CPU, very high disk write activity
The second behavior makes a lot of sense, java is sending over as much queries as possible and the database server is constrained by the writing to disk speed on my local system.
Here comes my question: why is the impact so huge? What is Oracle doing differently when I send over everything in a bigger transaction?
As a side-note: I never had this issue with MySQL so Oracle (or the oracle JDBC driver) must be doing something in a different way.
I can imagine that guaranteeing ACID compliance causes the overhead, but I wouldn't expect this huge speed difference.
You should make sure you have enough UNDO space (also known as UNDO segments) as a large transaction will consume a lot of it.
When a ROLLBACK statement is issued, undo records are used to undo
changes that were made to the database by the uncommitted transaction.
It's always preferable to only commit when you're done for data integrity and a properly tuned Oracle database can support large transactions without any performance issue.
As per my understanding the execution of java programs is pretty fast, things that slow down an application are mainly Network and IO operations.
For example, if I have a for loop running 10000 times which opens a file, processes some data and saves the data back into the file. If at all the application is slow, it not because of the loop executing 10000 times but because of the file opening and closing within the loop.
I have an MVC application where before I view a page I go through a Controller which in turn calls Services, which finally calls some DAO methods.
The problem is that there are so many queries being fired before the page loads and hence the page load time is 2 mins, which is pathetic.
Since the service calls various DAO methods and each DAO method uses a different connection object, I thought of doing this: "Create only one DAO method that the Service would call and this DAO method would fire all queries on one Connection object."
So this would save the time of connecting and disconnecting to the database.
But, the connection object in my application is coming from a connection pool. And most connection pools don't close connections they just send them back to the connection pools. So my above solution would not have any effect as anyways there is no opening and closing of connections.
How can I enhance the performance of my application?
Firstly you should accurately determine where the time is spent using tools like Profiler.
Once the root cause is known you can see if the operations can be optimized, i.e remove unnecessary steps. If not then you can see if the result of the operations can be cached and reused.
Without accurate understanding of processing that is taking time, it will be difficult to make any reasonable optimization.
If you reuse connection objects from the pool, this means that the connection/disconnection does not create any performance problem.
I agree with Ashwinee K Jha that a Profiler would give you some clear information of what you could optimize.
Meanwhile some other ideas/suggestions:
Could you maintain some cache of answers? I guess that not all of the 10,000 queries are distinct!
Try tuning the number of Connection objects in the Pool. There should be an optimal number.
Is your query execution already multi-threaded? I guess it is, so try tuning the number of threads. Generally, the number of cores is a good number of threads BUT, in the case of I/Os a much larger number is optimal (the big cost is the I/Os, not the CPU)
Some suggestions :
Scale your database. Perhaps the database itself is just slow.
Use 'second level caching' or application session caching to
potentially speed things up and reduce the need to query the
database.
Change your queries, application or schemas to reduce the number of
calls made.
You can use The Apache DBCP which use connection pool, calling database IO is costly, but mostly db connection openning and closing take good chunk of time.
You can also increase the maxIdle time (The maximum number of connections that can remain idle in the pool)
Also you can look into in memory data grid eg hazelcast etc
I am using multiple threads to insert insert records in different tables. In addition, I am using batch processing for the insertion of records to improve the efficiency.
Note: Number of records to be inserted are in millions.
My question is should I use connection pooling in this multi-threaded environment?
My Concern:
Each threads gonna run for quite sometime to perform the database operation. So, if the size of my connection pool is 2 and number of threads are 4,then at a given moment only 2 threads are gonna run. Consequently, other 2 threads gonna sit ideal for a long time to get the connection, as the db operations for million records are time consuming. Moreover, such connection-pooling will hinder the purpose of using multiple threads.
Using a connection pool in a batch is a matter of convenciency. It will help you limit the amount of open connections, abandoned time, close connections if you forget to close them verify if the connection is open etc.
Check out the Plain Ol' Java example here
I am working to develop a JMS application(stand alone multithreaded java application) which can receive 100 messages at a time , they need to be processed and database procedures need to be called for inserting/updating data. Procedures are very heavy as validations are also performed in them. Each procedure is taking about 30 to 50 seconds of time to execute and they are capable to run concurrently.
My concern is to execute 100 procedures for all 100 messages and also send reply within time limit of 90 seconds by jms application.
No application server to be used(requirement) and database is Teradata (RDBMS)
I am using connection pool and thread pool in java code and testing code with 90 connections.
Question is :
(1) What should be the limit on number of connections with database at a time?
(2) How many threads at a time are recommended?
Thanks,
Jyoti
90 seems like a lot. My recommendation is to benchmark this. Your criteria is uniques and you need to make sure you get the maximum throughput.
I would make the code configurable with how many concurrent connections you use and run it with 10 ... 100 connections going up 10 at a time. This should not take long. When you start slowing down then you know you have exceeded the benefits of running concurrently.
Do it several times to make sure your results are predictable.
Another concern is your statement of 'procedure is taking about 30 to 50 seconds to run'. How much of this time is processing via Java and how much time is waiting for the database to process an SQL statement? Should both times really be added to determine the max number of connections you need?
Generally speaking, you should get a connection, use it, and close it as quickly as possible after processing your java logic if possible. If possible, you should avoid getting a connection, do a bunch of java side processing, call the database, do more java processing, then close the conection. There is probably no need to hold the connection open that long. A consideration to keep in mind when doing this approach is what processing (including database access) you need to keep in single transaction.
If for example, of the 50 seconds to run, only 1 second of database access is necessary, then you probably don't need such a high max number of connections.