Multithreaded (or async) calculation on Spring Framework

Multithreaded (or async) calculation on Spring Framework - java

I am learning Spring Framework, and it is pretty awesome.
I want to use JAVA multithreading, but I don't know how with the Spring Framework.
Here is the code for service:
//StudentService.java
public List<Grade> loadGradesForAllStudents(Date date) {
try{
List<Grade> grades = new ArrayList<Grade>();
List<Student> students = loadCurrentStudents(); // LOAD FROM THE DB
for(Student student : students) { // I WANT TO USE MULTITHREAD FOR THIS PART
// LOAD FROM DB (MANY JOINS)
History studentHistory = loadStudentHistory(student.getStudentId(), date);
// CALCULATION PART
Grade calculatedGrade = calcStudentGrade(studentHistory, date);
grades.add(calculatedGrade);
}
return grades;
} catch(Exception e) {
...
return null;
}
}
And without multithreading, it is pretty slow.
I guess the for loop causes the slowness, but I don't know how to approach this problem. If give me an useful link or example code, I'd appreciate it.
I figured out the method loadStudentHistory is pretty slow (around 300ms) compare to calcStudentGrade (around 30ms).

Using multithreading for this a bad idea in an application with concurrent users, because instead of having each request use one thread and one connection now each query uses multiple threads and multiple connections. It doesn't scale as the number of users grows.
When I look at your example I see two possible issues:
1) You have too many round trips between the application and the database, where each of those trips takes time.
2) It's not clear if each query is using a separate transaction (you don't say where the transactions are demarcated in the example code), if your queries are each creating their own transaction that could be wasteful, because each transaction has overhead associated with it.
Using multithreading will not do much to help with #1 (and if it does help it will put more load on the database) and will either have no effect on #2 or make it worse (depending on the current transaction scopes; if you had the queries in the same transaction before, using multiple threads they'll have to be in different transactions). And as already explained it won't scale up.
My recommendations:
1) Make the service transactional, if it is not already, so that everything it does is within one transaction. Remove the exception-catching/null-returning stuff (which interferes with how Spring wants to use exceptions to rollback transactions) and introduce an exception-handler so that anything thrown from controllers will be caught and logged. That will minimize your overhead from creating transactions and make your exception-handling cleaner.
2) Create one query that brings back a list of your students. That way the query is sent to the database once, then the resultset results are read back in chunks (according to the fetch size on the resultset). You can customize the query to get back only what you need so you don't have an excessive number of joins. Run explain-plan on the query and make sure it uses indexes. You will have a faster query and a much smaller number of round trips, which will make a big speed improvement.

The simple solution is called stream, these enable you to iterate in parallel, for example :
students.stream().parallel().forEach( student -> doSomething(student));
This will give you a noticeable performance-boost but it wont remove the database-query overhead ... if your DB-management system takes about 300ms to return results .... well ... you're either using ORM on big databases or your queries are highly inefficient, i recommend re-analyzing your current solution

Related

Optimizing response time [ 1 findAll() vs multiple findByXYZ() inside a loop]

I have this service that does some accounting calculation (generating annual reports) which is somewhat complex since we use formulas that are then parsed and interpreted, so the java code in itself is complex but I managed to optimize it many times (we use SonarQube and codeMetrics), the problem is I have a DB call inside a for loop which now that I think about it is a problem (I've always been told that r/w operations take longer so reduce them as much as possible) but when you see it, it looks harmless (I just get what i need) but recently we noticed a performance issue, maybe it's because the DB is now larger (although i'm pretty certain I did my tests with large datasets) or maybe it's because we're now on lockdown and using a VPN which may have affected the response time.
anyway what i did was instead of having multiple findByXYZ() inside loops (which after inspection turns out i have 60 db calls after the loops are over), i used 2 findAll() and then inside the loops i just use a stream.filter(...) with this solution I managed to remove about 60 unnecessary db calls and saw a gain in response time by 1-2 seconds sometimes a few hundred ms, my question is, is this a good approach? or are there variables that I'm not taking into consideration that can be causing the issue? like having the server and the DB in the same network vs having them on two different network and the lag that can cause ...
Before
//1st loop
for(..) {
...
Optional<X> neededXInThisLoop = xDao.findByXYZ(x,y,z);
...
}
//2nd loop
for(..) {
...
List<Y> neededYsInThisLoop = yDao.findByX2Y2Z2(x2,y2,z2);
...
}
After
List<X> allXs = xDao.findAll();
List<Y> allYs = yDao.findAll();
//1st loop
for(..) {
...
Optional<X> neededXInThisLoop = allXs.stream.filter(...).findFirst();
...
}
//2nd loop
for(..) {
...
List<X> neededXsInThisLoop = allXs.stream.filter(...).collect(Collectors.toList());
...
}

Your hunch is very much right. The after is much more efficient than the before and you should try to minimize DB calls as much as possible (try to do as much in SQL, and then use stream to further transform the result or such).
DB Calls in for loops (or other repetitive structure) is a very big code smell and can cause serious performance problems.
Ideally you should not do xDao.findAll, but directly use xDao.findAllByXYZ() which just delivers you the filtered list, which you then just map to java pojos.
SQL (or whatever other Data Manipulation Language you might use) does a ton of optimizations. Use it for it's intended purpose.
You can read more about the different ways Spring supports JPA repositories for example in the official Spring documentation. You can, for example, simply name your method in the JpaRepository findAllBy____ (your condition here) or using a #Query annotation to specify a fully fledged SQL or JPQL query and Spring takes care of the rest.

Let me try to explain you,
Right now your DB have suppose 1000 records in that specific collection and using that 60 DB call you are filtering this number with 100 records.
But just think that this table contains 1M records and you applied findALL so you had received that 1M records and now you are trying to apply that filtering logic using Java 8 filters
So this filters itself process very slow with 1M records.
So my only suggestion is, as of now you have limited numbers of records in table so you are able to see performance improve with findALL.
Once this number increase, surely your performance will decrease.
Also you can see findByX and findAllByX on below mentioned link
https://spring.io/blog/2017/06/20/a-preview-on-spring-data-kay#improved-naming-for-crud-repository-methods

Is SQL IN Query better for performance or Java method ContainsAll

I have a scenario where user will select bulk of input up to 100K entries and i need to validate if this data belongs to the user and satisfies other X conditions so should I use complex Oracle SQL DB query - composite IN(id,column) to validate it OR
Should I fetch the data for this user satisfying the conditions to application memory and use List.containsAll, by first getting all the data (with all the other conditions)for this particular user and populating it in a dbList and then validating dbList.containsAll(inputList).
Which one will be better performace wise. DB Composite IN to send bulk input vs get the input and validate it with containsAll
I tried running SQL query in SIT environment, the query is taking around 70 -90 seconds which is too bad. It would be better in prod but still I feel the data has to sort through huge data in DB even though it is indexed by user ID.
IN DB i am using Count(*) with IN like below :
SQL Query :
select count(*) from user_table where user_id='X123' and X condtions and user_input IN(
('id','12344556'),
('id','789954334')
('id','343432443')
('id','455543545')
------- 50k entries
);
Also there are other AND conditions as well for validating the user_input are valid entries.
Sample JAVA code:
List<String> userInputList = request.getInputList();
List<String> userDBList = sqlStatement.execute(getConditionedQuery);
Boolean validDate = userDBList.containsAll(userInputList );
getConditionedQuery = "select user_backedn_id from user_table where user_id='X123'AND X complex conditions";
The SQL Query with composite IN condition takes around 70-90 seconds in lower environments, however Java code for the containsALL looks much faster.
Incidentally, I don't want to use temp table and execute the procedure because again bulk input entry in DB is a hassle. I am using ATG framework and the module is RESTful so performance is most important here.

I personally believe that you should apply all filters at the database side only for many reasons. First, exchanging that much data over the network will consume unnecessary bandwidth. Second, bringing all that data into JVM and processing it will consume more memory. Third, databases can be tuned and optimised for complex queries. Talk to your DBA, give him the query and him to run an analysis. The analysis will tell you if you need to add any indexes to optimise your query.
Also, contrary to your belief, my experience says that if a query takes 70-90 seconds in SIT, it will take MORE time in prod. Because although PROD machine are much faster, the amount of data in PROD is much much higher compared to SIT, so it will take longer. But that does not mean you should haul it over the network and process it in JVM. Plus, JVMs heap memory is much much lesser compared to database memory.
Also, as we move to a cloud-enabled, containerised application architecture, network bandwidth is charged. E.g. if your application is in the cloud and the database in on premise, imagine amount of data you will move back and forth to finally filter out 10 rows from a million rows.
I recommend that you write a good query, optimise it and process as many conditions as possible on the database side only. Hope it helps !

In general it's a good idea to push as much of the processing to the database. Even though it might actually like a bottleneck, it is generally well optimised and can work over the large amounts of data faster than you would.
For read queries like the one you're describing, you can even offload the work to read replicas, so it doesn't overwhelm the master.

Should I batch insert/update in Hibernate?

I am having some doubts regarding an function that updates multiple entities, but it does one by one. Of course, there could be a latency problem if we are working with a remote DB, but aside from that, I am worry that we can get an OutOfMemoryException, because of the amount of entities we are updating in one single transaction. My code goes something like the one below.
EntityHeader entityHeader = entityHeaderService.findById(id);
for(EntityDetail entityDetail : entityHeader.getDetails()) {
for(Entity entity : entityDetail.getEntities()) {
entity.setState(true);
entityService.update(entity);
}
}
This is an example and we also have a similar case in another method, but with inserts instead. These methods can update or insert up to 2k or 3k entities in one transaction. So my question is, should we start using batch operations or the amount of entities is not big enough to worry about it? Also, would it performed better if done with a batch operation?

When optimizing things always ask yourself if it is worth the time, e.g. :
Are these method some batch that run nightly or are something that get called quite often?
Is the performance gain high enough or is negligible?
Anyway ~3k entities in one transaction doesn't sound bad, but there are benefits to jdbc batch even with those numbers (also it is quite easy to achieve).
Kinda hard to tell when you should worry about an OutOfMemoryException as it depends on how much memory you are giving to the jvm and how big are those entites you are updating; just to give you some number i personally had some memory trouble when i had to insert around between 10~100 thousand rows in the same transaction with 4Gb memory, i had to flush and empty hibernate cache every once in a while.

Hibernate really slow. How to make it faster?

In my app. I have Case and for each Case there can be 0 to 2 Claim. If a Case has 0 claims it runs pretty fast, 1 claims and it slows down, and 2 is awfully slow. Any idea how to make this faster? I didn't know if my case and claim were going back and forth causing an infinite recurison, so I added a JsonManagedReference and JsonBackReference, but that doesn't seem to help much with speeds. Any ideas? Here is my Case.java:
#Entity
public class Case {
#OneToMany(mappedBy="_case", fetch = FetchType.EAGER)
#Fetch(FetchMode.JOIN)
#JsonManagedReference(value = "case-claim")
public Set<Claim> claims;
}
In Claim.java:
#Entity
public class Claim implements Cloneable {
#ManyToOne(optional = true)
#JoinColumn(name = "CASE_ID")
#JsonBackReference(value = "case-claim")
private Case _case;
}
output of 0 claims:
https://gist.github.com/elmatt/2cafbe7ecb1fa0b7f6a8
output of 2 claims:
https://gist.github.com/elmatt/b000bc28909453effc95

Your problem has nothing to do with the relationship between Case and Claim.
FYI: 300ms is not "pretty fast." Your problem is that you expect hibernate to magically and quickly deliver a complex object hierarchy to you, with no particular effort on your part. I view ORM as "The Big Lie" - it is super easy to use and works great on toy problems, but tends to fail miserably when you try to scale to interesting applications (like yours).
Don't abandon hibernate, but realize that you are going to need to work harder than you thought you would in order to make it work for you.
I happen to work in a similar data domain (post-adjudication healthcare claim analysis and processing). You should be able to select this kind of data in well under 10ms per claim (with all associated dimensions) using MySQL on modest hardware from a table with >1 billion claims and the DB hosted on a separate server from the app.
How do you get from where you are to where you should be?
1. Minimize the number of round-trips to the database by minimizing the number of separate queries that are executed.
2. Hand-craft your important queries to grab just the rows and joins that you actually need.
3. Use explain plan on every query to make sure that it hits the tables in the right order and every step is appropriately supported by an index.
4. Consider partitioning your big tables and include the partition criteria in your queries to enable partition-pruning to focus the query on the proper data.
5. Be very hesitant to let hibernate manage your relationships between your entities. I generally do not let hibernate deal with any relationships.
A few years ago, I worked on a product that is an iPhone app where the user walks through workflows (e.g., a nurse taking a patient's vitals) and each screen made a round-trip to the app server to execute the workflow step and get the data for the next screen. Think about how little data you can work with on an iPhone screen. Yet the DB portion of the round-trip generally took 2-5 seconds to execute. Everyone there took it for granted, because "That is how long it has always taken." I dug into the code and found that each step was pulling in a significant portion of the database (and then was not used by the business logic).
The only time they tweaked the default hibernate behavior was when they got an exception due to too many joins (yes, MySQL has a limit of something like 67 tables in one query).
The approach of creating your Java data model and simply ORM'ing it into the database generally works just fine on configuration data and the like, but tends to perform terribly for complex data models involving your transactional data. This is what is biting you now.
Your problem is totally fixable, and can be attacked incrementally - you don't have to tear apart the whole application to start making things better.

Can you enable hibernate logging and provide the output. It should indicate the SQL queries being executed against your DB. Information about which DB you are using would also be useful. When you have those I would recommend profiling the queries to ensure your DB is setup appropriately. It sounds like an non indexed query.
Size of the datasets would be helpful in targeting possible issues as well - number of rows and so on.
I would also recommend timing the actual hibernate call (could be as crude as log statement immediately before / after) vs overall processing to identify whether it really is hibernate or some other processing. Without further information & context that is not clear here.
Now you've posted your queries we can see what is happening. It looks like the structure of your entities is more complex than the code snippet originally posted. There are references to Person, Activities, HealthPlan and others in there.
As others have commented your query is triggering a very large select of a lot of data due to the nature of your model.
I recommend creating Named Queries for claims, and then load those using the ID of Case.
You should also review your hibernate model and switch to FetchType.LAZY, other hibernate will create large queries such as the one you have posted. The catch here is that if you try to access a related entity outside of the transaction you will get a lazyinitializationexception. You will need to consider each use case and ensure you load the data you need. Two common mistakes with Hibernate is to use FetchType.EAGER everywhere or to initiate the transaction to early to avoid this. There is not one correct design approach, but I normally do the following
JSP -> Controller -> [TX BOUNDARY] Service -> DAO
You service method(s) should encapsulate the business logic you need to load the data you require, before passing it back to the controller.
Again, per the other answer, I think you're expecting too much of Hibernate. It is a powerful tool but you need to understand how it works to get the best from it.

Pattern for batch query crawler operations

I am trying to create an abstraction for a batch query crawler operation. The idea is that a query is executed, a result set is obtained and for each row an operation is performed that either commits or rollbacks. The requirement is that all rows are processed independent of whether there are failures and that the result set is not loaded into memory beforehand.
The problem boils down to the fact that it is not possible to maintain an open result set after a rollback. This is as per the spec, cursor holdability is maintainable on commit (using ResultSet.HOLD_CURSORS_OVER_COMMIT), but not on rollback.
A naive implementation with JTA/JDBC semantics, providing two extension points, one for specifying the query and one for specifing the actual operations for each row, would be something like this:
UserTransaction tx = getUserTransaction();
tx.begin();
ResultSet rs = executeQuery(); //extension point
tx.commit();
while(rs.next()) {
tx.begin();
try {
performOperationOnCurrentRow(ResultSet rs); //extension point
tx.commit();
logSuccess();
}catch(Exception e) {
tx.rollback();
logFailure(e);
}
}
This does not seem such a far-fetched scenario, however I've found very little relevant information on the web. The question is, has this been addressed elegantly by any of the popular frameworks? I do not necessarily need an out of the box solution, I just wonder if there is a known good/generally accepted approach to handle this scenario.
One solution would be to keep track of the row that failed and re-open a cursor after that point, which will generally require to enforce some extension rules (e.g. ordered result set, query using last failed row id on where clause etc).
Another would be to use two different Threads for the query and the row operation.
What would you do?

Since this has not been answered for a couple years now, I will go on and answer it myself.
The solution we worked out revolves around the idea that the process (called BatchProcess) executes a Query (not limited to SQL, mind you) and adds its results to a concurrent Queue. The BatchProcess spawns a number of QueueProcessor objects (run on new Threads) that consume entries of the Queue and execute an Operation that uses the entry as input. Each Operation execution is performed as a single unit of work. The underlying transaction manager is a JTA implementation.
A bit dated, but at some point the implementation was this https://bo2.googlecode.com/svn/trunk/Bo2ImplOpen/main/gr/interamerican/bo2/impl/open/runtime/concurrent/BatchProcess.java
There is even a GUI for BatchProcess monitoring and management somewhere in that repo ;)
Disclaimer: This guy had A LOT to do with this design and implementation https://stackoverflow.com/users/2362356/nakosspy

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.