I am executing the following set of statements in my java application. It connects to a oracle database.
stat=connection.createStatement();
stat1=commection.createstatement();
ResultSet rs = stat.executeQuery(BIGQUERY);
while(rs.next()) {
obj1.setAttr1(rs.getString(1));
obj1.setAttr2(rs.getString(1));
obj1.setAttr3(rs.getString(1));
obj1.setAttr4(rs.getString(1));
ResultSet rs1 = stat1.executeQuery(SMALLQ1);
while(rs1.next()) {
obj1.setAttr5(rs1.getString(1));
}
ResultSet rs2 = stat1.executeQuery(SMALLQ2);
while(rs2.next()) {
obj1.setAttr6(rs2.getString(1));
}
.
.
.
LinkedBlockingqueue.add(obj1);
}
//all staements and connections close
The BIGQUERY returns around 4.5 million records and for each record, I have to execute the smaller queries, which are 14 in number. Each small query has 3 inner join statements.
My multi threaded application now can process 90,000 in one hour. But I may have to run the code daily, so I want to process all the records in 20 hours. I am using about 200 threads which process the above code and stores the records in linked blocking queue.
Does increasing the thread count blindly helps increase the performance or is there some other way in which I can increase the performance of the result sets?
PS : I am unable to post the query here, but I am assured that all queries are optimized.
To improve JDBC performance for your scenario you can apply some modifications.
As you will see, all these modifications can significantly speed your task.
1. Using batch operations.
You can read your big query and store results in some kind of buffer.
And only when buffer is full you should run subquery for all data collected in buffer.
This significantly reduces number of SQL statements to execute.
static final int BATCH_SIZE = 1000;
List<MyData> buffer = new ArrayList<>(BATCH_SIZE);
while (rs.hasNext()) {
MyData record = new MyData( rs.getString(1), ..., rs.getString(4) );
buffer.add( record );
if (buffer.size() == BATCH_SIZE) {
processBatch( buffer );
}
}
void processBatch( List<MyData> buffer ) {
String sql = "select ... where X and id in (" + getIDs(buffer) + ")";
stat1.executeQuery(sql); // query for all IDs in buffer
while(stat1.hasNext()) { ... }
...
}
2. Using efficient maps to store content from many selects.
If your records are no so big you can store them all at once event for 4 mln table.
I used this approach many times for night processes (with no normal users).
Such approach may need much heap memory (i.e. 100 MB - 1 GB) - but is much faster that approach 1).
To do that you need efficient map implementation, i.e. - gnu.trove.map.TIntObjectMap (etc)
which is much better that java standard library maps.
final TIntObjectMap<MyData> map = new TIntObjectHashMap<MyData>(10000, 0.8f);
// query 1
while (rs.hasNext()) {
MyData record = new MyData( rs.getInt(1), rs.getString(2), ..., rs.getString(4) );
map.put(record.getId(), record);
}
// query 2
while (rs.hasNext()) {
int id = rs.getInt(1); // my data id
String x = rs.getString(...);
int y = rs.getInt(...);
MyData record = map.get(id);
record.add( new MyDetail(x,y) );
}
// query 3
// same pattern as query 2
After this you have map filled with all data collected. Probably with a lot of memory allocated.
This is why you can use that method only if you hava such resources.
Another topic is how to write MyData and MyDetail classes to be as small as possible.
You can use some tricks:
storing 3 integers (with limited range) in 1 long variable (using util for bit shifting)
storing Date objects as integer (yymmdd)
calling str.intern() for each string fetched from DB
3. Transactions
If you have to do some updates or inserts than 4 mln records is too much to handle in on transactions.
This is too much for most database configurations.
Use approach 1) and commit transaction for each batch.
On each new inserted record you can have something like RUN_ID and if everything went well you can mark this RUN_ID as successful.
If your queries only read - there is no problem. However you can mark transaction as Read-only to help your database.
4. Jdbc fetch size.
When you load a lot of records from database it is very, very important to set proper fetch size on your jdbc connection.
This reduces number of physical hits to database socket and speeds your process.
Example:
// jdbc
statement.setFetchSize(500);
// spring
JdbcTemplate jdbc = new JdbcTemplate(datasource);
jdbc.setFetchSize(500);
Here you can find some benchmarks and patterns for using fetch size:
http://makejavafaster.blogspot.com/2015/06/jdbc-fetch-size-performance.html
5. PreparedStatement
Use PreparedStatement rather than Statement.
6. Number of sql statements.
Always try to minimize number of sql statements you send to database.
Try this
resultSet.setFetchSize(100);
while(resultSet.next) {
...
}
The parameter is the number of rows that should be retrieved from the
database in each roundtrip
Related
I am using spring and hibernate in my project and few day ago I found that Dev environment has crashed due to Java out of heap space exception. After some preliminary analysis using some heap analysis tools and visual vm, I found that the problem is with the one select SQL query. I rewrote the SQL in a different way which solved the memory issue. But now I am not sure why the previous SQL has caused the memory issue.
Note: The method is inside a DAO and is called in a while loop with a batch size of 800 until all the data is pulled. Table size is around 20 million rows.
For each call, a new hibernate session is created and destroyed.
Previous SQL:
#Override
public List<Book> getbookByJournalId(UnitOfWork uow,
List<Journal> batch) {
StringBuilder sb = new StringBuilder();
sb.append("select i from Book i where ( ");
if (batch == null || batch.size() <= 0)
sb.append("1=0 )");
else {
for (int i = 0; i < batch.size(); i++) {
if (i > 0)
sb.append(" OR ");
sb.append("( i.journalId='" + batch.get(i).journalId() + "')");
}
sb.append(")");
sb.append(
" and i.isDummy=:isNotDummy and i.statusId !=:BookStatus and i.BookNumber like :book ");
}
Query query = uow.getSession().createQuery(sb.toString());
query.setParameter("isNotDummy", Definitions.BooleanIdentifiers_Char.No);
query.setParameter("Book", "%" + Definitions.NOBook);
query.setParameter("BookStatus", Definitions.BookStatusID.CLOSED.getValue());
List<Book> bookList = (List<Book>) query.getResultList();
return bookList;
}
Rewritten SQL:
#Override
public List<Book> getbookByJournalId(UnitOfWork uow,
List<Journal> batch) {
List<String> bookIds = new ArrayList<>();
for(Journal J : batch){
bookIds.add(J.getJournalId());
}
StringBuilder sb = new StringBuilder();
sb.append("select i from Book i where i.journalId in (:bookIds) and i.isDummy=:isNotDummy and i.statusId !=:BookStatus and i.BookNumber like :Book");
Query query = uow.getSession().createQuery(sb.toString());
query.setParameter("isNotDummy", Definitions.BooleanIdentifiers_Char.No);
query.setParameter("Book", "%" + Definitions.NOBook);
query.setParameter("BookStatus", Definitions.BookStatusID.CLOSED.getValue());
query.setParameter("specimenNums",specimenNums);
query.setParameter("bookIds", bookIds);
List<Book> bookList = (List<Book>) query.getResultList();
return bookList;
}
When you create dynamic SQL statements, you miss out on ability of the database to cache the statement, indexes and even entire tables to optimise your data retrieval. That said, dynamic SQL can still be a practical solution.
But you need to be a good citizen on the both the application and database servers, by being very efficient with your memory usage. For a solution that needs to scale to 20 million rows, I recommend using more of a disk-based approach, using as little RAM as possible (i.e. avoiding arrays).
Problems I can see from the first statement are the following:
Up to 800 OR conditions may be added to the first statement for each batch. That makes for a very long SQL statement (not good). This I believe [please correct me if I'm wrong] would need to be cached in JVM heap and then passed to the database.
Java may not release this statement from the heap straight away, and garbage collection might be too slow to keep up with your code, increasing the RAM usage. You shouldn't rely on it to clean up after you while your code is running.
If you ran this code in parallel, many sessions on hibernate may risk having many sessions on the database too. I believe you should only use one session for this, unless there is a specific reason. Creating and destroying sessions that you don't need just creates unnecessary traffic on servers and the network.
If you are running this code serially, then why drop the session, when you can reuse it for the next batch? You may have a valid reason, but the question must be asked.
In the second statement, creating the bookIds array again uses up RAM in the JVM heap, and the where i.journalId in (:bookIds) part of the SQL will still be lengthy. Not as bad as before, but I think still too long.
You would be much better off doing the following:
Create a table on the database, with batchNumber, bookId and perhaps some meta-data, such as flags or timestamps. Join the Book table to your new table using a static statement, and pass in the batchNumber as a new parameter.
create table Batch
(
id integer primary key,
batchNumber integer not null,
bookId integer not null,
processed_datetime timestamp
);
create unique index Batch_Idx on Batch (batchNumber, bookId);
-- Put this statement into a loop, or use INSERT/SELECT if the data is available in the database
insert into Batch batchNumber values (:batchNumber, :bookId);
-- Updated SQL statement. This is now static. Note that batchNumber needs to be provided as a parameter.
select i
from Book i
inner join Batch b on b.bookId = i.journalId
where b.batchNumber = :batchNumber
and i.isDummy=:isNotDummy and i.statusId !=:BookStatus and i.BookNumber like :Book;
I have a requirement to insert/update more than 15000 rows in 3 tables. So that's 45k total inserts.
I used Statelesssession in hibernate after reading online that it is the best for batch processing as it doesn't have a context cache.
session = sessionFactory.openStatelessSession;
for(Employee e: emplList) {
session.insert(e);
}
transcation.commit;
But this codes takes more than an hour to complete.
Is there a way to save all the entity objects in one go?
Save the entire collection rather than doing it one by one?
Edit: Is there any other framework that can offer a quick insert?
Cheers!!
You should read this article of Vlad Mihalcea:
How to batch INSERT and UPDATE statements with Hibernate
You need to make sure that you've set the hibernate property:
hibernate.jdbc.batch_size
So that Hibernate can batch these inserts, otherwise they'll be done one at a time.
There is no way to insert all entities in one go. Even if you could do something like session.save(emplList) internally Hibernate will save one by one.
Accordingly to Hibernate User Guide StatelessSession do not use batch feature:
The insert(), update(), and delete() operations defined by the StatelessSession interface operate directly on database rows. They cause the corresponding SQL operations to be executed immediately. They have different semantics from the save(), saveOrUpdate(), and delete() operations defined by the Session interface.
Instead use normal Session and clear the cache from time to time. Acttually, I suggest you to measure your code first and then make changes like use hibernate.jdbc.batch_size, so you can see how much any tweak had improved your load.
Try to change it like this:
session = sessionFactory.openSession();
int count = 0;
int step = 0;
int stepSize = 1_000;
long start = System.currentTimeMillis();
for(Employee e:emplList) {
session.save(e);
count++;
if (step++ == stepSize) {
long elapsed = System.currentTimeMillis() - start;
long linesPerSecond = stepSize / elapsed * 1_000;
StringBuilder msg = new StringBuilder();
msg.append("Step time: ");
msg.append(elapsed);
msg.append(" ms Lines: ");
msg.append(count);
msg.append("/");
msg.append(emplList.size());
msg.append(" Lines/Seconds: ");
msg.append(linesPerSecond);
System.out.println(msg.toString());
start = System.currentTimeMillis();
step = 0;
session.clear();
}
}
transcation.commit;
About hibernate.jdbc.batch_size - you can try different values, including some very large depending on underlying database in use and network configuration. For example, I do use a value of 10,000 for a 1gbps network between app server and database server, giving me 20,000 records per second.
Change stepSize to the same value of hibernate.jdbc.batch_size.
I have a db fetch call with Spring jdbcTemplate and rows to be fetched is around 1 millions. It takes too much time iterating in result set. After debugging the behavior I found that it process some rows like a batch and then waits for some time and then again takes a batch of rows and process them. It seems row processing is not continuous so overall time is going into minutes. I have used default configuration for data source. Please help.
[Edit]
Here is some sample code
this.prestoJdbcTempate.query(query, new RowMapper<SomeObject>() {
#Override
public SomeObject mapRow(final ResultSet rs, final int rowNum) throws SQLException {
System.out.println(rowNum);
SomeObject obj = new SomeObject();
obj.setProp1(rs.getString(1));
obj.setProp2(rs.getString(2));
....
obj.setProp8(rs.getString(8));
return obj;
}
});
As most of the comments tell you, One mllion records is useless and unrealistic to be shown in any UI - if this is a real business requirement, you need to educate your customer.
Network traffic application and database server is a key factor in performance in scenarios like this. There is one optional parameter that can really help you in this scenario is : fetch size - that too to certain extent
Example :
Connection connection = //get your connection
Statement statement = connection.createStatement();
statement.setFetchSize(1000); // configure the fetch size
Most of the JDBC database drivers have a low fetch size by default and tuning this can help you in this situation. **But beware ** of the following.
Make sure your jdbc driver supports fetch size
Make sure your JVM heap setting ( -Xmx) is wide enough to handle objects created as a result of this.
Finally, select only the columns you need to reduce network overhead.
In spring, JdbcTemplate lets you set the fetchSize
I have this SQL query which queries the database every 5 seconds to determine who is currently actively using the software. Active users have pinged the server in the last 10 seconds. (The table gets updated correctly on user activity and a I have a thread evicting entries on session timeouts, that all works correctly).
What I'm looking for is a more efficient/quicker way to do this, since it gets called frequently, about every 5 seconds. In addition, there may be up to 500 users in the database. The language is Java, but the question really pertains to any language.
List<String> r = new ArrayList<String>();
Calendar c = Calendar.getInstance();
long threshold = c.get(Calendar.SECOND) + c.get(Calendar.MINUTE)*60 + c.get(Calendar.HOUR_OF_DAY)*60*60 - 10;
String tmpSql = "SELECT user_name, EXTRACT(HOUR FROM last_access_ts) as hour, EXTRACT(MINUTE FROM last_access_ts) as minute, EXTRACT(SECOND FROM last_access_ts) as second FROM user_sessions";
DBResult rs = DB.select(tmpSql);
for (int i=0; i<rs.size(); i++)
{
Map<String, Object> result = rs.get(i);
long hour = (Long)result.get("hour");
long minute = (Long)result.get("minute");
long second = (Long)result.get("second");
if (hour*60*60 + minute*60 + second > threshold)
r.add(result.get("user_name").toString());
}
return r;
If you want this to run faster, then create an index on user_sessions(last_access_ts, user_name), and do the date logic in the query:
select user_name
from user_sessions
where last_access_ts >= now() - 5/(24*60*60);
This does have a downside. You are, presumably, updating the last_access_ts field quite often. An index on the field will also have to be updated. On the positive side, this is a covering index, so the index itself can satisfy the query without resorting to the original data pages.
I would move the logic from Java to DB. This mean you translate if into where, and just select the name of valid result.
SELECT user_name FROM user_sessions WHERE last_access_ts > ?
In your example the c represent current time. It is highly possible that result will be empty.
So your question should be more about date time operation on your database.
Just let the database do the comparison for you by using this query:
SELECT
user_name
FROM user_sessions
where TIMESTAMPDIFF(SECOND, last_access_ts, current_timestamp) > 10
Complete example:
List<String> r = new ArrayList<String>();
Calendar c = Calendar.getInstance();
long threshold = c.get(Calendar.SECOND) + c.get(Calendar.MINUTE)*60 + c.get(Calendar.HOUR_OF_DAY)*60*60 - 10;
// this will return all users that were inactive for longer than 10 seconds
String tmpSql = "SELECT
user_name
FROM user_sessions
where TIMESTAMPDIFF(SECOND, last_access_ts, current_timestamp) > 10";
DBResult rs = DB.select(tmpSql);
for (int i=0; i<rs.size(); i++)
{
Map<String, Object> result = rs.get(i);
r.add(result.get("user_name").toString());
}
return r;
SQLFiddle
The solution is to remove the logic from your code to the sql query to only get the active users from that select, using a where clause.
It is faster to use the sql built-in functions to get fewer records and iterate less in your code.
Add this to your sql query to get the active users only:
Where TIMESTAMPDIFF(SECOND, last_access_ts, current_timestamp) > 10
This will get you all the records whose date is 10 seconds ago or sooner.
Try the MySQL TimeDiff function in your select. This way you can select only the results that are active without having to do any other calculations.
Link: MySQL: how to get the difference between two timestamps in seconds
If I get you right, then you got only 500 entries in your user_sessions table. In this case I wouldn't even care about indexes. Throw them away. The DB engine probably won't use them anyway for such a low record count. The performance gain due to not updating the indexes on every record update could be probably higher than the query overhead.
If you care about DB stress, then lengthen the query/update intervals to 1 minute or more, if your application allows this. Gordon Linoff's answer should give you the best query performance though.
As a side note (because it has bitten me before): If you don't use the same synchronized time for all user callbacks, then your "active users logic" is flawed by design.
having major issues with my query processing time :(
i think it is because the query is getting recompiled evrytime. but i dont see any way around it.
the following is the query/snippet of code:
private void readPerformance(String startTime, String endTime,
String performanceTable, String interfaceInput) throws SQLException, IOException {
String interfaceId, iDescp, iStatus = null;
String dtime, ingress, egress, newLine, append, routerId= null;
StringTokenizer st = null;
stmtD = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmtD.setFetchSize(Integer.MIN_VALUE);
BufferedReader interfaceRead = new BufferedReader(new FileReader(interfaceInput));
BufferedWriter pWrite = new BufferedWriter(new FileWriter("performanceInput.txt"));
while((newLine = interfaceRead.readLine())!= null){
st = new StringTokenizer(newLine,",");
while(st.hasMoreTokens()){
append = st.nextToken()+CSV+st.nextToken()+st.nextToken()+CSV+st.nextToken();
System.out.println(append +" ");
iStatus = st.nextToken().trim();
interfaceId = st.nextToken().trim();
append = append + CSV+iStatus+CSV+interfaceId;
System.out.println(append +" ");
pquery = " Select d.dtime,d.ifInOctets, d.ifOutOctets from "+performanceTable+"_1_60" +" AS d Where d.id = " +interfaceId
+ " AND dtime BETWEEN " +startTime+ " AND "+ endTime;
rsD = stmtD.executeQuery(pquery);
/* interface query*/
while(rsD.next()){
dtime = rsD.getString(1);
ingress= rsD.getString(2);
egress = rsD.getString(3);
pWrite.write(append + CSV + dtime+CSV+ingress+CSV+egress+NL);
}//end while
}//end while
}// end while
pWrite.close();
interfaceRead.close();
rsD.close() ;
stmtD.close();
}
my interfaceId value keeps changing. so i have put the query inside the loop resulting in recompilation of query multiple times.
is there any betetr way? can i sue stored procedure in java? if so how? do not have much knowledge of it.
current processing time is almost 60 mins (:(()!!! Text file getting generated is over 300 MB
Please help!!!
Thank you.
You can use a PreparedStatement and paramters, which may avoid recompiling the query. Since performanceTable is constant, this can be put into the prepared query. The remaining variables, used in the WHERE condition, are set as parameters.
Outside the loop, create a prepared statement, rather than a regular statement:
PreparedStatement stmtD = conn.prepareStatement(
"Select d.dtime,d.ifInOctets, d.ifOutOctets from "+performanceTable+"_1_60 AS d"+
" Where d.id = ? AND dtime BETWEEN ? AND ?");
Then later, in your loop, set the parameters:
stmtD.setInteger(1, interfaceID);
stmtD.setInteger(2, startTime);
stmtD.setInteger(3, endTime);
ResultSet rsD = stmtD.executeQuery(); // note no SQL passed in here
It may be a good idea to also check the query plan from MySQL with EXPLAIN to see if that is part of the bottleneck also. Also, there is quite a bit of diagnostic string concatenation going on in the function. Once the query is working, removing that may also improve performance.
Finally, note that even if the query is fast, network latency may slow things down. JDBC provides batch execution of multiple queries to help reduce overall latency per statement. See addBatch/executeBatch on Connection.
More information required but I can offer some general questions/suggestions. It may have nothing to do with the compilation of the query plan (that would be unusual)
Are the id and dtime columns indexed?
How many times does a query get executed in the 60mins?
How much time does each query take?
If the time per query is large then the problem is the query execution itself, not the compilation. Check the indexes as described above.
If there are many many many queries then it might be the sheer volume of queries that is causing the problem. Using PreparedStatement (see mdma's answer) may help. Or you can try and batch the interfaceIDs you want by using an "in" statement and running a query for every 100 interfaceIDs rather than one for each.
EDIT: As a matter of good practice you should ALWAYS use PreparedStatement as it will correctly handle datatypes such as dates so you don't have to worry about formatting them into correct SQL syntax. Also prevents SQL injection.
From the looks of things you are kicking off multiple select queries (even 100's based on your file size)
Instead of doing that, from your input file create a comma delimited list of all the interfaceId values and then make 1 SQL call using the "IN" keyword. You know the performanceTable, startTime and endTime arent changing so the query would look something like this
SELECT d.dtime,d.ifInOctets, d.ifOutOctets
FROM MyTable_1_60 as d
WHERE dtime BETWEEN '08/14/2010' AND '08/15/2010'
AND d.id IN ( 10, 18, 25, 13, 75 )
Then you are free to open your file, dump the result set in one swoop.