My problem is this: I am trying to process about 1.5 million rows of data in Spring via JDBCTemplate coming from MySQL. With such a large number of rows, I am using the RowCallbackHandler class as suggested here
The code is actually working, but's SLOW... The thing is that no matter what I set the fetch size to, I seem to get approximately 350 records per fetch, with a 2 to 3 second delay between fetches (from observing my logs). I tried commenting out the store command and confirmed that behavior stayed the same, so the problem is not with the writes.
There are 6 columns, only 1 that is a varchar, and that one is only 25 characters long, so I can't see throughput being the issue.
Ideally I'd like to get more like 30000-50000 rows at a time. Is there a way to do that?
Here is my code:
protected void runCallback(String query, Map params, int fetchSize, RowCallbackHandler rch)
throws DatabaseException {
int oldFetchSize = getJdbcTemplate().getFetchSize();
if (fetchSize > 0) {
getJdbcTemplate().setFetchSize(fetchSize);
}
try {
getJdbcTemplate().query(getSql(query), rch);
}
catch (DataAccessException ex) {
logger.error(ExceptionUtils.getStackTrace(ex));
throw new DatabaseException( ex.getMessage() );
}
getJdbcTemplate().setFetchSize(oldFetchSize);
}
and the handler:
public class SaveUserFolderStatesCallback implements RowCallbackHandler {
#Override
public void processRow(ResultSet rs) throws SQLException {
//Save each row sequentially.
//Do NOT call ResultSet.next() !!!!
Calendar asOf = Calendar.getInstance();
log.info("AS OF DATE: " + asOf.getTime());
Long x = (Long) rs.getLong("x");
Long xx = (Long) rs.getLong("xx");
String xxx = (String) rs.getString("xxx");
BigDecimal xxxx = (BigDecimal)rs.getBigDecimal("xxxx");
Double xxxx = (budgetAmountBD == null) ? 0.0 : budgetAmountBD.doubleValue();
BigDecimal xxxxx = (BigDecimal)rs.getBigDecimal("xxxxx");
Double xxxxx = (actualAmountBD == null) ? 0.0 : actualAmountBD.doubleValue();
dbstore(x, xx, xxx, xxxx, xxxxx, asOf);
}
}
And what is your query? Try to create an indexex for fields you are searching/sorting. That will help.
Second strategy: in memory cache implementation. Or using of hibernate plus 2nd level cache.
Both this technics can significantly speed up your query execution.
Few Questions
How long does it takes if you query the DB directly. Another issue could be ASYNC_NETWORK_IO delay between application and DB hosts.
did you check it without using Spring
The answer actually is to do setFetchSize(Integer.MIN_VALUE) while this totally violates the stated contract of Statement.setFetchSize, the mysql java connector uses this value to stream the resultset. This results in tremendous performance improvement.
Another part of the fix is that I also needed to create my own subclass of (Spring) JdbcTemplate that would accomodate the negative fetch size... Actually, I took the code example here, where I first found the idea of setting fetchSize(Integer.MIN_VALUE)
http://javasplitter.blogspot.com/2009/10/pimp-ma-jdbc-resultset.html
Thank you both for your help!
Related
I am writing a Java program that needs to continue checking if the auto increment value of a given database table has changed. Currently, the program does this by querying the database in an infinite loop on a separate thread.
public class StackOverflow implements Runnable {
#Override
public void run()
{
while(true)
{
// Assume that 'currentMessageID' has already been declared as type integer
// and that 'getLatestMessageID()' queries the database.
if(currentMessageID < queryHandler.getLatestMessageID())
{
int latestMessageID = queryHandler.getLatestMessageID();
for(int x = ajaxChat.currentMessageID + 1; x <= latestMessageID; x++)
{
// Do something when the auto increment value is greater than the last
// known auto increment value.
}
}
}
}
}
While this works just fine, it puts a significant strain on the database server since
SELECT `auto_increment` FROM INFORMATION_SCHEMA.TABLES WHERE table_name = 'SOrocks'
is being called over and over. Is there any way that I could watch for the auto increment value to change without hammering the database server with the same query over and over again?
Unluckyly if you can work only from the application side, a thread that polls the db is the only solution which comes to my mind BUT, if you can also change the DB side, you can always make a trigger on the database to call your java method (example with Oracle).
Im using mysql with JDBC.
I have a large example table which contains 6.3 million rows that I am trying to perform efficient select queries on. See below:
I have created three additional indexes on the table, see below:
Performing a SELECT query like this SELECT latitude, longitude FROM 3dag WHERE
timestamp BETWEEN "+startTime+" AND "+endTime+" AND HourOfDay=4 AND DayOfWeek=3" has a run time that is extremely high at 256356 ms, or a little above four minutes. My explain on the same query gives me this:
My code for retrieving the data is below:
Connection con = null;
PreparedStatement pst = null;
Statement stmt = null;
ResultSet rs = null;
String url = "jdbc:mysql://xxx.xxx.xxx.xx:3306/testdb";
String user = "bigd";
String password = "XXXXX";
try {
Class.forName("com.mysql.jdbc.Driver");
con = DriverManager.getConnection(url, user, password);
String query = "SELECT latitude, longitude FROM 3dag WHERE timestamp BETWEEN "+startTime+" AND "+endTime+" AND HourOfDay=4 AND DayOfWeek=3";
stmt = con.prepareStatement("SELECT latitude, longitude FROM 3dag WHERE timestamp>=" + startTime + " AND timestamp<=" + endTime);
stmt = con.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY, java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
rs = stmt.executeQuery(query);
System.out.println("Start");
while (rs.next()) {
int tempLong = (int) ((Double.parseDouble(rs.getString(2))) * 100000);
int x = (int) (maxLong * 100000) - tempLong;
int tempLat = (int) ((Double.parseDouble(rs.getString(1))) * 100000);
int y = (int) (maxLat * 100000) - tempLat;
if (!(y > matrix.length) || !(y < 0) || !(x > matrix[0].length) || !(x < 0)) {
matrix[y][x] += 1;
}
}
System.out.println("End");
JSONObject obj = convertToCRS(matrix);
return obj;
}catch (ClassNotFoundException ex){
Logger lgr = Logger.getLogger(Database.class.getName());
lgr.log(Level.SEVERE, ex.getMessage(), ex);
return null;
}
catch (SQLException ex) {
Logger lgr = Logger.getLogger(Database.class.getName());
lgr.log(Level.SEVERE, ex.getMessage(), ex);
return null;
} finally {
try {
if (rs != null) {
rs.close();
}
if (pst != null) {
pst.close();
}
if (con != null) {
con.close();
}
} catch (SQLException ex) {
Logger lgr = Logger.getLogger(Database.class.getName());
lgr.log(Level.WARNING, ex.getMessage(), ex);
return null;
}
}
Removing every line in the while(rs.next()) loop gives me the same horrible run-time.
My question is what can I do to optimize this type of query? I am curious about the .setFetchSize() and what the optimal value should be here. Documentation shows that INTEGER.MIN_VALUE results in fetching row-by-row, is this correct?
Any help is appreciated.
EDIT
After creating a new index on timestamp, DayOfWeek and HourOfDay my query runs 1 minute faster and explain gives me this:
Some ideas up front:
Did you in fact check the SQL Execution time (from .executeQuery() till first row?) or is that execution + iteration over 6.3 million rows?
You prepare a PreparedStatement but don't use it?!
Use PreparedStatement, pass tiemstamp, dayOfWeek, hourOfDay as parameters
Create one index that can satisfy your where condition. Order the keys in a way that you can eliminate the most items with the highest ranking field.
The idex might look like:
CREATE INDEX stackoverflow on 3dag(hourOfDay, dayOfWeek, Timestamp);
Perform your SQL inside MySQL - what time do you get there?
Try without stmt.setFetchSize(Integer.MIN_VALUE); this might create many unneeded network roundtrips.
According to your question, the cardinality of (that is, the number of distinct values in) your Timestamp column is about 1/30th of the cardinality of your Uid column. That is, you have lots and lots of identical timestamps. That doesn't bode well for the efficiency of your query.
That being said, you might try to use the following compound covering index to speed things up.
CREATE INDEX 3dag_q ON ('Timestamp' HourOfDay, DayOfWeek, Latitude, Longitude)
Why will this help? Because your whole query can be satisfied from the index with a so-called tight index scan. The MySQL query engine will random-access the index to the entry with the smallest Timestamp value matching your query. It will then read the index in order and pull out the latitude and longitude from the rows that match.
You could try doing some of the summarizing on the MySQL server.
SELECT COUNT(*) number_of_duplicates,
ROUND(Latitude,4) Latitude, ROUND(Longitude,4) Longitude
FROM 3dag
WHERE timestamp BETWEEN "+startTime+"
AND "+endTime+"
AND HourOfDay=4
AND DayOfWeek=3
GROUP BY ROUND(Latitude,4), ROUND(Longitude,4)
This may return a smaller result set. Edit This quantizes (rounds off) your lat/long values and then count the number of items duplicated by rounding them off. The more coarsely you round them off (that is, the smaller the second number in the ROUND(val,N) function calls happens to be) more duplicate values you will encounter, and the fewer distinct rows will be generated by your query. Fewer rows save time.
Finally, if these lat/long values are GPS derived and recorded in degrees, it makes no sense to try to deal with more than about four or five decimal places. Commercial GPS precision is limited to that.
More suggestions
Make your latitude and longitude columns into FLOAT values in your table if they have GPS precision. If they have more precision than GPS use DOUBLE. Storing and transferring numbers in varchar(30) columns is quite inefficient.
Similarly, make your HourOfDay and DayOfWeek columns into SMALLINT or even TINYINT data types in your table. 64 bit integers for values between 0 and 31 is wasteful. With hundreds of rows, it doesn't matter. With millions it does.
Finally, if your queries always look like this
SELECT Latitude, Longitude
FROM 3dag
WHERE timestamp BETWEEN SOME_VALUE
AND ANOTHER_VALUE
AND HourOfDay = SOME_CONSTANT_DAY
AND DayOfWeek = SOME_CONSTANT_HOUR
this compound covering index should be ideal to accelerate your query.
CREATE INDEX 3dag_hdtll ON (HourOfDay, DayofWeek, `timestamp`, Latitude, Longitude)
I am extrapolating from my tracking app. This is what i do for efficiency:
Firstly, a possible solution depends on whether or not you can predict/control the time intervals. Store snapshots every X minutes or once a day, for example. Let us say you want to display all events YESTERDAY. You can save a snapshot that has already filtered your file. This would speed things up enormously, but is not a viable solution for custom time intervals and real live coverage.
My application is LIVE, but usually works pretty well in T+5 minutes (5 minute maximum lag/delay). Only when the user actually chooses live position viewing will the application open a full query on the live db. Thus, depends on how your app works.
Second factor: How you store your timestamp is very important. Avoid VARCHAR, for example. If you are converting UNIXTIME that also will give you unnecessary lagtime. Since you are developing what appears to be a geotracking application, your timestamp would be in unixtime - an integer. some devices work with milliseconds, i would recommend not using them. 1449878400 instead of 1449878400000 (12/12/2015 0 GMT)
I save all my geopoint datetimes in unixtime seconds and use mysql timestamps only for timestamping the moment the point was received by server (which is irrelevant to this query you propose).
You might shave some time off accessing an indexed view instead of running a full a query. Whether that time is significant in a large query is subject to testing.
Finally, you could shave an itsy bitsy more by not using BETWEEN and using something SIMILAR to what it will be translate into (pseudocode below)
WHERE (timecode > start_Time AND timecode < end_time)
See that i change >= and <= to > and < because chances are your timestamp will almost never be on the precise second and even if it is, you will rarely be afffected whether 1 geopoint/time event is or not displayed.
I am trying to prototype performance results of orientdb mass delete of vertexes. I need to prototype trying to delete more than 10000 upto million.
Firstly, I am using the light weight edges property to be false while creating my vertices and edges following this Issue with creating edge in OrientDb with Blueprints / Tinkerpop
When I try deleting (Please see below the code)
private static OrientGraph graph = new OrientGraph(
"remote:localhost/WorkDBMassDelete2", "admin", "admin");
private static void removeCompleatedWork() {
try {
long startTime = System.currentTimeMillis();
List params = new ArrayList();
String deleteQuery = "delete vertex Work where out('status') contains (work-state = 'Not Started')";
int no = graph.getRawGraph().command(new OCommandSQL(deleteQuery))
.execute(params);
// graph.commit();
long endTime = System.currentTimeMillis();
System.out.println("No of activities removed : " + no
+ " and time taken is : " + (endTime - startTime));
} catch (Exception e) {
e.printStackTrace();
} finally {
graph.shutdown();
}
}
The Results are good if I tring to delete in 100's 500 activities take ~500 ms. But when i trying to delete 2500/5000 activities the no's are high for 2500 deletions it takes ~6000.
A) I also tried creating index. What is the best practise to create a index on the attribute work-state or to create index on the edge status? I tried both while creating the vertex and edge. But both are not improving the performance a lot.
((OrientBaseGraph) graph).createKeyIndex("Status", Edge.class);
//or on the vertex
((OrientBaseGraph) graph).createKeyIndex("work-state", Vertex.class);
What is the best practice to delete mass/group data using the query like mentioned above? Any help is greatly appreciated.
UPDATE:
I downloaded orientdb-community-1.7-20140416.230539-144-distribution.tar.gz from https://oss.sonatype.org/content/repositories/snapshots/com/orientechnologies/orientdb-community/1.7-SNAPSHOT/.
When I try deleting using the subquery from the studio / program I get the following error:com.orientechnologies.orient.core.sql.OCommandSQLParsingException: Error on parsing command at position #0: Class 'FROM was not found . I had modified my query like this:
delete vertex from (select in('status') from State where work-state = 'Complete')
Also while I ran it through program I updated my maven dependencies to 1.7-SNAPSHOT libraries. My old query was still producing the same numbers and the subquery deletion was giving errors even in studio. Please let me know if I am missing anything. Thanks !!
First, please try the same exact code with 1.7-SNAPSHOT. It should be faster.
Then in 1.7-SNAPSHOT we just added the ability to delete vertices from a sub-query. This is because why browsing all the Work when you could delete all the incoming vertex from the Status vertex "Not Started"?
So if you've 1.7-SNAPSHOT change this query from:
delete vertex Work where out('status') contains (work-state = 'Not Started')
to (assuming the status vertex is called "State"):
delete vertex from (select in('status') from State where work-state = 'Not Started')
My query returns with 31,000 results with 12 columns for each row, and each row contains roughly 8,000 characters (8KB per row). Here is how I processed:
public List<MyTableObj> getRecords(Connection con) {
List<MyTableObj> list = new ArrayList<MyTableObj>();
String sql = "my query...";
ResultSet rs = null;
Statement st = null;
con.setAutoCommit(false);
st = con.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
st.setFetchSize(50);
rs = st.executeQuery(sql);
try {
System.out.println("Before MemoryFreeSize = " + (double)Runtime.getRuntime().freeMemory() / 1024 / 1024 + " MB");
while ( rs.next() ) {
MyTableObjitem item = new MyTableObj();
item.setColumn1( rs.getString("column1") );
... ...
item.setColumn12( rs.getString("column12") );
list.add( item );
} // end loop
// try to release some memory, but it's not working at all
if ( st != null ) st.close();
if ( rs != null ) rs.close();
st = null; rs = null;
}
catch ( Exception e ) { //do something }
System.out.println("After MemoryFreeSize = " + (double)Runtime.getRuntime().freeMemory() / 1024 / 1024 + " MB");
return list;
} // end getRecords
If each row takes 8kb memory, 31k should take 242mb memory. After finish looping the query result, my remaining memory is only 142mb, which is not enough to finish rest of my other process.
I searched many solutions and I tried to set my heap memory to 512mb -Xmx512m -Xms512m, and I also set the fetch size setFetchSize(50).
I suspect it's the ResultSet occupied too much memories, the results may stored in the client-side catch. However, after I clear up some object ( st.close() and rs.close() ), even I manually called the garbage collector System.gc(), the free memory after the loop never increase (why?).
Let's just assume I can not change the database design, and I need all query results. How can I free more memory after processing?
P.S.: I also tried to not using the ResultSet.getString() and relace it with hardcode String, and after looping, I got 450mb free memory.
I found that, if I do:
// + counter to make the value different for each row, for testing purpose
item.setColumn1( "Constant String from Column1" + counter );
... ...
item.setColumn12( "Constant String from Column12" + counter );
counter++;
It used only around 60MB memory.
But if I do:
item.setColumn1( rs.getString("column1") );
... ...
item.setColumn12( rs.getString("column12") );
It used up to 380MB memory.
I already did rs.close(); and rs = null; //rs is Result instance, but this seems does not help. Why there is so much memory usage different between these 2 approaches? In both approaches I only passed in String.
You should narrow down your queries, try to get more specific and if necessary add limit in your queries your java can't handle too large results
If you need all the data you're getting in memory at the same time (so you can't process it in chunks), then you'll need to have enough memory for it. Try it with 1G of memory.
Forget calling System.gc(), that's not going to help (it will be called before an OutOfMemoryException is thrown anyway).
I also noticed you're not closing the connection. You should probably do that as well (if you don't have a connection pool yet, set one up).
And of course you can use a profiler to see where the memory is actually going to. What you're doing now is pure guesswork.
I don't think many people may encounter this issue, but I still feel like to post my solution for reference.
Before in my code,
The query is:
String sql = "SELECT column1, column2 ... FROM mytable";
and the setter for MyTableObj is:
public void setColumn1(String columnStr) {
this._columnStr = columnStr == null ? "" : columnStr.trim();
}
After update:
What I updated is just use the trim in query instead of using java code:
String sql = "SELECT TRIM(column1), TRIM(column2) ... FROM mytable";
public void setColumn1(String columnStr) {
this._columnStr = columnStr == null ? "" : columnStr;
}
Using this udpated code, it takes only roughly 100 MB memory, which is a lot less than previous memory usage (380 MB).
I still can not give a valid reason why java trim consume more memory the sql trim, if anyone knows the reason, please help me to explain it. I will appreciate it a lot.
After many test, I found that it's data. Each row takes 8 KB and 31,000 rows takes about 240MB memory. TRIM in the query can only works for those short data.
Since data is large and memory is limit, I can only limit my query result for now.
I asked a question here.Simply speaking, my algorithm need a four dimension array. and the size could reach 32G. so I plan to store it in MongoDB. I have implemented it in my way. As I never use MongoDB before, my implementation is too slow, so how should I store this four dimension array in MongoDB?
Some stats:
It would take hours(more than ten I guess,as I didn't wait) to update the whole array as my array size is about 12*7000*100*500, and my server is Windows Server 2008 R2 Standard with 16.0GB ram and cpu is Intel(R) Xeon(R) CPU,2.67GHz. My mongoDB version is 2.4.5
Explain my implementation a bit.
my array has four dimension, name them z, d, wt, wv respectively.
First,I construct a string for the array element. Take an array element p_z_d_wt_wv[1][2][3][4] for instance, as z is 1, d is 2,wt is 3, wv is 4, I get a string "1_2_3_4", it stand for p_z_d_wt_wv[1][2][3][4].Then I store the value of p_z_d_wt_wv[1][2][3][4] in the database.
so my data looks like below:
{ "_id" : { "$oid" : "51e0c6f15a66ea5c32a99773"} , "key" : "1_2_3_4" , "value" : 113.1232}
{ "_id" : { "$oid" : "51e0c6f15a66ea5c32a99774"} , "key" : "1_2_3_5" , "value" : 11.1243}
Any advice would be appreciated!
Thanks advance!
Below is my code
public class MongoTest {
private Mongo mongo = null;
private DB mmplsa;
private DBCollection p_z_d_wt_wv;
private DBCollection p_z_d_wt_wv_test;
public void init()
{
try{
mongo = new Mongo();
} catch (UnknownHostException e) {
e.printStackTrace();
} catch (MongoException e) {
e.printStackTrace();
}
mmplsa = mongo.getDB("mmplsa");
p_z_d_wt_wv = mmplsa.getCollection("p_z_d_wt_wv");
}
public void createIndex()
{
BasicDBObject query = new BasicDBObject("key",1);
p_z_d_wt_wv.ensureIndex(query,null, true);
}
public void add( String key, double value)
{
DBObject element = new BasicDBObject();
element.put("key", key);
element.put("value", value);
p_z_d_wt_wv.insert(element);
}
public Double query(String key)
{
BasicDBObject specific_key = new BasicDBObject("value",1).append("_id", false);
DBObject obj = p_z_d_wt_wv.findOne(new BasicDBObject("key",key),specific_key );
return (Double)obj.get("value");
}
public void update(boolean ifTrainset, String key, double new_value)
{
BasicDBObject query = new BasicDBObject().append("key", key);
BasicDBObject updated_element = new BasicDBObject();
updated_element.append("$set", new BasicDBObject().append("value", new_value));
p_z_d_wt_wv.update(query, updated_element);
}
}
Few suggestions
Since your database size has exceeded(is actually 2X) the size of your RAM. Perhaps you should look at Sharding. Mongo works well when you can fit your database size in memory.
Storing the field key as a String not only consumes more memory, string comparisions are slower. We can easily store this field in a NumberLong(MongoDB's Long DataType). Since you already know the maximum size of your array is 12*7000*100*500
I assume the max size of any dimension cannot grow over 10,000. And consequently the total number of elements in your collection is less than (10000 ** 4).
So if you want the element at p_z_d_wt_wv1[2][3][4]
You calculate the index as
(10000 ** 0 * 4) + (10000 ** 1 * 3) + (10000 ** 2 * 3) + (10000 * 3 * 1)
You go right to left, increase the power of your base and multiply it with whatever value happens to be in that position and finally take their sum.
Index this field and we should expect better performance.
Since you have just a massive array, I suggest you use a memory mapped file. This will use about 32 GB of disk space and be much more efficient. Even so, randomly accessing a data set size larger than main memory is always going to be slow unless you have an fast SDD (buying more memory would be cheaper)
I would be very surprised if Mongo DB will perform fast enough for you. If it takes ten hours to update, it is likely to take ten hours to scan once as well. If you have a SSD, a memory mapped file could take about three minutes. If the data was all in memory e.g. you had 48 GB (you would need 32+ GB free not total), this would drop to seconds.
You cannot beat the limitations of your hardware. ;)