Optimizing MySQL query on large table

Optimizing MySQL query on large table - java

Im using mysql with JDBC.
I have a large example table which contains 6.3 million rows that I am trying to perform efficient select queries on. See below:
I have created three additional indexes on the table, see below:
Performing a SELECT query like this SELECT latitude, longitude FROM 3dag WHERE
timestamp BETWEEN "+startTime+" AND "+endTime+" AND HourOfDay=4 AND DayOfWeek=3" has a run time that is extremely high at 256356 ms, or a little above four minutes. My explain on the same query gives me this:
My code for retrieving the data is below:
Connection con = null;
PreparedStatement pst = null;
Statement stmt = null;
ResultSet rs = null;
String url = "jdbc:mysql://xxx.xxx.xxx.xx:3306/testdb";
String user = "bigd";
String password = "XXXXX";
try {
Class.forName("com.mysql.jdbc.Driver");
con = DriverManager.getConnection(url, user, password);
String query = "SELECT latitude, longitude FROM 3dag WHERE timestamp BETWEEN "+startTime+" AND "+endTime+" AND HourOfDay=4 AND DayOfWeek=3";
stmt = con.prepareStatement("SELECT latitude, longitude FROM 3dag WHERE timestamp>=" + startTime + " AND timestamp<=" + endTime);
stmt = con.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY, java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
rs = stmt.executeQuery(query);
System.out.println("Start");
while (rs.next()) {
int tempLong = (int) ((Double.parseDouble(rs.getString(2))) * 100000);
int x = (int) (maxLong * 100000) - tempLong;
int tempLat = (int) ((Double.parseDouble(rs.getString(1))) * 100000);
int y = (int) (maxLat * 100000) - tempLat;
if (!(y > matrix.length) || !(y < 0) || !(x > matrix[0].length) || !(x < 0)) {
matrix[y][x] += 1;
}
}
System.out.println("End");
JSONObject obj = convertToCRS(matrix);
return obj;
}catch (ClassNotFoundException ex){
Logger lgr = Logger.getLogger(Database.class.getName());
lgr.log(Level.SEVERE, ex.getMessage(), ex);
return null;
}
catch (SQLException ex) {
Logger lgr = Logger.getLogger(Database.class.getName());
lgr.log(Level.SEVERE, ex.getMessage(), ex);
return null;
} finally {
try {
if (rs != null) {
rs.close();
}
if (pst != null) {
pst.close();
}
if (con != null) {
con.close();
}
} catch (SQLException ex) {
Logger lgr = Logger.getLogger(Database.class.getName());
lgr.log(Level.WARNING, ex.getMessage(), ex);
return null;
}
}
Removing every line in the while(rs.next()) loop gives me the same horrible run-time.
My question is what can I do to optimize this type of query? I am curious about the .setFetchSize() and what the optimal value should be here. Documentation shows that INTEGER.MIN_VALUE results in fetching row-by-row, is this correct?
Any help is appreciated.
EDIT
After creating a new index on timestamp, DayOfWeek and HourOfDay my query runs 1 minute faster and explain gives me this:

Some ideas up front:
Did you in fact check the SQL Execution time (from .executeQuery() till first row?) or is that execution + iteration over 6.3 million rows?
You prepare a PreparedStatement but don't use it?!
Use PreparedStatement, pass tiemstamp, dayOfWeek, hourOfDay as parameters
Create one index that can satisfy your where condition. Order the keys in a way that you can eliminate the most items with the highest ranking field.
The idex might look like:
CREATE INDEX stackoverflow on 3dag(hourOfDay, dayOfWeek, Timestamp);
Perform your SQL inside MySQL - what time do you get there?
Try without stmt.setFetchSize(Integer.MIN_VALUE); this might create many unneeded network roundtrips.

According to your question, the cardinality of (that is, the number of distinct values in) your Timestamp column is about 1/30th of the cardinality of your Uid column. That is, you have lots and lots of identical timestamps. That doesn't bode well for the efficiency of your query.
That being said, you might try to use the following compound covering index to speed things up.
CREATE INDEX 3dag_q ON ('Timestamp' HourOfDay, DayOfWeek, Latitude, Longitude)
Why will this help? Because your whole query can be satisfied from the index with a so-called tight index scan. The MySQL query engine will random-access the index to the entry with the smallest Timestamp value matching your query. It will then read the index in order and pull out the latitude and longitude from the rows that match.
You could try doing some of the summarizing on the MySQL server.
SELECT COUNT(*) number_of_duplicates,
ROUND(Latitude,4) Latitude, ROUND(Longitude,4) Longitude
FROM 3dag
WHERE timestamp BETWEEN "+startTime+"
AND "+endTime+"
AND HourOfDay=4
AND DayOfWeek=3
GROUP BY ROUND(Latitude,4), ROUND(Longitude,4)
This may return a smaller result set. Edit This quantizes (rounds off) your lat/long values and then count the number of items duplicated by rounding them off. The more coarsely you round them off (that is, the smaller the second number in the ROUND(val,N) function calls happens to be) more duplicate values you will encounter, and the fewer distinct rows will be generated by your query. Fewer rows save time.
Finally, if these lat/long values are GPS derived and recorded in degrees, it makes no sense to try to deal with more than about four or five decimal places. Commercial GPS precision is limited to that.
More suggestions
Make your latitude and longitude columns into FLOAT values in your table if they have GPS precision. If they have more precision than GPS use DOUBLE. Storing and transferring numbers in varchar(30) columns is quite inefficient.
Similarly, make your HourOfDay and DayOfWeek columns into SMALLINT or even TINYINT data types in your table. 64 bit integers for values between 0 and 31 is wasteful. With hundreds of rows, it doesn't matter. With millions it does.
Finally, if your queries always look like this
SELECT Latitude, Longitude
FROM 3dag
WHERE timestamp BETWEEN SOME_VALUE
AND ANOTHER_VALUE
AND HourOfDay = SOME_CONSTANT_DAY
AND DayOfWeek = SOME_CONSTANT_HOUR
this compound covering index should be ideal to accelerate your query.
CREATE INDEX 3dag_hdtll ON (HourOfDay, DayofWeek, `timestamp`, Latitude, Longitude)

I am extrapolating from my tracking app. This is what i do for efficiency:
Firstly, a possible solution depends on whether or not you can predict/control the time intervals. Store snapshots every X minutes or once a day, for example. Let us say you want to display all events YESTERDAY. You can save a snapshot that has already filtered your file. This would speed things up enormously, but is not a viable solution for custom time intervals and real live coverage.
My application is LIVE, but usually works pretty well in T+5 minutes (5 minute maximum lag/delay). Only when the user actually chooses live position viewing will the application open a full query on the live db. Thus, depends on how your app works.
Second factor: How you store your timestamp is very important. Avoid VARCHAR, for example. If you are converting UNIXTIME that also will give you unnecessary lagtime. Since you are developing what appears to be a geotracking application, your timestamp would be in unixtime - an integer. some devices work with milliseconds, i would recommend not using them. 1449878400 instead of 1449878400000 (12/12/2015 0 GMT)
I save all my geopoint datetimes in unixtime seconds and use mysql timestamps only for timestamping the moment the point was received by server (which is irrelevant to this query you propose).
You might shave some time off accessing an indexed view instead of running a full a query. Whether that time is significant in a large query is subject to testing.
Finally, you could shave an itsy bitsy more by not using BETWEEN and using something SIMILAR to what it will be translate into (pseudocode below)
WHERE (timecode > start_Time AND timecode < end_time)
See that i change >= and <= to > and < because chances are your timestamp will almost never be on the precise second and even if it is, you will rarely be afffected whether 1 geopoint/time event is or not displayed.

Related

Java how to increase the value of doubles in a database table

I am having some trouble figuring out a query that will update values in a column in one of my tables. Below is my function:
public void increasePrice(String [] str) {
PreparedStatement ps = null;
try {
ps = connection.prepareStatement("Update Journey Set price+=? where distance <?",PreparedStatement.RETURN_GENERATED_KEYS);
ps.setDouble(1,Double.parseDouble(str[1]));
ps.setDouble(2, Double.parseDouble(str[0]));
ps.executeUpdate();
ps.close();
System.out.println("1 rows updated.");
} catch (SQLException ex) {
Logger.getLogger(Jdbc.class.getName()).log(Level.SEVERE, null, ex);
}
}
To illustrate, the array passed in contains a value for distance and price and I am wanting to update the prices in the 'Journey' table based on their distance. For example, if a record in the table has a distance (type double) that is less than a given distance (the value of str[0]), I want to increase the price (also a double) of that record by the value 'str[1]' and do this for all records in the table.
The above code doesn't give any errors however, the records in the database never get updated. I could really use some help with this as I've searched around for a while now to try and find a solution and have not yet succeeded.

I do not know what database you are using but my guess is that this line:
ps = connection.prepareStatement("Update Journey Set price+=? where distance <?",PreparedStatement.RETURN_GENERATED_KEYS);
should be written like this:
ps = connection.prepareStatement("Update Journey Set price=price+? where distance <?",PreparedStatement.RETURN_GENERATED_KEYS);
And not related to your question but the line
System.out.println("1 rows updated.");
may make you waste hours of debugging in the future because 0 or more rows can be actually updated.

Calculating levenshtein distance between two strings

Im executing the following Postgres query.
SELECT * FROM description WHERE levenshtein(desci, 'Description text?') <= 6 LIMIT 10;
Im using the following code execute the above query.
public static boolean authQuestion(String question) throws SQLException{
boolean isDescAvailable = false;
Connection connection = null;
try {
connection = DbRes.getConnection();
String query = "SELECT * FROM description WHERE levenshtein(desci, ? ) <= 6";
PreparedStatement checkStmt = dbCon.prepareStatement(query);
checkStmt.setString(1, question);
ResultSet rs = checkStmt.executeQuery();
while (rs.next()) {
isDescAvailable = true;
}
} catch (URISyntaxException e1) {
e1.printStackTrace();
} catch (SQLException sqle) {
sqle.printStackTrace();
} catch (Exception e) {
if (connection != null)
connection.close();
} finally {
if (connection != null)
connection.close();
}
return isDescAvailable;
}
I want to find the edit distance between both input text and the values that's existing in the database. i want to fetch all datas that has edit distance of 60 percent. The above query doesnt work as expected. How do I get the rows that contains 60 percent similarity?

Use this:
SELECT *
FROM description
WHERE 100 * (length(desci) - levenshtein(desci, ?))
/ length(desci) > 60
The Levenshtein distance is the count of how many letters must change (move, delete or insert) for one string to become the other. Put simply, it's the number of letters that are different.
The number of letters that are the same is then length - levenshtein.
To express this as a fraction, divide by the length, ie (length - levenshtein) / length.
To express a fraction as a percentage, multiply by 100.
I perform the multiplication by 100 first to avoid integer division truncation problems.

The most general version of the levenshtein function is:
levenshtein(text source, text target, int ins_cost, int del_cost, int sub_cost) returns int
Both source and target can be any non-null string, with a maximum of
255 characters. The cost parameters specify how much to charge for a
character insertion, deletion, or substitution, respectively. You can
omit the cost parameters, as in the second version of the function; in
that case they all default to 1.
So, with the default cost parameters, the result you get is the total number of characters you need to change (by insertion, deletion, or substitution) in the source to get the target.
If you need to calculate the percentage difference, you should divide the levenshtein function result by the length of your source text (or target length - according to your definition of the percentage difference).

Working with a large Resultset

I'm trying to write a java function that can work with large result sets.
The table has 1.2 billion rows which is 189 Gb of data.
Currently I query all the data and extract the information which I store in their respective objects.(using a million row sample db)
TreeMap <Long, Vessel> vessels = new TreeMap<Long, Vessel>(); //list for all vessels
try{
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT mmsi, report_timestamp, position_geom, ST_X(position_geom) AS Long, "
+ "ST_Y(position_geom) AS Lat FROM reports2 WHERE position_geom IS NOT NULL ORDER by report_timestamp ASC");
while(rs.next()){
long mmsi = rs.getLong("mmsi");
java.util.Date time = rs.getTime("report_timestamp");
double longitude = rs.getDouble("Long");
double latitude = rs.getDouble("Lat");
Coordinate coordinate = new Coordinate(longitude, latitude, time);
Vessel vessel = new Vessel(mmsi);
if(!vessels.containsKey(mmsi)) { //if vessel is not present in vessels
vessel.addCoor(coordinate);
vessels.put(mmsi, vessel);
}
else{ //if vessel is already in vessels
vessels.get(mmsi).addCoor(coordinate);
}
}
}catch(Exception e){
JOptionPane.showMessageDialog(null, e);
}
With 189 Gb of data, my computer's memory won't be able to hold the information
I've never touched a table with a billion+ rows and some of my methods involve having all the tables attributes
Can I make Resultset collect 1,000,000 queries at a time and then delete objects after I run functions on them -> then collect another 1,000,000 and so on
Is it possible to hold a 1.2 billion row resultset in approx. 43,000,000 vessel objects (will it take too much space/time ?)
Do I try and limit my query by having a way to select a specific key or attribute and run functions on specified data?
Is there another option ?

If memory is an issue with the ResultSet you can set the fetch size, though you'll need to clear objects during fetch to ensure you don't run out of memory. With Postgres you need to turn off Auto commit or fetch size will not occur.
connection.setAutoCommit(false);
Statement stmt = connection.createStatement();
stmt.setFetchSize(fetchsize);
You can read more about buffering the Result set at https://jdbc.postgresql.org/documentation/94/query.html#query-with-cursor

From your code it seems that you are builing a java object that collects alla the coordinates with the same mmsi field. You did not provide information about this object (mmsi and it list of coordinates) usage. Given this information you can query the data sorting by mmsi and then timestamp (you order by clause is only by timestamp now), when in the resultset you find a different value of mmsi you collected all the data about than specific mmsi so you can use it wihout reading other data.
I don't think you really need to get all the data in memory; you can rewrite the query in order to get only a fixed (a sliding window) number of Vessel objects; you must page the data (i.e. retrieve a block of 10 vessels starting from vessel at position x)
In order to provide a more detailed response you have to explain what you are doing with Vessels.

Implementing K-Means Algorithm JDBC

I am working with an Oracle Database and have the following code implemented in java (with an SQL imported library), where I have a group of students, their average, and I flag those students with an average that is higher than one standard deviation away from the mean (by inserting a new column with a "1" in it). Then I count the number of students who meet the criteria and add them to a new table:
try{
Statement stOne, stTwo, stThree, stFour;
String SelectAverage = "SELECT MEAN FROM STUDENTS";
ResultSet rsOne = stOne.executeQuery(SelectAverage);
String TotalAverage = "SELECT Avg(MEAN) AS averages FROM STUDENTS";
ResultSet rsTwo = stTwo.executeQuery(TotalAverage);
String student_stan_dev = "SELECT STDEV(MEAN) AS standardDeviation FROM STUDENTS";
ResultSet rsThree = stThree.executeQuery(student_stan_dev);
int onesdMean = 1;
//Loop Duration_Sec column
while(rsOne.next()){
//Convert values into float values
float allAvgs = rsOne.getFloat("MEAN");
float totalAvg = rsTwo.getFloat("averages");
float StDev = rsThree.getFloat("standardDeviation");
float theSD = allAvgs - (onesdMean * StDev);
}
String flaggedStudents = "ALTER TABLE STUDENTS ADD FlaggedStudents INT";
ResultSet rsFour = stFour.executeUpdate(flaggedStudents);
if(allAvgs >= theSD){
String FlagHint = "INSERT INTO STUDENTS.FlaggedStudents VALUES('1')";
st.executeUpdate(FlagHint);
}
String countInstances = "SELECT STUDENTS.NAME, STUDENTS.FlaggedStudents" +
"COUNT(*)OVER(PARTITION BY STUDENTS) AS cnt FROM STUDENTS";
st.executeQuery(countInstances);
st.executeUpdate("CREATE TABLE IF NOT EXISTS StudentCount" +
"(NAME INT , cnt INT)");
String insertVals = String.format("INSERT INTO StudentCount" +
"(NAME , cnt INT") +
" VALUES ('%s','%s')");
st.execute(insertVals);
My question is, I want to implement a k-means algorithm instead, to cluster students who meet this criteria and separate those who are far from meeting this criteria. I have seen source code for the k-means algorithm, but how would I go about doing that with a database implemented in java/SQL? Would I just add this information to a cluster array? Any help would be appreciated.

If you have only one attribute, choose a different algorithm than k-means.
Clustering algorithms are really only good for multidimensional data.
For one-dimensional data, use kernel density estimation to find local minima to split the data there. This produces much more meaningful splits. And at the same time, 1-dimensional data can be sorted (and sorting is something your SQL database does very well), which makes the problem substantially easier than in multiple dimensions.
Seriously. 1-dimensional data is the prime domain of classic statistics. They have excellent tools for this kind of data, so use them!
Multi-dimensional data, where it gets tricky to accelerate your computations, is where data-mining really shines. Once the problem gets too hard to handle with proper statistics in reasonable time, THEN the heuristic approaches of data mining are attractive. But before that, classic statistics is much more clever and advanced.

Optimizing ResultSet fetch performance (Apache Spring, MySQL)

My problem is this: I am trying to process about 1.5 million rows of data in Spring via JDBCTemplate coming from MySQL. With such a large number of rows, I am using the RowCallbackHandler class as suggested here
The code is actually working, but's SLOW... The thing is that no matter what I set the fetch size to, I seem to get approximately 350 records per fetch, with a 2 to 3 second delay between fetches (from observing my logs). I tried commenting out the store command and confirmed that behavior stayed the same, so the problem is not with the writes.
There are 6 columns, only 1 that is a varchar, and that one is only 25 characters long, so I can't see throughput being the issue.
Ideally I'd like to get more like 30000-50000 rows at a time. Is there a way to do that?
Here is my code:
protected void runCallback(String query, Map params, int fetchSize, RowCallbackHandler rch)
throws DatabaseException {
int oldFetchSize = getJdbcTemplate().getFetchSize();
if (fetchSize > 0) {
getJdbcTemplate().setFetchSize(fetchSize);
}
try {
getJdbcTemplate().query(getSql(query), rch);
}
catch (DataAccessException ex) {
logger.error(ExceptionUtils.getStackTrace(ex));
throw new DatabaseException( ex.getMessage() );
}
getJdbcTemplate().setFetchSize(oldFetchSize);
}
and the handler:
public class SaveUserFolderStatesCallback implements RowCallbackHandler {
#Override
public void processRow(ResultSet rs) throws SQLException {
//Save each row sequentially.
//Do NOT call ResultSet.next() !!!!
Calendar asOf = Calendar.getInstance();
log.info("AS OF DATE: " + asOf.getTime());
Long x = (Long) rs.getLong("x");
Long xx = (Long) rs.getLong("xx");
String xxx = (String) rs.getString("xxx");
BigDecimal xxxx = (BigDecimal)rs.getBigDecimal("xxxx");
Double xxxx = (budgetAmountBD == null) ? 0.0 : budgetAmountBD.doubleValue();
BigDecimal xxxxx = (BigDecimal)rs.getBigDecimal("xxxxx");
Double xxxxx = (actualAmountBD == null) ? 0.0 : actualAmountBD.doubleValue();
dbstore(x, xx, xxx, xxxx, xxxxx, asOf);
}
}

And what is your query? Try to create an indexex for fields you are searching/sorting. That will help.
Second strategy: in memory cache implementation. Or using of hibernate plus 2nd level cache.
Both this technics can significantly speed up your query execution.

Few Questions
How long does it takes if you query the DB directly. Another issue could be ASYNC_NETWORK_IO delay between application and DB hosts.
did you check it without using Spring

The answer actually is to do setFetchSize(Integer.MIN_VALUE) while this totally violates the stated contract of Statement.setFetchSize, the mysql java connector uses this value to stream the resultset. This results in tremendous performance improvement.
Another part of the fix is that I also needed to create my own subclass of (Spring) JdbcTemplate that would accomodate the negative fetch size... Actually, I took the code example here, where I first found the idea of setting fetchSize(Integer.MIN_VALUE)
http://javasplitter.blogspot.com/2009/10/pimp-ma-jdbc-resultset.html
Thank you both for your help!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.