I'm trying to write a java function that can work with large result sets.
The table has 1.2 billion rows which is 189 Gb of data.
Currently I query all the data and extract the information which I store in their respective objects.(using a million row sample db)
TreeMap <Long, Vessel> vessels = new TreeMap<Long, Vessel>(); //list for all vessels
try{
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT mmsi, report_timestamp, position_geom, ST_X(position_geom) AS Long, "
+ "ST_Y(position_geom) AS Lat FROM reports2 WHERE position_geom IS NOT NULL ORDER by report_timestamp ASC");
while(rs.next()){
long mmsi = rs.getLong("mmsi");
java.util.Date time = rs.getTime("report_timestamp");
double longitude = rs.getDouble("Long");
double latitude = rs.getDouble("Lat");
Coordinate coordinate = new Coordinate(longitude, latitude, time);
Vessel vessel = new Vessel(mmsi);
if(!vessels.containsKey(mmsi)) { //if vessel is not present in vessels
vessel.addCoor(coordinate);
vessels.put(mmsi, vessel);
}
else{ //if vessel is already in vessels
vessels.get(mmsi).addCoor(coordinate);
}
}
}catch(Exception e){
JOptionPane.showMessageDialog(null, e);
}
With 189 Gb of data, my computer's memory won't be able to hold the information
I've never touched a table with a billion+ rows and some of my methods involve having all the tables attributes
Can I make Resultset collect 1,000,000 queries at a time and then delete objects after I run functions on them -> then collect another 1,000,000 and so on
Is it possible to hold a 1.2 billion row resultset in approx. 43,000,000 vessel objects (will it take too much space/time ?)
Do I try and limit my query by having a way to select a specific key or attribute and run functions on specified data?
Is there another option ?
If memory is an issue with the ResultSet you can set the fetch size, though you'll need to clear objects during fetch to ensure you don't run out of memory. With Postgres you need to turn off Auto commit or fetch size will not occur.
connection.setAutoCommit(false);
Statement stmt = connection.createStatement();
stmt.setFetchSize(fetchsize);
You can read more about buffering the Result set at https://jdbc.postgresql.org/documentation/94/query.html#query-with-cursor
From your code it seems that you are builing a java object that collects alla the coordinates with the same mmsi field. You did not provide information about this object (mmsi and it list of coordinates) usage. Given this information you can query the data sorting by mmsi and then timestamp (you order by clause is only by timestamp now), when in the resultset you find a different value of mmsi you collected all the data about than specific mmsi so you can use it wihout reading other data.
I don't think you really need to get all the data in memory; you can rewrite the query in order to get only a fixed (a sliding window) number of Vessel objects; you must page the data (i.e. retrieve a block of 10 vessels starting from vessel at position x)
In order to provide a more detailed response you have to explain what you are doing with Vessels.
Related
Hi I have two tabels PURCHASE and DELIVERY where by I want to show the remaining quantity in another table 'STOCK'.
Below is the output I got from my STOCK table and the code I did, but the calculation is wrong.
Can anyone help me with this?
private void stock(){
dbConection db = new dbConection();
Connection con=db.getConnection();
String sql = "Select delivery.pro_Name, delivery.pro_Code, (sum(purchase.pur_qty) - sum(delivery.Qty)) AS bal from delivery, purchase where purchase.productCode = delivery.pro_Code GROUP BY delivery.pro_Code ";
PreparedStatement pst = con.prepareStatement(sql);
ResultSet rs = pst.executeQuery();
DefaultTableModel tm = (DefaultTableModel)stockTable.getModel();
tm.setRowCount(0);
while(rs.next()){
Object o[]={rs.getString("pro_Code"), rs.getString("pro_Name"), rs.getString("bal")};
tm.addRow(o);
}
}catch(Exception ex){
JOptionPane.showMessageDialog(null,ex);
}
First of all, this is only a SQL problem, and Java is only for view the results.
The above code will be for MS SQL Server, as you did not provide the type of SQL Database.
A final stock is calculated starting from an initial stock.
So you will must have something like:
initial_stock + purchase_qty - delivery_qty = final_stock (group by ProductId)
Now, if you store your articles/products in your database as FIFO, that means you imply also the price in this formula, so group by implies also Fifo_Price . This could be also more complex SQL Select, based of related tables.
Supposing you don't need FIFO, but only a remaining quantity stock, the select syntax will be something like this:
-- Supposing initial stock is stored in Product Table
Select pr.ProductID, pr.InitialStock, Sum(IsNull(pc.Quantity,0)) AS QuantityIn,
Sum(IsNull(dv.Quantity,0)) AS QuantityOut,
pr.InitialStock + sum(IsNull(pc.Quantity,0) - IsNull(dv.Quantity,0)) AS FinalStock
From Products pr
Left Join Purchase pc on pr.ProductID = pc.ProductID
Left Join Delivery dv on pr.ProductID = dv.ProductID
Group By pr.ProductID, pr.InitialStock
-- if you want to see only products that are implied in purchase and delivery tables, you must Inner Join all tables
Edit: I have deleted the rest of answer, because was not relevant to question.
Edit 2: Please look at this picture. I have made a simple test showing what are results as I did and as you did:
Considering that initial stock is not null, I did not force a replacement with 0, but you can replace that too, so st.qty will be IsNull(st.Qty, 0) as InitialStock
I am having some trouble figuring out a query that will update values in a column in one of my tables. Below is my function:
public void increasePrice(String [] str) {
PreparedStatement ps = null;
try {
ps = connection.prepareStatement("Update Journey Set price+=? where distance <?",PreparedStatement.RETURN_GENERATED_KEYS);
ps.setDouble(1,Double.parseDouble(str[1]));
ps.setDouble(2, Double.parseDouble(str[0]));
ps.executeUpdate();
ps.close();
System.out.println("1 rows updated.");
} catch (SQLException ex) {
Logger.getLogger(Jdbc.class.getName()).log(Level.SEVERE, null, ex);
}
}
To illustrate, the array passed in contains a value for distance and price and I am wanting to update the prices in the 'Journey' table based on their distance. For example, if a record in the table has a distance (type double) that is less than a given distance (the value of str[0]), I want to increase the price (also a double) of that record by the value 'str[1]' and do this for all records in the table.
The above code doesn't give any errors however, the records in the database never get updated. I could really use some help with this as I've searched around for a while now to try and find a solution and have not yet succeeded.
I do not know what database you are using but my guess is that this line:
ps = connection.prepareStatement("Update Journey Set price+=? where distance <?",PreparedStatement.RETURN_GENERATED_KEYS);
should be written like this:
ps = connection.prepareStatement("Update Journey Set price=price+? where distance <?",PreparedStatement.RETURN_GENERATED_KEYS);
And not related to your question but the line
System.out.println("1 rows updated.");
may make you waste hours of debugging in the future because 0 or more rows can be actually updated.
try {
Statement s = conn.createStatement();
ResultSet result2 = s.executeQuery("Select Distinct * From Poem p,Recording r Where r.PoemTitle = p.PoemTitle AND r.poemTitle = 'poem1'");
System.out.print("Result (Select with Join): ");
while (result2.next()) {
System.out.println(result2.getString(1)+ " " + result2.getString(2)+ result2.getString(3));
}
} catch(Exception e) {
System.out.print(e.getMessage());
}
I am trying to output the poemtitle and the date it was recorded. When this runs it outputs the poemtitle and then gives the date the poem was created instead of recorded? Is this because of the relationship?
Most likely it's because of the * in the SELECT list.
Specify the columns that you want returned, in the order you want them returned.
We're just guessing at the name of the column that contains "date recorded" and which table it's in:
SELECT p.PoemTitle
, r.dateRecorded
, r.readBy
FROM Poem p
JOIN Recording r
ON r.PoemTitle = p.PoemTitle
WHERE r.poemTitle = 'poem1'
GROUP
BY p.PoemTitle
, r.dateRecorded
, r.readBy
ORDER
BY p.PoemTitle
, r.dateRecorded DESC
, r.readBy
Notes:
Ditch the old-school comma syntax for the join operation and use the JOIN keyword instead, and relocate the join predicates from the WHERE clause to an ON clause.
Avoid using * in the SELECT list. Explicitly list the columns/expressions to be returned. When we read the code, and that SQL statement, we don't know how many columns are being returned, what order the columns are in, or what the datatypes are. (We'd have to go look at the table definitions.)
Explicitly listing the columns/expressions being returned only takes a little bit of work. If code was only ever written, then it would be fine, save the time writing. But code is READ ten times more than it is written. (And the SQL statement with the * makes the SQL statement virtually indecipherable in terms of which column is being referenced by getString(1).
Listing the columns columns can also make it more efficient on the database, to prepare a resultset with a few columns vs a resultset of dozens of columns, and we also transfer a smaller resultset from the database to the client. With a subset of columns, its more likely we can use a covering index for the query.
Im using mysql with JDBC.
I have a large example table which contains 6.3 million rows that I am trying to perform efficient select queries on. See below:
I have created three additional indexes on the table, see below:
Performing a SELECT query like this SELECT latitude, longitude FROM 3dag WHERE
timestamp BETWEEN "+startTime+" AND "+endTime+" AND HourOfDay=4 AND DayOfWeek=3" has a run time that is extremely high at 256356 ms, or a little above four minutes. My explain on the same query gives me this:
My code for retrieving the data is below:
Connection con = null;
PreparedStatement pst = null;
Statement stmt = null;
ResultSet rs = null;
String url = "jdbc:mysql://xxx.xxx.xxx.xx:3306/testdb";
String user = "bigd";
String password = "XXXXX";
try {
Class.forName("com.mysql.jdbc.Driver");
con = DriverManager.getConnection(url, user, password);
String query = "SELECT latitude, longitude FROM 3dag WHERE timestamp BETWEEN "+startTime+" AND "+endTime+" AND HourOfDay=4 AND DayOfWeek=3";
stmt = con.prepareStatement("SELECT latitude, longitude FROM 3dag WHERE timestamp>=" + startTime + " AND timestamp<=" + endTime);
stmt = con.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY, java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
rs = stmt.executeQuery(query);
System.out.println("Start");
while (rs.next()) {
int tempLong = (int) ((Double.parseDouble(rs.getString(2))) * 100000);
int x = (int) (maxLong * 100000) - tempLong;
int tempLat = (int) ((Double.parseDouble(rs.getString(1))) * 100000);
int y = (int) (maxLat * 100000) - tempLat;
if (!(y > matrix.length) || !(y < 0) || !(x > matrix[0].length) || !(x < 0)) {
matrix[y][x] += 1;
}
}
System.out.println("End");
JSONObject obj = convertToCRS(matrix);
return obj;
}catch (ClassNotFoundException ex){
Logger lgr = Logger.getLogger(Database.class.getName());
lgr.log(Level.SEVERE, ex.getMessage(), ex);
return null;
}
catch (SQLException ex) {
Logger lgr = Logger.getLogger(Database.class.getName());
lgr.log(Level.SEVERE, ex.getMessage(), ex);
return null;
} finally {
try {
if (rs != null) {
rs.close();
}
if (pst != null) {
pst.close();
}
if (con != null) {
con.close();
}
} catch (SQLException ex) {
Logger lgr = Logger.getLogger(Database.class.getName());
lgr.log(Level.WARNING, ex.getMessage(), ex);
return null;
}
}
Removing every line in the while(rs.next()) loop gives me the same horrible run-time.
My question is what can I do to optimize this type of query? I am curious about the .setFetchSize() and what the optimal value should be here. Documentation shows that INTEGER.MIN_VALUE results in fetching row-by-row, is this correct?
Any help is appreciated.
EDIT
After creating a new index on timestamp, DayOfWeek and HourOfDay my query runs 1 minute faster and explain gives me this:
Some ideas up front:
Did you in fact check the SQL Execution time (from .executeQuery() till first row?) or is that execution + iteration over 6.3 million rows?
You prepare a PreparedStatement but don't use it?!
Use PreparedStatement, pass tiemstamp, dayOfWeek, hourOfDay as parameters
Create one index that can satisfy your where condition. Order the keys in a way that you can eliminate the most items with the highest ranking field.
The idex might look like:
CREATE INDEX stackoverflow on 3dag(hourOfDay, dayOfWeek, Timestamp);
Perform your SQL inside MySQL - what time do you get there?
Try without stmt.setFetchSize(Integer.MIN_VALUE); this might create many unneeded network roundtrips.
According to your question, the cardinality of (that is, the number of distinct values in) your Timestamp column is about 1/30th of the cardinality of your Uid column. That is, you have lots and lots of identical timestamps. That doesn't bode well for the efficiency of your query.
That being said, you might try to use the following compound covering index to speed things up.
CREATE INDEX 3dag_q ON ('Timestamp' HourOfDay, DayOfWeek, Latitude, Longitude)
Why will this help? Because your whole query can be satisfied from the index with a so-called tight index scan. The MySQL query engine will random-access the index to the entry with the smallest Timestamp value matching your query. It will then read the index in order and pull out the latitude and longitude from the rows that match.
You could try doing some of the summarizing on the MySQL server.
SELECT COUNT(*) number_of_duplicates,
ROUND(Latitude,4) Latitude, ROUND(Longitude,4) Longitude
FROM 3dag
WHERE timestamp BETWEEN "+startTime+"
AND "+endTime+"
AND HourOfDay=4
AND DayOfWeek=3
GROUP BY ROUND(Latitude,4), ROUND(Longitude,4)
This may return a smaller result set. Edit This quantizes (rounds off) your lat/long values and then count the number of items duplicated by rounding them off. The more coarsely you round them off (that is, the smaller the second number in the ROUND(val,N) function calls happens to be) more duplicate values you will encounter, and the fewer distinct rows will be generated by your query. Fewer rows save time.
Finally, if these lat/long values are GPS derived and recorded in degrees, it makes no sense to try to deal with more than about four or five decimal places. Commercial GPS precision is limited to that.
More suggestions
Make your latitude and longitude columns into FLOAT values in your table if they have GPS precision. If they have more precision than GPS use DOUBLE. Storing and transferring numbers in varchar(30) columns is quite inefficient.
Similarly, make your HourOfDay and DayOfWeek columns into SMALLINT or even TINYINT data types in your table. 64 bit integers for values between 0 and 31 is wasteful. With hundreds of rows, it doesn't matter. With millions it does.
Finally, if your queries always look like this
SELECT Latitude, Longitude
FROM 3dag
WHERE timestamp BETWEEN SOME_VALUE
AND ANOTHER_VALUE
AND HourOfDay = SOME_CONSTANT_DAY
AND DayOfWeek = SOME_CONSTANT_HOUR
this compound covering index should be ideal to accelerate your query.
CREATE INDEX 3dag_hdtll ON (HourOfDay, DayofWeek, `timestamp`, Latitude, Longitude)
I am extrapolating from my tracking app. This is what i do for efficiency:
Firstly, a possible solution depends on whether or not you can predict/control the time intervals. Store snapshots every X minutes or once a day, for example. Let us say you want to display all events YESTERDAY. You can save a snapshot that has already filtered your file. This would speed things up enormously, but is not a viable solution for custom time intervals and real live coverage.
My application is LIVE, but usually works pretty well in T+5 minutes (5 minute maximum lag/delay). Only when the user actually chooses live position viewing will the application open a full query on the live db. Thus, depends on how your app works.
Second factor: How you store your timestamp is very important. Avoid VARCHAR, for example. If you are converting UNIXTIME that also will give you unnecessary lagtime. Since you are developing what appears to be a geotracking application, your timestamp would be in unixtime - an integer. some devices work with milliseconds, i would recommend not using them. 1449878400 instead of 1449878400000 (12/12/2015 0 GMT)
I save all my geopoint datetimes in unixtime seconds and use mysql timestamps only for timestamping the moment the point was received by server (which is irrelevant to this query you propose).
You might shave some time off accessing an indexed view instead of running a full a query. Whether that time is significant in a large query is subject to testing.
Finally, you could shave an itsy bitsy more by not using BETWEEN and using something SIMILAR to what it will be translate into (pseudocode below)
WHERE (timecode > start_Time AND timecode < end_time)
See that i change >= and <= to > and < because chances are your timestamp will almost never be on the precise second and even if it is, you will rarely be afffected whether 1 geopoint/time event is or not displayed.
I am working with an Oracle Database and have the following code implemented in java (with an SQL imported library), where I have a group of students, their average, and I flag those students with an average that is higher than one standard deviation away from the mean (by inserting a new column with a "1" in it). Then I count the number of students who meet the criteria and add them to a new table:
try{
Statement stOne, stTwo, stThree, stFour;
String SelectAverage = "SELECT MEAN FROM STUDENTS";
ResultSet rsOne = stOne.executeQuery(SelectAverage);
String TotalAverage = "SELECT Avg(MEAN) AS averages FROM STUDENTS";
ResultSet rsTwo = stTwo.executeQuery(TotalAverage);
String student_stan_dev = "SELECT STDEV(MEAN) AS standardDeviation FROM STUDENTS";
ResultSet rsThree = stThree.executeQuery(student_stan_dev);
int onesdMean = 1;
//Loop Duration_Sec column
while(rsOne.next()){
//Convert values into float values
float allAvgs = rsOne.getFloat("MEAN");
float totalAvg = rsTwo.getFloat("averages");
float StDev = rsThree.getFloat("standardDeviation");
float theSD = allAvgs - (onesdMean * StDev);
}
String flaggedStudents = "ALTER TABLE STUDENTS ADD FlaggedStudents INT";
ResultSet rsFour = stFour.executeUpdate(flaggedStudents);
if(allAvgs >= theSD){
String FlagHint = "INSERT INTO STUDENTS.FlaggedStudents VALUES('1')";
st.executeUpdate(FlagHint);
}
String countInstances = "SELECT STUDENTS.NAME, STUDENTS.FlaggedStudents" +
"COUNT(*)OVER(PARTITION BY STUDENTS) AS cnt FROM STUDENTS";
st.executeQuery(countInstances);
st.executeUpdate("CREATE TABLE IF NOT EXISTS StudentCount" +
"(NAME INT , cnt INT)");
String insertVals = String.format("INSERT INTO StudentCount" +
"(NAME , cnt INT") +
" VALUES ('%s','%s')");
st.execute(insertVals);
My question is, I want to implement a k-means algorithm instead, to cluster students who meet this criteria and separate those who are far from meeting this criteria. I have seen source code for the k-means algorithm, but how would I go about doing that with a database implemented in java/SQL? Would I just add this information to a cluster array? Any help would be appreciated.
If you have only one attribute, choose a different algorithm than k-means.
Clustering algorithms are really only good for multidimensional data.
For one-dimensional data, use kernel density estimation to find local minima to split the data there. This produces much more meaningful splits. And at the same time, 1-dimensional data can be sorted (and sorting is something your SQL database does very well), which makes the problem substantially easier than in multiple dimensions.
Seriously. 1-dimensional data is the prime domain of classic statistics. They have excellent tools for this kind of data, so use them!
Multi-dimensional data, where it gets tricky to accelerate your computations, is where data-mining really shines. Once the problem gets too hard to handle with proper statistics in reasonable time, THEN the heuristic approaches of data mining are attractive. But before that, classic statistics is much more clever and advanced.