Implementing K-Means Algorithm JDBC

Implementing K-Means Algorithm JDBC - java

I am working with an Oracle Database and have the following code implemented in java (with an SQL imported library), where I have a group of students, their average, and I flag those students with an average that is higher than one standard deviation away from the mean (by inserting a new column with a "1" in it). Then I count the number of students who meet the criteria and add them to a new table:
try{
Statement stOne, stTwo, stThree, stFour;
String SelectAverage = "SELECT MEAN FROM STUDENTS";
ResultSet rsOne = stOne.executeQuery(SelectAverage);
String TotalAverage = "SELECT Avg(MEAN) AS averages FROM STUDENTS";
ResultSet rsTwo = stTwo.executeQuery(TotalAverage);
String student_stan_dev = "SELECT STDEV(MEAN) AS standardDeviation FROM STUDENTS";
ResultSet rsThree = stThree.executeQuery(student_stan_dev);
int onesdMean = 1;
//Loop Duration_Sec column
while(rsOne.next()){
//Convert values into float values
float allAvgs = rsOne.getFloat("MEAN");
float totalAvg = rsTwo.getFloat("averages");
float StDev = rsThree.getFloat("standardDeviation");
float theSD = allAvgs - (onesdMean * StDev);
}
String flaggedStudents = "ALTER TABLE STUDENTS ADD FlaggedStudents INT";
ResultSet rsFour = stFour.executeUpdate(flaggedStudents);
if(allAvgs >= theSD){
String FlagHint = "INSERT INTO STUDENTS.FlaggedStudents VALUES('1')";
st.executeUpdate(FlagHint);
}
String countInstances = "SELECT STUDENTS.NAME, STUDENTS.FlaggedStudents" +
"COUNT(*)OVER(PARTITION BY STUDENTS) AS cnt FROM STUDENTS";
st.executeQuery(countInstances);
st.executeUpdate("CREATE TABLE IF NOT EXISTS StudentCount" +
"(NAME INT , cnt INT)");
String insertVals = String.format("INSERT INTO StudentCount" +
"(NAME , cnt INT") +
" VALUES ('%s','%s')");
st.execute(insertVals);
My question is, I want to implement a k-means algorithm instead, to cluster students who meet this criteria and separate those who are far from meeting this criteria. I have seen source code for the k-means algorithm, but how would I go about doing that with a database implemented in java/SQL? Would I just add this information to a cluster array? Any help would be appreciated.

If you have only one attribute, choose a different algorithm than k-means.
Clustering algorithms are really only good for multidimensional data.
For one-dimensional data, use kernel density estimation to find local minima to split the data there. This produces much more meaningful splits. And at the same time, 1-dimensional data can be sorted (and sorting is something your SQL database does very well), which makes the problem substantially easier than in multiple dimensions.
Seriously. 1-dimensional data is the prime domain of classic statistics. They have excellent tools for this kind of data, so use them!
Multi-dimensional data, where it gets tricky to accelerate your computations, is where data-mining really shines. Once the problem gets too hard to handle with proper statistics in reasonable time, THEN the heuristic approaches of data mining are attractive. But before that, classic statistics is much more clever and advanced.

Related

How to subtract quantity between two tables based on item types in Java Netbeans?

Hi I have two tabels PURCHASE and DELIVERY where by I want to show the remaining quantity in another table 'STOCK'.
Below is the output I got from my STOCK table and the code I did, but the calculation is wrong.
Can anyone help me with this?
private void stock(){
dbConection db = new dbConection();
Connection con=db.getConnection();
String sql = "Select delivery.pro_Name, delivery.pro_Code, (sum(purchase.pur_qty) - sum(delivery.Qty)) AS bal from delivery, purchase where purchase.productCode = delivery.pro_Code GROUP BY delivery.pro_Code ";
PreparedStatement pst = con.prepareStatement(sql);
ResultSet rs = pst.executeQuery();
DefaultTableModel tm = (DefaultTableModel)stockTable.getModel();
tm.setRowCount(0);
while(rs.next()){
Object o[]={rs.getString("pro_Code"), rs.getString("pro_Name"), rs.getString("bal")};
tm.addRow(o);
}
}catch(Exception ex){
JOptionPane.showMessageDialog(null,ex);
}

First of all, this is only a SQL problem, and Java is only for view the results.
The above code will be for MS SQL Server, as you did not provide the type of SQL Database.
A final stock is calculated starting from an initial stock.
So you will must have something like:
initial_stock + purchase_qty - delivery_qty = final_stock (group by ProductId)
Now, if you store your articles/products in your database as FIFO, that means you imply also the price in this formula, so group by implies also Fifo_Price . This could be also more complex SQL Select, based of related tables.
Supposing you don't need FIFO, but only a remaining quantity stock, the select syntax will be something like this:
-- Supposing initial stock is stored in Product Table
Select pr.ProductID, pr.InitialStock, Sum(IsNull(pc.Quantity,0)) AS QuantityIn,
Sum(IsNull(dv.Quantity,0)) AS QuantityOut,
pr.InitialStock + sum(IsNull(pc.Quantity,0) - IsNull(dv.Quantity,0)) AS FinalStock
From Products pr
Left Join Purchase pc on pr.ProductID = pc.ProductID
Left Join Delivery dv on pr.ProductID = dv.ProductID
Group By pr.ProductID, pr.InitialStock
-- if you want to see only products that are implied in purchase and delivery tables, you must Inner Join all tables
Edit: I have deleted the rest of answer, because was not relevant to question.
Edit 2: Please look at this picture. I have made a simple test showing what are results as I did and as you did:
Considering that initial stock is not null, I did not force a replacement with 0, but you can replace that too, so st.qty will be IsNull(st.Qty, 0) as InitialStock

Optimizing MySQL query on large table

Im using mysql with JDBC.
I have a large example table which contains 6.3 million rows that I am trying to perform efficient select queries on. See below:
I have created three additional indexes on the table, see below:
Performing a SELECT query like this SELECT latitude, longitude FROM 3dag WHERE
timestamp BETWEEN "+startTime+" AND "+endTime+" AND HourOfDay=4 AND DayOfWeek=3" has a run time that is extremely high at 256356 ms, or a little above four minutes. My explain on the same query gives me this:
My code for retrieving the data is below:
Connection con = null;
PreparedStatement pst = null;
Statement stmt = null;
ResultSet rs = null;
String url = "jdbc:mysql://xxx.xxx.xxx.xx:3306/testdb";
String user = "bigd";
String password = "XXXXX";
try {
Class.forName("com.mysql.jdbc.Driver");
con = DriverManager.getConnection(url, user, password);
String query = "SELECT latitude, longitude FROM 3dag WHERE timestamp BETWEEN "+startTime+" AND "+endTime+" AND HourOfDay=4 AND DayOfWeek=3";
stmt = con.prepareStatement("SELECT latitude, longitude FROM 3dag WHERE timestamp>=" + startTime + " AND timestamp<=" + endTime);
stmt = con.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY, java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
rs = stmt.executeQuery(query);
System.out.println("Start");
while (rs.next()) {
int tempLong = (int) ((Double.parseDouble(rs.getString(2))) * 100000);
int x = (int) (maxLong * 100000) - tempLong;
int tempLat = (int) ((Double.parseDouble(rs.getString(1))) * 100000);
int y = (int) (maxLat * 100000) - tempLat;
if (!(y > matrix.length) || !(y < 0) || !(x > matrix[0].length) || !(x < 0)) {
matrix[y][x] += 1;
}
}
System.out.println("End");
JSONObject obj = convertToCRS(matrix);
return obj;
}catch (ClassNotFoundException ex){
Logger lgr = Logger.getLogger(Database.class.getName());
lgr.log(Level.SEVERE, ex.getMessage(), ex);
return null;
}
catch (SQLException ex) {
Logger lgr = Logger.getLogger(Database.class.getName());
lgr.log(Level.SEVERE, ex.getMessage(), ex);
return null;
} finally {
try {
if (rs != null) {
rs.close();
}
if (pst != null) {
pst.close();
}
if (con != null) {
con.close();
}
} catch (SQLException ex) {
Logger lgr = Logger.getLogger(Database.class.getName());
lgr.log(Level.WARNING, ex.getMessage(), ex);
return null;
}
}
Removing every line in the while(rs.next()) loop gives me the same horrible run-time.
My question is what can I do to optimize this type of query? I am curious about the .setFetchSize() and what the optimal value should be here. Documentation shows that INTEGER.MIN_VALUE results in fetching row-by-row, is this correct?
Any help is appreciated.
EDIT
After creating a new index on timestamp, DayOfWeek and HourOfDay my query runs 1 minute faster and explain gives me this:

Some ideas up front:
Did you in fact check the SQL Execution time (from .executeQuery() till first row?) or is that execution + iteration over 6.3 million rows?
You prepare a PreparedStatement but don't use it?!
Use PreparedStatement, pass tiemstamp, dayOfWeek, hourOfDay as parameters
Create one index that can satisfy your where condition. Order the keys in a way that you can eliminate the most items with the highest ranking field.
The idex might look like:
CREATE INDEX stackoverflow on 3dag(hourOfDay, dayOfWeek, Timestamp);
Perform your SQL inside MySQL - what time do you get there?
Try without stmt.setFetchSize(Integer.MIN_VALUE); this might create many unneeded network roundtrips.

According to your question, the cardinality of (that is, the number of distinct values in) your Timestamp column is about 1/30th of the cardinality of your Uid column. That is, you have lots and lots of identical timestamps. That doesn't bode well for the efficiency of your query.
That being said, you might try to use the following compound covering index to speed things up.
CREATE INDEX 3dag_q ON ('Timestamp' HourOfDay, DayOfWeek, Latitude, Longitude)
Why will this help? Because your whole query can be satisfied from the index with a so-called tight index scan. The MySQL query engine will random-access the index to the entry with the smallest Timestamp value matching your query. It will then read the index in order and pull out the latitude and longitude from the rows that match.
You could try doing some of the summarizing on the MySQL server.
SELECT COUNT(*) number_of_duplicates,
ROUND(Latitude,4) Latitude, ROUND(Longitude,4) Longitude
FROM 3dag
WHERE timestamp BETWEEN "+startTime+"
AND "+endTime+"
AND HourOfDay=4
AND DayOfWeek=3
GROUP BY ROUND(Latitude,4), ROUND(Longitude,4)
This may return a smaller result set. Edit This quantizes (rounds off) your lat/long values and then count the number of items duplicated by rounding them off. The more coarsely you round them off (that is, the smaller the second number in the ROUND(val,N) function calls happens to be) more duplicate values you will encounter, and the fewer distinct rows will be generated by your query. Fewer rows save time.
Finally, if these lat/long values are GPS derived and recorded in degrees, it makes no sense to try to deal with more than about four or five decimal places. Commercial GPS precision is limited to that.
More suggestions
Make your latitude and longitude columns into FLOAT values in your table if they have GPS precision. If they have more precision than GPS use DOUBLE. Storing and transferring numbers in varchar(30) columns is quite inefficient.
Similarly, make your HourOfDay and DayOfWeek columns into SMALLINT or even TINYINT data types in your table. 64 bit integers for values between 0 and 31 is wasteful. With hundreds of rows, it doesn't matter. With millions it does.
Finally, if your queries always look like this
SELECT Latitude, Longitude
FROM 3dag
WHERE timestamp BETWEEN SOME_VALUE
AND ANOTHER_VALUE
AND HourOfDay = SOME_CONSTANT_DAY
AND DayOfWeek = SOME_CONSTANT_HOUR
this compound covering index should be ideal to accelerate your query.
CREATE INDEX 3dag_hdtll ON (HourOfDay, DayofWeek, `timestamp`, Latitude, Longitude)

I am extrapolating from my tracking app. This is what i do for efficiency:
Firstly, a possible solution depends on whether or not you can predict/control the time intervals. Store snapshots every X minutes or once a day, for example. Let us say you want to display all events YESTERDAY. You can save a snapshot that has already filtered your file. This would speed things up enormously, but is not a viable solution for custom time intervals and real live coverage.
My application is LIVE, but usually works pretty well in T+5 minutes (5 minute maximum lag/delay). Only when the user actually chooses live position viewing will the application open a full query on the live db. Thus, depends on how your app works.
Second factor: How you store your timestamp is very important. Avoid VARCHAR, for example. If you are converting UNIXTIME that also will give you unnecessary lagtime. Since you are developing what appears to be a geotracking application, your timestamp would be in unixtime - an integer. some devices work with milliseconds, i would recommend not using them. 1449878400 instead of 1449878400000 (12/12/2015 0 GMT)
I save all my geopoint datetimes in unixtime seconds and use mysql timestamps only for timestamping the moment the point was received by server (which is irrelevant to this query you propose).
You might shave some time off accessing an indexed view instead of running a full a query. Whether that time is significant in a large query is subject to testing.
Finally, you could shave an itsy bitsy more by not using BETWEEN and using something SIMILAR to what it will be translate into (pseudocode below)
WHERE (timecode > start_Time AND timecode < end_time)
See that i change >= and <= to > and < because chances are your timestamp will almost never be on the precise second and even if it is, you will rarely be afffected whether 1 geopoint/time event is or not displayed.

Working with a large Resultset

I'm trying to write a java function that can work with large result sets.
The table has 1.2 billion rows which is 189 Gb of data.
Currently I query all the data and extract the information which I store in their respective objects.(using a million row sample db)
TreeMap <Long, Vessel> vessels = new TreeMap<Long, Vessel>(); //list for all vessels
try{
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT mmsi, report_timestamp, position_geom, ST_X(position_geom) AS Long, "
+ "ST_Y(position_geom) AS Lat FROM reports2 WHERE position_geom IS NOT NULL ORDER by report_timestamp ASC");
while(rs.next()){
long mmsi = rs.getLong("mmsi");
java.util.Date time = rs.getTime("report_timestamp");
double longitude = rs.getDouble("Long");
double latitude = rs.getDouble("Lat");
Coordinate coordinate = new Coordinate(longitude, latitude, time);
Vessel vessel = new Vessel(mmsi);
if(!vessels.containsKey(mmsi)) { //if vessel is not present in vessels
vessel.addCoor(coordinate);
vessels.put(mmsi, vessel);
}
else{ //if vessel is already in vessels
vessels.get(mmsi).addCoor(coordinate);
}
}
}catch(Exception e){
JOptionPane.showMessageDialog(null, e);
}
With 189 Gb of data, my computer's memory won't be able to hold the information
I've never touched a table with a billion+ rows and some of my methods involve having all the tables attributes
Can I make Resultset collect 1,000,000 queries at a time and then delete objects after I run functions on them -> then collect another 1,000,000 and so on
Is it possible to hold a 1.2 billion row resultset in approx. 43,000,000 vessel objects (will it take too much space/time ?)
Do I try and limit my query by having a way to select a specific key or attribute and run functions on specified data?
Is there another option ?

If memory is an issue with the ResultSet you can set the fetch size, though you'll need to clear objects during fetch to ensure you don't run out of memory. With Postgres you need to turn off Auto commit or fetch size will not occur.
connection.setAutoCommit(false);
Statement stmt = connection.createStatement();
stmt.setFetchSize(fetchsize);
You can read more about buffering the Result set at https://jdbc.postgresql.org/documentation/94/query.html#query-with-cursor

From your code it seems that you are builing a java object that collects alla the coordinates with the same mmsi field. You did not provide information about this object (mmsi and it list of coordinates) usage. Given this information you can query the data sorting by mmsi and then timestamp (you order by clause is only by timestamp now), when in the resultset you find a different value of mmsi you collected all the data about than specific mmsi so you can use it wihout reading other data.
I don't think you really need to get all the data in memory; you can rewrite the query in order to get only a fixed (a sliding window) number of Vessel objects; you must page the data (i.e. retrieve a block of 10 vessels starting from vessel at position x)
In order to provide a more detailed response you have to explain what you are doing with Vessels.

How to resolve ORA-01795 in Java code

I am getting ORA-01795 error in my Java code while executing more than 1000 records in IN clause.
I am thinking to break it in the batch of 1000 entries using multiple IN clause separated by OR clause like below:
select * from table_name
where
column_name in (V1,V2,V3,...V1000)
or
column_name in (V1001,V1002,V1003,...V2000)
I have a string id's like -18435,16690,1719,1082,1026,100759... which gets generated dynamically based on user selection. How to write a logic for condition like 1-1000 records ,1001 to 2000 records etc in Java. Can anyone help me here?

There are three potential ways around this limit:
1) As you have already mentioned: split up the statement in batches of 1000
2) Create a derived table using the values and then join them:
with id_list (id) as (
select 'V1' from dual union all
select 'V2' from dual union all
select 'V3' from dual
)
select *
from the_table
where column_name in (select id from id_list);
alternatively you could also join those values - might even be faster:
with id_list (id) as (
select 'V1' from dual union all
select 'V2' from dual union all
select 'V3' from dual
)
select t.*
from the_table t
join id_list l on t.column_name = l.id;
This still generates a really, really huge statement, but doesn't have the limit of 1000 ids. I'm not sure how fast Oracle will parse this though.
3) Insert the values into a (global) temporary table and then use an IN clause (or a JOIN). This is probably going to be the fastest solution.

With so many values I'd avoid both in and or, and the hard-parse penalty of embedded values, in the query if at all possible. You can pass an SQL collection of values and use the table() collection expression as a table you can join your real table to.
This uses a hard-coded array of integers as an example, but you can populate that array from your user input instead. I'm using the built-in collection type definitions, like sys.odcinumberlist, which us a varray of numbers and is limited to 32k values, but you can define your own table type if you prefer or might need to handle more than that.
int[] ids = { -18435,16690,1719,1082,1026,100759 };
ArrayDescriptor aDesc = ArrayDescriptor.createDescriptor("SYS.ODCINUMBERLIST", conn );
oracle.sql.ARRAY ora_ids = new oracle.sql.ARRAY(aDesc, conn, ids);
sql = "select t.* "
+ "from table(?) a "
+ "left join table_name t "
+ "on t.column_name = a.column_value "
+ "order by id";
pStmt = (OraclePreparedStatement) conn.prepareStatement(sql);
pStmt.setArray(1, ora_ids);
rSet = (OracleResultSet) pStmt.executeQuery();
...
Your array can have as many values as you like (well, as many as the collection type you use and your JVM's memory can handle) and isn't subject to the in list's 1000-member limit.
Essentially table(?) ends up looking like a table containing all your values, and this is going to be easier and faster than populating a real or temporary table with all the values and joining to that.
Of course, don't really use t.*, list the columns you need; I'm assuming you used * to simolify the question...
(Here is a more complete example, but for a slightly different scenario.)

I very recently hit this wall myself:
Oracle has an architectural limit of a maximum number of 1000 terms inside an IN()
There are two workarounds:
Refactor the query to become a join
Leave the query as it is, but call it multiple times in a loop, each call using less than 1000 terms
Option 1 depends on the situation. If your list of values comes from a query, you can refactor to a join
Option 2 is also easy, but less performant:
List<String> terms;
for (int i = 0; i <= terms.size() / 1000; i++) {
List<String> next1000 = terms.subList(i * 1000, Math.min((i + 1) * 1000, terms.size());
// build and execute query using next1000 instead of terms
}

In such situations, when I have ids in a List in Java, I use a utility class like this to split the list to partitions and generate the statement from those partitions:
public class ListUtils {
public static <T> List<List<T>> partition(List<T> orig, int size) {
if (orig == null) {
throw new NullPointerException("The list to partition must not be null");
}
if (size < 1) {
throw new IllegalArgumentException("The target partition size must be 1 or greater");
}
int origSize = orig.size();
List<List<T>> result = new ArrayList<>(origSize / size + 1);
for (int i = 0; i < origSize; i += size) {
result.add(orig.subList(i, Math.min(i + size, origSize)));
}
return result;
}
}
Let's say your ids are in a list called ids, you could get sublists of size at most 1000 with:
ListUtils.partition(ids, 1000)
Then you could iterate over the results to construct the final query string.

java- generate id (combination of string and integer)

I am making an application in NetBeans (java). This application has unique id combination of string and integer like abc/111 or xyz/253 and the integer part should increase by when a new entry takes place in the database i.e. abc/112 and xyz/254.
The problem is the value of integer part increase until it has reached 10 in a proper way but after that it does not increase and remain same for further entries in database.
I used the following code -
try{
String sql = "SELECT RegNumber FROM Death ORDER BY RegNumber DESC ";
pst = conn.prepareStatement(sql);
rs = pst.executeQuery();
if (rs.next()) {
String add1 = rs.getString("RegNumber");
String[] parts= add1.split("/");
String part1= parts[0];
String part2= parts[1];
int a,b;
a= Integer.parseInt(part2);
b=a+1;
jTextField20.setText(""+part1+"/"+b);
JOptionPane.showMessageDialog(null, "done");
}
}
"Integer part increase till 10" means that if I start the first value of id in database like abc/1 then new id generates automatically for the next entry with the increasing value 1 that is abc/2 and for next entry it is abc/3 and so on in sequential order like this: abc/4, ..., abc/10
But when it has reached abc/10 the new generated id remains same i.e. abc/10 for every new entry in database. (I am using MS Access 2007 and the id is of text type). The first id in the database is created by the application itself.
If anyone has another alternative to generate id, please tell me.

The problem is that
String sql = "SELECT RegNumber FROM Death ORDER BY RegNumber DESC ";
will sort on descending alphabetic order, and alphabetically speaking
"abc/9" > "abc/10"
and that's why your program always fetches 9 over and over again...
I think you will have to split up that column for storage, and store the numeric part as an actual number type in the database. That's probably not as hard as it sounds, you can always sort on 2 fields
String sql = "SELECT RegNumber FROM Death ORDER BY RegString DESC, RegNumber DESC ";
You could also consider using a SERIAL (autoincrement) datatype for the RegNumber part in certain cases (ie if RegNumber is not reset eg when the string part changes) to simplify your insertion logic further.

Your select query is sorting the entries in desc order, which are Varchar type
"SELECT RegNumber FROM Death ORDER BY RegNumber DESC "
Which means after sorting its getting values as
abc/9, abc/8, abc/7, abc/6, abc/5, abc/4, abc/3, abc/2, abc/10, abc/1.
Which means first id is 9 always, which means next value would be 10 always.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.