Calculating levenshtein distance between two strings - java

Im executing the following Postgres query.
SELECT * FROM description WHERE levenshtein(desci, 'Description text?') <= 6 LIMIT 10;
Im using the following code execute the above query.
public static boolean authQuestion(String question) throws SQLException{
boolean isDescAvailable = false;
Connection connection = null;
try {
connection = DbRes.getConnection();
String query = "SELECT * FROM description WHERE levenshtein(desci, ? ) <= 6";
PreparedStatement checkStmt = dbCon.prepareStatement(query);
checkStmt.setString(1, question);
ResultSet rs = checkStmt.executeQuery();
while (rs.next()) {
isDescAvailable = true;
}
} catch (URISyntaxException e1) {
e1.printStackTrace();
} catch (SQLException sqle) {
sqle.printStackTrace();
} catch (Exception e) {
if (connection != null)
connection.close();
} finally {
if (connection != null)
connection.close();
}
return isDescAvailable;
}
I want to find the edit distance between both input text and the values that's existing in the database. i want to fetch all datas that has edit distance of 60 percent. The above query doesnt work as expected. How do I get the rows that contains 60 percent similarity?

Use this:
SELECT *
FROM description
WHERE 100 * (length(desci) - levenshtein(desci, ?))
/ length(desci) > 60
The Levenshtein distance is the count of how many letters must change (move, delete or insert) for one string to become the other. Put simply, it's the number of letters that are different.
The number of letters that are the same is then length - levenshtein.
To express this as a fraction, divide by the length, ie (length - levenshtein) / length.
To express a fraction as a percentage, multiply by 100.
I perform the multiplication by 100 first to avoid integer division truncation problems.

The most general version of the levenshtein function is:
levenshtein(text source, text target, int ins_cost, int del_cost, int sub_cost) returns int
Both source and target can be any non-null string, with a maximum of
255 characters. The cost parameters specify how much to charge for a
character insertion, deletion, or substitution, respectively. You can
omit the cost parameters, as in the second version of the function; in
that case they all default to 1.
So, with the default cost parameters, the result you get is the total number of characters you need to change (by insertion, deletion, or substitution) in the source to get the target.
If you need to calculate the percentage difference, you should divide the levenshtein function result by the length of your source text (or target length - according to your definition of the percentage difference).

Related

Java/MySQL - How to retrieve data from specific row

I really can't find a solution for this problem:
Here I have two ResultSets, one which always shows me the number of items stored in my database and one that retrieves all the data from it.
I would like to generate a random number and then generate a random item based on the row number/id in my database. Since I'm fairly new I'm not sure if this is an efficient approach. It doesn't look very clean to retrieve all the data and then iterate over it every time. Especially if I had like 1000 items and the randomly generated number is 999.
PreparedStatement randomSelection = con.prepareStatement("SELECT * FROM items ORDER BY RAND() LIMIT 1"); {
String name = ((ResultSet) randomSelection).getString(2);
System.out.println(name);
}
Tried calling the column itemname with the last line. However I just can't look for a good solution for this problem. Would highly appreciate any help since I'm fairly new to databases.
Thank you
EDIT: This is what I tried now and there is no output somehow
Same for
ResultSet numberOfItemsInDataBase = stmt.executeQuery("SELECT count(*) FROM items;");
// this will return a number between 0 and the number of rows - 1
int id = new Random().nextInt(numberOfItemsInDataBase.getInt(1));
ResultSet itemsInDataBase = stmt.executeQuery("select * from items order by id limit 1 offset " + id);
if (itemsInDataBase.next()) {
String item = itemsInDataBase.getString(2);
System.out.println(item);
}
If you just need a random row of the table then you can do it with plain SQL with the function RAND():
ResultSet itemsInDataBase = stmt.executeQuery("select * from items order by rand() limit 1");
if (itemsInDataBase.next()) {
item = new Item(itemsInDataBase.getString(2));
}
If you want to use the generated random number, then use it in the OFFSET clause of the sql statement:
ResultSet numberOfItemsInDataBase = stmt.executeQuery("SELECT count(*) FROM items;");
// the above query will return exactly 1 row
numberOfItemsInDataBase.next();
// this will return a number between 0 and the number of rows - 1
int id = new Random().nextInt(numberOfItemsInDataBase.getInt(1));
ResultSet itemsInDataBase = stmt.executeQuery("select * from items order by id limit 1 offset " + id);
if (itemsInDataBase.next()) {
item = new Item(itemsInDataBase.getString(2));
}
Use ORDER BY RAND() and limit the result to 1. This circumvents you having to query for the count and then ultimately iterate through the ResultSet until you find the random entry.
try (ResultSet randomSelection = connection
.preparedStatement("SELECT * FROM items ORDER BY RAND() LIMIT 1")) {
if (randomSelection.next()) {
String name = randomSelection.getString(2);
}
}
You can use the limit function to get the item.
The LIMIT clause can be used to constrain the number of rows returned by the SELECT statement. LIMIT takes one or two numeric arguments, which must both be nonnegative integer constants (except when using prepared statements).
With two arguments, the first argument specifies the offset of the first row to return, and the second specifies the maximum number of rows to return. The offset of the initial row is 0 (not 1). So in your case the offset can be the the random generated id minus one and maximum number of rows is 1:
select * from items LIMIT {id-1},1; # Retrieve row (id-1)

Java how to increase the value of doubles in a database table

I am having some trouble figuring out a query that will update values in a column in one of my tables. Below is my function:
public void increasePrice(String [] str) {
PreparedStatement ps = null;
try {
ps = connection.prepareStatement("Update Journey Set price+=? where distance <?",PreparedStatement.RETURN_GENERATED_KEYS);
ps.setDouble(1,Double.parseDouble(str[1]));
ps.setDouble(2, Double.parseDouble(str[0]));
ps.executeUpdate();
ps.close();
System.out.println("1 rows updated.");
} catch (SQLException ex) {
Logger.getLogger(Jdbc.class.getName()).log(Level.SEVERE, null, ex);
}
}
To illustrate, the array passed in contains a value for distance and price and I am wanting to update the prices in the 'Journey' table based on their distance. For example, if a record in the table has a distance (type double) that is less than a given distance (the value of str[0]), I want to increase the price (also a double) of that record by the value 'str[1]' and do this for all records in the table.
The above code doesn't give any errors however, the records in the database never get updated. I could really use some help with this as I've searched around for a while now to try and find a solution and have not yet succeeded.
I do not know what database you are using but my guess is that this line:
ps = connection.prepareStatement("Update Journey Set price+=? where distance <?",PreparedStatement.RETURN_GENERATED_KEYS);
should be written like this:
ps = connection.prepareStatement("Update Journey Set price=price+? where distance <?",PreparedStatement.RETURN_GENERATED_KEYS);
And not related to your question but the line
System.out.println("1 rows updated.");
may make you waste hours of debugging in the future because 0 or more rows can be actually updated.

Optimizing MySQL query on large table

Im using mysql with JDBC.
I have a large example table which contains 6.3 million rows that I am trying to perform efficient select queries on. See below:
I have created three additional indexes on the table, see below:
Performing a SELECT query like this SELECT latitude, longitude FROM 3dag WHERE
timestamp BETWEEN "+startTime+" AND "+endTime+" AND HourOfDay=4 AND DayOfWeek=3" has a run time that is extremely high at 256356 ms, or a little above four minutes. My explain on the same query gives me this:
My code for retrieving the data is below:
Connection con = null;
PreparedStatement pst = null;
Statement stmt = null;
ResultSet rs = null;
String url = "jdbc:mysql://xxx.xxx.xxx.xx:3306/testdb";
String user = "bigd";
String password = "XXXXX";
try {
Class.forName("com.mysql.jdbc.Driver");
con = DriverManager.getConnection(url, user, password);
String query = "SELECT latitude, longitude FROM 3dag WHERE timestamp BETWEEN "+startTime+" AND "+endTime+" AND HourOfDay=4 AND DayOfWeek=3";
stmt = con.prepareStatement("SELECT latitude, longitude FROM 3dag WHERE timestamp>=" + startTime + " AND timestamp<=" + endTime);
stmt = con.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY, java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
rs = stmt.executeQuery(query);
System.out.println("Start");
while (rs.next()) {
int tempLong = (int) ((Double.parseDouble(rs.getString(2))) * 100000);
int x = (int) (maxLong * 100000) - tempLong;
int tempLat = (int) ((Double.parseDouble(rs.getString(1))) * 100000);
int y = (int) (maxLat * 100000) - tempLat;
if (!(y > matrix.length) || !(y < 0) || !(x > matrix[0].length) || !(x < 0)) {
matrix[y][x] += 1;
}
}
System.out.println("End");
JSONObject obj = convertToCRS(matrix);
return obj;
}catch (ClassNotFoundException ex){
Logger lgr = Logger.getLogger(Database.class.getName());
lgr.log(Level.SEVERE, ex.getMessage(), ex);
return null;
}
catch (SQLException ex) {
Logger lgr = Logger.getLogger(Database.class.getName());
lgr.log(Level.SEVERE, ex.getMessage(), ex);
return null;
} finally {
try {
if (rs != null) {
rs.close();
}
if (pst != null) {
pst.close();
}
if (con != null) {
con.close();
}
} catch (SQLException ex) {
Logger lgr = Logger.getLogger(Database.class.getName());
lgr.log(Level.WARNING, ex.getMessage(), ex);
return null;
}
}
Removing every line in the while(rs.next()) loop gives me the same horrible run-time.
My question is what can I do to optimize this type of query? I am curious about the .setFetchSize() and what the optimal value should be here. Documentation shows that INTEGER.MIN_VALUE results in fetching row-by-row, is this correct?
Any help is appreciated.
EDIT
After creating a new index on timestamp, DayOfWeek and HourOfDay my query runs 1 minute faster and explain gives me this:
Some ideas up front:
Did you in fact check the SQL Execution time (from .executeQuery() till first row?) or is that execution + iteration over 6.3 million rows?
You prepare a PreparedStatement but don't use it?!
Use PreparedStatement, pass tiemstamp, dayOfWeek, hourOfDay as parameters
Create one index that can satisfy your where condition. Order the keys in a way that you can eliminate the most items with the highest ranking field.
The idex might look like:
CREATE INDEX stackoverflow on 3dag(hourOfDay, dayOfWeek, Timestamp);
Perform your SQL inside MySQL - what time do you get there?
Try without stmt.setFetchSize(Integer.MIN_VALUE); this might create many unneeded network roundtrips.
According to your question, the cardinality of (that is, the number of distinct values in) your Timestamp column is about 1/30th of the cardinality of your Uid column. That is, you have lots and lots of identical timestamps. That doesn't bode well for the efficiency of your query.
That being said, you might try to use the following compound covering index to speed things up.
CREATE INDEX 3dag_q ON ('Timestamp' HourOfDay, DayOfWeek, Latitude, Longitude)
Why will this help? Because your whole query can be satisfied from the index with a so-called tight index scan. The MySQL query engine will random-access the index to the entry with the smallest Timestamp value matching your query. It will then read the index in order and pull out the latitude and longitude from the rows that match.
You could try doing some of the summarizing on the MySQL server.
SELECT COUNT(*) number_of_duplicates,
ROUND(Latitude,4) Latitude, ROUND(Longitude,4) Longitude
FROM 3dag
WHERE timestamp BETWEEN "+startTime+"
AND "+endTime+"
AND HourOfDay=4
AND DayOfWeek=3
GROUP BY ROUND(Latitude,4), ROUND(Longitude,4)
This may return a smaller result set. Edit This quantizes (rounds off) your lat/long values and then count the number of items duplicated by rounding them off. The more coarsely you round them off (that is, the smaller the second number in the ROUND(val,N) function calls happens to be) more duplicate values you will encounter, and the fewer distinct rows will be generated by your query. Fewer rows save time.
Finally, if these lat/long values are GPS derived and recorded in degrees, it makes no sense to try to deal with more than about four or five decimal places. Commercial GPS precision is limited to that.
More suggestions
Make your latitude and longitude columns into FLOAT values in your table if they have GPS precision. If they have more precision than GPS use DOUBLE. Storing and transferring numbers in varchar(30) columns is quite inefficient.
Similarly, make your HourOfDay and DayOfWeek columns into SMALLINT or even TINYINT data types in your table. 64 bit integers for values between 0 and 31 is wasteful. With hundreds of rows, it doesn't matter. With millions it does.
Finally, if your queries always look like this
SELECT Latitude, Longitude
FROM 3dag
WHERE timestamp BETWEEN SOME_VALUE
AND ANOTHER_VALUE
AND HourOfDay = SOME_CONSTANT_DAY
AND DayOfWeek = SOME_CONSTANT_HOUR
this compound covering index should be ideal to accelerate your query.
CREATE INDEX 3dag_hdtll ON (HourOfDay, DayofWeek, `timestamp`, Latitude, Longitude)
I am extrapolating from my tracking app. This is what i do for efficiency:
Firstly, a possible solution depends on whether or not you can predict/control the time intervals. Store snapshots every X minutes or once a day, for example. Let us say you want to display all events YESTERDAY. You can save a snapshot that has already filtered your file. This would speed things up enormously, but is not a viable solution for custom time intervals and real live coverage.
My application is LIVE, but usually works pretty well in T+5 minutes (5 minute maximum lag/delay). Only when the user actually chooses live position viewing will the application open a full query on the live db. Thus, depends on how your app works.
Second factor: How you store your timestamp is very important. Avoid VARCHAR, for example. If you are converting UNIXTIME that also will give you unnecessary lagtime. Since you are developing what appears to be a geotracking application, your timestamp would be in unixtime - an integer. some devices work with milliseconds, i would recommend not using them. 1449878400 instead of 1449878400000 (12/12/2015 0 GMT)
I save all my geopoint datetimes in unixtime seconds and use mysql timestamps only for timestamping the moment the point was received by server (which is irrelevant to this query you propose).
You might shave some time off accessing an indexed view instead of running a full a query. Whether that time is significant in a large query is subject to testing.
Finally, you could shave an itsy bitsy more by not using BETWEEN and using something SIMILAR to what it will be translate into (pseudocode below)
WHERE (timecode > start_Time AND timecode < end_time)
See that i change >= and <= to > and < because chances are your timestamp will almost never be on the precise second and even if it is, you will rarely be afffected whether 1 geopoint/time event is or not displayed.

split the request for every 100 characters

I am calling a stored procedure for one request with more than 100 characters in one field and it is getting failed because the field size is 100 and how to split the request for every 100 characters of that particular field? the maximum size for that particular field is 100 and when are passing 250 characters to that field we have to split the call for every 100 characters.
FYI not updating anything into DB just reading the values from DB.
You've given very little to go on, but here's my best guess at a solution:
String longStr; // your long string
for (String max100 : longStr.split("(?<=.{,100})")) {
connection.execute("call someProc('" + max100 + "')");
}
This code is very simplistic and is for illustrative use only. In reality you'd use a prepared statement with placeholders.
That said, the splitting code, which is the core of this question, should be helpful.
Try this:
// longStr is your long string
String substring = null;
PreparedStatement prepped = connection.prepareeStatement("call someProc(?)", ResultSet.TYPE_SCROLL_INSENSITIVE, ResultSet.CONCUR_UPDATABLE);
int maxLoop = longStr.length() / 100;
for (int i = 0; i < maxLoop; i++) {
substring = longStr.substring(100 * i, 100 * (i + 1));
prepped.setString(1, substring);
ResultSet result = prepped.executeQuery();
}
if (longStr.length() % 100 != 0) {
substring = longStr.substring(maxLoop * 100);
prepped.setString(1, substring);
ResultSet result = prepped.executeQuery();
}

Inserting each value from int array into database - Java

What i have is a multi-select Jlist box which the users selects several features. I grab the ID of these and store them into an int[] array.
What i am trying to do with these is insert them into my database base as below. But this is causing a
java.sql.SQLException: ORA-01722: invalid number
exception to appear. The line in question is the point at which the statement is executed. Ive checked the array isn't null and produces the correct values. I am unsure what would be causing this error.
for (int i = 0; i < features.length; i++) {
try {
String strQuery = "INSERT INTO home_feature(home_id, feature_id) VALUES (?, ?)";
PreparedStatement stmt = conn.prepareStatement(strQuery);//prepare the SQL Query
stmt.setString(1, homeID);//insert homeid
stmt.setInt(2, features[i]);//insert featureid.
stmt.executeQuery();//execute query
dataAdded = true;//data successfully inserted
} catch (Exception e) {
e.printStackTrace();
dataAdded = false;//there was a problem, data not inserted
}//end try
}
Am I inserting the list of values correctly? Or should I be approaching this from a different angle?
Looks like you are passing a non valid number in query. Check the values of homeID and features[i].
ORA-01722 Cause:
The attempted conversion of a character string to a number failed
because the character string was not a valid numeric literal. Only
numeric fields or character fields containing numeric data may be used
in arithmetic functions or expressions. Only numeric fields may be
added to or subtracted from dates.
Action:
Check the character strings in the function or expression. Check that
they contain only numbers, a sign, a decimal point, and the character
"E" or "e" and retry the operation.
One flaw I see is:
stmt.executeQuery(); // execute query
should be
stmt.executeUpdate(); // execute query
When executing DML (Data Manipulation queries) you should use
PreparedStatement#executeUpdate()

Categories

Resources