MySQL loop through every row (big table)

MySQL loop through every row (big table) - java

I have a table with ID and name. I want to go through every row of this table.
TheID is a primary key and auto_increment.
I can't use(?) a single query to get all rows because the table is huge.
I am doing something with every result. I want the possibility to stop this task and continue with it later.
I thought I could do something like this:
for (int i = 0; i < 90238529; i++) {
System.out.println("Current ID :" + i);
query = "SELECT name FROM table_name WHERE id = " + i;
...
}
But that does not work because the auto_increment skipped some numbers.
As mentioned, I need an option to stop this task in a way that would allow me to start again where I left. Like with the example code above, I know the ID of the current entry and if I want to start it again, I just set int i = X.

Use a single query to fetch all the records :
query = "SELECT name FROM table_name WHERE id > ? ORDER BY id";
Then iterate over the ResultSet and read how many records you wish (you don't have to read all the row returned by the ResultSet).
Next time you run the query, pass the last ID you got in the previous execution.

You mention this is a big table. It's important to note then that the MySQL Connector/J API Implementation Notes say
ResultSet
By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate, and due to the design of the MySQL network protocol is easier to implement. If you are working with ResultSets that have a large number of rows or large values, and cannot allocate heap space in your JVM for the memory required, you can tell the driver to stream the results back one row at a time.
To enable this functionality, create a Statement instance in the following manner:
stmt = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
java.sql.ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
So, I think you need to do that and I would use a try-with-resources Statement. Next, I suggest you let the database help you iterate the rows
String query = "SELECT id, name FROM table_name ORDER BY id";
try (PreparedStatement ps = conn.prepareStatement(query,
ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
ResultSet rs = ps.executeQuery();) {
while (rs.next()) {
int id = rs.getInt("id");
String name = rs.getString("name");
System.out.printf("id=%d, name=%s%n", id, name);
}
} catch (SQLException e) {
e.printStackTrace();
}

I can't use a single query to get all rows because the table is huge and I am doing something with every result. Also I want the possibility to stop this task and continue with it later.
Neither of these reasons eliminate using a single query. It only impacts performance (keeping one connection alive for a long time vs. constantly opening and closing a connection, which can be mitigated using a connection pool).
As mentioned I need a option to stop this task but so that I could start again where I left. Like with the example code above I know the ID of the current entry and if I want to start it again I just set the int i = X
If you think about it, this wouldn't work either, as you said yourself
But that does not work because the auto_increment skipped some numbers.
More importantly, rows could have been inserted or deleted since the last time you queried the DB.
First of all, this sounds like a classical XY Problem, (you are describing a problem with your solution to the problem, rather than the actual problem). Secondly, seem to be using an RDBM for something (A queue) that it was never really designed for.
If you really want to do this, rather than use a better suited database there are a number of approaches you can use. Your first problem is that you want to resume from a certain point/state, but that this is not stored in the Database, so will not work in a scenario where there are multiple DB connections. The first way to fix this is to introduce a "processed" field in your table (which you can clear with an UPDATE statement if you want to resume from an arbitrary point), now depending on which problem you're actually trying to solve, this can be a simple true/false field, a unique identifier of the currently processing thread, or a relational table. Depending on requirements.
Then you can go back to using SQL to get the data you want.

Related

Postgres: How to get SELECT to detect new insertions within a specific table

I'm trying to execute an SQL query (SELECT operation) using the following Java code:
ResultSet resultSet = statement.executeQuery("SELECT * FROM tasks");
while (resultSet.next()) {
while (1) {
//loop infinitely until a worker executes the task
}
}
But that is inefficient in the case when a new task gets added, as SELECT won't detect the new change ..
So, what is the Postgres SQL syntax that fetches the whole entries while detecting new insertions, within a specific table?

What you want sounds like change data capture (CDC), where you only want what was changed since you last queried the table.
to do that, you need a way to:
1. mark the rows that were changed
2. mark the rows that were already existent
The only way to do that is to:
1. keep a copy of the table so you can compare it against the table being updated/inserted.
2. use the audit columns within the table such as date_inserted, last_modified, etc. and pull the rows with dates after the last time you looked at the table.
3. Implement the table being updated as a slowly changing dimension.

Batching "UPDATE vs. INSERT" Queries Against Oracle Database

Let's assume that I have an Oracle database with a table called RUN_LOG I am using to record when jobs have been executed.
The table has a primary key JOB_NAME which uniquely identifies the job that has been executed, and a column called LAST_RUN_TIMESTAMP which reflects when the job was last executed.
When an job starts, I would like to update the existing row for a job (if it exists), or otherwise insert a new row into the table.
Given Oracle does not support a REPLACE INTO-style query, it is necessary to try an UPDATE, and if zero rows are affected follow this up with an INSERT.
This is typically achieved with jdbc using something like the following:
PreparedStatement updateStatement = connection.prepareStatement("UPDATE ...");
PreparedStatement insertStatement = connection.prepareStatement("INSERT ...");
updateStatement.setString(1, "JobName");
updateStatement.setTimestamp(2, timestamp);
// If there are no rows to update, it must be a new job...
if (updateStatement.executeUpdate() == 0) {
// Follow-up
insertStatement.setString(1, "JobName");
insertStatement.setTimestamp(2, timestamp);
insertStatement.executeUpdate();
}
This is a fairly well-trodden path, and I am very comfortable with this approach.
However, let's assume my use-case requires me to insert a very large number of these records. Performing individual SQL queries against the database would be far too "chatty". Instead, I would like to start batching these INSERT / UPDATE queries
Given the execution of the UPDATE queries will be deferred until the batch is committed, I cannot observe how many rows are affected until a later date.
What is the best mechanism for achieving this REPLACE INTO-like result?
I'd rather avoid using a stored procedure, as I'd prefer to keep my persistence logic in this one place (class), rather than distributing it between the Java code and the database.

What about the SQL MERGE statement. You can insert large number of records to temporary table, then merge temp table with RUN_LOG For example:
merge into RUN_LOG tgt using (
select job_name, timestamp
from my_new_temp_table
) src
on (src.job_name = tgt.job_name)
when matched then update set
tgt.timestamp = src.timestamp
when not matched then insert values (src.job_name, src.timestamp)
;

How to increase the fetching time of a particular value from database from java?

I want to fetch a particular value from database in java. I used the following command in prepared statement:
Select Pname from table where pid=458;
The table contains around 50,000 rows and taking more time to fetch, please help me to get the data faster.
i used index and then i bind the variable also but it reduce the execution time only few seconds, i need more efficient. Is there any way to retrieve data faster???

Index your database table for pid, it will make the search faster.
Indexes are used to quickly locate data without having to search every row in a database table every time a database table is accessed. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.
SQL Server
CREATE TABLE MyCustomers (CustID int, CompanyName nvarchar(50));
CREATE UNIQUE INDEX idxCustId ON MyCustomers (CustId);
References
https://msdn.microsoft.com/en-us/library/ms188783.aspx
https://technet.microsoft.com/en-us/library/ms345331(v=sql.110).aspx

Create index on field pid in your table.
Use bind variables in queries.
Use prepared statement instead of statement in Java, that will use bind variables.
pstatement = conn.prepareStatement("Select Pname from table where pid = ?");
This ensures that the SQl is pre compiled and hence runs faster.
However, you are likely to gain more performance improvement by index than bind variables .

What is the fastest way to retrieve sequential data from database?

I have a lot of rows in a database and it must be processed, but I can't retrieve all the data to the memory due to memory limitations.
At the moment, I using LIMIT and OFFSET to retrieve the data to get the data in some especified interval.
I want to know if the is the faster way or have another method to getting all the data from a table in database. None filter will be aplied, all the rows will be processed.

SELECT * FROM table ORDER BY column
There's no reason to be sucking the entire table in to RAM. Simply open a cursor and start reading. You can play games with fetch sizes and what not, but the DB will happily keep its place while you process your rows.
Addenda:
Ok, if you're using Java then I have a good idea what your problem is.
First, just by using Java, you're using a cursor. That's basically what a ResultSet is in Java. Some ResultSets are more flexible than others, but 99% of them are simple, forward only ResultSets that you call 'next' upon to get each row.
Now as to your problem.
The problem is specifically with the Postgres JDBC driver. I don't know why they do this, perhaps it's spec, perhaps it's something else, but regardless, Postgres has the curious characteristic that if your Connection has autoCommit set to true, then Postgres decides to suck in the entire result set on either the execute method or the first next method. Not really important as to where, only that if you have a gazillion rows, you get a nice OOM exception. Not helpful.
This can easily be exactly what you're seeing, and I appreciate how it can be quite frustrating and confusing.
Most Connection default to autoCommit = true. Instead, simply set autoCommit to false.
Connection con = ...get Connection...
con.setAutoCommit(false);
PreparedStatement ps = con.prepareStatement("SELECT * FROM table ORDER BY columm");
ResultSet rs = ps.executeQuery();
while(rs.next()) {
String col1 = rs.getString(1);
...and away you go here...
}
rs.close();
ps.close();
con.close();
Note the distinct lack of exception handling, left as an exercise for the reader.
If you want more control over how many rows are fetched at a time into memory, you can use:
ps.setFetchSize(numberOfRowsToFetch);
Playing around with that might improve your performance.
Make sure you have an appropriate index on the column you use in the ORDER BY if you care about sequencing at all.

Since its clear your using Java based on your comments:
If you are using JDBC you will want to use:
http://download.oracle.com/javase/1.5.0/docs/api/java/sql/ResultSet.html
If you are using Hibernate it gets trickier:
http://docs.jboss.org/hibernate/core/3.3/reference/en/html/batch.html

Updating a database while using a preparedStatement select

I'm selecting a subset of data from a MS SQL datbase, using a PreparedStatement.
While iterating through the resultset, I also want to update the rows. At the moment I use something like this:
prepStatement = con.prepareStatement(
selectQuery,
ResultSet.TYPE_FORWARD_ONLY,
ResultSet.CONCUR_UPDATABLE);
rs = prepStatement.executeQuery();
while(rs.next){
rs.updateInt("number", 20)
rs.updateRow();
}
The database is updated with the correct values, but I get the following exception:
Optimistic concurrency check failed. The row was modified outside of this cursor.
I've Googled it, but haven't been able to find any help on the issue.
How do I prevent this exception? Or since the program does do what I want it to do, can I just ignore it?

The record has been modified between the moment it was retrieved from the database (through your cursor) and the moment when you attempted to save it back. If the number column can be safely updated independently of the rest of the record or independently of some other process having already set the number column to some other value, you could be tempted to do:
con.execute("update table set number = 20 where id=" & rs("id") )
However, the race condition persists, and your change may be in turn overwritten by another process.
The best strategy is to ignore the exception (the record was not updated), possibly pushing the failed record to a queue (in memory), then do a second pass over the failed records (re-evaluating the conditions in query and updating as appropriate - add number <> 20 as one of the conditions in query if this is not already the case.) Repeat until no more records fail. Eventually all records will be updated.

Assuming you know exactly which rows you will update, I would do
SET your AUTOCOMMIT to OFF
SET ISOLATION Level to SERIALIZABLE
SELECT row1, row1 FROM table WHERE somecondition FOR UPDATE
UPDATE the rows
COMMIT
This is achieved via pessimistic locking (and assuming row locking is supported in your DB, it should work)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.