is Select Before Update a good approach or vice versa?

is Select Before Update a good approach or vice versa? - java

I am developing an application using normal JDBC connection. The application is developed with Java-Java EE SpringsMVC 3.0 and SQL Server 08 as database. I am required to update a table based on a non primary key column.
Now, before updating the table we had to decide an approach for updating the table, as table may contain huge amount of data. The update Query will be executed in a batch and we are required to design application in a manner wherein it doesn't hog the system resources.
Now, We had to decide between either of the approaches,
1. SELECT DATA BEFORE YOU UPDATE or
2. UPDATE DATA AND THEN SELECT MISSING DATA.
Select data before update is only benificial if chances of failure are maximum, i.e. if a batch 100 Query update is executed, and out of which if only 20 rows are updated successfully, then this approach should be taken
Update data and then check missing data is benificial only when failure records are far less. By this ap[proach one database select call can be avoided, i.e after a batch update, the count of records updated can be taken and the select query should be executed if and only if theres is a count in mismatch w.r.t no of query.
We are totally unaware about the system on Production environment, but we want to counter for all possibilities and want a faster system. I need your inputs as which is a better approach.

Since there is 50:50 chance of successful updates or faster selects, its hard to tell from the current scenario mentioned. You probably would want a fuzzy logic approach, getting constant feedback of how many updates were successful over the period of time, and then decide on the basis of that data to either do an update before select or do a select before update.

Related

Check if query result has changed, in order to not re-query for everything?

I am developing a java application that loads certain things from a database, such as client records and product info. When the user navigates to say the 'products' tab, I query for products in the database and update a table with that information.
I am wondering if there is a way to see if the query results have changed since the last check, in order to avoid querying and loading all info from the database, and instead just load updates. Is there a way to do this, or perhaps just load changes only from a query into my table list? My goal is to make the program run faster when switching between tabs.

I am wondering if there is a way to see if the query results have changed since the last check
Stated differently, you want a way to automatically answer the question “is this the same result?” without retrieving the entire result.
The general approach to this problem would be to come up with some fast-to-query proxy for the entire state of the result set, and query that instead.
Once you have determined a stable fast computation for the entire result set, you can compute that any time the relevant data changes; and only poll that stored proxy to see whether the data has changed.
For example, you could say that “the SHA-256 hash of fields lorem, ipsum, and dolor” is your proxy. You can now:
Implement that computation inside the database as a function, maybe products_hash.
Create a latest_products_hash table, that stores created timestamp and products_hash that was computed at that time.
In your application, retrieve the most recent record from latest_products_hash and keep it for reference.
In the database, have a scheduled job, or a trigger on some event you decide makes sense, that will compute and store the products_hash in latest_products_hash automatically without any action from the application.
To determine whether there have been updates yet, the application will query the latest_products_hash table again and compare its most recent record with the one the application stored for reference.
Only if the latest_products_hash most-recent value is different, then query the products table and get the full result set.
That way, the application is polling a much faster query (the most-recent record in latest_products_hash) frequently, and avoiding the full products query until it knows the result set will be new.

How change many records in an ORM database quickly?

Situation: I need to change many records in database (10 000 records, in example), using ORMLite DAO. All records change only in one table, in one column and changing records, which have specified id.
Question: how update many records in database at once, using ORMLite DAO?
Now I update records, using this code:
imagesDao.update(imageOrmRecord);
But updating records in cycle very slow (100 records\sec).
I think that real update records, using SQL-code, but this is undesirable...

SQL is a set-oriented language. The whole point of an ORM is to abstract this away into objects.
So when you want to update a bunch of objects, you have to go through these objects.
(You have run into the object-relational impedance mismatch; also read The Vietnam of Computer Science.)
ORMLite gives you a backdoor to execute raw SQL:
someDao.executeRaw("UPDATE ...");
But if your only problem is performance, this is likely to be caused by the auto-commit mode, which adds transaction overhead to each single statement. Using callBatchTasks() would fix this.

Question: how update many records in database at once, using ORMLite DAO?
It depends a bit on what updates you are making. You can certainly use the UpdateBuilder which will make wholesale updates to objects.
UpdateBuilder<Account, String> updateBuilder = accountDao.updateBuilder();
// update the password to be "none"
updateBuilder.updateColumnValue("password", "none");
// only update the rows where password is null
updateBuilder.where().isNull(Account.PASSWORD_FIELD_NAME);
updateBuilder.update();
Or something like:
// update hasDog boolean to true if dogC > 0
updateBuilder.updateColumnExpression("hasDog", "dogC > 0");
You should be able to accomplish a large percentage of the updates that you would do using raw SQL this way.
But if you need to make per-entity updates then you will need to do dao.update(...) for each one. What I'd do then is to do it in a transaction to make the updates go faster. See this answer.

Accessing database multiple times

I am working on solution of below mentioned but could not find any best practice/tool for this.
For a batch of requests(say 5000 unique ids and records) received in webservice, it has to fetch rows for those unique ids in database and keep them in buffer(or cache) and compare those with records received in webservice. If there is a change for a particular data(say column) that will be updated in table for that unique id. And in turn, the child tables of that table also get affected. For ex, if someone changes his laptop model number and country, model number will be updated in a table and country value in another table. Likewise it goes on accessing multiple tables in short time. The maximum records coming in a webservice call might reach 70K in one call in an hour.
I don't have any other option than implementing it in java. Is there any good practice of implementing this, or can it be achieved using any open source java tools. Please suggest. Thanks.

Hibernate is likely to be the first thing you should try. I tend to avoid because it is overkill for most of my applications but it is a standard tool for accessing database which anyone who knows Java should at least have an understanding of. There are dozens of other solutions you could use but Hibernate is the most often used.

JDBC is the API to use to access relational database. Useful performance and security tips:
use prepared statements
use where ... in () queries to load many rows at once, but beware on the limit in the number of values in the in clause (1000 max in Oracle)
use batched statements to make your updates, rather than executing each update separately (see http://download.oracle.com/javase/1.3/docs/guide/jdbc/spec2/jdbc2.1.frame6.html)
See http://download.oracle.com/javase/tutorial/jdbc/ for a tutorial on JDBC.

This sounds not that complicated. Of course, you must know (or learn):
SQL
JDBC
Then you can go through the web service data record by record and for each record do the following:
fetch corresponding database record
for each field in record
if updated
execute corresponding update SQL statement
commit // every so many records
70K records per hour should be not the slightest problem for a decent RDBMS.

Database Loading batch job Java

Have a batch job written in Java which truncates and then loads certain table in Oracle database every few minutes. There are reports generated on web pages based on the data in the table. Am wondering of a good way of not affecting the report querying part when the data loading process is happeneing so that the users won't end up with some and/or no data.

If you process all your SQL statements inside a single transaction there will be always a valid state seen from outside. Beware that TRUNCATE doe not work in transactions, so you have to use DELETE. While this guarantees to always have reasonable data in your table it needs a bigger rollback segment and will be considerably slower.

you could have 2 tables and a meta table which tracks which table is the main table being used for querying. your batch job will be truncating and loading one of the table and you can switch the main tables once the loading is completed. so the query app will get recent data now and u can load now in the other table

What I would do is set a flag in a DB table to indicate that that the update is in progress and have the reports look for that flag and display an appropriate message and wait for the update to finish. Once the update is complete clear the flag.

speed up operation on mysql

I'm currently writing java project against mysql in a cluster with ten nodes. The program simply pull some information from the database and do some calculation, then push some data back to the database. However, there are millions of rows in the table. Is there any way to split up the job and utilize the cluster architecture? How to do multi-threading on different node?

I watched an interesting presentation on using Gearman to do Map/Reduce style things on a mysql database. It might be what you are looking for: see here. There is a recording on the mysql webpage here (have to register for mysql.com though).

I'd think about doing that calculation in a stored procedure on the database server and pass on bringing millions of rows to the middle tier. You'll save yourself a lot of bytes on the wire. Depending on the nature of the calculation, your schema, indexing, etc. you might find that the database server is well equipped to do that calculation without having to resort to multi-threading.
I could be wrong, but it's worth a prototype to see.

Assume the table (A) you want to process has 10 million rows. Create a table B in the database to store the set of rows processed by a node. So you can write the Java program in such a way like it will first fetch the last row processed by other nodes and then it add an entry in the same table informing other nodes what range of rows it is going to process (you can decide this number). In our case, lets assume each node can process 1000 rows at a time. Node 1 fetches table B and finds it it empty. Then Node 1 inserts a row ('Node1', 1000) informing that it is processing till primary key of A is <=1000 ( Assuming primary key of table A is numeric and it is in ascending order). Node 2 comes and finds 1000 primary keys are processed by some other node. Hence it inserts a row ('Node2', 2000) informing others that it is processing rows between 1001 and 2000. Please note that access to table B should be synchronized, i.e. only one can work on it at a time.

Since you only have one mysql server, make sure you're using the innodb engine to reduce table locking on updates.
Also I'd try to keep your queries as simple as possible, even if you have to run more of them. This can increase chances of query cache hits, as well as reduce the over all workload on the backend, offloading some of the querying matching and work to the frontends (where you have more resources). It will also reduce the time a row lock is held therefore decreasing contention.
The proposed Gearman solution is probably the right tool for this job. As it will allow you to offload batch processing from mysql back to the cluster transparently.
You could set up sharding with a mysql on each machine but the set up time, maintenance and the changes to database access layer might be a lot of work compared to a gearman solution. You might also want to look at the experimental spider engine that could allow you to use multiple mysqls in unison.

Unless your calculation is very complex, most of the time will be spent retrieving data from MySql and sending the results back to MySQl.
As you have a single database no amount of parallelism or clustering on the application side will make much difference.
So your best options would be to do the update in pure SQL if that is at all possible, or, use a stored procedure so that all processing can take place within the MySql server and no data movement is required.
If this is not fast enough then you will need to split your database among several instances of MySql and come up with some schema to partition the data based on some application key.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.