Have a batch job written in Java which truncates and then loads certain table in Oracle database every few minutes. There are reports generated on web pages based on the data in the table. Am wondering of a good way of not affecting the report querying part when the data loading process is happeneing so that the users won't end up with some and/or no data.
If you process all your SQL statements inside a single transaction there will be always a valid state seen from outside. Beware that TRUNCATE doe not work in transactions, so you have to use DELETE. While this guarantees to always have reasonable data in your table it needs a bigger rollback segment and will be considerably slower.
you could have 2 tables and a meta table which tracks which table is the main table being used for querying. your batch job will be truncating and loading one of the table and you can switch the main tables once the loading is completed. so the query app will get recent data now and u can load now in the other table
What I would do is set a flag in a DB table to indicate that that the update is in progress and have the reports look for that flag and display an appropriate message and wait for the update to finish. Once the update is complete clear the flag.
Related
One of my Java application's functionality is to read and parse very frequently (almost every 5 minutes) an xml file and populate a database table. I have created a cron job to do that. Most of the columns' values remain the same but for certain columns there may be a frequent update on the value. I was wondering what is the most efficient way of doing that:
1) Delete the table every time and re-create it or
2) Update the table data and specifically the column where a change in the source file has appeared.
The number of rows parsed and persisted every time is about 40000-50000.
I would assume that around 2000-3000 rows need to update on every cron job run.
I am using JPA to persist data to a mysql server and I have gone for the first option so far.
Obviously for both options the job would execute as a single transaction.
Any ideas which one is better and possibly any optimization suggestions?
I would suggest scheduling your jobs using something more sophisticated than cron. For instance, Quartz.
I am parsing an XML file consisting of ~600K lines. Parsing and Inserting the data from the XML to the database is not a problem as I am using SAX to parse and using LOAD DATA INFILE (from a .txt file) to INSERT into the database. The txt file is populated in Java using JDBC. All of this takes a good 5 seconds to populate in the database.
My bottle neck is now executing multiple SELECT queries. Basically, each time I hit a certain XML tag, I would call the SELECT query to grab a data from another DB table. Adding these SELECT queries brings my populating time to 2 minutes.
For example:
I am parsing through an XML consisting of books, articles, thesis, etc.
Each book/article has child elements such as isbn, title, author, editor, publisher.
At each author/editor/publisher, I need to query a table in a database.
Let's say I encountered the author tag with value Tolkien.
I need to query a table that already exist in the database called author_table
The query is [select author_id from author_table where name = 'Tolkien']
This is where the bottle neck is happening.
Now my question is: Is there a way to speed this up?
BTW, the reason why I think 2 minutes is long is because this is a homework assignment and I am not yet finished with populating the database. I would estimate that the whole DB population would take 5 minutes. Thus the reason why I am seeking advice for performance optimization.
There are few things you can consider:
Use connection pooling so you don't create/close a new connection everytime you're executing query. Doing so is expensive
Cache whatever data you are obtaining via SELECT query. Is it possible to prefetch all the data beforehand so you don't have to query them on the spot?
If your SELECT is slow, ensure the query is optimized and you have appropriate index in place to avoid scanning the whole table
Ensure you use buffered IO in Java
Can you subdivide the work into multiple threads? If so create multiple worker thread to do multiple instance of your job in parallel
I am developing an application using normal JDBC connection. The application is developed with Java-Java EE SpringsMVC 3.0 and SQL Server 08 as database. I am required to update a table based on a non primary key column.
Now, before updating the table we had to decide an approach for updating the table, as table may contain huge amount of data. The update Query will be executed in a batch and we are required to design application in a manner wherein it doesn't hog the system resources.
Now, We had to decide between either of the approaches,
1. SELECT DATA BEFORE YOU UPDATE or
2. UPDATE DATA AND THEN SELECT MISSING DATA.
Select data before update is only benificial if chances of failure are maximum, i.e. if a batch 100 Query update is executed, and out of which if only 20 rows are updated successfully, then this approach should be taken
Update data and then check missing data is benificial only when failure records are far less. By this ap[proach one database select call can be avoided, i.e after a batch update, the count of records updated can be taken and the select query should be executed if and only if theres is a count in mismatch w.r.t no of query.
We are totally unaware about the system on Production environment, but we want to counter for all possibilities and want a faster system. I need your inputs as which is a better approach.
Since there is 50:50 chance of successful updates or faster selects, its hard to tell from the current scenario mentioned. You probably would want a fuzzy logic approach, getting constant feedback of how many updates were successful over the period of time, and then decide on the basis of that data to either do an update before select or do a select before update.
I am having an application for handling more than 10000000 data.
The MainTable has more than 10000000 data
I am trying to Insert the Data into a SubTable From the Main Table as
INSERT INTO SubTable(Value1,Value2)
SELECT Value1,Value2 FROM MainTable
GROUP BY Value1_ID;
After performing certain processing in SubTable..Again I update the new values into the Main Table as
UPDATE MainTable inf,SubTable in
SET inf.Value1=in.Value1, inf.Value2=in.Value2
WHERE inf.Value1_ID= in.Value1_ID;
While Running this query the Entire Server gets very slow and it stops the entire other transaction.I am using the JDBC Driver Manager connection here. How to avoid this? How to solve this problem?
If it's something that you have to do only once in a while, instead of updating the whole table in a single update, you can set up a small script that will update by batch of rows every few seconds/minutes or so. The other processes will have their query executed freely between two updates.
For example, by updating a batch of 100,000 rows every minutes, if your tables have the right indexes, that would take 1~2 hours, but with a far lesser impact on the performance.
The other solution would be do the update when the activity on the server is at its lowest (maybe during the week-ends?), that way you won't impact the other processes as much.
I am stuck at some point wherein I need to get database changes in a Java code. Request is to get any record updated, added, deleted in any table of db; should be recognized by Java program. How could it be implemented JMS? or a Java thread?
Update: Thanks guys for your support i am actually using Oracle as DB and Weblogic 10.3 workshop. Actually I want to get the updates from a table in which I have only read permission so guys what do you all suggest. I can't update the DB. Only thing I can do is just read the DB and if there is any change in the table I have to get the information/notification that certain data rows has been added/deleted or updated.
Unless the database can send a message to Java, you'll have to have a thread that polls.
A better, more efficient model would be one that fires events on changes. A database that has Java running inside (e.g., Oracle) could do it.
We do it by polling the DB using an EJB timer task. In essence, we have a status filed which we update when we have processed that row.
So the EJB timer thread calls a procedure that grabs rows which are flagged "un-treated".
Dirty, but also very simple and robust. Especially, after a crash or something, it can still pick up from where it crashed without too much complexity.
The disadvantage is the wasted load on the DB, and also response time will be limited (probably requires seconds).
We have accomplished this in our firm by adding triggers to database tables that call an executable to issue a Tib Rendezvous message, which is received by all interested Java applications.
However, the ideal way to do this IMHO is to be in complete control of all database writes at the application level, and to notify any interested parties at this point (via multi-cast, Tib, etc). In reality this isn't always possible where you have a number of disparate systems.
You're indeed dependent on whether the database in question supports it. You'll also need to take the overhead into account. Lot of inserts/updates also means a lot of notifications and your Java code has to handle them consistently, else it will bubble up.
If the datamodel allows it, just add an extra column which holds a timestamp which get updated on every insert/update. Most major DB's supports an auto-update of the column on every insert/update. I don't know which DB server you're using, so I'll give only a MySQL-targeted example:
CREATE TABLE mytable (
id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
somevalue VARCHAR(255) NOT NULL,
lastupdate TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
INDEX (lastupdate)
)
This way you don't need to worry about inserting/updating the lastupdate yourself. You can just do an INSERT INTO mytable (somevalue) VALUES (?) or UPDATE mytable SET somevalue = ? WHERE id = ? and the DB will do the magic.
After ensuring that the DB server's time and Java application's time are the same, you can just fire a background thread (using either Timer with TimerTask, or ScheduledExecutorService with Runnable or Callable) which does roughly this:
Date now = new Date();
statement = connection.prepareStatement("SELECT id FROM mytable WHERE lastupdate BETWEEN ? AND ?");
statement.setDate(1, this.lastTimeChecked);
statement.setDate(2, now);
resultSet = statement.executeQuery();
while (resultSet.next()) {
// Handle accordingly.
}
this.lastTimeChecked = now;
Update: as per the question update it turns out that you have no control over the DB. Well, then you don't have much good/efficient options. Either just refresh the entire list in Java memory with entire data from DB without checking/comparing for changes (probably the fastest way), or dynamically generate a SQL query based on the current data which excludes the current data from the results.
I assume that you're talking about a situation where anything can update a table. If for some reason you're instead talking about a situation where only the Java application will be updating the table that's different. If you're using Java only you can put this code in your DAO or EJB doing the update (it's much cleaner than using a trigger in this case).
An alternative way to do this is to funnel all database calls through a web service API, or perhaps a JMS API, which does the actual database calls. Processes could register there to get a notification of a database update.
We have a similar requirement. In our case we have a legacy system that we do not want to adversely impact performance on the existing transaction table.
Here's my proposal:
A new work table with pk to transaction and insert timestamp
A new audit table that has same columns as transaction table + audit columns
Trigger on transaction table to dump all insert/update/deletes to an audit table
Java process to poll the work table, join to the audit table, publish the event in question and delete from the work table.
Question is: What do you use for polling? Is quartz overkill? How can you scale back the polling frequency based on the current DB load?