I'm currently sourcing some static data from a third party. It's a simple one-to-many, like this
garage:
id
name
desc
location
garage_price:
id
garage_id
price_type
price
Sometimes, the data is incorrect, and I will need to correct it. At the same time, I'd like to preserve the original sourced data somewhere and potentially run some queries to show the changes.
My question is whether someone is doing something like this with SQL, Java and Hibernate, and what's the approach you've taken, or would take.
I could add a boolean column, "original_data", to both tables, and before an update happens, run a trigger to copy the row from garage or garage_price into an "original_garage" or "original_price" table as long as original_data is true. Then set original_data to false, and all further updates will just happen on the garage/garage_price tables.
Anything wrong with that approach, and how do people typically work with multiple tables with the same data in Hibernate/JPA? Previously, I'd create a class that holds all the data, and subclass it twice, once per each table, while setting
#Inheritance(strategy=InheritanceType.TABLE_PER_CLASS)
on the parent.
As so often there are various options:
Use Hibernate Envers. It will keep a complete history of changes, so if you do multiple changes each will result in a row in the auditing tables. These tables are separate from your main data tables which might be a pro or a con, depending on your requirements.
Use the approach that you described: Write the original dataset, copy it before modifying it. You'll need two additional attributes:
A flag marking the original and a technical id do have a unique primary key.
Just as the second version, but you could actually do that in a trigger in the database. Which probably is faster, works no matter how the data gets inserted and to copy rows in the database is actually really easy, while it feels rather cumbersome in Java. Of course, writing triggers is considered a PITA in itself by many Java developers. If your application doesn't usually use triggers and stored procedures it is also really easy to forget about the trigger and being rather confused where these additional rows come from.
Related
I'm not sure if something special exists for this use case - but it felt like a case where someone was likely to have made some sort of useful structure/technique/design-pattern.
My Situation
I have a set of SQL commands executed from middle tier (Java) to insert/update/delete data to any of a set of very large tables via joins from a related staging table.
I have more SQL commands which update various derived tables based on the staging table/actual table contents. Different tables will interact with different derived tables via different queries (as usual). These commands may have to be interleaved with the first set depending on the use case - so, I can't necessarily execute set 1 then set 2 all at once.
My Question
So, I need to build a chain of commands that get executed sequentially, and I need to trigger a rollback if any of them fail. I'd like to do this in the most clear, documented way possible.
Does anyone know a standard way of coding this? I'm sure anyone migrating from stored procedure code to middle tier code has done this before and I don't want to reinvent the wheel if there are good options out there.
Additional Information
One of my main concerns is making everything clear. To elaborate, I'll have a set of queries specifically designed to:
Truncate staging table A' and populate it with primary keys targeting deletion records
Delete from actual table A based on join with A'
Truncate staging table A' and populate it with full data for upserts
Update/Insert records from A' to A based on joins
The same logic will apply to tables B, C, D, etc. Unfortunately, it can be the case where just A and C need an extra step, like syncing deletes to a certain derived table, to be done after the deletions but before the upserts.
I'd obviously like to group all the logic for updating a table, and I'd like to group all the logic for updating a derived table as well, but at execution time they have to be intelligently interleaved and this sounds messy to me.
Don't write such a thing yourself. This is what JTA was born for.
You can use either JPA or Spring to do it.
Annotate the unit of work as transactional and let the database and JDBC handle it.
If you must do it yourself, follow the aspect-oriented approach and make it a decorative "before & after" implementation.
This is a very simple question that applies to programming web interfaces with java. Say, I am not using an ORM (even if I am using one), and let's say I've got this Car (id,name, color, type, blah, blah) entity in my app and I have a CAR table to represent this entity in the database. So, say I have this need to update only a subset of fields on a bunch of cars, I understand that the typical flow would be:
A DAO class (CarDAO) - getCarsForUpdate()
Iterate over all Car objects, update just the color to say green or something.
Another DAO call to updateCars(Cars cars).
Now, isn't this a little beating around the bush for what would be a simple select and update query? In the first step above, I would be retrieving the entire object data from the database: "select id,name,color,type,blah,blah.. where ..from CAR" instead of "select id,color from CAR where ...". So why should I retrieve those extra fields when post the DAO call I would never use anything other than "color"? The same applies to the last step 3. OR, say I query just for the id and color (select id,color) and create a car object with only id and color populated - that is perfectly ok, isn't it? The Car object is anemic anyway?
Doesn't all this (object oriented-ness) seem a little fake?
For one, I would prefer that if the RDBMS can handle your queries, let it. The reason is that you don't want your JVM do all the work especially when running an enterprise application (and you have many concurrent connections needing the same resource).
If you particularly want to update an object (e.g. set the car colour to green) in database, I would suggest a SQL like
UPDATE CAR SET COLOR = 'GREEN';
(Notice I haven't used the WHERE clause). This updates ALL CAR table and I didn't need to pull all Car object, call setColor("Green") and do an update.
In hindsight, what I'm trying to say is that apply engineering knowledge. Your DAO should simply do fast select, update, etc. and let all SQL "work" be handled by RDBMS.
From my experience, what I can say is :
As long as you're not doing join operations, i.e. just querying columns from the same table, the number of columns you fetch will change almost nothing to performance. What really affects performance is how many rows you get, and the where clause. Fetching 2 or 20 columns changes so little you won't see any difference.
Same thing for updating
I think that in certain situations, it is useful to request a subset of the fields of an object. This can be a performance win if you have a large number of columns or if there are some large BLOB columns that would impact performance if they were hydrated. Although the database usually reads in an entire row of information whenever there is a match, it is typical to store BLOB and other large fields in different locations with non-trivial IO requirements.
It might also make sense if you are iterating across a large table and doing some sort of processing. Although the savings might be insignificant on a single row, it might be measurable across a large table.
Also, if you are only using fields that are in indexes, I believe that the row itself will never be read and it will use the fields from the index itself. Not sure in your example if color would be indexed however.
All this said, if you are only persisting objects that are relatively simple without BLOB or other large database fields then this could turn into premature optimization since the query processing, row IO, JDBC overhead, and object creation are most likely going take a lot more time compared to hydrating a subset of the fields in the row. Converting database objects into the final Java class is typically a small portion of the load of each query.
I have to go through a database and modify it according to a logic. The problem looks something like this. I have a history table in my database and I have to modify.
Before modifying anything I have to look at whether an object (which has several rows in the history table) had a certain state, say 4 or 9. If it had state 4 or 9 then I have to check the rows between the currently found row and the next state 4 or 9 row. If such a row (between those states) has a specific value in a specific column then I do something in the next row. I hope this is simple enough to give you an idea. I have to do this check for all the objects. Keep in mind that any object can be modified anywhere in its life cycle (of course until it reaches a final state).
I am using a SQL Sever 2005 and Hibernate. AFAIK I can not do such a complicated check in Transact SQL! So what would you recommend for me to do? So far I have been thinking on doing it as JUnit test. This would have the advantage of having Hibernate to help me do the modifications and I would have Java for lists and other data structures I might need and don't exist in SQL. If I am doing it as a JUnit test I am not loosing my mapping files!
I am curious what approaches would you use?
I think you should be able to use cursors to manage the complicated checks in SQL Server. You didn't mention how frequently you need to do this, but if this is a one-time thing, you can either do it in Java or SQL Server, depending on your comfort level.
If this check needs to be applied on every CRUD operation, perhaps database trigger is the way to go. If the logic may change frequently over the time, I would much rather writing the checks in Hibernate assuming no one will hit the database directly.
I have an existing application that I am working w/ and the customer has defined the table structure they would like for an audit log. It has the following columns:
storeNo
timeChanged
user
tableChanged
fieldChanged
BeforeValue
AfterValue
Usually I just have simple audit columns on each table that provide a userChanged, and timeChanged value. The application that will be writing to these tables is a java application, and the calls are made via jdbc, on an oracle database. The question I have is what is the best way to get the before/after values. I hate to compare objects to see what changes were made to populate this table, this is not going to be efficient. If several columns change in one update, then this new table will have several entries. Or is there a way to do this in oracle? What have others done in the past to track not only changes but changed values?
This traditionally what oracle triggers are for. Each insert or update triggers a stored procedure which has access to the "before and after" data, which you can do with as you please, such as logging the old values to an audit table. It's transparent to the application.
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:59412348055
If you use Oracle 10g or later, you can use built in auditing functions. You paid good money for the license, might as well use it.
Read more at http://www.oracle.com/technology/pub/articles/10gdba/week10_10gdba.html
"the customer has defined the table structure they would like for an audit log"
Dread words.
Here is how you would implement such a thing:
create or replace trigger emp_bur before insert on emp for each row
begin
if :new.ename = :old.ename then
insert_audit_record('EMP', 'ENAME', :old.ename, :new.ename);
end if;
if :new.sal = :old.sal then
insert_audit_record('EMP', 'SAL', :old.sal, :new.sal);
end if;
if :new.deptno = :old.deptno then
insert_audit_record('EMP', 'DEPTNO', :old.deptno, :new.deptno);
end if;
end;
/
As you can see, it involves a lot of repetition, but that is easy enough to handle, with a code generator built over the data dictionary. But there are more serious problems with this approach.
It has a sizeable overhead: an
single update which touches ten
field will generate ten insert
statements.
The BeforeValue and AfterValue
columns become problematic when we
have to handle different datatypes -
even dates and timestamps become
interesting, let alone CLOBs.
It is hard to reconstruct the state
of a record at a point in time. We
need to start with the earliest
version of the record and apply the
subsequent changes incrementally.
It is not immediately obvious how
this approach would handle INSERT
and DELETE statements.
Now, none of those objections are a problem if the customer's underlying requirement is to monitor changes to a handful of sensitive columns: EMPLOYEES.SALARY, CREDIT_CARDS.LIMIT, etc. But if the requirement is to monitor changes to every table, a "whole record" approach is better: just insert a single audit record for each row affected by the DML.
I'll ditto on triggers.
If you have to do it at the application level, I don't see how it would be possible without going through these steps:
start a transaction
SELECT FOR UPDATE of the record to be changed
for each field to be changed, pick up the old value from the record and the new value from the program logic
for each field to be changed, write an audit record
update the record
end the transaction
If there's a lot of this, I think I would be creating an update-record function to do the compares, either at a generic level or a separate function for each table.
Let's presume that you are writing an application for a retail store chain. So, you would design your object model such that you would define 'Store' as the core business object and lots of supporting objects. Let's say 'Store' looks like follows:
class Store implements Validatable{
int storeNo;
int storeName;
... etc....
}
So, your client tells you that you have to import store schedule from a excel sheet into the application and you would have to run a series of validations on 'em. For instance, 'StoreIsInSameCountry';'StoreIsValid'... etc. So, you would design a Rule interface for checking all business conditions. Something like this:
interface Rule T extends Validatable> {
public Error check(T value) throws Exception;
}
Now, here comes the question. I am uploading 2000 stores from this excel sheet. So, I would end up running each rule defined for a store that many times. If I were to have 4 rules = 8000 queries to the database, i.e, 16000 hits to the connection pool. For a simple check where I would just have to check whether the store exists or not, the query would be:
SELECT STORE_ATTRIB1, STORE_ATTRIB2... from STORE where STORE_ID = ?
That way I would obtain get my 'Store' object. When I don't get anything from the database, then that store doesn't exist. So, for such a simple check, I would have to hit the database 2000 times for 2000 stores.
Alternatively, I could just do:
SELECT STORE_ATTRIB1, STORE_ATTRIB2... from STORE where STORE_ID in (1,2,3..... )
This query would actually return much faster than doing the one above it 2000 times.
However, it doesn't go well with the design that a Rule can be run for a single store only.
I know using IN is not a suggested methodology. So, what do you think I should be doing? Should I go ahead and use IN here, coz it gives better performance in this scenario? Or should I change my design?
What would you do if you were in my shoes, and what is the best practice?
That way I would obtain get my 'Store' object from the database. When I don't get anything from the database, then that store doesn't exist. So, for such a simple check, I would have to hit the database 2000 times for 2000 stores.
This is what you should not do.
Create a temporary table, fill the table with your values and JOIN this table, like this:
SELECT STORE_ATTRIB1, STORE_ATTRIB2...
FROM temptable tt
JOIN STORE s
ON s.STORE_ID = t.id
or this:
SELECT STORE_ATTRIB1, STORE_ATTRIB2...
FROM STORE s
WHERE s.STORE_ID IN
(
SELECT id
FROM temptable tt
)
I know using IN is not a suggested methodology. So, what do you think I should be doing? Should I go ahead and use IN here, coz it gives better performance in this scenario? Or should I change my design?
IN filters duplicates out.
If you want each eligible row to be selected for each duplicate value in the list, use JOIN.
IN is in no way a "not suggested methology".
In fact, there was a time when some databases did not support IN queries effciently, that's why folk wisdom still advices against using it.
But if your store_id is indexed properly (and it most probably is, if it's a PRIMARY KEY which it looks like), then all modern versions of major databases (that is Oracle, SQL Server, MySQL and PostgreSQL) will use an efficient plan to perform this query.
See this article in my blog for performance details in SQL Server:
IN vs. JOIN vs. EXISTS
Note, that in a properly designed database, validation rules are also set-based.
I. e. you implement your validation rules as queries against the temptable.
However, to support legacy rules, you can select values from temptable row-by-agonizing-row, apply the rules, and delete values which did not pass validation.
SELECT store_id FROM store WHERE store_active = 1
or even
SELECT store_id FROM store
will tell you all the active stores in a single query. You can now conduct the other tests on stores you know to exist, and you've saved yourself 1,999 hits to the database.
If you've got relatively uncontested database access, and no time constraint on how long the whole thing is going to take then you've no real need to worry about hitting the connection pool over and over again. That's what it's designed for, after all!
I think it's more of a business question with parameter of how often does the client run the import, how long would it take for you to implement either of the solution, and how expensive is your time per hour.
If it's something that runs once in a while, a bit of bad performance is acceptable in my opinion, especially if you can get the job done quick using clean code.
...a Rule can be run for a single store only.
Managing business rules along with performance is a tricky task, so there is a library ("Persistence Layer") that does exactly that. You define rules, then execute a bulk of commands, then the library fetch from DB whatever the rules require in a single query (by using temp tables rather than 'IN') and then passes it to the rules.
There is an example of a validator in here.