How to build a change tracking system - not audit system

How to build a change tracking system - not audit system - java

I have a requirement in which I need to capture data changes (not auditing) and life cycle states on inventory.
Technology:
Jave, Oracle, Hibernate + JPA
For the data changes, we have been given a list of data elements that are to be monitored. If the element changes we are to notify a given 3rd party vendor. What I want to do is make this a generic service that we can provide to any of our current and future 3rd party vendors.
We don't care who made the change or what the new value is just that it changed.
The thought is that the data layer of our application would use annotation on each of the data elements. If that data element changed, then it would place a message into a queue. The message bean would then read the queue and make an entry in a table.
Table to look something like the following:
Table Name: ATL_CHANGE_TRACKER
Key columns
INVENTORY_ID Inventory Id of the vehicle
SALEEVENT_ITEM_ID SaleEvent item of the vehicle
FIELD_CHANGED_ID Id of the field that got changed or action. Link to subscription
UPDATE_DTM Indicates the date time when change occured.
For a given inventory, we could have up to 200 entries in this table (monitoring 200 fields across many tables).
Then a daemon for the given 3rd party would then read from this table based on the fields that it has subscribed to (could be all the fields). It would then read what every table it is required to to create the message to be sent to the 3rd party. Decouple the provider of the data and the user of the data.
Identify the list of fields/actions that are available
Table Name: ATL_FIELD_ACTION
Key columns
ID
NAME Name of the field/action - Example Color,Make
REC_CRE_TIME_STAMP
REC_CRE_USER_ID
LAST_UPDATE_USER_ID
LAST_UPDATE_TIME_STAMP
Subscription table, if 3rd Party company xyz is interested in 60 fields, the 60 fields will be mapped to this table.
ATL_FIELD_ACTION_SUBSCRIPTION
Key columns
ATL_FIELD_ACTION_ ID ID of the atl_field_action table
CONSUMER 3rd Party Name
FUNCTION Name of the 3rd Party Transmission that it is used for
STATUS
REC_CRE_TIME_STAMP
REC_CRE_USER_ID
LAST_UPDATE_USER_ID
LAST_UPDATE_TIME_STAMP
The second part is that there will be actions on the life cycle of the inventory which will need to be recored also. In this case, when the state of the inventory changes a message will be placed on the same queue and that entry will be entered in the same table.
Again, the daemon will have subscribed to these states and will collect the ones it is interested in.
The goal here is to not have the business tier/data tier care who wants the data - just that it needs to provide it so those interested can get it.
Wonder if anyone has done something like this - any gotchas - off the shelf - open source solutions to do this.

For a high-level discussion on the topic, I would suggest reading this article by Martin Fowler.
Its sounds like you have write-once, read-many type of data, it might produce large volumes of data, and the data is different for different clients. If you ask me, it sounds like this may be a good place to make use of either a NOSQL database or hack your Oracle database to act as a NOSQL database. See here for a discussion on how someone did this with MySQL.
Otherwise, you may look at creating an "immutable" database table and have Hibernate write new records every time it does an update as described here.

Couple things.
First, you get to do all of this work yourself. The JPA/Hibernate lifecycle listeners, while they have an event for when an update occurs, you aren't passed the "old" object and the "new" object. So, you're going to have to keep track of what fields change using some other method.
Second, again with lifecycle listeners, be careful inside of them, as the transaction state is a bit murky. At least on Glassfish/EclipseLink, I've had "strange" problems using either the JPA or JMS from a lifecycle listener. Just weird behavior. We went to a non-transactional queue to capture all of our information that we track from the lifecycle events.
If having the change data committed on its own transaction is acceptable, then there is value is pushing the data on to a faster, internal queue (which can feed a listener that posts it to an MDB). This just gets the auditing "out of band" with your transaction, give you better transaction throughput. But if you need to have the change information committed with the same transaction, this won't work. For example, you could put something on the queue and then the transaction may be rolled back (for whatever) reason, leaving the change on the queue showing it happened, when it in fact failed. That's a potential issue with this.
But if you're posting a lot of audit information, then this can be a concern.
If the auditing information has a short life span (with respect to the rest of the data), then you should probably make an effort to cull the audit tables, they can get pretty large.
Also, if practical, don't disregard the use of DB triggers for this. They can be quite efficient and effective at this process.

Related

Database polling when there's an insert - Spring Data JPA

I have a requirement where whenever there's an entry in the table, I want to trigger an event. I have used EntityListeners (Spring Data JPA concept) for this, which is working perfectly fine; but the issue here is the insert can happen through stored procedures or manual entry. I tried searching online and found the Spring JPA inbound and outbound channel adapter concept, but I think this concept doesn't help me much in what I want to achieve. Can anybody clarify to me if this concept helps me as I have no much idea on this concept or provide me with any solutions on how I can achieve this?

There are no "great" mechanisms for raising events "from the data layer" in SQL Server.
There are three "OK" ones:
Triggers (only arguably OK)
Triggers seem like an obvious solution, but then you have to ask yourself... what will the trigger actually do? If it just writes data into another table, you still haven't gotten yourself outside the database. There are various arcane tricks you could try to use for this, like CLR procedures, or a few extended procedures.
But if you go down that route, you have to start thinking about another consideration: Triggers happen in the same transaction as the DML operation that caused them to fire. If they take time to execute, you'll be slowing down your OLTP workloads. If they do anything that is potentially unreliable they could fail, causing your transaction to roll back.
Triggers plus service broker
Service broker provides a mechanism - perhaps the only even-half-sensible mechanism - to get your data out of SQL and into some kind of listener in a "push" based manner. You still have a trigger, but the trigger writes data to a service broker queue. A listener can use a special waitfor receive statement to listen for data as it appears in the queue. The nice thing about this is that once the trigger has pushed data into a broker queue, its job is done. The "receipt" of that data is decoupled from the transaction that caused it to be enqueued in the first place. This sort of service broker mechanism is what is used by things like the SqlDependency built into dot net.
The two main issues with service broker are complexity and performance. Service broker has a steep learning curve, and it's easy to get things wrong. Performance becomes complex if you need to scale, because while it's "easy" to build xml or json payloads, large set based data changes can mean those payloads are massive.
In any case, if you want to explore this route, you're going to want to read (all of) the excellent articles on the subject by Remus Rusanu
Bear in mind that this is an asynchronous "near real time" mechanism, not a synchronous "real time" mechanism like triggers.
Polling a built in change detection mechanism: CDC or Change Tracking.
Sql server comes with two flavours of technology that can natively "watch" changes that happen in tables, and record them: Change Tracking, and Change Data Capture
Neither of these push data out of the database, they're both "pull" based. What they do is store additional data in the database when changes happen. CDC can provide a complete log of every change, whereas change tracking "points to" rows that have changed via the primary key values. Though both of these involve "polling history", there are significant differences between them, so read the fine print.
Note that CDC is "doubly asynchronous" - the data is read from the transaction log, so recording the data is not part of the original transaction. And then you have to poll the CDC data, it's not pushed out to you. Furthermore, the functions generated by Microsoft when you enable CDC can be unbelievably slow as soon as you ask for something useful, like net changes with mask (which can tell you which columns really changed their value), and your ability to enable CDC comes with a lot of caveats and limitations (again, read the docs for all of this).
As to which of these is "best", well, that's a matter of opinion and circumstance. I have used CDC extensively, service broker rarely, and triggers almost never, as a way of getting events out of SQL. I have never actually used change tracking in a production environment, but if I had the choice again I would probably have chosen change tracking rather than change data capture, at least until or unless there were requirements that mandated the use of CDC because of its additional functionality beyond what change tracking can provide.
One last note: If you need to "guarantee" that the events that get raised have in fact been collected by a listener and successfully forwarded to subscribers, well, you have some work ahead of you! Guaranteed messaging is hard.

Table data overrides

I'm currently sourcing some static data from a third party. It's a simple one-to-many, like this
garage:
id
name
desc
location
garage_price:
id
garage_id
price_type
price
Sometimes, the data is incorrect, and I will need to correct it. At the same time, I'd like to preserve the original sourced data somewhere and potentially run some queries to show the changes.
My question is whether someone is doing something like this with SQL, Java and Hibernate, and what's the approach you've taken, or would take.
I could add a boolean column, "original_data", to both tables, and before an update happens, run a trigger to copy the row from garage or garage_price into an "original_garage" or "original_price" table as long as original_data is true. Then set original_data to false, and all further updates will just happen on the garage/garage_price tables.
Anything wrong with that approach, and how do people typically work with multiple tables with the same data in Hibernate/JPA? Previously, I'd create a class that holds all the data, and subclass it twice, once per each table, while setting
#Inheritance(strategy=InheritanceType.TABLE_PER_CLASS)
on the parent.

As so often there are various options:
Use Hibernate Envers. It will keep a complete history of changes, so if you do multiple changes each will result in a row in the auditing tables. These tables are separate from your main data tables which might be a pro or a con, depending on your requirements.
Use the approach that you described: Write the original dataset, copy it before modifying it. You'll need two additional attributes:
A flag marking the original and a technical id do have a unique primary key.
Just as the second version, but you could actually do that in a trigger in the database. Which probably is faster, works no matter how the data gets inserted and to copy rows in the database is actually really easy, while it feels rather cumbersome in Java. Of course, writing triggers is considered a PITA in itself by many Java developers. If your application doesn't usually use triggers and stored procedures it is also really easy to forget about the trigger and being rather confused where these additional rows come from.

Is checksum a good way to see if table has been modified in MySQL?

I'm currently developing an application in Java that connects to a MySQL database using JDBC, and displays records in jTable. The application is going to be run by more than one user at a time and I'm trying to implement a way to see if the table has been modified. EG if user one modifies a column such as stock level, and then user two tries to access the same record tries to change it based on level before user one interacts.
At the moment I'm storing the checksum of the table that's being displayed as a variable and when a user tries to modify a record it will do a check whether the stored checksum is the same as the one generated before the edit.
As I'm new to this I'm not sure if this a correct way to do it or not; as I have no experience in this matter.

Calculating the checksum of an entire table seems like a very heavy-handed solution and definitely something that wouldn't scale in the long term. There are multiple ways of handling this but the core theme is to do as little work as possible to ensure that you can scale as the number of users increase. Imagine implementing the checksum based solution on table with million rows continuously updated by hundreds of users!
One of the solutions (which requires minimal re-work) would be to "check" the stock name against which the value is updated. In the background, you'll fire across a query to the table to see if the data for "that particular stock" has been updated after the table was populated. If yes, you can warn the user or mark the updated cell as dirty to indicate that that value has changed. The problem here is that the query won't be fired off till the user tries to save the updated value. Or you could poll the database to avoid that but again hardly an efficient solution.
As a more robust solution, I would recommend using a database which implements native "push notifications" to all the connected clients. Redis is a NoSQL database which comes to mind for this.
Another tried and tested technique would be to forgo direct database connection and use a middleware layer like a messaging queue (e.g. RabbitMQ). Message queues enable design of systems which communicate using message. So for e.g. every update the stock value in the JTable would be sent across as a message to an "update database queue". Once the update is done, a message would be sent across to a "update notification queue" to which all clients would be connected. This will enable all of them to know that the value of a given stock has been updated and act accordingly. The advantage to this solution is that you get to keep your existing stack (Java, MySQL) and can implement notifications without polling the DB and killing it.

Checksum is a way to see if data has changed.
Anyway I would suggest you store a column "last_update_date", this column is supposed to be always updated at every update of the record.
So you juste have to store this date (precision date time) and do the check with that.
You can also add a column version number : a simple counter incremented by 1 at each update.
Note:
You can add a trigger on update for updating last_update_date, it should be 100% reliable, maybe you don't need a trigger if you control all updates.

When using in network communication:
A checksum is a count of the number of bits in a transmission unit
that is included with the unit so that the receiver can check to see
whether the same number of bits arrived. If the counts match, it's
assumed that the complete transmission was received.
So it can be translated to check 2 objects are different, your approach is correct.

Exploring user specific data in webapps

I am busy practicing on designing a simple todo list webapp whereby a user can authenticate into the app and save todo list items. The user is also only able to to view/edit the todo list items that they added.
This seems to be a general feature (authenticated user only views their own data) in most web applications (or applications in general).
To me what is important is having knowledge of the different options for accomplishing this. What I would like to achieve is a solution that can handle lots of users' data effectively. At the moment I am doing this using a Relational Database, but noSQL answers would be useful to me as well.
The following ideas came to mind:
Add a user_id column each time this "feature" is needed.
Add an association table (in the example above a user_todo_list_item table) that associates the data.
Design in such a way that you have a table per user per "feature" ... so you would have a todolist_userABC table. It's an option but I do not like it much since a thousand user's means a thousand tables?!
Add row level security to the specific "feature". I am not familiar on how this works but it seems to be a valid option. I am also not sure whether this is database vendor specific.
Of my choices I went with the user_id column on the todolist_item table. Although it can do the job, I feel that a user_id column might be problematic when reading data if the data within the table gets large enough. One could add an index I guess but I am not sure of the index's effectiveness.
What I don't like about it is that I need to have a user_id for every table where I desire this type of feature which doesn't seem correct to me? It also seems that when I implement the database layer I would have to add this to my queries for every feature (unless I use some AOP)?
I had a look around (How does Trello store data in MongoDB? (Collection per board?)), but it does not speak about the techniques regarding user_id columns or things like that. I also tried reading about this in some security frameworks (Spring Security to be specific) but it seems that it only goes into privileges/permissions on a table level and not a row level?
So the question is whether my choice was appropriate and if there are better techniques to do this?

Your choice is the natural thing to do.
The table-per-user is a non-starter (anything that modifies the database structure in response to user action is usually suspect).
Row-level security isn't really an option for webapps - it requires each user session to have a separate, persistent connection to the database, which is rarely practical. And yes, it is vendor-specific.
How you index your tables depends entirely on your usage patterns and types of queries you want to run. Is 'show all TODOs for a user' a query you want to support (seems like it would be)? Then and index on the user id is obviously needed.
Why does having a user_id column seem wrong to you? If you want to restrict access by user, you need to be able to identify which user the record belongs to. Doesn't actually mean that every table needs it - for example, if one record composes another (say, your TODOs have 'steps', each step belongs to a single TODO), only the root of the object graph needs the user id.

Is there any heuristic/pattern for logging user actions

I have a GWT/Java/Hibernate/MySQL application (but I think any web pattern could be valid) that do a CRUD on several objects. Each object is stored in a table in the database. I want to implement an action logger. For example for Object A I want to know who created it and modified it, and for User B, what actions did he perform.
My idea is to have a History table that stores : UserId, ObjectId, ActionName. The UserId and ObjectId are foreign keys. Am I on the right track ?

I also think this is the right direction.
However, bare in mind that in an application with lots of traffic, this logs can become overhead.
I would suggest the following in this case -
A. Don't use hibernate for this "action logging" - Hibernate has better performance for "mostly read DB"
B. Consider DB that is better in "mostly write" scenario for the action logging table.
You can try to look for a NoSQL solution for this.
C. If you use such NoSQL DB, but still want to keep the logging actions in the relational DB, have an offline process that runs once in a day for example), that will query your "action logging DB" and will insert it to the relational DB.
D. If it's ok that your system might lose some action logging, consider using producer/consumer pattern (for example - use a queue between producer and consumer thread) - the threads that need to log actions will not log them synchronously, but will log them asynchronously.
E. In addition, don't forget that such logging table has the potential to be over-flooded in time, causing queries on it to take a long time. For these issues consider the following:
E.1. Every day remove really old logs - let's say - older than month, or move them to some "backup" table.
E.2 Index some fields that you mostly use for action logging queries (for example - maybe an action_type) field.

If only changes to specific fields, e.g., something like status in a users table, should be tracked, I would use a user_status_histories table being referenced from the users table via foreign key. The user_status_histories table would contain fields such as current_status, date and something like admin_who_modified_the_status.
Whenever a status change is made, a new record would be inserted into the user_status_histories table. This would allow easy querying of all status changes.
Of course, querying a user would then require a (LEFT or INNER) JOIN with the user_status_histories table in order to get the last record (= the current status).
Depending on your needs, you might think of a current_status field in the users table (besides the status serving as foreign key) for fast access, which would be maintained parallel to the user_status_histories table.

Yes you are. Another very similar framework is one which supports undo and redo. These frameworks track user actions and have the additional ability to restore state to the way it was before the user action.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.