database deign advice wanted: double data entry - java

I'm designing a database for capturing clinical trial data. The data are entered twice by two persons, independently and the results must be matched. What are the best database tools to use to achieve the best results. Any one has similar experiences?
Your helps are highly appreciated.
thanks.

Are you designing a database, or the app to enter data into the database?
If you are simply looking at the database, I would capture the following information:
1) user A item X entered data
2) user A userID
3) user A intem X entered date/time
4) user B item X entered data
5) user B userID
6) user B intem X entered date/time
I'd then conclude that there was something called a "Datapoint" that contained the fields
-- entering userID
-- entry date
-- entry data (double value)
I'd also assign it a unique ID for the entry
--entryID (autoinc)
I would then state that there is something called a "data trial" that has two of these things called "data entries"
If I believed that this number of entries per data trial might be 3 verifications instead of 2, I might change my design, but initially I would give my "Data Trial" the following definition:
-- data trial name
-- data trial creation date
-- user creating data trial (userID)
-- data entry 1 (dataPointID)
-- data entry 2 (dataPointID)
-- entries verified (boolean)
and give each of these a unique ID also
-- data trial ID (autoinc)

(I can't add comments yet...) Adding to Zak's answer, if there is any doubt over how many people will enter these values (say it jumps from two to three, like Zak says) I'd break the Data entry 1 and 2 (both dataPointIDs) into another table with two columns:
--data trial id
--data entry id
This way you could theoretically have as many different users
inserting the data, and the data trial table would then contain only meta data about the trial, not "business logic," which only having 2 data entries per trial essentially is.
A similar setup could be used if different trials contain different amounts of data values to be entered.

If you are looking for a good database tool you should consider using a Entity-Relationship Designer to model your database, such as Case Studio or Embarcadero ER/Studio.

Databases are not designed to solve this issue. Double entry is an application issue and violates normalization. I would implement a verification field to indicate that the data has been verified, and if it failed or not. I would likely include an audit table containing each set of entries entered.
The application would need a lookup function to determine if this is the first entry or a subsequent entry. There are a number of design issues related to this.
Verification can't find first entry.
How to correct data if it doesn't match on verification.
How to handle unverified data which should be verified.

Related

querying in oracle table too slow

Application i worked on has a user search API which will do a user search based on Name,Address,DateOfBirth,and Pincode. Of these only pincode is the mandatory field. And any 2 of other data (Date of birth, address, name) can be send. Also in one request clients can search for multiple users. So against each search key we have to give response like either found USER object or NOT FOUND response.
This is how our algorithm works for each search keys .
Fetch all the user profile from table with the given pincode.
Iterate users from step 1 and check if any two of date of birth,
name, address matches then that is a valid candidate. We are doing
this using predicate as shown below
stream(profiles.spliterator(), false).filter(new
UserSearchPredicate(key)).
This predicate will check if any 2 of the
values matched and if yes return true.
Check if only one user we got from step 2 then DONE. Otherwise Exception
Return user.
The above algorithm has a problem , it is searching the users table by pincode.
Total 20 million records are there in the table which makes the query slow. AGainst one pincode we may get 16000 or more records , iterating it is another bottle neck area. ALso this has to be done for all search keys. There can be 50 or more search keys in one request, which took almost a minute to respond.
What we have done:
Pincode column indexing is properly DONE in database. We are using oracle DB. Also tried parallel processing as well. Parallel processing has a problem with DB connections in connection pool which is not a solution we concluded.
Adding one more column to the query is not logically possible because if i add name , then where pincode='' and name='' will be the query. But the matching record can be pincode and date of birth . Apart from pincode nothing is mandatory.
Question: Is there a better way to handle this problem.Am particularly looking for a better algorithm. please help.

Using "array-contains" Query for Cloud Firestore Social Media Structure

I have a data structure that consists of a collection, called "Polls." "Polls" has several documents that have randomly generated ID's. Within those documents, there is an additional collection set called "answers." Users vote on these polls, with the votes all written to the "answers" subcollection. I use the .runTransaction() method on the "answers" node, with the idea that this subscollection (for any given poll) is constantly being updated and written to by users.
I have been reading about social media structure for Firestore. However, I recently came across a new feature for Firestore, the "array_contains" query option.
While the post references above discusses a "following" feed for social media structure, I had a different idea in mind. I envision users writing (voting) to my main poll node, therefore creating another "following" node and also having users write to this node to update poll vote counts (using a cloud function) seems horribly inefficient since I would have to constantly be copying from the main node, where votes are being counted.
Would the "array_contains" query be another practical option for social media structure scalability? My thought is:
If user A follows user B, write to a direct array child in my "Users" node called "followers."
Before any poll is created by user B, user's B's device reads "followers" array from Firestore to gain a list of all users following and populates them in the client side, in an Array object
Then, when user B writes a new poll, add that "followers" array to the poll, so each new poll from user B will have an array attached to it that contains all ID's of the users following.
What are the limitations on the "array_contains" query? Is it practical to have an array stored in Firebase that contains thousands of users / followers?
Would the "array_contains" query be another practical option for social media structure scalability?
Yes of course. This the reason why Firebase creators added this feature.
Seeing your structure, I think you can give it a try, but to responde to your question.
What are the limitations on the "array_contains" query?
There is no limitations regarding what type of data do you store.
Is it practical to have an array stored in Firebase that contains thousands of users / followers?
Is not about practical or not, is about other type of limitations. The problem is that the documents have limits. So there are some limits when it comes to how much data you can put into a document. According to the official documentation regarding usage and limits:
Maximum size for a document: 1 MiB (1,048,576 bytes)
As you can see, you are limited to 1 MiB total of data in a single document. When we are talking about storing text, you can store pretty much. So in your case, if you would store only ids, I think that will be no problem. But IMHO, as your array getts bigger, be careful about this limitation.
If you are storing large amount of data in arrays and those arrays should be updated by lots of users, there is another limitation that you need to take care of. So you are limited to 1 write per second on every document. So if you have a situation in which a lot of users al all trying to write/update data to the same documents all at once, you might start to see some of this writes to fail. So, be careful about this limitation too.
I did a real-time polls system, here is my implementation:
I made a polls collection where each document has a unique identifier, a title and an array of answers.
Also, each document has a subcollection called answers where each answer has a title and the total of distributed counters in their own shards subcollection.
Example :
polls/
[pollID]
- title: 'Some poll'
- answers: ['yolo' ...]
answers/
[answerID]
- title: 'yolo'
- num_shards: 2
shards/
[1]
- count: 2
[2]
- count: 16
I made another collection called votes where each document is a composite key of userId_pollId so I can keep tracking if the user has already voted a poll.
Each document holds the pollId, the userId, the answerId...
When a document is created, I trigger a Cloud Function that grab the pollId and the answerId and I increment a random shard counter in this answerId's shards subcollection, using a transaction.
Finaly, on the client-side, I reduce the count value of each shards of each answers of a poll to calculate the total.
For the following stuff, you can do the same thing using a middle-man collection called "following", where each document is a composite key of userAid_userBid so you can track easily which user is following another user without breaking firestore's limits.

Is checksum a good way to see if table has been modified in MySQL?

I'm currently developing an application in Java that connects to a MySQL database using JDBC, and displays records in jTable. The application is going to be run by more than one user at a time and I'm trying to implement a way to see if the table has been modified. EG if user one modifies a column such as stock level, and then user two tries to access the same record tries to change it based on level before user one interacts.
At the moment I'm storing the checksum of the table that's being displayed as a variable and when a user tries to modify a record it will do a check whether the stored checksum is the same as the one generated before the edit.
As I'm new to this I'm not sure if this a correct way to do it or not; as I have no experience in this matter.
Calculating the checksum of an entire table seems like a very heavy-handed solution and definitely something that wouldn't scale in the long term. There are multiple ways of handling this but the core theme is to do as little work as possible to ensure that you can scale as the number of users increase. Imagine implementing the checksum based solution on table with million rows continuously updated by hundreds of users!
One of the solutions (which requires minimal re-work) would be to "check" the stock name against which the value is updated. In the background, you'll fire across a query to the table to see if the data for "that particular stock" has been updated after the table was populated. If yes, you can warn the user or mark the updated cell as dirty to indicate that that value has changed. The problem here is that the query won't be fired off till the user tries to save the updated value. Or you could poll the database to avoid that but again hardly an efficient solution.
As a more robust solution, I would recommend using a database which implements native "push notifications" to all the connected clients. Redis is a NoSQL database which comes to mind for this.
Another tried and tested technique would be to forgo direct database connection and use a middleware layer like a messaging queue (e.g. RabbitMQ). Message queues enable design of systems which communicate using message. So for e.g. every update the stock value in the JTable would be sent across as a message to an "update database queue". Once the update is done, a message would be sent across to a "update notification queue" to which all clients would be connected. This will enable all of them to know that the value of a given stock has been updated and act accordingly. The advantage to this solution is that you get to keep your existing stack (Java, MySQL) and can implement notifications without polling the DB and killing it.
Checksum is a way to see if data has changed.
Anyway I would suggest you store a column "last_update_date", this column is supposed to be always updated at every update of the record.
So you juste have to store this date (precision date time) and do the check with that.
You can also add a column version number : a simple counter incremented by 1 at each update.
Note:
You can add a trigger on update for updating last_update_date, it should be 100% reliable, maybe you don't need a trigger if you control all updates.
When using in network communication:
A checksum is a count of the number of bits in a transmission unit
that is included with the unit so that the receiver can check to see
whether the same number of bits arrived. If the counts match, it's
assumed that the complete transmission was received.
So it can be translated to check 2 objects are different, your approach is correct.

Designing a count based access control

I would like to get some advice on designing a count based access control. For example I want to restrict the number of users that a customer can create in my system based on their account. So by default a customer can create 2 users but if the upgrade their account they get to create 5 users and so on.
There are a few more features that I need to restrict on a similar basis.
The application follows a generic model so every feature exposed has a backing table and we have a class which handles the CRUD operation on that table. Also the application runs on multiple nodes and has a distributed cache.
The approach that I am taking to implement this is as follows
- I have a new table which captures the functionality to control and the allowed limit (stored per customer).
- I intercept the create method for all tables and check if the table in question needs to have access control applied. If so I fetch the count of created entities and compare against the limit to decide if I should allow the creation or not.
- I am using the database to handle synchronization in case of concurrent requests. So after the create method is called I update the table using the following where clause
where ( count_column + 1 ) = #countInMemory#
. i.e. the update will succeed only if the value stored in the DB + 1 = value in memory. This will ensure that even if two threads attempt a create at the same time, only one of them will be able to successfully update. The thread that successfully updates wins and the other one is rolled back. This way I do not need to synchronize any code in the application.
I would like to know if there is any other / better way of doing this. My application runs on Oracle and MySQL DB.
Thanks for the help.
When you roll back, do you retry (after fetching the new user count) or do you fail? I recommend the former, assuming that the new fetched user count would permit another user.
I've dealt with a similar system recently, and a few things to consider: do you want CustomerA to be able to transfer their users to CustomerB? (This assumes that customers are not independent, for example in our system CustomerA might be an IT manager and CustomerB might be an accounting manager working for the same company, and when one of CustomerA's employees moves to accounting he wants this to be reflected by CustomerB's account.) What happens to a customer's users when the customer is deleted? (In our case another customer/manager would need to adopt them, or else they would be deleted.) How are you storing the customer's user limit - in a separate table (e.g. a customer has type "Level2," and the customer-type table says that "Level2" customers can create 5 users), or in the customer's row (which is more error prone, but would also allow a per-customer override on their max user count), or a combination (a customer has a type column that says they can have 5 users, and an override column that says they can have an additional 3 users)?
But that's beside the point. Your DB synchronization is fine.

How can I analyse ~13GB of data?

I have ~300 text files that contain data on trackers, torrents and peers. Each file is organised like this:
tracker.txt
time torrent
time peer
time peer
...
time torrent
...
I have several files per tracker and much of the information is repeated (same information, different time).
I'd like to be able to analyse what I have and report statistics on things like
How many torrents are at each tracker
How many trackers are torrents listed on
How many peers do torrents have
How many torrents to peers have
The sheer quantity of data is making this hard for me to. Here's What I've tried.
MySQL
I put everything into a database; one table per entity type and tables to hold the relationships (e.g. this torrent is on this tracker).
Adding the information to the database was slow (and I didn't have 13GB of it when I tried this) but analysing the relationships afterwards was a no-go. Every mildly complex query took over 24 hours to complete (if at all).
An example query would be:
SELECT COUNT(DISTINCT torrent)
FROM TorrentAtPeer, Peer
WHERE TorrentAtPeer.peer = Peer.id
GROUP BY Peer.ip;
I tried bumping up the memory allocations in my my.cnf file but it didn't seem to help. I used the my-innodb-heavy-4G.cnf settings file.
EDIT: Adding table details
Here's what I was using:
Peer Torrent Tracker
----------- ----------------------- ------------------
id (bigint) id (bigint) id (bigint)
ip* (int) infohash* (varchar(40)) url (varchar(255))
port (int)
TorrentAtPeer TorrentAtTracker
----------------- ----------------
id (bigint) id (bigint)
torrent* (bigint) torrent* (bigint)
peer* (bigint) tracker* (bigint)
time (int) time (int)
*indexed field. Navicat reports them as being of normal type and Btree method.
id - Always the primary key
There are no foreign keys. I was confident in my ability to only use IDs that corresponded to existing entities, adding a foreign key check seemed like a needless delay. Is this naive?
Matlab
This seemed like an application that was designed for some heavy lifting but I wasn't able to allocate enough memory to hold all of the data in one go.
I didn't have numerical data so I was using cell arrays, I moved from these to tries in an effort to reduce the footprint. I couldn't get it to work.
Java
My most successful attempt so far. I found an implementation of Patricia Tries provided by the people at Limewire. Using this I was able to read in the data and count how many unique entities I had:
13 trackers
1.7mil torrents
32mil peers
I'm still finding it too hard to work out the frequencies of the number of torrents at peers. I'm attempting to do so by building tries like this:
Trie<String, Trie<String, Object>> peers = new Trie<String, Trie<String, Object>>(...);
for (String line : file) {
if (containsTorrent(line)) {
infohash = getInfohash(line);
}
else if (containsPeer(line)) {
Trie<String, Object> torrents = peers.get(getPeer(line));
torrents.put(infohash, null);
}
}
From what I've been able to do so far, if I can get this peers trie built then I can easily find out how many torrents are at each peer. I ran it all yesterday and when I came back I noticed that the log file wan't being written to, I ^Z the application and time reported the following:
real 565m41.479s
user 0m0.001s
sys 0m0.019s
This doesn't look right to me, should user and sys be so low? I should mention that I've also increased the JVM's heap size to 7GB (max and start), without that I rather quickly get an out of memory error.
I don't mind waiting for several hours/days but it looks like the thing grinds to a halt after about 10 hours.
I guess my question is, how can I go about analysing this data? Are the things I've tried the right things? Are there things I'm missing? The Java solution seems to be the best so far, is there anything I can do to get it work?
You state that your MySQL queries took too long. Have you ensured that proper indices are in place to support the kind of request you submitted? In your example, that would be an index for Peer.ip (or even a nested index (Peer.ip,Peer.id)) and an index for TorrentAtPeer.peer.
As I understand you Java results, you have much data but not that many different strings. So you could perhaps save some time by assigning a unique number to each tracker, torrent and peer. Using one table for each, with some indexed value holding the string and a numeric primary key as the id. That way, all tables relating these entities would only have to deal with those numbers, which could save a lot of space and make your operations a lot faster.
I would give MySQL another try but with a different schema:
do not use id-columns here
use natural primary keys here:
Peer: ip, port
Torrent: infohash
Tracker: url
TorrentPeer: peer_ip, torrent_infohash, peer_port, time
TorrentTracker: tracker_url, torrent_infohash, time
use innoDB engine for all tables
This has several advantages:
InnoDB uses clustered indexes for primary key. Means that all data can be retrieved directly from index without additional lookup when you only request data from primary key columns. So InnoDB tables are somewhat index-organized tables.
Smaller size since you do not have to store the surrogate keys. -> Speed, because lesser IO for the same results.
You may be able to do some queries now without using (expensive) joins, because you use natural primary and foreign keys. For example the linking table TorrentAtPeer directly contains the peer ip as foreign key to the peer table. If you need to query the torrents used by peers in a subnetwork you can now do this without using a join, because all relevant data is in the linking table.
If you want the torrent count per peer and you want the peer's ip in the results too then we again have an advantage when using natural primary/foreign keys here.
With your schema you have to join to retrieve the ip:
SELECT Peer.ip, COUNT(DISTINCT torrent)
FROM TorrentAtPeer, Peer
WHERE TorrentAtPeer.peer = Peer.id
GROUP BY Peer.ip;
With natural primary/foreign keys:
SELECT peer_ip, COUNT(DISTINCT torrent)
FROM TorrentAtPeer
GROUP BY peer_ip;
EDIT
Well, original posted schema was not the real one. Now the Peer table has a port field. I would suggest to use primary key (ip, port) here and still drop the id column. This also means that the linking table needs to have multicolumn foreign keys. Adjusted the answer ...
If you could use C++, you should take a look at Boost flyweight.
Using flyweight, you can write your code as if you had strings, but each instance of a string (your tracker name, etc.) uses only the size of a pointer.
Regardless of the language, you should convert the IP address to an int (take a look at this question) to save some more memory.
You most likely have a problem that can be solved by NOSQL and distributed technologies.
i) I would write a distributed system using Hadoop/HBase.
ii) Rent several tens / hundred AWS machines, but only for a few seconds (It'll still cost you less than a $0.50)
iii) Profit!!!

Categories

Resources