For a customer we where developing a big application that where open to all users if you will, meaning, all users could see each others data.
Now suddenly the customer is saying that they want only users belonging to the same organization to be able to view each others data.
So we came up with this data model:
So now the question is: How is it best to separate the data?
This is the only alternative I see:
SQL JOIN on ALL relevant tables (All tables that have data should no always join on Organization)
-- All queries should now add an extra join to Organization, and if the join doesn't exists, we need to create a new foreign key.
But I feel an extra join (We have around 20 tables that needs extra join) is quite costly.
I hope there are some other best practices or solutions we can consider.
PS: This is a Web application developed using Java/JSF/Seam (but I don't know if that is relevant)
UPDATE
I want to clarify something. My consurn is not security but performance. We have added the foreign key to organization to all relevant tables that has shared data, and we are using user's logged in organization to filter the data.
All I want to know is if this is a good architectural solution (inner join) or if we should do something else (ie: Load all shared data, and filter in memory instead of sql join).
You really have to understand the difference between the persistency layer and the application layer.
It doesn't matter how you define your database tables, as anyone with database access will have access to all the users data. What does matter is how you define the behavior in your application.
Changing the database design should only be done for performance reasons, not for security - which should be handled in the application.
I would reckon that the best pattern would be to only expose the user details through the web application, so at that point its a case of restricting the data exposed to each user. This will allow you to build in the required security inside the application.
Alternatively if you are allowing direct database access then you will need to create a login/user (depends on database used) for each organization or user and then restrict the access of these login/user entities to parameterized stored procedures rather than the base tables. This will push security back onto the database, which is riskier but still do-able.
As to meta changes to support the organization column, parameterizing the stored procedures will be fairly trivial:
select #organizationId = organizationId from User where User.id = #currentUserId
select * from User where organizationId = #organizationId
(depending on the sql flavour you will need to enclose some entities eg ``User, [User] etc)
I see no reason that Organization has to be 'joined' at all.
If your 'data' tables all have OrganizationID columns, then you can lookup the 'organizationID' from the user and then add this as a condition to the join.
EX:
select #OrganizationId = organizationId from User where User.id = #currentUserId
select * from datatable a .... where .... AND a.organizationID = #organizationID
See; no join.
With respect to performance, there are different types of joins, and SQLServer allows you to hint at the type of join. So in some cases, a merge join is the best, whereas in something like this scenario, a loop join would be the best. Not sure if these choices are available in MySQL.
With respect to all of your tables needing a join, or condition (see above), there is a logical answer, and an implementation answer. The implementation answer depends on your indexing. If you can limit the dataset the most by adding that condition, then you will benefit. But if the join with the other table that has already been filtered does a better job at reducing rows, then the condition will be worthless (or worst case, it will use the wrong index). Assuming you have indexes on your join and condition columns.
Logically, only data that isn't fully dependent on a table that is filtered by organizationID needs that extra condition. If you have a car table, and carparts table, then you only have to filter the car table. Unless for some reason you don't need to join with the car table for some joins, in which case you will need that organizationID on the parts table too.
Related
I'm looking to have a GUI where when I click an Invoice it displays the information from both Customer and Product also, such as name, brand etc all in one row.
Do I have to put Name, brand, etc into Invoice too and inner join everything?
Invoice Table Customer Table Product Table
EDIT:
No, no need to modify the tables you're referring to. They all contain a unique primary key column which are referenced from the invoice table. Based on them the INNER JOIN can be formulated.
Maybe also worth mentioning: Don't confuse the INNER JOIN with the SELF JOIN which also exists.
The difference is that the INNER JOIN is still joining two different tables based on specific columns (e.g. id) whereby the SELF JOIN is joining a single table with itself.
Yes what you'll need is the INNER JOIN combining the information from your invoice table with the one from the customer table as well as the product table - all based on your given invoice id (column: idInvoice).
To obtain the needed information you don't need to add - and therefore repeat - it in the invoice table. Due to the join they'll be available for selection in one single query.
Your query should look like:
SELECT *
FROM invoice inv, customer cust, product prod
WHERE
inv.idCustomer = cust.idCostumer
AND
inv.idProduct = prod.idProduct
AND
inv.idInvoice = ${theIdOfTheInvoiceTheUserClickedOn}
Note: If you don't need all the information (columns) from the three tables (what the "*" stands for) you can replace the "*" with an enumeration explicitly stating only the columns you want to show. E.g. inv.id, cust.FirstName, cust.LastName.
Depending on the database technology/ dialect you're exactly using. The example above would be suitable for an Oracle database and should also suite most other databases, since only basic SQL features are being used in the query.
I'm assuming you're not using any ORM framework like Hibernate and that you'll need to write the query yourself (since you didn't provide any more detail on your technology stack). In case youre using an ORM framework the approach would need to look different, but the final generated query should look similar.
In the query above the first two conditions in the WHERE clause are forming the INNER JOIN implicitly, whereby the last third one is specifying which exact entry you're looking for.
Although you've asked only if an INNER JOIN is needed, I've provided the query here to you since your question implied you're not sure how to write one.
You might take it as an working example you can compare your solution with. You should try to understand how it's working and how it can be written and also research more on the SQL basics so that you can write it on your own as well.
Tip: PreparedStatements are the way to go to execute such queries to a database from Java in a safe way
In my opinion, based on your application, you can use a flat table that includes what you need and doesn't need to join tables. This solution is applicable when you are in a situation that you have small datasets (E.g. in banking, relationships between Terminal table and ATMTerminal, POSTerminal and CashlessTerminal tables).
And for situations that you have a relationship that one side is static data (like the above example) and another side is transactional data (like Banking Transactions), you should use the foreign key from the transaction table to the static data table.
I recommend you to use the second solution for your problem and handle relationships using foreign keys and get the columns you need using the join strategy.
I want to fetch data from two different entites in JPA. I am using Google DataStore with App Engine to store my data on cloud storage. Now what i want is to fetch data from two different entites by making use of Join query.As i am new to app engine and datastore, i don't know how to do that. I referred this link and it says that DataStore doesn't support joins properly. Is that true? Pleas eguide me to solve this problem. Thank you.
The are ample places where it is stated clearly that GAE/Datastore does not do "join queries". Such as https://developers.google.com/appengine/docs/java/datastore/jdo/overview-dn2
If instead you are using google-cloud-sql (why you tag this question as SQL?) then I suggest you update your question to state that
How to join records when your data store does not: write a join in the client application code. Warning - depending on the data, doing this might cost a lot of overhead. This is a straw man answer designed to justify the real answer in the final paragraph.
Conceptually, your application could implement a nested loop join as follows. Choose the entity whose expected record count is lowest for the outer loop. Create a query to iterate over those records. Within the iterator loop for each record, copy the fields used for joining into variables, then create an inner nested query that takes these variables as parameters. Iterate over the records produced by the inner query, and for each inner record, produce a record of output using data from both the inner and the outer current entities.
Because an external nested loop join is such a bad idea, you should really consider redesigning your current schema to produce the results you are after without requiring a join at all. Start by just imagining the output that you want coming directly out of entities of just one Kind. That usually means letting go of relational normal forms. After you have designed appropriate NoSQL structures that can deliver the required outputs, you should then design appropriate NoSQL algorithms to write the data that way.
I'm not sure if something special exists for this use case - but it felt like a case where someone was likely to have made some sort of useful structure/technique/design-pattern.
My Situation
I have a set of SQL commands executed from middle tier (Java) to insert/update/delete data to any of a set of very large tables via joins from a related staging table.
I have more SQL commands which update various derived tables based on the staging table/actual table contents. Different tables will interact with different derived tables via different queries (as usual). These commands may have to be interleaved with the first set depending on the use case - so, I can't necessarily execute set 1 then set 2 all at once.
My Question
So, I need to build a chain of commands that get executed sequentially, and I need to trigger a rollback if any of them fail. I'd like to do this in the most clear, documented way possible.
Does anyone know a standard way of coding this? I'm sure anyone migrating from stored procedure code to middle tier code has done this before and I don't want to reinvent the wheel if there are good options out there.
Additional Information
One of my main concerns is making everything clear. To elaborate, I'll have a set of queries specifically designed to:
Truncate staging table A' and populate it with primary keys targeting deletion records
Delete from actual table A based on join with A'
Truncate staging table A' and populate it with full data for upserts
Update/Insert records from A' to A based on joins
The same logic will apply to tables B, C, D, etc. Unfortunately, it can be the case where just A and C need an extra step, like syncing deletes to a certain derived table, to be done after the deletions but before the upserts.
I'd obviously like to group all the logic for updating a table, and I'd like to group all the logic for updating a derived table as well, but at execution time they have to be intelligently interleaved and this sounds messy to me.
Don't write such a thing yourself. This is what JTA was born for.
You can use either JPA or Spring to do it.
Annotate the unit of work as transactional and let the database and JDBC handle it.
If you must do it yourself, follow the aspect-oriented approach and make it a decorative "before & after" implementation.
I am busy practicing on designing a simple todo list webapp whereby a user can authenticate into the app and save todo list items. The user is also only able to to view/edit the todo list items that they added.
This seems to be a general feature (authenticated user only views their own data) in most web applications (or applications in general).
To me what is important is having knowledge of the different options for accomplishing this. What I would like to achieve is a solution that can handle lots of users' data effectively. At the moment I am doing this using a Relational Database, but noSQL answers would be useful to me as well.
The following ideas came to mind:
Add a user_id column each time this "feature" is needed.
Add an association table (in the example above a user_todo_list_item table) that associates the data.
Design in such a way that you have a table per user per "feature" ... so you would have a todolist_userABC table. It's an option but I do not like it much since a thousand user's means a thousand tables?!
Add row level security to the specific "feature". I am not familiar on how this works but it seems to be a valid option. I am also not sure whether this is database vendor specific.
Of my choices I went with the user_id column on the todolist_item table. Although it can do the job, I feel that a user_id column might be problematic when reading data if the data within the table gets large enough. One could add an index I guess but I am not sure of the index's effectiveness.
What I don't like about it is that I need to have a user_id for every table where I desire this type of feature which doesn't seem correct to me? It also seems that when I implement the database layer I would have to add this to my queries for every feature (unless I use some AOP)?
I had a look around (How does Trello store data in MongoDB? (Collection per board?)), but it does not speak about the techniques regarding user_id columns or things like that. I also tried reading about this in some security frameworks (Spring Security to be specific) but it seems that it only goes into privileges/permissions on a table level and not a row level?
So the question is whether my choice was appropriate and if there are better techniques to do this?
Your choice is the natural thing to do.
The table-per-user is a non-starter (anything that modifies the database structure in response to user action is usually suspect).
Row-level security isn't really an option for webapps - it requires each user session to have a separate, persistent connection to the database, which is rarely practical. And yes, it is vendor-specific.
How you index your tables depends entirely on your usage patterns and types of queries you want to run. Is 'show all TODOs for a user' a query you want to support (seems like it would be)? Then and index on the user id is obviously needed.
Why does having a user_id column seem wrong to you? If you want to restrict access by user, you need to be able to identify which user the record belongs to. Doesn't actually mean that every table needs it - for example, if one record composes another (say, your TODOs have 'steps', each step belongs to a single TODO), only the root of the object graph needs the user id.
How do I build oracle pl/sql query dynamically from a java application? The user will be presented with a bunch of columns that are present in different tables in the database. The user can select any set of column and the application should build the complete select query using only the tables that contain the selected columns.
For example, lets consider that there are 3 tables in the database. The user selects col11, col22. In this case, the application should build the query using Tabl1 and Tabl2 only.
How do I achieve this?
Tabl1
- col11
- col12
- col13
Tabl2
- fkTbl1
- col21
- col22
- col23
Tabl3
- col31
- col32
- col33
- fkTbl1
Ad hoc reporting is an old favourite. It frequently appears as a one-liner at the end of the Reports Requirements section: "Users must be able to define and run their own reports". The only snag is that ad hoc reporting is an application in its own right.
You say
"The user will be presented with a
bunch of columns that are present in
different tables in the database."
You can avoid some of the complexities I discuss below if the "bunch of columns" (and the spread of tables) is preselected and tightly controlled. Alas, it is in the nature of ad hoc reporting that users will want pretty much all columns from all tables.
Let's start with your example. The user has selected col11 and col22, so you need to generate this query:
SELECT tabl1.col11
, tabl2.col22
FROM tabl1 JOIN tabl2
ON (TABL1.ID = TABL2.FKTABL1)
/
That's not too difficult. You just need to navigate the data dictionary views USER_CONSTRAINTS and USER_CONS_COLUMNS to establish the columns in the join condition - providing you have defined foreign keys (please have foreign keys!).
Things become more complicated if we add a fourth table:
Tabl4
- col41
- col42
- col43
- fkTbl2
Now when the user choose col11 and col42 you need to navigate the data dictionary to establish that Tabl2 acts as an intermediary table to join Tabl4 and Tabl1 (presuming you are not using composite primary keys, as most people don't). But suppose the user selects col31 and col41. Is that a legitimate combination? Let's say it is. Now you have to join Tabl4 to Tabl2 to Tabl1 to Tabl3. Hmmm...
And what if the user selects columns from two completely unrelated tables - Tabl1 and Tabl23? Do you blindly generate a CROSS JOIN or do you hurl an exception? The choice is yours.
Going back to that first query, it will return all the rows in both tables. Almost certainly your users will want the option to restrict the result set. So you need to offer them the ability to add to filters to the WHERE clause. Gotchas here include:
ensuring that supplied values are of an appropriate data-type (no strings for a number, no numbers for a date)
providing look-ups to reference data
values
handling multiple values (IN list
rather than equals)
ensuring date ranges are sensible
(opening bound before closing bound)
handling free text searches (are you
going to allow it? do you need to
use TEXT indexes or will you run the
risk of users executing LIKE
'%whatever%' against some CLOB
column?)
The last point highlights one risk inherent in ad hoc reporting: if the users can assemble a query from any tables with any filters they can assemble a query which can drain all the resources from your system. So it is a good idea to apply profiles to prevent that happening. Also, as I have already mentioned, it is possible for the users to build nonsensical queries. Bear in mind that you don't need very many tables in your schema to generate too many permutations to test.
Finally there is the tricky proposition of security policies. If users are restricted to seeing subsets of data on the basis their department or their job role, then you will need to replicate those rules. In such cases the automatic application of policies through Row Level Security is a real boon
All of which might lead you to conclude that the best solution would be to pursuade your users to acquire an off-the-shelf product instead. Although that approach isn't without its own problems.
The way that I've done this kind of thing in the past is to simply construct the SQL query on the fly using a StringBuilder and then executing it using a JDBC a non-prepared statement. This is rather inefficient since the Oracle DB has to repeat all of the query analysis and optimization work for each query.