Join query in DataStore JPA java- App engine

Join query in DataStore JPA java- App engine - java

I want to fetch data from two different entites in JPA. I am using Google DataStore with App Engine to store my data on cloud storage. Now what i want is to fetch data from two different entites by making use of Join query.As i am new to app engine and datastore, i don't know how to do that. I referred this link and it says that DataStore doesn't support joins properly. Is that true? Pleas eguide me to solve this problem. Thank you.

The are ample places where it is stated clearly that GAE/Datastore does not do "join queries". Such as https://developers.google.com/appengine/docs/java/datastore/jdo/overview-dn2
If instead you are using google-cloud-sql (why you tag this question as SQL?) then I suggest you update your question to state that

How to join records when your data store does not: write a join in the client application code. Warning - depending on the data, doing this might cost a lot of overhead. This is a straw man answer designed to justify the real answer in the final paragraph.
Conceptually, your application could implement a nested loop join as follows. Choose the entity whose expected record count is lowest for the outer loop. Create a query to iterate over those records. Within the iterator loop for each record, copy the fields used for joining into variables, then create an inner nested query that takes these variables as parameters. Iterate over the records produced by the inner query, and for each inner record, produce a record of output using data from both the inner and the outer current entities.
Because an external nested loop join is such a bad idea, you should really consider redesigning your current schema to produce the results you are after without requiring a join at all. Start by just imagining the output that you want coming directly out of entities of just one Kind. That usually means letting go of relational normal forms. After you have designed appropriate NoSQL structures that can deliver the required outputs, you should then design appropriate NoSQL algorithms to write the data that way.

Related

Database tables inner joining

I'm looking to have a GUI where when I click an Invoice it displays the information from both Customer and Product also, such as name, brand etc all in one row.
Do I have to put Name, brand, etc into Invoice too and inner join everything?
Invoice Table Customer Table Product Table

EDIT:
No, no need to modify the tables you're referring to. They all contain a unique primary key column which are referenced from the invoice table. Based on them the INNER JOIN can be formulated.
Maybe also worth mentioning: Don't confuse the INNER JOIN with the SELF JOIN which also exists.
The difference is that the INNER JOIN is still joining two different tables based on specific columns (e.g. id) whereby the SELF JOIN is joining a single table with itself.
Yes what you'll need is the INNER JOIN combining the information from your invoice table with the one from the customer table as well as the product table - all based on your given invoice id (column: idInvoice).
To obtain the needed information you don't need to add - and therefore repeat - it in the invoice table. Due to the join they'll be available for selection in one single query.
Your query should look like:
SELECT *
FROM invoice inv, customer cust, product prod
WHERE
inv.idCustomer = cust.idCostumer
AND
inv.idProduct = prod.idProduct
AND
inv.idInvoice = ${theIdOfTheInvoiceTheUserClickedOn}
Note: If you don't need all the information (columns) from the three tables (what the "*" stands for) you can replace the "*" with an enumeration explicitly stating only the columns you want to show. E.g. inv.id, cust.FirstName, cust.LastName.
Depending on the database technology/ dialect you're exactly using. The example above would be suitable for an Oracle database and should also suite most other databases, since only basic SQL features are being used in the query.
I'm assuming you're not using any ORM framework like Hibernate and that you'll need to write the query yourself (since you didn't provide any more detail on your technology stack). In case youre using an ORM framework the approach would need to look different, but the final generated query should look similar.
In the query above the first two conditions in the WHERE clause are forming the INNER JOIN implicitly, whereby the last third one is specifying which exact entry you're looking for.
Although you've asked only if an INNER JOIN is needed, I've provided the query here to you since your question implied you're not sure how to write one.
You might take it as an working example you can compare your solution with. You should try to understand how it's working and how it can be written and also research more on the SQL basics so that you can write it on your own as well.
Tip: PreparedStatements are the way to go to execute such queries to a database from Java in a safe way

In my opinion, based on your application, you can use a flat table that includes what you need and doesn't need to join tables. This solution is applicable when you are in a situation that you have small datasets (E.g. in banking, relationships between Terminal table and ATMTerminal, POSTerminal and CashlessTerminal tables).
And for situations that you have a relationship that one side is static data (like the above example) and another side is transactional data (like Banking Transactions), you should use the foreign key from the transaction table to the static data table.
I recommend you to use the second solution for your problem and handle relationships using foreign keys and get the columns you need using the join strategy.

SQL query speed: multiple queries vs sorting within Java

Lets say I have a table with 2 columns:
city
name (of a person).
I also have a Java "city" object which contains:
city name
a list of all the people in that city
So now I have two options to get the data:
First use DISTINCT to get a list of all the cities. Then, for each city, query the database again, using WHERE to get only records where the person lives in that city. Then I can store this in a City object.
Get a list of all the data, using ORDER BY to order by the city name. Then loop through all the records and start storing them in City objects. When I detect that the city name changes then I can create a new City object and store the records in that.
Which of these methods is faster / better practice? Or is there some better way of getting this information than these two methods? I am using Oracle database.

A database query is a relatively expensive operation - you need to communicate with another server over the network, it then may need to access its disk, compute a result, return it to you, etc. You'd want to minimize these as much as possible. Having a single query and going over its results is by far a better idea than having multiple queries, unless you have some killer reason not to do so - which doesn't seem to be the case here, at least not from the information you shared.

Sort answer is #2. You wish to make as less queries to the database as possible. #2 if i got it correct you will make a join of city/people and then create the object.
Better way: Use JPA/Hibernate. i.e check http://www.baeldung.com/hibernate-one-to-many

Answer number #2 is optimal, in all cases.
You'll need to code the logic in Java to differentiate when you change from one city to the next one.
Alternatively, if you were using MyBatis the solution becomes very simple by using "collections". These perform a single database call and retrieve the whole Java tree you specify, including all sublists in multiple levels. Very performant and also easy to code.

Can we write describe table query in JPQL?

I am taking a 'Keyword' and table name from user.
Now, I want to find all the columns of table whose data type is varchar(String).
Then I will create query which will compare the keyword with those column and matching rows will be returned as result set.
I tried desc table_name query, but it didn't work.
Can we write describe table query in JPQL?
If not then is there any other way to solve above situation?
Please help and thank you in advance.

No workaround is necessary, because it's not a drawback of the technology. It is not JPQL that needs to be changed, it's your choice of technology. In JPQL you cannot even select data from a table. You select from classes, and these can be mapped to multiple tables at once, resulting in SQL joins for simplest queries. Describing such a join would be meaningless. And even if you could describe a table, you do not use names of columns in JPQL, but properties of objects. Describing tables in JPQL makes no sense.
JPQL is meant for querying objects, not tables. Also, it is meant for static work (where classes are mapped to relations once and for good) and not for dynamic things like mapping tables to objects on-the-fly or live inspection of database (that is what ror's AR is for). Dynamic discovery of properties is not a part of that.
Depending on what you really want to achieve (we only know what you are trying to do, that's different) you have two basic choices:
if you are trying to write a piece of software in a dynamic way, so that it adjusts itself to changes in schema - drop JPQL (or any other ORM). Java classes are meant to be static, you can't really map them to dynamic tables (or grow new attributes). Use rowsets, they work fine and they will let you use SQL;
if you are building a clever library that can be shared by many projects and so has to work with many different static mappings, use reflection API to find properties of objects that you query for. Names of columns in the table will not help you anyway, since in JPQL queries you have to use names defined in mappings.

Map the database dictionary tables and read the required data from them. For Oracle database you will need to select from these three tables: user_tab_comments, user_tab_cols, user_col_comments; to achieve the full functionality of the describe statement.
There are some talks over the community about dynamic definition of the persistent unit in the future releases of JPA: http://www.oracle.com/goto/newsletters/javadev/0111/blogs_sun_devoxx.html?msgid=3-3156674507

According to me, we can not use describe query in jpql.

What are the best practices to separate data from users

For a customer we where developing a big application that where open to all users if you will, meaning, all users could see each others data.
Now suddenly the customer is saying that they want only users belonging to the same organization to be able to view each others data.
So we came up with this data model:
So now the question is: How is it best to separate the data?
This is the only alternative I see:
SQL JOIN on ALL relevant tables (All tables that have data should no always join on Organization)
-- All queries should now add an extra join to Organization, and if the join doesn't exists, we need to create a new foreign key.
But I feel an extra join (We have around 20 tables that needs extra join) is quite costly.
I hope there are some other best practices or solutions we can consider.
PS: This is a Web application developed using Java/JSF/Seam (but I don't know if that is relevant)
UPDATE
I want to clarify something. My consurn is not security but performance. We have added the foreign key to organization to all relevant tables that has shared data, and we are using user's logged in organization to filter the data.
All I want to know is if this is a good architectural solution (inner join) or if we should do something else (ie: Load all shared data, and filter in memory instead of sql join).

You really have to understand the difference between the persistency layer and the application layer.
It doesn't matter how you define your database tables, as anyone with database access will have access to all the users data. What does matter is how you define the behavior in your application.
Changing the database design should only be done for performance reasons, not for security - which should be handled in the application.

I would reckon that the best pattern would be to only expose the user details through the web application, so at that point its a case of restricting the data exposed to each user. This will allow you to build in the required security inside the application.
Alternatively if you are allowing direct database access then you will need to create a login/user (depends on database used) for each organization or user and then restrict the access of these login/user entities to parameterized stored procedures rather than the base tables. This will push security back onto the database, which is riskier but still do-able.
As to meta changes to support the organization column, parameterizing the stored procedures will be fairly trivial:
select #organizationId = organizationId from User where User.id = #currentUserId
select * from User where organizationId = #organizationId
(depending on the sql flavour you will need to enclose some entities eg ``User, [User] etc)

I see no reason that Organization has to be 'joined' at all.
If your 'data' tables all have OrganizationID columns, then you can lookup the 'organizationID' from the user and then add this as a condition to the join.
EX:
select #OrganizationId = organizationId from User where User.id = #currentUserId
select * from datatable a .... where .... AND a.organizationID = #organizationID
See; no join.
With respect to performance, there are different types of joins, and SQLServer allows you to hint at the type of join. So in some cases, a merge join is the best, whereas in something like this scenario, a loop join would be the best. Not sure if these choices are available in MySQL.
With respect to all of your tables needing a join, or condition (see above), there is a logical answer, and an implementation answer. The implementation answer depends on your indexing. If you can limit the dataset the most by adding that condition, then you will benefit. But if the join with the other table that has already been filtered does a better job at reducing rows, then the condition will be worthless (or worst case, it will use the wrong index). Assuming you have indexes on your join and condition columns.
Logically, only data that isn't fully dependent on a table that is filtered by organizationID needs that extra condition. If you have a car table, and carparts table, then you only have to filter the car table. Unless for some reason you don't need to join with the car table for some joins, in which case you will need that organizationID on the parts table too.

When to 'IN' and when not to?

Let's presume that you are writing an application for a retail store chain. So, you would design your object model such that you would define 'Store' as the core business object and lots of supporting objects. Let's say 'Store' looks like follows:
class Store implements Validatable{
int storeNo;
int storeName;
... etc....
}
So, your client tells you that you have to import store schedule from a excel sheet into the application and you would have to run a series of validations on 'em. For instance, 'StoreIsInSameCountry';'StoreIsValid'... etc. So, you would design a Rule interface for checking all business conditions. Something like this:
interface Rule T extends Validatable> {
public Error check(T value) throws Exception;
}
Now, here comes the question. I am uploading 2000 stores from this excel sheet. So, I would end up running each rule defined for a store that many times. If I were to have 4 rules = 8000 queries to the database, i.e, 16000 hits to the connection pool. For a simple check where I would just have to check whether the store exists or not, the query would be:
SELECT STORE_ATTRIB1, STORE_ATTRIB2... from STORE where STORE_ID = ?
That way I would obtain get my 'Store' object. When I don't get anything from the database, then that store doesn't exist. So, for such a simple check, I would have to hit the database 2000 times for 2000 stores.
Alternatively, I could just do:
SELECT STORE_ATTRIB1, STORE_ATTRIB2... from STORE where STORE_ID in (1,2,3..... )
This query would actually return much faster than doing the one above it 2000 times.
However, it doesn't go well with the design that a Rule can be run for a single store only.
I know using IN is not a suggested methodology. So, what do you think I should be doing? Should I go ahead and use IN here, coz it gives better performance in this scenario? Or should I change my design?
What would you do if you were in my shoes, and what is the best practice?

That way I would obtain get my 'Store' object from the database. When I don't get anything from the database, then that store doesn't exist. So, for such a simple check, I would have to hit the database 2000 times for 2000 stores.
This is what you should not do.
Create a temporary table, fill the table with your values and JOIN this table, like this:
SELECT STORE_ATTRIB1, STORE_ATTRIB2...
FROM temptable tt
JOIN STORE s
ON s.STORE_ID = t.id
or this:
SELECT STORE_ATTRIB1, STORE_ATTRIB2...
FROM STORE s
WHERE s.STORE_ID IN
(
SELECT id
FROM temptable tt
)
I know using IN is not a suggested methodology. So, what do you think I should be doing? Should I go ahead and use IN here, coz it gives better performance in this scenario? Or should I change my design?
IN filters duplicates out.
If you want each eligible row to be selected for each duplicate value in the list, use JOIN.
IN is in no way a "not suggested methology".
In fact, there was a time when some databases did not support IN queries effciently, that's why folk wisdom still advices against using it.
But if your store_id is indexed properly (and it most probably is, if it's a PRIMARY KEY which it looks like), then all modern versions of major databases (that is Oracle, SQL Server, MySQL and PostgreSQL) will use an efficient plan to perform this query.
See this article in my blog for performance details in SQL Server:
IN vs. JOIN vs. EXISTS
Note, that in a properly designed database, validation rules are also set-based.
I. e. you implement your validation rules as queries against the temptable.
However, to support legacy rules, you can select values from temptable row-by-agonizing-row, apply the rules, and delete values which did not pass validation.

SELECT store_id FROM store WHERE store_active = 1
or even
SELECT store_id FROM store
will tell you all the active stores in a single query. You can now conduct the other tests on stores you know to exist, and you've saved yourself 1,999 hits to the database.
If you've got relatively uncontested database access, and no time constraint on how long the whole thing is going to take then you've no real need to worry about hitting the connection pool over and over again. That's what it's designed for, after all!

I think it's more of a business question with parameter of how often does the client run the import, how long would it take for you to implement either of the solution, and how expensive is your time per hour.
If it's something that runs once in a while, a bit of bad performance is acceptable in my opinion, especially if you can get the job done quick using clean code.

...a Rule can be run for a single store only.
Managing business rules along with performance is a tricky task, so there is a library ("Persistence Layer") that does exactly that. You define rules, then execute a bulk of commands, then the library fetch from DB whatever the rules require in a single query (by using temp tables rather than 'IN') and then passes it to the rules.
There is an example of a validator in here.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.