Is it a bad practice to expose DB internal IDs in URLs?

Is it a bad practice to expose DB internal IDs in URLs? - java

Is it a bad practice to expose DB internal IDs in URLs?
For example, suppose I have a users table with some IDs (primary key) for each row. Would exposing the URL myapp.com/accountInfo.html?userId=5, where 5 is an actual primary key, be considered a "bad thing" and why?
Also assume that we properly defend against SQL injections.
I am mostly interested in answers related to the Java web technology stack (hence the java tag), but general answers will also be very helpful.
Thanks.

That bases on the way you parse the URL. If you allow blind SQL injections that is bad. You have to only to validate the id from the user input.
Stackexchange also puts the id of the row into the URL as you can see in your address bar. The trick is to parse the part and get did of all possible SQL. The simples way is to check that the id is a number.

It isn't a bad thing to pass through in the URL, as it doesn't mean much to the end user - its only bad if you rely on that value in the running of your application. For example, you don't want the user to notice that userId=5 and change it to userID=10 to display the account of another person.
It would be much safer to store this information in a session on the server. For example, when the user logs in, their userID value is stored in the session on the server, and you use this value whenever you query the database. If you do it this way, there usually wouldn't be any need to pass through the userID in the URL, however it wouldn't hurt because it isn't used by your DB-querying code.

To use the database ID in URLs is good, because this ID should never change in an objects (db rows) life. Thus the URL is durable - the most important aspect of an URL. See also Cool URIs don't change.

Yes it is a bad thing. You are exposing implementation detail. How bad? That depends. It forces you to do unneeded checks of the user input. If other applications start depending on it, you are no longer free to change the database scheme.

PKs are meant for the system.
To the user, it may represent a different meaning:
For e.g.
Let's consider following links. Using primary-key,it displays an item under products productA, productB,productC;
(A)http://blahblahsite.com/browse/productA/111 (pkey)
(B)http://blahblahsite.com/browse/productB/112 (pkey)
(C)http://blahblahsite.com/browse/productC/113 (pkey)
User on link B may feel there are 112 items under ProductB, which is misleading.
Also it will cause problem while merging tables since PK will be auto-incremented.

Related

Should you expose a primary key in REST API URLs?

I'm very new to Spring. I'm trying to create a REST API using Spring Boot and I'm stuck whether to expose my user's primary key or not which also happens to be their email. Something like api/user/example#gmail.com. A big part of me says it's okay since it would sensible to expose it as it is the identifier for that specific record when viewing, deleting, and updating. Is there a security risk for this? What is the best practice for such implementation? Right now I'm combining the #PathVariable and #RequestBody. I didn't like the idea of putting my primary key in the RequestBody thinking that it might pose a risk...or is there?
#RequestMapping(value = "/updateUser/{customerEmail}", method = RequestMethod.POST)
public ApiResult updateCustomer(#RequestBody UserDetailsDto userDetailsDto, #PathVariable String customerEmail) {
//service call...
}

First of all, user e-mail is often considered to be PII (Personally Identifiable Information). As such it would be unwise to put it into a URL, because you should not put any sensitive information into the URL. Header - ok, body - too. But not into the URL. The reason is, that all the proxies/load balancers/other infrastructure you have or might have in the future will always be allowed to log URLs for debug reasons. And you don't want your sensitive data to leak across the components like this. No company policy would ever allow that.

Spring is a good framework of choice, usually as long as the identifier is unique it should be fine, the problem with using an email is you are exposing your users data more easily which could be problematic to the users, I would suggest you rather use a string of unique characters as an identifier in the form of:
http://api.example.com/user-management/users/{id} as an example http://api.example.com/user-management/users/22
in this case identifier of user 22 has the email example#gmail.com in this way you are not exposing sensitive data when doing an update here is a link that gives guidance on best naming practice https://restfulapi.net/resource-naming/.
Another tip given in the link provided is to avoid using URI's as CRUD (Create, Read, Update, Delete) functionality "URIs should be used to uniquely identify resources and not any action upon them".

Any sensitive information (in this case email but in other case that could also be your database autoincremented primary key field ID in your table) should not be exposed.
Once way to go around that that I know and I use is to have 2 fields. For example, I have table USER {ID, USERID, NAME, ...}
Above, ID is autoincremented Long field representing PK.
USERID on the other hand, is a field generated of random characters or GUID which I use to pass back and fort in REST calls.
So, I might have record in USER table as:
USER {1, "a23asf60asdaare998700asdfasr70po097", "Mike", ...}
If I were to pass ID=1 back and forth, a malicious user could easily deduce what it is and how to query next user. For that reason, i pass USERID which represent a public and safe version of ID that can be passed and no one can know what would be the USERID of next user.
So, your response model, dto model etc should have these fields and response model should return USERID instead of returning ID. And you can use JPA to find the user by the USERID (so, based on that, that method must be called in this case findByUserId).
The same would apply for your case where you use email instead of ID if you want dont want to expose user emails which make sense to me.
Hope that helps.

I think it's more a matter of taste and personal beliefs rather than objective aspects.
Since HTTPS is more or less mandatory today, it's a lot harder to obtain the e-mail address by just sniffing with a tool like Wireshark.
So what's the possible risk? Since users have to be authorized to call this endpoint, they know at least their own e-mail address and most likely used it to authenticate. So a user can't modify or acquire the data of another user, if properly implemented.
A problem which may be of concern is that it might be possible to check for a registered e-mail during the registration process. Depending on what kind of application you're developing, this might be an issue. To give a brief example of such a case: Imagine a catholic priest registered on a porn site or the e-mail address of your husband/wife registered on a dating platform.
So may advice: Force HTTPS and you are pretty fine to use them as a primary key. However, if you have the possibility to abstract this, I'd do so. A numerical key or username may be a better choice and also easier to handle - but it makes no difference. Imagine if you have an endpoint to acquire the user's data, including e-mail address. It just doesn't matter if you acquire this data by a numerical key or by the e-mail address. In the end, you end up with the e-mail address in the response's body. And if this body is accessible by someone, he can also access the username and password, thus rendering any security measurement you've taken useless.

Exploring user specific data in webapps

I am busy practicing on designing a simple todo list webapp whereby a user can authenticate into the app and save todo list items. The user is also only able to to view/edit the todo list items that they added.
This seems to be a general feature (authenticated user only views their own data) in most web applications (or applications in general).
To me what is important is having knowledge of the different options for accomplishing this. What I would like to achieve is a solution that can handle lots of users' data effectively. At the moment I am doing this using a Relational Database, but noSQL answers would be useful to me as well.
The following ideas came to mind:
Add a user_id column each time this "feature" is needed.
Add an association table (in the example above a user_todo_list_item table) that associates the data.
Design in such a way that you have a table per user per "feature" ... so you would have a todolist_userABC table. It's an option but I do not like it much since a thousand user's means a thousand tables?!
Add row level security to the specific "feature". I am not familiar on how this works but it seems to be a valid option. I am also not sure whether this is database vendor specific.
Of my choices I went with the user_id column on the todolist_item table. Although it can do the job, I feel that a user_id column might be problematic when reading data if the data within the table gets large enough. One could add an index I guess but I am not sure of the index's effectiveness.
What I don't like about it is that I need to have a user_id for every table where I desire this type of feature which doesn't seem correct to me? It also seems that when I implement the database layer I would have to add this to my queries for every feature (unless I use some AOP)?
I had a look around (How does Trello store data in MongoDB? (Collection per board?)), but it does not speak about the techniques regarding user_id columns or things like that. I also tried reading about this in some security frameworks (Spring Security to be specific) but it seems that it only goes into privileges/permissions on a table level and not a row level?
So the question is whether my choice was appropriate and if there are better techniques to do this?

Your choice is the natural thing to do.
The table-per-user is a non-starter (anything that modifies the database structure in response to user action is usually suspect).
Row-level security isn't really an option for webapps - it requires each user session to have a separate, persistent connection to the database, which is rarely practical. And yes, it is vendor-specific.
How you index your tables depends entirely on your usage patterns and types of queries you want to run. Is 'show all TODOs for a user' a query you want to support (seems like it would be)? Then and index on the user id is obviously needed.
Why does having a user_id column seem wrong to you? If you want to restrict access by user, you need to be able to identify which user the record belongs to. Doesn't actually mean that every table needs it - for example, if one record composes another (say, your TODOs have 'steps', each step belongs to a single TODO), only the root of the object graph needs the user id.

Way to know table is modified

There are two different processes developed in Java running independently,
If any of the process modifyies the table, can i get any intimation? As the table is modified. My objective is i want a object always in sync with a table in database, if any modification happens on table i want to modify the object.
If table is modified can i get any intimation regarding this ? Do Database provide any facility like this?

We use SQL Server and have certain triggers that fire when a table is modified and call an external binary. The binary we call sends a Tib rendezvous message to notify other applications that the table has been updated.
However, I'm not a huge fan of this solution - Much better to control writing to your table through one "custodian" process and have other applications delegate to that. To enforce this you could change permissions on your table so that only your custodian process can write to the database.
The other advantage of this approach is being able to provide a caching layer within your custodian process to cater for common access patterns. Granted that a DBMS performs caching anyway, but by offering it at the application layer you will have more control / visibility over it.

No, database doesn't provide these services. You have to query it periodically to check for modification. Or use some JMS solution to send notifications from one app to another.

You could add a timestamp column (last_modified) to the tables and check it periodically for updates or sequence numbers (which are incremented on updates similiar in concept to optimistic locking).
You could use jboss cache which provides update mechanisms.

One way, you can do this is: Just enclose your database statement in a method which should return 'true' when successfully accomplished. Maintain the scope of the flag in your code so that whenever you want to check whether the table has been modified or not. Why not you try like this???

If you're willing to take the hack approach, and your database stores tables as files (eg, mySQL), you could always have something that can check the modification time of the files on disk, and look to see if it's changed.
Of course, databases like Oracle where tables are assigned to tablespaces, and tablespaces are what have storage on disk it won't work.
(yes, I know this is a bad approach, that's why I said it's a hack -- but we don't know all of the requirements, and if he needs something quick, without re-writing the whole application, this would technically work for some databases)

Google app engine: Poor Performance with JDO + Datastore

I have a simple data model that includes
USERS: store basic information (key, name, phone # etc)
RELATIONS: describe, e.g. a friendship between two users (supplying a relationship_type + two user keys)
COMMENTS: posted by users (key, comment text, user_id)
I'm getting very poor performance, for instance, if I try to print the first names of all of a user's friends. Say the user has 500 friends: I can fetch the list of friend user_ids very easily in a single query. But then, to pull out first names, I have to do 500 back-and-forth trips to the Datastore, each of which seems to take on the order of 30 ms. If this were SQL, I'd just do a JOIN and get the answer out fast.
I understand there are rudimentary facilities for performing two-way joins across un-owned relations in a relaxed implementation of JDO (as described at http://gae-java-persistence.blogspot.com) but they sound experimental and non-standard (e.g. my code won't work in any other JDO implementation).
Worse yet, what if I want to pull out all the comments posted by a user's friends. Then I need to get from User --> Relation --> Comments, i.e. a three-way join, which isn't even supported experimentally. The overhead of 500 back-and-forths to get a friend list + another 500 trips to see if there are any comments from a user's friends is already enough to push runtime >30 seconds.
How do people deal with these problems in real-world datastore-backed JDO applications? (Or do they?)
Has anyone managed to extract satisfactory performance from JDO/Datastore in this kind of (very common) situation?
-Bosh

First of all, for objects that are frequently accessed (like users), I rely on the memcache. This should speedup your application quite a bit.
If you have to go to the datastore, the right way to do this should be through getObjectsById(). Unfortunately, it looks like GAE doesn't optimize this call. However, a contains() query on keys is optimized to fetch all the objects in one trip to the datastore, so that's what you should use:
List myFriendKeys = fetchFriendKeys();
Query query = pm.newQuery(User.class, ":p.contains(key)");
query.execute(myFriendKeys);
You could also rely on the low-level API get() that accept multiple keys, or do like me and use objectify.
A totally different approach would be to use an equality filter on a list property. This will match if any item in the list matches. So if you have a friendOf list property in your user entity, you can issue a single Query friendOf == theUser. You might want to check this: http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine

You have to minimize DB reads. That must be a huge focus for any GAE project - anything else will cost you. To do that, pre-calculate as much as you can, especially oft-read information. To solve the issue of reading 500 friends' names, consider that you'll likely be changing the friend list far less than reading it, so on each change, store all names in a structure you can read with one get.
If you absolutely cannot then you have to tweak each case by hand, e.g. use the low-level API to do a batch get.
Also, rather optimize for speed and not data size. Use extra structures as indexes, save objects in multiple ways so you can read it as quickly as possible. Data is cheap, CPU time is not.

Unfortunately Phillipe's suggestion
Query query = pm.newQuery(User.class, ":p.contains(key)");
is only optimized to make a single query when searching by primary key. Passing in a list of ten non-primary-key values, for instance, gives the following trace
alt text http://img293.imageshack.us/img293/7227/slowquery.png
I'd like to be able to bulk-fetch comments, for example, from all a user's friends. If I do store a List on each user, this list can't be longer than 1000 elements long (if it's an indexed property of the user) as described at: http://code.google.com/appengine/docs/java/datastore/overview.html .
Seems increasingly like I'm using the wrong toolset here.
-B

Facebook has 28 Terabytes of memory cache... However, making 500 trips to memcached isn't very cheap either. It can't be used to store a gazillion pieces of small items. "Denomalization" is the key. Such applications do not need to support ad-hoc queries. Compute and store the results directly for the few supported queries.
in your case, you probably have just 1 type of query - return data of this, that and the others that should be displayed on a user page. You can precompute this big ball of mess, so later one query based on userId can fetch it all.
when userA makes a comment to userB, you retrieve userB's big ball of mess, insert userA's comment in it, and save it.
Of course, there are a lot of problems with this approach. For giant internet companies, they probably don't have a choice, generic query engines just don't cut it. But for others? Wouldn't you be happier if you can just use the good old RDBMS?

If it is a frequently used query, you can consider preparing indexes for the same.
http://code.google.com/appengine/articles/index_building.html

The indexed property limit is now raised to 5000.
However you can go even higher than that by using the method described in http://www.scribd.com/doc/16952419/Building-scalable-complex-apps-on-App-Engine
Basically just have a bunch of child entities for the User called UserFriends, thus splitting the big list and raising the limit to n*5000, where n is the number of UserFriends entities.

When to 'IN' and when not to?

Let's presume that you are writing an application for a retail store chain. So, you would design your object model such that you would define 'Store' as the core business object and lots of supporting objects. Let's say 'Store' looks like follows:
class Store implements Validatable{
int storeNo;
int storeName;
... etc....
}
So, your client tells you that you have to import store schedule from a excel sheet into the application and you would have to run a series of validations on 'em. For instance, 'StoreIsInSameCountry';'StoreIsValid'... etc. So, you would design a Rule interface for checking all business conditions. Something like this:
interface Rule T extends Validatable> {
public Error check(T value) throws Exception;
}
Now, here comes the question. I am uploading 2000 stores from this excel sheet. So, I would end up running each rule defined for a store that many times. If I were to have 4 rules = 8000 queries to the database, i.e, 16000 hits to the connection pool. For a simple check where I would just have to check whether the store exists or not, the query would be:
SELECT STORE_ATTRIB1, STORE_ATTRIB2... from STORE where STORE_ID = ?
That way I would obtain get my 'Store' object. When I don't get anything from the database, then that store doesn't exist. So, for such a simple check, I would have to hit the database 2000 times for 2000 stores.
Alternatively, I could just do:
SELECT STORE_ATTRIB1, STORE_ATTRIB2... from STORE where STORE_ID in (1,2,3..... )
This query would actually return much faster than doing the one above it 2000 times.
However, it doesn't go well with the design that a Rule can be run for a single store only.
I know using IN is not a suggested methodology. So, what do you think I should be doing? Should I go ahead and use IN here, coz it gives better performance in this scenario? Or should I change my design?
What would you do if you were in my shoes, and what is the best practice?

That way I would obtain get my 'Store' object from the database. When I don't get anything from the database, then that store doesn't exist. So, for such a simple check, I would have to hit the database 2000 times for 2000 stores.
This is what you should not do.
Create a temporary table, fill the table with your values and JOIN this table, like this:
SELECT STORE_ATTRIB1, STORE_ATTRIB2...
FROM temptable tt
JOIN STORE s
ON s.STORE_ID = t.id
or this:
SELECT STORE_ATTRIB1, STORE_ATTRIB2...
FROM STORE s
WHERE s.STORE_ID IN
(
SELECT id
FROM temptable tt
)
I know using IN is not a suggested methodology. So, what do you think I should be doing? Should I go ahead and use IN here, coz it gives better performance in this scenario? Or should I change my design?
IN filters duplicates out.
If you want each eligible row to be selected for each duplicate value in the list, use JOIN.
IN is in no way a "not suggested methology".
In fact, there was a time when some databases did not support IN queries effciently, that's why folk wisdom still advices against using it.
But if your store_id is indexed properly (and it most probably is, if it's a PRIMARY KEY which it looks like), then all modern versions of major databases (that is Oracle, SQL Server, MySQL and PostgreSQL) will use an efficient plan to perform this query.
See this article in my blog for performance details in SQL Server:
IN vs. JOIN vs. EXISTS
Note, that in a properly designed database, validation rules are also set-based.
I. e. you implement your validation rules as queries against the temptable.
However, to support legacy rules, you can select values from temptable row-by-agonizing-row, apply the rules, and delete values which did not pass validation.

SELECT store_id FROM store WHERE store_active = 1
or even
SELECT store_id FROM store
will tell you all the active stores in a single query. You can now conduct the other tests on stores you know to exist, and you've saved yourself 1,999 hits to the database.
If you've got relatively uncontested database access, and no time constraint on how long the whole thing is going to take then you've no real need to worry about hitting the connection pool over and over again. That's what it's designed for, after all!

I think it's more of a business question with parameter of how often does the client run the import, how long would it take for you to implement either of the solution, and how expensive is your time per hour.
If it's something that runs once in a while, a bit of bad performance is acceptable in my opinion, especially if you can get the job done quick using clean code.

...a Rule can be run for a single store only.
Managing business rules along with performance is a tricky task, so there is a library ("Persistence Layer") that does exactly that. You define rules, then execute a bulk of commands, then the library fetch from DB whatever the rules require in a single query (by using temp tables rather than 'IN') and then passes it to the rules.
There is an example of a validator in here.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.