Limiting SQL Injection when query is almost entirely configurable

Limiting SQL Injection when query is almost entirely configurable - java

I have a requirement to perform a scheduled dump of a SQL query from a web application. Initially it was an entire table (only the table name was configurable), but then the addition of a configurable WHERE clause was raised, along with a subset of columns.
The configurable options now required are:
columns
table name
where clause
At this point, it might as well just be the entire query, right?!
I know that SQLi can be mitigated somewhat by java.sql.PreparedStatement, but as far as I can tell, that relies on knowing the columns and datatypes at compile time.
The configurable items will not be exposed to end users. They will sit in a properties file within WEB-INF/classes, so the user's I am defending from here are sysadmins that are not as good as they think they are.
Am I being over cautious here?
If nothing else, can java.sql.PreparedStatement prevent multiple queries from being executed if, say, the WHERE clause was Robert'); DROP TABLE students;--?

A prepared statement will not handle this for you. With a prepared statement you can only safely add parameters to your query, not table names, column names or entire where clauses.
Especially the latter makes it virtually impossible to prevent injection if there are no constraints whatsoever. Column and table name parameters could be checked against a list of valid values either statically defined or dynamically based on you database structure. You could do some basic regex checking on the where parameter, but that will only really help against obvious SQL injection.
With the flexiblity you intend to offer in the form of SELECT FROM WHERE you could have queries like this:
SELECT mycolumn FROM mytable WHERE id = 1 AND 'username' in (SELECT username FROM users)
You could look at something like JOOQ to offer safe dynamic query building while still being able to constrain the things your users are allowed to query for.
Constraining your users in one way or another is key here. Not doing that means you have to worry not just about SQL injection, but also about performance issues for instance. Provide them with a visual (drag-and-drop) query builder for instance.

"It all depends".
If you have an application where users can type in the where clause as free text, then yes, they can construct SQL Injection attacks. They can also grind your server to a halt by selecting huge cartesian joins.
You could create a visual query builder - use the schema metadata to show a list of tables, and once the table is selected the columns, and for each column the valid comparisons. You can then construct the query as a parameterized query, and limit the human input to the comparison values, which you can in turn use as parameters.
It's a lot of work, though, and in most production systems of any scale, letting users run this kind of query is usually not particularly useful...

It's insecure to allow users to execute arbitrary queries. This is the kind of thing you'd see at Equifax. You don't want to allow it.
Prepared statements don't help make SQL expressions safe. Using parameters in prepared statements help make values safe. You can use a parameter only in the place where you would normally put a constant value, like a number, a quoted string, or a quoted date.
The easiest solution would be to NOT allow arbitrary queries or expressions on demand.
Instead, allow users to submit their custom query for review.
The query is reviewed by a human being, who may authorize the stored query to be run by the user (or other users). If you think you can develop some kind of automatic validator, be my guest, but IMHO that's bound to be a lot more work than just having a qualified database administrator review it.
Subsequently, the user is allowed to run the stored query on demand, but only by its id.
Here's another alternative idea: users who want to run custom queries can apply to get a replica of the database, to host on their own computer. They will get a dump of the subset of data they are authorized to view. Then if they run queries that trash the data, or melt their computer, that's their business.

Related

Change summary after executing SQL query

I am trying to log a “change summary” from each INSERT/UPDATE MySQL/SQL Server query that executes in a Java program. For example, let’s say I have the following query:
Connection con = ...
PreparedStatement ps = con.prepareStatement(“INSERT INTO cars (color, brand) VALUES (?, ?)”);
ps.setString(1, “red”);
ps.setString(2, “toyota”);
ps.executeUpdate();
I want to build a “change set“ from this query so I know that one row was inserted into the cars table with the values color=red and brand=toyota.
Ideally, I would like MySQL/SQL Server to tell me this information as that would be the most accurate. I want to avoid using a Java SQL parser because I may have queries with “IF EXISTS BEGIN ELSE END”, in which case I would want to know what was the final query that was inserted/updated.
I only want to track INSERT/UPDATE queries. Is this possible?

What ORM do you use? If you don't use one, now could be the time to start - you give the impression that you have all these prepared statement scattered throughout the code, which is something that needs improving anyway.
Using something like Hibernate means you can just activate its logging and keep the query/parameter data. It might also make you focus your data later a bit more (if it's a bit haphazardly structured right now).
If you're not willing to switch to using an ORM consider creating your own class, perhaps called LoggingPreparedStatement, that is identical to normal PreparedStatement (subclass or wrapper of PreparedStatement such that it uses all the same method names etc so it's a drop in replacement) and logs whatever you want. Use find/replace across the code base to switch to using it.
As an alternative to doing it on the client side, you can get the database to do It. For SQL server it has change tracking, don't know what there is for MySQL but it'll be something proprietary. For something consistent, most DB have triggers that have some mechanism of identifying old and new data and you can stash this in a history table(s) to see what was changed and when. Triggers that keep history have a regularity to their code that means they can be programmatically generated from a list of the table columns and datatypes, so you can query the db for the column names (most db have some virtual tables that tell you info about the real tables) etc and generate your triggers in code and (re)apply them whenever schema changes. The advantage of using triggers is that they really easily identify the data that was changed. The disadvantage is that this is all they can see so if you want your trigger to know more you have to add that info to the table or the session so the trigger can access it - stuff like who ran the query, what the query was. If you're not willing to add useless columns to a table (and indeed, why should you) you can rename all your tables and provide a set of views that select from the new names and are named the old names. These new views can expose extra columns that your client side can update and the views themselves can have INSTEAD OF triggers that update the real tables. Doesn't help for selections though because deleting data doesn't need any data from the client, so the whole thing is a mess. If you were going that wholesale on your DB you'd just switch to using stored procedures for your data modifications and embark on a massive job to change your client side calls. An alternative that is also well leveraged for SQL Server is the CONTEXT_INFO variable, a 128byte variable block of binary data that lives for the life of your connection/session or it's newer upgrade SESSION_CONTEXT, a 256kb set of key value pairs. If you're building something at the client side that logs the user, query and parameter data and you're also building a trigger that logs the data change you could use these variables, programmatically set at the start of each data modification statement, to give your trigger something more involved than "what is the current time" to identify which triggered dataset relates to which query logged. Generating a guid in the client and passing it to the db in some globally readable way that means the database trigger can see it and log it in the history table , tying the client side log of the statement and parameters to the server side set of logged row changes

Filtering executed SQL queries on application server

First of all, I know this is bad practice but regardless I'm still looking for an answer.
In our web application we have a textarea where the user can write SQL to bring in custom data sets and view them in a chart. The way this works is essentially taking the written string and executing it as a query. What I'm looking for is everything I need to implement in our application server back end security wise as to disallow the execution of queries that produce results other than SELECT type queries.
The user won't be able to execute any type of SELECT query he wants since the app server backend expends the returned result set to have 2 columns named X_FIELD and Y_FIELD so we're not so much worried about the user being able to view data as much as him executing SQL that will break the database.
What we thought of doing is parsing the string for keywords such as DROP, ALTER, CREATE etc. Are there specific things that we have to look out for? Is there a tool/library that automates this? We're using java for our back end code.

Filtering queries can be done at the application level but it requires much more database-specific expertise than creating separate security systems for each database.
As an example, I created an open source program that can do this for Oracle. It won't solve your problem but the code can at least help explain why this is a bad idea.
First, it's important to understand that Oracle SQL syntax is much more complicated than most programming languages, such as Java.
Oracle has 2175 keywords and almost none of them are reserved. Forget about parsing SQL - none of the existing 3rd party parsers are accurate enough to do this securely.
Luckily a full parser is not needed for this task. Oracle syntax is structured in such a way that any statement can be classified with only 8 tokens, excluding
whitespace and comments.
But building a tokenizer and a
statement classifier is still difficult. That solution will handle
unusual kinds of selects, such as (select * from dual) or with asdf as (select 1 a from dual) select a from asdf;. But even a SELECT statement can cause
changes to the database; either through PL/SQL hidden in a function or type, or locking rows through a for update.
And don't forget to remove the (sometimes optional) terminator. They work fine
in most IDEs, but they are not allowed in dynamic SQL. Don't just remove the last characters, or the last token, because some SELECT statements allow semicolons in the middle.
That's a lot of work for just one database! If you want to use this method to implement security policies you need almost 100% accuracy. Very few people are fanatical enough about any database to build this. There's no chance you can do this for multiple databases.

Can I pass table name as argument to a java prepared statement?

I am writing a DAO layer IN Java for my Tomcat server application,
I wish to use Prepared Statement wrapping my queries (1. parsing queries once, 2. defend against SQL injections),
My db design contains a MyISAM table per data source system. And most of the queries through DBO are selects using different table names as arguments.
Some of this tables may be created on the fly.
I already went though many posts that explain that i may not use table name as an argument for Prepared statement.
I have found solutions that suggest to use some type of function (e.g. mysql_real_escape_string) that may process this argument and append the result as a string to the query,
Is there any built in Jave library function that may do it in the best optimized way, or may be you may suggest to do something else in the DAO layer (i do not prefer to add any routines to the DB it self)?

Are you able to apply restrictions to the table names? That may well be easier than quoting. For example, if you could say that all table names had to match a regex of [0-9A-Za-z_]+ then I don't think you'd need any quoting. If you need spaces, you could probably get away with always using `table name` - but again, without worrying about "full" quoting.
Restricting what's available is often a lot simpler than handling all the possibilities :)

If you want to be extra safe than you can prepare a query and call it with supplied table name to check if it really exists:
PreparedStatement ps = conn.prepareStatement("SHOW TABLES WHERE tables = ?");
ps.setString(1, nameToCheck);
if(!ps.executeQuery().next())
throw new RuntimeException("Illegal table name: " + nameToCheck);
(The WHERE condition might need some correction because I don't have mysql under my fingers at the moment).

build oracle sql query dynamically from java application

How do I build oracle pl/sql query dynamically from a java application? The user will be presented with a bunch of columns that are present in different tables in the database. The user can select any set of column and the application should build the complete select query using only the tables that contain the selected columns.
For example, lets consider that there are 3 tables in the database. The user selects col11, col22. In this case, the application should build the query using Tabl1 and Tabl2 only.
How do I achieve this?
Tabl1
- col11
- col12
- col13
Tabl2
- fkTbl1
- col21
- col22
- col23
Tabl3
- col31
- col32
- col33
- fkTbl1

Ad hoc reporting is an old favourite. It frequently appears as a one-liner at the end of the Reports Requirements section: "Users must be able to define and run their own reports". The only snag is that ad hoc reporting is an application in its own right.
You say
"The user will be presented with a
bunch of columns that are present in
different tables in the database."
You can avoid some of the complexities I discuss below if the "bunch of columns" (and the spread of tables) is preselected and tightly controlled. Alas, it is in the nature of ad hoc reporting that users will want pretty much all columns from all tables.
Let's start with your example. The user has selected col11 and col22, so you need to generate this query:
SELECT tabl1.col11
, tabl2.col22
FROM tabl1 JOIN tabl2
ON (TABL1.ID = TABL2.FKTABL1)
/
That's not too difficult. You just need to navigate the data dictionary views USER_CONSTRAINTS and USER_CONS_COLUMNS to establish the columns in the join condition - providing you have defined foreign keys (please have foreign keys!).
Things become more complicated if we add a fourth table:
Tabl4
- col41
- col42
- col43
- fkTbl2
Now when the user choose col11 and col42 you need to navigate the data dictionary to establish that Tabl2 acts as an intermediary table to join Tabl4 and Tabl1 (presuming you are not using composite primary keys, as most people don't). But suppose the user selects col31 and col41. Is that a legitimate combination? Let's say it is. Now you have to join Tabl4 to Tabl2 to Tabl1 to Tabl3. Hmmm...
And what if the user selects columns from two completely unrelated tables - Tabl1 and Tabl23? Do you blindly generate a CROSS JOIN or do you hurl an exception? The choice is yours.
Going back to that first query, it will return all the rows in both tables. Almost certainly your users will want the option to restrict the result set. So you need to offer them the ability to add to filters to the WHERE clause. Gotchas here include:
ensuring that supplied values are of an appropriate data-type (no strings for a number, no numbers for a date)
providing look-ups to reference data
values
handling multiple values (IN list
rather than equals)
ensuring date ranges are sensible
(opening bound before closing bound)
handling free text searches (are you
going to allow it? do you need to
use TEXT indexes or will you run the
risk of users executing LIKE
'%whatever%' against some CLOB
column?)
The last point highlights one risk inherent in ad hoc reporting: if the users can assemble a query from any tables with any filters they can assemble a query which can drain all the resources from your system. So it is a good idea to apply profiles to prevent that happening. Also, as I have already mentioned, it is possible for the users to build nonsensical queries. Bear in mind that you don't need very many tables in your schema to generate too many permutations to test.
Finally there is the tricky proposition of security policies. If users are restricted to seeing subsets of data on the basis their department or their job role, then you will need to replicate those rules. In such cases the automatic application of policies through Row Level Security is a real boon
All of which might lead you to conclude that the best solution would be to pursuade your users to acquire an off-the-shelf product instead. Although that approach isn't without its own problems.

The way that I've done this kind of thing in the past is to simply construct the SQL query on the fly using a StringBuilder and then executing it using a JDBC a non-prepared statement. This is rather inefficient since the Oracle DB has to repeat all of the query analysis and optimization work for each query.

When to 'IN' and when not to?

Let's presume that you are writing an application for a retail store chain. So, you would design your object model such that you would define 'Store' as the core business object and lots of supporting objects. Let's say 'Store' looks like follows:
class Store implements Validatable{
int storeNo;
int storeName;
... etc....
}
So, your client tells you that you have to import store schedule from a excel sheet into the application and you would have to run a series of validations on 'em. For instance, 'StoreIsInSameCountry';'StoreIsValid'... etc. So, you would design a Rule interface for checking all business conditions. Something like this:
interface Rule T extends Validatable> {
public Error check(T value) throws Exception;
}
Now, here comes the question. I am uploading 2000 stores from this excel sheet. So, I would end up running each rule defined for a store that many times. If I were to have 4 rules = 8000 queries to the database, i.e, 16000 hits to the connection pool. For a simple check where I would just have to check whether the store exists or not, the query would be:
SELECT STORE_ATTRIB1, STORE_ATTRIB2... from STORE where STORE_ID = ?
That way I would obtain get my 'Store' object. When I don't get anything from the database, then that store doesn't exist. So, for such a simple check, I would have to hit the database 2000 times for 2000 stores.
Alternatively, I could just do:
SELECT STORE_ATTRIB1, STORE_ATTRIB2... from STORE where STORE_ID in (1,2,3..... )
This query would actually return much faster than doing the one above it 2000 times.
However, it doesn't go well with the design that a Rule can be run for a single store only.
I know using IN is not a suggested methodology. So, what do you think I should be doing? Should I go ahead and use IN here, coz it gives better performance in this scenario? Or should I change my design?
What would you do if you were in my shoes, and what is the best practice?

That way I would obtain get my 'Store' object from the database. When I don't get anything from the database, then that store doesn't exist. So, for such a simple check, I would have to hit the database 2000 times for 2000 stores.
This is what you should not do.
Create a temporary table, fill the table with your values and JOIN this table, like this:
SELECT STORE_ATTRIB1, STORE_ATTRIB2...
FROM temptable tt
JOIN STORE s
ON s.STORE_ID = t.id
or this:
SELECT STORE_ATTRIB1, STORE_ATTRIB2...
FROM STORE s
WHERE s.STORE_ID IN
(
SELECT id
FROM temptable tt
)
I know using IN is not a suggested methodology. So, what do you think I should be doing? Should I go ahead and use IN here, coz it gives better performance in this scenario? Or should I change my design?
IN filters duplicates out.
If you want each eligible row to be selected for each duplicate value in the list, use JOIN.
IN is in no way a "not suggested methology".
In fact, there was a time when some databases did not support IN queries effciently, that's why folk wisdom still advices against using it.
But if your store_id is indexed properly (and it most probably is, if it's a PRIMARY KEY which it looks like), then all modern versions of major databases (that is Oracle, SQL Server, MySQL and PostgreSQL) will use an efficient plan to perform this query.
See this article in my blog for performance details in SQL Server:
IN vs. JOIN vs. EXISTS
Note, that in a properly designed database, validation rules are also set-based.
I. e. you implement your validation rules as queries against the temptable.
However, to support legacy rules, you can select values from temptable row-by-agonizing-row, apply the rules, and delete values which did not pass validation.

SELECT store_id FROM store WHERE store_active = 1
or even
SELECT store_id FROM store
will tell you all the active stores in a single query. You can now conduct the other tests on stores you know to exist, and you've saved yourself 1,999 hits to the database.
If you've got relatively uncontested database access, and no time constraint on how long the whole thing is going to take then you've no real need to worry about hitting the connection pool over and over again. That's what it's designed for, after all!

I think it's more of a business question with parameter of how often does the client run the import, how long would it take for you to implement either of the solution, and how expensive is your time per hour.
If it's something that runs once in a while, a bit of bad performance is acceptable in my opinion, especially if you can get the job done quick using clean code.

...a Rule can be run for a single store only.
Managing business rules along with performance is a tricky task, so there is a library ("Persistence Layer") that does exactly that. You define rules, then execute a bulk of commands, then the library fetch from DB whatever the rules require in a single query (by using temp tables rather than 'IN') and then passes it to the rules.
There is an example of a validator in here.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.