Is there a reliable way to sanitize database input in Java without using prepared statements?
All the answers I have found suggest using PreparedStatement, but I am trying to avoid the extra round trip to the database server.
-- Additional Info --
My queries will be very simple and very few will share the same format so there's little performance advantage of any query plan caching.
The database server will be located on a separate physical location, though still in the same LAN, so there will be extra network bottleneck with the extra round trip required when using prepared statements.
What I'm hoping to find is something like this, which exists in C, Python and PHP:
http://dev.mysql.com/doc/refman/5.0/en/mysql-real-escape-string.html
You can use escapeSql in the org.apache.commons.lang.StringEscapeUtils:
username="'; or 1=1";
sane_username=StringEscapeUtils.escapeSql(username);
// turns into "''; or 1=1"
sql= "select username from users where username = '" + sane_username + "'";
// select username from users where username = '''; 1 or 1'
But this is really bad practice. You should always use prepared statements.
Saves memory on the sql server
Allows you to reuse sql statements without a reparsing sql,
increasing performance
more secure, you might forget to sanity some input.
This sounds like a classic case of premature optimization. Yes, PreparedStatements do a little unnecessary work, the same way ArrayLists allocate unnecessary memory. But far and away the benefit of these utilities outweighs their extra costs, and the JIT (or the JDBC driver, in PreparedStatement's case) can improve over time to make them even less expensive.
On the other hand if you re-invent the wheel manually, you're liable to make mistakes, and forever on the hook to ensure this actually remains performant into the future. Suppose you invest a lot of effort into building a secure alternative to PreparedStatement, only to discover next year that a driver update actually renders the PreparedStatement call more efficient than your implementation - it's not a far-fetched idea.
Performance concerns are important, but avoid getting hung up on squeezing every last drop out of your application. Simply doing something that works is likely to be faster, safer, and easier to upgrade in the future than investing large amounts of time into one small aspect of your project. Put another way, why work around the wonderful benefits PreparedStatement offers until you've conclusively decided it's preventing you from improving further?
Related
I know (or think I know) that using things like prepared statements can help future executions of the same query execute faster. However, I was wondering, if you're using prepared statements but the actual values are the same every time, will it then also additionally optimize using the value?
To give a little more context, I want to test performance for a service request that uses an underlying database. The easy route would be to send in the same data each time. The more arduous route would be to ensure the data values were different each time. However, in either case, the same SQL query would be generated -- just the values would be different. So, will these scenarios end up testing the same thing or something different because of potential DB optimization?
I've tried to research this topic but I feel like a lot of what I'm reading is over my head. Any good links for someone that knows little about DB optimization would also be welcomed in addition to the central question.
It depends on exactly what you are doing and measuring. I would expect, though, that you'd need to use different values in order to get realistic results.
Caching
If you send the same values every time, you can probably guarantee that the particular row(s) that you're interested in are always going to be cached (in the buffer cache, in the file system cache, in the SAN cache, etc.) which is probably not terribly realistic if the set of possible inputs is large. On the other hand, if there are a small number of potential inputs and you're reasonably confident that the rows of interest will always be cached (for example, if you know that some other activity that takes place just before your service is called will cause the data you're interested in to be cached in memory before your service is called) then perhaps this is a realistic assumption.
Optimization
Ignoring caching, we can look at how the optimizer would treat the two cases. If you are generating SQL queries with embedded literals (a bad practice that is particularly harmful in Oracle but one that is very common), then you are generating different SQL statements. As far as Oracle is concerned
SELECT *
FROM emp
WHERE deptno = 10
is a completely different statement from
SELECT *
FROM emp
WHERE deptno = 20
There are some settings (i.e. cursor_sharing) you can tweak to ask Oracle to treat these two as identical queries (by having Oracle force them into using bind variables) but that is not without its own downsides and is generally only recommended when you're trying to apply a band-aid to a poorly written application while you work on refactoring the application to use bind variables properly.
Assuming that you are generating queries using bind variables in your application, preparing the statement, and then binding different values before executing the query multiple times, i.e.
SELECT *
FROM emp
WHERE deptno = :1
then you get into the realm of histograms, bind variable peeking, and adaptive cursor sharing. This can get pretty involved and depends heavily on the version of Oracle you're using, the edition you're using, and how you've configured the optimizer to work. I'll try to give a simplified high-level overview here-- if you want to delve too much deeper into one of these, we'll probably want a separate question.
Histograms
By default, the optimizer assumes that data is equally spaced and equally likely. So, for example, if the deptno column has 50 distinct values, the optimizer assumes by default that each value is equally likely. That's probably a pretty reasonable assumption for most columns but it's obviously not reasonable for all columns. If I have a table with all active duty military members, for example, and one of the columns is birth_year, there will be far more rows with a birth_year of 1994 (20 years ago) than 1934 (80 years ago). In these cases, you gather histograms on the column in question in order to tell the optimizer that the data isn't evenly distributed and to let the optimizer gather information about which values are more common and how common they are.
The optimizer doesn't care about the values you are passing for your bind variable values unless there is a histogram on one of the columns in your predicate (I'll ignore for the moment the possibility that you are passing a value that is out of range).
Bind variable peeking
If you do have a histogram on one or more columns, then Oracle (9.1 and later if memory serves) will "peek" at the first value that is passed in for a bind variable and use that value with the histogram to determine the best plan for all subsequent executions. This works reasonably well the vast majority of the time but it occasionally leads to hair-pullingly painful problems (and much swearing) when Oracle peeks at a "bad" value and generates a plan that is efficient for that one execution but terrible for all future executions. This is summed up by Tom Kyte's story about the database that has to be restarted if it's rainy on a Monday morning. If you have a histogram on the column and different values that you might pass in would likely benefit from different query plans, you'd likely want to take bind variable peeking into consideration to determine if passing in values in a different order created any performance issues.
Adaptive cursor sharing
In recent versions (if memory serves 11.1 and later) and depending on your configuration, Oracle can use adaptive cursor sharing to maintain multiple query plans for a single statement and to use the most appropriate version for the particular bind variable value that is passed in. This is a much more sophisticated version of bind variable peeking that peeks for each set of values you pass in and figures out whether it is close enough to some other set of values to use the previously generated plan or whether it needs to compute a new plan for the new set of values. Figuring out what constitutes "close enough" and how this interacts with various features for ensuring plan stability is a rather involved topic in its own right.
you could use db caching
http://www.oracle.com/technetwork/articles/sql/11g-caching-pooling-088320.html
if the app is making network roundtrip and caculating results, that will still eat considerable time
I need one help from you guys regarding JDBC performance optimization. One of our pojo is using jdbc to connect to a oracle database and retrieve the records. Basically the records are email addresses basing upon which emails will be sent to the users. The problem here is the performance. This process happens every weekend and the records are very huge in number, around 100k.
The performance is very slow and it worries us a lot. Only 1000 records seem to be fetched from the database every 1 hour, which means that it will take 100 hours for this process to complete (which is very bad). Please help me on this.
The database server and the java process are in two different remote servers. We have used rs_email.setFetchSize(1000); hoping that it would make any difference but no change at all.
The same query executed on server takes 0.35 seconds to complete. Any quick suggestion would of great help to us.
Thanks,
Aamer.
First look at your queries. Analyze them. See if the SQL could be made more efficient (ie, ask the database for what you want, not for what you don't want -- makes a big difference). Also check to see if there are indexes on any fields in your where and join clauses. Indexes make a big difference. But it can't be just any indexes. They have to be good indexes (ie, that the fields that make up the index provide enough uniqueness for the database to retrieve things appropriately). Work with your DBA on this. Look for either high run time against the db or check for queries with high CPU usage (even if the queries run sub-second). These are the thing that can kill your database.
Also from a code perspective, check to see if you are opening and closing your connections or if you are re-using them. Can make a big difference too.
It would help to post your code, queries, table layouts, and any indexes you have.
Use log4jdbc to get the real sql for fetching single record. Then check speed and plan for that sql. You may need a proper index or even db defragmentation.
Not sure about the Oracle driver, but I do know that the MySQL driver supports two different results retrieval methods: "stream" and "wait until you've got it all".
The streaming method lets you start process the results the moment you've got the first row returned from the query, whereas the other method retrieves the entire resultset before you can start work on it. In cases where you deal with huge recordsets, this often leads to memory exceptions, or slow performance because java hit the "memory roof" and the garbage collector can't throw away "used" records like it can in the streaming mode.
The streaming mode doesn't let you navigate/scroll the resultset the way the "normal"/"wait until you've got it all" mode...
Anyway, not sure if this is of any help but it might be worth checking out.
My answer to your question, in summary is:
1. Check network
2. Check SQL
3. Check Java code.
It sounds very slow. First thing to check would be to see if you have a slow network. You can do this pretty quickly by just pinging the database server. Or run the database server on the same machine as your JVMM. If it is not the network, get an explain plan for your SQL and ensure you are not doing table scans when you don't need to be. If it is not the network or the SQL, then it's time to check your Java code. Are you doing anything like blocking when you shouldn't be?
I have a database log appender that inserts a variable number of log lines into the database every once in a while.
I'd like to create an SQL statement in a way that prevents SQL injection, but not using server-side prepared statements (because I have a variable number of rows in every select, caching them won't help but might hurt performance here).
I also like the convenience of prepared statments, and prefer them to string concatination. Is there something like a 'client side prepared statement' ?
It sounds like you haven't benchmarked the simplest solution - prepared statements. You say that they "might hurt performance" but until you've tested it, you really won't know.
I would definitely test prepared statements first. Even if they do hamper performance slightly, until you've tested them you won't know whether you can still achieve the performance you require.
Why spend time trying to find alternative solutions when you haven't tried the most obvious one?
If you find that prepared statement execution plan caching is costly, you may well find there are DB-specific ways of tuning or disabling it.
Not sure if I understand your question correctly. Is there something in PreparedStatement that isn't fitting your needs?
I think that whether or not the statement is cached on the server side is an implementation detail of the database driver and the specific database you're using; if your query/statement changes over time than this should have no impact - the cached/compiled statements simply won't be used.
First, Jon's answer that you should go with the most obvious solution until performance is measured to be a problem is certainly the right approach in general.
I don't think your performance concerns are misplaced. I have certainly seen precompiled complex statements fail dramatically on the performance scale (on MS-SQL 2000). The reason is the statement was so complex that it had several potential execution paths depending on the parameters, but the compilation locked one in for one set of parameters, and the next set of parameters were too slow, whereas a recompile would force a recalculation of the execution plan more appropriate for the different set of parameters.
But that concern is very far fetched until you see it in practice.
The underlying problem here is that parameter escaping is database specific, so unless the JDBC driver for your database is giving you something non-standard to do this (highly unlikely), you are going to have to have a different library or different escaping mechanism that is very specific to this one database.
From the wording of your question, it doesn't sound like your performance concerns have yet come to the point of meriting finding (or developing) such a solution.
It should also be noted that although JDBC drivers may not all behave this way, technically according to the spec the precompilation is supposed to be cached in the PreparedStatement object, and if you throw that away and get a new PreparedStatement every time, it should not actually be caching anything, so the whole issue may be mute and would need to be investigated for your specific JDBC driver.
From the spec:
A SQL statement with or without IN parameters can be pre-compiled and stored in a PreparedStatement object. This object can then be used to efficiently execute this statement multiple times.
what's wrong with using a regular prepared statement e.g. in the following pseudocode:
DatabaseConnection connection;
PreparedStatement insertStatement = ...;
...
connection.beginTransaction();
for (Item item : items)
{
insertStatement.setParameter(1, item);
insertStatement.execute();
}
connection.commitTransaction();
A smart database implementation will batch up several inserts into one communications exchange w/ the database server.
I can't think of a reason why you shouldn't use prepared statements. If you're running this on a J2EE server using connection pooling the server keeps your connections open, and the server caches your access/execution plans. It's not the data it caches!
If you're closing your connection every time, then you're probably not gaining any performance. But you still get the SQL injection prevention
Most java performance tuning books will tell you the same:
Java performance tuning
Prepared Statements don't care about client or server side.
Use them and drop any SQL string concatenation. There is not a single reason to not use Prepared Statements.
I have an old MySQL 4.1 database with a table that has a few millions rows and an old Java application that connects to this database and returns several thousand rows from this this table on a frequent basis via a simple SQL query (i.e. SELECT * FROM people WHERE first_name = 'Bob'. I think the Java application uses client side prepared statements but was looking at switching this to the server, and in the example mentioned the value for first_name will vary depending on what the user enters).
I would like to speed up performance on the select query and was wondering if I should switch to Prepared Statements or Stored Procedures. Is there a general rule of thumb of what is quicker/less resource intensive (or if a combination of both is better)
You do have an index of first_name, right? That will speed up your query a lot more than choosing between prepared statements and stored procedures.
If you have just one query to worry about, you should be able to implement the two alternatives (on your test platform of course!) and see which one gives you the best performance.
(My guess is that there won't be much difference though ...)
Looks like the best way is just to make the change and test it out in a test environment.
Thanks for the help.
So I keep hearing that PreparedStatements are good for performance.
We have a Java application in which we use the regular 'Statement' more than we use the 'PreparedStatement'. While trying to move towards using more PreparedStatements, I am trying to get a more thorough understanding of how PreparedStatements work - on the client side and the server side.
So if we have some typical CRUD operations and update an object repeatedly in the application, does it help to use a PS? I understand that we will have to close the PS every time otherwise it will result in a cursor leak.
So how does it help with performance? Does the driver cache the precompiled statement and give me a copy the next time I do connection.prepareStatement? Or does the DB server help?
I understand the argument about the security benefits of PreparedStatements and I appreciate the answers below which emphasize it. However I really want to keep this discussion focused on the performance benefits of PreparedStatements.
Update: When I say update data, I really mean more in terms of that method randomly being called several times. I understand the advantage in the answer offered below which asks to re-use the statement inside a loop.
// some code blah blah
update();
// some more code blah blah
update();
....
public void update () throws SQLException{
try{
PreparedStatement ps = connection.prepareStatement("some sql");
ps.setString(1, "foobar1");
ps.setString(2, "foobar2");
ps.execute();
}finally {
ps.close();
}
}
There is no way to actually reuse the 'ps' java object and I understand that the actual connection.prepareStatement call is quite expensive.
Which is what brings me back to the original question. Is this "some sql" PreparedStatement still being cached and reused under the covers that I dont know about?
I should also mention that we support several databases.
Thanks in advance.
The notion that prepared statements are primarily about performance is something of a misconception, although it's quite a common one.
Another poster mentioned that he noted a speed improvement of about 20% in Oracle and SQL Server. I've noted a similar figure with MySQL. It turns out that parsing the query just isn't such a significant part of the work involved. On a very busy database system, it's also not clear that query parsing will affect overall throughput: overall, it'll probably just be using up CPU time that would otherwise be idle while data was coming back from the disk.
So as a reason for using prepared statements, the protection against SQL injection attacks far outweighs the performance improvement. And if you're not worried about SQL injection attacks, you probably should be...
Prepared statements can improve performance when re-using the same statement that you prepared:
PreparedStatement ps = connection.prepare("SOME SQL");
for (Data data : dataList) {
ps.setInt(1, data.getId());
ps.setString(2, data.getValue();
ps.executeUpdate();
}
ps.close();
This is much faster than creating the statement in the loop.
Some platforms also cache prepared statements so that even if you close them they can be reconstructed more quickly.
However even if the performance were identical you should still use prepared statements to prevent SQL Injection. At my company this is an interview question; get it wrong and we might not hire you.
Prepared statements are indeed cached after their first use, which is what they provide in performance over standard statements. If your statement doesn't change then it's advised to use this method. They are generally stored within a statement cache for alter use.
More info can be found here:
http://www.theserverside.com/tt/articles/article.tss?l=Prepared-Statments
and you might want to look at Spring JDBCTemplate as an alternative to using JDBC directly.
http://static.springframework.org/spring/docs/2.0.x/reference/jdbc.html
Parsing the SQL isn't the only thing that's going on. There's validating that the tables and columns do indeed exist, creating a query plan, etc. You pay that once with a PreparedStatement.
Binding to guard against SQL injection is a very good thing, indeed. Not sufficient, IMO. You still should validate input prior to getting to the persistence layer.
So how does it help with performance? Does the driver cache the
precompiled statement and give me a copy the next time I do
connection.prepareStatement? Or does the DB server help?
I will answer in terms of performance. Others here have already stipulated that PreparedStatements are resilient to SQL injection (blessed advantage).
The application (JDBC Driver) creates the PreparedStatement and passes it to the RDBMS with placeholders (the ?). The RDBMS precompiles, applying query optimization (if needed) of the received PreparedStatement and (in some) generally caches them. During execution of the PreparedStatement, the precompiled PreparedStatement is used, replacing each placeholders with their relevant values and calculated. This is in contrast to Statement which compiles it and executes it directly, the PreparedStatement compiles and optimizes the query only once. Now, this scenario explained above is not an absolute case by ALL JDBC vendors but in essence that's how PreparedStatement are used and operated on.
Anecdotally: I did some experiments with prepared vs. dynamic statements using ODBC in Java 1.4 some years ago, with both Oracle and SQL Server back-ends. I found that prepared statements could be as much as 20% faster for certain queries, but there were vendor-specific differences regarding which queries were improved to what extent. (This should not be surprising, really.)
The bottom line is that if you will be re-using the same query repeatedly, prepared statements may help improve performance; but if your performance is bad enough that you need to do something about it immediately, don't count on the use of prepared statements to give you a radical boost. (20% is usually nothing to write home about.)
Your mileage may vary, of course.
Which is what brings me back to the original question. Is this "some sql" PreparedStatement still being cached and reused under the covers that I dont know about?
Yes at least with Oracle. Per Oracle® Database JDBC Developer's Guide Implicit Statement Caching (emphasis added),
When you enable implicit Statement caching, JDBC automatically caches the prepared or callable statement when you call the close method of this statement object. The prepared and callable statements are cached and retrieved using standard connection object and statement object methods.
Plain statements are not implicitly cached, because implicit Statement caching uses a SQL string as a key and plain statements are created without a SQL string. Therefore, implicit Statement caching applies only to the OraclePreparedStatement and OracleCallableStatement objects, which are created with a SQL string. You cannot use implicit Statement caching with OracleStatement. When you create an OraclePreparedStatement or OracleCallableStatement, the JDBC driver automatically searches the cache for a matching statement.
1. PreparedStatement allows you to write dynamic and parametric query
By using PreparedStatement in Java you can write parametrized sql queries and send different parameters by using same sql queries which is lot better than creating different queries.
2. PreparedStatement is faster than Statement in Java
One of the major benefits of using PreparedStatement is better performance. PreparedStatement gets pre compiled
In database and there access plan is also cached in database, which allows database to execute parametric query written using prepared statement much faster than normal query because it has less work to do. You should always try to use PreparedStatement in production JDBC code to reduce load on database. In order to get performance benefit its worth noting to use only parametrized version of sql query and not with string concatenation
3. PreparedStatement prevents SQL Injection attacks in Java
Read more: http://javarevisited.blogspot.com/2012/03/why-use-preparedstatement-in-java-jdbc.html#ixzz3LejuMnVL
Short answer:
PreparedStatement helps performance because typically DB clients perform the same query repetitively, and this makes it possible to do some pre-processing for the initial query to speed up the following repetitive queries.
Long answer:
According to Wikipedia, the typical workflow of using a prepared statement is as follows:
Prepare: The statement template is created by the application and sent
to the database management system (DBMS). Certain values are left
unspecified, called parameters, placeholders or bind variables
(labelled "?" below): INSERT INTO PRODUCT (name, price) VALUES (?, ?)
(Pre-compilation): The DBMS parses, compiles, and performs query optimization on the
statement template, and stores the result without executing it.
Execute: At a later time, the application supplies (or binds) values
for the parameters, and the DBMS executes the statement (possibly
returning a result). The application may execute the statement as many
times as it wants with different values. In this example, it might
supply 'Bread' for the first parameter and '1.00' for the second
parameter.
Prepare:
In JDBC, the "Prepare" step is done by calling java.sql.Connection.prepareStatement(String sql) API. According to its Javadoc:
This method is optimized for handling parametric SQL statements that benefit from precompilation. If the driver supports precompilation, the method prepareStatement will send the statement to the database for precompilation. Some drivers may not support precompilation. In this case, the statement may not be sent to the database until the PreparedStatement object is executed. This has no direct effect on users; however, it does affect which methods throw certain SQLException objects.
Since calling this API may send the SQL statement to database, it is an expensive call typically. Depending on JDBC driver's implementation, if you have the same sql statement template, for better performance, you may have to avoiding calling this API multiple times in client side for the same sql statement template.
Precompilation:
The sent statement template will be pre-compiled on database and cached in db server. The database will probably use the connection and sql statement template as the key, and the pre-compiled query and the computed query plan as value in the cache. Parsing query may need to validate table, columns to be queried, so it could be an expensive operation, and computation of query plan is an expensive operation too.
Execute:
For following queries from the same connection and sql statement template, the pre-compiled query and query plan will be looked up directly from cache by database server without re-computation again.
Conclusion:
From performance perspective, using prepare statement is a two-phase process:
Phase 1, prepare-and-precompilation, this phase is expected to be
done once and add some overhead for the performance.
Phase 2,
repeated executions of the same query, since phase 1 has some pre
processing for the query, if the number of repeating query is large
enough, this can save lots of pre-processing effort for the same
query.
And if you want to know more details, there are some articles explaining the benefits of PrepareStatement:
http://javarevisited.blogspot.com/2012/03/why-use-preparedstatement-in-java-jdbc.html
http://docs.oracle.com/javase/tutorial/jdbc/basics/prepared.html
Prepared statements have some advantages in terms of performance with respect to normal statements, depending on how you use them. As someone stated before, if you need to execute the same query multiple times with different parameters, you can reuse the prepared statement and pass only the new parameter set. The performance improvement depends on the specific driver and database you are using.
As instance, in terms of database performance, Oracle database caches the execution plan of some queries after each computation (this is not true for all versions and all configuration of Oracle). You can find improvements even if you close a statement and open a new one, because this is done at RDBMS level. This kind of caching is activated only if the two subsequent queries are (char-by-char) the same. This does not holds for normal statements because the parameters are part of the query and produce different SQL strings.
Some other RDBMS can be more "intelligent", but I don't expect they will use complex pattern matching algorithms for caching the execution plans because it would lower performance. You may argue that the computation of the execution plan is only a small part of the query execution. For the general case, I agree, but.. it depends. Keep in mind that, usually, computing an execution plan can be an expensive task, because the rdbms needs to consult off-memory data like statistics (not only Oracle).
However, the argument about caching range from execution-plans to other parts of the extraction process. Giving to the RDBMS multiple times the same query (without going in depth for a particular implementation) helps identifying already computed structures at JDBC (driver) or RDBMS level. If you don't find any particular advantage in performance now, you can't exclude that performance improvement will be implemented in future/alternative versions of the driver/rdbms.
Performance improvements for updates can be obtained by using prepared statements in batch-mode but this is another story.
Ok finally there is a paper that tests this, and the conclusion is that it doesn't improve performance, and in some cases its slower:
https://ieeexplore.ieee.org/document/9854303
PDF: https://www.bib.irb.hr/1205158/download/1205158.Performance_analysis_of_SQL_Prepared_Statements_in_CRUD_operations_final.pdf