New to SQL - Organization and Optimization of Queries - java

For a thick-client project I'm working on, I have to remotely connect to a database (IBM i-series) and perfom a number of SQL related tasks:
Download/Update a set of local/offline 'control' data - this data may have changed between runs unnoticed.
On command, download data from multiple (15-20) tables and store separately into a single Java object. The names of the tables are known, but the schema name changes between runs and can change inter-run (as far as I know, PreparedStatements do not allow one to dynamically insert the schema).
I had considered using joins/unions/etc to perform all of these queries as one, but the project requires me to have in-memory separations between table data (instead of one big joined lump).
Perform between 2 and 100+ repetitions of (2)
The last factor is that this needs to be run on high-latency (potentially dial-up) network connections using Java 1.5 on the oldest computers possible.
Currently I run 15-20 dynamically constructed PreparedStatements but I know this to be rather inefficient (I measured, so as to avoid premature optimization ala Knuth).
What would be the most efficient and error-tolerant method of performing these tasks?
My thoughts:
Regarding (1), I really have no idea other than checking the entire table against the new table, at which point I feel I might as well just download the new (potentially and likely unchanged) table and replace the old one, but this takes more time.
For (2): Ideally I'd be able to construct something similar to an array of SELECT statements, send them all at once, and have the database return one ResultSet per internal query. From what I understand, however, neither Statement nor PreparedStatement support returning multiple ResultSet objects.
Lastly, the best way I can think of doing (3) is to batch a number of (2) operations.

There is nothing special about having moving requirements, but the single most important thing to use when talking to most databases is having a connection pool in your Java application and use it properly.
This also applies here. The IBM i DB2/400 database is quite fast, and the database driver available in the jt400 project (type 4, no native code) is quite good, so you can pull over quite a bit of data in a short while simply by generating SQL on the fly.
Note that if you only have a single schema you can tell in the conneciton which one you need, and can then use non-qualified table names in your SQL statements. Read the JDBC properties in the InfoCenter very carefully - it is a bit tricky to get right. If you need multiple schemaes, the "naming=system" allows for library lists - i.e. a list of schemaes to look for the tables, which can be very useful when done correctly. The IBM i folks can help you here.
That said, if the connection is the limiting factor, you might have a very strong case for running the "create object from tables" Java code directly on the IBM i. You should already now prepare for being able to measure the traffic to the database - either with network monitoring tooling, using p6spy or simply going through a proxy (perhaps even a throtteling one)

Ideally, you would have the database group provide you with a set of stored procedures to optimize the access to the database.
Since you don't have access, you may want to ask them if they have timestamp data in the database at the row level to see when records were modified, this way you can select only the data that's changed since some point in time.
What #ThorbjørnRavnAndersen is suggesting is moving the database code on to the IBM host and connecting to it via RMI or JMS from the client. So the server code would be a RMI or JMS Server that accesses the database on your behalf and returns you java objects instead of bringing SQL resultsets across the wire.
I would pass along your requirements to the database team and see if they can't do something for you. I'm sure they don't want all these remote clients bringing all the data down each time, so it would benefit them as much as it would benefit you.

Related

Database Data Filtering Best Practice

I am currently using raw JDBC to query records in a MySql database; each record in the subsequent Resultset is ultimately extracted, placed in a domain specific model, and stored to a List Instance.
My query is: in circumstances where there is a requirement to further filter that data (incidentally based on columns that exist in the SAME Table) which of the following approaches would generally be considered best practice:
1.The issuance of further WHERE clause calls into the database. This will effectively offload the filtering process to the database but obviously results in an additional query or queries where multiple filters are applied consecutively.
2.Explicitly filtering the aforementioned preprocessed List at the Application level, thus negating the need to have to make additional calls into the database each time the records are filtered.
3.Some hybrid combination of the above two approaches, perhaps where all filtering operations are initially undertaken by the database server but THEN preprocessed to a application specific model and implicitly cached to a collection for some finite amount of time. Further filter queries, received within this interval, would then be serviced from the data stored in the cache.
It is important to note that the Database Server in this scenario is actually located on
an external machine, therefore the overhead and latency of sending query traffic over the local network also has to be factored into the approach we ultimately elect to take.
I am patently aware of the age-old mantra that stipulates that: "The database server should be used to do what its good at." however in this scenario it just seems like a less than adequate solution to be making numerous calls into the database to filter data that I ALREADY HAVE at the application level.
Your thoughts and insights would be greatly appreciated.
I have used the hybrid approach on many applications with good results.
Database filtering works good especially for columns that are indexed. This reduces network overhead since fewer rows are sent to application.
Database filtering can be really slow for some columns depending upon the quantity of rows in the results and the lack of indexes. The network overhead can be negligible compared to database query time so application filtering may be faster for this situation.
I also find that application filtering in Java easier to write and understand instead of complex SQL.
I usually experiment manually to get the fewest rows in a reasonable time with plain SQL. Then write Java to refine to the desired rows.
i appreciate this question first...as i too faced similar situation few days back...as you already discussed all available options i prefer to go with the second option....i mean handling at application level rather than filtering at DB level.

understanding good practice for java rmi

I have a RMI application,
Basically every request from the client, created a new connection (on the server side) to the database, r n an SQL query and turned the data to a serializable class that was sent back to the client.
The user base of the app grew,and the request took a very long time to complete. the solution previous programmers came up with is to create a fixed size connection pool from the server to the DB, and every client's request used the oldest(the one used least recently) to run the SQL query.
My question is: what is the correct way to solve such a problem?
I would say, pooling the DB connections is already an important step, since to establish a connection is expensive. Instead of implementing my own pool I would however use an existing and proven pooled data-source implementation such as DBCP or C3P0. They have many useful features such as varying size, automatic connection check, etc...
If it is the query itself that takes up too long, the optimization would be more complex than just that. Various approaches are possible and depend on the details of the situation, for example :
Is there only one SQL query, always the same, as your question seems to imply ?
Is the database read-only ?
If not, are the modifications made within the same application or externally ?
etc...
Possible approaches (I can think of right now) to reduce the request time :
Caching of the result in the java app (but this is a vast subject...)
Optimization of the SQL request
Optimization of the DB schema, with indexes or deeper refactoring of the tables structure
Reduce the amount of data sent back to the client to just the bare minimum (In case the network is the bottleneck)
I hope this helps. We would really need more details on the use case to give you a better answer.

Is there any use for views,triggers and stored procedures for a Java GUI project?

I am making a Java gui and web application which will use the same mysql database.
It's a DTh management system where all the information will be stored and retrieved dynamically depending on input.
I believe that views are static by nature and thus would be useless as all my queries will have a different where condition (userid).
Do I need to use triggers? I mean I could code the java to execute multiple statements instead of using a inbuilt trigger (e.g. Insert in customers name and family members name both will have a duplicate copy for head of the family). Is there a performance hit? Am I wrong in some way?
And same thing what is the use of stored procedures? Can't I use methods in java to do everything?
So, I am asking is it possible to shift all the calculation intensive stuff to java and web script instead of the sql. If yes, does this mean I only have to create the backend structure of Database(i.e. all the different tables and FK,PK) and do rest without using any sql stuff on mysql workbench?
Thank you for helping.
There is (as always) one correct answer: It depends.
If you only want to show and query some data, you probably won't need trigger or stored procedures.
Views are a different thing: They are pretty helpful if you want a static viesw to a join-table or something like that. If you don't need this, just don't use it.
Keys are really important. They make your data robust against wrong input.
What you shoud use is PrepearedStatement instead of Statement. If you only use PreparedStatements, you are (nearly ?) safe in the question of SQL-Injection.
We use Views because it just faster than select query and for just showing data (not edit-update) it is faster and preferable.
Trigger are fired at database side so it is faster because it just execute 2 or more queries in single execution.
Same in Stored procedures, because we can execute more than one queries in single database connection. If we execute different queries than it take more time on every execution for database connection (find database server, authenticate, find database,... etc.).

Way to know table is modified

There are two different processes developed in Java running independently,
If any of the process modifyies the table, can i get any intimation? As the table is modified. My objective is i want a object always in sync with a table in database, if any modification happens on table i want to modify the object.
If table is modified can i get any intimation regarding this ? Do Database provide any facility like this?
We use SQL Server and have certain triggers that fire when a table is modified and call an external binary. The binary we call sends a Tib rendezvous message to notify other applications that the table has been updated.
However, I'm not a huge fan of this solution - Much better to control writing to your table through one "custodian" process and have other applications delegate to that. To enforce this you could change permissions on your table so that only your custodian process can write to the database.
The other advantage of this approach is being able to provide a caching layer within your custodian process to cater for common access patterns. Granted that a DBMS performs caching anyway, but by offering it at the application layer you will have more control / visibility over it.
No, database doesn't provide these services. You have to query it periodically to check for modification. Or use some JMS solution to send notifications from one app to another.
You could add a timestamp column (last_modified) to the tables and check it periodically for updates or sequence numbers (which are incremented on updates similiar in concept to optimistic locking).
You could use jboss cache which provides update mechanisms.
One way, you can do this is: Just enclose your database statement in a method which should return 'true' when successfully accomplished. Maintain the scope of the flag in your code so that whenever you want to check whether the table has been modified or not. Why not you try like this???
If you're willing to take the hack approach, and your database stores tables as files (eg, mySQL), you could always have something that can check the modification time of the files on disk, and look to see if it's changed.
Of course, databases like Oracle where tables are assigned to tablespaces, and tablespaces are what have storage on disk it won't work.
(yes, I know this is a bad approach, that's why I said it's a hack -- but we don't know all of the requirements, and if he needs something quick, without re-writing the whole application, this would technically work for some databases)

Tips on Speeding up JDBC writes?

I am writing a program that does a lot of writes to a Postgres database. In a typical scenario I would be writing say 100,000 rows to a table that's well normalized (three foreign integer keys, the combination of which is the primary key and the index of the table). I am using PreparedStatements and executeBatch(), yet I can only manage to push in say 100k rows in about 70 seconds on my laptop, when the embedded database we're replacing (which has the same foreign key constraints and indices) does it in 10.
I am new at JDBC and I don't expect it to beat a custom embedded DB, but I was hoping it to be only 2-3x slower, not 7x. Anything obvious that I maybe missing? does the order of the writes matter? (i.e. say if it's not the order of the index?). Things to look at to squeeze out a bit more speed?
This is an issue that I have had to deal with often on my current project. For our application, insert speed is a critical bottleneck. However, we have discovered for the vast majority of database users, the select speed as their chief bottleneck so you will find that there are more resources dealing with that issue.
So here are a few solutions that we have come up with:
First, all solutions involve using the postgres COPY command. Using COPY to import data into postgres is by far the quickest method available. However, the JDBC driver by default does not currently support COPY accross the network socket. So, if you want to use it you will need to do one of two workarounds:
A JDBC driver patched to support COPY, such as this one.
If the data you are inserting and the database are on the same physical machine, you can write the data out to a file on the filesystem and then use the COPY command to import the data in bulk.
Other options for increasing speed are using JNI to hit the postgres api so you can talk over the unix socket, removing indexes and the pg_bulkload project. However, in the end if you don't implement COPY you will always find performance disappointing.
Check if your connection is set to autoCommit. If autoCommit is true, then if you have 100 items in the batch when you call executeBatch, it will issue 100 individual commits. That can be a lot slower than calling executingBatch() followed by a single explicit commit().
I would avoid the temptation to drop indexes or foreign keys during the insert. It puts the table in an unusable state while your load is running, since nobody can query the table while the indexes are gone. Plus, it seems harmless enough, but what do you do when you try to re-enable the constraint and it fails because something you didn't expect to happen has happened? An RDBMS has integrity constraints for a reason, and disabling them even "for a little while" is dangerous.
You can obviously try to change the size of your batch to find the best size for your configuration, but I doubt that you will gain a factor 3.
You could also try to tune your database structure. You might have better performances when using a single field as a primary key than using a composed PK. Depending on the level of integrity you need, you might save quite some time by deactivating integrity checks on your DB.
You might also change the database you are using. MySQL is supposed to be pretty good for high speed simple inserts ... and I know there is a fork of MySQL around that tries to cut functionalities to get very high performances on highly concurrent access.
Good luck !
try disabling indexes, and reenabling them after the insert. also, wrap the whole process in a transaction

Categories

Resources