efficient way for storing query results? - java

I have a requirement like running 'n' numbers of select queries at fixed time intervals and storing that data. These results need to be pulled later upon a client's demand.
My question is:
1) Is it okay to store it as csv files? Or could you suggest another format?
2) Or, should it be stored as clob variable in a db?
Please suggest any compression techniques to store these query results; also, is it possible to store only revisions of previous resultsets instead of storing the whole resultset?
note:
The minimum time interval is hourly.
The number of queries (n) will be varying (currently 10 to 200 queries.)
The resultset size of each query is also varying (say 10 to 1,000,000 but mostly around 10k.)
The resultset data fetched between each time intervals doesn't differ much. (The row value will not be updated frequently.)
I am new to computer science and programming and also not very aware about storage or db designs.

It sounds like you should be building a data warehouse.

Performance-wise I suppose it would be better to have a table which purpose is to store the query results.

I think you need to store the data in a database. SQL database can serve you the best.
Regarding to storing the data in fixed interval of time, you just need to make effect of the change in the data set instead of storing the whole data again and again. I don't know what is your requirement and how much infrastructure you can afford. If you have such huge queries, I recommend you to work in Distributed System. Use NOSQL database for better performance.

Related

SELECT DISTINCT vs Java Set/List

I need to get the distinct values from a column (which is not indexed) and the table contains billions of rows.
So when I use distinct in the select query, the query gets time out as the timeout is set to 3 minutes.
Will be it a good approach to get all the data from the table and then using set we can get the unique values?
please suggest the best approach here.
Thanks in advance !! :)
It is not a good idea to take all rows (especially if there are billions of them), because it is not memory efficient, or you"ll get OutOfMemoryError.
The best of course is to rearrange unstructured data.
Also, for big sets of data common practice is to use a paging (pagination) mechanism, that allows you to take the data in small chunks, so you'll bypass this timeout issue.

What is the best approach to create aggregation tables?

I have data being collected every 1 sec and stored in hsqlDB.
I need to have aggregation data (per 15 sec, 1 min etc) on each metrics in the data collected.
What is the best approach to calculate the aggregation values? When to store in the DB?
Should I calculate the values online and each 15 sec store in DB? Or maybe query the DB for the last results and calculate the aggregation on them? Should I use small aggregation (15 sec) to calculate the large aggregation (1 min) ?
Are there free java tools for it?
From previous experience, I would suggest using a real time database, probably non-relational with a built-in ability to deal with time series. That way, you should be able to avoid storing calculated aggregated data. Using a relational database, you will quickly end up with millions of rows that will be difficult to manage and slow to access. Your other option is to denormalize your data and store every 1 hour of data in a single row, in a BLOB column (in binary format).
You can use HSQLDB is MVCC mode for concurrent reads and writes.
Provided the table for the raw data has an indexed timestamp column, aggregate calculation on a range is very fast using a SELECT statement. Because SELECT statements with aggregate calculations happen concurrently, you can use separate threads to perform the operation every 1 second and every 15 seconds.

Most performant way of querying database with JDBC?

I need to get data from several tables, so I used a query with N left outer joins. It seems to me that it may be a waste of performance since I get the cartesian product of lots of data. Which is the preferable way to this in order to achieve greater performance? I'm thinking of doing N+1 little queries. Am I on the right track?
I know, this has little to do with JDBC specifics. I want to retrieve data from a single table, and make left outer joins to other N tables. The result set gets very big because I get a cartesian product. For example:
table1data1, table2data1, table3data1
table1data1, table2data2, table3data1
table1data1, table2data1, table3data2
table1data1, table2data2, table3data2
I know that if a make several queries to the database (such as in my example I get 1 record for table1, 2 records for table 2 and 2 records for table 2) I'll make a lot of roundtrips to the database. But I've tested this way and it looks a lot faster.
This really isn't JDBC specific. Generally speaking, depending on the amount of data being returned, you'll get better performance retrieving everything in a single result set. N+1 queries tends to make for a lot of round trips to the database. Does the result set contain fields you don't need? Can you trim the columns being returned? That would be a first step, if possible.
I think your current approach off getting a lot of data in one trip to the database is the right approach. However if you find yourself executing the same query many times with different parameters, it is more performant to write it as a stored procedure using bind variables. But I would definitely shy-away from breaking your JOIN into several smaller queries.

Fastest way for inserting very large number of records into a Table in SQL

The problem is, we have a huge number of records (more than a million) to be inserted into a single table from a Java application. The records are created by the Java code, it's not a move from another table, so INSERT/SELECT won't help.
Currently, my bottleneck is the INSERT statements. I'm using PreparedStatement to speed-up the process, but I can't get more than 50 recods per second on a normal server. The table is not complicated at all, and there are no indexes defined on it.
The process takes too long, and the time it takes will make problems.
What can I do to get the maximum speed (INSERT per second) possible?
Database: MS SQL 2008. Application: Java-based, using Microsoft JDBC driver.
Batch the inserts. That is, only send 1000 rows at a time, rather then one row at a time, so you hugely reduce round trips/server calls
Performing Batch Operations on MSDN for the JDBC driver. This is the easiest method without reengineering to use genuine bulk methods.
Each insert must be parsed and compiled and executed. A batch will mean a lot less parsing/compiling because a 1000 (for example) inserts will be compiled in one go
There are better ways, but this works if you are limited to generated INSERTs
Use BULK INSERT - it is designed for exactly what you are asking and significantly increases the speed of inserts.
Also, (just in case you really do have no indexes) you may also want to consider adding an indexes - some indexes (most an index one on the primary key) may improve the performance of inserts.
The actual rate at which you should be able to insert records will depend on the exact data, the table structure and also on the hardware / configuration of the SQL server itself, so I can't really give you any numbers.
Have you looked into bulk operations bulk operations?
Have you considered to use batch updates?
Is there any integrity constraint or trigger on the table ?
If so, droping it before inserts will help, but you have to be sure that you can afford the consequences.
Look into Sql Server's bcp utility.
This would mean a big change in your approach in that you'd be generating a delimited file and using an external utility to import the data. But this is the fastest method for inserting a large number of records into a Sql Server db and will speed up your load time by many orders of magnitude.
Also, is this a one-time operation you have to perform or something that will occur on a regular basis? If it's one time I would suggest not even coding this process but performing an export/import with a combination of db utilities.
I would recommend using an ETL engine for it. You can use Pentaho. It's free. The ETL engines are optimized for doing bulk loading on data and also any forms of transformation/validation that are required.

Writing resultset to file with sorted output

I want to put "random" output from my result set (about 1.5 mil rows) in a file in a sorted manner. I know i can use sort by command in my query but that command is "expensive".
Can you tell me is there any algorithm for writing result set rows in a file so the content would be sorted in the end and can i gain in performance with this?
I'm using java 1.6, and query has multiple joins.
Define an index for the sort criteria in your table, then you can use the order by clause without problems and write the file as it comes from the resultset.
If your query has multiple joins, create the proper indexes for the joins and for the sort criteria. You can sort the data on your program but you'd be wasting time. That time will be a lot more valuable when employed learning how to properly tune/use your database rather than reinventing sorting algorithms already present in the database engine.
Grab your database's profiler and check the query's execution plan.
In my experience sorting at the database side is usually as fast or faster...certainly if the column you sort on is indexed
If you're reading from a database, getting sorted output shouldn't be so 'expensive' if you have appropriate indexes.
But, sometimes with complex queries it's very hard for the SQL optimiser to apply indexes. In that case, the DB simply accumulates the results in a temporary table and sorts it for you, transparently.
It's very unlikely that you could match the level of optimisations put into your DB engine; but if your problem arises because you're doing some postprocessing of the data that negates any sorting done by the DB, then you have no alternative other than sorting it yourself.
Again, the easiest would be to use the DB: simply write to a temporary table with an appropriate index and dump from there.
If you're certain that the data will always fit in RAM, you can sort it in memory. It's the only case in which you might be able to beat the DB engine, just because you know you won't need HD access.
But that's a lot of 'ifs'. Better stay with your DB
If you need the data sorted, someone has to do it - either you or the database. It's certainly easier effort-wise to add the ORDER BY to the query. But there's no reason you can't sort it in-memory on your side. The easiest way is to chunk the data in a sorted collection (TreeSet, TreeMap) using a Comparator to sort on the column you need. Then write out the sorted data.

Categories

Resources