I have a database to access via JDBC which returns around 200k of records I have to consolidate data table style for further processing. I could if that leads to good performance send a few SELECT (Count(...) ...) statements upfront.
What is a good algorithm in Java to compute such a pivot table?
Clarification
I specifically look for a Java solution, not a SQL approach (and the back end isn't Oracle)
Related
The following query generated by hibernate takes 13+ seconds and locks the table:
SELECT COUNT(auditentit0_.audit_id) AS col_0_0_ FROM Audit auditentit0_ WHERE 1=1;
The growing Microsoft SQL server database table contains 90+ million rows.
For Microsoft SQL server, I have found an accurate meta data way of getting the same information very quickly.
However, I would rather not write custom code for Microsoft sql server and oracle (the next database) if hibernate has a way of getting this information.
Here is an example meta data query for Microsoft sql server that is accurate and almost instant:
SELECT SUM (row_count) FROM sys.dm_db_partition_stats WHERE object_id=OBJECT_ID('huge_audit_table') AND (index_id=0 or index_id=1);
Is there a way to have hibernate issue a similar query for a table row count?
One posted answer has indicated that a view could be of use. I'm investigating this post to see if it can solve the issue:
https://vladmihalcea.com/map-jpa-entity-to-view-or-sql-query-with-hibernate/
In hibernate you should use projections like in the link you provided in order to guarantee that it works on multiple dbms:
protected Long countByCriteria(DetachedCriteria criteria) {
Criteria crit = criteria.getExecutableCriteria(getSession());
crit.setProjection(Projections.rowCount());
return (Long)crit.uniqueResult();
}
What engine are you using in mysql? I never had a blocking problem with row count in MySql or Oracle. Maybe the following link will help you: Any way to select without causing locking in MySQL?
Also, after some quick reading i see that Sql Server does indeed block on count.
Maybe you could use a stored procedure or some other mechanism to pass the problem to the dbms.
Edit:
Projections in Hibernate are used to select the columns to fetch, the columns to group elements by, and to use built-in aggregate functions (sum, count, avg, max, min, countDistinct).
It helps you keep your application database-agnotic. Remember that hibernate supports around 30 databases.
In your case you have an specific problem with mssql as the count blocks the table prioritizing accuracy. And using the system views is really quick as you get an estimate but isn´t standard.
You could encapsulate the problem into a view or stored procedure dbms dependant. Or maybe you could try with a NOLOCK hint or READ UNCOMMITED in hibernate (in a count of an audit table it should be acceptable).
To solve this particular problem we stepped back and changed how the UI functions. Through a collaborative effort between UIX and UI developers we agreed that unfiltered queries will NOT ask for total counts. The initial screen load will show only a page full of data. No page 1 of 60,000 controls will exists. Only when the user enters specific criteria will the total count come into play. Those queries should be very fast. Now... it is possible for the user to still setup a query that will be just as bad as the original problem. It should be the exception versus the norm.
So there really is not a solid answer for the OP. If you are faced with this type of problem, if you have control of the UI and API, then it is time to rethink the solution. Think of how google handles paging from a UI perspective. The days of showing a "page 1 of (XX)" are gone IMHO.
TLDR: What is the best way to store spatial data in an SQL database to be used in R-Trees?
Long question:
I am writing a feature that incorporates spatial data. The goal is to store POIs and being able to retrieve the data quickly, perform clustering etc.
My understanding is that R*-Trees are a good solution for this kind of task. I am planning on using: https://github.com/davidmoten/rtree.
SQLite seems to offer R-Trees, but I can use only SQL. What would be the most efficient way to store this data?
Get a database that has R trees.
For example SQLite, PostgreSQL, Oracle, ...
But beware that the query performance of these databases will usually be pretty bad compared to an in-memory index such as ELKIs. In particular if you want nearest neighbor with haversine distance, which is what I need mostly.
Often, their R tree index is a ugly hack. It seems they usually create a table to store the pages of the tree, so querying means repeatedly selecting rows from that table.
Currently I am facing a distinct issue, where I receive data from a webservice call, same need to be loaded into Oracle Table.
Scenario:
- I have a very huge table with 500 columns - all columns mandatory, and no choice to split table.
- Dataset is 50m records, which I am trying to export from source system - and its continuously increasing
- At a time I receive 50 column data by firing request to webservice (at source system), hence I need to submit 10 request of 50 column each for getting full record.
- Also at a time I can only receive 100000 (1 lac) records in one request for specific set of columns.
Now, to import same data into Oracle DB at destination system I have following two choices:
1. First export data on temporary tables of 50 columns each and then run join for all of them to create final table with all 500 columns
2. Fire 10 parallel request of 50 columns each and stitch data on my java program and then send insert query with all 500 columns
Here I would like to know, which technique works out better, to go with Oracle based table join or apply stitching on java side by using Primary Key column?
As the data set is very huge, I am purely looking on performance aspect. Also any more optimized ways to solve same problem?
From performance point of view the Oracle based solution would clearly win. From implementation point of view (aiming for a clear and simple solution) Oracle tables win again. Here is why:
Architecture point of view: Combining the data in your app will make your app stateful. From a simple stateless (receive-save-forget) application you would turn it into a complex state-aware (save-look for joint records-did not find anything-store-wait-look again-etc). This is much harder to develop, maintain or debug.
Performance point of view: Saving data into multiple tables and later combining them into one (either by views or stored procedures or simple selects) is something Oracle is designed for. Immense amount of development time was spent on optimizing these basic features. Whatever you would come up with to implement the same features (even though you are aware of some specifics) would likely performe worse.
So overall I would strongly suggest Option #1, leave it for Oracle to do the hard part. Depending on how you want to use this data after the import (almost real-time / once in a while / after extra filtering applied) you can choose how you construct the final records by using one of these:
stored procedures
Oracle jobs
views.
The problem is, we have a huge number of records (more than a million) to be inserted into a single table from a Java application. The records are created by the Java code, it's not a move from another table, so INSERT/SELECT won't help.
Currently, my bottleneck is the INSERT statements. I'm using PreparedStatement to speed-up the process, but I can't get more than 50 recods per second on a normal server. The table is not complicated at all, and there are no indexes defined on it.
The process takes too long, and the time it takes will make problems.
What can I do to get the maximum speed (INSERT per second) possible?
Database: MS SQL 2008. Application: Java-based, using Microsoft JDBC driver.
Batch the inserts. That is, only send 1000 rows at a time, rather then one row at a time, so you hugely reduce round trips/server calls
Performing Batch Operations on MSDN for the JDBC driver. This is the easiest method without reengineering to use genuine bulk methods.
Each insert must be parsed and compiled and executed. A batch will mean a lot less parsing/compiling because a 1000 (for example) inserts will be compiled in one go
There are better ways, but this works if you are limited to generated INSERTs
Use BULK INSERT - it is designed for exactly what you are asking and significantly increases the speed of inserts.
Also, (just in case you really do have no indexes) you may also want to consider adding an indexes - some indexes (most an index one on the primary key) may improve the performance of inserts.
The actual rate at which you should be able to insert records will depend on the exact data, the table structure and also on the hardware / configuration of the SQL server itself, so I can't really give you any numbers.
Have you looked into bulk operations bulk operations?
Have you considered to use batch updates?
Is there any integrity constraint or trigger on the table ?
If so, droping it before inserts will help, but you have to be sure that you can afford the consequences.
Look into Sql Server's bcp utility.
This would mean a big change in your approach in that you'd be generating a delimited file and using an external utility to import the data. But this is the fastest method for inserting a large number of records into a Sql Server db and will speed up your load time by many orders of magnitude.
Also, is this a one-time operation you have to perform or something that will occur on a regular basis? If it's one time I would suggest not even coding this process but performing an export/import with a combination of db utilities.
I would recommend using an ETL engine for it. You can use Pentaho. It's free. The ETL engines are optimized for doing bulk loading on data and also any forms of transformation/validation that are required.
My requirement is to read some set of columns from a table.
The source table has many - around 20-30 numeric columns and I would like to read only a set of those columns from the source table and keep appending the values of those columns to the destination table. My DB is on Oracle and the programming language is JDBC/Java.
The source table is very dynamic - there are frequent inserts and deletes happen on
it. Whereas at the destination table, I would like to keep the data for at least 30
days.
My Setup is described as below -
Database is Oracle.
Number of rows in the source table = 20 Million rows with 30 columns
Number of rows in destinationt table = 300 Million rows with 2-3 columns
The columns are all Numeric.
I am thinking of not doing a vanilla JDBC connection open and transfer the data,
which might be pretty slow looking at the size of the tables.
I am trying to take the dump of the selected columns of the source table using some
sql like -
SQL> spool on
SQL> select c1,c5,c6 from SRC_Table;
SQL> spool off
And later use SQLLoader to load the data into the destination database.
The source table is storing time series data and the data gets purged/deleted from source table within 2 days. Its part of OLTP environment. The destination table has larger retention period - 30days of data can be stored here and it is a part of OLAP environment. So, the view on source table where view selects only set of columns from the source table, does not work in this environment.
Any suggestion or review comments on this approach is welcome.
EDIT
My tables are partitioned. The easiest way to copy data is to exchange partition netween tables
*ALTER TABLE <table_name>
EXCHANGE PARTITION <partition_name>
WITH TABLE <new_table_name>
<including | excluding> INDEXES
<with | without> VALIDATION
EXCEPTIONS INTO <schema.table_name>;*
but since my source and destination tables have different columns so I think exchange partition will not work.
Shamik, okay, you're loading an OLAP database with OLTP data.
What's the acceptable latency? Does your OLAP need today's data before people come in to the office tomorrow morning, or is it closer to real time.
Saying the Inserts are "frequent" doesn't mean anything. Some of us are used to thousands of txns/sec - to others 1/sec is a lot.
And you say there's a lot of data. Same idea. I've read people's post where they have HUGE tables with a couple million records. i have table with hundreds of billions of records. SO again. A real number is very helpful.
Do not go with the trigger suggested by Schwern. If you believe your insert volume is large, it means you've probably have had issues in that area. A trigger will just make it worse.
Oracle provide lots of different choices for getting data from OLTP to OLAP. Instead of reinventing the wheel, use something already written. Oracle Streams was BORN to do this exact job. You can roll your own streams with using Oracle AQ. You can capture inserted rows without a trigger by using either Database Change Notification or Change Data Capture.
This is an extremely common problem, which is why I've listed 4 technologies designed to solve it.
Advanced Queuing
Streams
Change Data Capture
Database Change Notification
Start googling these terms and come back with questions on those. you'll be better off than building your own from the ground up or using triggers.
The problem seems a little vague, and frankly a little odd. The fact that there's hundreds of columns in a single table, and that you're duplicating data within the database, suggests a hosed database design.
Rather than do it manually, it sounds like a job for a trigger. Create an insert trigger on the source table to copy columns to the destination table just after they're inserted.
Another possibility is that since it seems all you want is a slice of the data in your original table, rather than duplicating it, a cardinal sin of database design, create a view which only includes the columns and ranges you want. Then just access that view like any other table.
I'm willing the guess that the root of the problem is accessing just the information you want in your source table is too slow. This suggests you might be able to fix that with better indexing. Also, your source table is probably just too damn wide.
Since I'm not an Oracle person, I leave the syntax of this as an exercise for the reader, but the concept should be sound.
On a tangential note, you might want to look at Oracle's partitioning here and here.
Partitioning enables tables and indexes to be split into smaller, more manageable components and is a key requirement for any large database with high performance and high availability requirements. Oracle Database 11g offers the widest choice of partitioning methods including interval, reference, list, and range in addition to composite partitions of two methods such as order date (range) and region (list) or region (list) and customer type (list).
Faster Performance—Lowers query times from minutes to seconds
Increases Availability—24 by 7 access to critical information
Improves Manageability—Manage smaller 'chunks' of data
Enables Information Lifecycle Management—Cost-efficient use of storage
Partitioning the table into daily partitions would make archiving easier as described here