Data Stitching Join/Merge - Oracle Vs Java based technique

Data Stitching Join/Merge - Oracle Vs Java based technique - java

Currently I am facing a distinct issue, where I receive data from a webservice call, same need to be loaded into Oracle Table.
Scenario:
- I have a very huge table with 500 columns - all columns mandatory, and no choice to split table.
- Dataset is 50m records, which I am trying to export from source system - and its continuously increasing
- At a time I receive 50 column data by firing request to webservice (at source system), hence I need to submit 10 request of 50 column each for getting full record.
- Also at a time I can only receive 100000 (1 lac) records in one request for specific set of columns.
Now, to import same data into Oracle DB at destination system I have following two choices:
1. First export data on temporary tables of 50 columns each and then run join for all of them to create final table with all 500 columns
2. Fire 10 parallel request of 50 columns each and stitch data on my java program and then send insert query with all 500 columns
Here I would like to know, which technique works out better, to go with Oracle based table join or apply stitching on java side by using Primary Key column?
As the data set is very huge, I am purely looking on performance aspect. Also any more optimized ways to solve same problem?

From performance point of view the Oracle based solution would clearly win. From implementation point of view (aiming for a clear and simple solution) Oracle tables win again. Here is why:
Architecture point of view: Combining the data in your app will make your app stateful. From a simple stateless (receive-save-forget) application you would turn it into a complex state-aware (save-look for joint records-did not find anything-store-wait-look again-etc). This is much harder to develop, maintain or debug.
Performance point of view: Saving data into multiple tables and later combining them into one (either by views or stored procedures or simple selects) is something Oracle is designed for. Immense amount of development time was spent on optimizing these basic features. Whatever you would come up with to implement the same features (even though you are aware of some specifics) would likely performe worse.
So overall I would strongly suggest Option #1, leave it for Oracle to do the hard part. Depending on how you want to use this data after the import (almost real-time / once in a while / after extra filtering applied) you can choose how you construct the final records by using one of these:
stored procedures
Oracle jobs
views.

Related

Is it better to count in server side API using java stream() then using count query call repeatedly in spring jpa

I want to count the number of rows in a table three times depending on three filters/conditions. I want to know which one of the following two ways is better for performance and cost-efficiency. We are using AWS as our server, java spring to develop server-side API and MySQL for the database.
Use the count feature of MySQL to query three times in the database for three filtering criteria to get the three count result.
Fetch all the rows of the table from the database first using only one query. Then using java stream three times based on three filtering criteria to get the three count result.

It'll be better to go with option (1) in extreme cases. If it's slow to execute SELECT COUNT(*) FROM table then you should consider some tweak on SQL side. Not sure what you're using but I found this example for sql server
Assuming you go with Option (2) and you have hundreds of thousands of rows, I suspect that your application will run out of memory (especially under high load) before you have time to worry about slow response time from running SELECT count(*). Not to mention that you'll have lots of unnecessary rows and slow down transfer time between database and application

A basic argument against doing counts in the app is that hauling lots of data from the server to the client is time-consuming. (There are rare situations where it is worth the overhead.) Note that your client and AWS may be quite some distance apart, thereby exacerbating the cost of shoveling lots of data. I am skeptical of what you call "server-side API". But even if you can run Java on the server, there is still some cost of shoveling between MySQL and Java.
Sometimes this pattern lets you get 3 counts with one pass over the data:
SELECT
SUM(status='ready') AS ready_count,
SUM(status='complete') AS completed_count,
SUM(status='unk') AS unknown_count,
...
The trick here is that a Boolean expression has a value of 0 (for false) or 1 (for true). Hence the SUM() works like a 'conditional count'.

Maintaining preprocessed data from large, continous data feed in MySQL

I'm currently working on an analytics tool that every night (with a Java program) parses huge event logs (approx. 1 GB each) to a MySQL database - for each event there's about 40 attributes. The event logs are parsed "raw" to the database.
The user of the application needs to see different graphs and charts based on complicated calculations on the log data. For the user not to wait several minuts for a chart-request to be fulfilled, we need to store the preprocessed data somehow ready to display for the user (the user is able to filter by dates, units etc., but the largest parts of the calculations can be done on beforehand). My question is concerned about how to maintain such preprocessed data - currently, all calculations are expressed in SQL as we assume is the most efficient way (is this a correct assumption?). We need to be able to easily expand with new calculations for new charts, customer specific wishes etc.
Some kind of materialized view jumps to my mind, but MySQL doesn't seem to support this feature. Similarly, we could execute the SQL calculation each night after the event logs has been imported, but in this way each calculation/preprocessed data table needs to know which events it has processed and which it hasn't. The table will contain up to a year worth of data (i.e. events) so simply truncating the table and doing all calculations once again seems not to be the solution? Using triggers doesn't seem right neither, as some calculations need to consider for example the time difference between to specific kinds of events?
I'm having a hard time weighing the pros and cons of possible solutions.

"Materialized Views" are not directly supported by MySQL. "Summary Tables" is another name for them in this context. Yes, that is the technique to use. You must create and maintain the summary table(s) yourself. They would be updated either as you insert data into the 'Fact' table, or periodically through a cron job, or simply after uploading the nightly dump.
The details of such are far more than can be laid out in this forum, and the specific techniques that would work best for you involve many questions. I have covered much of it in three blogs: DW, Summary Tables, and High speed ingestion. If you have further, more specific, questions, please open a new Question and I will dig into more details as needed.
I have done such in several projects; usually the performance is 10x better than reading the Fact table; in one extreme case, it was 1000x. I always end up with UI-friendly "reports" coming from the Summary Table(s).
In some situations, you are actually better off building the Summary Tables and not saving the Fact rows in a table. Optionally, you could simply keep the source file in case of a need to reprocess it. Not building the Fact table will get the summary info to the end-user even faster.
If you are gathering data for a year, and then purging the 'old' data, see my blog on partitioning. I often use that on the Fact table, but rarely feel the need on a Summary Table, since the Summary table is much smaller (that is, not filling up disk).
One use case had a 1GB dump every hour. A perl script moved the data to a Fact table, plus augmented 7 Summary Tables, in less than 10 minutes. The system was also replicated, that added some extra challenges. So, I can safely say that 1GB a day is not a problem.

Oracle distinct vs java (cqengine/set) : whose leads to better performances?

I have a table from which I extract 8 columns, said columns will be properties of a pojo, say MyPojo.
I want to remove duplicates.
I came up with two strategies.
1-Let oracle take care of this with distinct keyword
select distinct c1,c2...c8 from TABLE where...`
2-Do this in java with cqengine (https://code.google.com/p/cqengine/wiki/DeduplicationStrategies#Logical_Elimination_Strategy):
DeduplicationOption deduplication = deduplicate(DeduplicationStrategy.LOGICAL_ELIMINATION);
ResultSet<Car> results = cars.retrieve(query, queryOptions(deduplication));
3-Do this in java with a set
simply storing rows inside of a Set<MyPojo>
From a performance point of view which one is better?

Let the database do the work. In this case you don't send unnecessary data over the network which will - probably - have the biggest positive impact on performance.
Also it is the most compact solution in terms of code size.

The best way to decide these things is to model it.
What are the access patterns in your application?
If this is would be a one-off request: have the database do the filtering.
If you expect to get many such identical requests: have the database do the filtering, and consider caching results in the application.
If you expect to get a variety of queries on the same dataset, consider caching the unfiltered dataset into the application tier, and querying it with CQEngine.
There is no rule of thumb such as "always have the database do the work". If your application operates at any kind of scale, you will not want every request to hit the database. You need to scale out your application tier.
On the other hand, you should not over-engineer. The answer depends on the traffic volume and data access patterns that you expect.

speed up operation on mysql

I'm currently writing java project against mysql in a cluster with ten nodes. The program simply pull some information from the database and do some calculation, then push some data back to the database. However, there are millions of rows in the table. Is there any way to split up the job and utilize the cluster architecture? How to do multi-threading on different node?

I watched an interesting presentation on using Gearman to do Map/Reduce style things on a mysql database. It might be what you are looking for: see here. There is a recording on the mysql webpage here (have to register for mysql.com though).

I'd think about doing that calculation in a stored procedure on the database server and pass on bringing millions of rows to the middle tier. You'll save yourself a lot of bytes on the wire. Depending on the nature of the calculation, your schema, indexing, etc. you might find that the database server is well equipped to do that calculation without having to resort to multi-threading.
I could be wrong, but it's worth a prototype to see.

Assume the table (A) you want to process has 10 million rows. Create a table B in the database to store the set of rows processed by a node. So you can write the Java program in such a way like it will first fetch the last row processed by other nodes and then it add an entry in the same table informing other nodes what range of rows it is going to process (you can decide this number). In our case, lets assume each node can process 1000 rows at a time. Node 1 fetches table B and finds it it empty. Then Node 1 inserts a row ('Node1', 1000) informing that it is processing till primary key of A is <=1000 ( Assuming primary key of table A is numeric and it is in ascending order). Node 2 comes and finds 1000 primary keys are processed by some other node. Hence it inserts a row ('Node2', 2000) informing others that it is processing rows between 1001 and 2000. Please note that access to table B should be synchronized, i.e. only one can work on it at a time.

Since you only have one mysql server, make sure you're using the innodb engine to reduce table locking on updates.
Also I'd try to keep your queries as simple as possible, even if you have to run more of them. This can increase chances of query cache hits, as well as reduce the over all workload on the backend, offloading some of the querying matching and work to the frontends (where you have more resources). It will also reduce the time a row lock is held therefore decreasing contention.
The proposed Gearman solution is probably the right tool for this job. As it will allow you to offload batch processing from mysql back to the cluster transparently.
You could set up sharding with a mysql on each machine but the set up time, maintenance and the changes to database access layer might be a lot of work compared to a gearman solution. You might also want to look at the experimental spider engine that could allow you to use multiple mysqls in unison.

Unless your calculation is very complex, most of the time will be spent retrieving data from MySql and sending the results back to MySQl.
As you have a single database no amount of parallelism or clustering on the application side will make much difference.
So your best options would be to do the update in pure SQL if that is at all possible, or, use a stored procedure so that all processing can take place within the MySql server and no data movement is required.
If this is not fast enough then you will need to split your database among several instances of MySql and come up with some schema to partition the data based on some application key.

Duplicate set of columns from one table to another table

My requirement is to read some set of columns from a table.
The source table has many - around 20-30 numeric columns and I would like to read only a set of those columns from the source table and keep appending the values of those columns to the destination table. My DB is on Oracle and the programming language is JDBC/Java.
The source table is very dynamic - there are frequent inserts and deletes happen on
it. Whereas at the destination table, I would like to keep the data for at least 30
days.
My Setup is described as below -
Database is Oracle.
Number of rows in the source table = 20 Million rows with 30 columns
Number of rows in destinationt table = 300 Million rows with 2-3 columns
The columns are all Numeric.
I am thinking of not doing a vanilla JDBC connection open and transfer the data,
which might be pretty slow looking at the size of the tables.
I am trying to take the dump of the selected columns of the source table using some
sql like -
SQL> spool on
SQL> select c1,c5,c6 from SRC_Table;
SQL> spool off
And later use SQLLoader to load the data into the destination database.
The source table is storing time series data and the data gets purged/deleted from source table within 2 days. Its part of OLTP environment. The destination table has larger retention period - 30days of data can be stored here and it is a part of OLAP environment. So, the view on source table where view selects only set of columns from the source table, does not work in this environment.
Any suggestion or review comments on this approach is welcome.
EDIT
My tables are partitioned. The easiest way to copy data is to exchange partition netween tables
*ALTER TABLE <table_name>
EXCHANGE PARTITION <partition_name>
WITH TABLE <new_table_name>
<including | excluding> INDEXES
<with | without> VALIDATION
EXCEPTIONS INTO <schema.table_name>;*
but since my source and destination tables have different columns so I think exchange partition will not work.

Shamik, okay, you're loading an OLAP database with OLTP data.
What's the acceptable latency? Does your OLAP need today's data before people come in to the office tomorrow morning, or is it closer to real time.
Saying the Inserts are "frequent" doesn't mean anything. Some of us are used to thousands of txns/sec - to others 1/sec is a lot.
And you say there's a lot of data. Same idea. I've read people's post where they have HUGE tables with a couple million records. i have table with hundreds of billions of records. SO again. A real number is very helpful.
Do not go with the trigger suggested by Schwern. If you believe your insert volume is large, it means you've probably have had issues in that area. A trigger will just make it worse.
Oracle provide lots of different choices for getting data from OLTP to OLAP. Instead of reinventing the wheel, use something already written. Oracle Streams was BORN to do this exact job. You can roll your own streams with using Oracle AQ. You can capture inserted rows without a trigger by using either Database Change Notification or Change Data Capture.
This is an extremely common problem, which is why I've listed 4 technologies designed to solve it.
Advanced Queuing
Streams
Change Data Capture
Database Change Notification
Start googling these terms and come back with questions on those. you'll be better off than building your own from the ground up or using triggers.

The problem seems a little vague, and frankly a little odd. The fact that there's hundreds of columns in a single table, and that you're duplicating data within the database, suggests a hosed database design.
Rather than do it manually, it sounds like a job for a trigger. Create an insert trigger on the source table to copy columns to the destination table just after they're inserted.
Another possibility is that since it seems all you want is a slice of the data in your original table, rather than duplicating it, a cardinal sin of database design, create a view which only includes the columns and ranges you want. Then just access that view like any other table.
I'm willing the guess that the root of the problem is accessing just the information you want in your source table is too slow. This suggests you might be able to fix that with better indexing. Also, your source table is probably just too damn wide.
Since I'm not an Oracle person, I leave the syntax of this as an exercise for the reader, but the concept should be sound.

On a tangential note, you might want to look at Oracle's partitioning here and here.
Partitioning enables tables and indexes to be split into smaller, more manageable components and is a key requirement for any large database with high performance and high availability requirements. Oracle Database 11g offers the widest choice of partitioning methods including interval, reference, list, and range in addition to composite partitions of two methods such as order date (range) and region (list) or region (list) and customer type (list).
Faster Performance—Lowers query times from minutes to seconds
Increases Availability—24 by 7 access to critical information
Improves Manageability—Manage smaller 'chunks' of data
Enables Information Lifecycle Management—Cost-efficient use of storage
Partitioning the table into daily partitions would make archiving easier as described here

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.