Java ETL from Oracle using CursorExpression - java

I am attempting to export a large amount of data from multiple separate tables from Oracle 11 into a NoSQL database via a Java app utilising JDBI.
The data is being read from the following tables: store, store2, staff and product.
The final desired data structure is a multi-tiered structure like so;
Country
Store1
StoreFloorSize
StoreAddress1
StorePostcode
StoreStaff
StaffMember1
StaffForename
StaffMember2
StaffForename
StoreProducts
Product1
ProductName
Product2
ProductName
Store2
...
There will be many countries and each country can have many stores and each store can have many staff members/products.
So far I've attempted to perform this export by querying the data from Oracle in the structure utilising cursor expressions (refcursors) and then
mapping the results to Java objects before saving to the new NoSQL database.
A very simplfied version of the query used to extract data from Oracle is below;
select countryName, cursor(storeFloorSize, storeAddress1, storePostcode, cursor(select staffForename from staff where staff.storeId = store.Id),
cursor(select productName from product where product.storeid = store.id)
from (select * from store union all select * from store2) store WHERE store.countryid = country.id) from country
This approach works however due to the volume of data it's taking a long time (a number of days to complete) and there are a few constraints with it.
The entire process takes a two to three days to complete however when looking at Oracle stats the time spent executing on Oracle is only approximately 6 hours.
So far in trying to track down where this additional time is taken I've done/checked the following;
First the NoSQL database has been removed from the equation entirely and the performance remains the same.
The Oracle server and machine on which the Java application is running on are both fine in terms of CPU and Memory resources (very little usage on both machines for both resources)
I've broken the task up across multiple threads each working on separate partitions of the table (country in the above example); each thread performs select from oracle-> map to java objects -> save to NoSQL - This parallel processing when done across a large number of threads reduced the execution time on Oracle but had no real affect on the overall time. (These are separate threads in Java and each has their own separate connection to Oracle via a connection pool)
I've tried modifying the fetchSize property however it seems to have a very small difference (This adds another complication as each result row contains three cursors and when parallised
across a large number of threads the MAX_OPEN_CURSORS setting on Oracle needs to increase drastically very quickly).
I can't seem to identify any particular bottle necks however resource utilisation is still very low.
As mentioned in the first line I'm using the JDBI wrapper around JDBC to perform the query and map the results to Java objects however if this was the bottle neck I believe that I'd see high usage on the machine running the Java application.
Is there anything I may have overlooked with regard to the above or might I be better of moving back to pure SQL queries and performing the transformation in Java?

Related

Is SQL IN Query better for performance or Java method ContainsAll

I have a scenario where user will select bulk of input up to 100K entries and i need to validate if this data belongs to the user and satisfies other X conditions so should I use complex Oracle SQL DB query - composite IN(id,column) to validate it OR
Should I fetch the data for this user satisfying the conditions to application memory and use List.containsAll, by first getting all the data (with all the other conditions)for this particular user and populating it in a dbList and then validating dbList.containsAll(inputList).
Which one will be better performace wise. DB Composite IN to send bulk input vs get the input and validate it with containsAll
I tried running SQL query in SIT environment, the query is taking around 70 -90 seconds which is too bad. It would be better in prod but still I feel the data has to sort through huge data in DB even though it is indexed by user ID.
IN DB i am using Count(*) with IN like below :
SQL Query :
select count(*) from user_table where user_id='X123' and X condtions and user_input IN(
('id','12344556'),
('id','789954334')
('id','343432443')
('id','455543545')
------- 50k entries
);
Also there are other AND conditions as well for validating the user_input are valid entries.
Sample JAVA code:
List<String> userInputList = request.getInputList();
List<String> userDBList = sqlStatement.execute(getConditionedQuery);
Boolean validDate = userDBList.containsAll(userInputList );
getConditionedQuery = "select user_backedn_id from user_table where user_id='X123'AND X complex conditions";
The SQL Query with composite IN condition takes around 70-90 seconds in lower environments, however Java code for the containsALL looks much faster.
Incidentally, I don't want to use temp table and execute the procedure because again bulk input entry in DB is a hassle. I am using ATG framework and the module is RESTful so performance is most important here.
I personally believe that you should apply all filters at the database side only for many reasons. First, exchanging that much data over the network will consume unnecessary bandwidth. Second, bringing all that data into JVM and processing it will consume more memory. Third, databases can be tuned and optimised for complex queries. Talk to your DBA, give him the query and him to run an analysis. The analysis will tell you if you need to add any indexes to optimise your query.
Also, contrary to your belief, my experience says that if a query takes 70-90 seconds in SIT, it will take MORE time in prod. Because although PROD machine are much faster, the amount of data in PROD is much much higher compared to SIT, so it will take longer. But that does not mean you should haul it over the network and process it in JVM. Plus, JVMs heap memory is much much lesser compared to database memory.
Also, as we move to a cloud-enabled, containerised application architecture, network bandwidth is charged. E.g. if your application is in the cloud and the database in on premise, imagine amount of data you will move back and forth to finally filter out 10 rows from a million rows.
I recommend that you write a good query, optimise it and process as many conditions as possible on the database side only. Hope it helps !
In general it's a good idea to push as much of the processing to the database. Even though it might actually like a bottleneck, it is generally well optimised and can work over the large amounts of data faster than you would.
For read queries like the one you're describing, you can even offload the work to read replicas, so it doesn't overwhelm the master.

Working with files using Java

I have a query in regards to what is the best way of handling huge files in Java?
Shall we use the no-sql database like Cassandra or try to use our existing Oracle database (to dump the content of the file).
My file can contain at most 1 or 2 fields. But mostly what I shall be able to do with the file content is just search an Id and return boolean.
File can contain records in tens of millions or as low as thousands.
Also this file can get refreshed on daily basis. Whenever refreshed I need to clear all previous values.
Any suggestions would be helpful!!
Regards,
Vicky
As per your requirements,
Oracle
Is good for indexing and fits your requirements if every day data is in tens of millions.
Index will be stored in memory and searches will be faster for this short data. If table is also short you can also request to keep table in memory and that will be even faster if any other column is also required.
You can drop table every day and import file again as new table. This should work.
Cassandra
Is also good for indexing. All your searches will also be faster (similar to oracle for such small data)
Cassandra is NoSQL database designed to provide scalability, high write throughput, availability for high volume data and queries.
Cassandra generally runs in clustered environments for above properties.
I would suggest to check your requirements, If you just to keep data in DB and wants to query once in a while or maybe 100 requests per sec then using Cassandra is like hitting a nail in wall with sledgehammer where small hammer or mallet is enough.

Data Stitching Join/Merge - Oracle Vs Java based technique

Currently I am facing a distinct issue, where I receive data from a webservice call, same need to be loaded into Oracle Table.
Scenario:
- I have a very huge table with 500 columns - all columns mandatory, and no choice to split table.
- Dataset is 50m records, which I am trying to export from source system - and its continuously increasing
- At a time I receive 50 column data by firing request to webservice (at source system), hence I need to submit 10 request of 50 column each for getting full record.
- Also at a time I can only receive 100000 (1 lac) records in one request for specific set of columns.
Now, to import same data into Oracle DB at destination system I have following two choices:
1. First export data on temporary tables of 50 columns each and then run join for all of them to create final table with all 500 columns
2. Fire 10 parallel request of 50 columns each and stitch data on my java program and then send insert query with all 500 columns
Here I would like to know, which technique works out better, to go with Oracle based table join or apply stitching on java side by using Primary Key column?
As the data set is very huge, I am purely looking on performance aspect. Also any more optimized ways to solve same problem?
From performance point of view the Oracle based solution would clearly win. From implementation point of view (aiming for a clear and simple solution) Oracle tables win again. Here is why:
Architecture point of view: Combining the data in your app will make your app stateful. From a simple stateless (receive-save-forget) application you would turn it into a complex state-aware (save-look for joint records-did not find anything-store-wait-look again-etc). This is much harder to develop, maintain or debug.
Performance point of view: Saving data into multiple tables and later combining them into one (either by views or stored procedures or simple selects) is something Oracle is designed for. Immense amount of development time was spent on optimizing these basic features. Whatever you would come up with to implement the same features (even though you are aware of some specifics) would likely performe worse.
So overall I would strongly suggest Option #1, leave it for Oracle to do the hard part. Depending on how you want to use this data after the import (almost real-time / once in a while / after extra filtering applied) you can choose how you construct the final records by using one of these:
stored procedures
Oracle jobs
views.

Fast way to replicate a huge database table

We are currently trying to solve a performance problem. Which is searching for data and presenting it in a paginated way takes about 2-3 minutes.
Upon further investigation (and after several sql tuning), it seems that searching is slow just because of the sheer amount of data.
A possible solution that I'm currently investigating is to replicate the data in a searchable cache. Now this cache can be in the database (i.e. materialized view) or it could be outside the db (nosql approach). However, since I would like the cache to be horizontally scalable, I am leaning towards caching it outside the database.
I've created a proof of concept, and indeed, searching in my cache is faster than in the db. However, the initial full replication takes a long time to complete. Although the full replication will just happen once, and then succeeding replication will just be incremental against those that changed since the last replication, it would still be great if I can speed up the initial full replication.
However, during full replication, aside from the slowness of the query's execution, I also have to battle against network latency. In fact, I can deal with the slow query execution time. But the network latency is really really slowing the replication down.
So which leads me to my question, how can I speed up my replication? Should I spawn several threads each one doing a query? Should I use a scrollable?
Replicating the data in a cache seems like replicating the functionality of the database.
From reading other comments, I see that you are not doing this to avoid network roundtrips, but because of costly joins. In many DBMS you can create temporary tables - like this:
CREATE TEMPORARY TABLE abTable AS SELECT * FROM a , b ;
If a and b are large (relatively permanent) tables, then you will have a one-time cost of 2-3 minutes to create the temporary table. However, if you use abTable for many queries, then the subsequent per query cost will be much smaller than
SELECT name, city, ... , FROM a , b ;
Other database systems have a view concept which lets you do something like this
CREATE VIEW abView AS SELECT * FROM a , b ;
Changes in the underlying a and b table will be reflected in the abView.
If you really are concerned about network round trips, then you may be able to replicate parts of the database on the local computer.
A good database management system should be able to handle your data needs. So why reinvent the wheel?
SELECT * FROM YOUR_TABLE
Map results into an object or data structure
Assign a unique key for each object or data structure
Load the key and object or data structure into a WeakHashMap to act as your cache.
I don't see why you need sorting, because your cache should access values by unique key in O(1) time. What is sorting buying you?
Be sure to think about thread safety.
I'm assuming that this is a read-only cache, and you're doing this to avoid the constant network latency. I'm also assuming that you'll do this once on start up.
How much data per record? 12M records at 1KB per record means you'll need 12GB of RAM just to hold your cache.

speed up operation on mysql

I'm currently writing java project against mysql in a cluster with ten nodes. The program simply pull some information from the database and do some calculation, then push some data back to the database. However, there are millions of rows in the table. Is there any way to split up the job and utilize the cluster architecture? How to do multi-threading on different node?
I watched an interesting presentation on using Gearman to do Map/Reduce style things on a mysql database. It might be what you are looking for: see here. There is a recording on the mysql webpage here (have to register for mysql.com though).
I'd think about doing that calculation in a stored procedure on the database server and pass on bringing millions of rows to the middle tier. You'll save yourself a lot of bytes on the wire. Depending on the nature of the calculation, your schema, indexing, etc. you might find that the database server is well equipped to do that calculation without having to resort to multi-threading.
I could be wrong, but it's worth a prototype to see.
Assume the table (A) you want to process has 10 million rows. Create a table B in the database to store the set of rows processed by a node. So you can write the Java program in such a way like it will first fetch the last row processed by other nodes and then it add an entry in the same table informing other nodes what range of rows it is going to process (you can decide this number). In our case, lets assume each node can process 1000 rows at a time. Node 1 fetches table B and finds it it empty. Then Node 1 inserts a row ('Node1', 1000) informing that it is processing till primary key of A is <=1000 ( Assuming primary key of table A is numeric and it is in ascending order). Node 2 comes and finds 1000 primary keys are processed by some other node. Hence it inserts a row ('Node2', 2000) informing others that it is processing rows between 1001 and 2000. Please note that access to table B should be synchronized, i.e. only one can work on it at a time.
Since you only have one mysql server, make sure you're using the innodb engine to reduce table locking on updates.
Also I'd try to keep your queries as simple as possible, even if you have to run more of them. This can increase chances of query cache hits, as well as reduce the over all workload on the backend, offloading some of the querying matching and work to the frontends (where you have more resources). It will also reduce the time a row lock is held therefore decreasing contention.
The proposed Gearman solution is probably the right tool for this job. As it will allow you to offload batch processing from mysql back to the cluster transparently.
You could set up sharding with a mysql on each machine but the set up time, maintenance and the changes to database access layer might be a lot of work compared to a gearman solution. You might also want to look at the experimental spider engine that could allow you to use multiple mysqls in unison.
Unless your calculation is very complex, most of the time will be spent retrieving data from MySql and sending the results back to MySQl.
As you have a single database no amount of parallelism or clustering on the application side will make much difference.
So your best options would be to do the update in pure SQL if that is at all possible, or, use a stored procedure so that all processing can take place within the MySql server and no data movement is required.
If this is not fast enough then you will need to split your database among several instances of MySql and come up with some schema to partition the data based on some application key.

Categories

Resources