Bulk load data into DB - java

We have a linux box into which some third party tool drops 0.5MB of data and we have about 32000 similar files. We need to process those files and insert into Oracle10G DB. some one in our organization already has already created a Java program and it is running as a Daemon thread with static fields to map the data in the file and save data into db and clear the static fields for the next line.
This is a serial processing of file and it seems so slow. I'm planning to make this multithreaded by getting rid of it, Or, run multiple java processes(same jar but each one will be start with java -jar run.jar) for parallel execution. But, I'm concerned about the data locking etc., issues.
Questions is what is the best way to bulk load the data into the DB using Java? Or any other way.
Update:
the data that we work on is in the following format, we process the below lines, to make entries into db.
x.y.1.a.2.c.3.b = 12 // ID 1 of table A onetomany table C 3 ID sequence and its proprty b =12
x.y.1.a.2.c.3.f = 143 // ID 1 of table A onetomany table C 3 ID sequence and its proprty f =143
x.y.2.a.1.c.1.d = 12
Update:
We have about 15 tables that take this data. Data is in blocks, each block has related data, and related data will be processed at a time. So you are looking at the following figures when inserting one block
Table 1 | Table 2 | Table 3
---------------------------
5 rows | 8 rows | 12 rows
etc.,

Take a look at Oracle's SQL*Loader tool. It is a tool that is used to bulk load data into Oracle databases. There is a control file that you can write to describe the data, some basic transforms for the data, skip rows, convert types, etc. I've used it before for a similar process and it worked great. And the only thing I had to maintain was the driver script and the control files. I realize you asked for a Java solution, but this might also meet your needs.

Ideally, this sounds like a job for SQL Loader rather than Java.
If you do decide to do this job in java, consider using executebatch. An example is here.

Related

How to clone a postgreSQL database with partial data [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I'm facing a task that clone a PostgreSQL database, keep all constraints, indexes, etc., and including all records that related to specific column value.
In other word, it's a separate big database to multiple smaller databases.
For example, my original database has numerous schemas, in each schema has numerous tables and in each table has records about multiple person. I want to clone it to new database but clone only records that related to specific person with person id (clone all records in all tables that have person_id = xxx).
Is there a tool for this task or any suggestions? (I'm familiar with Java and Python)
The best way I have found to do this is to first export the complete schema using the pg_dump tool with the -s flag (to dump schema only, and not data), and then export data separately.
To load your schemas starting from a fresh, empty database, use pg_restore. It will read the output from pg_dump and use it to build a database.
When exporting the data, you'll need to classify each table (you can write a helper script to make this easier, or use excel, etc...):
Tables that you want a subset of data from, based on some condition (ie. a certain value in person_id)
Tables you want to copy in their entirety (like dimension tables, such as calendar and company_locations)
Tables that you don't need any data from
For (1), you will need to write an appropriate SELECT query that returns the subset of data that you want to copy. Put those queries in a script and have each one write the result to a separate file, named <schema>.<table>. Lastly, use the psql utility to load the data to the test database. psql has a special \copy command that makes this easy. It can be used from a terminal like this:
psql --c "\copy schema.table FROM ‘~/dump_data/schema.table’ WITH DELIMITER ‘,’ CSV;"
Use pg_dump again to take care of all of those falling under (2), using the -t <table name> flag to dump only the named tables, and -a to dump only data (no schema). This could also be added to the script for (1) by just adding an unqualified SELECT * for each table in (2) and loading the data the same way.
Tables falling under (3) were already handled by the initial export.

Mysql data export tool

so, say I have a SQL(Mysql) database containing 4 tables {A, B, C, D}
and I want to create a testing database which contains a subset of data from the first database (both in time and type).
so for example:
"I want to create a new (identical in structure) database containing all of the data for user "bob" for the last two weeks."
the naive approach is to dump two weeks of data from the first database, use vagrant / chef to spin up a new empty database and import the dump data.
however, this does not work as each table has foreign keys with each other.
so, if I have two weeks of data of "A", it might rely on a year old data from "D".
My current solution is to use the data layer of my java application load the data in to memory and then inserted it into the database. however, this is not sustainable / scalable.
so, in a roundabout way, my question is, does anyone know of any tools or tricks to migrate a "complete" set of data from one database to another considering a time period of 1 table and including all other related data from the other tables as well?
any suggestions would be fantastic :)
Try your "naive approach", but SET FOREIGN_KEY_CHECKS=0; first, then run your backup queries, then SET FOREIGN_KEY_CHECKS=1;
There is the way to recreate similar database with part of data, not so simple, but can be used:
Create new database (only schema), for example using buckup or schema comparer tool.
The next step is to copy table data. You could do it with a help of data comparer tool (look at schema comparer link). Select tables you need and check record data to synchronize.

Data Stitching Join/Merge - Oracle Vs Java based technique

Currently I am facing a distinct issue, where I receive data from a webservice call, same need to be loaded into Oracle Table.
Scenario:
- I have a very huge table with 500 columns - all columns mandatory, and no choice to split table.
- Dataset is 50m records, which I am trying to export from source system - and its continuously increasing
- At a time I receive 50 column data by firing request to webservice (at source system), hence I need to submit 10 request of 50 column each for getting full record.
- Also at a time I can only receive 100000 (1 lac) records in one request for specific set of columns.
Now, to import same data into Oracle DB at destination system I have following two choices:
1. First export data on temporary tables of 50 columns each and then run join for all of them to create final table with all 500 columns
2. Fire 10 parallel request of 50 columns each and stitch data on my java program and then send insert query with all 500 columns
Here I would like to know, which technique works out better, to go with Oracle based table join or apply stitching on java side by using Primary Key column?
As the data set is very huge, I am purely looking on performance aspect. Also any more optimized ways to solve same problem?
From performance point of view the Oracle based solution would clearly win. From implementation point of view (aiming for a clear and simple solution) Oracle tables win again. Here is why:
Architecture point of view: Combining the data in your app will make your app stateful. From a simple stateless (receive-save-forget) application you would turn it into a complex state-aware (save-look for joint records-did not find anything-store-wait-look again-etc). This is much harder to develop, maintain or debug.
Performance point of view: Saving data into multiple tables and later combining them into one (either by views or stored procedures or simple selects) is something Oracle is designed for. Immense amount of development time was spent on optimizing these basic features. Whatever you would come up with to implement the same features (even though you are aware of some specifics) would likely performe worse.
So overall I would strongly suggest Option #1, leave it for Oracle to do the hard part. Depending on how you want to use this data after the import (almost real-time / once in a while / after extra filtering applied) you can choose how you construct the final records by using one of these:
stored procedures
Oracle jobs
views.

Speeding Up Query Calls to MySQL using Java

I am parsing an XML file consisting of ~600K lines. Parsing and Inserting the data from the XML to the database is not a problem as I am using SAX to parse and using LOAD DATA INFILE (from a .txt file) to INSERT into the database. The txt file is populated in Java using JDBC. All of this takes a good 5 seconds to populate in the database.
My bottle neck is now executing multiple SELECT queries. Basically, each time I hit a certain XML tag, I would call the SELECT query to grab a data from another DB table. Adding these SELECT queries brings my populating time to 2 minutes.
For example:
I am parsing through an XML consisting of books, articles, thesis, etc.
Each book/article has child elements such as isbn, title, author, editor, publisher.
At each author/editor/publisher, I need to query a table in a database.
Let's say I encountered the author tag with value Tolkien.
I need to query a table that already exist in the database called author_table
The query is [select author_id from author_table where name = 'Tolkien']
This is where the bottle neck is happening.
Now my question is: Is there a way to speed this up?
BTW, the reason why I think 2 minutes is long is because this is a homework assignment and I am not yet finished with populating the database. I would estimate that the whole DB population would take 5 minutes. Thus the reason why I am seeking advice for performance optimization.
There are few things you can consider:
Use connection pooling so you don't create/close a new connection everytime you're executing query. Doing so is expensive
Cache whatever data you are obtaining via SELECT query. Is it possible to prefetch all the data beforehand so you don't have to query them on the spot?
If your SELECT is slow, ensure the query is optimized and you have appropriate index in place to avoid scanning the whole table
Ensure you use buffered IO in Java
Can you subdivide the work into multiple threads? If so create multiple worker thread to do multiple instance of your job in parallel

Best Design for the scenario

I have a requirement where I have to select around 60 million plus records from database. Once I have all records in ResultSet then I have to formate some columns as per the client requirement(date format and number format) and then I have to write all records in a file(secondary memory).
Currently I am selecting records on day basis (7 selects for 7 days) from DB and putting them in a HashMap. Reading from HashMap and formating some columns and finally writing in a file(separate file for 7 days).
Finally I am merging all 7 files in a single file.
But this whole process is taking 6 hrs to complete. To improve this process I have created 7 threads for 7 days and all threads are writing separate files.
Finally I am merging all 7 files in a single file. This process is taking 2 hours. But my program is going to OutOfMemory after 1 hour and so.
Please suggest the best design for this scenario, should I used some caching mechanism, if yes, then which one and how?
Note: Client doesn't want to change anything at Database like create indexes or stored procedures, they don't want to touch database.
Thanks in advance.
Do you need to have all the records in memory to format them? You could try and stream the records through a process and right to the file. If your able to even break the query up further you might be able to start processing the results, while your still retrieving them.
Depending on your DB backend they might have tools to help with this such as SSIS for Sql Server 2005+.
Edit
I'm a .net developer so let me suggest what I would do in .net and hopefully you can convert into comparable technologies on the java side.
ADO.Net has a DataReader which is a forward only, read only (Firehose) cursor of a resultset. It returns data as the query is executing. This is very important. Essentially, my logic would be:
IDataReader reader=GetTheDataReader(dayOfWeek);
while (reader.Read())
{
file.Write(formatRow(reader));
}
Since this is executing while we are returning rows your not going to block on the network access which I am guessing is a huge bottleneck for you. The key here is we are not storing any of this in memory for long, as we cycle the reader will discard the results, and the file will write the row to disk.
I think what Josh is suggesting is this:
You have loops, where you currently go through all the result records of your query (just using pseudo code here):
while (rec = getNextRec() )
{
put in hash ...
}
for each rec in (hash)
{
format and save back in hash ...
}
for each rec in (hash)
{
write to a file ...
}
instead, do it like this:
while (rec = getNextRec() )
{
format fields ...
write to the file ...
}
then you never have more than 1 record in memory at a time ... and you can process an unlimited number of records.
Obviously reading 60 million records at once is using up all your memory - so you can't do that. (ie your 7 thread model). Reading 60 millions records one at a time is using up all your time - so you can't do that either (ie your initial read to file model).
So.... you're going to have to compromise and do a bit of both.
Josh has it right - open a cursor to your DB that simply reads the next record, one after the other in the simplest, most feature-light way. A "firehose" cursor (otherwise known as a read-only, forward-only cursor) is what you want here as it imposes the least load on the database. The DB isn't going to let you update the records, or go backwards in the recordset, which you don't want anyway, so it won't need to handle memory for the records.
Now you have this cursor, you're being given 1 record at a time by the DB - read it, and write it to a file (or several files), this should finish quite quickly. Your task then is to merge the files into 1 with the correct order, which is relatively easy.
Given the quantity of records you have to process, I think this is the optimum solution for you.
But... seeing as you're doing quite well so far anyway, why not just reduce the number of threads until you are within your memory limits. Batch processing is run overnight is many companies, this just seems to be another one of those processes.
Depends on the database you are using, but if it was SQL Server, I would recommend using something like SSIS to do this rather than writing a program.

Categories

Resources