I am parsing an XML file consisting of ~600K lines. Parsing and Inserting the data from the XML to the database is not a problem as I am using SAX to parse and using LOAD DATA INFILE (from a .txt file) to INSERT into the database. The txt file is populated in Java using JDBC. All of this takes a good 5 seconds to populate in the database.
My bottle neck is now executing multiple SELECT queries. Basically, each time I hit a certain XML tag, I would call the SELECT query to grab a data from another DB table. Adding these SELECT queries brings my populating time to 2 minutes.
For example:
I am parsing through an XML consisting of books, articles, thesis, etc.
Each book/article has child elements such as isbn, title, author, editor, publisher.
At each author/editor/publisher, I need to query a table in a database.
Let's say I encountered the author tag with value Tolkien.
I need to query a table that already exist in the database called author_table
The query is [select author_id from author_table where name = 'Tolkien']
This is where the bottle neck is happening.
Now my question is: Is there a way to speed this up?
BTW, the reason why I think 2 minutes is long is because this is a homework assignment and I am not yet finished with populating the database. I would estimate that the whole DB population would take 5 minutes. Thus the reason why I am seeking advice for performance optimization.
There are few things you can consider:
Use connection pooling so you don't create/close a new connection everytime you're executing query. Doing so is expensive
Cache whatever data you are obtaining via SELECT query. Is it possible to prefetch all the data beforehand so you don't have to query them on the spot?
If your SELECT is slow, ensure the query is optimized and you have appropriate index in place to avoid scanning the whole table
Ensure you use buffered IO in Java
Can you subdivide the work into multiple threads? If so create multiple worker thread to do multiple instance of your job in parallel
Related
I have a very large table in the database, the table has a column called
"unique_code_string", this table has almost 100,000,000 records.
Every 2 minutes, I will receive 100,000 code string, they are in an array and they are unique to each other. I need to insert them to the large table if they are all "good".
The meaning of "good" is this:
All 100,000 codes in the array never occur in the database large table.
If one or more codes occur in the database large table, the whole array will not use at all,
it means no codes in the array will insert into the large table.
Currently, I use this way:
First I do a loop and check each code in the array to see if there is already same code in the database large table.
Second, if all code is "new", then, I do the real insert.
But this way is very slow, I must finish all thing within 2 minutes.
I am thinking of other ways:
Join the 100,000 code in a SQL "in clause", each code has 32 length, I think no database will accept this 32*100,000 length "in clause".
Use database transaction, I force insert the codes anyway, if error happens, the transaction rollback. This cause some performance issue.
Use database temporary table, I am not good at writing SQL querys, please give me some example if this idea may work.
Now, can any experts give me some advice or some solutions?
I am a non-English speaker, I hope you see the issue I am meeting.
Thank you very much.
Load the 100,000 rows into a table!
Create a unique index on the original table:
create unique index unq_bigtable_uniquecodestring on bigtable (unique_code_string);
Now, you have the tools you need. I think I would go for a transaction, something like this:
insert into bigtable ( . . . )
select . . .
from smalltable;
If any row fails (due to the unique index), then the transaction will fail and nothing is inserted. You can also be explicit:
insert into bigtable ( . . . )
select . . .
from smalltable
where not exists (select 1
from smalltable st join
bigtable bt
on st.unique_code_string = bt.unique_code_string
);
For this version, you should also have an index/unique constraint on smalltable(unique_code_string).
It's hard to find an optimal solution with so little information. Often this depends on the network latency between application and database server and hardware resources.
You can load the 100,000,000 unique_code_string from the database and use HashSet or TreeSet to de-duplicate in-memory before inserting into the database. If your database server is resource constrained or there is considerable network latency this might be faster.
Depending how your receive the 100,000 records delta you could load it into the database e.g. a CSV file can be read using external table. If you can get the data efficiently into a temporary table and database server is not overloaded you can do it very efficiently with SQL or stored procedure.
You should spend some time to understand how real-time the update has to be e.g. how many SQL queries are reading the 100,000,000 row table and can you allow some of these SQL queries to be cancelled or blocked while you update the rows. Often it's a good idea to create a shadow table:
Create new table as copy of the existing 100,000,000 rows table.
Disable the indexes on the new table
Load the delta rows to the new table
Rebuild the indexes on new table
Delete the existing table
Rename the new table to the existing 100,000,000 rows table
The approach here is database specific. It will depend on how your database is defining the indexes e.g. if you have a partitioned table it might be not necessary.
I have a use case in which I do the following:
Insert some rows into a BigQuery table (t1) which is date partitioned.
Run some queries on t1 to aggregate the data and store them in another table.
In the above use case, I faced an issue today where the queries I run had some discrepancy in the aggregated data. When I executed the same queries some time later from the Web UI of BigQuery, the aggregation were fine. My suspicion is that some of the inserted rows were not available for the query.
I read this documentation for BigQuery data availability. I have the following doubts in this:
The link says that "Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table". Is there an upper limit on the number of seconds to wait before it is available for real time analysis?
From the same link: "Data can take up to 90 minutes to become available for copy and export operations". Does the following operations come under this restriction?
Copy the result of a query to another table
Exporting the result of a query to a csv file in cloud storage
Also from the same link- "when streaming to a partitioned table, data in the streaming buffer has a NULL value for the _PARTITIONTIME pseudo column". Does this mean that I should not use _PARTITIONTIME in the queries till data is present in the streamingBuffer?
Can somebody please clarify these?
You can use _PARTITIONTIME is null to detect which rows are in buffer. You can actually use this logic to further UNION this buffer to a date you wish (like today). You could do wire in some logic that reads the buffer and where time is null it will set a time for the rest of the query logic.
This buffer is by design a bit delayed, but if you need immediate access to data you need to use the IS NULL trick to be able to query it.
For the questions:
Does the following operations come under this restriction?
Copy the result of a query to another table
Exporting the result of a query to a csv file in cloud storage
The results of a query are immediately available for any operation (like copy and export) - even if that query had been ran on streamed data still in the buffer.
I am trying to improve a data transfer program that I wrote. I am looking for suggestions on how to make it quicker.
My program extracts data from a database (usually Oracle 11g) by filling a ResultSet and writing this result into a file. The program looks periodically into the tables and queries if a special column has changed. For example, this could be such a query:
select columnA, columnB from scheme.table where changeColumn = '1'
Now comes the critical part. After extracting the data I need to update this changeColumn to '0'. Since I have just used the ResultSet for exporting the data into a file I have to rewind it, so the code looks like this:
extractedData.beforeFirst();
while (extractedData.next()) {
extractedData.updateString("changeColumn", "0");
extractedData.updateRow();
}
Now if this ResultSet is bigger (let's say more than 100.000 entries) then this loop can take hours. Does anyone have any suggestions on how to increase the performance of this?
I heard of setting the fetch size to a bigger value, but usually the ResultSet only contains less than a dozen entries. Is there a way to dynamically set the fetch size?
Use a JDBC Batch Update. From all the row that needs updating, take the primary key on the row that needs updating, add it to a batch update (SQL query) and execute the batch.
A good example from Mkyong shows you how to do JDBC Batch Update with JDBC PreparedStatement.
I am working on solution of below mentioned but could not find any best practice/tool for this.
For a batch of requests(say 5000 unique ids and records) received in webservice, it has to fetch rows for those unique ids in database and keep them in buffer(or cache) and compare those with records received in webservice. If there is a change for a particular data(say column) that will be updated in table for that unique id. And in turn, the child tables of that table also get affected. For ex, if someone changes his laptop model number and country, model number will be updated in a table and country value in another table. Likewise it goes on accessing multiple tables in short time. The maximum records coming in a webservice call might reach 70K in one call in an hour.
I don't have any other option than implementing it in java. Is there any good practice of implementing this, or can it be achieved using any open source java tools. Please suggest. Thanks.
Hibernate is likely to be the first thing you should try. I tend to avoid because it is overkill for most of my applications but it is a standard tool for accessing database which anyone who knows Java should at least have an understanding of. There are dozens of other solutions you could use but Hibernate is the most often used.
JDBC is the API to use to access relational database. Useful performance and security tips:
use prepared statements
use where ... in () queries to load many rows at once, but beware on the limit in the number of values in the in clause (1000 max in Oracle)
use batched statements to make your updates, rather than executing each update separately (see http://download.oracle.com/javase/1.3/docs/guide/jdbc/spec2/jdbc2.1.frame6.html)
See http://download.oracle.com/javase/tutorial/jdbc/ for a tutorial on JDBC.
This sounds not that complicated. Of course, you must know (or learn):
SQL
JDBC
Then you can go through the web service data record by record and for each record do the following:
fetch corresponding database record
for each field in record
if updated
execute corresponding update SQL statement
commit // every so many records
70K records per hour should be not the slightest problem for a decent RDBMS.
Have a batch job written in Java which truncates and then loads certain table in Oracle database every few minutes. There are reports generated on web pages based on the data in the table. Am wondering of a good way of not affecting the report querying part when the data loading process is happeneing so that the users won't end up with some and/or no data.
If you process all your SQL statements inside a single transaction there will be always a valid state seen from outside. Beware that TRUNCATE doe not work in transactions, so you have to use DELETE. While this guarantees to always have reasonable data in your table it needs a bigger rollback segment and will be considerably slower.
you could have 2 tables and a meta table which tracks which table is the main table being used for querying. your batch job will be truncating and loading one of the table and you can switch the main tables once the loading is completed. so the query app will get recent data now and u can load now in the other table
What I would do is set a flag in a DB table to indicate that that the update is in progress and have the reports look for that flag and display an appropriate message and wait for the update to finish. Once the update is complete clear the flag.