I'm developing a program that retrieves the cursor from a stored procedure in Oracle and writes the result in a text file. The cursor is expected to return a total of 2M records and over 400+ columns and the result is stored a ResultSet
Before the program writes to the text file, It will check the columns for masking and if the column is tagged to be masked it will masked the field using Salt and Key Encryption.
The program is working as expected however the runtime is longer that expected. Last runtime was 10 hours but only wrote 20k records out of the 2M.
Is there a way to use multithreading where the program will handle multiple records instead on just one by one.
I tried this setup before but it's not threading as expected:
ExecutorService threadPool =Executors.newFixedThreadPool(10);
while(rs.next()){
//rs is a ResultSet
threadPool.execute(new Runnable(){
#Override
public void run(){
//the program that will setup the record and check if a field needs to be masked is placed here.
}
});
}
threadPool.shutdown();
Any suggestion how to do this?
Try to check where is the bottleneck of your application. I suppose that is not something that you can solve with multithreading.
To identify the bottleneck work step by step:
try to execute manually the query locally to the database to see the performances of the query. Use it to check if there are indexes to add or any other optimization at query level
try to extract records from the database without do nothing with them to check to speed connection between the database and the java application. Use it to identify if the problem is the connectivity between the server hosting the database and the server hosting the java application. Check if you can reduce the size of the single record (for example don't extract all the fields with a select * if you don't need them, but select field by field)
check the speed to write the file. Use buffered streams to increase speed. Don't hold the whole file in memory but stream it periodically to disk
If you perform any long operation for each record retrieved from the database try to spend your time on that code. Eventually in this case you can use the multithreading
Related
I have a use case in which I do the following:
Insert some rows into a BigQuery table (t1) which is date partitioned.
Run some queries on t1 to aggregate the data and store them in another table.
In the above use case, I faced an issue today where the queries I run had some discrepancy in the aggregated data. When I executed the same queries some time later from the Web UI of BigQuery, the aggregation were fine. My suspicion is that some of the inserted rows were not available for the query.
I read this documentation for BigQuery data availability. I have the following doubts in this:
The link says that "Streamed data is available for real-time analysis within a few seconds of the first streaming insertion into a table". Is there an upper limit on the number of seconds to wait before it is available for real time analysis?
From the same link: "Data can take up to 90 minutes to become available for copy and export operations". Does the following operations come under this restriction?
Copy the result of a query to another table
Exporting the result of a query to a csv file in cloud storage
Also from the same link- "when streaming to a partitioned table, data in the streaming buffer has a NULL value for the _PARTITIONTIME pseudo column". Does this mean that I should not use _PARTITIONTIME in the queries till data is present in the streamingBuffer?
Can somebody please clarify these?
You can use _PARTITIONTIME is null to detect which rows are in buffer. You can actually use this logic to further UNION this buffer to a date you wish (like today). You could do wire in some logic that reads the buffer and where time is null it will set a time for the rest of the query logic.
This buffer is by design a bit delayed, but if you need immediate access to data you need to use the IS NULL trick to be able to query it.
For the questions:
Does the following operations come under this restriction?
Copy the result of a query to another table
Exporting the result of a query to a csv file in cloud storage
The results of a query are immediately available for any operation (like copy and export) - even if that query had been ran on streamed data still in the buffer.
Now, I made a sound searching program.
First, I saved the file path (sound file) in the MySQL DB, then recorded the sound and searched for a matching file comparing the sound fingerprint. But it takes long, because I have a lot of rows (sound files) in the DB. So, here are several questions.
I want to connect to the MySQL database to get information using Java
To increase the speed of the program, I want to use multithreading.
How can I do that?
(For example, I want to make the first thread query the first 10 rows,
and the second the next 10 rows. Approximately, the table has more than
500 rows.)
How can I compare the result of the threads? Can each thread return a value?
TO second point about thread job results you can use Callable interface and ExecutionService ( are build-in in standard library)
Try this example :
https://blogs.oracle.com/CoreJavaTechTips/entry/get_netbeans_6
Have a batch job written in Java which truncates and then loads certain table in Oracle database every few minutes. There are reports generated on web pages based on the data in the table. Am wondering of a good way of not affecting the report querying part when the data loading process is happeneing so that the users won't end up with some and/or no data.
If you process all your SQL statements inside a single transaction there will be always a valid state seen from outside. Beware that TRUNCATE doe not work in transactions, so you have to use DELETE. While this guarantees to always have reasonable data in your table it needs a bigger rollback segment and will be considerably slower.
you could have 2 tables and a meta table which tracks which table is the main table being used for querying. your batch job will be truncating and loading one of the table and you can switch the main tables once the loading is completed. so the query app will get recent data now and u can load now in the other table
What I would do is set a flag in a DB table to indicate that that the update is in progress and have the reports look for that flag and display an appropriate message and wait for the update to finish. Once the update is complete clear the flag.
I am using SQL Server 2008 and Java 6 / Spring jdbc.
We have a table with records count ~60mn.
We need to load this entire table into memory, but firing select * on this table takes hours to complete.
So I am splitting the query as below
String query = " select * from TABLE where " ;
for(int i =0;i<10;i++){
StringBuilder builder = new StringBuilder(query).append(" (sk_table_id % 10) =").append(i);
service.submit(new ParallelCacheBuilder(builder.toString(),namedParameters,jdbcTemplate));
}
basically, I am splitting the query by adding a where condition on primary key column,
above code snippet splits the query into 10 queries running in parallel.this uses java's ExecutorCompletionService.
I am not a SQL expert, but I guess above queries will need to load same data in memory before applyinh modulo operator on primary column.
Is this good/ bad/ best/worst way? Is there any other way, please post.
Thanks in advance!!!
If you do need all the 60M records in memory, select * from ... is the fastest approach. Yes, it's a full scan; there's no way around. It's disk-bound so multithreading won't help you any. Not having enough memory available (swapping) will kill performance instantly. Data structures that take significant time to expand will hamper performance, too.
Open the Task Manager and see how much CPU is spent; probably little; if not, profile your code or just comment out everything but the reading loop. Or maybe it's a bottleneck in the network between the SQL server and your machine.
Maybe SQL Server can offload data faster to an external dump file of known format using some internal pathways (e.g. Oracle can). I'd explore the possibility of dumping a table into a file and then parsing that file with C#; it could be faster e.g. because it won't interfere with other queries that the SQL server is serving at the same time.
I have a requirement where I have to select around 60 million plus records from database. Once I have all records in ResultSet then I have to formate some columns as per the client requirement(date format and number format) and then I have to write all records in a file(secondary memory).
Currently I am selecting records on day basis (7 selects for 7 days) from DB and putting them in a HashMap. Reading from HashMap and formating some columns and finally writing in a file(separate file for 7 days).
Finally I am merging all 7 files in a single file.
But this whole process is taking 6 hrs to complete. To improve this process I have created 7 threads for 7 days and all threads are writing separate files.
Finally I am merging all 7 files in a single file. This process is taking 2 hours. But my program is going to OutOfMemory after 1 hour and so.
Please suggest the best design for this scenario, should I used some caching mechanism, if yes, then which one and how?
Note: Client doesn't want to change anything at Database like create indexes or stored procedures, they don't want to touch database.
Thanks in advance.
Do you need to have all the records in memory to format them? You could try and stream the records through a process and right to the file. If your able to even break the query up further you might be able to start processing the results, while your still retrieving them.
Depending on your DB backend they might have tools to help with this such as SSIS for Sql Server 2005+.
Edit
I'm a .net developer so let me suggest what I would do in .net and hopefully you can convert into comparable technologies on the java side.
ADO.Net has a DataReader which is a forward only, read only (Firehose) cursor of a resultset. It returns data as the query is executing. This is very important. Essentially, my logic would be:
IDataReader reader=GetTheDataReader(dayOfWeek);
while (reader.Read())
{
file.Write(formatRow(reader));
}
Since this is executing while we are returning rows your not going to block on the network access which I am guessing is a huge bottleneck for you. The key here is we are not storing any of this in memory for long, as we cycle the reader will discard the results, and the file will write the row to disk.
I think what Josh is suggesting is this:
You have loops, where you currently go through all the result records of your query (just using pseudo code here):
while (rec = getNextRec() )
{
put in hash ...
}
for each rec in (hash)
{
format and save back in hash ...
}
for each rec in (hash)
{
write to a file ...
}
instead, do it like this:
while (rec = getNextRec() )
{
format fields ...
write to the file ...
}
then you never have more than 1 record in memory at a time ... and you can process an unlimited number of records.
Obviously reading 60 million records at once is using up all your memory - so you can't do that. (ie your 7 thread model). Reading 60 millions records one at a time is using up all your time - so you can't do that either (ie your initial read to file model).
So.... you're going to have to compromise and do a bit of both.
Josh has it right - open a cursor to your DB that simply reads the next record, one after the other in the simplest, most feature-light way. A "firehose" cursor (otherwise known as a read-only, forward-only cursor) is what you want here as it imposes the least load on the database. The DB isn't going to let you update the records, or go backwards in the recordset, which you don't want anyway, so it won't need to handle memory for the records.
Now you have this cursor, you're being given 1 record at a time by the DB - read it, and write it to a file (or several files), this should finish quite quickly. Your task then is to merge the files into 1 with the correct order, which is relatively easy.
Given the quantity of records you have to process, I think this is the optimum solution for you.
But... seeing as you're doing quite well so far anyway, why not just reduce the number of threads until you are within your memory limits. Batch processing is run overnight is many companies, this just seems to be another one of those processes.
Depends on the database you are using, but if it was SQL Server, I would recommend using something like SSIS to do this rather than writing a program.