java multi thread to approach mysql

java multi thread to approach mysql - java

Now, I made a sound searching program.
First, I saved the file path (sound file) in the MySQL DB, then recorded the sound and searched for a matching file comparing the sound fingerprint. But it takes long, because I have a lot of rows (sound files) in the DB. So, here are several questions.
I want to connect to the MySQL database to get information using Java
To increase the speed of the program, I want to use multithreading.
How can I do that?
(For example, I want to make the first thread query the first 10 rows,
and the second the next 10 rows. Approximately, the table has more than
500 rows.)
How can I compare the result of the threads? Can each thread return a value?

TO second point about thread job results you can use Callable interface and ExecutionService ( are build-in in standard library)
Try this example :
https://blogs.oracle.com/CoreJavaTechTips/entry/get_netbeans_6

Related

Multithreading in ResultSet

I'm developing a program that retrieves the cursor from a stored procedure in Oracle and writes the result in a text file. The cursor is expected to return a total of 2M records and over 400+ columns and the result is stored a ResultSet
Before the program writes to the text file, It will check the columns for masking and if the column is tagged to be masked it will masked the field using Salt and Key Encryption.
The program is working as expected however the runtime is longer that expected. Last runtime was 10 hours but only wrote 20k records out of the 2M.
Is there a way to use multithreading where the program will handle multiple records instead on just one by one.
I tried this setup before but it's not threading as expected:
ExecutorService threadPool =Executors.newFixedThreadPool(10);
while(rs.next()){
//rs is a ResultSet
threadPool.execute(new Runnable(){
#Override
public void run(){
//the program that will setup the record and check if a field needs to be masked is placed here.
}
});
}
threadPool.shutdown();
Any suggestion how to do this?

Try to check where is the bottleneck of your application. I suppose that is not something that you can solve with multithreading.
To identify the bottleneck work step by step:
try to execute manually the query locally to the database to see the performances of the query. Use it to check if there are indexes to add or any other optimization at query level
try to extract records from the database without do nothing with them to check to speed connection between the database and the java application. Use it to identify if the problem is the connectivity between the server hosting the database and the server hosting the java application. Check if you can reduce the size of the single record (for example don't extract all the fields with a select * if you don't need them, but select field by field)
check the speed to write the file. Use buffered streams to increase speed. Don't hold the whole file in memory but stream it periodically to disk
If you perform any long operation for each record retrieved from the database try to spend your time on that code. Eventually in this case you can use the multithreading

Is it better to count in server side API using java stream() then using count query call repeatedly in spring jpa

I want to count the number of rows in a table three times depending on three filters/conditions. I want to know which one of the following two ways is better for performance and cost-efficiency. We are using AWS as our server, java spring to develop server-side API and MySQL for the database.
Use the count feature of MySQL to query three times in the database for three filtering criteria to get the three count result.
Fetch all the rows of the table from the database first using only one query. Then using java stream three times based on three filtering criteria to get the three count result.

It'll be better to go with option (1) in extreme cases. If it's slow to execute SELECT COUNT(*) FROM table then you should consider some tweak on SQL side. Not sure what you're using but I found this example for sql server
Assuming you go with Option (2) and you have hundreds of thousands of rows, I suspect that your application will run out of memory (especially under high load) before you have time to worry about slow response time from running SELECT count(*). Not to mention that you'll have lots of unnecessary rows and slow down transfer time between database and application

A basic argument against doing counts in the app is that hauling lots of data from the server to the client is time-consuming. (There are rare situations where it is worth the overhead.) Note that your client and AWS may be quite some distance apart, thereby exacerbating the cost of shoveling lots of data. I am skeptical of what you call "server-side API". But even if you can run Java on the server, there is still some cost of shoveling between MySQL and Java.
Sometimes this pattern lets you get 3 counts with one pass over the data:
SELECT
SUM(status='ready') AS ready_count,
SUM(status='complete') AS completed_count,
SUM(status='unk') AS unknown_count,
...
The trick here is that a Boolean expression has a value of 0 (for false) or 1 (for true). Hence the SUM() works like a 'conditional count'.

Processing more than 100k records of data

I am developing spring-mvc application.
I have an requirement of processing more than 100k records of data. And I can't make it database dependent so I have to implement all the logic in java.
For now I am creating number of threads and assigning say 1000 records to each thread to process.
I am using org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor.
List item
Question:
Suggested number of threads that I should use.
Should I equally divide number of records among threads or
Should I give predefined number of records to each thread and increase the number of threads?
ThreadPoolTaskExecutor is ok or I should use something else?
Should I maintain the record ids which is assigned to each thread in java or in database? (Note : If using database then I have make extra database call for each record and update it after processing that record)
Can any one please suggest me best practices in this scenario.
Any kind of suggestion will be great.
Note: Execution time is main concern.
Update:
Processing include hug number of database calls.
Means you can consider it as searching done in java. Taking one record, then comparing(in java) that record with other records from db. Then again taking another record and do the same.

In order to process huge amount of data, you can use Spring Batch framework.
Check this Doc.
Wiki page.

ExecutorService should be fine for you, no need to use spring. But the thread number will be a trick. I can only say, it depends, why not try out to figure out the optimized number?

Bulk load data into DB

We have a linux box into which some third party tool drops 0.5MB of data and we have about 32000 similar files. We need to process those files and insert into Oracle10G DB. some one in our organization already has already created a Java program and it is running as a Daemon thread with static fields to map the data in the file and save data into db and clear the static fields for the next line.
This is a serial processing of file and it seems so slow. I'm planning to make this multithreaded by getting rid of it, Or, run multiple java processes(same jar but each one will be start with java -jar run.jar) for parallel execution. But, I'm concerned about the data locking etc., issues.
Questions is what is the best way to bulk load the data into the DB using Java? Or any other way.
Update:
the data that we work on is in the following format, we process the below lines, to make entries into db.
x.y.1.a.2.c.3.b = 12 // ID 1 of table A onetomany table C 3 ID sequence and its proprty b =12
x.y.1.a.2.c.3.f = 143 // ID 1 of table A onetomany table C 3 ID sequence and its proprty f =143
x.y.2.a.1.c.1.d = 12
Update:
We have about 15 tables that take this data. Data is in blocks, each block has related data, and related data will be processed at a time. So you are looking at the following figures when inserting one block
Table 1 | Table 2 | Table 3
---------------------------
5 rows | 8 rows | 12 rows
etc.,

Take a look at Oracle's SQL*Loader tool. It is a tool that is used to bulk load data into Oracle databases. There is a control file that you can write to describe the data, some basic transforms for the data, skip rows, convert types, etc. I've used it before for a similar process and it worked great. And the only thing I had to maintain was the driver script and the control files. I realize you asked for a Java solution, but this might also meet your needs.

Ideally, this sounds like a job for SQL Loader rather than Java.
If you do decide to do this job in java, consider using executebatch. An example is here.

Best Design for the scenario

I have a requirement where I have to select around 60 million plus records from database. Once I have all records in ResultSet then I have to formate some columns as per the client requirement(date format and number format) and then I have to write all records in a file(secondary memory).
Currently I am selecting records on day basis (7 selects for 7 days) from DB and putting them in a HashMap. Reading from HashMap and formating some columns and finally writing in a file(separate file for 7 days).
Finally I am merging all 7 files in a single file.
But this whole process is taking 6 hrs to complete. To improve this process I have created 7 threads for 7 days and all threads are writing separate files.
Finally I am merging all 7 files in a single file. This process is taking 2 hours. But my program is going to OutOfMemory after 1 hour and so.
Please suggest the best design for this scenario, should I used some caching mechanism, if yes, then which one and how?
Note: Client doesn't want to change anything at Database like create indexes or stored procedures, they don't want to touch database.
Thanks in advance.

Do you need to have all the records in memory to format them? You could try and stream the records through a process and right to the file. If your able to even break the query up further you might be able to start processing the results, while your still retrieving them.
Depending on your DB backend they might have tools to help with this such as SSIS for Sql Server 2005+.
Edit
I'm a .net developer so let me suggest what I would do in .net and hopefully you can convert into comparable technologies on the java side.
ADO.Net has a DataReader which is a forward only, read only (Firehose) cursor of a resultset. It returns data as the query is executing. This is very important. Essentially, my logic would be:
IDataReader reader=GetTheDataReader(dayOfWeek);
while (reader.Read())
{
file.Write(formatRow(reader));
}
Since this is executing while we are returning rows your not going to block on the network access which I am guessing is a huge bottleneck for you. The key here is we are not storing any of this in memory for long, as we cycle the reader will discard the results, and the file will write the row to disk.

I think what Josh is suggesting is this:
You have loops, where you currently go through all the result records of your query (just using pseudo code here):
while (rec = getNextRec() )
{
put in hash ...
}
for each rec in (hash)
{
format and save back in hash ...
}
for each rec in (hash)
{
write to a file ...
}
instead, do it like this:
while (rec = getNextRec() )
{
format fields ...
write to the file ...
}
then you never have more than 1 record in memory at a time ... and you can process an unlimited number of records.

Obviously reading 60 million records at once is using up all your memory - so you can't do that. (ie your 7 thread model). Reading 60 millions records one at a time is using up all your time - so you can't do that either (ie your initial read to file model).
So.... you're going to have to compromise and do a bit of both.
Josh has it right - open a cursor to your DB that simply reads the next record, one after the other in the simplest, most feature-light way. A "firehose" cursor (otherwise known as a read-only, forward-only cursor) is what you want here as it imposes the least load on the database. The DB isn't going to let you update the records, or go backwards in the recordset, which you don't want anyway, so it won't need to handle memory for the records.
Now you have this cursor, you're being given 1 record at a time by the DB - read it, and write it to a file (or several files), this should finish quite quickly. Your task then is to merge the files into 1 with the correct order, which is relatively easy.
Given the quantity of records you have to process, I think this is the optimum solution for you.
But... seeing as you're doing quite well so far anyway, why not just reduce the number of threads until you are within your memory limits. Batch processing is run overnight is many companies, this just seems to be another one of those processes.

Depends on the database you are using, but if it was SQL Server, I would recommend using something like SSIS to do this rather than writing a program.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

java multi thread to approach mysql - java

TO second point about thread job results you can use Callable interface and ExecutionService ( are build-in in standard library) Try this example : https://blogs.oracle.com/CoreJavaTechTips/entry/get_netbeans_6

Related

Multithreading in ResultSet

Is it better to count in server side API using java stream() then using count query call repeatedly in spring jpa

Processing more than 100k records of data

Bulk load data into DB

Best Design for the scenario

Categories

Resources