I have a DynamoDB table where I have 1000+ rows. I need to write a springboot app that reads the table rows one by one and make a rest call to another service that accepts one JSON at a time. Looping through the table 1 by 1 does not seem to be an optimal solution. Can this be achieved by multi-threading, if so how can that be done ? Or do we have any better option for this, can someone help ?
At a time you can read suppose N records at a time from for example 50,you can use more than 1 thread to read records from database as you are only reading and not writing, once you have read the record then you can create N number of Threads to call external service from ExecutorService, and Each Thread of Executor service can have data of 1 record which will call that external service
You could use ‘#async’. Its one of the easiest ways of executing multiple threads in springboot
First of all, why do you have to read a DB table one row by another row? you can have just one sql statement and get all the rows that you need (achieved in your repository layer if it's a spring boot app) and then use multiple threading as described by other people (#Async, ExecutorService, etc...) in your service classes.
Related
I have a use case like,
Read bulk records from multiple tables(more than 10000 records)
Business logic to validate records
Updating validated records to a different table other than where records were retrieved in Same Database.
I would like implement my use case with spring batch and scheduler to run at
certain point of time.
I have read about spring batch and understand that there is a ItemReader, ItemProcessor, and ItemWriter as job in chunk to execute activity.
Also I would like to implement it using multi threading by defining taskExecutor(org.springframework.core.task.SimpleAsyncTaskExecutor).I have decided to go with the below approach
Read records from DB with query by calling DAO implemented in other module with spring hibernate transaction manager in ItemReader.
Process the records each at a time in ItemProcessor
Update the record to table in ItemWriter with commit interval of some number.
I am new to Spring batch processing So I would like to understand if this is a good solution to implement or if is there any better way to implement it. I also have few questions regarding how DB connections and transactions will be maintained.
Will there be one connection and transaction for the whole batch job? Or will multiple connections and transaction be opened at certain points of execution? How this process will be handled?
How to effectively process the above use case with multi-threading to process records with 10, or 20 threads at a time?
Can someone please provide a brief explanation to understand more on this concept or provide any samples to understand more?
Thanks in advance.
Your approach sounds good to me .
I will try to answer your first question.
Will there be one connection and transaction for the whole batch job? Or will multiple connections and transaction be opened at certain points of execution? How this process will be handled?
You can have multiple data source and multiple transaction managers , but managing it will be difficult , as you will have to take care of many of the thing that spring batch manager can do on its own .
Since most of the spring batch operations like Restart , stop etc need metadata which is stored in Db by spring batch . If you try to play with it then those operations might not work very well .
I would suggest you to have both spring batch tables and your business specific tables inside the same data source .
That way you need to have only one data source and one transaction manager and you do not have to worry about transaction issues that you might face.
I am developing spring-mvc application.
I have an requirement of processing more than 100k records of data. And I can't make it database dependent so I have to implement all the logic in java.
For now I am creating number of threads and assigning say 1000 records to each thread to process.
I am using org.springframework.scheduling.concurrent.ThreadPoolTaskExecutor.
List item
Question:
Suggested number of threads that I should use.
Should I equally divide number of records among threads or
Should I give predefined number of records to each thread and increase the number of threads?
ThreadPoolTaskExecutor is ok or I should use something else?
Should I maintain the record ids which is assigned to each thread in java or in database? (Note : If using database then I have make extra database call for each record and update it after processing that record)
Can any one please suggest me best practices in this scenario.
Any kind of suggestion will be great.
Note: Execution time is main concern.
Update:
Processing include hug number of database calls.
Means you can consider it as searching done in java. Taking one record, then comparing(in java) that record with other records from db. Then again taking another record and do the same.
In order to process huge amount of data, you can use Spring Batch framework.
Check this Doc.
Wiki page.
ExecutorService should be fine for you, no need to use spring. But the thread number will be a trick. I can only say, it depends, why not try out to figure out the optimized number?
I’m struggling with how to design a Spring Batch job. The overall goal is to retrieve ~20 million records and save them to a sql database.
I’m doing it in two parts. First I retrieve the 20 million ids of the records I want to retrieve and save those to a file (or DB). This is a relatively fast operation. Second, I loop through my file of Ids, taking batches of 2,000, and retrieve their related records from an external service. I then repeat this, 2,000 Ids at a time, until I’ve retrieved all of the records. For each batch of 2,000 records I retrieve, I save them to a database.
Some may be asking why I’m doing this in two steps. I eventual plan to make the second step run in parallel so that I can retrieve batches of 2,000 records in parallel and hopefully greatly speed-up the download. Having the Ids allows me to partition the job into batches. For now, let’s not worry about parallelism and just focus on how to design a simpler sequential job.
Imagine I already have solved the first problem of saving all of the Ids locally. They are in a file, one Id per line. How do I design the steps for the second part?
Here’s what I’m thinking…
Read 2,000 Ids using a flat file reader. I’ll need an aggregator since I only want to do one query to my external service for each batch of 2K Ids. This is where i’m struggling. Do I nest a series of readers? Or can I do ‘reading’ in the processor or writer?
Essentially, my problem is that I want to read lines from a file, aggregate those lines, and then immediately do another ‘read’ to retrieve the respective records. I almost want to chain readers together.
Finally, once I’ve retrieved the records from the external service, I’ll have a List of records. Which means when they arrive at the Writer, I’ll have a list of lists. I want a list of objects so that I can use the JdbcItemWriter out of the box.
Thoughts? Hopefully that makes sense.
Andrew
This is a matter of design and is subjective, but based on the Spring Batch example I found (from SpringSource) and my personal experience, the pattern of doing addtional reading in the processor step is a good solution to this problem. You can also chain together multiple processors/readers in the 'processor' step. So, while the names don't exactly match, i find myself doing more and more 'reading' in my processors.
[http://docs.spring.io/spring-batch/trunk/reference/html/patterns.html#drivingQueryBasedItemReaders][1]
Given that you want to call your external service just once per chunk of 2.000 records, you 'll actually want to do this service call in an ItemWriter. That is the standard recommended way to do chunk-level processing.
You can create a custom ItemWriter<Long> implementation. It will receive the list of 2.000 IDs as input, and call the external service. The result from the external service should allow you to create a List<Item>. Your writer can then simply forward this List<Item> to your JdbcItemWriter<Item> delegate.
I need to create some files with my own format(let say 5000 recs each) from a huge data table (May contain 5 Million records). And i want this creation to be multi threaded.
So how can I form queries efficiently to fetch records like 1..5000 and 5001..10000 and so on.
I can form some thing like select * from table where rownum<5000 and not exists ( already fetched records). But it is not the efficient one.
Please suggest the best way of forming the queries or any alternative approach to create files.
If you're on Oracle 11g you can use the DBMS_PARALLEL_EXECUTE package to run your procedure in multiple threads. Find out more.
If you're on an earlier version you can implement DIY parallelism using a technique from Tom Kyte. The Hungry DBA provides a good explanation on his blog here.
Sounds like you need a set of queries using the MySql LIMIT clause to implement paging (e.g. a query would get the first 1000, another would get the second 1000 etc..).
You could form these queries and submit as Callables to an Executor service with a set number of threads. The Executor will manage the threads. I suspect it may be more efficient to both query and write your records within each Callable, but this is an assumption that would likely require testing.
I'm currently working on a Java project which i need to prepare a big(to me) mysql database. I have to do web scraping using Jsoup and store the results into my database as well. As i estimated, i will have roughly 1,500,000 to 2,000,000 records to be inserted. In my first trial, i just use a loop to insert these records and it takes me one week to insert about 1/3 of my required records, which is too slow i think. Is it possible to make this process multi-threaded, so that i can have my records split into 3 sets, say 500,000 records per set, and then insert them into one database( one table specifically)?
Multi-threading isn't going to help you here. You'll just move the contention bottleneck from your app server to the database.
Instead, try using batch-inserts instead, they generally make this sort of thing orders of magnitude faster. See "3.4 Making Batch Updates" in the JDBC tutorial.
Edit: As #Jon commented, you need to decouple the fetching of the web pages from their insertion into the database, otherwise the whole process will go at the speed of the slowest operation. You could have multiple threads fetching web pages, which add the data to a queue data structure, and then have a single thread draining the queue into the database using a batch insert.
Just make sure two (or more) threads doesn't use the same connection at the same time, using a connection pool resolves that. c3po and apache dbcp comes to mind ...
You can insert these records in different threads provided they do use different primary key values.
You should also look at Spring Batch which I believe will be useful in your case.
You can chunk your record set into batches and do this, but perhaps you should think about other factors as well.
Are you doing a network round trip for each INSERT? If yes, latency could be the real enemy. Try batching those requests to cut down on network traffic.
Do you have transactions turned on? If yes, the size of the rollback log could be the problem.
I'd recommend profiling the app server and the database server to see where the time is being spent. You can waste a lot of time guessing about the root cause.
I think multi thread approch usefull for your issue but you have to using a connection pool such as C3P0 or Tomca 7 Connetcion pool for more performance.
Another solution is using a batch-operation provider such as Spring-batch, exist anothers utility for batch operation also.
Another solution is using a PL/SQl Procedure with a input structure parameter.