I have a use case like,
Read bulk records from multiple tables(more than 10000 records)
Business logic to validate records
Updating validated records to a different table other than where records were retrieved in Same Database.
I would like implement my use case with spring batch and scheduler to run at
certain point of time.
I have read about spring batch and understand that there is a ItemReader, ItemProcessor, and ItemWriter as job in chunk to execute activity.
Also I would like to implement it using multi threading by defining taskExecutor(org.springframework.core.task.SimpleAsyncTaskExecutor).I have decided to go with the below approach
Read records from DB with query by calling DAO implemented in other module with spring hibernate transaction manager in ItemReader.
Process the records each at a time in ItemProcessor
Update the record to table in ItemWriter with commit interval of some number.
I am new to Spring batch processing So I would like to understand if this is a good solution to implement or if is there any better way to implement it. I also have few questions regarding how DB connections and transactions will be maintained.
Will there be one connection and transaction for the whole batch job? Or will multiple connections and transaction be opened at certain points of execution? How this process will be handled?
How to effectively process the above use case with multi-threading to process records with 10, or 20 threads at a time?
Can someone please provide a brief explanation to understand more on this concept or provide any samples to understand more?
Thanks in advance.
Your approach sounds good to me .
I will try to answer your first question.
Will there be one connection and transaction for the whole batch job? Or will multiple connections and transaction be opened at certain points of execution? How this process will be handled?
You can have multiple data source and multiple transaction managers , but managing it will be difficult , as you will have to take care of many of the thing that spring batch manager can do on its own .
Since most of the spring batch operations like Restart , stop etc need metadata which is stored in Db by spring batch . If you try to play with it then those operations might not work very well .
I would suggest you to have both spring batch tables and your business specific tables inside the same data source .
That way you need to have only one data source and one transaction manager and you do not have to worry about transaction issues that you might face.
Related
I have a scenario where in data is uploaded from excel sheet to mysql db. I am using spring data jpa. And the service calls the entities recursively after stuffing them with data taken from excel sheet to save in db. This creates "unable to acquire jdbc connections" after a certain load.
I tried with #Transactional to know advantage. Then I am thinking of using EntityManager manually in code and control transaction boundary so that all recursive save calls of entities happen within one transaction and thereby one connection object. I just wanted to check would it be a nice idea or is there any other approach I should take which is more performant. Needless to say anyhow I have to do it through entities.
My answer is completely based on the assumption that the way of implementing the requirement is faulty as there isn't any code shared in the question.
By your approach, yes you will run out of the connection as the entity population would surely be much faster than persisting that entity in the database and since you are doing it recursively your application will run out connections at one point of time if the amount of the data is very high, numbers are certainly a factor here.
The other approach I would prefer is that you can prepare your entities(Assuming all the data is for a common entity class) and store in a collection, once it is ready you can persist all of it in one transaction using saveAll() method.
If the data is not for common entities you can create multiple lists of different entities and initiate the DB operations after processing the excel sheet.
I have a DynamoDB table where I have 1000+ rows. I need to write a springboot app that reads the table rows one by one and make a rest call to another service that accepts one JSON at a time. Looping through the table 1 by 1 does not seem to be an optimal solution. Can this be achieved by multi-threading, if so how can that be done ? Or do we have any better option for this, can someone help ?
At a time you can read suppose N records at a time from for example 50,you can use more than 1 thread to read records from database as you are only reading and not writing, once you have read the record then you can create N number of Threads to call external service from ExecutorService, and Each Thread of Executor service can have data of 1 record which will call that external service
You could use ‘#async’. Its one of the easiest ways of executing multiple threads in springboot
First of all, why do you have to read a DB table one row by another row? you can have just one sql statement and get all the rows that you need (achieved in your repository layer if it's a spring boot app) and then use multiple threading as described by other people (#Async, ExecutorService, etc...) in your service classes.
I am developing a dictionary application and using many external sources to collect the data.
This data is collected from those sources only for the first time, after that i persist it to my db and fetch it from their.
The problem i am facing is, some words like set, cut, put etc have 100's of meanings and many examples as well. It takes around 10 seconds to Persist all this data to mysql. I am using mybatis to persist data. And because of this, the response time is getting screwed up. Without this database persist, i get response in 400-500ms, if i show data directly after fetching from sources.
I am trying to find a way to persist the data in background. I am using MVC pattern so dao layer is separate.
Is it a good idea to use threading in the dao layer as a solution? Or should I use some messaging tool like Kafka to send a message to persist the given word in background? What else can I do?
Note: I prefer MySQL as the db right now, will probably use redis for caching later on.
My global answer on question + further comments:
Do not bulk insert with Mybatis foreach. Instead you shall execute the statement in a java iteration over the list of object to store, using ExecutorType Reuse or Batch(Read the documentation).
For transactions, in main mybatis-config xml, configure the environment:
transactionManager type JDBC to manage the transaction in the code session = sessionFactory.openSession(); session.commit(); session.rollback();
transactionManager type MANAGED to let the container manage.
Furthermore, you can let the web app send the response, while a new thread takes its time to store the data.
I explain better my question since from the title it could be not very clear, but I didn't find a way to summarize the problem in few work. Basically I have a web application whose DB have 5 tables. 3 of these are managed using JPA and eclipselink implementation. The other 2 tables are manager directly with SQL using the package java.sql. When I say "managed" I mean just that query, insertion, deletion and updates are performed in two different way.
Now the problem is that I have to monitor the response time of each call to the DB. In order to do this I have a library that use aspects and at runtime I can monitor the execution time of any code snippet. Now the question is, if I want to monitor the response time of a DB request (let's suppose the DB in remote, so the response time will include also network latency, but actually this is fine), what are in the two distinct case described above the instructions whose execution time has to be considered.
I make an example in order to be more clear.
Suppose tha case of using JPA and execute a DB update. I have the following code:
EntityManagerFactory emf = Persistence.createEntityManagerFactory(persistenceUnit);
EntityManager em=emf.createEntityManager();
EntityToPersist e=new EntityToPersist();
em.persist(e);
Actually it is correct to suppose that only the em.persist(e) instruction connects and make a request to the DB?
The same for what concern using java.sql:
Connection c=dataSource.getConnection();
Statement statement = c.createStatement();
statement.executeUpdate(stm);
statement.close();
c.close();
In this case it is correct to suppose that only the statement.executeUpdate(stm) connect and make a request to the DB?
If it could be useful to know, actually the remote DBMS is mysql.
I try to search on the web, but it is a particular problem and I'm not sure about what to look for in order to find a solution without reading the JPA or java.sql full specification.
Please if you have any question or if there is something that is not clear from my description, don't hesitate to ask me.
Thank you a lot in advance.
In JPA (so also in EcliplseLink) you have to differentiate from SELECT queries (that do not need any transaction) and queries that change the data (DELETE, CREATE, UPDATE: all these need a transacion). When you select data, then it is enough the measure the time of Query.getResultList() (and calls alike). For the other operations (EntityManager.persist() or merge() or remove()) there is a mechanism of flushing, which basically forces the queue of queries (or a single query) from the cache to hit the database. The question is when is the EntityManager flushed: usually on transaction commit or when you call EntityManager.flush(). And here again another question: when is the transaction commit: and the answer is: it depends on your connection setup (if autocommit is true or not), but a very correct setup is with autocommit=false and when you begin and commit your transactions in your code.
When working with statement.executeUpdate(stm) it is enough to measure only such calls.
PS: usually you do not connect directly to any database, as that is done by a pool (even if you work with a DataSource), which simply gives you a already established connection, but that again depends on your setup.
PS2: for EclipseLink probably the most correct way would be to take a look in the source code in order to find when the internal flush is made and to measure that part.
I'm currently working on a Java project which i need to prepare a big(to me) mysql database. I have to do web scraping using Jsoup and store the results into my database as well. As i estimated, i will have roughly 1,500,000 to 2,000,000 records to be inserted. In my first trial, i just use a loop to insert these records and it takes me one week to insert about 1/3 of my required records, which is too slow i think. Is it possible to make this process multi-threaded, so that i can have my records split into 3 sets, say 500,000 records per set, and then insert them into one database( one table specifically)?
Multi-threading isn't going to help you here. You'll just move the contention bottleneck from your app server to the database.
Instead, try using batch-inserts instead, they generally make this sort of thing orders of magnitude faster. See "3.4 Making Batch Updates" in the JDBC tutorial.
Edit: As #Jon commented, you need to decouple the fetching of the web pages from their insertion into the database, otherwise the whole process will go at the speed of the slowest operation. You could have multiple threads fetching web pages, which add the data to a queue data structure, and then have a single thread draining the queue into the database using a batch insert.
Just make sure two (or more) threads doesn't use the same connection at the same time, using a connection pool resolves that. c3po and apache dbcp comes to mind ...
You can insert these records in different threads provided they do use different primary key values.
You should also look at Spring Batch which I believe will be useful in your case.
You can chunk your record set into batches and do this, but perhaps you should think about other factors as well.
Are you doing a network round trip for each INSERT? If yes, latency could be the real enemy. Try batching those requests to cut down on network traffic.
Do you have transactions turned on? If yes, the size of the rollback log could be the problem.
I'd recommend profiling the app server and the database server to see where the time is being spent. You can waste a lot of time guessing about the root cause.
I think multi thread approch usefull for your issue but you have to using a connection pool such as C3P0 or Tomca 7 Connetcion pool for more performance.
Another solution is using a batch-operation provider such as Spring-batch, exist anothers utility for batch operation also.
Another solution is using a PL/SQl Procedure with a input structure parameter.