What is the most efficient way to persist thousands of entities? - java

I have fairly large CSV files which I need to parse and then persist into PostgreSQL. For example, one file contains 2_070_000 records which I was able to parse and persist in ~8 minutes (single thread). Is it possible to persist them using multiple threads?
public void importCsv(MultipartFile csvFile, Class<T> targetClass) {
final var headerMapping = getHeaderMapping(targetClass);
File tempFile = null;
try {
final var randomUuid = UUID.randomUUID().toString();
tempFile = File.createTempFile("data-" + randomUuid, "csv");
csvFile.transferTo(tempFile);
final var csvFileName = csvFile.getOriginalFilename();
final var csvReader = new BufferedReader(new FileReader(tempFile, StandardCharsets.UTF_8));
Stopwatch stopWatch = Stopwatch.createStarted();
log.info("Starting to import {}", csvFileName);
final var csvRecords = CSVFormat.DEFAULT
.withDelimiter(';')
.withHeader(headerMapping.keySet().toArray(String[]::new))
.withSkipHeaderRecord(true)
.parse(csvReader);
final var models = StreamSupport.stream(csvRecords.spliterator(), true)
.map(record -> parseRecord(record, headerMapping, targetClass))
.collect(Collectors.toUnmodifiableList());
// How to save such a large list?
log.info("Finished import of {} in {}", csvFileName, stopWatch);
} catch (IOException ex) {
ex.printStackTrace();
} finally {
tempFile.delete();
}
}
models contains a lot of records. The parsing into records is done using parallel stream, so it's quite fast. I'm afraid to call SimpleJpaRepository.saveAll, because I'm not sure what it will do under the hood.
The question is: What is the most efficient way to persist such a large list of entities?
P.S.: Any other improvements are greatly appreciated.

You have to use batch inserts.
Create an interface for a custom repository SomeRepositoryCustom
public interface SomeRepositoryCustom {
void batchSave(List<Record> records);
}
Create an implementation of SomeRepositoryCustom
#Repository
class SomesRepositoryCustomImpl implements SomeRepositoryCustom {
private JdbcTemplate template;
#Autowired
public SomesRepositoryCustomImpl(JdbcTemplate template) {
this.template = template;
}
#Override
public void batchSave(List<Record> records) {
final String sql = "INSERT INTO RECORDS(column_a, column_b) VALUES (?, ?)";
template.execute(sql, (PreparedStatementCallback<Void>) ps -> {
for (Record record : records) {
ps.setString(1, record.getA());
ps.setString(2, record.getB());
ps.addBatch();
}
ps.executeBatch();
return null;
});
}
}
Extend your JpaRepository with SomeRepositoryCustom
#Repository
public interface SomeRepository extends JpaRepository, SomeRepositoryCustom {
}
to save
someRepository.batchSave(records);
Notes
Keep in mind that, if you are even using batch inserts, database driver will not use them. For example, for MySQL, it is necessary to add a parameter rewriteBatchedStatements=true to database URL.
So better to enable driver SQL logging (not Hibernate) to verify everything. Also can be useful to debug driver code.
You will need to make decision about splitting records by packets in the loop
for (Record record : records) {
}
A driver can do it for you, so you will not need it. But better to debug this thing too.
P. S. Don't use var everywhere.

Related

Slow service Hibernate

My project requires a service that will repeatable migration between two tables on two different databases. I have implemented this via Hibernate. So, I have a service that fetches data from the primary database table and then migrate it to the second database table, however, my primary database table has over 200000 rows and the iterator takes quite a long time to complete the task.
What should I use to speed up the process?
Here is the Service code:
#Service
public class TrmInCardClientService {
#Autowired
TrmInCardClientDBRepository trmInCardClientDBRepository;
#Autowired
TrmInCardClientUKMRepository trmInCardClientUKMRepository;
private Logger log = Logger.getLogger(TrmInCardClientService.class.getName());
public TrmInCardClientService() {
}
#Scheduled(
fixedRate = 300000
)
public void updatelist() {
this.log.info("TrmInCardClient data transfer start");
try {
Iterable<TrmInCardClientUKM> trmInCardClientUKMS = this.trmInCardClientUKMRepository.findByDeleted(0);
List<TrmInCardClientUKM> trmInCardClientUKMList = new ArrayList();
List<TrmInCardClientDB> trmInCardClientDBList = new ArrayList<>();
Iterator var5 = trmInCardClientUKMS.iterator();
while (var5.hasNext()) {
TrmInCardClientUKM cardClientUKM = (TrmInCardClientUKM) var5.next();
trmInCardClientUKMList.add(cardClientUKM);
trmInCardClientDBList.add(new TrmInCardClientDB(cardClientUKM.getCard(), cardClientUKM.getClient(),
cardClientUKM.getDeleted(), cardClientUKM.getGlobal_id(), cardClientUKM.getVersion()));
}
this.trmInCardClientDBRepository.saveAll(trmInCardClientDBList);
this.log.info("TrmInCardClient data transfer end");
}
catch (Exception e) {
this.log.warning("Error encountered during TrmInCardClient data migration");
}
}
}

the use of threads and the java Future interface in AWS Lambda

I want to create an AWS Lambda function in java that writes to a database in Firestore. The short story is that, while the code does what it should when I execute it
on my own computer, using NetBeans (the truth is that it works most of the time, but not always, maybe due to problems with my internet connection), nothing at all
happens when I deploy it as a Lambda function and invoke this. I suspect that this has less to do with Firestore itself, but rather with how AWS Lambda handles asynchronous
operations.
Now to the details!
As a simple example, the method that writes to the Firestore object db reads
public static void writeFirestore(Firestore db){
try{
DateTime now = DateTime.now();
String time = now.toString();
Map<String, String> data = new HashMap<>();
data.put("time", time);
String collTitle = "Notebook";
String docTitle = "Document: "+time;
db.collection(collTitle).document(docTitle).set(data);
System.out.println("wrote to Firestore");
}
catch(Exception e){
System.out.println("Could not write to db: "+e.toString());
}
}
Now, as it takes some time to connect to Firestore and initialize db, I want to make sure that db is not passed as an argument into writeFirestore() before it
has been properly retrieved. So, I define a version of db in the form of a Future object, using ExecutorService, and then retrieve
the object db with the get()-method. For this, I define the class TaskRunner:
public class TaskRunner {
ExecutorService executor;
public TaskRunner(){
executor = Executors.newSingleThreadExecutor();
}
public static interface Callback<T>{
public void onCallback(T result);
}
public <T> void executeAsync(Callable<T> callable, Callback<T> callback) throws Exception{
try{
Future future = executor.submit(callable);
Object result = future.get();
if(result != null){
System.out.println("result is not null; applying callback...");
callback.onCallback((T) result);
}
else{
System.out.println("result is null");
}
}
catch(Exception e){
System.out.println("Problem running executeAsync: "+e.toString());
}
}
}
Writing the example document to my fixed database db now goes as follows:
I define the class FirestoreCreator that implements Callable with the purpose of retrieving the Firestore object db:
public static class FirestoreCreator implements Callable<Firestore>{
#Override
public Firestore call() throws Exception {
String projectId = "myProjectId";
GoogleCredentials credentials =
GoogleCredentials.fromStream(new FileInputStream("myCredentialsFile.json"));
FirestoreOptions firestoreOptions = FirestoreOptions.getDefaultInstance()
.toBuilder()
.setProjectId(projectId)
.setCredentials(credentials)
.build();
Firestore db = firestoreOptions.getService();
return db;
}
}
I implement the TaskRunner.Callback interface using writeFirestore().
I create a TaskRunner object, taskRunner, and call its executeAsync() method with the above two objects as parameters.
These three steps are collected in the final method testUpdateFirestore() that does the job:
public static void testUpdateFirestoreInterface(){
FirestoreCreator fsCreator = new FirestoreCreator();
TaskRunner.Callback<Firestore> updateCallback = new TaskRunner.Callback<Firestore>() {
#Override
public void onCallback(Firestore result) {
writeFirestore(result);
}
};
TaskRunner taskRunner = new TaskRunner();
try {
taskRunner.executeAsync(fsCreator, updateCallback);
} catch (Exception ex) {
System.out.println("Failed to run executeAsync");
}
}
As I already mentioned in the introduction, the code works (most times) when I run it on my computer, but not at all in AWS Lambda. No exception is thrown, and yet no document has been written in Firestore.
The discussion about threads in AWS Lambda (https://dzone.com/articles/multi-threaded-programming-with-aws-lambda) made me suspect that reason is that the use of some thread that runs when ExecutorService is used is not being handled properly.
Does anyone know what goes wrong and what a solution could look like?

Implementing Spring + Apache Flink project with Postgres

I have a SpringBoot gradle project using apache flink to process datastream signals. When a new signal comes through the datastream, I would like to query look up (i.e. findById() ) it's details using an ID in a postgres database table which is already created in order to get additional information about the signal and enrich the data. I would like to avoid using spring dependencies to perform the lookup (i.e Autowire repository) and want to stick with flink implementation for the lookup.
Where can i specify how to add the postgres connection config information such as port, database, url, username, password etc... (for simplicity purposes can assume the postgres db is local in my machine). Is it as simple as adding the configuration to the application.properties file? if so how can i write the query method to look up the record in the postgres table when searching by non primary key value?
Some online sources are suggesting using this skeleton code but I am not sure how/id it fits my use case. (I have a EventEntity model created which contains all the params/columns from the table which i'm looking up).
like so
public class DatabaseMapper extends RichFlatMapFunction<String, EventEntity> {
// Declare DB connection & query statements
public void open(Configuration parameters) throws Exception {
//Initialize DB connection
//prepare query statements
}
#Override
public void flatMap(String value, Collector<EventEntity> out) throws Exception {
}
}
Your sample code is correct. You can set all your custom initialization and preparation code for PostgreSQL in open() method. Then you can use your pre-configured fields in your flatMap() function.
Here is one sample for Redis operations
I have used RichAsyncFunction here and I suggest you do the same as it is suggested as best practice. Read here for more: https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/operators/asyncio.html)
You can pass configuration parameteres in your constructor method and use it in your initialization process
public static class AsyncRedisOperations extends RichAsyncFunction<Object,Object> {
private JedisPool jedisPool;
private Configuration redisConf;
public AsyncRedisOperations(Configuration redisConf) {
this.redisConf = redisConf;
}
#Override
public void open(Configuration parameters) {
JedisPoolConfig jedisPoolConfig = new JedisPoolConfig();
jedisPoolConfig.setMaxTotal(this.redisConf.getInteger("pool", 8));
jedisPoolConfig.setMaxIdle(this.redisConf.getInteger("pool", 8));
jedisPoolConfig.setMaxWaitMillis(this.redisConf.getInteger("maxWait", 0));
JedisPool jedisPool = new JedisPool(jedisPoolConfig,
this.redisConf.getString("host", "192.168.10.10"),
this.redisConf.getInteger("port", 6379), 5000);
try {
this.jedisPool = jedisPool;
this.logger.info("Redis connected: " + jedisPool.getResource().isConnected());
} catch (Exception e) {
this.logger.error(BaseUtil.append("Exception while connecting Redis"));
}
}
#Override
public void asyncInvoke(Object in, ResultFuture<Object> out) {
try (Jedis jedis = this.jedisPool.getResource()) {
String key = jedis.get(key);
this.logger.info("Redis Key: " + key);
}
}
}

Efficient way to write asynchronously into cassandra using datastax java driver?

I am using datastax java driver 3.1.0 to connect to cassandra cluster and my cassandra cluster version is 2.0.10. I am writing asynchronously with QUORUM consistency.
public void save(final String process, final int clientid, final long deviceid) {
String sql = "insert into storage (process, clientid, deviceid) values (?, ?, ?)";
try {
BoundStatement bs = CacheStatement.getInstance().getStatement(sql);
bs.setConsistencyLevel(ConsistencyLevel.QUORUM);
bs.setString(0, process);
bs.setInt(1, clientid);
bs.setLong(2, deviceid);
ResultSetFuture future = session.executeAsync(bs);
Futures.addCallback(future, new FutureCallback<ResultSet>() {
#Override
public void onSuccess(ResultSet result) {
logger.logInfo("successfully written");
}
#Override
public void onFailure(Throwable t) {
logger.logError("error= ", t);
}
}, Executors.newFixedThreadPool(10));
} catch (Exception ex) {
logger.logError("error= ", ex);
}
}
And below is my CacheStatement class:
public class CacheStatement {
private static final Map<String, PreparedStatement> cache =
new ConcurrentHashMap<>();
private static class Holder {
private static final CacheStatement INSTANCE = new CacheStatement();
}
public static CacheStatement getInstance() {
return Holder.INSTANCE;
}
private CacheStatement() {}
public BoundStatement getStatement(String cql) {
Session session = CassUtils.getInstance().getSession();
PreparedStatement ps = cache.get(cql);
// no statement cached, create one and cache it now.
if (ps == null) {
synchronized (this) {
ps = cache.get(cql);
if (ps == null) {
cache.put(cql, session.prepare(cql));
}
}
}
return ps.bind();
}
}
My above save method will be called from multiple threads and I think BoundStatement is not thread safe. Btw StatementCache class is thread safe as shown above.
Since BoundStatement is not thread safe. Will there be any problem in my above code if I write asynchronously from multiple threads?
And secondly, I am using Executors.newFixedThreadPool(10) in the addCallback parameter. Is this ok or there will be any problem? Or should I use MoreExecutors.directExecutor. What is the difference between these two then? And what is the best way for this?
Below is my connection setting to connect to cassandra using datastax java driver:
Builder builder = Cluster.builder();
cluster =
builder
.addContactPoints(servers.toArray(new String[servers.size()]))
.withRetryPolicy(new LoggingRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE))
.withPoolingOptions(poolingOptions)
.withReconnectionPolicy(new ConstantReconnectionPolicy(100L))
.withLoadBalancingPolicy(
DCAwareRoundRobinPolicy
.builder()
.withLocalDc(
!TestUtils.isProd() ? "DC2" : TestUtils.getCurrentLocation()
.get().name().toLowerCase()).withUsedHostsPerRemoteDc(3).build())
.withCredentials(username, password).build();
I think what you're doing is fine. You could optimize a bit further by preparing all the statements at application startup, so you have everything already cached, so you don't get any performance hit for preparing statement when "saving", and you don't lock anything in your workflow.
BoundStatement is not threadsafe, but PreparedStatement yes, and you are returning a new BoundStatement every time you call your getStatement. Indeed, the .bind() function of the PreparedStatement is actually a shortcut for new BoundStatement(ps).bind(). And you are not accessing the same BoundStatement from multiple thread. So your code is fine.
For thread pool, instead, you are actually creating a new thread pool on each addCallback function. This is a waste of resources. I don't use this callback method and I prefer managing plain FutureResultSet by myself, but I saw examples on datastax documentation that use MoreExecutors.sameThreadExecutor() instead of MoreExecutors.directExecutor().

How to read all records from table(> 10 million records) and serve each record as chunk response?

I try to fetch record from database using hibernate's scrollable result and with reference from this github project, i tried to send each record as chunk response.
Controller:
#Transactional(readOnly=true)
public Result fetchAll() {
try {
final Iterator<String> sourceIterator = Summary.fetchAll();
response().setHeader("Content-disposition", "attachment; filename=Summary.csv");
Source<String, ?> s = Source.from(() -> sourceIterator);
return ok().chunked(s.via(Flow.of(String.class).map(i -> ByteString.fromString(i+"\n")))).as(Http.MimeTypes.TEXT);
} catch (Exception e) {
return badRequest(e.getMessage());
}
}
Service:
public static Iterator<String> fetchAll() {
StatelessSession session = ((Session) JPA.em().getDelegate()).getSessionFactory().openStatelessSession();
org.hibernate.Query query = session.createQuery("select l.id from Summary l")
.setFetchSize(Integer.MIN_VALUE).setCacheable(false).setReadOnly(true);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
return new models.ScrollableResultIterator<>(results, String.class);
}
Iterator:
public class ScrollableResultIterator<T> implements Iterator<T> {
private final ScrollableResults results;
private final Class<T> type;
public ScrollableResultIterator(ScrollableResults results, Class<T> type) {
this.results = results;
this.type = type;
}
#Override
public boolean hasNext() {
return results.next();
}
#Override
public T next() {
return type.cast(results.get(0));
}
}
For test purpose, i am having 1007 records in my table, whenever i call this end point, it always return only 503 records.
Enabled AKKA log level to DEBUG and tried it again, it logs the following line for 1007 times
2016-07-25 19:55:38 +0530 [DEBUG] from org.hibernate.loader.Loader in application-akka.actor.default-dispatcher-73 - Result row: From the log i confirm that it fetching all, but couldn't get where the remaining one got left.
I run the same query in my workbench and i export it to a file locally and compared it with the file generated by the end point, kept LHS record generated from end point and RHS file exported from Workbench.
First row matches, second and third didn't match. After that it got matches for alternate records until the end.
Please correct me, if am doing anything wrong and suggest me is this the correct approach for generating CSV for large db records.
For the sake of testing, i removed the CSV conversion logic in my above snippet.
// Controller code
// Prepare a chunked text stream
ExportAsChuncked eac = new ExportAsChuncked();
response().setHeader("Content-disposition","attachment; filename=results.csv");
Chunks<String> chunks = new StringChunks() {
// Called when the stream is ready
public void onReady(Chunks.Out<String> out) {
try {
eac.exportData(scrollableIterator, out);
}catch (XOException e) {
Logger.error(ERROR_WHILE_DOWNLOADING_RESPONSE, e);
}
out.close();
}
};
// Serves this stream with 200 OK
return ok(chunks);
// Export as chunk logic
class ExportAsChuncked {
void exportData(Iterator<String> data, Chunks.Out<String> out) {
while(data.hasNext()) {
out.write(data.next());
}
}
}

Categories

Resources