Efficient way to write asynchronously into cassandra using datastax java driver? - java

I am using datastax java driver 3.1.0 to connect to cassandra cluster and my cassandra cluster version is 2.0.10. I am writing asynchronously with QUORUM consistency.
public void save(final String process, final int clientid, final long deviceid) {
String sql = "insert into storage (process, clientid, deviceid) values (?, ?, ?)";
try {
BoundStatement bs = CacheStatement.getInstance().getStatement(sql);
bs.setConsistencyLevel(ConsistencyLevel.QUORUM);
bs.setString(0, process);
bs.setInt(1, clientid);
bs.setLong(2, deviceid);
ResultSetFuture future = session.executeAsync(bs);
Futures.addCallback(future, new FutureCallback<ResultSet>() {
#Override
public void onSuccess(ResultSet result) {
logger.logInfo("successfully written");
}
#Override
public void onFailure(Throwable t) {
logger.logError("error= ", t);
}
}, Executors.newFixedThreadPool(10));
} catch (Exception ex) {
logger.logError("error= ", ex);
}
}
And below is my CacheStatement class:
public class CacheStatement {
private static final Map<String, PreparedStatement> cache =
new ConcurrentHashMap<>();
private static class Holder {
private static final CacheStatement INSTANCE = new CacheStatement();
}
public static CacheStatement getInstance() {
return Holder.INSTANCE;
}
private CacheStatement() {}
public BoundStatement getStatement(String cql) {
Session session = CassUtils.getInstance().getSession();
PreparedStatement ps = cache.get(cql);
// no statement cached, create one and cache it now.
if (ps == null) {
synchronized (this) {
ps = cache.get(cql);
if (ps == null) {
cache.put(cql, session.prepare(cql));
}
}
}
return ps.bind();
}
}
My above save method will be called from multiple threads and I think BoundStatement is not thread safe. Btw StatementCache class is thread safe as shown above.
Since BoundStatement is not thread safe. Will there be any problem in my above code if I write asynchronously from multiple threads?
And secondly, I am using Executors.newFixedThreadPool(10) in the addCallback parameter. Is this ok or there will be any problem? Or should I use MoreExecutors.directExecutor. What is the difference between these two then? And what is the best way for this?
Below is my connection setting to connect to cassandra using datastax java driver:
Builder builder = Cluster.builder();
cluster =
builder
.addContactPoints(servers.toArray(new String[servers.size()]))
.withRetryPolicy(new LoggingRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE))
.withPoolingOptions(poolingOptions)
.withReconnectionPolicy(new ConstantReconnectionPolicy(100L))
.withLoadBalancingPolicy(
DCAwareRoundRobinPolicy
.builder()
.withLocalDc(
!TestUtils.isProd() ? "DC2" : TestUtils.getCurrentLocation()
.get().name().toLowerCase()).withUsedHostsPerRemoteDc(3).build())
.withCredentials(username, password).build();

I think what you're doing is fine. You could optimize a bit further by preparing all the statements at application startup, so you have everything already cached, so you don't get any performance hit for preparing statement when "saving", and you don't lock anything in your workflow.
BoundStatement is not threadsafe, but PreparedStatement yes, and you are returning a new BoundStatement every time you call your getStatement. Indeed, the .bind() function of the PreparedStatement is actually a shortcut for new BoundStatement(ps).bind(). And you are not accessing the same BoundStatement from multiple thread. So your code is fine.
For thread pool, instead, you are actually creating a new thread pool on each addCallback function. This is a waste of resources. I don't use this callback method and I prefer managing plain FutureResultSet by myself, but I saw examples on datastax documentation that use MoreExecutors.sameThreadExecutor() instead of MoreExecutors.directExecutor().

Related

What is the most efficient way to persist thousands of entities?

I have fairly large CSV files which I need to parse and then persist into PostgreSQL. For example, one file contains 2_070_000 records which I was able to parse and persist in ~8 minutes (single thread). Is it possible to persist them using multiple threads?
public void importCsv(MultipartFile csvFile, Class<T> targetClass) {
final var headerMapping = getHeaderMapping(targetClass);
File tempFile = null;
try {
final var randomUuid = UUID.randomUUID().toString();
tempFile = File.createTempFile("data-" + randomUuid, "csv");
csvFile.transferTo(tempFile);
final var csvFileName = csvFile.getOriginalFilename();
final var csvReader = new BufferedReader(new FileReader(tempFile, StandardCharsets.UTF_8));
Stopwatch stopWatch = Stopwatch.createStarted();
log.info("Starting to import {}", csvFileName);
final var csvRecords = CSVFormat.DEFAULT
.withDelimiter(';')
.withHeader(headerMapping.keySet().toArray(String[]::new))
.withSkipHeaderRecord(true)
.parse(csvReader);
final var models = StreamSupport.stream(csvRecords.spliterator(), true)
.map(record -> parseRecord(record, headerMapping, targetClass))
.collect(Collectors.toUnmodifiableList());
// How to save such a large list?
log.info("Finished import of {} in {}", csvFileName, stopWatch);
} catch (IOException ex) {
ex.printStackTrace();
} finally {
tempFile.delete();
}
}
models contains a lot of records. The parsing into records is done using parallel stream, so it's quite fast. I'm afraid to call SimpleJpaRepository.saveAll, because I'm not sure what it will do under the hood.
The question is: What is the most efficient way to persist such a large list of entities?
P.S.: Any other improvements are greatly appreciated.
You have to use batch inserts.
Create an interface for a custom repository SomeRepositoryCustom
public interface SomeRepositoryCustom {
void batchSave(List<Record> records);
}
Create an implementation of SomeRepositoryCustom
#Repository
class SomesRepositoryCustomImpl implements SomeRepositoryCustom {
private JdbcTemplate template;
#Autowired
public SomesRepositoryCustomImpl(JdbcTemplate template) {
this.template = template;
}
#Override
public void batchSave(List<Record> records) {
final String sql = "INSERT INTO RECORDS(column_a, column_b) VALUES (?, ?)";
template.execute(sql, (PreparedStatementCallback<Void>) ps -> {
for (Record record : records) {
ps.setString(1, record.getA());
ps.setString(2, record.getB());
ps.addBatch();
}
ps.executeBatch();
return null;
});
}
}
Extend your JpaRepository with SomeRepositoryCustom
#Repository
public interface SomeRepository extends JpaRepository, SomeRepositoryCustom {
}
to save
someRepository.batchSave(records);
Notes
Keep in mind that, if you are even using batch inserts, database driver will not use them. For example, for MySQL, it is necessary to add a parameter rewriteBatchedStatements=true to database URL.
So better to enable driver SQL logging (not Hibernate) to verify everything. Also can be useful to debug driver code.
You will need to make decision about splitting records by packets in the loop
for (Record record : records) {
}
A driver can do it for you, so you will not need it. But better to debug this thing too.
P. S. Don't use var everywhere.

the use of threads and the java Future interface in AWS Lambda

I want to create an AWS Lambda function in java that writes to a database in Firestore. The short story is that, while the code does what it should when I execute it
on my own computer, using NetBeans (the truth is that it works most of the time, but not always, maybe due to problems with my internet connection), nothing at all
happens when I deploy it as a Lambda function and invoke this. I suspect that this has less to do with Firestore itself, but rather with how AWS Lambda handles asynchronous
operations.
Now to the details!
As a simple example, the method that writes to the Firestore object db reads
public static void writeFirestore(Firestore db){
try{
DateTime now = DateTime.now();
String time = now.toString();
Map<String, String> data = new HashMap<>();
data.put("time", time);
String collTitle = "Notebook";
String docTitle = "Document: "+time;
db.collection(collTitle).document(docTitle).set(data);
System.out.println("wrote to Firestore");
}
catch(Exception e){
System.out.println("Could not write to db: "+e.toString());
}
}
Now, as it takes some time to connect to Firestore and initialize db, I want to make sure that db is not passed as an argument into writeFirestore() before it
has been properly retrieved. So, I define a version of db in the form of a Future object, using ExecutorService, and then retrieve
the object db with the get()-method. For this, I define the class TaskRunner:
public class TaskRunner {
ExecutorService executor;
public TaskRunner(){
executor = Executors.newSingleThreadExecutor();
}
public static interface Callback<T>{
public void onCallback(T result);
}
public <T> void executeAsync(Callable<T> callable, Callback<T> callback) throws Exception{
try{
Future future = executor.submit(callable);
Object result = future.get();
if(result != null){
System.out.println("result is not null; applying callback...");
callback.onCallback((T) result);
}
else{
System.out.println("result is null");
}
}
catch(Exception e){
System.out.println("Problem running executeAsync: "+e.toString());
}
}
}
Writing the example document to my fixed database db now goes as follows:
I define the class FirestoreCreator that implements Callable with the purpose of retrieving the Firestore object db:
public static class FirestoreCreator implements Callable<Firestore>{
#Override
public Firestore call() throws Exception {
String projectId = "myProjectId";
GoogleCredentials credentials =
GoogleCredentials.fromStream(new FileInputStream("myCredentialsFile.json"));
FirestoreOptions firestoreOptions = FirestoreOptions.getDefaultInstance()
.toBuilder()
.setProjectId(projectId)
.setCredentials(credentials)
.build();
Firestore db = firestoreOptions.getService();
return db;
}
}
I implement the TaskRunner.Callback interface using writeFirestore().
I create a TaskRunner object, taskRunner, and call its executeAsync() method with the above two objects as parameters.
These three steps are collected in the final method testUpdateFirestore() that does the job:
public static void testUpdateFirestoreInterface(){
FirestoreCreator fsCreator = new FirestoreCreator();
TaskRunner.Callback<Firestore> updateCallback = new TaskRunner.Callback<Firestore>() {
#Override
public void onCallback(Firestore result) {
writeFirestore(result);
}
};
TaskRunner taskRunner = new TaskRunner();
try {
taskRunner.executeAsync(fsCreator, updateCallback);
} catch (Exception ex) {
System.out.println("Failed to run executeAsync");
}
}
As I already mentioned in the introduction, the code works (most times) when I run it on my computer, but not at all in AWS Lambda. No exception is thrown, and yet no document has been written in Firestore.
The discussion about threads in AWS Lambda (https://dzone.com/articles/multi-threaded-programming-with-aws-lambda) made me suspect that reason is that the use of some thread that runs when ExecutorService is used is not being handled properly.
Does anyone know what goes wrong and what a solution could look like?

Cache preparedstatement in a thread safe way?

I am caching prepared statement so that I don't have to prepare it again while working with datastax java driver (cassandra).. Below is my code and it works:
private static final ConcurrentHashMap<String, PreparedStatement> cache = new ConcurrentHashMap<>();
public ResultSetFuture send(final String cql, final Object... values) {
return executeWithSession(new SessionCallable<ResultSetFuture>() {
#Override
public ResultSetFuture executeWithSession(Session session) {
BoundStatement bs = getStatement(cql, values);
bs.setConsistencyLevel(consistencyLevel);
return session.executeAsync(bs);
}
});
}
private BoundStatement getStatement(final String cql, final Object... values) {
Session session = getSession();
PreparedStatement ps = cache.get(cql);
// no statement cached, create one and cache it now.
// below line is causing thread safety issue..
if (ps == null) {
ps = session.prepare(cql);
PreparedStatement old = cache.putIfAbsent(cql, ps);
if (old != null)
ps = old;
}
return ps.bind(values);
}
But problem is send method will be called by multiple threads so I suspect my getStatement method is not thread safe because of if (ps == null) check.. How can I make it thread safe?
I wanted to avoid using synchronize keyword so wanted to see if there is any better way. I am working with Java 7 as of now.

Store database connection as separate Class - Java

Is it possible to store a database connection as a separate class, then call the database objects from a main code? ie;
public class main{
public static void main{
try{
Class.forName("com.jdbc.driver");
Database to = new Database(1,"SERVER1","DATABASE");
Database from = new Database(2,"SERVER2","DATABASE");
String QueryStr = String.format("SELECT * FROM TABLE WHERE Id = %i", to.id)
to.results = sql.executeQuery(QueryStr);
while (to.results.next()) {
String QueryStr = String.format("INSERT INTO Table (A,B) VALUES (%s,%s)",to.results.getString(1),to.results.getString(2));
from.sql.executeQuery("QueryStr");
}
to.connection.close()
from.connection.close()
} catch (Exception ex) {
ex.printStackTrace();
{ finally {
if (to.connection != null)
try {
to.connection.close();
} catch (SQLException x) {
}
if (from.connection != null)
try {
from.connection.close();
} catch (SQLException x) {
}
}
}
public static class Database {
public int id;
public String server;
public String database;
public Connection connection;
public ResultSet results;
public Statement sql;
public Database(int _id, String _server, String _database) {
id = _id;
server = _server;
database = _database;
String connectStr = String.format("jdbc:driver://SERVER=%s;port=6322;DATABASE=%s",server,database);
connection = DriverManager.getConnection(connectStr);
sql = connection.createStatement;
}
}
}
I keep getting a "Connection object is closed" error when I call to.results = sql.executeQuery("SELECT * FROM TABLE"); like the connection closes as soon as the Database is done initializing.
The reason I ask is I have multiple databases that are all about the same that I am dumping into a master database. I thought it would be nice to setup a loop to go through each from database and insert into each to database using the same class. Is this not possible? Database will also contain more methods than shown as well. I am pretty new to java, so hopefully this makes sense...
Also, my code is probably riddled with syntax errors as is, so try not to focus on that.
Connection object is closed doesn't mean that the connection is closed, but that the object relative to the connection is closed (it could be a Statement or a ResultSet).
It's difficult to see from your example, since it has been trimmed/re-arranged, but it looks like you may be trying to use a ResultSet after having re-used its corresponding Statement. See the documentation:
By default, only one ResultSet object per Statement object can be open
at the same time. Therefore, if the reading of one ResultSet object is
interleaved with the reading of another, each must have been generated
by different Statement objects. All execution methods in the Statement
interface implicitly close a statment's current ResultSet object if an
open one exists.
In your example, it may be because autoCommit is set to true by default. You can override this on the java.sql.Connection class. Better yet is to use a transaction framework if you're updating multiple tables.

What could be the possibilities of using a thread-unsafe java.sql.Connection object in java?

I am dealing with a legacy code where there the connection object in the singleton dao class is a member variable and is prone to race-conditions.
I know this is a potential design issue however I am interested in knowing about the different types of problems that could be thought of when dealing with the jdbc connection object in java.
Following is the EventLoggerDAO class code:
package com.code.ref.dao;
import java.sql.Connection;
import java.sql.PreparedStatement;
import com.code.ref.utils.common.DBUtil;
import com.code.ref.utils.common.PCMLLogger;
public class EventLoggerDAO {
private static EventLoggerDAO staticobj_EventLoggerDAO;
private Connection obj_ClsConnection;
private PreparedStatement obj_ClsPreparedStmt;
private EventLoggerDAO() {
try {
obj_ClsConnection = DBUtil.getConnection();
} catch (Exception e) {
PCMLLogger.logMessage(EventLoggerDAO.class, "EventLoggerDAO()", "Some problem in creating db connection:" + e);
}
}
public static synchronized EventLoggerDAO getInstance() {
if (staticobj_EventLoggerDAO == null) {
synchronized (EventLoggerDAO.class) {
if (staticobj_EventLoggerDAO == null)
staticobj_EventLoggerDAO = new EventLoggerDAO();
}
}
return staticobj_EventLoggerDAO;
}
public void addEvent(String sName, String sType, String sAction, String sModifiedBy) throws Exception {
StringBuffer sbQuery = new StringBuffer();
sbQuery.append("INSERT INTO TM_EVENT_LOG (NAME, TYPE, ACTION, MODIFIED_BY) ").append("VALUES (?, ?, ?, ?) ");
if(obj_ClsConnection == null)
obj_ClsConnection = DBUtil.getConnection();
obj_ClsPreparedStmt = obj_ClsConnection.prepareStatement(sbQuery.toString());
obj_ClsPreparedStmt.setString(1, sName);
obj_ClsPreparedStmt.setString(2, sType);
obj_ClsPreparedStmt.setString(3, sAction);
obj_ClsPreparedStmt.setString(4, sModifiedBy);
obj_ClsPreparedStmt.executeUpdate();
if (obj_ClsPreparedStmt != null) {
obj_ClsPreparedStmt.close();
obj_ClsPreparedStmt = null;
}
}
}
Problem observed:
Sometimes it happens that the table TM_EVENT_LOG stops inserting and there is not even exception in the server logs.
I suspect that during race conditions the connection objects held by different threads might be leading to inconsistent state and might also not be commiting the data. The connection is derived through a websphere datasource maintaining a connection pool.
Any thoughts or ideas why this might be happening?
Everything can happen here. Note that obj_ClsPreparedStmt is a member variable whereas it's used as a local variable - it seems to be a much more serious problem than shared Connection.

Categories

Resources