Write to Postgres with apache beam (GCP) - java

We are using apache beam in our google cloud platform and implemented a dataflow streaming job that writes to our postgres database. However, we noticed that once we started using two JdbcIO.write() statements next to each other, our streaming job starts throwing errors like these:
Operation ongoing in step JdbcIO.WriteVoid/ParDo(Write) for at least 35m00s without outputting or completing in state process
at jdk.internal.misc.Unsafe.park (Native Method)
at java.util.concurrent.locks.LockSupport.park (LockSupport.java:194)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await (AbstractQueuedSynchronizer.java:2081)
at org.apache.commons.pool2.impl.LinkedBlockingDeque.takeFirst (LinkedBlockingDeque.java:581)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject (GenericObjectPool.java:439)
at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject (GenericObjectPool.java:356)
at org.apache.commons.dbcp2.PoolingDataSource.getConnection (PoolingDataSource.java:134)
at org.apache.beam.sdk.io.jdbc.JdbcIO$WriteVoid$WriteFn.executeBatch (JdbcIO.java:1438)
at org.apache.beam.sdk.io.jdbc.JdbcIO$WriteVoid$WriteFn.processElement (JdbcIO.java:1387)
at org.apache.beam.sdk.io.jdbc.JdbcIO$WriteVoid$WriteFn$DoFnInvoker.invokeProcessElement (Unknown Source)
This only occurs approximately 30 minutes after deployment. It is able to process 10.000 elements just fine until those 30-ish minutes later. On average the throughput ranges from 50 elements/second to 120 elements/second.
The queries are not that heavy either, just a simple delete and insert statement.
We think that the connections are stuck and not released for the other elements but we don't know how to fix it though.
Here's the code:
public void writeToPostgres(PCollection<TimestampedValue<KV<String, Duration>>> collection) {
collection
.apply(Filter.by(Postgres::filter1))
.apply(JdbcIO.<TimestampedValue<KV<String, Duration>>>write()
.withDataSourceProviderFn(JdbcIO.PoolableDataSourceProvider.of(getDataSourceConfiguration()))
.withStatement("DELETE FROM table1 where field1 = ?::UUID and field2=?")
.withPreparedStatementSetter((element, statement) -> {
statement.setString(1, element.getValue().getKey());
Instant timestamp = element.getTimestamp();
statement.setTimestamp(2, new Timestamp(timestamp.getMillis()));
})
.withBatchSize(1)
.withRetryStrategy(DEADLOCK_DETECTED_RETRY_STRATEGY));
collection
.apply(Filter.by(Postgres::filter2))
.apply(
JdbcIO.<TimestampedValue<KV<String, Duration>>>write()
.withDataSourceProviderFn(JdbcIO.PoolableDataSourceProvider.of(getDataSourceConfiguration()))
.withStatement("INSERT INTO table1 (field1, field2) \n" +
"VALUES (?::UUID, ?) \n" +
"ON CONFLICT ON CONSTRAINT someconstraint\n" +
"DO UPDATE SET field2 = excluded.field2")
.withPreparedStatementSetter((element, statement) -> {
Instant eventTime = element.getTimestamp();
Timestamp now = Timestamp.from(now());
statement.setString(1, element.getValue().getKey());
statement.setTimestamp(2, new Timestamp(eventTime.getMillis()));
})
.withBatchSize(1)
.withRetryStrategy(DEADLOCK_DETECTED_RETRY_STRATEGY)
);
}
...
private DataSourceConfiguration getDataSourceConfiguration() {
return DataSourceConfiguration.create(ValueProvider.StaticValueProvider.of("org.postgresql.Driver"), jdbcUrlProvider)
.withUsername(usernameProvider)
.withPassword(passwordProvider);
}
How can I fix this?

We were able to find a fix but I consider this to be more of a workaround because we didn't find anything within the DataSourceProvider of JdbcIO. We basically copied the PoolableDataSourceProvider of JdbcIO and used the HikariDataSource instead because it seems to improve performance anyway.
First, we add the hikariCP dependency in out pom file
<dependency>
<groupId>com.zaxxer</groupId>
<artifactId>HikariCP</artifactId>
<version>5.0.0</version>
</dependency>
Here's how the HikariDataSourceProvider looks like:
public static class HikariDataSourceProvider implements SerializableFunction<Void, DataSource> {
private static final ConcurrentHashMap<HikariDataSourceConfig, DataSource> instances = new ConcurrentHashMap<>();
private final HikariDataSourceConfig config;
private HikariDataSourceProvider(HikariDataSourceConfig config) {
this.config = config;
}
public static SerializableFunction<Void, DataSource> of(HikariDataSourceConfig hikariDataSourceConfig) {
return new HikariDataSourceProvider(hikariDataSourceConfig);
}
#Override
public DataSource apply(Void input) {
return instances.computeIfAbsent(
config,
ignored -> {
HikariDataSource hikariDataSource = new HikariDataSource();
hikariDataSource.setJdbcUrl(config.getJdbcUrlProvider().get());
hikariDataSource.setUsername(config.getUsernameProvider().get());
hikariDataSource.setPassword(config.getPasswordProvider().get());
hikariDataSource.setAutoCommit(false);
return hikariDataSource;
});
}
}
...
#Data
#Builder
public static class HikariDataSourceConfig implements Serializable {
private final ValueProvider<String> jdbcUrlProvider;
private final ValueProvider<String> usernameProvider;
private final ValueProvider<String> passwordProvider;
}
The #Data and #Builder are lombok annotations.
The PTransform would look something like this:
JdbcIO.<TimestampedValue<KV<String, Duration>>>write()
.withDataSourceProviderFn(HikariDataSourceProvider.of(getDataSourceConfig()))
.withStatement("...
We also removed the .withBatchSize(1) line so it doesn't bottleneck the process. We tried just removing this line first without the HikariDataSource but that alone did not solve this issue.
The streaming job can now handle the statements and is stable. The error no longer occurs.

Related

What is the most efficient way to persist thousands of entities?

I have fairly large CSV files which I need to parse and then persist into PostgreSQL. For example, one file contains 2_070_000 records which I was able to parse and persist in ~8 minutes (single thread). Is it possible to persist them using multiple threads?
public void importCsv(MultipartFile csvFile, Class<T> targetClass) {
final var headerMapping = getHeaderMapping(targetClass);
File tempFile = null;
try {
final var randomUuid = UUID.randomUUID().toString();
tempFile = File.createTempFile("data-" + randomUuid, "csv");
csvFile.transferTo(tempFile);
final var csvFileName = csvFile.getOriginalFilename();
final var csvReader = new BufferedReader(new FileReader(tempFile, StandardCharsets.UTF_8));
Stopwatch stopWatch = Stopwatch.createStarted();
log.info("Starting to import {}", csvFileName);
final var csvRecords = CSVFormat.DEFAULT
.withDelimiter(';')
.withHeader(headerMapping.keySet().toArray(String[]::new))
.withSkipHeaderRecord(true)
.parse(csvReader);
final var models = StreamSupport.stream(csvRecords.spliterator(), true)
.map(record -> parseRecord(record, headerMapping, targetClass))
.collect(Collectors.toUnmodifiableList());
// How to save such a large list?
log.info("Finished import of {} in {}", csvFileName, stopWatch);
} catch (IOException ex) {
ex.printStackTrace();
} finally {
tempFile.delete();
}
}
models contains a lot of records. The parsing into records is done using parallel stream, so it's quite fast. I'm afraid to call SimpleJpaRepository.saveAll, because I'm not sure what it will do under the hood.
The question is: What is the most efficient way to persist such a large list of entities?
P.S.: Any other improvements are greatly appreciated.
You have to use batch inserts.
Create an interface for a custom repository SomeRepositoryCustom
public interface SomeRepositoryCustom {
void batchSave(List<Record> records);
}
Create an implementation of SomeRepositoryCustom
#Repository
class SomesRepositoryCustomImpl implements SomeRepositoryCustom {
private JdbcTemplate template;
#Autowired
public SomesRepositoryCustomImpl(JdbcTemplate template) {
this.template = template;
}
#Override
public void batchSave(List<Record> records) {
final String sql = "INSERT INTO RECORDS(column_a, column_b) VALUES (?, ?)";
template.execute(sql, (PreparedStatementCallback<Void>) ps -> {
for (Record record : records) {
ps.setString(1, record.getA());
ps.setString(2, record.getB());
ps.addBatch();
}
ps.executeBatch();
return null;
});
}
}
Extend your JpaRepository with SomeRepositoryCustom
#Repository
public interface SomeRepository extends JpaRepository, SomeRepositoryCustom {
}
to save
someRepository.batchSave(records);
Notes
Keep in mind that, if you are even using batch inserts, database driver will not use them. For example, for MySQL, it is necessary to add a parameter rewriteBatchedStatements=true to database URL.
So better to enable driver SQL logging (not Hibernate) to verify everything. Also can be useful to debug driver code.
You will need to make decision about splitting records by packets in the loop
for (Record record : records) {
}
A driver can do it for you, so you will not need it. But better to debug this thing too.
P. S. Don't use var everywhere.

Implementing Spring + Apache Flink project with Postgres

I have a SpringBoot gradle project using apache flink to process datastream signals. When a new signal comes through the datastream, I would like to query look up (i.e. findById() ) it's details using an ID in a postgres database table which is already created in order to get additional information about the signal and enrich the data. I would like to avoid using spring dependencies to perform the lookup (i.e Autowire repository) and want to stick with flink implementation for the lookup.
Where can i specify how to add the postgres connection config information such as port, database, url, username, password etc... (for simplicity purposes can assume the postgres db is local in my machine). Is it as simple as adding the configuration to the application.properties file? if so how can i write the query method to look up the record in the postgres table when searching by non primary key value?
Some online sources are suggesting using this skeleton code but I am not sure how/id it fits my use case. (I have a EventEntity model created which contains all the params/columns from the table which i'm looking up).
like so
public class DatabaseMapper extends RichFlatMapFunction<String, EventEntity> {
// Declare DB connection & query statements
public void open(Configuration parameters) throws Exception {
//Initialize DB connection
//prepare query statements
}
#Override
public void flatMap(String value, Collector<EventEntity> out) throws Exception {
}
}
Your sample code is correct. You can set all your custom initialization and preparation code for PostgreSQL in open() method. Then you can use your pre-configured fields in your flatMap() function.
Here is one sample for Redis operations
I have used RichAsyncFunction here and I suggest you do the same as it is suggested as best practice. Read here for more: https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/operators/asyncio.html)
You can pass configuration parameteres in your constructor method and use it in your initialization process
public static class AsyncRedisOperations extends RichAsyncFunction<Object,Object> {
private JedisPool jedisPool;
private Configuration redisConf;
public AsyncRedisOperations(Configuration redisConf) {
this.redisConf = redisConf;
}
#Override
public void open(Configuration parameters) {
JedisPoolConfig jedisPoolConfig = new JedisPoolConfig();
jedisPoolConfig.setMaxTotal(this.redisConf.getInteger("pool", 8));
jedisPoolConfig.setMaxIdle(this.redisConf.getInteger("pool", 8));
jedisPoolConfig.setMaxWaitMillis(this.redisConf.getInteger("maxWait", 0));
JedisPool jedisPool = new JedisPool(jedisPoolConfig,
this.redisConf.getString("host", "192.168.10.10"),
this.redisConf.getInteger("port", 6379), 5000);
try {
this.jedisPool = jedisPool;
this.logger.info("Redis connected: " + jedisPool.getResource().isConnected());
} catch (Exception e) {
this.logger.error(BaseUtil.append("Exception while connecting Redis"));
}
}
#Override
public void asyncInvoke(Object in, ResultFuture<Object> out) {
try (Jedis jedis = this.jedisPool.getResource()) {
String key = jedis.get(key);
this.logger.info("Redis Key: " + key);
}
}
}

How to implement FlinkKafkaProducer serializer for Kafka 2.2

I've been working on updating a Flink processor (Flink version 1.9) that reads from Kafka and then writes to Kafka. We have written this processor to run towards a Kafka 0.10.2 cluster and now we have deployed a new Kafka cluster running version 2.2. Therefore I set out to update the processor to use the latest FlinkKafkaConsumer and FlinkKafkaProducer (as suggested by the Flink docs). However I've run into some problems with the Kafka producer. I'm unable to get it to Serialize data using deprecated constructors (not surprising) and I've been unable to find any implementations or examples online about how to implement a Serializer (all the examples are using older Kafka Connectors)
The current implementation (for Kafka 0.10.2) is as follows
FlinkKafkaProducer010<String> eventBatchFlinkKafkaProducer = new FlinkKafkaProducer010<String>(
"playerSessions",
new SimpleStringSchema(),
producerProps,
(FlinkKafkaPartitioner) null
);
When trying to implement the following FlinkKafkaProducer
FlinkKafkaProducer<String> eventBatchFlinkKafkaProducer = new FlinkKafkaProducer<String>(
"playerSessions",
new SimpleStringSchema(),
producerProps,
null
);
I get the following error:
Exception in thread "main" java.lang.NullPointerException
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.<init>(FlinkKafkaProducer.java:525)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.<init>(FlinkKafkaProducer.java:483)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.<init>(FlinkKafkaProducer.java:357)
at com.ebs.flink.sessionprocessor.SessionProcessor.main(SessionProcessor.java:122)
and I haven't been able to figure out why.
The constructor for FlinkKafkaProducer is also deprecated and when I try implementing the non-deprecated constructor I can't figure out how to serialize the data.
The following is how it would look:
FlinkKafkaProducer<String> eventBatchFlinkKafkaProducer = new FlinkKafkaProducer<String>(
"playerSessions",
new KafkaSerializationSchema<String>() {
#Override
public ProducerRecord<byte[], byte[]> serialize(String s, #Nullable Long aLong) {
return null;
}
},
producerProps,
FlinkKafkaProducer.Semantic.EXACTLY_ONCE
);
But I don't understand how to implement the KafkaSerializationSchema and I find no examples of this online or in the Flink docs.
Does anyone have any experience implementing this or any tips on why the FlinkProducer gets NullPointerException in the step?
If you are just sending String to Kafka:
public class ProducerStringSerializationSchema implements KafkaSerializationSchema<String>{
private String topic;
public ProducerStringSerializationSchema(String topic) {
super();
this.topic = topic;
}
#Override
public ProducerRecord<byte[], byte[]> serialize(String element, Long timestamp) {
return new ProducerRecord<byte[], byte[]>(topic, element.getBytes(StandardCharsets.UTF_8));
}
}
For sending a Java Object:
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonProcessingException;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.flink.streaming.connectors.kafka.KafkaSerializationSchema;
import org.apache.kafka.clients.producer.ProducerRecord;
public class ObjSerializationSchema implements KafkaSerializationSchema<MyPojo>{
private String topic;
private ObjectMapper mapper;
public ObjSerializationSchema(String topic) {
super();
this.topic = topic;
}
#Override
public ProducerRecord<byte[], byte[]> serialize(MyPojo obj, Long timestamp) {
byte[] b = null;
if (mapper == null) {
mapper = new ObjectMapper();
}
try {
b= mapper.writeValueAsBytes(obj);
} catch (JsonProcessingException e) {
// TODO
}
return new ProducerRecord<byte[], byte[]>(topic, b);
}
}
In your code
.addSink(new FlinkKafkaProducer<>(producerTopic, new ObjSerializationSchema(producerTopic),
params.getProperties(), FlinkKafkaProducer.Semantic.EXACTLY_ONCE));
To the deal with the timeout in the case of FlinkKafkaProducer.Semantic.EXACTLY_ONCE you should read https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-011-and-newer, particularly this part:
Semantic.EXACTLY_ONCE mode relies on the ability to commit transactions that were started before taking a checkpoint, after recovering from the said checkpoint. If the time between Flink application crash and completed restart is larger than Kafka’s transaction timeout there will be data loss (Kafka will automatically abort transactions that exceeded timeout time). Having this in mind, please configure your transaction timeout appropriately to your expected down times.
Kafka brokers by default have transaction.max.timeout.ms set to 15 minutes. This property will not allow to set transaction timeouts for the producers larger than it’s value. FlinkKafkaProducer011 by default sets the transaction.timeout.ms property in producer config to 1 hour, thus transaction.max.timeout.ms should be increased before using the Semantic.EXACTLY_ONCE mode.

spring batch metadata in cassandra database

The question is quite simple: can I create metadata schema for Spring Batch in Cassandra database? How can I do this if yes?
I've read that Spring Batch requires RDBMS database for that and No-SQL databases are not supported. Is that still the limitation in Spring Batch and how can I override that issue eventually?
Due to the fact that Casandra does not support simple sequences for keys, Spring Batch does not support using it for the job repository.
It is possible to extend Spring Batch to support Cassandra by customising ItemReader and ItemWriter.
Reader:
#Override
public Company read() throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException {
final List<Company> companies = cassandraOperations.selectAll(aClass);
log.debug("Read operations is performing, the object size is {}", companies.size());
if (index < companies.size()) {
final Company company = companies.get(index);
index++;
return company;
}
return null;
}
Writer:
#Override
public void write(final List<? extends Company> items) throws Exception {
logger.debug("Write operations is performing, the size is {}" + items.size());
if (!items.isEmpty()) {
logger.info("Deleting in a batch performing...");
cassandraTemplate.deleteAll(aClass);
logger.info("Inserting in a batch performing...");
cassandraTemplate.insert(items);
}
logger.debug("Items is null...");
}
#Beans:
#Bean
public ItemReader<Company> reader(final DataSource dataSource) {
final CassandraBatchItemReader<Company> reader = new CassandraBatchItemReader<Company>(Company.class);
return reader;
}
#Bean
public ItemWriter<Company> writer(final DataSource dataSource) {
final CassandraBatchItemWriter<Company> writer = new CassandraBatchItemWriter<Company>(Company.class);
return writer;
}
Full source code can be found in Github Spring-Batch-with-Cassandra

ArangoDB java driver on executing AQL sometimes return NULL and other times the correct result

I am unable to wrap my head around this peculiar issue.
I am using arangodb 3.0.10 and arangodb-java-driver 3.0.4.
I am executing a very simple AQL fetch query. (See code below) All my unit tests pass every time and problem never arises when debugging. The problem does not occur all the time (around half the time). It gets even stranger, the most frequent manifestation is NullPointerException at
return cursor.getUniqueResult();
but also got once a ConcurrentModificationException
Questions:
Do I have to manage the database connection? Like closing the driver
connection after each use.
Am i doing something completely wrong
with the ArangoDB query?
Any hint in the right direction is appreciated.
Error 1:
java.lang.NullPointerException
at org.xworx.sincapp.dao.UserDAO.get(UserDAO.java:41)
Error 2:
java.util.ConcurrentModificationException
at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
at java.util.HashMap$EntryIterator.next(HashMap.java:1471)
at java.util.HashMap$EntryIterator.next(HashMap.java:1469)
at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.write(MapTypeAdapterFactory.java:206)
at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.write(MapTypeAdapterFactory.java:145)
at com.google.gson.internal.bind.TypeAdapterRuntimeTypeWrapper.write(TypeAdapterRuntimeTypeWrapper.java:68)
at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.write(MapTypeAdapterFactory.java:208)
at com.google.gson.internal.bind.MapTypeAdapterFactory$Adapter.write(MapTypeAdapterFactory.java:145)
at com.google.gson.Gson.toJson(Gson.java:593)
at com.google.gson.Gson.toJson(Gson.java:572)
at com.google.gson.Gson.toJson(Gson.java:527)
at com.google.gson.Gson.toJson(Gson.java:507)
at com.arangodb.entity.EntityFactory.toJsonString(EntityFactory.java:201)
at com.arangodb.entity.EntityFactory.toJsonString(EntityFactory.java:165)
at com.arangodb.impl.InternalCursorDriverImpl.getCursor(InternalCursorDriverImpl.java:94)
at com.arangodb.impl.InternalCursorDriverImpl.executeCursorEntityQuery(InternalCursorDriverImpl.java:79)
at com.arangodb.impl.InternalCursorDriverImpl.executeAqlQuery(InternalCursorDriverImpl.java:148)
at com.arangodb.ArangoDriver.executeAqlQuery(ArangoDriver.java:2158)
at org.xworx.sincapp.dao.UserDAO.get(UserDAO.java:41)
ArangoDBConnector
public abstract class ArangoDBConnector {
protected static ArangoDriver driver;
protected static ArangoConfigure configure;
public ArangoDBConnector() {
final ArangoConfigure configure = new ArangoConfigure();
configure.loadProperties(ARANGODB_PROPERTIES);
configure.init();
final ArangoDriver driver = new ArangoDriver(configure);
ArangoDBConnector.configure = configure;
ArangoDBConnector.driver = driver;
}
UserDAO
#Named
public class UserDAO extends ArangoDBConnector{
private Map<String, Object> bindVar = new HashMap();
public UserDAO() {}
public User get(#NotNull String objectId) {
bindVar.clear();
bindVar.put("uuid", objectId);
String fetchUserByObjectId = "FOR user IN User FILTER user.uuid == #uuid RETURN user";
CursorResult<User> cursor = null;
try {
cursor = driver.executeAqlQuery(fetchUserByObjectId, bindVar, driver.getDefaultAqlQueryOptions(), User.class);
} catch (ArangoException e) {
new ArangoDaoException(e.getErrorMessage());
}
return cursor.getUniqueResult();
}
As AntJavaDev said, you access bindVar more than once the same time. When one thread modify bindVar and another tried to build the AQL call at the same time by reading bindVar. This leads to the ConcurrentModificationException.
The NullPointerException results from an AQL call with no result. e.g. when you clear bindVar and directly after that, execute the AQL in another thread with no content in bindVar.
To your questions:
1. No, you do not have to close the driver connection after each call.
2. Beside the shared bindVar, everything looks correct.

Categories

Resources