How to efficiently save value to many duplicated tables?

How to efficiently save value to many duplicated tables? - java

I have an entity named Message with fields: id (PK), String messageXML and Timestamp date. and simple dao to store object into Oracle Database (11g) / MyBatis
Code looks like something like that:
Sevice:
void process throws ProcessException {
Message message = wrapper.getMessage(request);
Long messageId;
try {
messageId = (Long) dao.save(message);
} catch (DaoException e) {
throw ProcessException(e);
}
Dao
private String mapperName = "messageMapper";
Serializable save(Message message) throws DaoException {
try {
getSqlSession().insert(mapperName + ".insert", message);
return message.getPrimaryKey();
} catch (Exception e) {
throw DaoException(e);
}
Simple code. Unfortunately, load of this method process(req) is about 500 req / sec. and sometimes I get a lock on DB during saving message.
To resolve that problem I thought about multiplication table Message, for instance I will be have five table Message1, Message2 ... Message 5 and during saving entity Message i will be drawing (like a round robin algorithm) table - for instance:
private Random generator;
public MessageDao() {
this.generator = new Random();
Serializable save(Message message) throws DaoException {
try {
getSqlSession().insert(getMapperName() + ".insert", message);
return message.getPrimaryKey();
} catch (Exception e) {
throw DaoException(e);
}
private String getMapperName() {
return this.mapperName.concat(String.valueOf(generator.nextInt(5))); //could be more effeciency of course
}
What are you thinking about this solution? Could be efficiently? How can I make that better? Where could I make bottleneck?

Reading between the lines, I guess you have a number of instances of code running serving multiple concurrent requests, hence why you are getting the contention. Or you have 1 server that is firing 500 requests per second and you experience waits. Not sue which of these you mean. In the former case, you might want to look extent allocation - if the table/index next extent sizes are small you will see regularly latency when Oracle grabs the next extent. Size too small and you will get this latency very regularly, size big and when it does eventually run out the wait will be longer. You could do something like calculate the storage per week, and have a weekly procedure to "Grow" the table/indexes accordingly to avoid this during operation hours. I would be tempted to examine the stats and see what the waits are.
If however the cause is concurrency (maybe in addition to extent management), then you're probably getting hot-block contention on the index used to enforce the PK constraints. Typical strategies to mitigate this include REVERSE index (no code change required), or more controversially use partitioning with a weaker unique constraint by adding a simple column to further segregate the concurrent sessions. E.g. add a column serverId to the table and partition by this and the existing PK column. Assign each application server a unique serverId (config/startup file). Amend the insert to include the serverID. Have 1 partition per server. Controversial because the constraint is weaker (down to how partitions work), and this will be an anathema to purists, but this is something I've used on projects with Oracle Consulting to maximise performance on Exadata. So, it's out there. Of course, partitions can be thought of as distinct tables grouped into a super table, so your idea of writing to separate tables is not a million miles from what is being suggested here. The advantage with partitions it is a more natural mechanism for group this data, and adding a new partition will require less work than adding a new table when expanded.

Related

Duplicate entry exception while trying to create new job

We're using quartz for scheduling jobs with MariaDB on beneath in a setup of a few nodes. We're also using it as a queue system. The most important reason for that is that we already have quartz in our service, and we do not have any queue just yet.
We're getting a lot of requests, very often at the same time with the same id of a business entity, which is used to generate job name for the sake of uniqueness.
try {
quartzServce.scheduleJob(job, trigger);
log.info("job: {} has been scheduled", id);
} catch (ObjectAlreadyExistsException ex) {
log.warn(DUPLICATE_ENTRY_MESSAGE, job.getKey(), trigger.getKey());
} catch (Exception ex) {
throw new RuntimeException(ex);
}
the quartzService.scheduleJobs is simply:
public void scheduleJob(JobDetail jobDetail, Trigger trigger) throws SchedulerException {
schedulerFactory.getScheduler().scheduleJob(jobDetail, trigger);
}
As you can see we're catching ObjectAlreadyExistsException and silence it down on a Warn level, as we do not treat it as errors, but from time to time we still get SQLIntegrityConstraintViolationException wrapped in JobPersistenceException with a message:
Couldn't store job: (conn=435654) Duplicate entry '{our key is over here}' for key 'PRIMARY' [See nested exception: java.sql.SQLIntegrityConstraintViolationException: (conn=435654) Duplicate entry '{our key is over here}' for key 'PRIMARY']
What I assume is that between existence check, and the actual insert the other node manages to insert a row for the same id.
As I'm not really a fan of checking the message in an exception for something like "Duplicated entry" and silence the exception down with a warn level on that condition, I'm looking for another solution, maybe quartz configuration?

You can try checking if the job exists or not before calling quartzServce.scheduleJob.
Using
scheduler.checkExists(...)
Not sure if thats what you were looking for.
But it sounds like you are gonna need a lock of some kind on this business entity id. So only one node at a time tries to schedule it.
You can save the distinct entity ids in a table before scheduling jobs for them. And let your nodes pick them up with a lock so that particular entity id is not visible to other nodes.

What is the correct way to commit after processing each record retrieved from Kafka?

I'm having a bit of trouble understanding how to manually commit properly for each record I consume.
First, let's look at an example from https://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
buffer.add(record);
}
if (buffer.size() >= minBatchSize) {
insertIntoDb(buffer);
consumer.commitSync();
buffer.clear();
}
}
This example commits only after all the records that were received in the poll were processed. I think this isn't a great approach, because if we receive three records, and my service dies while processing the second one, it will end up consuming the first record again, which is incorrect.
So there's a second example that covers committing records on a per-partition basis:
try {
while(running) {
ConsumerRecords<String, String> records = consumer.poll(Long.MAX_VALUE);
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, String>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, String> record : partitionRecords) {
System.out.println(record.offset() + ": " + record.value());
}
long lastOffset = partitionRecords.get(partitionRecords.size() - 1).offset();
consumer.commitSync(Collections.singletonMap(partition, new OffsetAndMetadata(lastOffset + 1)));
}
}
} finally {
consumer.close();
}
However, I think this suffers from the same problem, it only commits after processing all the records that have come from a particular partition.
The solution I have managed to come up with is this:
val consumer: Consumer<String, MyEvent> = createConsumer(bootstrap)
consumer.subscribe(listOf("some-topic"))
while (true) {
val records: ConsumerRecords<String, MyEvent> = consumer.poll(Duration.ofSeconds(1))
if (!records.isEmpty) {
mainLogger.info("Received ${records.count()} events from CRS kafka topic, with partitions ${records.partitions()}")
records.forEach {
mainLogger.debug("Record at offset ${it.offset()}, ${it.value()}")
processEvent(it.value()) // Complex event processing occurs in this function
consumer.commitSync(mapOf(TopicPartition(it.topic(), it.partition()) to OffsetAndMetadata (it.offset() + 1)))
}
}
}
Now this seems to work while I am testing. So far, during my testing though, there appears to be only one partition being used (I have checked this by logging records.partitions()).
Is this approach going to cause any issues? The Consumer API does not seem to provide a way to commit an offset without specifying a partition, and this seems a bit odd to me. Am I missing something here?

There's no right or wrong way to commit. It really depends on your use case and application.
Committing every offset gives more granular control but it has an implication in terms of performance. On the other side of the spectrum, you could commit asynchronously every X seconds (like auto commit does) and have very little overhead but a lot less control.
In the first example, events are processed and committed in batch. It's interesting in terms of performance, but in case of error, the full batch could be reprocessed.
In the second example, it's also batching but only per partitions. This should lead to smaller batches so less performance but less reprocessing in case things to wrong.
In your last example, you choose the commit every single message. While this gives the most control, it significantly affects performance. In addition, like the other cases, it's not fully error proof.
If the application crashes after the event is processed but before it's committed, upon restarting the last event is likely to be reprocessed (ie at least once semantics). But at least, only one event should be affected.
If you want exactly once semantics, you need to use the Transactional Producer.

DynamoDB wait for table to become active

I am working on a project where we are using dynamoDB as the database.
I used the TableUtils of import com.amazonaws.services.dynamodbv2.util.TableUtils;
to create table if it does not exist.
CreateTableRequest tableRequest = dynamoDBMapper.generateCreateTableRequest(cls);
tableRequest.setProvisionedThroughput(new ProvisionedThroughput(5L, 5L));
boolean created = TableUtils.createTableIfNotExists(amazonDynamoDB, tableRequest);
Now after creating table i have to push the data once it is active.
I saw there is a method to do this
try {
TableUtils.waitUntilActive(amazonDynamoDB, cls.getSimpleName());
} catch (Exception e) {
// TODO: handle exception
}
But this is taking 10 minutes.
Is there a method in TableUtils which return as soon as table becomes active.

You may try something as follows.
Table table = dynamoDB.createTable(request);
System.out.println("Waiting for " + tableName + " to be created...this may take a while...");
table.waitForActive();
For more information check out this link.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/AppendixSampleDataCodeJava.html

I had implemented the solution for this in GO language.
Here is the summary.
You have to use an API - DescribeTable or corresponding API.
The input to this API will be DescribeTableInput, where you specify the table name.
You will need to do polling in a loop till the table becomes active.
The output of the Describe table will provide you status of the table ( result.Table.TableStatus)
If the status is "ACTIVE" then you can insert the info. Else you will need to continue with the loop.
In my case, the tables are becoming active in less than one minute.

Proper way to insert record with unique attribute

I am using spring, hibernate and postgreSQL.
Let's say I have a table looking like this:
CREATE TABLE test
(
id integer NOT NULL
name character(10)
CONSTRAINT test_unique UNIQUE (id)
)
So always when I am inserting record the attribute id should be unique
I would like to know what is better way to insert new record (in my spring java app):
1) Check if record with given id exists and if it doesn't insert record, something like this:
if(testDao.find(id) == null) {
Test test = new Test(Integer id, String name);
testeDao.create(test);
}
2) Call straight create method and wait if it will throw DataAccessException...
Test test = new Test(Integer id, String name);
try{
testeDao.create(test);
}
catch(DataAccessException e){
System.out.println("Error inserting record");
}
I consider the 1st way appropriate but it means more processing for DB. What is your opinion?
Thank you in advance for any advice.

Option (2) is subject to a race condition, where a concurrent session could create the record between checking for it and inserting it. This window is longer than you might expect because the record might be already inserted by another transaction, but not yet committed.
Option (1) is better, but will result in a lot of noise in the PostgreSQL error logs.
The best way is to use PostgreSQL 9.5's INSERT ... ON CONFLICT ... support to do a reliable, race-condition-free insert-if-not-exists operation.
On older versions you can use a loop in plpgsql.
Both those options require use of native queries, of course.

Depends on the source of your ID. If you generate it yourself you can assert uniqueness and rely on catching an exception, e.g. http://docs.oracle.com/javase/1.5.0/docs/api/java/util/UUID.html
Another way would be to let Postgres generate the ID using the SERIAL data type
http://www.postgresql.org/docs/8.1/interactive/datatype.html#DATATYPE-SERIAL
If you have to take over from an untrusted source, do the prior check.

Pull from Cassandra database whenever any new rows or any new update is there?

I am working on a system in which I need to store Avro Schemas in Cassandra database. So in Cassandra we will be storing something like this
SchemaId AvroSchema
1 some schema
2 another schema
Now suppose as soon as I insert another row in the above table in Cassandra and now the table is like this -
SchemaId AvroSchema
1 some schema
2 another schema
3 another new schema
As soon as I insert a new row in the above table - I need to tell my Java program to go and pull the new schema id and corresponding schema..
What is the right way to solve these kind of problem?
I know, one way is to have polling every few minutes, let's say every 5 minutes we will go and pull the data from the above table but this is not the right way to solve this problem as every 5 minutes, I am doing a pull whether or not there are any new schemas..
But is there any other solution apart from this?
Can we use Apache Zookeeper? Or Zookeeper is not fit for this problem?
Or any other solution?
I am running Apache Cassandra 1.2.9

Some solutions:
With database triggers: Cassandra 2.0 has some trigger support but it looks like it is not final and might change a little in 2.1 according to this article: http://www.datastax.com/dev/blog/whats-new-in-cassandra-2-0-prototype-triggers-support. Triggers are a common solution.
You brought up polling but that is not always a bad option. Especially if you have something that marks that row as not being pulled yet, so you can just pull the new rows out of Cassandra. Pulling once every 5 minutes is nothing load wise for Cassandra or any database if the query is not a heavy cost. This option might not be good if new rows get inserted on a very infrequent basis.
Zookeeper would not be a perfect solution, see this quote:
Because watches are one time triggers and there is latency between
getting the event and sending a new request to get a watch you cannot
reliably see every change that happens to a node in ZooKeeper. Be
prepared to handle the case where the znode changes multiple times
between getting the event and setting the watch again. (You may not
care, but at least realize it may happen.)
Quote sourced from: http://zookeeper.apache.org/doc/r3.4.2/zookeeperProgrammers.html#sc_WatchRememberThese

Cassandra 3.0
You can use this and it will get you everything in the insert as a json object.
public class HelloWorld implements ITrigger
{
private static final Logger logger = LoggerFactory.getLogger(HelloWorld.class);
public Collection<Mutation> augment(Partition partition)
{
String tableName = partition.metadata().cfName;
logger.info("Table: " + tableName);
JSONObject obj = new JSONObject();
obj.put("message_id", partition.metadata().getKeyValidator().getString(partition.partitionKey().getKey()));
try {
UnfilteredRowIterator it = partition.unfilteredIterator();
while (it.hasNext()) {
Unfiltered un = it.next();
Clustering clt = (Clustering) un.clustering();
Iterator<Cell> cells = partition.getRow(clt).cells().iterator();
Iterator<ColumnDefinition> columns = partition.getRow(clt).columns().iterator();
while(columns.hasNext()){
ColumnDefinition columnDef = columns.next();
Cell cell = cells.next();
String data = new String(cell.value().array()); // If cell type is text
obj.put(columnDef.toString(), data);
}
}
} catch (Exception e) {
}
logger.debug(obj.toString());
return Collections.emptyList();
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.