Is there a way to retry a Bolt in Storm? - java

We have an app that does database saves. If the save fails, is there a way to retry just the bolt that failed? We don't want to fail all the way back to the spout.

You could add an output "scorpion tail" stream to the bolt. The stream would be read by whichever bolt would begin the retry process. This would create a loop in the topology. The idea is that when a failure occurs, you can write a packet of information to this stream and have the tuple delivered to the upstream bolt that would begin the retry. The packet contains whatever state is needed for the retry.

There is no build-in support for this in Storm. However, you can code you own solution:
Do not ack (or fail) the failing tuple, buffer it in an internal data structure (ie, member variable; maybe a List), and return from execute()
Keep processing further tuples in execute() until you want to retry (maybe some timer, ie, you might want to get current timestamp or you retry counter based).
On retry, before processing the new input tuple, receive the failed tuple from your buffer and try to insert into DB. If it fails again, insert into buffer again. If insert is successful, ack buffered tuple and resume processing current input tuple.
You only need to consider Storm's MESSAGE_TIMEOUT. Retrying cannot take longer than this value because if the tuple gets not acked within the timeout value, Storm fails the tuple at the source automatically.

Related

Flink block process event with the same id at the same time

I have a flink application that process stream of data and write some result to database. The stream is keyd by id. A database operation could take o quite of time (e.g 3 min) and can be only one operation for specified id key to protect against locks. At this moment, this sink operation could not be process with paralell and have to be set parallelism to 1.
process
.keyBy(new ProductKeySelector())
.addSink(new ProductSink())
.setParallelism(1)
I want to lock stream with actual processing id event and take another, out of order, and wait until same id end process then run process to it. It's will be process like blocking queue.
Update:
example:
kafkaKeyedStream
.map(new MapToProductType())
.keyBy(new ProductKeySelector())
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.reduce(new ProductAggregateFunction())
.addSink(new ProductSink());
From Kafka Source i recieved data:
enter image description here
As you can see, data are grouped by window function (first value in data is the key) and te results are process by sink function. For this example, let's say that processing will take 20s per each part of data. So if i have 1 thread its not a problem, because the next data waiting for process, but if i set parallelism= 2 then first part of data will be still process by one thread, and after 10s anoter thread start process next part of data with the same key as first. And this create a lock on database.
I would like in a situation where one thread is already processing data under a specific key,
the second thread did not take data on the same key, and took either a different one or did nothing if there is nothing else
If your DB operation could take 3 minutes, you don't want to use a regular JDBC sink. Instead, look at Flink's Async IO support. You'd want to keyBy(id), and then inside of your custom RichAsyncFunction operator you can keep track of whether you've got an active DB request for a given id.

Do newer versions of Kafka producers still have "producer.type"?

Older versions' doc says it's one of the essential properties.
Newer versions' doc doesn't mention it at all.
Do newer versions of Kafka producers still have producer.type?
Or, new producers are always async, and I should call future.get() to make it sync?
New producers are always async, and you should call future.get() to make it sync. It's not worth making two apis methods when something as simple as adding future.get() gives you basically the same functionality.
From the documentation for send() here
https://kafka.apache.org/0110/javadoc/index.html?org/apache/kafka/clients/producer/KafkaProducer.html
Since the send call is asynchronous it returns a Future for the
RecordMetadata that will be assigned to this record. Invoking get() on
this future will block until the associated request completes and then
return the metadata for the record or throw any exception that
occurred while sending the record.
If you want to simulate a simple blocking call you can call the get()
method immediately:
byte[] key = "key".getBytes();
byte[] value = "value".getBytes();
ProducerRecord<byte[],byte[]> record = new ProducerRecord<byte[],byte[]>("my-topic", key, value);
producer.send(record).get();
Why do you want to make the send() to sync ?
This is a kafka feature to batch message for better throughput.
Asynchronous send
Batching is one of the big drivers of efficiency, and to enable batching the Kafka producer will attempt to accumulate data in memory and to send out larger batches in a single request. The batching can be configured to accumulate no more than a fixed number of messages and to wait no longer than some fixed latency bound (say 64k or 10 ms). This allows the accumulation of more bytes to send, and few larger I/O operations on the servers. This buffering is configurable and gives a mechanism to trade off a small amount of additional latency for better throughput.
There is no way to do a send sync because of the api only support the async method, But there is a some configs you can specify to do some work arround.
You could set the batch.size to 0. In this case, the message bacthing is disabled.
However I think you should just leave the batch.size default and set the linger.ms to 0 (this is also default). In this case, if many message come in the same time, they will be batched in one send immediately .
The producer groups together any records that arrive in between request transmissions into a single batched request. Normally this occurs only under load when records arrive faster than they can be sent out.
And if you want to make sure the message is sent and persisted successfully, you coould set the acks to -1 or 1 and retries to 3 (e.g.)
More info about the producer config, you can refer https://kafka.apache.org/documentation/#producerconfigs

Apache NiFi: FlowFileHandlingException when transfer FlowFile in custom processor

I doing data ingestion from remote API service for different time ranges in my custom NiFi Processor.
I have time ranges global counter which updates with each iteration (I'm using Timer driven scheduling strategy).
When the counter is greater than the maximum value, I want to transfer just FlowFile from request (session.get()) with SUCCESS relationship, i.e. without performing additional logic:
session.transfer(requestFlowFile, SUCCESS);
I undestand that I can't stop or pause processor when time ranges collection is over. So I trying to use the above approach as a solution.
All iterations going fine until the counter has become greater than the maximum and processor trying to transfer FlowFile from request (session.get())
So I having this Exception:
failed to process session due to org.apache.nifi.processor.exception.FlowFileHandlingException: StandardFlowFileRecord[uuid=459e615b-0ff5-424f-aac7-f95d364cdc13,claim=,offset=0,name=99628180019265,size=0] is not known in this session
What's wrong here? Or may be another approach?
That error means the flow file being passed to session.transfer() came from a different session. You can only call transfer() on the same session from which you called get().
if it's a custom processor - just do not do session.get() and skip this execution without transferring anything.
or if you need incoming file to take a decision you can get it, do some checks, and rollback the current session with penalize rollback(true), so the file(s) you get will stay in the incoming queue during Penalty Duration processor parameter without thiggering rocessor to run.
or you can do session.get(FlowFileFilter) to get from incoming queue only files that matches your logic

Effective strategy to avoid duplicate messages in apache kafka consumer

I have been studying apache kafka for a month now. I am however, stuck at a point now. My use case is, I have two or more consumer processes running on different machines. I ran a few tests in which I published 10,000 messages in kafka server. Then while processing these messages I killed one of the consumer processes and restarted it. Consumers were writing processed messages in a file. So after consumption finished, file was showing more than 10k messages. So some messages were duplicated.
In consumer process I have disabled auto commit. Consumers manually commit offsets batch wise. So for e.g if 100 messages are written to file, consumer commits offsets. When single consumer process is running and it crashes and recovers duplication is avoided in this manner. But when more than one consumers are running and one of them crashes and recovers, it writes duplicate messages to file.
Is there any effective strategy to avoid these duplicate messages?
The short answer is, no.
What you're looking for is exactly-once processing. While it may often seem feasible, it should never be relied upon because there are always caveats.
Even in order to attempt to prevent duplicates you would need to use the simple consumer. How this approach works is for each consumer, when a message is consumed from some partition, write the partition and offset of the consumed message to disk. When the consumer restarts after a failure, read the last consumed offset for each partition from disk.
But even with this pattern the consumer can't guarantee it won't reprocess a message after a failure. What if the consumer consumes a message and then fails before the offset is flushed to disk? If you write to disk before you process the message, what if you write the offset and then fail before actually processing the message? This same problem would exist even if you were to commit offsets to ZooKeeper after every message.
There are some cases, though, where
exactly-once processing is more attainable, but only for certain use cases. This simply requires that your offset be stored in the same location as unit application's output. For instance, if you write a consumer that counts messages, by storing the last counted offset with each count you can guarantee that the offset is stored at the same time as the consumer's state. Of course, in order to guarantee exactly-once processing this would require that you consume exactly one message and update the state exactly once for each message, and that's completely impractical for most Kafka consumer applications. By its nature Kafka consumes messages in batches for performance reasons.
Usually your time will be more well spent and your application will be much more reliable if you simply design it to be idempotent.
This is what Kafka FAQ has to say on the subject of exactly-once:
How do I get exactly-once messaging from Kafka?
Exactly once semantics has two parts: avoiding duplication during data production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data production:
Use a single-writer per partition and every time you get a network error check the last message in that partition to see if your last write succeeded
Include a primary key (UUID or something) in the message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be duplicate-free. However, reading without duplicates depends on some co-operation from the consumer too. If the consumer is periodically checkpointing its position then if it fails and restarts it will restart from the checkpointed position. Thus if the data output and the checkpoint are not written atomically it will be possible to get duplicates here as well. This problem is particular to your storage system. For example, if you are using a database you could commit these together in a transaction. The HDFS loader Camus that LinkedIn wrote does something like this for Hadoop loads. The other alternative that doesn't require a transaction is to store the offset with the data loaded and deduplicate using the topic/partition/offset combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply by optionally integrating support for this on the server.
The existing high-level consumer doesn't expose a lot of the more fine grained control of offsets (e.g. to reset your position). We will be working on that soon
I agree with RaGe's deduplicate on the consumer side. And we use Redis to deduplicate Kafka message.
Assume the Message class has a member called 'uniqId', which is filled by the producer side and is guaranteed to be unique. We use a 12 length random string. (regexp is '^[A-Za-z0-9]{12}$')
The consumer side use Redis's SETNX to deduplicate and EXPIRE to purge expired keys automatically. Sample code:
Message msg = ... // eg. ConsumerIterator.next().message().fromJson();
Jedis jedis = ... // eg. JedisPool.getResource();
String key = "SPOUT:" + msg.uniqId; // prefix name at will
String val = Long.toString(System.currentTimeMillis());
long rsps = jedis.setnx(key, val);
if (rsps <= 0) {
log.warn("kafka dup: {}", msg.toJson()); // and other logic
} else {
jedis.expire(key, 7200); // 2 hours is ok for production environment;
}
The above code did detect duplicate messages several times when Kafka(version 0.8.x) had situations. With our input/output balance audit log, no message lost or dup happened.
There's a relatively new 'Transactional API' now in Kafka that can allow you to achieve exactly once processing when processing a stream. With the transactional API, idempotency can be built in, as long as the remainder of your system is designed for idempotency. See https://www.baeldung.com/kafka-exactly-once
Whatever done on producer side, still the best way we believe to deliver exactly once from kafka is to handle it on consumer side:
Produce msg with a uuid as the Kafka message Key into topic T1
consumer side read the msg from T1, write it on hbase with uuid as rowkey
read back from hbase with the same rowkey and write to another topic T2
have your end consumers actually consume from topic T2

Why storm replays tuple from spout instead of retry on crashing component?

I am using storm to process online problems, but I cant't understand why storm replays tuple from spout . Retrying on what crashed may be more effective than replaying from root, right?
Anyone can help me? Thx
A typical spout implementation will replay only the FAILED tuples. As explained here a tuple emitted from the spout can trigger thousands of others tuple and storm creates a tree of tuple based on that. Now a tuple is called "fully processed" when every message in the tree has been processed. While emitting spout add a message id which is used to identify the tuple in later phase. This is called anchoring and can be done in the following way
_collector.emit(new Values("field1", "field2", 3) , msgId);
Now from the link posted above it says
A tuple is considered failed when its tree of messages fails to be fully processed within a specified timeout. This timeout can be configured on a topology-specific basis using the Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS configuration and defaults to 30 seconds.
If the tuple times-out Storm will call the FAIL method on spout and likewise in case of success the ACK method will be called.
So at this point storm will let you know which are the tuple that it has been failed to process but if you look into the source code you will see that the implementation of the fail method is empty in the BaseRichSpout class, so you need to override BaseRichSpout's fail method in order to have replay capability in your application.
Such replays of failed tuples should represent only a tiny proportion of the overall tuple traffic, so the efficiency of this simple replay-from start policy is usually not a concern.
Supporting a "replay-from-error-step" would bring lot's of complexity since the location of errors are sometimes hard to determine and there would be a need to support "replay-elsewhere" in case the cluster node where the error happened is currently (or permanently) down. It would also slow down the execution of the whole traffic which would probably not be compensated by the efficiency gained on error handling (which, again, is assumed to be triggered rarely).
If you think this replay-from-start strategy would impact negatively your topology, try to break it down into several smaller ones separated by some persistent queuing system like Kafka.

Categories

Resources