I am trying to have exactly-once consuming of Kafka Consumer.
My requirement is of:
Read data from Topic
Process the data [which involves calling another API]
Writing the response back to Kafka
I wanted to know if exactly once is possible in this scenario?
I know that use case satisfies Kafka streams API, but I wanted to know from the Producer/Consumer API? Also, if lets say that after processing of the data, the consumer fails for some reason, (the processing should be done only once), what would be best way to handle such cases? Can there be any continuation/checkpoint for such cases?
I understand that Kafka Streams API is produce-consumer-produce transactional. Here also, if after calling the API consumer crashes, the flow would start from the very start, right?
Yes; Spring for Apache Kafka supports exactly once semantics in the same way as Kafka Streams.
See
https://docs.spring.io/spring-kafka/docs/current/reference/html/#exactly-once
and
https://docs.spring.io/spring-kafka/docs/current/reference/html/#transactions
Bear in mind that "exactly once" means that the entire successful
consume -> process -> produce
is performed once. But, if the produce step fails (rolling back the transaction), then the consume -> process part is "at least once".
Therefore, you need to make the process part idempotent.
Related
Hi I am working with akka streams along with akka-stream-kafka. I am setting up a Stream with the below setup:
Source (Kafka) --> | Akka Actor Flow | --> Sink (MongoDB)
Actor Flow basically by Actors that will process data, below is the hierarchy:
System
|
Master Actor
/ \
URLTypeHandler SerializedTypeHandler
/ \ |
Type1Handler Type2Handler SomeOtherHandler
So Kafka has the message, I write up the consumer and run it in atMostOnceSource configuration and use
Consumer.Control control =
Consumer.atMostOnceSource(consumerSettings, Subscriptions.topics(TOPIC))
.mapAsyncUnordered(10, record -> processAccessLog(rootHandler, record.value()))
.to(Sink.foreach(it -> System.out.println("FinalReturnedString--> " + it)))
.run(materializer);
I've used a print as a sink initially, just to get the flow running.
and the processAccessLog is defined as:
private static CompletionStage<String> processAccessLog(ActorRef handler, byte[] value) {
handler.tell(value, ActorRef.noSender());
return CompletableFuture.completedFuture("");
}
Now, from the definition ask must be used when an actor is expecting a response, makes sense in this case since I want to return values to be written in the sink.
But everyone (including docs), mention to avoid ask and rather use tell and forward, an amazing blog is written on it Don't Ask, Tell.
In the blog he mentions, in case of nested actors, use tell for the first message and then use forward for the message to reach the destination and then after processing directly send the message back to the root actor.
Now here is the problem,
How do I send the message from D back to A, such that I can still use the sink.
Is it good practice to have open ended streams? e.g. Streams where Sink doesn't matter because the actors have already done the job. (I don't think it is recommend to do so, seems flawed).
ask is Still the Right Pattern
From the linked blog article, one "drawback" of ask is:
blocking an actor itself, which cannot pick any new messages until the
response arrives and processing finishes.
However, in akka-stream this is the exact feature we are looking for, a.k.a. "back-pressure". If the Flow or Sink are taking a long time to process data then we want the Source to slow down.
As a side note, I think the claim in the blog post that the additional listener Actor results in an implementation that is "dozens times heavier" is an exaggeration. Obviously an intermediate Actor adds some latency overhead but not 12x more.
Elimination of Back-Pressure
Any implementation of what you are looking for would effectively eliminate back-pressure. An intermediate Flow that only used tell would continuously propagate demand back to the Source regardless of whether or not your processing logic, within the handler Actors, was completing its calculations at the same speed that the Source is generating data.
Consider an extreme example: what if your Source could produce 1 million messages per second but the Actor receiving those messages via tell could only process 1 message per second. What would happen to that Actor's mailbox?
By using the ask pattern in an intermediate Flow you are purposefully linking the speed of the handlers and the speed with which your Source produces data.
If you are willing to remove back-pressure signaling, from the Sink to the Source, then you might as well not use akka-stream in the first place. You can have either back-pressure or non-blocking messaging, but not both.
Ramon J Romero y Vigil is right but I will try to extend the response.
1) I think that the "Don't ask, tell" dogma is mostly for Actor systems architecture. Here you need to return a Future so the stream can resolve the processed result, you have two options:
Use ask
Create an actor per event and pass them Promise so a Future will be complete when this actor receives the data (you can use the getSender method so D can send the response to A). There is no way to send a Promise or Future in a message (The are not Serialisable) so the creation of this short living actors can not be avoided.
At the end you are doing mostly the same...
2) It's perfectly fine to use an empty Sink to finalise the stream (indeed akka provides the Sink.ignore() method to do so).
Seems like you are missing the reason why you are using streams, they are cool abstraction to provide composability, concurrency and back pressure. In the other hand, actors can not be compose and is hard to handle back pressure. If you don't need this features and your actors can have the work done easily you shouldn't use akka-streams in first place.
I am trying to write a custom receiver for Structured Streaming that will consume messages from RabbitMQ.
Spark recently released DataSource V2 API, which seems very promising. Since it abstracts away many details, I want to use this API for the sake of both simplicity and performance. However, since it's quite new, there are not many sources available. I need some clarification from experienced Spark guys, since they will grasp the key points easier. Here we go:
My starting point is the blog post series, with the first part here. It shows how to implement a data source, without streaming capability. To make a streaming source, I slightly changed them, since I need to implement MicroBatchReadSupport instead of (or in addition to) DataSourceV2.
To be efficient, it's wise to have multiple spark executors consuming RabbitMQ concurrently, i.e. from the same queue. If I'm not confused, every partition of the input -in Spark's terminology- corresponds to a consumer from the queue -in RabbitMQ terminology. Thus, we need to have multiple partitions for the input stream, right?
Similar with part 4 of the series, I implemented MicroBatchReader as follows:
#Override
public List<DataReaderFactory<Row>> createDataReaderFactories() {
int partition = options.getInt(RMQ.PARTITICN, 5);
List<DataReaderFactory<Row>> factories = new LinkedList<>();
for (int i = 0; i < partition; i++) {
factories.add(new RMQDataReaderFactory(options));
}
return factories;
}
I am returning a list of factories, and hope that every instance in the list will be used to create a reader, which will be also a consumer. Is that approach correct?
I want my reciever to be reliable, i.e. after every processed message (or at least written to chekpoint directory for further processing), I need to ack it back to RabbitMQ. The problem starts after here: these factories are created at the driver, and the actual reading process takes place at executors through DataReaders. However, the commit method is a part of MicroBatchReader, not DataReader. Since I have many DataReaders per MicroBatchReader, how should I ack these messages back to RabbitMQ? Or should I ack when the next method is called on DataReader? Is it safe? If so, what is the purpose of commit function then?
CLARIFICATION: OBFUSCATION: The link provided in the answer about the renaming of some classes/functions (in addition to the explanations there) made everything much more clear worse than ever. Quoting from there:
Renames:
DataReaderFactory to InputPartition
DataReader to InputPartitionReader
...
InputPartition's purpose is to manage the lifecycle of the
associated reader, which is now called InputPartitionReader, with an
explicit create operation to mirror the close operation. This was no
longer clear from the API because DataReaderFactory appeared to be more
generic than it is and it isn't clear why a set of them is produced for
a read.
EDIT: However, the docs clearly say that "the reader factory will be serialized and sent to executors, then the data reader will be created on executors and do the actual reading."
To make the consumer reliable, I have to ACK for a particular message only after it is committed at Spark side. Note that the messages have to be ACKed on the same connection that it has been delivered through, but commit function is called at driver node. How can I commit at the worker/executor node?
> I am returning a list of factories, and hope that every instance in the list will be used to create a reader, which will be also a consumer. Is that approach correct?
The source [socket][1] source implementation has one thread pushing messages into the internal ListBuffer. In other words, there is one consumer (the thread) filling up the internal ListBuffer which is **then** divided up into partitions by `planInputPartitions`( `createDataReaderFactories` got [renamed][2] to `planInputPartitions`).
Also, according to the Javadoc of [MicroBatchReadSupport][3]
> The execution engine will create a micro-batch reader at the start of a streaming query, alternate calls to setOffsetRange and createDataReaderFactories for each batch to process, and then call stop() when the execution is complete. Note that a single query may have multiple executions due to restart or failure recovery.
In other words, the `createDataReaderFactories` should be called **multiple** times, which to my understanding suggests that each `DataReader` is responsible for a static input partition, which implies that the DataReader shouldn't be a consumer.
----------
> However, the commit method is a part of MicroBatchReader, not DataReader ... If so, what is the purpose of commit function then?
Perhaps part of the rationale for the commit function is to prevent the internal buffer of the MicroBatchReader from getting to big. By committing an Offset, you can effectively remove elements less than the Offset from the buffer as you are making a commitment to not process them anymore. You can see this happening in the socket source code with `batches.trimStart(offsetDiff)`
I'm unsure about implementing a reliable receiver, so I hope a more experienced Spark guy comes around and grabs your question as I'm interested too!
Hope this helps!
EDIT
I had only studied the socket, and wiki-edit sources. These sources are not production ready, which is something that the question was was not looking for. Instead, the kafka source is the better starting point which has, unlike the aforementioned sources, multiple consumers like the author was looking for.
However, perhaps if you're looking for unreliable sources, the socket and wikiedit sources above provide a less complicated solution.
I'm mostly using Kafka for traditional messaging but I'd also like the ability to consume small topics in a batch fashion, i.e. connect to a topic, consume all the messages and immediately disconnect (not block waiting for new messages). All my topics have a single partition (though they are replicated across a cluster) and I'd like to use the high-level consumer if possible. It's not clear from the docs how I could accomplish such a thing in Scala (or Java). Any advice gratefully received.
The consumer.timeout.ms setting will throw a timeout exception after the specified time if no message is consumed before and this is the only option you have with the high level consumer afaik. Using this you could set it to something like 1 second and disconnect after that if it's an acceptable solution.
If not, you'd have to use the simple consumer and check message offsets.
I have been studying apache kafka for a month now. I am however, stuck at a point now. My use case is, I have two or more consumer processes running on different machines. I ran a few tests in which I published 10,000 messages in kafka server. Then while processing these messages I killed one of the consumer processes and restarted it. Consumers were writing processed messages in a file. So after consumption finished, file was showing more than 10k messages. So some messages were duplicated.
In consumer process I have disabled auto commit. Consumers manually commit offsets batch wise. So for e.g if 100 messages are written to file, consumer commits offsets. When single consumer process is running and it crashes and recovers duplication is avoided in this manner. But when more than one consumers are running and one of them crashes and recovers, it writes duplicate messages to file.
Is there any effective strategy to avoid these duplicate messages?
The short answer is, no.
What you're looking for is exactly-once processing. While it may often seem feasible, it should never be relied upon because there are always caveats.
Even in order to attempt to prevent duplicates you would need to use the simple consumer. How this approach works is for each consumer, when a message is consumed from some partition, write the partition and offset of the consumed message to disk. When the consumer restarts after a failure, read the last consumed offset for each partition from disk.
But even with this pattern the consumer can't guarantee it won't reprocess a message after a failure. What if the consumer consumes a message and then fails before the offset is flushed to disk? If you write to disk before you process the message, what if you write the offset and then fail before actually processing the message? This same problem would exist even if you were to commit offsets to ZooKeeper after every message.
There are some cases, though, where
exactly-once processing is more attainable, but only for certain use cases. This simply requires that your offset be stored in the same location as unit application's output. For instance, if you write a consumer that counts messages, by storing the last counted offset with each count you can guarantee that the offset is stored at the same time as the consumer's state. Of course, in order to guarantee exactly-once processing this would require that you consume exactly one message and update the state exactly once for each message, and that's completely impractical for most Kafka consumer applications. By its nature Kafka consumes messages in batches for performance reasons.
Usually your time will be more well spent and your application will be much more reliable if you simply design it to be idempotent.
This is what Kafka FAQ has to say on the subject of exactly-once:
How do I get exactly-once messaging from Kafka?
Exactly once semantics has two parts: avoiding duplication during data production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data production:
Use a single-writer per partition and every time you get a network error check the last message in that partition to see if your last write succeeded
Include a primary key (UUID or something) in the message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be duplicate-free. However, reading without duplicates depends on some co-operation from the consumer too. If the consumer is periodically checkpointing its position then if it fails and restarts it will restart from the checkpointed position. Thus if the data output and the checkpoint are not written atomically it will be possible to get duplicates here as well. This problem is particular to your storage system. For example, if you are using a database you could commit these together in a transaction. The HDFS loader Camus that LinkedIn wrote does something like this for Hadoop loads. The other alternative that doesn't require a transaction is to store the offset with the data loaded and deduplicate using the topic/partition/offset combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply by optionally integrating support for this on the server.
The existing high-level consumer doesn't expose a lot of the more fine grained control of offsets (e.g. to reset your position). We will be working on that soon
I agree with RaGe's deduplicate on the consumer side. And we use Redis to deduplicate Kafka message.
Assume the Message class has a member called 'uniqId', which is filled by the producer side and is guaranteed to be unique. We use a 12 length random string. (regexp is '^[A-Za-z0-9]{12}$')
The consumer side use Redis's SETNX to deduplicate and EXPIRE to purge expired keys automatically. Sample code:
Message msg = ... // eg. ConsumerIterator.next().message().fromJson();
Jedis jedis = ... // eg. JedisPool.getResource();
String key = "SPOUT:" + msg.uniqId; // prefix name at will
String val = Long.toString(System.currentTimeMillis());
long rsps = jedis.setnx(key, val);
if (rsps <= 0) {
log.warn("kafka dup: {}", msg.toJson()); // and other logic
} else {
jedis.expire(key, 7200); // 2 hours is ok for production environment;
}
The above code did detect duplicate messages several times when Kafka(version 0.8.x) had situations. With our input/output balance audit log, no message lost or dup happened.
There's a relatively new 'Transactional API' now in Kafka that can allow you to achieve exactly once processing when processing a stream. With the transactional API, idempotency can be built in, as long as the remainder of your system is designed for idempotency. See https://www.baeldung.com/kafka-exactly-once
Whatever done on producer side, still the best way we believe to deliver exactly once from kafka is to handle it on consumer side:
Produce msg with a uuid as the Kafka message Key into topic T1
consumer side read the msg from T1, write it on hbase with uuid as rowkey
read back from hbase with the same rowkey and write to another topic T2
have your end consumers actually consume from topic T2
I'm trying to understand the best way to coalesce or chunk incoming messages in RabbitMQ (using Spring AMQP or the Java client directly).
In other words I would like to take say 100 incoming messages and combine them as 1 and resend it to another queue in a reliable (correctly ACKed way). I believe this is called the aggregator pattern in EIP.
I know Spring Integration provides an aggregator solution but the implementation looks like its not fail safe (that is it looks like it has to ack and consume messages to build the coalesced message thus if you shutdown it down while its doing this you will loose messages?).
I can't comment directly on the Spring Integration library, so I'll speak generally in terms of RabbitMQ.
If you're not 100% convinced by the Spring Integration implementation of the Aggregator and are going to try to implement it yourself then I would recommend avoiding using tx which uses transactions under the hood in RabbitMQ.
Transactions in RabbitMQ are slow and you will definitely suffer performance problems if you're building a high traffic/throughput system.
Rather I would suggest you take a look at Publisher Confirms which is an extension to AMQP implemented in RabbitMQ. Here is an introduction to it when it was new http://www.rabbitmq.com/blog/2011/02/10/introducing-publisher-confirms/.
You will need to tweak the prefetch setting to get the performance right, take a look at http://www.rabbitmq.com/blog/2012/05/11/some-queuing-theory-throughput-latency-and-bandwidth/ for some details.
All the above gives you some background to help solve your problem. The implementation is rather straightforward.
When creating your consumer you will need to ensure you set it so that ACK is required.
Dequeue n messages, as you dequeue you will need to make note of the DeliveryTag for each message (this is used to ACK the message)
Aggregate the messages into a new message
Publish the new message
ACK each dequeued message
One thing to note is that if your consumer dies after 3 and before 4 has completed then those messages that weren't ACK'd will be reprocessed when it comes back to life
If you set the <amqp-inbound-channel-adapter/> tx-size attribute to 100, the container will ack every 100 messages so this should prevent message loss.
However, you might want to make the send of the aggregated message (on the 100th receive) transactional so you can confirm the broker has the message before the ack for the inbound messages.