As mentioned in the answer,
A message queue is a one-way pipe: one process writes to the queue, and another reads the data in the order
SysV message queue is one example
So, my understanding is,
one message queue is used by two processes, where one process(producer) insert an item in the queue and another process(consumer) consumes the item from the queue
1) Is RabbitMQ or Kafka message queue a 1:1 messaging system? used by only two processes, where one process writes and other process reads......
2) after the consumer consume the item, does the item get deleted? If no, why do we need queue data structure? Why not just shared memory?
Kafka is not strictly 1:1 messaging system. Multiple producers can write into a topic and multiple consumers can read from it. Moreover, in Kafka, multiple consumers can be assigned same or different consumer groups. Every message is consumed by only one consumer from every consumer group (load balancing) and all consumer groups receive a copy of every message (of course, if they are subscribed to corresponding topics and no messages are lost). A good description of this process can be found in this article: Scalability of Kafka Messaging using Consumer Groups.
In Kafka all messages are persisted on the disk and stored until the compaction reaps it, or the retention.ms passes, or the log size is exceeded. That's a very high-level point of view and there are a lot of nuances here. Like: the messages are stored in segments, every segment contains multiple messages. When the retention period passes for a message, it is not removed from the segment at that moment, instead Kafka waits until all messages in that segment are expired and delete the whole segment at once. Also, retention could come before the log exceeds the maximum size or vice versa: the log can exceed the size even before the retention period passes. And so on. Just read the docs and pay attention to topics about "log cleaner" and "retention".
After the Kafka consumer reads the message it is neither compacted, nor expired. So, it's not removed from the log and stays there. It also means that every message could be re-read by a consumer if needed (until it is deleted completely). It can be useful if some of your consumers went offline for some reason and were not able to process the messages as they come in. It also allows interesting features like transaction replays and so on. Persistence is one of the Kafka's features.
Shared memory? Well, strictly speaking shared memory is only allowed inside a single process. So you can't generally use "shared memory" when you need to access it from different processes. And there is absolutely no way to have "shared memory" when you app runs on multiple hosts. However, there are in-memory brokers. Like Redis can be used as a message broker, and it's all in-memory. However, if such a broker restarts for some reason you lose everything. Speaking about Redis: it has two persistence configurations specifically to handle the restarts.
I am not sure about RabbitMQ, but it probably deletes messages after the consumer acknowledged them by default. So it's closer to 1:1 mental model. However, RabbitMQ employs disk persistence as well.
Related
I am building a high volume system that will be processing up to a hundred million messages everyday. I have a microservice that is reading from a Kafka topic and doing some basic processing on them before forwarding them to the next microservice.
Kafka Topic -> Normalizer Microservice -> Ordering Microservice
Below is what the processing would look like:
Normalizer would be concurrently picking up messages from the Kafka topic.
Normalizer would read the messages from the topic and post them to an in-memory seda queue from where the message would be subsequently picked up, normalized and validated.
This normalization, validation and processing is expected to take around 1 second per message. Within this one second, the message will be stored to the database and will become persistent in the system.
My concern is that during this processing, if a message has been already read from the topic and posted to the seda queue and has either
not yet been picked up from the seda queue or,
has been picked up from the seda queue and is currently processing and has not yet been persisted to the database
and the Normalizer JVM crashes or is force-killed (kill -9), how do I ensure that I do NOT lose the message?
It is critical that I do NOT drop/lose any messages and even in case of a crash/failure, I should be able to retain the message such that I can trigger re-processing of that message if required.
One naïve approach that comes to mind is to push the message to a cache (which will be a very fast operation).
Read from topic -> Push to cache -> Push to seda queue
Needless to say, the problem still exists, it just makes it less probable that I will lose the message. Also, this is certainly not the smartest solution out there.
Please share your thoughts on how I can design this system such that I can preserve messages on my side once the messages have been read off of the Kafka topic even in the event of the Normalizer JVM crashing.
I have a problem where i need to prioritize some events to be processes earlier and some events lets say after the high priority events. Those events come from one source and i need to prioritize the streams depending on their event type priority to be either forwarded in the high priority or lower priority sink. I'm using kafka and akka kafka streams. So the main problem is i get a lot of traffic at a given point in time. What would here be the preferred scenario?
The first thing to tackle is the offset commit. Because processing will not be in order, committing offsets after processing cannot guarantee at-least-once (nor can it guarantee at-most-once), because the following sequence is possible (and the probability of this cannot be reduced to zero):
Commit offset for high-priority message which has been processed before multiple low-priority messages have been processed
Stream fails (or instance running the stream is stopped, or whatever)
Stream restarts from last committed offset
The low-priority messages are never read from Kafka again, so never get processed
This then suggests that either the offset commit will have to happen before the reordering or we'll need a notion of processed-but-not-yet-committable until the low-priority messages have been processed. Noting that for the latter option, tracking the greatest offset not committed (the simplest strategy which could possibly work) will not work if there's anything which could create gaps in the offset sequence which implies infinite retention and no compaction, I'd actually suggest committing the offsets before processing, but once the processing logic has guaranteed that it will eventually process the message.
A combination of actors and Akka Persistence allows this approach to be taken. The rough outline is to have an actor which is persistent (this is a good fit for event-sourcing) and basically maintains lists of high-priority and low-priority messages to process. The stream sends an "ask" with the message from Kafka to the actor, which on receipt classifies the message as high-/low-priority, assuming that the message hasn't already been processed. The message (and perhaps its classification) is persisted as an event and the actor acknowledges receipt of the message and that it commits to processing it by scheduling a message to itself to fully process a "to-process" message. The acknowledgement completes the ask, allowing the offset to be committed to Kafka. On receipt of the message (a command, really) to process a message, the actor chooses the Kafka message to process (by priority, age, etc.) and persists that it's processed that message (thus moving it from "to-process" to "processed") and potentially also persists an event updating state relevant to how it interprets Kafka messages. After this persistence, the actor sends another command to itself to process a "to-process" message.
Fault-tolerance is then achieved by having a background process periodically pinging this actor with the "process a to-process message" command.
As with the stream, this is a single-logical-thread-per-partition process. It's possible that you are multiplexing many partitions worth of state per physical Kafka partition, in which case you can have multiple of these actors and send multiple asks from the ingest stream. If doing this, the periodic ping is likely best accomplished by a stream fed by an Akka Persistence Query to get the identifiers of all the persistent actors.
Note that the reordering in this problem makes it fundamentally a race and thus non-deterministic: in this design sketch, the race is because for messages M1 from actor B and M2 from actor C sent to actor A may be received in any order (if actor B sent a message M3 to actor A after it sent message M1, M3 would arrive after M1 but could arrive before or after M2). In a different design, the race could occur based on speed of processing relative to the latency for Kafka to make a message available for consumption.
I have this problem that in my queue there are messages from different source systems.
For example: in the first message I have source system name as: 'X' and in the second one, the source system name is: 'Y'.
Currently I have a JMS listener with concurrency level set as 1. So all messages are getting processed one by one as expected, but now I want to process messages concurrently such that if messages are from the same source system then only one message should be processed for that source system at a time and if there are messages for different source systems they must be executed in parallel.
Source system are getting created dynamically, that's why I can't have separate queue and consumer for each source system.
It would be great if someone push me into the right direction.
It sounds like your problem is around maintaining ordered delivery for messages originating from a given source, but to be able to process messages from different sources in parallel.
You can do this using message groups.
The broker allows messaging applications to classify a set of related
messages as belonging to a group. This allows a message producer to
indicate to the consumer that a group of messages should be considered
a single logical operation with respect to the application.
To make this work, have the producer systems set the JMSXGroupID header to the producer system name:
Mesasge message = session.createTextMessage("<message />");
message.setStringProperty("JMSXGroupID", "SourceSystem1Name");
Then the broker will enforce consumption ordering among messages belonging to that group.
Addendum
There might be N numbers of source systems as they are created
dynamically and there is a producer which puts all the messages from
these N source systems into the queue
So the message producer can set the JMSXGroupID header to the name of the source system.
The problem I'm facing is If the message for one source system is
getting processed, other source system messages have to wait till the
completion of processing of that message
So once the group header is set as described, the broker will ensure that it only releases messages for a given source system to the consumer sequentially. It does this by forcing the consumer to send an acknowledgement that a previous message has been processed, before releasing the next message in the group.
So by setting concurrency to some appropriate value, the consumer can process messages from different source systems in parallel but will be forced to sequentially process messages from any given source system, which is the behaviour you require.
Because the consumer is declared on the consumer side, you cannot dynamically (at my knowledge) change it according to another variable.
The solution I would suggest is to use selectors and use two consumer connections like the following:
Consumer endpoint for parallel processing (5 in this case) will looks like:
activemq:queue:MY_QUEUE_UNIQUE?concurrentConsumers=5&selector=MySourceHeader<>'X'
Consumer endpoint for sequencial processing will looks like:
activemq:queue:MY_QUEUE_UNIQUE?concurrentConsumers=1&selector=MySourceHeader='X'
It would be good as well to define a fetch size of one for the queue (even I am not sure it is mandatory).
I have been studying apache kafka for a month now. I am however, stuck at a point now. My use case is, I have two or more consumer processes running on different machines. I ran a few tests in which I published 10,000 messages in kafka server. Then while processing these messages I killed one of the consumer processes and restarted it. Consumers were writing processed messages in a file. So after consumption finished, file was showing more than 10k messages. So some messages were duplicated.
In consumer process I have disabled auto commit. Consumers manually commit offsets batch wise. So for e.g if 100 messages are written to file, consumer commits offsets. When single consumer process is running and it crashes and recovers duplication is avoided in this manner. But when more than one consumers are running and one of them crashes and recovers, it writes duplicate messages to file.
Is there any effective strategy to avoid these duplicate messages?
The short answer is, no.
What you're looking for is exactly-once processing. While it may often seem feasible, it should never be relied upon because there are always caveats.
Even in order to attempt to prevent duplicates you would need to use the simple consumer. How this approach works is for each consumer, when a message is consumed from some partition, write the partition and offset of the consumed message to disk. When the consumer restarts after a failure, read the last consumed offset for each partition from disk.
But even with this pattern the consumer can't guarantee it won't reprocess a message after a failure. What if the consumer consumes a message and then fails before the offset is flushed to disk? If you write to disk before you process the message, what if you write the offset and then fail before actually processing the message? This same problem would exist even if you were to commit offsets to ZooKeeper after every message.
There are some cases, though, where
exactly-once processing is more attainable, but only for certain use cases. This simply requires that your offset be stored in the same location as unit application's output. For instance, if you write a consumer that counts messages, by storing the last counted offset with each count you can guarantee that the offset is stored at the same time as the consumer's state. Of course, in order to guarantee exactly-once processing this would require that you consume exactly one message and update the state exactly once for each message, and that's completely impractical for most Kafka consumer applications. By its nature Kafka consumes messages in batches for performance reasons.
Usually your time will be more well spent and your application will be much more reliable if you simply design it to be idempotent.
This is what Kafka FAQ has to say on the subject of exactly-once:
How do I get exactly-once messaging from Kafka?
Exactly once semantics has two parts: avoiding duplication during data production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data production:
Use a single-writer per partition and every time you get a network error check the last message in that partition to see if your last write succeeded
Include a primary key (UUID or something) in the message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be duplicate-free. However, reading without duplicates depends on some co-operation from the consumer too. If the consumer is periodically checkpointing its position then if it fails and restarts it will restart from the checkpointed position. Thus if the data output and the checkpoint are not written atomically it will be possible to get duplicates here as well. This problem is particular to your storage system. For example, if you are using a database you could commit these together in a transaction. The HDFS loader Camus that LinkedIn wrote does something like this for Hadoop loads. The other alternative that doesn't require a transaction is to store the offset with the data loaded and deduplicate using the topic/partition/offset combination.
I think there are two improvements that would make this a lot easier:
Producer idempotence could be done automatically and much more cheaply by optionally integrating support for this on the server.
The existing high-level consumer doesn't expose a lot of the more fine grained control of offsets (e.g. to reset your position). We will be working on that soon
I agree with RaGe's deduplicate on the consumer side. And we use Redis to deduplicate Kafka message.
Assume the Message class has a member called 'uniqId', which is filled by the producer side and is guaranteed to be unique. We use a 12 length random string. (regexp is '^[A-Za-z0-9]{12}$')
The consumer side use Redis's SETNX to deduplicate and EXPIRE to purge expired keys automatically. Sample code:
Message msg = ... // eg. ConsumerIterator.next().message().fromJson();
Jedis jedis = ... // eg. JedisPool.getResource();
String key = "SPOUT:" + msg.uniqId; // prefix name at will
String val = Long.toString(System.currentTimeMillis());
long rsps = jedis.setnx(key, val);
if (rsps <= 0) {
log.warn("kafka dup: {}", msg.toJson()); // and other logic
} else {
jedis.expire(key, 7200); // 2 hours is ok for production environment;
}
The above code did detect duplicate messages several times when Kafka(version 0.8.x) had situations. With our input/output balance audit log, no message lost or dup happened.
There's a relatively new 'Transactional API' now in Kafka that can allow you to achieve exactly once processing when processing a stream. With the transactional API, idempotency can be built in, as long as the remainder of your system is designed for idempotency. See https://www.baeldung.com/kafka-exactly-once
Whatever done on producer side, still the best way we believe to deliver exactly once from kafka is to handle it on consumer side:
Produce msg with a uuid as the Kafka message Key into topic T1
consumer side read the msg from T1, write it on hbase with uuid as rowkey
read back from hbase with the same rowkey and write to another topic T2
have your end consumers actually consume from topic T2
I have a JMS Queue that is populated at a very high rate ( > 100,000/sec ).
It can happen that there can be multiple messages pertaining to the same entity every second as well. ( several updates to entity , with each update as a different message. )
On the other end, I have one consumer that processes this message and sends it to other applications.
Now, the whole set up is slowing down since the consumer is not able to cope up the rate of incoming messages.
Since, there is an SLA on the rate at which consumer processes messages, I have been toying with the idea of having multiple consumers acting in parallel to speed up the process.
So, what Im thinking to do is
Multiple consumers acting independently on the queue.
Each consumer is free to grab any message.
After grabbing a message, make sure its the latest version of the entity. For this, part, I can check with the application that processes this entity.
if its not latest, bump the version up and try again.
I have been looking up the Integration patterns, JMS docs so far without success.
I would welcome ideas to tackle this problem in a more elegant way along with any known APIs, patterns in Java world.
ActiveMQ solves this problem with a concept called "Message Groups". While it's not part of the JMS standard, several JMS-related products work similarly. The basic idea is that you assign each message to a "group" which indicates messages that are related and have to be processed in order. Then you set it up so that each group is delivered only to one consumer. Thus you get load balancing between groups but guarantee in-order delivery within a group.
Most EIP frameworks and ESB's have customizable resequencers. If the amount of entities is not too large you can have a queue per entity and resequence at the beginning.
For those ones interested in a way to solve this:
Use Recipient List EAI pattern
As the question is about JMS, we can take a look into an example from Apache Camel website.
This approach is different from other patterns like CBR and Selective Consumer because the consumer is not aware of what message it should process.
Let me put this on a real world example:
We have an Order Management System (OMS) which sends off Orders to be processed by the ERP. The Order then goes through 6 steps, and each of those steps publishes an event on the Order_queue, informing the new Order's status. Nothing special here.
The OMS consumes the events from that queue, but MUST process the events of each Order in the very same sequence they were published. The rate of messages published per minute is much greater than the consumer's throughput, hence the delay increases over time.
The solution requirements:
Consume in parallel, including as many consumers as needed to keep queue size in a reasonable amount.
Guarantee that events for each Order are processed in the same publish order.
The implementation:
On the OMS side
The OMS process responsible for sending Orders to the ERP, determines the consumer that will process all events of a certain Order and sends the Recipient name along with the Order.
How this process know what should be the Recipient? Well, you can use different approaches, but we used a very simple one: Round Robin.
On ERP
As it keeps the Recipient's name for each Order, it simply setup the message to be delivered to the desired Recipient.
On OMS Consumer
We've deployed 4 instances, each one using a different Recipient name and concurrently processing messages.
One could say that we created another bottleneck: the database. But it is not true, since there is no concurrency on the order line.
One drawback is that the OMS process which sends the Orders to the ERP must keep knowledge about how many Recipients are working.