Aggregate messages without List

Aggregate messages without List - java

I'm using spring integration and I need to pack group of messages by 10k. I don't want to store it into List since later 10k could became much bigger and persistent storage is also not my choice. I just want that several threads send messages into single thread where I can count them and write into disk into files containing 10k lines. After counter reaches 10k I create new file set counter to zero and so on. It would work fine with direct channel but how to tell several threads(I'm using
<int:dispatcher task-executor="executor" />
) to send messages into single thread? Thanks

You can reach the task with the QueueChannel. Any threads can send messages to it concurrently. On the other side you should just configure PollingConsumer with the fixed-delay poller - single-threaded, as you requested. I mean that poller with the fixed-delay and everything downstream with the DirectChannel will be done only in single thread. Therefore your count and rollover logic can be reached there.
Nothing to show you, because that configuration is straight forward: different services send messages to the same QueueChannel. The fixed-delay poller ensures single-threaded reading for you.

Related

ActiveMQ : how to fork-join? Ie. how to emit one message when all subtasks are done

imagine you have some task structure of
Task1
Task2: 1 million separate independent Subtask[i] that can run concurrently
Task3: must run once after ALL Task2 subtasks have completed
And all of Task1, Subtask[i] and Task3 are represented by MQ messages.
How can this be solved on an ActiveMQ? Especially the triggering of a Task3 message once all subtasks are complete.
I know, it's not a queueing problem, it's a fork-join problem. Lets say the environment dictates you must use an ActiveMQ for it.
Using ActiveMQ features, dynamic queues and consumers, stuff like that, is allowed. Using external counters, like a database row representing Task2's progress, is not allowed.

Hidden in this fork-join problem is a state management and observability challenge. Since the database is ruled out, you have to rely on something in-memory or on-queue.
Create a unique id for the task run -- something short, but with enough space to not collide like an airplane locator code-- ie. 34FDSX
Send all messages for the task to a queue://TASK.34FDSX.DATA
Send a control message to queue://TASK.34FDSX.CONTROL that contains the task id and expected total # of messages (including each messageId would be helpful too)
When consumers from queue://TASK.34FDSX.DATA complete their work, they should send a 'done' message to queue://TASK.34FDSX.DONE queue with their messageId or some identifier.
The consumers for the .CONTROL queue and the .DONE queue should be the same process and can track the expected and total completed tasks. Once everything is completed, he can fire the event to trigger Task #3.
This approach provides everything as 'online', and you can also timeout the .CONTROL and .DONE reader if too much time passes before the task completes.
Queue deletion can be done using ActiveMQ destination GC, or as a clean-up step in the .CONTROL/.DONE reader during the occurances when everything completes successfully.
Advantages:
No infinite blocking consumers
No infinite open transactions
State of the TASK is online and observable via the presence of queues and queue metrics-- queue size, enqueue count, dequeue count
The entire solution can be multi-threaded and the only requirement is that for a given task the .CONTROL/.DONE listener is the same consumer, but multiple tasks can have individual .CONTROL/.DONE listeners to scale.

The question here is a bit vague so my answer will have to be a bit vague as well.
Each of the million independent subtasks for "Task 2" can be represented by a single message. All these messages can be in the same queue. You can spin up as many consumers as you want and process all these messages (i.e. perform all the subtasks). Just ensure that these consumers either use client-acknowledge mode or a transacted session so that the message is not removed from the queue until they are done processing the message. Once there are no more messages in the queue then you know "Task 2" is done.
To detect when the queue is empty you can have a "special" consumer on the queue that periodically opens a transacted session and tries to consume a message from the queue. If the consumer receives a message then you can rollback the transacted session to put the message back on the queue and you know that the queue is not empty (i.e. "Task 2" is not done). If the consumer doesn't receive a message then you know the queue is empty and you can send another message indicating this. You could launch this special consumer as part of "Task 2" after all the messages for the subtasks have been sent to avoid detecting an empty queue prematurely.
To be clear, this is a simple solution. You could certainly add more complexity depending on your requirements, but your question just outlined the basic problem so it's unclear what other requirements you have (if any).

Is RabbitMQ or Kafka message queue a 1:1 messaging system?

As mentioned in the answer,
A message queue is a one-way pipe: one process writes to the queue, and another reads the data in the order
SysV message queue is one example
So, my understanding is,
one message queue is used by two processes, where one process(producer) insert an item in the queue and another process(consumer) consumes the item from the queue
1) Is RabbitMQ or Kafka message queue a 1:1 messaging system? used by only two processes, where one process writes and other process reads......
2) after the consumer consume the item, does the item get deleted? If no, why do we need queue data structure? Why not just shared memory?

Kafka is not strictly 1:1 messaging system. Multiple producers can write into a topic and multiple consumers can read from it. Moreover, in Kafka, multiple consumers can be assigned same or different consumer groups. Every message is consumed by only one consumer from every consumer group (load balancing) and all consumer groups receive a copy of every message (of course, if they are subscribed to corresponding topics and no messages are lost). A good description of this process can be found in this article: Scalability of Kafka Messaging using Consumer Groups.
In Kafka all messages are persisted on the disk and stored until the compaction reaps it, or the retention.ms passes, or the log size is exceeded. That's a very high-level point of view and there are a lot of nuances here. Like: the messages are stored in segments, every segment contains multiple messages. When the retention period passes for a message, it is not removed from the segment at that moment, instead Kafka waits until all messages in that segment are expired and delete the whole segment at once. Also, retention could come before the log exceeds the maximum size or vice versa: the log can exceed the size even before the retention period passes. And so on. Just read the docs and pay attention to topics about "log cleaner" and "retention".
After the Kafka consumer reads the message it is neither compacted, nor expired. So, it's not removed from the log and stays there. It also means that every message could be re-read by a consumer if needed (until it is deleted completely). It can be useful if some of your consumers went offline for some reason and were not able to process the messages as they come in. It also allows interesting features like transaction replays and so on. Persistence is one of the Kafka's features.
Shared memory? Well, strictly speaking shared memory is only allowed inside a single process. So you can't generally use "shared memory" when you need to access it from different processes. And there is absolutely no way to have "shared memory" when you app runs on multiple hosts. However, there are in-memory brokers. Like Redis can be used as a message broker, and it's all in-memory. However, if such a broker restarts for some reason you lose everything. Speaking about Redis: it has two persistence configurations specifically to handle the restarts.
I am not sure about RabbitMQ, but it probably deletes messages after the consumer acknowledged them by default. So it's closer to 1:1 mental model. However, RabbitMQ employs disk persistence as well.

Concurrent consumers of Seda in Apache camel

I have a route as mentioned below. The route polls a directory at regular interval and reads a big size .csv file. It then split the files in chunk of 1000 lines and sends it to the seda queue(firstQueue). I have 15 concurrent consumers on this seda queue.
route.split().tokenize("\n", 1000).streaming().to("seda:firstQueue?concurrentConsumers=15").process(myProcessor).to("seda:secondQueue?concurrentConsumers=15").process(anotherMyProcessor);
1) What does 15 concurrent consumers means - does it means 15 threads read data from the seda and pass it to one instance of myProcessor? Or 15 separate instance of myProcessor are created each one acting on the same copy of the data? Note that myProcessor is a singleton, what will happen if I change it to prototype.
2) Is it possible that any two or more threads pick the same data and pass it to the myProcessor? Or is it guaranteed that no two threads will have the same data ?
Appreciate a quick response. Thanks!

My Camel is a bit rusty but I'm pretty sure that
There are 15 threads running. Each will read a message from the queue and call myProcessor. There is only one instance of my processor so you need to make sure that it is thread safe. I've never tried the it, but I don't believe changing the scope to prototype will make any difference.
Two threads should not pick up the same message from the queue. In normal running each message should get processed just once. However, there are error conditions that my result in the same message being processes twice, the most obvious one being that you restart the app part way through processing the file.

JMS Queue : Multiple Threads to read

I have a java program that puts down into the queue on the other side of the queue, i do have 10-15 consumers; any ONE of which should read the message and process it. If any of the 10-15 consumers get free they pick up the next message from the queue.
Basically, a Consumer can pickup the message from the queue whenever it is free, and only one consumer must pick it up. (without any synchronization blocks or so).
Also on the sender's end can i pause sending the messages into the queue if the queue sizes becomes full(or reaches a certain threshold)?
I am really new to the JMS API. Apologies if this is a newbie question .
Thanks!!

I have to Send messages into a queue and i have 20 threads running as
consumers, who can pick up the data from the queue(once they are
free). so when each thread gets free it goes to the queue checks if
the data is there it picks up and so on.. is this doable?
Yes, it's doable - that's the standard process of doing it with JMS queues. Another alternative would be topics, but with topics, every listener would have to process the same message, not just one, so queues are what you want. Although usually you don't have threads as consumers (I'm not even sure what that means), but message-driven beans. You might consider using them. MDBs run in their own thread anyway.

How to design a system that queues requests & processes them in batches?

I have at my disposal a REST service that accepts a JSON array of image urls, and will return scaled thumbnails.
Problem
I want to batch up image URLs sent by concurrent clients before calling the REST service.
Obviously if I receive 1 image, I should wait a moment in case other images trickle in.
I've settled on a batch of 5 images. But the question is, how do I design it to take care of these scenarios:
If I receive x images, such that x < 5, how do I timeout from waiting if no new images will arrive in the next few minutes.
If I use a queue to buffer incoming image urls, I will probably need to lock it to prevent clients from concurrently writing while I'm busy reading my batches of 5. What data structure is good for this ? BlockingQueue ?

The data structure is not what's missing. What's missing is an entity - a Timer task, I'd say, which you stop and restart every time you send a batch of images to your service. You do this whether you send them because you had 5 (incidentally, I assume that 5 is just your starting number and it'll be configurable, along with your timeout), or whether because the timeout task fired.
So there's two entities running: a main thread which receives requests, queues them, checks queue depth, and if it's 5 or more, sends the oldest 5 to the service (and restarts the timer task); and the timer task, which picks up incomplete batches and sends them on.
Side note: that main thread seems to have several responsibilities, so some decomposition might be in order.

Well what you could do is have the clients send a special string to the queue, indicating that it is done sending image URLs. So if your last element in the queue is that string, you know that there are no URLs left.
If you have multiple clients and you know the number of clients you can always count the amount of the indicators in the queue to check if all of the clients are finished.

1- As example, if your Java web app is running on Google AppEngine, you could write each client request in the datastore, have cron job (i.e. scheduled task in GAE speak) read the datastore, build a batch and send it.
2- For the concurrency/locking aspect, then again you could rely on GAE datastore to provide atomicity.
Of course feel free to disregard my proposal if GAE isn't an option.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.