A trigger in an Oracle 10g generates upsert and delete messages for a subset of rows in a regular table. These messages consist out of two fields:
A unique row id.
A non-unique id.
When consuming these message I want to impose an order on the deque process that respects the following constraints:
Messages must be dequeued in insertion order.
Messages belonging to the same id must be dequeued in such a fashion that no other dequeuing process should be able to dequeue a potential successor message (or messages) with this id. Since the messages are generated using a trigger I cannot use groups for this purpose.
I am using the Oracle Java interface for AQ. Any pointers on how that could be achieved?
The default dequeue order I believe is first in first out, therefore they will be dequeued in the same order they were enqueued.
For your second point, are you saying that you want to serialize dequeue on the non-unique-id? Ie, you basically have many queues within your queue, and you only want one job to consume messages form each queue at any one time?
Ie, you have messages:
1 | a
2 | a
3 | b
4 | a
Here you have two types of record (a and b) and you want 1 job to consume all the a's and another to consume all the b's. If that is the case consider creating multiple queues perhaps?
Failing multiple queues, have a look at the dequeue_options_t type that you pass to the dequeue procedure - most notably dequeue_condition - this allows you to select only specific messages, so you could start a job for all the a's and another for all the b's etc.
Related
Say for example, I have 4 partitions.
When a message msg1 of key 101 is put into partition 1 (out of 4) and is not consumed yet. Meanwhile a new partition is added making total of 5 partitions.
Then the next message msg2 of key 101, goes to 4th partition (say for example) because the hash(101)%no_of_partitions=4.
Now, in the streams API, whenever a message is consumed by its key, the partition 4 will be accessed for the key because that is the partition it gets when it computes the hash(101)%no_of_partitions and therefore it gets the msg2 of key 101 in partition 4.
Now, what about the msg1 of key 101 in partition 1? Is it consumed at all?
You won't loose data, however, depending on your application, adding partitions might not supported and will break your application.
You can add partitions only, if you application is stateless. If your application is stateful, your application will most likely break and die with an exception.
Also note, that Kafka Streams assumes, that input data is partitioned by key. Thus, if the partitioning is changed, even if the application does not break, it will most likely compute an incorrect result, because adding a partition violated the partitioning assumption.
One way to approach this issue is, to reset your application (cf ). However, this implies that you loose your current application state. Note, that resetting will not address the problem about incorrect partitioning though and your application might compute incorrect results. To guard agains the partitioning problem, you could insert a dummy map() operation that only forward the data after you read data from a topic, because this will result in data repartitioning if required and thus fix the key-based partitioning.
The msg1 of key 101 in partition 1 will be consumed.
In Kafka Streams, you do not "consume a message by its key". Every message in every partition will be consumed. If someone should filter on the key, it would be in the code of the Kafka Stream App.
It will be consumed, but order is not guaranteed. Be sure that application logic is idempotent. One possible solution is to go through intermediate topic with more partitions. KStream#through will help you to produce and to consume with a single instruction. The method does exactly the same thing and returns a KStream. In pseudo code:
.stream(...)
// potential key transformation
.through("inner_topic_with_more_partitions")
.toTable(accountMateriazer)
The input stream consists of data in JSON array of objects format.
Each object has one field/key named state by which we need to separate the input stream, see below example
Object1 -> "State":"Active"
Object2 -> "State":"Idle"
Object3 -> "State":"Blocked"
Object4 -> "State":"Active"
We have to start processing/thread as soon as we receive a particular state, keep on getting the data if a new state is similar to the previous state let the previous thread handle it else start a new thread for a new state. Also, it is required to run each thread for finite time and all the threads should run in parallel.
Please suggest how can I do it in Apache Flink. Pseudo codes and links would be helpful.
This can be done with Flink's Datastream API. Each JSON object can be treated as a tuple, which can be processed with any of the Flink Operators.
/----- * * | Active
------ (KeyBy) ------ * | Idle
\----- * | Blocked
Now, you can split the single data stream into multiple streams using the KeyBy operator. This operator splits and clubs together, all the tuples with a particular key (State in your case) into a keyedstream which is processed in parallel. Internally, this is implemented with hash partitioning.
Any new keys(States) are dynamically handled as new keyedstreams are created for them.
Explore the documentation for implementation purpose.
From your description, I believe you'd need to first have an operator with a parallelism of 1, that "chunks" events by the state, and adds a "chunk id" to the output record. Whenever you get an event with a new state, you'd increment the chunk id.
Then key by the chunk id, which will parallelize downstream processing. Add a custom function which is keyed by the chunk id, and has a window duration of 10 minutes. This is where the bulk of your data processing will occur.
And as #narush noted above, you should read through the documentation that he linked to, so you understand how windows work in Flink.
My requirement is as follows
Maintain a pool of records in a table (MySQL DB).
A job acts as a producer and fills up this pool if the number of entries goes below a certain threshold. The job runs every 15 mins.
There can be multiple consumers with each consumer picking up just one record each. Two consumers coming in at the same time should get two different records.
The producer should not block the consumer. So while the producer job is running consumers should be able to pick up any available rows.
The producer / consumer is a part of the application code which is turn is a JBoss application.
In order to ensure that each consumer picks a distinct record (in case of concurrency) we do the following
We use an integer column as an index.
A consumer will first update the record with the lowest index value with its own name.
It will then select and pick up that record and proceed with that.
This approach ensures that two consumers do not end up with the same record.
One problem we are seeing is that when the producer is filling up the pool, consumers get blocked. Since the producer can take some time to complete, all consumers in that period are blocked as the update by the consumer waits for the insert by the producer to complete.
Is there any way to resolve this scenario? Any other approach to design this is also welcome.
Is it a hard requirement that you use a relational database as a queue? This seems like a bad approach to the problem, especially since the problems been addressed by message queues. You could use MySQL to persist the state of your queue, but it won't make a good queue itself.
Take a look at ActiveMQ or JBoss Messaging (given that you are using JBoss)
Updated:
We receive measurement values from different measurement stations, and put them for processing into a jms queue.
Now we use jms grouping, where the groupId is the id of the station, to ensure that the values from one station are processed serially.
Just see the stock sample in the docs: http://docs.jboss.org/hornetq/2.2.2.Final/user-manual/en/html/message-grouping.html
But in HornetQ (or jms at all), each group is pinned to one specific worker. This means if group A and group B are pinned to Worker X and there are 10 Messages for group A and 10 messages for group B in the queue, the group B messages have to wait until the group A messages are handled. (Although there are enough free workers, that can handle the group B messages right now)
Is there a way, to tell jms, not to pin each group to one specific worker, but only to ensure the serial processomg within each group.
JMS queues do prevent parallel processing. Have you confirmed that the different consumers process the same message? Does each message have a unique ID and are you logging the ID of each message processed?
From the example you gave it sounds like you are mis-using JMS queues. Could you update your question with a clearer, more in-depth example of your workflow?
Right now with the way your question is worded, it sounds like you need to re-design your workflow model
I am using JBOSS 5.1.2 MDB to consume entity messages placed on a queue. The message producer can produce multiple messages for the same entity. These message entities are defined by their entity number. (msg_entity_no) I must insert a record into an existing entity table only if the the entity does not exist else I must update the existing record.
The existing entity table is not unique on this entity number as it has another internal key. ie it contains duplicate msg_entity_no
The problem I am experiencing is that when multiple messages are produced , multiple instances of the MDB query's for existence on the entity table at the same time.
At that time it does not exist for either instance and the process then inserts for both messages. As opposed to one insert for the non-existent entity
and then updating the record for subsequent messages.
I want to get away from using the annotation #ActivationConfigProperty(propertyName = "maxSession", propertyValue = "1") and deploying to the deploy-hasingleton folder which only allows one instance of the MDB as this is not scalable.
The condition you are receiving is due to either DUPLICATE or SAME DATA contained with in messages which are placed in the queue within quick succession. There are a couple of solutions for this.
1) DEQUEUE On one JBOSS instance with only one mdb. This means you will have ONE MDB running on one JBOSS SERVER in the cluster, messages will be essentially processed in sequence.
2) Create a locking mechanism whereby you create a table write the message contents to that table with a PRIMARY KEY and the message contents. You then filter out or create an ordering of data to be processed based on the contents. This will slowdown execution time but you will have better auditing for your data. You will in essence have two ASYNC process jobs. One to populate you data from the QUEUE and another to PROCESS the data. You could do this one minute later.
3) Some QUEUE Implementations such as ORACLE AQ have a dequeue condition which can be set Messages in the queue are evaluated against the condition, and messages that satisfy the given condition are returned. TIBCO have strategies for Locking which protect the thread of execution when multiple threads in an agent.
References
http://tech.chococloud.com/archives/2152
Not understanding the business process, I would suggest you try prevent the "DUPLICATE" messages from the source.