Use Apache Flink to only produce messages to a Kafka topic

Use Apache Flink to only produce messages to a Kafka topic - java

From the examples I have seen the below code snippet and it works fine. But the problem is that : I don't always have requirements of processing the input-stream and produce it to a sink.
What if I have an application where based on some events I have to only publish to a kafka topic so that down-stream applications can make certain decisions. That means, I don't really have an input-stream but I just know when something happens in my application, I need to publish a message to a particular topic of kafka. That is, I only need a sink.
I was going through examples but didn't find anything matching to my requirements. Is there a way to only configure a KafkaSink that exposes a method() to be called for publishing messages to a topic.
Many thanks in advance!!
String inputTopic = "flink_input";
String outputTopic = "flink_output";
String consumerGroup = "baeldung";
String address = "localhost:9092";
StreamExecutionEnvironment environment = StreamExecutionEnvironment
.getExecutionEnvironment();
FlinkKafkaConsumer011<String> flinkKafkaConsumer = createStringConsumerForTopic(
inputTopic, address, consumerGroup);
DataStream<String> stringInputStream = environment
.addSource(flinkKafkaConsumer);
FlinkKafkaProducer011<String> flinkKafkaProducer = createStringProducer(
outputTopic, address);
stringInputStream
.map(new WordsCapitalizer())
.addSink(flinkKafkaProducer);

You must have a source. You might want to implement a custom source, or you could use something like a NumberSequenceSource followed by an operator like a process function that emits whatever you know you want to write to the sink, followed by the sink.
That process function could, for example, transform the incoming events into whatever you want to write to Kafka, or it could ignore its inputs and use a timer to generate the events to be sent to Kafka.
Or you might find that async i/o is a better building block than a process function, depending on your requirements.

Related

Java/Quarkus Kafka Streams Reading/Writing to Same Topic based on a condition

Hello I have this issue that I'm trying to solve. Basically I have a Kafka Streams topology that will read JSON messages from a Kafka topic and that message gets deserialized into a POJO. Then ideally it will read check that message for a certain boolean flag. If that flag is true it will do some transformation and then write it back to the topic. However if the flag is false, I'm trying to have it not write anything but I'm not sure how I can go about it. With the MP Reactive Messaging I can just use an RxJava 2 Flowable Stream and return something like Flowable.empty() but I can't use that method here it seems.
JsonbSerde<FinancialMessage> financialMessageSerde = new JsonbSerde<>(FinancialMessage.class);
StreamsBuilder builder = new StreamsBuilder();
builder.stream(
TOPIC_NAME,
Consumed.with(Serdes.Integer(), financialMessageSerde)
)
.mapValues (
message -> checkCondition(message)
)
.to (
TOPIC_NAME,
Produced.with(Serdes.Integer(), financialMessageSerde)
);
The below is the function call logic.
public FinancialMessage checkCondition(FinancialMessage rawMessage) {
FinancialMessage receivedMessage = rawMessage;
if (receivedMessage.compliance_services) {
receivedMessage.compliance_services = false;
return receivedMessage;
}
else return null;
}
If the boolean is false it just returns a JSON body with "null".
I've tried changing the return type of the checkCondition function wrapped like
public Flowable<FinancialMessage> checkCondition (FinancialMessage rawMessage)
And then having the return from the if be like Flowable.just(receivedMessage) or Flowable.empty() but I can't seem to serialize the Flowable object. This might be a silly question but is there a better way to go about this?

Note that Kafka messages are immutable and not deleted after read, and if you read/write from the same topic with a single application, a message would be processed infinitely often (or to be more precise different copies of it) if you don't have a condition to "break" the cycle.
Also, if for example 5 services read from the same topic, all 5 services get a copy of every event. And if one service write back, all other 4 services and the writing service itself will read the message again. Thus, you get quite some data amplification.
If you have different services to react on the original input message consecutively, you could have one topic between each pair of consecutive services to really build a pipeline though.
Last, you say if the boolean flag is true you want to transform the message and emit (I assume for the next service to consumer). And for false you want to do nothing. I a further assume that for a message only a single flag will be true and a successful transformation also switches the flag (to enable processing by the next service). For this case, it's best if you can ensure that each original input message has the same initial boolean flag set to build your pipeline. Thus, only the corresponding service will read messages with its boolean flag set (you don't even need to check the boolean flag as your upstream write ensures that it's set; you could only have a sanity check).
If you don't know which boolean flag is set initially and all services read from the same input topic, just filtering out the message is correct. If all services read all messages, 4 services will filter the message while one service will process it and emit a new message with a different flag. For this architecture, a single topic might work: if a message is processed by all services and all boolean flags are false (after all services processed the message), and you write it back to the input topic, all services would drop the last copy correctly. However, using a single topic implies a lot of redundant reading/writing.
Maybe the best architecture is, to have your original input topic, and one additional input topic for each service. You also use an additional "dispatcher" service that read from the original input topics, and branches() the KStream into the service input topics according to the boolean flag. This way, each service will read only messages with the right flag set to true. Furthermore, each service will write to the input topic of the other services also using branch() after the message transformation to write it to the input topic of the correct next service. Last, you would want an output topic that each service can write into after a message is fully processed.

Conditional logic on a Reactor Flux

I am a Reactor newbie. I am trying to develop the following application logic:
Read messages from a Kafka topic source.
Transform the massages.
Write a subset of the transformed messages to a new Kafka topic target.
Explicitly acknowledge the reading operation for all the messages originally read from topic source.
The only solution I found is to rewrite the above business logic as it follows.
Read messages from a Kafka topic source.
Transform the massages.
Immediately acknowledge the message not be written to topic target.
Filter all the above messages.
Write the rest of the transformed messages to the new Kafka topic target.
Explicitly acknowledge the reading operation for these messages
The code implementing the second logic is the following:
receiver.receive()
.flatMap(this::processMessage)
.map(this::acknowledgeMessagesNotToWriteInKafka)
.filter(this::isMessageToWriteInKafka)
.as(this::sendToKafka)
.doOnNext(r -> r.correlationMetadata().acknowledge());
Clearly, receiver type is KafkaReceiver, and method sendToKafka uses a KafkaSender. One of the things I don't like is that I am using a map to acknowledge some messages.
Is there any better solution to implement the original logic?

This is not exactly your four business logic steps, but I think it's a little bit closer to what you want.
You could acknowledge the "discarded" messages that won't be written in .doOnDiscard after .filter...
receiver.receive()
.flatMap(this::processMessage)
.filter(this::isMessageToWriteInKafka)
.doOnDiscard(ReceiverRecord.class, record -> record.receiverOffset().acknowledge())
.as(this::sendToKafka)
.doOnNext(r -> r.correlationMetadata().acknowledge());
Note: you'll need to use the proper object type that was discarded. I don't know what type of object the Publisher returned from processMessage emits, but I assume you can get the ReceiverRecord or ReceiverOffset from it in order to acknowledge it.
Alternatively, you could combine filter/doOnDiscard into a single .handle operator...
receiver.receive()
.flatMap(this::processMessage)
.handle((m, sink) -> {
if (isMessageToWriteInKafka(m)) {
sink.next(m);
} else {
m.getReceiverRecord().getReceiverOffset().acknowledge();
}
})
.as(this::sendToKafka)
.doOnNext(r -> r.correlationMetadata().acknowledge());

How to send elements to a Source.actorRef or a Source.queue in Java

I'm currently working with Akka Streams (in Java) for a personal project and I'm having a hard time understanding how to send element to a Source.
The idea is to use a WebSocket to push content into the user's web browser. I've managed to use Akka Streams to create a request-response system, following the Akka HTTP documentation, but this is not what I want to do.
Looking into the Akka Streams documentation, I saw that there is Source.queue and Source.actorRef. But I don't understand how to put an element into the Source. Source.queue and Source.actorRef return a Source, which doesn't have the method offer (for Source.queue) or tell (for Source.actorRef).
My question is: how do I get the ActorRef for the Source created by Source.actorRef or the SourceQueueWithComplete for a Source created with Source.queue, to be able to send elements to my Source?
I searched the various Akka documentation but found no method to do that. And the majority of the code I found on the Internet is written in Scala, which doesn't seem to have the same problem.

The actor and queue from Source.actorRef and Source.queue, respectively, are the materialized values of those sources, meaning that they can be obtained only if the stream is running. For example:
final ActorRef actor =
Source.actorRef(Integer.MAX_VALUE, OverflowStrategy.fail())
.to(Sink.foreach(m -> System.out.println(m)))
.run(materializer);
actor.tell("do something", ActorRef.noSender());
It's no different in Scala:
implicit val materializer = ActorMaterializer()
val actor =
Source.actorRef(Int.MaxValue, OverflowStrategy.fail)
.to(Sink.foreach(println))
.run()
actor ! "do something"

How to handle kafka publishing failure in robust way

I'm using Kafka and we have a use case to build a fault tolerant system where not even a single message should be missed. So here's the problem:
If publishing to Kafka fails due to any reason (ZooKeeper down, Kafka broker down etc) how can we robustly handle those messages and replay them once things are back up again. Again as I say we cannot afford even a single message failure.
Another use case is we also need to know at any given point in time how many messages were failed to publish to Kafka due to any reason i.e. something like counter functionality and now those messages needs to be re-published again.
One of the solution is to push those messages to some database (like Cassandra where writes are very fast but we also need counter functionality and I guess Cassandra counter functionality is not that great and we don't want to use that.) which can handle that kind of load and also provide us with the counter facility which is very accurate.
This question is more from architecture perspective and then which technology to use to make that happen.
PS: We handle some where like 3000TPS. So when system start failing those failed messages can grow very fast in very short time. We're using java based frameworks.
Thanks for your help!

The reason Kafka was built in a distributed, fault-tolerant way is to handle problems exactly like yours, multiple failures of core components should avoid service interruptions. To avoid a down Zookeeper, deploy at least 3 instances of Zookeepers (if this is in AWS, deploy them across availability zones). To avoid broker failures, deploy multiple brokers, and ensure you're specifying multiple brokers in your producer bootstrap.servers property. To ensure that the Kafka cluster has written your message in a durable manor, ensure that the acks=all property is set in the producer. This will acknowledge a client write when all in-sync replicas acknowledge reception of the message (at the expense of throughput). You can also set queuing limits to ensure that if writes to the broker start backing up you can catch an exception and handle it and possibly retry.
Using Cassandra (another well thought out distributed, fault tolerant system) to "stage" your writes doesn't seem like it adds any reliability to your architecture, but does increase the complexity, plus Cassandra wasn't written to be a message queue for a message queue, I would avoid this.
Properly configured, Kafka should be available to handle all your message writes and provide suitable guarantees.

I am super late to the party. But I see something missing in above answers :)
The strategy of choosing some distributed system like Cassandra is a decent idea. Once the Kafka is up and normal, you can retry all the messages that were written into this.
I would like to answer on the part of "knowing how many messages failed to publish at a given time"
From the tags, I see that you are using apache-kafka and kafka-consumer-api.You can write a custom call back for your producer and this call back can tell you if the message has failed or successfully published. On failure, log the meta data for the message.
Now, you can use log analyzing tools to analyze your failures. One such decent tool is Splunk.
Below is a small code snippet than can explain better about the call back I was talking about:
public class ProduceToKafka {
private ProducerRecord<String, String> message = null;
// TracerBulletProducer class has producer properties
private KafkaProducer<String, String> myProducer = TracerBulletProducer
.createProducer();
public void publishMessage(String string) {
ProducerRecord<String, String> message = new ProducerRecord<>(
"topicName", string);
myProducer.send(message, new MyCallback(message.key(), message.value()));
}
class MyCallback implements Callback {
private final String key;
private final String value;
public MyCallback(String key, String value) {
this.key = key;
this.value = value;
}
#Override
public void onCompletion(RecordMetadata metadata, Exception exception) {
if (exception == null) {
log.info("--------> All good !!");
} else {
log.info("--------> not so good !!");
log.info(metadata.toString());
log.info("" + metadata.serializedValueSize());
log.info(exception.getMessage());
}
}
}
}
If you analyze the number of "--------> not so good !!" logs per time unit, you can get the required insights.
God speed !

Chris already told about how to keep the system fault tolerant.
Kafka by default supports at-least once message delivery semantics, it means when it try to send a message something happens, it will try to resend it.
When you create a Kafka Producer properties, you can configure this by setting retries option more than 0.
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:4242");
props.put("acks", "all");
props.put("retries", 0);
props.put("batch.size", 16384);
props.put("linger.ms", 1);
props.put("buffer.memory", 33554432);
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<>(props);
For more info check this.

AWS SNS - How to get topic arn by topic name

In my application i need to send messages to the given topic name. The topic is already created by other person and in a config file they give only the topic Name. My work is to push the messages in the given topic Name. Is there any way to get topic ARN by topic NAME in java?

As stated in this answer using createTopic(topicName) is more direct approach. In case topic have been created before, it will just return you topic ARN.

I've done this one of two ways. The ARN is always the same pattern. So, you could just subscribe to "arn:aws:sns:<region>:<account>:<name>" where:
region is from Regions.getCurrentRegion(). Be careful with this as it is a bit expensive of a call and you'll need to handle not being on an EC2/Elastic Beanstalk instance.
account is from AmazonIdentityManagementClient.getUser().getUser().getArn(). You'll have to parse the account number from that. Same warning about not being in an EC2 environment.
name is what you have.
A simpler way is loop through the the topics and look for the name you want in the ARN. You will use the AmazonSNSClient listTopics method to do this. Remember that that method only returns the first 100 topics - you'll want to properly loop over the entire topic list.

This might be what you need. Supply the topic and get it from available topics.
import json
import boto3
def lambda_handler(event, context):
try:
topic = event['topic']
subject = event['subject']
body = event['body']
subscriber = event['subscriber']
sns_client = boto3.client('sns')
sns_topic_arn = [tp['TopicArn'] for tp in sns_client.list_topics()['Topics'] if topic in tp['TopicArn']]
sns_client.publish(TopicArn = sns_topic_arn[0], Message=body,
Subject=subject)
except Exception as e:
print(e)

What you can do, is to create a table that contains topic and its topicArn.
The topic arn can be retrieve by the console or using the api when you create a topic.
This way no need to loop or try to match a pattern.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.