Multiple Kafka Consumers in java - java

So i am running kafka with two topics and they will both be serving files -
Topic 1 - I am reading the string value from it and processing it.
Topic 2 - I need the binary values for processing.
So i am guessing i would need two consumers?
The reasoning behind this is that in the properties for the data from first topic i need a StringSerializer/Deserializer and for the second one i need a BinaryArraySerializer Desererializer.
So is it feasible to have two consumer implementations?
I know what I can get the string value and convert it to binary or vice versa but somehow having two consumers seems to be the clean way of doing it.
Thanks in Advance.

Related

Spring Integration aggregating messages that were split twice

I have a use case where my message are being split twice and i want to aggregate all these messages. How can this best be achieved, should i aggregate the messages twice by introducing different sequence headers, or is there a way to aggregate the messages in single aggregating step by overriding the method how messages are grouped?
That's called a "nested splitting" and there is built-in algorithm to push sequence detail headers to the stack for a new splitting context. This would allow in the end to have an ascendant aggregation: the first one aggregate for the closest nested splitting, pops sequence detail headers and allows the next aggregator to deal with its own sequence context.
So, in two words: it is better to have as many aggregator as you have splitting if you want to send a single message in the start and receive a single message in the end.
Of course you can have a custom splitting algorithm with an applySequence = false. As many as you need. And have only a single aggregator in the end, but with a custom correlation logic already.
We have some explanation in the docs: https://docs.spring.io/spring-integration/docs/current/reference/html/message-routing.html#aggregatingmessagehandler
Starting with version 5.3, after processing message group, an AbstractCorrelatingMessageHandler performs a MessageBuilder.popSequenceDetails() message headers modification for the proper splitter-aggregator scenario with several nested levels.
We don't have a sample on the matter, but here is a configuration for test case: https://github.com/spring-projects/spring-integration/blob/main/spring-integration-core/src/test/java/org/springframework/integration/aggregator/scenarios/NestedAggregationTests-context.xml

Get the sequence no of the message processing in Spark Streaming

I am using Spark Structured streaming for processing the messages and I am using Java8. I am reading the message from the kafka and writing the message to the file and save the file in HDFS.
I got a requirement like I need to write a sequence number along with the message to file.
For example, if I get the first message from kafka, the output file content will be "message, 1" , for second message its "message,2" etc.. kind of count.
if the message count reaches some threshold let say "message, 999999", then I need to reset the sequence from 1 again from the next message I received.
if the spark streaming job is restarted, it should continue with the sequence where it left. So I need to save this number in HDFS kind of checkPointLocation.
What is the best approach I can use to implement this sequence. Can I use Accumulator to do that? or is there any other better to approach to implement during the distributed processing ? or is it not possible in distributed processing?
It wont be that hard.You can read each message using a map function and keep on adding the count to the messages.The count can be maintained with in your code logic.

How to create Kafka Topic name on run based on some value field

I have a requirement, where I have to create the topic name based on different values coming in for a field in the <Value object>. So that all the records<K,V> with similar field values goes in Topic_<Field>. How can I do it using kstream?
In Kafka 1.1.0, you can use branch() to split a stream into substreams and than write the different substreams into different topics by adding a different sink operator (ie, to()) to each substream.
Kafka 2.0 (will be released in June), adds a new "dynamic routing" feature that simplifies this scenario. Compare: https://cwiki.apache.org/confluence/display/KAFKA/KIP-303%3A+Add+Dynamic+Routing+in+Streams+Sink
Note, that Kafka Streams requires that sink topics are created manually -- Kafka Streams does not create any sink topic for you. As mentioned by #Hemant, you could turn on auto topic creation. However, it's not recommended in general (one reason is you might want different configs for different topic, but via auto creation all would be created with the same default config).
Also note, that a rogue application could DDoS your Kafka cluster if auto topic creation is enabled by sending "bad data" into the application and thus creating hundreds or thousands of topics (by specifying a different topic name for each message). Thus, it's risky and not recommended to enable auto topic creation but to create topics manually.

Dynamically connecting a Kafka input stream to multiple output streams

Is there functionality built into Kafka Streams that allows for dynamically connecting a single input stream into multiple output streams? KStream.branch allows branching based on true/false predicates, but this isn't quite what I want. I'd like each incoming log to determine the topic it will be streamed to at runtime, e.g., a log {"date": "2017-01-01"} will be streamed to the topic topic-2017-01-01 and a log {"date": "2017-01-02"} will be streamed to the topic topic-2017-01-02.
I could call forEach on the stream, then write to a Kafka producer, but that doesn't seem very elegant. Is there a better way to do this within the Streams framework?
If you want to create topics dynamically based on your data, you do not get any support within Kafka's Streaming API at the moment (v0.10.2 and earlier). You will need to create a KafkaProducer and implement your dynamic "routing" by yourself (for example using KStream#foreach() or KStream#process()). Note, that you need to do synchronous writes to avoid data loss (which are not very performant unfortunately). There are plans to extend Streaming API with dynamic topic routing, but there is no concrete timeline for this feature right now.
There is one more consideration you should take into account. If you do not know your destination topic(s) ahead of time and just rely on the so-called "topic auto creation" feature, you should make sure that those topics are being created with the desired configuration settings (e.g., number of partitions or replication factor).
As an alternative to "topic auto creation" you can also use Admin Client (available since v0.10.1) to create topics with correct configuration. See https://cwiki.apache.org/confluence/display/KAFKA/KIP-4+-+Command+line+and+centralized+administrative+operations

Java- Getting Json by parts

Generally is there a way to get a big JSON string by a single request by parts?
For example, if I have a JSON string consisting of three big objects and having each size of 1mb, can I somehow in a single request get the first 1mb then parse it while other 3 objects are still being downloaded, instead of waiting for the full 3mb string to download?
If you know how big the parts are, it would be possible to split your request in three using HTTP1.1's range requests. Assuming your ranges are defined correctly, you should be able to get the JSON objects directly from the server (if the server supports range requests).
Note that this hinges on a) the server's capability to handle range requests, b) the idempotency of your REST operation (it could very well run the call three times, a cache or reverse proxy may help with this) and c) your ability to know the ranges before you call.

Categories

Resources