Apache Kafka - KafkaStream on topic/partition - java

I am writing Kafka Consumer for high volume high velocity distributed application. I have only one topic but rate incoming messages is very high. Having multiple partition that serve more consumer would be appropriate for this use-case. Best way to consume is to have multiple stream readers. As per the documentation or available samples, number of KafkaStreams the ConsumerConnector gives out is based on number of topics. Wondering how to get more than one KafkaStream readers [based on the partition], so that I can span one thread per stream or Reading from same KafkaStream in multiple threads would do the concurrent read from multiple partitions?
Any insights are much appreciated.

Would like to share what I found from mailing list:
The number that you pass in the topic map controls how many streams a topic is divided into. In your case, if you pass in 1, all 10 partitions's data will be fed into 1 stream. If you pass in 2, each of the 2 streams will get data from 5 partitions. If you pass in 11, 10 of them will each get data from 1 partition and 1 stream will get nothing.
Typically, you need to iterate each stream in its own thread. This is because each stream can block forever if there is no new event.
Sample snippet:
topicCount.put(msgTopic, new Integer(partitionCount));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreams = connector.createMessageStreams(topicCount);
List<KafkaStream<byte[], byte[]>> streams = consumerStreams.get(msgTopic);
for (final KafkaStream stream : streams) {
ReadTask task = new ReadTask(stream, msgTopic);
task.addObserver(this.msgObserver);
tasks.add(task); executor.submit(task);
}
Reference: http://mail-archives.apache.org/mod_mbox/incubator-kafka-users/201201.mbox/%3CCA+sHyy_Z903dOmnjp7_yYR_aE2sRW-x7XpAnqkmWaP66GOqf6w#mail.gmail.com%3E

The recommended way to do this is to have a thread pool so Java can handle organisation for you and for each stream the createMessageStreamsByFilter method gives you consume it in a Runnable. For example:
int NUMBER_OF_PARTITIONS = 6;
Properties consumerConfig = new Properties();
consumerConfig.put("zk.connect", "zookeeper.mydomain.com:2181" );
consumerConfig.put("backoff.increment.ms", "100");
consumerConfig.put("autooffset.reset", "largest");
consumerConfig.put("groupid", "java-consumer-example");
consumer = Consumer.createJavaConsumerConnector(new ConsumerConfig(consumerConfig));
TopicFilter sourceTopicFilter = new Whitelist("mytopic|myothertopic");
List<KafkaStream<Message>> streams = consumer.createMessageStreamsByFilter(sourceTopicFilter, NUMBER_OF_PARTITIONS);
ExecutorService executor = Executors.newFixedThreadPool(streams.size());
for(final KafkaStream<Message> stream: streams){
executor.submit(new Runnable() {
public void run() {
for (MessageAndMetadata<Message> msgAndMetadata: stream) {
ByteBuffer buffer = msgAndMetadata.message().payload();
byte [] bytes = new byte[buffer.remaining()];
buffer.get(bytes);
//Do something with the bytes you just got off Kafka.
}
}
});
}
In this example I asked for 6 threads basically because I know that I have 3 partitions for each topic and I listed two topics in my whitelist. Once we have the handles of the incoming streams we can iterate over their content, which are MessageAndMetadata objects. Metadata is really just the topic name and offset. As you discovered you can do it in a single thread if you ask for 1 stream instead of, in my example 6, but if you require parallel processing the nice way is to launch an executor with one thread for each returned stream.

/**
* #param source : source kStream to sink output-topic
*/
private static void pipe(KStream<String, String> source) {
source.to(Serdes.String(), Serdes.String(), new StreamPartitioner<String, String>() {
#Override
public Integer partition(String arg0, String arg1, int arg2) {
return 0;
}
}, "output-topic");
}
above code will write record at partition 1 of topic name "output-topic"

Related

Apache Beam Split to Multiple Pipeline Output

I am subscribing from one topic and contains different event types and they pass in with different attributes.
After I read the element, based on their attribute, I need to move them to different places. This is the sample code look like:
Options options =PipelineOptionsFactory.fromArgs(args).withValidation().as(Options.class);
Pipeline pipeline = Pipeline.create(options);
pipeline
.apply(
"ReadType1",
EventIO.<T>readJsons()
.of(T.class)
.withPubsubTimestampAttributeName(null)
.withOptions(options))
.apply(
Filter.by(
new SerializableFunction<T, Boolean>() {
#Override
public Boolean apply(T input) {
return input.attributes.get("type").equals("type1");
}
}))
.apply(
"WindowMetrics",
Window.into(FixedWindows.of(Duration.standardSeconds(options.getWindowDuration()))))
.apply("AsJsons", AsJsons.of(T.class))
.apply(
"Write File(s)",
TextIO.write()
.withWindowedWrites()
.withNumShards(options.getNumShards())
.to(
new WindowedFilenamePolicy(
options.getRunOutputDirectory(),
options.getUseCurrentDateForOutputDirectory(),
options.getOutputFilenamePrefix(),
options.getOutputShardTemplate(),
options.getOutputFilenameSuffix()))
.withTempDirectory(
NestedValueProvider.of(
options.getTempDirectory(),
(SerializableFunction<String, ResourceId>)
input -> FileBasedSink.convertToFileResourceIfPossible(input))));
pipeline.apply("ReadType2",
EventIO.<T>readJsons().of(T.class)
.withPubsubTimestampAttributeName(null)
.withOptions(options))
.apply(Filter.by(new SerializableFunction<T, Boolean>() {
#Override
public Boolean apply(Event input) {
return input.attributes.get("type").equals("type2");
}
})).apply( "WindowMetrics",
Window.into(FixedWindows.of(Duration.standardSeconds(options.getWindowDuration()))))
.apply("AsJsons", AsJsons.of(T.class))
.apply(
"Write File(s)",
TextIO.write()
.withWindowedWrites()
.withNumShards(options.getNumShards())
.to(
new WindowedFilenamePolicy(
options.getBatchOutputDirectory(),
options.getUseCurrentDateForOutputDirectory(),
options.getOutputFilenamePrefix(),
options.getOutputShardTemplate(),
options.getOutputFilenameSuffix()))
.withTempDirectory(
NestedValueProvider.of(
options.getTempDirectory(),
(SerializableFunction<String, ResourceId>)
input -> FileBasedSink.convertToFileResourceIfPossible(input))));
pipeline.apply("ReadType3",
EventIO.<Event>readJsons().of(T.class)
.withPubsubTimestampAttributeName(null)
.withOptions(options))
.apply(Filter.by(new SerializableFunction<T, Boolean>() {
#Override
public Boolean apply(T input) {
return input.attributes.get("type").equals("type3");
}
})).apply( "WindowMetrics",
Window.into(FixedWindows.of(Duration.standardSeconds(options.getWindowDuration()))))
.apply("AsJsons", AsJsons.of(T.class))
.apply(
"Write File(s)",
TextIO.write()
.withWindowedWrites()
.withNumShards(options.getNumShards())
.to(
new WindowedFilenamePolicy(
options.getCustomIntervalOutputDirectory(),
options.getUseCurrentDateForOutputDirectory(),
options.getOutputFilenamePrefix(),
options.getOutputShardTemplate(),
options.getOutputFilenameSuffix()))
.withTempDirectory(
NestedValueProvider.of(
options.getTempDirectory(),
(SerializableFunction<String, ResourceId>)
input -> FileBasedSink.convertToFileResourceIfPossible(input))));
pipeline.run();
Basically I read an event and filter them on their attribute and write the file. The job failed in dataflow as Workflow failed. Causes: The pubsub configuration contains errors: Subscription 'sub-name' is consumed by multiple stages, this will result in undefined behavior.
So what will be the appropriate way to split the pipeline within the same job?
I tried Pipeline1, Pipeline2,Pipeline3 and it end up need to multiple job name to run multiple pipeline, I am not sure that should the right way to do it.
The two EventIO transforms on the same subscription are the cause of the error. You need to eliminate one of those transforms in order for this to work. This can be done by consuming the subscription into a single PCollection and then applying two filtering branches to that collection individually. Here is a partial example:
// single PCollection of the events consumed from the subscription
PCollection<T> events = pipeline
.apply("Read Events",
EventIO.<T>readJsons()
.of(T.class)
.withPubsubTimestampAttributeName(null)
.withOptions(options));
// PCollection of type1 events
PCollection<T> typeOneEvents = events.apply(
Filter.by(
new SerializableFunction<T, Boolean>() {
#Override
public Boolean apply(T input) {
return input.attributes.get("type").equals("type1");
}}));
// TODO typeOneEvents.apply("WindowMetrics / AsJsons / Write File(s)")
// PCollection of type2 events
PCollection<T> typeTwoEvents = events.apply(
Filter.by(
new SerializableFunction<T, Boolean>() {
#Override
public Boolean apply(T input) {
return input.attributes.get("type").equals("type2");
}}));
// TODO typeTwoEvents.apply("WindowMetrics / AsJsons / Write File(s)")
Another possibility is to use some other transforms provided by Apache Beam. Doing so might simplify your solution a little. Once such transform is Partition. Partition allows for the splitting of a single PCollection in a fixed number of PCollections based on a partitioning function. A partial example using Partition is:
// single PCollection of the events consumed from the subscription
PCollectionList<T> eventsByType = pipeline
.apply("Read Events",
EventIO.<T>readJsons()
.of(T.class)
.withPubsubTimestampAttributeName(null)
.withOptions(options))
.apply("Partition By Type",
Partition.of(2, new PartitionFn<T>() {
public int partitionFor(T event, int numPartitions) {
return input.attributes.get("type").equals("type1") ? 0 : 1;
}}));
PCollection<T> typeOneEvents = eventsByType.get(0);
// TODO typeOneEvents.apply("WindowMetrics / AsJsons / Write File(s)")
PCollection<T> typeTwoEvents = eventsByType.get(1);
// TODO typeTwoEvents.apply("WindowMetrics / AsJsons / Write File(s)")
The answer should be using Partition in Beam.
https://beam.apache.org/documentation/transforms/java/elementwise/partition/

Multithreading in a hashmap entry loop - Executor service (Java)

I have a map (HashMap<String, Map<String,String> mapTest) in which I have a loop that does several operations.
HashMap<String, Map<String, String> mapTest = new HashMap<>();
ArrayList<Object> testFinals = new ArrayList<>();
for (Map.Entry<String, Map<String, String>> entry : mapTest.entrySet()) {
// in here I do a lot of things, like another for loops, if's, etc.
//the final juice to get from here is that in each time the loop is executed I have this:
List<Object> resultExp = methodXYZ(String, String, String);
testFinals.addAll(resultExp);
}
- In here, I have to wait before I proceed, since I need the full testFinals filled to advance.
Now, what I need to do is:
1 - This mapTest can have like 400 rows to iterate from.
I want to schedule like 4 threads, and assign like 100 rows of that FOR cycle to thread 1, the next 100 rows of the mapTest to thread 2, and so on.
Already tryed a few solutions, like this one:
ExecutorService taskExecutor = Executors.newFixedThreadPool(4);
while(...) {
taskExecutor.execute(new MyTask());
}
taskExecutor.shutdown();
try {
taskExecutor.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
} catch (InterruptedException e) {
...
}
but I can't adapt this correctly or a similar working solution to what I have now, with that map iteration.
HashMap is not a thread safe data structure.
When using concurrent, consider that the threads must obtain, hold and relinquish locks on a variable.
This is done at field level -not content.
On short ... The hasmap is locked for access y a specific thread. Not some random entry.

How to loop and limit the number of items fetched each time in an Observable

I have the following Observable which receives kafka consumer records and inserts them into a database.
It is currently working where I can receive the data as expected in the consumer and extracting those to perform some mapping and put it into a list.
Data in this list will then be inserted into the DB.
The way it is written right now, it is going to attempt to insert everything at the same time. There are chances for the kafka record to hold between 100k - 1 Million records.
I am looking for a way to break this up such that I only take 1000 items from the consumer records, insert into DB and repeat again for the next 1000 items and keep going till the records is empty. Is this possible?
I attempted to use variations of take, takeuntil with repeat, but they do not work. As in after I subscribe, the call just ends, does not even enter the observable when I do these.
Could I get some advice on how I could write this such that I can fetch 1000 records from the kafka records, insert them to db and keep doing this until done with all kafka records? Thanks.
Please note I am using RXJava 1 and need to stick to this version.
private final static AtomicInteger INSERT_COUNT = new AtomicInteger(1000);
private final static AtomicInteger RECORD_COUNT = new AtomicInteger();
private final static AtomicInteger REMAINDER = new AtomicInteger();
private final static AtomicInteger REPEAT_COUNT = new AtomicInteger();
public Observable<KafkaConsumerRecord<String, CustomObj>> dbInsert(KafkaConsumerRecords<String, CustomObj> records) {
return Observable.just(records.getDelegate().records())
// attempting to loop based on following counts. Not preferred but unsure of a better way.
// the figures captured here are correct.
// plus this doesn't currently matter anyway cos not able to get it to work using takeUntil, repeat.
.doOnSubscribe(() -> {
RECORD_COUNT.set(records.getDelegate().records().count());
REMAINDER.set(RECORD_COUNT.get() % INSERT_COUNT.get() == 0 ? 0 : 1);
REPEAT_COUNT.set((RECORD_COUNT.get() / INSERT_COUNT.get()) + REMAINDER.get());
})
.map(consumerRecords -> consumerRecords.records("Topic name"))
.map(it -> {
List<CustomRequest> requests = new ArrayList<>();
it.forEach(r -> {
ConsumerRecord<String, SomeObj> record = (ConsumerRecord<String, SomeObj>) r;
CustomRequest request = new CustomRequest (
new String(record.headers().headers("id").iterator().next().value(), StandardCharsets.UTF_8),
Long.parseLong(new String(record.headers().headers("code").iterator().next().value(), StandardCharsets.UTF_8)),
record.value()
);
requests.add(request);
});
return requests;
})
// nothing happens if I uncomment these.
// .takeUntil(customRequests -> customRequests.size() == INSERT_COUNT.get())
// .repeat(REPEAT_COUNT.get())
.doOnNext(customRequests -> {
// planning to do some db inserts here in a transaction of 1000 inserts at a time.
})
.doOnCompleted(() -> System.out.println("Completed"));
}
The following should work with RxJava 1.3.8
rx.Observable.from(List.of(1, 2, 3, 4, 5, 6))
.buffer(2)
.doOnNext(r -> System.out.println(r))
.subscribe();
following was the output -
[1, 2]
[3, 4]
[5, 6]
I used following version to test the above code -
<dependency>
<groupId>io.reactivex</groupId>
<artifactId>rxjava</artifactId>
<version>1.3.8</version>
</dependency>

Kafka streams, branched output to multiple topics

In my DSL based transformation, I have a stream-->branch, where in I want branched output redirected to multiple topics.
Current branch.to() method accepts only a String.
Is there any simple option with stream.branch where I can route the result to multiple topics. With a consumer, I can subscribe to multiple topics by providing an array of string as topics.
My problem requires me to take multiple actions if particular predicate satisfies a query.
I tried with stream.branch[index].to(string), but this is not sufficient for my requirement. I am looking for something like stream.branch[index].to(string array of topics) or stream.branch[index].to(string).
I expect the branch.to method with multiple topics or is there any alternate way to achieve the same with streams?
adding sample code.Removed actual variable names.
My Predicates
Predicate <String, MyDomainObject> Predicate1 = new Predicate<String, MyDomainObject>() {
#Override
public boolean test(String key, MyDomainObject domObj) {
boolean result = false;
if condition on domObj
return result;
}
};
Predicate <String, MyDomainObject> Predicate2 = new Predicate<String, MyDomainObject>() {
#Override
public boolean test(String key, MyDomainObject domObj) {
boolean result = false;
if condition on domObj
return result;
}
};
KStream <String, MyDomainObject>[] branches= myStream.branch(
Predicate1, Predicate2
);
// here I need your suggestions.
// this is my current implementation
branches[0].to(singleTopic),
Produced.with(Serdes.String(), Serdes.serdeFrom(inSer, deSer)));
// I want to send notification to multiple topics. something like below
branches[0].to(topicList),
Produced.with(Serdes.String(), Serdes.serdeFrom(inSer, deSer)));
If you know to which topics you want to send the data, you can do the following:
branches[0].to("first-topic");
branches[0].to("second-topic");
// etc.

Kafka Consumer : controlled reading from topic

I have below kafka consumer code where 3 threads are reading from kafka Topic having 3 partitions.
Is there any way, where new message will be read from the kafka topic only after the messages currently being in process by the thread got processed.
For example lets say there are 100 messages in topic, so is there any way where only 3 messages should be read at a time and processed. Now when these 3 messages gets processed then only next 3 messages should be read and so on.
public void run(int a_numThreads) {
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(topic, new Integer(a_numThreads));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);
// now launch all the threads
//
executor = Executors.newFixedThreadPool(3);
// now create an object to consume the messages
//
int threadNumber = 0;
for (final KafkaStream stream : streams) {
executor.submit(new ConsumerTest(stream, threadNumber));
threadNumber++;
}
}
If iterator inside ConsumerTest is processing message in synchronously then only 3 messages will be consumed at a time. enable.auto.commit is true by default. Make sure you do not set it to false, else you need to add logic for committing offset.
ex-
ConsumerIterator<byte[], byte[]> streamIterator= stream.iterator();
while (streamIterator.hasNext()) {
String kafkaMsg= new String(streamIterator.next().message());
}
Well, the consumers do not know about each other by default, so the can not "sync" their work. What you could to is either wrap your three messages into one (and thus guaranteeing they all will be answered in order) or maybe introduce more ("sub") topics.
Another possibility (if you really need to guarantee that your three messages will be consumed by individual consumers) might be that all your consumers sync their work or maybe notifying a controller which tracks your work.
But tbh it feels like you "are doing it wrong", actually the messages in a queue are meant to be stateless and only their order in a topic determines their "order in which they should be processed". WHEN the messages are being processed should not matter.

Categories

Resources