I have the following code
//Kafka Config setup
Properties props = ...; //setup
List<String> topicList = Arrays.asList({"A", "B", "C"});
StreamBuilder builder = new StreamBuilder();
KStream<String, String> source = builder.stream(topicList);
source
.map((k,v) -> { //busy code for mapping data})
.transformValues(new MyGenericTransformer());
.to((k,v,r) -> {//busy code for topic routing});
new KafkaStream(builder.build(), properties).start();
My Problem : When I add more than one topic to subscribe to (ie A,B,C in above) the Kstream code stops receiving records.
References : https://kafka.apache.org/10/javadoc/org/apache/kafka/streams/StreamsBuilder.html
Relevant Documentation
public <K,V> KStream<K,V> stream(java.util.Collection<java.lang.String> topics)
"If multiple topics are specified there is no ordering guarantee for records from different topics."
What I'm trying to achieve : Have one Kstream (ie 'source' from above) consume/process from multiple topics.
Do the topics share the same key?
Note that the specified input topics must be partitioned by key. If
this is not the case it is the user's responsibility to repartition
the data before any key based operation (like aggregation or join) is
applied to the returned KStream.
this maybe your blocker.
Another possible issue maybe the consumer group used.
Related
With the addition of Headers to the records (ProducerRecord & ConsumerRecord) in Kafka 0.11, is it possible to get these headers when processing a topic with Kafka Streams? When calling methods like map on a KStream it provides arguments of the key and the value of the record but no way I can see to access the headers. It would be nice if we could just map over the ConsumerRecords.
ex.
KStreamBuilder kStreamBuilder = new KStreamBuilder();
KStream<String, String> stream = kStreamBuilder.stream("some-topic");
stream
.map((key, value) -> ... ) // can I get access to headers in methods like map, filter, aggregate, etc?
...
something like this would work:
KStreamBuilder kStreamBuilder = new KStreamBuilder();
KStream<String, String> stream = kStreamBuilder.stream("some-topic");
stream
.map((record) -> {
record.headers();
record.key();
record.value();
})
...
Records headers are accessible since versions 2.0.0 (cf. KIP-244 for details).
You can access record metadata via the Processor API (ie, via transform(), transformValues(), or process()), by the given "context" object (cf. https://docs.confluent.io/current/streams/developer-guide/processor-api.html#accessing-processor-context).
Update
As of 2.7.0 release, the Processor API was improved (cf. KIP-478), adding a new type-safe api.Processor class with process(Record) instead of process(K, V) method. For this case, headers (and record metadata) are accessible via the Record class).
This new feature is not yet available in "PAPI method of the DSL though (eg. KStream#process(), KStream#transform() and siblings).
+++++
Prior to 2.0, the context only exposes topic, partition, offset, and timestamp---but not headers that are in fact dropped by Streams on read in those older versions.
Metadata is not available at DSL level though. However, there is also work in progress to extend the DSL via KIP-159.
I am using name : kafka_2.12 version : 2.3.0. Based on the traffic/load I want to change the maximum partition number for a topic. Is it possible to make this kind of change once Kafka is up and can it be done by code?
Yes you could increase partition by code. Use AdminClient.createPartitions method.
AdminClients.createPartitions method API document
public abstract CreatePartitionsResult createPartitions(java.util.Map<java.lang.String,NewPartitions> newPartitions,CreatePartitionsOptions options)
Increase the number of partitions of the topics given as the keys of newPartitions according to the corresponding values. If partitions are increased for a topic that has a key, the partition logic or ordering of the messages will be affected.
This operation is not transactional so it may succeed for some topics while fail for others.
It may take several seconds after this method returns success for all the brokers to become aware that the partitions have been created. During this time, describeTopics(Collection) may not return information about the new partitions.
How to use:
public static void createPartitions(String topicName, int numPartitions) {
Properties props = new Properties();
props.put("bootstrap.servers","localhost:9092");
AdminClient adminClient = AdminClient.create(props);
Map<String, NewPartitions> newPartitionSet = new HashMap<>();
newPartitionSet.put(topicName, NewPartitions.increaseTo(numPartitions));
adminClient.createPartitions(newPartitionSet);
adminClient.close();
}
I have multi-schema Kafka Streams application that enriches a record via a join to a KTable, and then passes the enriched record along.
The input topic naming format is currently well defined but I'm changing this to a wildcard. I want to determine the input topic of each record, derive the output topic via regex replacement, and send it on.
E.g. While listening to event.raw.* a record comes in on event.raw.foo and I wish to pass it out on event.foo.
I realise I can get the input topics via the Processor API:
public class EnrichmentProcessor extends AbstractProcessor<String, GenericRecord> {
#Override
public void process(String key, GenericRecord value) {
//Do Join...
//Determine output topic and forward
String outputTopic = context().topic().replaceFirst(".raw.", ".");
context().forward(key, value, To.child(outputTopic));
context().commit();
}
}
But this doesn't help me when I'm trying to define my Topology because I have no way of knowing up front what my output topic is going to be.
InternalTopologyBuilder topologyBuilder = new InternalTopologyBuilder();
topologyBuilder.addSource("SOURCE", stringDeserializer, genericRecordDeserializer, "event.raw.*")
.addProcessor("ENRICHER", EnrichmentProcessor::new, "SOURCE")
.addSink("OUTPUT", outputTopic, stringSerializer, genericRecordSerializer, "ENRICHER"); // How can I register all possible output topics here?
Has anyone solved a situation like this before?
I know that if I had a list of possible output-topic names up front I could have multiple sinks defined on the topology but I'm not going to.
Is there a way I can define the topology to have dynamically allocated output topic names when I dont't have a hard coded list of possible output topic names up front?
This should be possible: You can use Topology#addSink(..., new TopicNameExtractor(){...}, ...) to dynamically set an output topic name. TopicNameExtractor has access to the RecordContext that allows you to get the input topic name via context.topic(). Hence, you should be able to compute the output topic name, base on the input topic name.
I am pretty new to Kafka and Kafka Streams so please bear with me. I would like to know if I am on the right track here.
I am writing to a Kafka topic at the moment and try to access the data through a rest service. The raw data kind of needs to be transformed before it will be accessed.
What I have so far is a producer that writes the raw data into a topic.
1.) Now I want streams App (should be a jar running in a container) that just transforms the data in my desired shape. Following the materialized view paradigm here.
Over simplified version of 1.)
KStreamBuilder builder = new KStreamBuilder();
KStream<String, String> source =
builder.stream("my-raw-data-topic");
KafkaStreams streams = new KafkaStreams(builder,props);
KTable<String, Long> t = source.groupByKey().count("My-Table");
streams.start();
2.) And another streams App (should be a jar running in a container) that justs holds the KTable as some sort of Repository which can be accessed via a wrapping rest service.
Here I am kind of stuck with the proper way to work with the api.
What is the bare minimun to access and query a KTable? Do I need to assign the transformation topology to the builder again?
KStreamBuilder builder = new KStreamBuilder();
KTable table = builder.table("My-Table"); //Casting?
KafkaStreams streams = new KafkaStreams(builder, props);
RestService service = new RestService(table);
// Use the Table as Repository which is wrapped by a Rest-Service and gets updated reactivly
Right now this is pseudo code
Am I on the right path here? Does is make sense to separate 1.) and 2.)? Is this the indented way to work with streams to materialize views? For me, it would have the benefit to scale up the writes and the reads via container independently where I see more traffic.
How is the repopulating of the KTable handled on a crash of either 1.) or 2.). Is this done via replication to the streaming api or is this something I would need to address via code. Like resetting the cursor and reply the events?
Couple of comments:
In your code snippet (1) you modify your topology after you handed the builder into the KafkaStreams constructor:
KafkaStreams streams = new KafkaStreams(builder,props);
// don't modify builder anymore!
You should not do this but first specify you topology and afterwards create the KafkaStreams instance.
About splitting you application into two. This can make sense to scale both parts independently. But it's hard to say in general. However, if you do spit both, the first one needs to write the transformed date into an output topic and the second one should read this output topic as a table (builder.table("output-topic-of-transformation") to serve the REST requests.
For accessing the store of the KTable, you need to get a query handle via the provided store name:
ReadOnlyKeyValueStore keyValueStore =
streams.store("My-Table", QueryableStoreTypes.keyValueStore());
See the docs for further details:
http://docs.confluent.io/current/streams/developer-guide.html#interactive-queries
I'm trying to find a way to re-order messages within a topic partition and send ordered messages to a new topic.
I have Kafka publisher that sends String messages of the following format:
{system_timestamp}-{event_name}?{parameters}
for example:
1494002667893-client.message?chatName=1c&messageBody=hello
1494002656558-chat.started?chatName=1c&chatPatricipants=3
Also, we add some message key for each message, to send them to the corresponding partition.
What I want to do is reorder events based on {system-timestamp} part of the message and within a 1-minute window, cause our publishers doesn't guarantee that messages will be sent in accordance with {system-timestamp} value.
For example, we can deliver to the topic, a message with a bigger {system-timestamp} value first.
I've investigated Kafka Stream API and found some examples regarding messages windowing and aggregation:
Properties streamsConfiguration = new Properties();
streamsConfiguration.put(StreamsConfig.APPLICATION_ID_CONFIG, "stream-sorter");
streamsConfiguration.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
streamsConfiguration.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, "localhost:2181");
streamsConfiguration.put(StreamsConfig.KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
streamsConfiguration.put(StreamsConfig.VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
KStreamBuilder builder = new KStreamBuilder();
KStream<String, String> stream = builder.stream("events");
KGroupedStream<String>, String> groupedStream = stream.groupByKey();//grouped events within partion.
/* commented since I think that I don't need any aggregation, but I guess without aggregation I can't use time windowing.
KTable<Windowed<String>, String> windowedEvents = stream.groupByKey().aggregate(
() -> "", // initial value
(aggKey, value, aggregate) -> aggregate + "", // aggregating value
TimeWindows.of(1000), // intervals in milliseconds
Serdes.String(), // serde for aggregated value
"test-store"
);*/
But what should I do next with this grouped stream? I don't see any 'sort() (e1,e2) -> e1.compareTo(e2)' methods available, also windows could be applied to methods like aggregation(), reduce() ,count() , but I think that I don't need any messages data manipulations.
How can I re-order message in the 1-minute window and send them to another topic?
Here's an outline:
Create a Processor implementation that:
in process() method, for each message:
reads the timestamp from the message value
inserts into a KeyValueStore using (timestamp, message-key) pair as the key and the message-value as the value. NB this also provides de-duplication. You'll need to provide a custom Serde to serialize the key so that the timestamp comes first, byte-wise, so that ranged queries are ordered by timestamp first.
in the punctuate() method:
reads the store using a ranged fetch from 0 to timestamp - 60'000 (=1 minute)
sends the fetched messages in order using context.forward() and deletes them from the store
The problem with this approach is that punctuate() is not triggered if no new msgs arrive to advance the "stream time". If this is a risk in your case, you can create an external scheduler that sends periodic "tick" messages to each(!) partition of your topic, that your processor should just ignore, but they'll cause punctuate to trigger in the absence of "real" msgs.
KIP-138 will address this limitation by adding explicit support for system-time punctuation:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-138%3A+Change+punctuate+semantics
Here is how I ordered streams in my project.
Created topology with source, processor, sink.
In Processor
process(key, value) -> Added each record to List(instance variable).
Init() -> schedule(WINDOW_BUFFER_TIME, WALL_CLOCK_TIME) -> punctuate (timestamp) sort list of items of window buffer time in List (instance variable) and iterate and forward. Clear List (instance variable).
This logic is working fine for me.