Kafka streams not using serde after repartitioning - java

My Kafka Streams application is consuming from a kafka topic that is using the following key-value layout:
String.class -> HistoryEvent.class
When printing my current topic this can be confirmed:
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic flow-event-stream-file-service-test-instance --property print.key=true --property key.separator=" -- " --from-beginning
flow1 -- SUCCESS #C:\Daten\file-service\in\crypto.p12
"flow1" is the String key and the part after -- is the serialized value.
My flow is set up like this:
KStream<String, HistoryEvent> eventStream = builder.stream(applicationTopicName, Consumed.with(Serdes.String(),
historyEventSerde));
eventStream.selectKey((key, value) -> new HistoryEventKey(key, value.getIdentifier()))
.groupByKey()
.reduce((e1, e2) -> e2,
Materialized.<HistoryEventKey, HistoryEvent, KeyValueStore<Bytes, byte[]>>as(streamByKeyStoreName)
.withKeySerde(new HistoryEventKeySerde()));
So as far as I know I am telling it to consume the topic using String and HistoryEvent serde as this is what is in the topic. I then 'rekey' it to use a combined key which should be stored locally using the provided serde for HistoryEventKey.class. As far as I understand this will cause an additional topic to be created (can be seen with topic list in the kafka container) with the new key. This is fine.
Now the problem is the application is unable to start up even from a clean environment with just that one document in the topic:
org.apache.kafka.streams.errors.StreamsException: Exception caught in process. taskId=0_0, processor=KSTREAM-SOURCE-0000000000, topic=flow-event-stream-file-service-test-instance, partition=0, offset=0
Caused by: org.apache.kafka.streams.errors.StreamsException: A serializer (key: org.apache.kafka.common.serialization.StringSerializer / value: HistoryEventSerializer) is not compatible to the actual key or value type (key type: HistoryEventKey / value type: HistoryEvent). Change the default Serdes in StreamConfig or provide correct Serdes via method parameters.
It is kinda hard to tell from the message where exactly the issue is. It says in my base topic but that is not possible as the key there is not of type HistoryEventKey. Since I have provided a serde for HistoryEventKey in the reduce it also cannot be with the local store.
The only thing that makes sense to me is that it is related to the selectKey operation that causes a rearranging and a new topic. However I am not able to figure out how I can provide the serde to that operation. I do not want to set it as a default, because it is not the default key serde.

After doing some more debugging of the execution I was able to figure out that the new topic is created in the groupByKey step. You can provide a Grouped instance that offers the possibility to specify the Serde used for key and value:
eventStream.selectKey((key, value) -> new HistoryEventKey(key, value.getIdentifier()))
.groupByKey(Grouped.<HistoryEventKey, HistoryEvent>as(null)
.withKeySerde(new HistoryEventKeySerde())
.withValueSerde(new HistoryEventSerde())
)
.reduce((e1, e2) -> e2,
Materialized.<HistoryEventKey, HistoryEvent, KeyValueStore<Bytes, byte[]>>as(streamByKeyStoreName)
.withKeySerde(new HistoryEventKeySerde()));

I've encountered a very similar error message, yet I had no groupbys, but joins instead. I'm posting here for the next person that googles around.
org.apache.kafka.streams.errors.StreamsException: ClassCastException while producing data to topic my-processor-KSTREAM-MAP-0000000023-repartition. A serializer (key: org.apache.kafka.common.serialization.StringSerializer / value: org.apache.kafka.common.serialization.StringSerializer) is not compatible to the actual key or value type (key type: java.lang.String / value type: com.mycorp.mySession). Change the default Serdes in StreamConfig or provide correct Serdes via method parameters (for example if using the DSL, `#to(String topic, Produced<K, V> produced)` with `Produced.keySerde(WindowedSerdes.timeWindowedSerdeFrom(String.class))`).
Clearly, same as in the original question, I did not want to change the default serdes.
So in my case the solution was to pass a Joined instance in the join, which will allow to pass in the serdes. Note that the error message points to a repartition-MAP-... which is a bit of a red herring, because the fix goes somewhere else.
how I fixed it (a joined example)
//...omitted ...
KStream<String,MySession> mySessions = myStream
.map((k,v) ->{
MySession s = new MySession(v);
k = s.makeKey();
return new KeyValue<>(k, s);
});
// ^ the mapping causes the repartition, you can not however specify a serde in there.
// but in the join right below, we can pass a JOINED instance and fix it.
return enrichedSessions
.leftJoin(
myTable,
(session, info) -> {
session.infos = info;
return session; },
Joined.as("my_enriched_session")
.keySerde(Serdes.String())
.valueSerde(MySessionSerde())
);

Related

Finding min/max values in a KafkaStream (KStream) object

I have a Kafka Stream application and Avro schemas for each of the topics and also for the key. Key topic schema is same for all.
Now, there is a KafkaStream (KStream) object with the known key object as the key and a value object (derived from the AvroSchema) which extends org.apache.avro.specific.SpecificRecordBase but it could be any of my avro schemas for the topic content.
KStream<CustomKey, ? extends SpecificRecordBase> myStream = ...
What I want to achieve is to run min and max functions on this stream. The problem is that I don't know what is the ? object, and as there are 30+ (and will increase in the future) topics, I don't wanna do a switch-case. So I have the followings:
public KStream<CustomKey, ? extends SpecificRecordBase> max(
final KStream<CustomKey, ? extends SpecificRecordBase> myStream,
final String attributeName) {
SpecificRecordBase maxValue = ...;
myStream.foreach((key, value) -> {
value.get(attributeName) // I want to find the max value for this attribute,
// but at this point we don't know it's type and
// and we can't assign maxValue = value, because this is a lambda
// function.
});
// find and return the max value
}
My question is, how can I calculate the max value for the myStream on the attributeName attribute?
it could be any of my avro schemas for the topic content
Then you need to extends ClassWithMinMaxFields. Otherwise, you will be unable to extract it from generic SpecificRecordBase object.
Also, your method returns a stream. You cannot return the min/max. If that is your objective, you need a plain consumer to scan the whole topic, beginning to (eventual) end.
To do this (correctly) with Streams API, you would either
need to build a KTable for every value, grouped by key, then do a table scan for the min/max, as you need them.
Create a new topic using aggregate DSL function, initialized with {"min": +Inf, "max": -Inf}, then on new records you check old vs new records, if you have a new min and/or max, set them and return the new record. Then, you still need an external consumer to fetch the most recent min/max events.
If you had a consistent Avro type, you could use ksqlDB functions

Why doesn't Mono.fromRunnable return Mono<Void>?

Runnable returns void in Java. Why does Mono.fromRunnable return Mono<T> instead of Mono<Void>?
API documentation of Mono#fromRunnable states about the type parameter:
The generic type of the upstream, which is preserved by this operator
It allows to use it as part of an operation chain without altering the resulting type.
Example:
This code:
Mono<String> myMono = Mono.empty();
myMono = myMono.switchIfEmpty(Mono.fromRunnable(()
-> System.err.println("WARNING, empty signal !")));
String value = myMono.block(Duration.ofSeconds(1));
System.out.println("Exported value is "+value);
Produces:
WARNING, empty signal !
Exported value is null
The code above compiles fine, and provide a Mono for a String without having to add additional casts.
The posted example is not very good, but I suppose that this signature allows to use fromRunnable to launch side-effects in some case, without disturbing value type of the overall operation chain.
It's kinda like Mono.empty() with additional computing embedded with it.

Kafka streams - grouping by value property?

I have a stream that comes in with:
Key: { "Symbol": "xxx" }
Value: { "Date": "2019-01-01", ... }
So, I want to group by Symbol and then Value.Date in 5 day blocks. I.e. 01-01 -> 01-05.
KStream<Key, Value> stream = kStreamBuilder.stream(...);
stream.groupBy((key, value) -> key.getSymbol())
So I've got the stream properly, and the first step, I group by the Key.Symbol. Not really sure where to go from here. Any pointers would be appreciated.
You could use a custom timestamp extractor to returns the timestamp from the value, i.e., implement TimestampExtractor interface and specify your class via default.timestamp.extractor configuration parameter (cf https://docs.confluent.io/current/streams/developer-guide/config-streams.html#default-timestamp-extractor)
This allows you to use tumbling time windows based on the extracted timestamp via:
groupBy(...).windowedBy(TimeWindows.of(Duration.ofDays(5))).aggregate(...)
See the docs for more details: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#tumbling-time-windows

key definition for 'fetch.message.max.bytes' in Kafka

I am not sure how to define the key for the message size of my KafkaSpouts.
My example:
Map<String, Object> props = new HashMap<>();
props.put("fetch.message.max.bytes", "2097152"); // 2MB
props.put(KafkaSpoutConfig.Consumer.GROUP_ID, group);
I searched for the constant key definition of "fetch.message.max.bytes" without succeed.
I expect this key in KafkaSpoutConfig.Consumer or at least KafkaSpoutConfig.
Anyone know the correct location?
Storm's KafkaSpout does not offer all available keys as perdefined members. However, if you know the name of the key, you can safely use a String (as shown in your example) of use a Kafka class that defines the key.

Apache Flink : Extract TypeInformation of Tuple

I am using FlinkKafkaConsumer09 wherein I have a ByteArrayDeseializationSchema implementing KeyedDeserializationSchema> , now in the getProducedType how do I extract the TypeInformation.
I read in the documentation that TypeExtractor.getForClass method does not support ParameterizedTypes , which method of TypeExtractor should I use to to achieve this ?
I think we have to use createTypeInfo method , can you please tell me how do I use this to return the TypeInformation ?
If the returned type of your deserialization schema is a byte[] then you can use PrimitiveArrayTypeInfo.BYTE_PRIMITIVE_ARRAY_TYPE_INFO as the return value of getProducedType.
If you want a TypeInformation of Tuple2<byte[], byte[]> :
return TupleTypeInfo.getBasicAndBasicValueTupleTypeInfo(byte[].class, byte[].class)
could be possible ?
Type Hints in the Java API
To help cases where Flink cannot reconstruct the erased generic type information, the Java API offers so called type hints from version 0.9 on. The type hints tell the system the type of the data set produced by a function. The following gives an example:
DataSet<SomeType> result = dataSet
.map(new MyGenericNonInferrableFunction<Long, SomeType>())
.returns(SomeType.class);
The returns statement specifies the produced type, in this case via a class. The hints support type definition through
Classes, for non-parameterized types (no generics)
Strings in the form of returns("Tuple2<Integer, my.SomeType>"), which are parsed and converted to a TypeInformation.
A TypeInformation directly
You can try this one :
case class CaseClassForYourExpectedMessage(key:String,value:String)
/* getProducedType method */
override def getProducedType: TypeInformation[CaseClassForYourExpectedMessage] =
TypeExtractor.getForClass(classOf[CaseClassForYourExpectedMessage])

Categories

Resources