Why isn't Kafka consumer producing results? - java

As a Kafka learning exercise, I have written a Java program TsdbMetricToKafkaTopic to copy data from openTSDB to a Kafka topic, and another Java program DumpKafkaTopic to print out the results; below is the key method of DumpKafkaTopic.
I have confirmed, by using the Kafka utility kafka-console-consumer.sh, that the data I expect are indeed getting written to the intended topic. However, the behavior of DumpKafkaTopic is strange: When I run the producer and then DumpKafkaTopic, it prints results as I'd expect. However, if I re-run it immediately, it prints nothing.
I thought that because I set auto.offset.reset to earliest, my program would be idempotent, that is, every time I run it, it should produce the same results (until I write something else to the topic). Why isn't this happening?
public void dump( String kafka_topic ) {
// Serializers/deserializers (serde) for key and value types
final Serde<Long> long_serde = Serdes.Long();
final Serde< TsdbObject > tsdb_object_serde =
Serdes.serdeFrom( new TsdbObject.TsdbObjectSerializer(),
new TsdbObject.TsdbObjectDeserializer() );
StreamsBuilder streams_builder = new StreamsBuilder();
KStream< Long, TsdbObject > kstream =
streams_builder.stream( kafka_topic, Consumed.with( long_serde, tsdb_object_serde ) );
// Add final operator, to print results to stdout:
Printed< Long, TsdbObject > printed = Printed.toSysOut();
kstream.print( printed );
Map<String, Object> kstreams_props = new HashMap<>();
kstreams_props.put(StreamsConfig.APPLICATION_ID_CONFIG, "DumpKafkaTopic");
kstreams_props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
// make sure to consume the complete topic via "auto.offset.reset = earliest"
kstreams_props.put( ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
StreamsConfig kstreams_config = new StreamsConfig(kstreams_props);
KafkaStreams kstreams = new KafkaStreams( streams_builder.build(), kstreams_config );
System.out.println( "Starting DumpKafkaTopic stream " );
kstreams.start();
// Add shutdown hook to respond to SIGTERM and gracefully close Kafka Streams (from https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/)
Runtime.getRuntime().addShutdownHook(new Thread(new Runnable() {
#Override
public void run() {
System.out.println( "Stopping DumpKafkaTopic stream " );
kstreams.close();
}
}));
}

Related

kafka streams: publish/send messages even when few record transformation throw exceptions?

A typical kafka streams application flow is as below (not including all step like props/serdes etc) -
final StreamsBuilder builder = new StreamsBuilder();
final KStream<String, String> textLines = builder.stream(inputTopic);
final KStream<String, String> textTransformation_1 = textLines.processValues(value -> value+"firstTranstormation");
final KStream<String, String> textTransformation_2 = textTransformation_1.processValues(value -> value+"secondTranstormation");
//my concern is at this stage -
final KStream<String, String> textTransformation_3 = textTransformation_2.processValues(this::processValueAndDoRelatedStuff);
....
....
textTransformation_x.to(outputTopic, Produced.with(Serdes.String(), Serdes.Long()));
final KafkaStreams streams = new KafkaStreams(builder.build(), streamsConfiguration);
streams.start();
Runtime.getRuntime().addShutdownHook(new Thread(streams::close));
Now if the processValueAndDoRelatedStuff(String input) method throws an error, I don't want the program to crash but want kafka to only NOT send that one transformation output to outputTopic (i.e ignore the transformation of that one record) and continue dealing with processing rest of the incoming messages normally.
Is the above possible??
In generally, as there is a way to skip sending transformation output to outputTopic based on a predicate. In the next stage, I can think of adding an filter, if in processValueAndDoRelatedStuff(String input) i can catch the exception and return some value based on which I can filter in the next stage.
final KStream<String, String> textTransformation_4 = textTransformation_3.filter((k,v) -> !v.equals("badrecord"));
But I am more interested in the case where the exception is not handled but thrown from the mapper functions. Is it possible for kafka to ignore that one record causing an exception and still proceed with rest of processing.
The default behavior is to stop the topology on any uncaught exception.
If you want to catch them, simply don't use a function handle. Use a try-catch around the function
final KStream<String, String> textTransformation_3 = textTransformation_2.processValues(value -> {
try {
return processValueAndDoRelatedStuff(value);
} catch (Exception e) {
// log, if you want
return null;
}
).filter((k, v) -> Objects.nonNull(v)); // remove events that caused exceptions
Otherwise, you can set exception handlers, as well - https://developer.confluent.io/learn-kafka/kafka-streams/error-handling/

Flink S3 StreamingFileSink not writing files to S3

I am doing a POC for writing data to S3 using Flink. The program does not give a error. However I do not see any files being written in S3 either.
Below is the code
public class StreamingJob {
public static void main(String[] args) throws Exception {
// set up the streaming execution environment
final String outputPath = "s3a://testbucket-s3-flink/data/";
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//Enable checkpointing
env.enableCheckpointing();
//S3 Sink
final StreamingFileSink<String> sink = StreamingFileSink
.forRowFormat(new Path(outputPath), new SimpleStringEncoder<String>("UTF-8"))
.build();
//Source is a local kafka
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kafka:9094");
properties.setProperty("group.id", "test");
DataStream<String> input = env.addSource(new FlinkKafkaConsumer<String>("queueing.transactions", new SimpleStringSchema(), properties));
input.flatMap(new Tokenizer()) // Tokenizer for generating words
.keyBy(0) // Logically partition the stream for each word
.timeWindow(Time.minutes(1)) // Tumbling window definition
.sum(1) // Sum the number of words per partition
.map(value -> value.f0 + " count: " + value.f1.toString() + "\n")
.addSink(sink);
// execute program
env.execute("Flink Streaming Java API Skeleton");
}
public static final class Tokenizer
implements FlatMapFunction<String, Tuple2<String, Integer>> {
#Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
String[] tokens = value.toLowerCase().split("\\W+");
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2<>(token, 1));
}
}
}
}
}
Note that I have set the s3.access-key and s3.secret-key value in the configuration and tested by changing them to incorrect values (I got a error on incorrect values)
Any pointers what may be going wrong?
Could it be that you are running into this issue?
Given that Flink sinks and UDFs in general do not differentiate between normal job termination (e.g. finite input stream) and termination due to failure, upon normal termination of a job, the last in-progress files will not be transitioned to the “finished” state.

Apache Flink Dynamic Pipeline

I'm working on creating a framework to allow customers to create their own plugins to my software built on Apache Flink. I've outlined in a snippet below what I'm trying to get working (just as a proof of concept), however I'm getting a org.apache.flink.client.program.ProgramInvocationException: The main method caused an error. error when trying to upload it.
I want to be able to branch the input stream into x number of different pipelines, then having those combine together into a single output. What I have below is just my simplified version I'm starting with.
public class ContentBase {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kf-service:9092");
properties.setProperty("group.id", "varnost-content");
// Setup up execution environment and get stream from Kafka
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<ObjectNode> logs = see.addSource(new FlinkKafkaConsumer011<>("log-input",
new JSONKeyValueDeserializationSchema(false), properties).setStartFromLatest())
.map((MapFunction<ObjectNode, ObjectNode>) jsonNodes -> (ObjectNode) jsonNodes.get("value"));
// Create a new List of Streams, one for each "rule" that is being executed
// For now, I have a simple custom wrapper on flink's `.filter` function in `MyClass.filter`
List<String> codes = Arrays.asList("404", "200", "500");
List<DataStream<ObjectNode>> outputs = new ArrayList<>();
for (String code : codes) {
outputs.add(MyClass.filter(logs, "response", code));
}
// It seemed as though I needed a seed DataStream to union all others on
ObjectMapper mapper = new ObjectMapper();
ObjectNode seedObject = (ObjectNode) mapper.readTree("{\"start\":\"true\"");
DataStream<ObjectNode> alerts = see.fromElements(seedObject);
// Union the output of each "rule" above with the seed object to then output
for (DataStream<ObjectNode> output : outputs) {
alerts.union(output);
}
// Convert to string and sink to Kafka
alerts.map((MapFunction<ObjectNode, String>) ObjectNode::toString)
.addSink(new FlinkKafkaProducer011<>("kf-service:9092", "log-output", new SimpleStringSchema()));
see.execute();
}
}
I can't figure out how to get the actual error out of the Flink web interface to add that information here
There were a few errors I found.
1) A Stream Execution Environment can only have one input (apparently? I could be wrong) so adding the .fromElements input was not good
2) I forgot all DataStreams are immutable so the .union operation creates a new DataStream output.
The final result ended up being much simpler
public class ContentBase {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kf-service:9092");
properties.setProperty("group.id", "varnost-content");
// Setup up execution environment and get stream from Kafka
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<ObjectNode> logs = see.addSource(new FlinkKafkaConsumer011<>("log-input",
new JSONKeyValueDeserializationSchema(false), properties).setStartFromLatest())
.map((MapFunction<ObjectNode, ObjectNode>) jsonNodes -> (ObjectNode) jsonNodes.get("value"));
// Create a new List of Streams, one for each "rule" that is being executed
// For now, I have a simple custom wrapper on flink's `.filter` function in `MyClass.filter`
List<String> codes = Arrays.asList("404", "200", "500");
List<DataStream<ObjectNode>> outputs = new ArrayList<>();
for (String code : codes) {
outputs.add(MyClass.filter(logs, "response", code));
}
Optional<DataStream<ObjectNode>> alerts = outputs.stream().reduce(DataStream::union);
// Convert to string and sink to Kafka
alerts.map((MapFunction<ObjectNode, String>) ObjectNode::toString)
.addSink(new FlinkKafkaProducer011<>("kf-service:9092", "log-output", new SimpleStringSchema()));
see.execute();
}
}
The code you post cannot be compiled through because of the last part code (i.e., converting to string). You mixed up the java stream API map with Flink one. Change it to
alerts.get().map(ObjectNode::toString);
can fix it.
Good luck.

Kafka Stream to sort messages based on timestamp key in json message

I am publishing Kafka with JSON messages, eg:
"UserID":111,"UpdateTime":06-13-2018 12:13:43.200Z,"Comments":2,"Like":10
"UserID":111,"UpdateTime":06-13-2018 12:13:40.200Z,"Comments":0,"Like":6
"UserID":222,"UpdateTime":06-13-2018 12:13:43.200Z,"Comments":1,"Like":10
"UserID":111,"UpdateTime":06-13-2018 12:13:44.600Z,"Comments":3,"Like":12
I would like to sort messages based on UpdateTime in 10 second time window using Kafka Streams and push back sorted messages in another Kafka topic.
I have created a stream, which reads data from the input topic and then I am creating TimeWindowedKStream after groupByKey() where the UserID is the key in the message (Although its not necessary to groupByKey and then sort, but I could not get WindowedBy directly). But I am not able to sort messages in 10 second window based on UpdateTime further. My source code is:
public static void main(String[] args) throws Exception {
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-sorting");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "broker");
props.put(StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG, 0);
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass().getName());
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> source = builder.stream("UnsortedMessages");
TimeWindowedKStream<String, String> countss = source.groupByKey().windowedBy(TimeWindows.of(10000L)
.until(10000L));
/*
SORTING CODE
*/
outputMessage.toStream().to("SortedMessages", Produced.with(Serdes.String(), Serdes.Long()));
final KafkaStreams streams = new KafkaStreams(builder.build(), props);
final CountDownLatch latch = new CountDownLatch(1);
// attach shutdown handler to catch control-c
Runtime.getRuntime().addShutdownHook(new Thread("streams-sorting-shutdown-hook") {
#Override
public void run() {
streams.close();
latch.countDown();
}
});
try {
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
System.exit(0);
}
Many thanks in advance.
If you want to sort messages ignoring the key, it makes only sense to do this based on partitions and also only if the input topic has the same number of partitions as the output topic. For this case, you should extract the partition number and use it as message key (cf: https://docs.confluent.io/current/streams/faq.html#accessing-record-metadata-such-as-topic-partition-and-offset-information)
For sorting, it's more tricky. Note, that Kafka Streams follows a "continuous output" model and does emit updates for each input record using the DSL. Thus, it might be better to use Processor API. You would use a Processor with an attached store and put records into the store. As an in-memory structure you keep a sorted list of records. While time advances, you can emit "finished" windows and delete the corresponding records from the store.
I don't think you can build this using the DSL.

Spark checkpointing error when joining static dataset with DStream

I am trying to use Spark Streaming application in Java. My Spark application reads continuous feed from Hadoop
directory using textFileStream() at interval of each 1 Min.
I need to perform Spark aggregation(group by) operation on incoming DStream. After aggregation, I am joining aggregated DStream<Key, Value1> with RDD<Key, Value2>
with RDD<Key, Value2> created from static dataset read by textFile() from hadoop directory.
Problem comes when I enable checkpointing. With empty checkpoint directory, it runs fine. After running 2-3 batches I close it using ctrl+c and run it again.
On second run it throws spark exception immediately: "SPARK-5063"
Exception in thread "main" org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063
Following is the Block of Code of spark application:
private void compute(JavaSparkContext sc, JavaStreamingContext ssc) {
JavaRDD<String> distFile = sc.textFile(MasterFile);
JavaDStream<String> file = ssc.textFileStream(inputDir);
// Read Master file
JavaRDD<MasterParseLog> masterLogLines = distFile.flatMap(EXTRACT_MASTER_LOGLINES);
final JavaPairRDD<String, String> masterRDD = masterLogLines.mapToPair(MASTER_KEY_VALUE_MAPPER);
// Continuous Streaming file
JavaDStream<ParseLog> logLines = file.flatMap(EXTRACT_CKT_LOGLINES);
// calculate the sum of required field and generate group sum RDD
JavaPairDStream<String, Summary> sumRDD = logLines.mapToPair(CKT_GRP_MAPPER);
JavaPairDStream<String, Summary> grpSumRDD = sumRDD.reduceByKey(CKT_GRP_SUM);
//GROUP BY Operation
JavaPairDStream<String, Summary> grpAvgRDD = grpSumRDD.mapToPair(CKT_GRP_AVG);
// Join Master RDD with the DStream //This is the block causing error (without it code is working fine)
JavaPairDStream<String, Tuple2<String, String>> joinedStream = grpAvgRDD.transformToPair(
new Function2<JavaPairRDD<String, String>, Time, JavaPairRDD<String, Tuple2<String, String>>>() {
private static final long serialVersionUID = 1L;
public JavaPairRDD<String, Tuple2<String, String>> call(
JavaPairRDD<String, String> rdd, Time v2) throws Exception {
return masterRDD.value().join(rdd);
}
}
);
joinedStream.print(10);
}
public static void main(String[] args) {
JavaStreamingContextFactory contextFactory = new JavaStreamingContextFactory() {
public JavaStreamingContext create() {
// Create the context with a 60 second batch size
SparkConf sparkConf = new SparkConf();
final JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaStreamingContext ssc1 = new JavaStreamingContext(sc, Durations.seconds(duration));
app.compute(sc, ssc1);
ssc1.checkpoint(checkPointDir);
return ssc1;
}
};
JavaStreamingContext ssc = JavaStreamingContext.getOrCreate(checkPointDir, contextFactory);
// start the streaming server
ssc.start();
logger.info("Streaming server started...");
// wait for the computations to finish
ssc.awaitTermination();
logger.info("Streaming server stopped...");
}
I know that block of code which joins static dataset with DStream is causing error, But that is taken from spark-streaming
page of Apache spark website (sub heading "stream-dataset join" under "Join Operations"). Please help me to get it working even if
there is different way of doing it. I need to enable checkpointing in my streaming application.
Environment Details:
Centos6.5 :2 node Cluster
Java :1.8
Spark :1.4.1
Hadoop :2.7.1*

Categories

Resources