I'm trying to figure out why all my Kafka messages are getting replayed every time I restart my Storm topology.
My understanding how how it should work were that once the last Bolt have ack'ed the tuple the spout should commit the message on Kafka, and hence I should not see it replay after a restart.
My code is a simple Kafka-spout and a Bolt which just print every message and then ack'ing them.
private static KafkaSpout buildKafkaSpout(String topicName) {
ZkHosts zkHosts = new ZkHosts("localhost:2181");
SpoutConfig spoutConfig = new SpoutConfig(zkHosts,
topicName,
"/" + topicName,
"mykafkaspout"); /*was:UUID.randomUUID().toString()*/
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
return new KafkaSpout(spoutConfig);
}
public static class PrintBolt extends BaseRichBolt {
OutputCollector _collector;
public static Logger LOG = LoggerFactory.getLogger(PrintBolt.class);
#Override
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
_collector = collector;
}
#Override
public void execute(Tuple tuple) {
LOG.error("PrintBolt.0: {}",tuple.getString(0));
_collector.ack(tuple);
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("nothing"));
}
}
public static void main(String[] args) throws Exception {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("kafka", buildKafkaSpout("mytopic"), 1);
builder.setBolt("print1", new PrintBolt(),1).shuffleGrouping("kafka");
}
I have not provided any config settings than those in the code.
Am I missing a config-setting or what am I doing wrong?
UPDATE:
To clarify, everything works fine until I restart the pipeline. The below behavior is what I can get in other (non-storm) consumers, and what I expected from the KafkaSpout
My expectations:
However the actual behavior Im getting using the default setting is the following. The messages are processed fine up to I stop the pipeline, and then when I restart I get a replay of all the messages, including those (A and B) which I believed I had ack'ed already
What actually happens:
As per the configuration options mentioned by Matthias, I can change the startOffsetTime to Latest, however that is literally the latest where the pipeline is dropping the messages (Message "C") that were produced while the pipeline were restarting.
I have a consume written in NodeJS (using npm kafka-node) which is able to ack messages to Kafka and when I restart the NodeJs consumer it does exactly what I expected (catchup on message "C" which were produced when the consumer were down and continue from there) -- so how do I get the same behavior with the KafkaSpout?
The problem were in the submit code -- the template code for submitting the topology will create a instance of LocalCluster if the storm jar is run without a topology name, and the local cluster does not capture the state and hence the replay.
So
$ storm jar myjar.jar storm.myorg.MyTopology topologyname
will launch it on my single node development cluster, where
$ storm jar myjar.jar storm.myorg.MyTopology
will launch it on an instance of LocalCluster
Related
I've been working on updating a Flink processor (Flink version 1.9) that reads from Kafka and then writes to Kafka. We have written this processor to run towards a Kafka 0.10.2 cluster and now we have deployed a new Kafka cluster running version 2.2. Therefore I set out to update the processor to use the latest FlinkKafkaConsumer and FlinkKafkaProducer (as suggested by the Flink docs). However I've run into some problems with the Kafka producer. I'm unable to get it to Serialize data using deprecated constructors (not surprising) and I've been unable to find any implementations or examples online about how to implement a Serializer (all the examples are using older Kafka Connectors)
The current implementation (for Kafka 0.10.2) is as follows
FlinkKafkaProducer010<String> eventBatchFlinkKafkaProducer = new FlinkKafkaProducer010<String>(
"playerSessions",
new SimpleStringSchema(),
producerProps,
(FlinkKafkaPartitioner) null
);
When trying to implement the following FlinkKafkaProducer
FlinkKafkaProducer<String> eventBatchFlinkKafkaProducer = new FlinkKafkaProducer<String>(
"playerSessions",
new SimpleStringSchema(),
producerProps,
null
);
I get the following error:
Exception in thread "main" java.lang.NullPointerException
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.<init>(FlinkKafkaProducer.java:525)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.<init>(FlinkKafkaProducer.java:483)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.<init>(FlinkKafkaProducer.java:357)
at com.ebs.flink.sessionprocessor.SessionProcessor.main(SessionProcessor.java:122)
and I haven't been able to figure out why.
The constructor for FlinkKafkaProducer is also deprecated and when I try implementing the non-deprecated constructor I can't figure out how to serialize the data.
The following is how it would look:
FlinkKafkaProducer<String> eventBatchFlinkKafkaProducer = new FlinkKafkaProducer<String>(
"playerSessions",
new KafkaSerializationSchema<String>() {
#Override
public ProducerRecord<byte[], byte[]> serialize(String s, #Nullable Long aLong) {
return null;
}
},
producerProps,
FlinkKafkaProducer.Semantic.EXACTLY_ONCE
);
But I don't understand how to implement the KafkaSerializationSchema and I find no examples of this online or in the Flink docs.
Does anyone have any experience implementing this or any tips on why the FlinkProducer gets NullPointerException in the step?
If you are just sending String to Kafka:
public class ProducerStringSerializationSchema implements KafkaSerializationSchema<String>{
private String topic;
public ProducerStringSerializationSchema(String topic) {
super();
this.topic = topic;
}
#Override
public ProducerRecord<byte[], byte[]> serialize(String element, Long timestamp) {
return new ProducerRecord<byte[], byte[]>(topic, element.getBytes(StandardCharsets.UTF_8));
}
}
For sending a Java Object:
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonProcessingException;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.flink.streaming.connectors.kafka.KafkaSerializationSchema;
import org.apache.kafka.clients.producer.ProducerRecord;
public class ObjSerializationSchema implements KafkaSerializationSchema<MyPojo>{
private String topic;
private ObjectMapper mapper;
public ObjSerializationSchema(String topic) {
super();
this.topic = topic;
}
#Override
public ProducerRecord<byte[], byte[]> serialize(MyPojo obj, Long timestamp) {
byte[] b = null;
if (mapper == null) {
mapper = new ObjectMapper();
}
try {
b= mapper.writeValueAsBytes(obj);
} catch (JsonProcessingException e) {
// TODO
}
return new ProducerRecord<byte[], byte[]>(topic, b);
}
}
In your code
.addSink(new FlinkKafkaProducer<>(producerTopic, new ObjSerializationSchema(producerTopic),
params.getProperties(), FlinkKafkaProducer.Semantic.EXACTLY_ONCE));
To the deal with the timeout in the case of FlinkKafkaProducer.Semantic.EXACTLY_ONCE you should read https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-011-and-newer, particularly this part:
Semantic.EXACTLY_ONCE mode relies on the ability to commit transactions that were started before taking a checkpoint, after recovering from the said checkpoint. If the time between Flink application crash and completed restart is larger than Kafka’s transaction timeout there will be data loss (Kafka will automatically abort transactions that exceeded timeout time). Having this in mind, please configure your transaction timeout appropriately to your expected down times.
Kafka brokers by default have transaction.max.timeout.ms set to 15 minutes. This property will not allow to set transaction timeouts for the producers larger than it’s value. FlinkKafkaProducer011 by default sets the transaction.timeout.ms property in producer config to 1 hour, thus transaction.max.timeout.ms should be increased before using the Semantic.EXACTLY_ONCE mode.
I started a local flink server (./bin/start-cluster.sh), and submitted a job. I have the following code to define a custom metric:
.map(new RichMapFunction<String, String>() {
private transient Counter counter;
#Override
public void open(Configuration config) {
this.counter = getRuntimeContext()
.getMetricGroup()
.counter("myCounter");
}
#Override
public String map(String value) throws Exception {
this.counter.inc();
return value;
}
})
But when I run the job and send some data, I cannot see any metrics in flink web UI, just "No metrics"
I had configured the JMX reporter in the flink-conf.yaml. I am not sure how could I get the metrics shown on the dashboard?
I had the same problem. My problem was in cluster configuration, I was using the hostname to name the taskmanager, and when I change it (using the default name) , the task metrics start to work.
I use docker-swarm to deploy the flink cluster.
This is my question
Flink 1.7.0 Dashboard not show Task Statistics
I was talking about the task statistics, but the taskmetrics was wrong too
I'm new to Apache Storm and trying to get my feet wet.
Right now I simply want to log or print incoming Kafka messages which are received as byte arrays of ProtoBuf objects.
I need to do this within a Java Spring application.
I'm using Kafka 0.11.0.2
I'm using Storm 1.1.2 and have storm-core, storm-kafka, and storm-starters in my pom.
Main service class example
//annotations for spring
public class MyService{
public static void main(String[] args){
SpringApplication.run(MyService.class, args);
}
#PostConstruct
public void postConstruct() throws Exception {
SpoutConfig spoutConfig = new SpoutConfig(new ZKHosts("localhost:9092"), "topic", "/topic", "storm-spout");
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("storm-spout", kafkaSpout);
builder.setBolt("printer", new PrinterBolt())
.shuffleGrouping("storm-spout");
Config config = new Config();
config.setDebug(true);
config.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("kafka", config, builder.createTopology());
Thread.sleep(30000);
cluster.shutdown();
}
private class PrinterBolt extends BaseBasicBolt {
#Override
public void execute(Tuple input, BasicOutputCollector){
System.out.println("\n\n INPUT: "+input.toString()+"\n\n");
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer){}
}
}
I build a docker image from this with a Dockerfile that I know works with my environment for other spring apps and run it in a container it throws an exception and hangs.
The exception is java.io.NotSerializeableException
and I see Caused by java.lang.IllegalStateException: Bolt 'printer' contains a non-seriablizeable field of type my.package.MyService$$EnhancerBySpringCGLIB$$696afb49, which was instantiated prior to topology creation. my.package.MyService$$EnhancerBySpringCLGIB$$696afb49 should be instantiated within the prepare method of 'printer at the earliest.
I figure maybe it's because storm is trying and failing to serialize the incoming byte array but I'm not sure how to remedy that and I haven't seen a lot of people trying to do this.
I was using this as a reference. https://github.com/thehydroimpulse/storm-kafka-starter/blob/master/src/jvm/storm/starter/KafkaTopology.java
Either declare PrinterBolt in a new file, or make the class static. The problem you're running into is that PrinterBolt is a non-static inner class of MyService, which means it contains a reference to the outer MyService class. Since MyService isn't serializable, PrinterBolt isn't either. Storm requires bolts to be serializable.
Also unrelated to the error you're seeing, you might want to consider using storm-kafka-client over storm-kafka, since the latter is deprecated.
I have two Java Application (App1, App2) to test how to access a KTable from a different app on a single instance environment in docker.
The first App (App1) writes to a KTable with following code.
public static void main(String[] args)
{
final Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG,"gateway-service");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "172.18.0.11:9092");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG,Serdes.String().getClass().getName());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, ServiceTransactionSerde.class);
KStreamBuilder builder = new KStreamBuilder();
KStream<String,ServiceTransaction> source = builder.stream("gateway_request_processed");
KStream<String, Long> countByApi = source.groupBy((key,value)-> value.getApiId().toString()).count("Counts").toStream();
countByApi.to(Serdes.String(), Serdes.Long(),"countByApi");
countByApi.print();
final KafkaStreams streams = new KafkaStreams(builder,props);
streams.start();
System.out.println(streams.state());
System.out.println(streams.allMetadata());
System.out.println(streams.allMetadataForStore("countByApi"));
Runtime.getRuntime().addShutdownHook(new Thread(new Runnable() {
#Override
public void run() {
System.out.println(streams.allMetadata());
streams.close();
}
}));
}
When I run my producer I got following output for the code in App1
RUNNING
[]
[]
[KTABLE-TOSTREAM-0000000006]: c00af5ee-3c2d-4d12-9c4b-3b55c1284dd6, 19
This shows me state = RUNNING. Metadata are empty also for the store. But the request gets processed and store in the KTable successfully (String,Long).
When I run kafka-topics.sh --list --zookeeper:2181
I get the following topics.
bash-4.3# kafka-topics.sh --list --zookeeper zookeeper:2181
__consumer_offsets
countByApi
gateway-Counts-changelog
gateway-Counts-repartition
gateway-service-Counts-changelog
gateway-service-Counts-repartition
gateway_request_processed
This shows me that the KTable is somehow persisted with new topics.
I then have a secound command line app (App2) with following code which tries to access this KTable as a state store (ReadOnlyKeyValueStore) and access it.
public static void main( String[] args )
{
final Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "gateway-service-table-client");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "172.18.0.11:9092");
KStreamBuilder builder = new KStreamBuilder();
KafkaStreams streams = new KafkaStreams(builder,props);
streams.cleanUp();
streams.start();
System.out.println( "Hello World!" );
System.out.println(streams.state());
ReadOnlyKeyValueStore<String,Long> keyValueStore =
streams.store("countByApi", QueryableStoreTypes.keyValueStore());
final KeyValueIterator<String,Long> range = keyValueStore.all();
while(range.hasNext()){
KeyValue<String,Long> next = range.next();
System.out.println(String.format("key: %s | value: %s", next.key,next.value));
}
Runtime.getRuntime().addShutdownHook(new Thread(new Runnable() {
#Override
public void run() {
System.out.println(streams.allMetadata());
streams.close();
}
}));
}
When I run the 2. App I do get the error message:
RUNNING
Exception in thread "main" org.apache.kafka.streams.errors.InvalidStateStoreException: the state store, countByApi, may have migrated to another instance.
at org.apache.kafka.streams.state.internals.QueryableStoreProvider.getStore(QueryableStoreProvider.java:60)
at org.apache.kafka.streams.KafkaStreams.store(KafkaStreams.java:728)
at com.comp.streamtable.App.main(App.java:37)
Unfortunatly I do have only 1 instance and I verify that the state is equal "RUNNING".
Note: I had to choose different application.id for each app since this thew another Exception. Just wanted to point this out since this might be for interest.
What do I miss here to access my KTable from another app?
You are using two different application.id for both applications. Thus, both applications are completely decoupled.
Interactive Queries are designed for different instances of the same app, and do not work across applications.
This blog post might help: https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/
Edit: I added an .ack() to the Bolt (which required me to use a Rich Bolt instead of the basic bolt) and am having the same issue - nothing that tells me tuples are being processed by the bolt.
If it matters, I'm running this on a CentOS image on an EC2 instance. Any help would be appreciated.
I'm trying to set up a very basic HelloWorld Storm example to read messages from a Kafka cluster and print/log the messages I get.
Currently I have 20 messages in the Kafka cluster. When I run the topology (which appears to start just fine), I am able to see my Kafka Spout as well as the Echo Bolt. In the Storm UI, the Kafka Spout Acked column has 20 as a value - which I would assume is the number of messages that it was able to read/access (?)
The Echo Bolt line, however, only notes that I have 1 executor and 1 tasks. All other columns are 0.
Looking at the Storm worker log that is generated, I see this line: Read partition information from: /HelloWorld Spout/partition_0 --> {"topic":"helloworld","partition":0,"topology":{"id":"<UUID>","name":"Kafka-Storm test"},"broker":{"port":6667,"host":"ip-10-0-0-35.ec2.internal"},"offset":20}
The next few lines are as follows:
s.k.PartitionManager [INFO] Last commit offset from zookeeper: 0
s.k.PartitionManager [INFO] Commit offset 0 is more than 9223372036854775807 behind, resetting to startOffsetTime=-2
s.k.PartitionManager [INFO] Starting Kafka ip-10-0-0-35.ec2.internal:0 from offset 0
s.k.ZkCoordinator [INFO] Task [1/1] Finished refreshing
s.k.ZkCoordinator [INFO] Task [1/1] Refreshing partition manager connections
s.k.DynamicBrokersReader [INFO] Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=ip-10-0-0-35.ec2.internal:6667}}
The rest of the worker log shows no log/print out of the messages processed by the Bolt. I'm at a loss of why the Bolt doesn't seem to be getting any of the messages from the Kafka Cluster. Any help would be great. Thanks.
Building the KafkaSpout
private static KafkaSpout setupSpout() {
BrokerHosts hosts = new ZkHosts("localhost:2181");
SpoutConfig spoutConfig = new SpoutConfig(hosts, "helloworld", "", "HelloWorld Spout");
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
spoutConfig.forceFromStart = true;
spoutConfig.startOffsetTime = kafka.api.OffsetRequest.EarliestTime();
return new KafkaSpout(spoutConfig);
}
Building the topology and submitting it
public static void main(String[] args) {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("Kafka Spout", setupSpout());
builder.setBolt("Echo Bolt", new SystemOutEchoBolt());
try {
System.setProperty("storm.jar", "/tmp/storm.jar");
StormSubmitter.submitTopology("Kafka-Storm test", new Config(), builder.createTopology());
} //catchExceptionsHere
}
Bolt
public class SystemOutEchoBolt extends BaseRichBolt {
private static final long serialVersionUID = 1L;
private static final Logger logger = LoggerFactory.getLogger(SystemOutEchoBolt.class);
private OutputCollector m_collector;
#SuppressWarnings("rawtypes")
#Override
public void prepare(Map _map, TopologyContext _conetxt, OutputCollector _collector) {
m_collector = _collector;
}
#Override
public void execute(Tuple _tuple) {
System.out.println("Printing tuple with toString(): " + _tuple.toString());
System.out.println("Printing tuple with getString(): " + _tuple.getString(0));
logger.info("Logging tuple with logger: " + _tuple.getString(0));
m_collector.ack(_tuple);
}
#Override
public void declareOutputFields(OutputFieldsDeclarer _declarer) {}
}
The answer was simple. I was never telling the bolt which stream to subscribe to. Adding .shuffleGrouping("Kafka Spout"); fixed the issue.
You need to call an ack or fail on the tuple in your bolts, otherwise the spout doesn't know that the tuple has been fully processed. This will cause the count issues you're seeing.
public class SystemOutEchoBolt extends BaseBasicBolt {
private static final long serialVersionUID = 1L;
private static final Logger logger = LoggerFactory.getLogger(SystemOutEchoBolt.class);
#Override
public void execute(Tuple _tuple, BasicOutputCollector _collector) {
System.out.println("Printing tuple with toString(): " + _tuple.toString());
System.out.println("Printing tuple with getString(): " + _tuple.getString(0));
logger.info("Logging tuple with logger: " + _tuple.getString(0));
_collector.ack(_tuple);
}
#Override
public void declareOutputFields(OutputFieldsDeclarer arg0) {}
}
}