Edit: I added an .ack() to the Bolt (which required me to use a Rich Bolt instead of the basic bolt) and am having the same issue - nothing that tells me tuples are being processed by the bolt.
If it matters, I'm running this on a CentOS image on an EC2 instance. Any help would be appreciated.
I'm trying to set up a very basic HelloWorld Storm example to read messages from a Kafka cluster and print/log the messages I get.
Currently I have 20 messages in the Kafka cluster. When I run the topology (which appears to start just fine), I am able to see my Kafka Spout as well as the Echo Bolt. In the Storm UI, the Kafka Spout Acked column has 20 as a value - which I would assume is the number of messages that it was able to read/access (?)
The Echo Bolt line, however, only notes that I have 1 executor and 1 tasks. All other columns are 0.
Looking at the Storm worker log that is generated, I see this line: Read partition information from: /HelloWorld Spout/partition_0 --> {"topic":"helloworld","partition":0,"topology":{"id":"<UUID>","name":"Kafka-Storm test"},"broker":{"port":6667,"host":"ip-10-0-0-35.ec2.internal"},"offset":20}
The next few lines are as follows:
s.k.PartitionManager [INFO] Last commit offset from zookeeper: 0
s.k.PartitionManager [INFO] Commit offset 0 is more than 9223372036854775807 behind, resetting to startOffsetTime=-2
s.k.PartitionManager [INFO] Starting Kafka ip-10-0-0-35.ec2.internal:0 from offset 0
s.k.ZkCoordinator [INFO] Task [1/1] Finished refreshing
s.k.ZkCoordinator [INFO] Task [1/1] Refreshing partition manager connections
s.k.DynamicBrokersReader [INFO] Read partition info from zookeeper: GlobalPartitionInformation{partitionMap={0=ip-10-0-0-35.ec2.internal:6667}}
The rest of the worker log shows no log/print out of the messages processed by the Bolt. I'm at a loss of why the Bolt doesn't seem to be getting any of the messages from the Kafka Cluster. Any help would be great. Thanks.
Building the KafkaSpout
private static KafkaSpout setupSpout() {
BrokerHosts hosts = new ZkHosts("localhost:2181");
SpoutConfig spoutConfig = new SpoutConfig(hosts, "helloworld", "", "HelloWorld Spout");
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
spoutConfig.forceFromStart = true;
spoutConfig.startOffsetTime = kafka.api.OffsetRequest.EarliestTime();
return new KafkaSpout(spoutConfig);
}
Building the topology and submitting it
public static void main(String[] args) {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("Kafka Spout", setupSpout());
builder.setBolt("Echo Bolt", new SystemOutEchoBolt());
try {
System.setProperty("storm.jar", "/tmp/storm.jar");
StormSubmitter.submitTopology("Kafka-Storm test", new Config(), builder.createTopology());
} //catchExceptionsHere
}
Bolt
public class SystemOutEchoBolt extends BaseRichBolt {
private static final long serialVersionUID = 1L;
private static final Logger logger = LoggerFactory.getLogger(SystemOutEchoBolt.class);
private OutputCollector m_collector;
#SuppressWarnings("rawtypes")
#Override
public void prepare(Map _map, TopologyContext _conetxt, OutputCollector _collector) {
m_collector = _collector;
}
#Override
public void execute(Tuple _tuple) {
System.out.println("Printing tuple with toString(): " + _tuple.toString());
System.out.println("Printing tuple with getString(): " + _tuple.getString(0));
logger.info("Logging tuple with logger: " + _tuple.getString(0));
m_collector.ack(_tuple);
}
#Override
public void declareOutputFields(OutputFieldsDeclarer _declarer) {}
}
The answer was simple. I was never telling the bolt which stream to subscribe to. Adding .shuffleGrouping("Kafka Spout"); fixed the issue.
You need to call an ack or fail on the tuple in your bolts, otherwise the spout doesn't know that the tuple has been fully processed. This will cause the count issues you're seeing.
public class SystemOutEchoBolt extends BaseBasicBolt {
private static final long serialVersionUID = 1L;
private static final Logger logger = LoggerFactory.getLogger(SystemOutEchoBolt.class);
#Override
public void execute(Tuple _tuple, BasicOutputCollector _collector) {
System.out.println("Printing tuple with toString(): " + _tuple.toString());
System.out.println("Printing tuple with getString(): " + _tuple.getString(0));
logger.info("Logging tuple with logger: " + _tuple.getString(0));
_collector.ack(_tuple);
}
#Override
public void declareOutputFields(OutputFieldsDeclarer arg0) {}
}
}
Related
I am trying to calculate the rate of incoming events per minute from a Kafka topic based on event time. I am using TumblingEventTimeWindows of 1 minute for this. The code snippet is given below.
I have observed that if I am not receiving any event for a particular window, e.g. from 2.34 to 2.35, then the previous window of 2.33 to 2.34 does not get closed. I understand the risk of losing data for the window of 2.33 to 2.34 (may happen due to system failure, bigger Kafka lag, etc.), but I cannot wait indefinitely. I need to close this window after waiting for a certain period of time, and subsequent windows can continue after the system recovers. How can I achieve this?
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
executionEnvironment.setRestartStrategy(RestartStrategies.fixedDelayRestart(
3,
org.apache.flink.api.common.time.Time.of(10, TimeUnit.SECONDS)
));
executionEnvironment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
executionEnvironment.setParallelism(1);
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "AllEventCountConsumerGroup");
FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>("event_input_topic", new SimpleStringSchema(), properties);
DataStreamSource<String> kafkaDataStream = environment.addSource(kafkaConsumer);
kafkaDataStream
.flatMap(new EventFlatter())
.filter(Objects::nonNull)
.assignTimestampsAndWatermarks(WatermarkStrategy
.<Entity>forMonotonousTimestamps()
.withIdleness(Duration.ofSeconds(60))
.withTimestampAssigner((SerializableTimestampAssigner<Entity>) (element, recordTimestamp) -> element.getTimestamp()))
.assignTimestampsAndWatermarks(new EntityWatermarkStrategy())
.keyBy((KeySelector<Entity, String>) Entity::getTenant)
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.allowedLateness(Time.seconds(10))
.aggregate(new EventCountAggregator())
.addSink(eventRateProducer);
private static class EntityWatermarkStrategy implements WatermarkStrategy<Entity> {
#Override
public WatermarkGenerator<Entity> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new EntityWatermarkGenerator();
}
}
private static class EntityWatermarkGenerator implements WatermarkGenerator<Entity> {
private long maxTimestamp;
public EntityWatermarkGenerator() {
this.maxTimestamp = Long.MIN_VALUE + 1;
}
#Override
public void onEvent(Entity event, long eventTimestamp, WatermarkOutput output) {
maxTimestamp = Math.max(maxTimestamp, eventTimestamp);
}
#Override
public void onPeriodicEmit(WatermarkOutput output) {
output.emitWatermark(new Watermark(maxTimestamp + 2));
}
}
Also, I tried adding some custom triggers, but it didn't help. I am using Apache Flink 1.11
Can somebody suggest, what wrong am I doing?
When I tried to push some more data with the newer timestamp (say t+1) of a topic, data from an earlier timeframe (t) gets pushed. but again for t+1 data, the same issues occur as of t.
One reason why withIdleness() isn't helping in your case is that you are calling assignTimestampsAndWatermarks on the datastream after it has been emitted by the kafka source, rather than calling it on the FlinkKafkaConsumer itself. If you were to do the latter, then the FlinkKafkaConsumer would be able to assign timestamps and watermarks on a per-partition basis, and would consider idleness at the granularity of each individual kafka partition. See Watermark Strategies and the Kafka Connector for more info.
To make this work, however, you'll need to use a deserializer other than a SimpleStringSchema (such as a KafkaDeserializationSchema) that is able to create individual stream records, with timestamps. See https://stackoverflow.com/a/62072265/2000823 for an example of how to implement a
KafkaDeserializationSchema.
Keep in mind, however, that withIdleness() will not advance the watermark if all partitions are idle. What it will do is to prevent idle partitions from holding back the watermark, which may advance if there are events from other partitions.
See the idle partitions documentation for an approach to solving your problem.
using flink 1.11+ watermarkstrategy api should help you avoid pumping dummy data. What you need is to generate watermark at the end of minute periodically. this is the reference:
https://ci.apache.org/projects/flink/flink-docs-stable/dev/event_timestamps_watermarks.html
Create a flinkKafkaConsumer with CustomKafkaSerializer:
FlinkKafkaConsumer otherConsumer = new FlinkKafkaConsumer(
topics, new CustomKafkaSerializer(apacheFlinkEnvironmentLoader), props);
How to create CustomKafkaSerializer ?
Ans -
Two questions about Flink deserializing
Now use watermark Strategy for this flinkKafkaConsumer:
FlinkKafkaConsumer<Tuple3<String,String,String>> flinkKafkaConsumer = apacheKafkaConfig.getOtherConsumer();
flinkKafkaConsumer.assignTimestampsAndWatermarks(new ApacheFlinkWaterMarkStrategy(envConfig.getOutOfOrderDurationSeconds()).
withIdleness(Duration.ofSeconds(envConfig.getIdlePartitionTimeout())));
So this is How WaterMark Strategy Looks Like?
Ans ->
public class ApacheFlinkWaterMarkStrategy implements WatermarkStrategy<Tuple3<String, String, String>> {
private long outOfOrderDuration;
public ApacheFlinkWaterMarkStrategy(long outOfOrderDuration)
{
super();
this.outOfOrderDuration = outOfOrderDuration;
}
#Override
public TimestampAssigner<Tuple3<String, String, String>> createTimestampAssigner(TimestampAssignerSupplier.Context context) {
return new ApacheFlinkTimeForEvent();
}
#Override
public WatermarkGenerator<Tuple3<String, String, String>> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context) {
return new ApacheFlinkWaterMarkGenerator(this.outOfOrderDuration);
} }
This is how we get event time from payload:
Ans->
public class ApacheFlinkTimeForEvent implements SerializableTimestampAssigner<Tuple3<String,String,String>> {
public static final Logger logger = LoggerFactory.getLogger(ApacheFlinkTimeForEvent.class);
private static final FhirContext fhirContext = FhirContext.forR4();
#Override
public long extractTimestamp(Tuple3<String,String,String> o, long l) {
//get timestamp from payload
}
}
This is how we generate watermarks periodically so that irrespective
whether data arrives or not watermark gets updated every minute in
each partition.
public class ApacheFlinkWaterMarkGenerator implements WatermarkGenerator<Tuple3<String,String,String>> {
public static final Logger logger = LoggerFactory.getLogger(ApacheFlinkWaterMarkGenerator.class);
private long outOfOrderGenerator;
private long maxEventTimeStamp;
public ApacheFlinkWaterMarkGenerator(long outOfOrderGenerator)
{
super();
this.outOfOrderGenerator = outOfOrderGenerator;
}
#Override
public void onEvent(Tuple3<String, String, String> stringStringStringTuple3, long l, WatermarkOutput watermarkOutput) {
maxEventTimeStamp = Math.max(maxEventTimeStamp,l);
Watermark eventWatermark = new Watermark(maxEventTimeStamp);
watermarkOutput.emitWatermark(eventWatermark);
logger.info("Current Watermark emitted from event is {}",eventWatermark.getFormattedTimestamp());
}
#Override
public void onPeriodicEmit(WatermarkOutput watermarkOutput) {
long currentUtcTime = Instant.now().toEpochMilli();
Watermark periodicWaterMark = new Watermark(currentUtcTime-outOfOrderGenerator);
watermarkOutput.emitWatermark(periodicWaterMark);
logger.info("Current Watermark emitted periodically is {}",periodicWaterMark.getFormattedTimestamp());
}
}
Also, periodic emitting of watermark has to be set at start of the
application.
streamExecutionEnvironment.getConfig().setAutoWatermarkInterval(This is in milliseconds long);
This is how we add to custom watermark and timestamps to flinkKafkaConsumer.
flinkKafkaConsumer.assignTimestampsAndWatermarks(new ApacheFlinkWaterMarkStrategy(Out of Order seconds).
withIdleness(IdlePartiton Seconds);
I'm doing it first time. Where am going to read stream of data using websocket.
Here is my code snippet
RsvpApplication
#SpringBootApplication
public class RsvpApplication {
private static final String MEETUP_RSVPS_ENDPOINT = "ws://stream.myapi.com/2/rsvps";
public static void main(String[] args) {
SpringApplication.run(RsvpApplication.class, args);
}
#Bean
public ApplicationRunner initializeConnection(
RsvpsWebSocketHandler rsvpsWebSocketHandler) {
return args -> {
System.out.println("initializeConnection");
WebSocketClient rsvpsSocketClient = new StandardWebSocketClient();
rsvpsSocketClient.doHandshake(
rsvpsWebSocketHandler, MEETUP_RSVPS_ENDPOINT);
};
}
}
RsvpsWebSocketHandler
#Component
class RsvpsWebSocketHandler extends AbstractWebSocketHandler {
private static final Logger logger =
Logger.getLogger(RsvpsWebSocketHandler.class.getName());
private final RsvpsKafkaProducer rsvpsKafkaProducer;
public RsvpsWebSocketHandler(RsvpsKafkaProducer rsvpsKafkaProducer) {
this.rsvpsKafkaProducer = rsvpsKafkaProducer;
}
#Override
public void handleMessage(WebSocketSession session,
WebSocketMessage<?> message) {
logger.log(Level.INFO, "New RSVP:\n {0}", message.getPayload());
System.out.println("handleMessage");
rsvpsKafkaProducer.sendRsvpMessage(message);
}
}
RsvpsKafkaProducer
#Component
#EnableBinding(Source.class)
public class RsvpsKafkaProducer {
private static final int SENDING_MESSAGE_TIMEOUT_MS = 10000;
private final Source source;
public RsvpsKafkaProducer(Source source) {
this.source = source;
}
public void sendRsvpMessage(WebSocketMessage<?> message) {
System.out.println("sendRsvpMessage");
source.output()
.send(MessageBuilder.withPayload(message.getPayload())
.build(),
SENDING_MESSAGE_TIMEOUT_MS);
}
}
As far I know and read about websocket is that, It needs one time connection and stream of data will be flowing continuously until either party (client or server) stops.
I'm building it first time, so trying to cover major scenarios which can come acroos while dealing with 10000+ messages per minute. Total kafka brokers are two with enough space.
What can be done, if connection gets lost and again start consuming messages from webscoket once connected back where it was left in last failure and push messages into further Kafka broker ?
What can be done to put on hold websocket to keep pushing messages in broker if it has reached to threshold limit of not processed messages (in broker) ?
What can be done, When broker reached to its threshold, run a separate process to check available space in broker to push more messages and give indication to resume pushing messages in kafka broker ?
Please share other issues, which needs to be considered while setting up this thing ?
I want to do the following: when a message fails and falls to my dead letter queue, I want to wait 5 minutes and republishes the same message on my queue.
Today, using Spring Cloud Streams and RabbitMQ, I did the following code Based on this documentation:
#Component
public class HandlerDlq {
private static final Logger LOGGER = LoggerFactory.getLogger(HandlerDlq.class);
private static final String X_RETRIES_HEADER = "x-retries";
private static final String X_DELAY_HEADER = "x-delay";
private static final int NUMBER_OF_RETRIES = 3;
private static final int DELAY_MS = 300000;
private RabbitTemplate rabbitTemplate;
#Autowired
public HandlerDlq(RabbitTemplate rabbitTemplate) {
this.rabbitTemplate = rabbitTemplate;
}
#RabbitListener(queues = MessageInputProcessor.DLQ)
public void rePublish(Message failedMessage) {
Map<String, Object> headers = failedMessage.getMessageProperties().getHeaders();
Integer retriesHeader = (Integer) headers.get(X_RETRIES_HEADER);
if (retriesHeader == null) {
retriesHeader = 0;
}
if (retriesHeader > NUMBER_OF_RETRIES) {
LOGGER.warn("Message {} added to failed messages queue", failedMessage);
this.rabbitTemplate.send(MessageInputProcessor.FAILED, failedMessage);
throw new ImmediateAcknowledgeAmqpException("Message failed after " + NUMBER_OF_RETRIES + " attempts");
}
retriesHeader++;
headers.put(X_RETRIES_HEADER, retriesHeader);
headers.put(X_DELAY_HEADER, DELAY_MS * retriesHeader);
LOGGER.warn("Retrying message, {} attempts", retriesHeader);
this.rabbitTemplate.send(MessageInputProcessor.DELAY_EXCHANGE, MessageInputProcessor.INPUT_DESTINATION, failedMessage);
}
#Bean
public DirectExchange delayExchange() {
DirectExchange exchange = new DirectExchange(MessageInputProcessor.DELAY_EXCHANGE);
exchange.setDelayed(true);
return exchange;
}
#Bean
public Binding bindOriginalToDelay() {
return BindingBuilder.bind(new Queue(MessageInputProcessor.INPUT_DESTINATION)).to(delayExchange()).with(MessageInputProcessor.INPUT_DESTINATION);
}
#Bean
public Queue parkingLot() {
return new Queue(MessageInputProcessor.FAILED);
}
}
My MessageInputProcessor interface:
public interface MessageInputProcessor {
String INPUT = "myInput";
String INPUT_DESTINATION = "myInput.group";
String DLQ = INPUT_DESTINATION + ".dlq"; //from application.properties file
String FAILED = INPUT + "-failed";
String DELAY_EXCHANGE = INPUT_DESTINATION + "-DlqReRouter";
#Input
SubscribableChannel storageManagerInput();
#Input(MessageInputProcessor.FAILED)
SubscribableChannel storageManagerFailed();
}
And my properties file:
#dlx/dlq setup - retry dead letter 5 minutes later (300000ms later)
spring.cloud.stream.rabbit.bindings.myInput.consumer.auto-bind-dlq=true
spring.cloud.stream.rabbit.bindings.myInput.consumer.republish-to-dlq=true
spring.cloud.stream.rabbit.bindings.myInput.consumer.dlq-ttl=3000
spring.cloud.stream.rabbit.bindings.myInput.consumer.delayedExchange=true
#input
spring.cloud.stream.bindings.myInput.destination=myInput
spring.cloud.stream.bindings.myInput.group=group
With this code, I can read from dead letter queue, capture the header but I can't put it back to my queue (the line LOGGER.warn("Retrying message, {} attempts", retriesHeader); only runs once, even if I put a very slow time).
My guess is that the method bindOriginalToDelay is binding the exchange to a new queue, and not mine. However, I didn't find a way to get my queue to bind there instead of creating a new one. But I'm not even sure this is the error.
I've also tried to send to MessageInputProcessor.INPUT instead of MessageInputProcessor.INPUT_DESTINATION, but it didn't work as expected.
Also, unfortunately, I can't update Spring framework due to dependencies on the project...
Could you help me with putting back the failed message on my queue after some time? I really didn't want to put a thread.sleep there...
With that configuration, myInput.group is bound to the delayed (topic) exchange myInput with routing key #.
You should probably remove spring.cloud.stream.rabbit.bindings.myInput.consumer.delayedExchange=true because you don't need the main exchange to be delayed.
It will also be bound to your explicit delayed exchange, with key myInput.group.
Everything looks correct to me; you should see the same (single) queue bound to two exchanges:
The myInput.group.dlq is bound to DLX with key myInput.group
You should set a longer TTL and examine the message in the DLQ to see if something stands out.
EDIT
I just copied your code with a 5 second delay and it worked fine for me (with turning off the delay on the main exchange).
Retrying message, 4 attempts
and
added to failed messages queue
Perhaps you thought it was not working because you have a delay on the main exchange too?
The producer-api(Kafka-client-1.0.1) cannot not produce messages in my remote Kafka broker (v1.1.0) hosted in google cloud.
I have configured the server.properties file, by adding the following:
listeners=PLAINTEXT://0.0.0.0:9092
advertised.listeners=PLAINTEXT://<Local-IP>:9092
But same results.
My Producer Api looks like this:-
public class KafkaProducerExample {
private final static String TOPIC = "my-topic";
private final static String BOOTSTRAP_SERVERS ="Public-IP:9092";
private static Producer<Long, String> createProducer() {
Properties props = new Properties();
props.put("bootstrap.servers",BOOTSTRAP_SERVERS);
props.put(ProducerConfig.CLIENT_ID_CONFIG, "KafkaExampleProducer");
props.put("key.serializer",LongSerializer.class.getName());
props.put("value.serializer",StringSerializer.class.getName());
return new KafkaProducer<>(props);
}
static void runProducer(final int sendMessageCount) throws Exception {
final Producer<Long, String> producer = createProducer();
long time = System.currentTimeMillis();
try {
for (long index = time; index < time + sendMessageCount; index++) {
final ProducerRecord<Long, String> record =
new ProducerRecord<>(TOPIC, index,
"Hello Suvro " + index);
RecordMetadata metadata = producer.send(record).get();
long elapsedTime = System.currentTimeMillis() - time;
System.out.printf("sent record(key=%s value=%s) " +
"meta(partition=%d, offset=%d) time=%d\n",
record.key(), record.value(), metadata.partition(),
metadata.offset(), elapsedTime);
}
} finally {
producer.flush();
producer.close();
}
}
public static void main(String args[]) {
try {
runProducer(5);
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
The error looks like:
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for my-topic-0: 30039 ms has passed since batch creation plus linger time
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:94)
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:64)
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:29)
at in.gov.enam.etrade.notification.KafkaProducerExample.runProducer(KafkaProducerExample.java:39)
at in.gov.enam.etrade.notification.KafkaProducerExample.main(KafkaProducerExample.java:54)
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for my-topic-0: 30039 ms has passed since batch creation plus linger time
However, this entire thing worked when the zookeeper, Kafka-server and the Producer api were run on the same machine with localhost.
I have no clue what I have missed or where I am going wrong.
I had a similar issue pointing to Kafka cluster deployed in Docker. It is because of access issue in between your local machine and remote Kafka cluster. If you are able to package your producer app, deploy/run it on the remote google cloud, the producer app should work fine.
If you want to access from LAN, change the following 2 files:
In config/server.properties:
advertised.listeners=PLAINTEXT://server.ip.in.lan:9092
In config/producer.properties:
bootstrap.servers=server.ip.in.lan:9092
I'm trying to figure out why all my Kafka messages are getting replayed every time I restart my Storm topology.
My understanding how how it should work were that once the last Bolt have ack'ed the tuple the spout should commit the message on Kafka, and hence I should not see it replay after a restart.
My code is a simple Kafka-spout and a Bolt which just print every message and then ack'ing them.
private static KafkaSpout buildKafkaSpout(String topicName) {
ZkHosts zkHosts = new ZkHosts("localhost:2181");
SpoutConfig spoutConfig = new SpoutConfig(zkHosts,
topicName,
"/" + topicName,
"mykafkaspout"); /*was:UUID.randomUUID().toString()*/
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
return new KafkaSpout(spoutConfig);
}
public static class PrintBolt extends BaseRichBolt {
OutputCollector _collector;
public static Logger LOG = LoggerFactory.getLogger(PrintBolt.class);
#Override
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
_collector = collector;
}
#Override
public void execute(Tuple tuple) {
LOG.error("PrintBolt.0: {}",tuple.getString(0));
_collector.ack(tuple);
}
#Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("nothing"));
}
}
public static void main(String[] args) throws Exception {
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("kafka", buildKafkaSpout("mytopic"), 1);
builder.setBolt("print1", new PrintBolt(),1).shuffleGrouping("kafka");
}
I have not provided any config settings than those in the code.
Am I missing a config-setting or what am I doing wrong?
UPDATE:
To clarify, everything works fine until I restart the pipeline. The below behavior is what I can get in other (non-storm) consumers, and what I expected from the KafkaSpout
My expectations:
However the actual behavior Im getting using the default setting is the following. The messages are processed fine up to I stop the pipeline, and then when I restart I get a replay of all the messages, including those (A and B) which I believed I had ack'ed already
What actually happens:
As per the configuration options mentioned by Matthias, I can change the startOffsetTime to Latest, however that is literally the latest where the pipeline is dropping the messages (Message "C") that were produced while the pipeline were restarting.
I have a consume written in NodeJS (using npm kafka-node) which is able to ack messages to Kafka and when I restart the NodeJs consumer it does exactly what I expected (catchup on message "C" which were produced when the consumer were down and continue from there) -- so how do I get the same behavior with the KafkaSpout?
The problem were in the submit code -- the template code for submitting the topology will create a instance of LocalCluster if the storm jar is run without a topology name, and the local cluster does not capture the state and hence the replay.
So
$ storm jar myjar.jar storm.myorg.MyTopology topologyname
will launch it on my single node development cluster, where
$ storm jar myjar.jar storm.myorg.MyTopology
will launch it on an instance of LocalCluster