Storm/Kafka - Unable to get offset lags for kafka - java

I am running a Storm Topology which is getting tweets from Kafka on AWS
Ubuntu Server 14.04 LTS instances with 4 nodes - Nimbus, a Supervisor, a Kafka-Zookeeper node, a Zookeeper (for Storm cluster). My Storm UI is up and running and I am able to submit topologies. I have two brokers, but I'm only using the broker.id=0 one. I have tweets in it under a topic. My kafka server is running fine too.
I created the kafka-topic in this way:
bin/kafka-topics.sh --create --zookeeper localhost:2181/kafka --replication-factor 1 --partitions 1 --topic twitter1
The thing I'm confused about is:
SpoutConfig kafkaConfig = new SpoutConfig(kafkaHosts, topicName+"-0", "/kafka", topicName+"-0");
I think my errors are sprouting from this point. Complete code is:
import org.apache.storm.tuple.Fields;
import org.apache.storm.kafka.BrokerHosts;
import org.apache.storm.kafka.KafkaSpout;
import org.apache.storm.kafka.SpoutConfig;
import org.apache.storm.kafka.ZkHosts;
import java.util.Arrays;
import org.apache.storm.Config;
import org.apache.storm.StormSubmitter;
import org.apache.storm.spout.SchemeAsMultiScheme;
import org.apache.storm.topology.TopologyBuilder;
import org.apache.storm.kafka.StringScheme;
public class TwitterTopology{
public static void main(String[] args) {
String topicName = "twitter1";
String topologyName = args[0];
String kafkaIp = "xxx.31.xxx.207"; //hiding the IPs here. This is the IP for my kafka-zk node. Is this ok?
String nimbusHost = "xxx.31.xxx.70";
String kafkaHost = kafkaIp + ":9092";
BrokerHosts kafkaHosts = new ZkHosts(kafkaHost);
SpoutConfig kafkaConfig = new SpoutConfig(kafkaHosts, topicName, "/kafka", topicName);
kafkaConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
KafkaSpout kafkaSpout = new KafkaSpout(kafkaConfig);
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("twitter-spout", kafkaSpout, 8);
builder.setBolt("WordSplitterBolt", new JsonWordSplitterBolt(5)).shuffleGrouping("twitter-spout");
builder.setBolt("IgnoreWordsBolt", new IgnoreWordsBolt()).shuffleGrouping("WordSplitterBolt");
builder.setBolt("WordCounterBolt", new WordCounterBolt(5, 5 * 60, 50)).shuffleGrouping("IgnoreWordsBolt");
Config config = new Config();
config.setDebug(false);
config.setMaxTaskParallelism(5);
config.put(Config.NIMBUS_HOST, nimbusHost);
config.put(Config.NIMBUS_THRIFT_PORT, 6627);
config.put(Config.STORM_ZOOKEEPER_PORT, 2181);
config.put(Config.STORM_ZOOKEEPER_SERVERS, Arrays.asList(kafkaIp));
try {
config.setNumWorkers(20);
config.setMaxSpoutPending(5000);
StormSubmitter.submitTopology(topologyName, config, builder.createTopology());
} catch (Exception e) {
throw new IllegalStateException("Couldn't initialize the topology", e);
}
}
}
I am getting this exception in the Storm UI:
Unable to get offset lags for kafka. Reason: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /brokers/topics/twitter1/partitions at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1590) at org.apache.curator.shaded.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:242) at org.apache.curator.shaded.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:231) at org.apache.curator.shaded.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64) at org.apache.curator.shaded.RetryLoop.callWithRetry(RetryLoop.java:100) at org.apache.curator.shaded.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:228) at org.apache.curator.shaded.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:219) at org.apache.curator.shaded.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:41) at org.apache.storm.kafka.monitor.KafkaOffsetLagUtil.getLeadersAndTopicPartitions(KafkaOffsetLagUtil.java:319) at org.apache.storm.kafka.monitor.KafkaOffsetLagUtil.getOffsetLags(KafkaOffsetLagUtil.java:256) at org.apache.storm.kafka.monitor.KafkaOffsetLagUtil.main(KafkaOffsetLagUtil.java:124)
The error Unable to get offset lags for kafka remains constant while the other part of the exception changes according to the zkroot path that I change (3rd argument in SpoutConfig). I don't know how exactly to fill these arguments up to have the Kafka pull in the tweets from my topic.
I used the tutorial present here to write the code for submitting the topology: http://stdatalabs.blogspot.ca/2016/10/real-time-stream-processing-using.html
I have made numerous changes for the maven dependencies. My pom.xml has all the dependencies for storm-core, kafka, etc. with the latest versions available in the maven repo.

The zkHosts() should contain the zookeeper's config instead of kafka. If your zookeeper and kafka are on the same server.
Try giving the correct port for zookeeper(2181)
Refer https://storm.apache.org/releases/1.2.3/storm-kafka.html

Related

Unable to read a (text)file in FileProcessing.PROCESS_CONTINUOS mode

I have a requirement to read a file continously from a specific path.
Means flink job should continously poll the specified location and read a file that will arrive at this location at certains intervals .
Example: my location on windows machine is C:/inputfiles get a file file_1.txt at 2:00PM, file_2.txt at 2:30PM, file_3.txt at 3:00PM.
I experimented it with below code .
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.io.FilePathFilter;
import org.apache.flink.api.java.io.TextInputFormat;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.FileProcessingMode;
import org.apache.flink.util.Collector;
import java.util.Arrays;
import java.util.List;
public class ContinuousFileProcessingTest {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(10);
String localFsURI = "D:\\FLink\\2021_01_01\\";
TextInputFormat format = new TextInputFormat(new org.apache.flink.core.fs.Path(localFsURI));
format.setFilesFilter(FilePathFilter.createDefaultFilter());
DataStream<String> inputStream =
env.readFile(format, localFsURI, FileProcessingMode.PROCESS_CONTINUOUSLY, 100);
SingleOutputStreamOperator<String> soso = inputStream.map(String::toUpperCase);
soso.print();
soso.writeAsText("D:\\FLink\\completed", FileSystem.WriteMode.OVERWRITE);
env.execute("read and write");
}
}
Now to test this on flink cluster i brought flink cluster up using flink's 1.9.2 version and i was able to achieve my goal of reading file continously at some intervals.
Note: Flink's 1.9.2 version can bring up cluster on windows machine.
But now i have to upgrade flink's version from 1.9.2 to 1.12 .And we used docker to bring cluster up on 1.12 (unlike 1.9.2).
Unlike windows path i changed the file location as per docker location but the same above program in not running there.
Moreover:
Accessing file is not the problem.Means if i put the file before starting the job then this job reads these files correctly but if i add any new file at runtime then it does not read this newly added files.
Need help to find the solution.
Thanks in advance.
Try to reduce directoryScanInterval from sample code to Duration.ofSeconds(50).toMillis() and checkout StreamExecutionEnvironment.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC) mode.
For RuntimeExecutionMode referred from https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/api/common/RuntimeExecutionMode.html
Working code as below:
public class ContinuousFileProcessingTest {
private static final Logger log = LoggerFactory.getLogger(ReadSpecificFilesFlinkBatch.class);
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(10);
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
String localFsURI = "file:///usr/test";
// create the monitoring source along with the necessary readers.
TextInputFormat format = new TextInputFormat(new org.apache.flink.core.fs.Path(localFsURI));
log.info("format : " + format.toString());
format.setFilesFilter(FilePathFilter.createDefaultFilter());
log.info("setFilesFilter : " + FilePathFilter.createDefaultFilter().toString());
log.info("getFilesFilter : " + format.getFilePath().toString());
DataStream<String> inputStream =
env.readFile(format, localFsURI, FileProcessingMode.PROCESS_CONTINUOUSLY, Duration.ofSeconds(50).toMillis());
SingleOutputStreamOperator<String> soso = inputStream.map(String::toUpperCase);
soso.writeAsText("file:///usr/test/completed.txt", FileSystem.WriteMode.OVERWRITE);
env.execute("read and write");
}
}
This code works on docker desktop with Flink 1.12 and container file path as file:///usr/test.Note Keep parallalism as minimum 2 so that parallelly files can be processed.

Error while submitting word count topology in Apache storm

This is the basic wordcount topology I tried to run. But I am recieving error as 'INFO org.apache.storm.zookeeper.server.SessionTrackerImpl - SessionTrackerImpl exited loop!'. Can anyone help me with this??
When i removed cluster.shutdown(), tweets are coming continously until I press cntrl+c. Again wordcount is not showing ##
import java.util.Arrays;
import backtype.storm.Config;
import backtype.storm.LocalCluster;
import backtype.storm.topology.TopologyBuilder;
import backtype.storm.tuple.Fields;
public class TwitterHashtagStorm {
public static void main(String[] args) throws Exception {
String consumerKey = "************";
String consumerSecret = "***************";
String accessToken = "**********";
String accessTokenSecret = "***********";
String[] keyWords = {"apple"};
Config config = new Config();
config.setDebug(true);
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("twitter-spout", new TwitterSampleSpout(consumerKey,
consumerSecret, accessToken, accessTokenSecret, keyWords));
builder.setBolt("twitter-hashtag-reader-bolt", new HashtagReaderBolt())
.shuffleGrouping("twitter-spout");
builder.setBolt("twitter-hashtag-counter-bolt",
new HashtagCounterBolt()).fieldsGrouping(
"twitter-hashtag-reader-bolt", new Fields("hashtag"));
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("TwitterHashtagStorm", config,
builder.createTopology());
Thread.sleep(10000);
cluster.shutdown();
}
}
10 seconds (10000 ms) is probably not be enough time for the Twitter connection to establish and for tweets to come into your topology. You should set the sleep call to something longer (several minuets at a minimum).
As for showing the work count, does your HashTagCounter bolt print out the counts to stout? If so the print out may be lost in the log messages from Storm. Try setting config.setDebud(false) (to cut down the log messages and give you a chance of seeing the count) or rewrite HashTagCounter to emit the messages to another location (a message broker, local socket reciever etc) separate from the console you are running Storm from.

Kafka: No message seen on console consumer after message sent by Java Producer

I'm new to Kafka. I created a java producer on my local machine and setup a Kafka broker on another machine, say M2, on the network(I can ping,SSH, connect to this machine). On the Producer side in the Eclipse console I get "Message sent". But when I check the console consumer on machine M2 I cannot see those messages.
My java producer code is:
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerConfig;
import org.apache.kafka.clients.producer.ProducerRecord;
import java.util.HashMap;
import java.util.Map;
public class KafkaMessageProducer {
/**
* #param args
*/
public static void main(String[] args) {
KafkaMessageProducer reportObj = new KafkaMessageProducer();
reportObj.send();
}
public void send(){
Map<String, Object> config = new HashMap<String, Object>();
config.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "135.113.133.60:9092");
config.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
config.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
KafkaProducer<String, String> producer = new KafkaProducer<String, String>(config);
int maxMessages = 5;
int count = 0;
while(count < maxMessages){
producer.send(new ProducerRecord<String, String>("test", "msg", "message --- #"+count++));
System.out.println("Message send.."+count);
}
producer.close();
}
}
Can you please let me know where I'm going wrong? I can send messages locally on machine M2 from the console producer.
Note: Even when I change the IP address to the full hostname of the Kafka Broker it still has the same issue.
Update: I also think that the Producer is able to connect to the Kafka broker and send the messages, but the Kafka Broker does not pass these messages to the consumer. If I change the IP address or the port to Zookeeper(which is running on the same node as the Kafka Broker), and see Zookeeper's log, it gets the Producer ping and then rejects the session.
Update2: I created a Producer jar and ran this jar on Machine M2 and it worked. So it seems that there is something wrong with the way Producer tries to connect to the Kafka broker. Not sure yet what is the problem.
I finally found the answer and I'm posting here in case anyone else has the same issue. Use the Kafka broker setting advertised.hostname when you are trying to connect remotely. This worked for me.
Just as an idea for debugging - try producer.send(/* record */).get();
That is, wait for the result from the Future returned from the send() method. Could be that there's an exception on producer side and it's simply ignored in the background.
You can try to use code as follows to read the metadata info for the kafka topic to see if the broker received the messages. That can help debugging.
SimpleConsumer consumer = new SimpleConsumer(broker.host(), broker.port(), 100000,
64 * 1024, "your_group_id");
List<String> topics = new ArrayList<>();
topics.add(topic);
TopicMetadataRequest req = new TopicMetadataRequest(topics);
TopicMetadataResponse resp = simpleConsumer.send(req);
if (resp.topicsMetadata().size() != 1) {
throw new RuntimeException("Expected one metadata for topic "
+ topic + " found " + resp.topicsMetadata().size());
}
TopicMetadata topicMetaData = resp.topicsMetadata().get(0);

How to get All Topics in apache kafka?

#RequestMapping(value = "/getTopics",method = RequestMethod.GET)
#ResponseBody
public Response getAllTopics() {
ZkClient zkClient = new ZkClient(ZookeeperProps.zookeeperURL, ZookeeperProps.connectionTimeoutMs,
ZookeeperProps.sessionTimeoutMs, ZKStringSerializer$.MODULE$);
Seq<String> topics = ZkUtils.getAllTopics(zkClient);
scala.collection.Iterator<String> topicIterator = topics.iterator();
String allTopics = "";
while(topicIterator.hasNext()) {
allTopics+=topicIterator.next();
allTopics+="\n";
}
Response response = new Response();
response.setResponseMessage(allTopics);
return response;
}
I am novice in apache kafka.
Now a days trying to understand kafka with zookeeper.
I want to fetch the topics associated with zookeeper. so I am trying following things
a:) first i made the zookeeper client as shown below :
ZkClient(ZookeeperProps.zookeeperURL, ZookeeperProps.connectionTimeoutMs, ZookeeperProps.sessionTimeoutMs, ZKStringSerializer$.MODULE$);
Seq<String> topics = ZkUtils.getAllTopics(zkClient);
but topics is blank set while executing with Java code.I am not getting what is problem here.
My Zookeeper Props is as follow : String zkConnect = "127.0.0.1:2181";
And zookeeper is running perfectly fine.
Please help guys.
It's pretty simple. (My example is written in Java, but it would be almost the same in Scala.)
import java.util.List;
import org.apache.zookeeper.ZooKeeper;
public class KafkaTopicListFetcher {
public static void main(String[] args) throws Exception {
ZooKeeper zk = new ZooKeeper("localhost:2181", 10000, null);
List<String> topics = zk.getChildren("/brokers/topics", false);
for (String topic : topics) {
System.out.println(topic);
}
}
}
The result when I have three topics: test, test2, and test 3
test
test2
test3
The picture below is what I drew for my own blog posting. It would be helpful when you understand the structure of ZooKeeper tree that Kafka uses. (It looks pretty small here. Open the image in a new tab and zoom in please.)
You can use kafka AdminClient . Below code snippet may help you:
Properties properties = new Properties();
properties.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
AdminClient adminClient = AdminClient.create(properties);
ListTopicsOptions listTopicsOptions = new ListTopicsOptions();
listTopicsOptions.listInternal(true);
System.out.println(adminClient.listTopics(listTopicsOptions).names().get());
I would prefer to use kafka-topics.sh which is a built in shell script of Kafka to get topics.
Kafka Client library has AdminClient API: which supports managing and inspecting topics, brokers, configurations, ACL’s.
You can find code samples for
Creating new topic
Delete topic
Describe topic: gives Leader, Partitions, ISR and Replicas
List topics
Fetch controller broker/node details
All brokers/nodes details from the cluster
https://medium.com/nerd-for-tech/how-client-application-interact-with-kafka-cluster-made-easy-with-java-apis-58f29229d992

Kafka consumer in java not consuming messages

I am trying to a kafka consumer to get messages which are produced and posted to a topic in Java. My consumer goes as follows.
consumer.java
import java.io.UnsupportedEncodingException;
import java.nio.ByteBuffer;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import kafka.consumer.Consumer;
import kafka.consumer.ConsumerConfig;
import kafka.consumer.ConsumerIterator;
import kafka.consumer.KafkaStream;
import kafka.javaapi.consumer.ConsumerConnector;
import kafka.javaapi.message.ByteBufferMessageSet;
import kafka.message.MessageAndOffset;
public class KafkaConsumer extends Thread {
final static String clientId = "SimpleConsumerDemoClient";
final static String TOPIC = " AATest";
ConsumerConnector consumerConnector;
public static void main(String[] argv) throws UnsupportedEncodingException {
KafkaConsumer KafkaConsumer = new KafkaConsumer();
KafkaConsumer.start();
}
public KafkaConsumer(){
Properties properties = new Properties();
properties.put("zookeeper.connect","10.200.208.59:2181");
properties.put("group.id","test-group");
ConsumerConfig consumerConfig = new ConsumerConfig(properties);
consumerConnector = Consumer.createJavaConsumerConnector(consumerConfig);
}
#Override
public void run() {
Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
topicCountMap.put(TOPIC, new Integer(1));
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumerConnector.createMessageStreams(topicCountMap);
KafkaStream<byte[], byte[]> stream = consumerMap.get(TOPIC).get(0);
System.out.println(stream);
ConsumerIterator<byte[], byte[]> it = stream.iterator();
while(it.hasNext())
System.out.println("from it");
System.out.println(new String(it.next().message()));
}
private static void printMessages(ByteBufferMessageSet messageSet) throws UnsupportedEncodingException {
for(MessageAndOffset messageAndOffset: messageSet) {
ByteBuffer payload = messageAndOffset.message().payload();
byte[] bytes = new byte[payload.limit()];
payload.get(bytes);
System.out.println(new String(bytes, "UTF-8"));
}
}
}
When I run the above code I am getting nothing in the console wheres the java producer program behind the screen is posting data continously under the 'AATest' topic. Also the in the zookeeper console I am getting the following lines when I try running the above consumer.java
[2015-04-30 15:57:31,284] INFO Accepted socket connection from /10.200.208.59:51780 (org.apache.zookeeper.
server.NIOServerCnxnFactory)
[2015-04-30 15:57:31,284] INFO Client attempting to establish new session at /10.200.208.59:51780 (org.apa
che.zookeeper.server.ZooKeeperServer)
[2015-04-30 15:57:31,315] INFO Established session 0x14d09cebce30007 with negotiated timeout 6000 for clie
nt /10.200.208.59:51780 (org.apache.zookeeper.server.ZooKeeperServer)
Also when I run a separate console-consumer pointing to the AATest topic, I am getting all the data produced by the producer to that topic.
Both consumer and broker are in the same machine whereas the producer is in different machine. This actually resembles this question. But going through it dint help me. Please help me.
Different answer but it happened to be initial offset (auto.offset.reset) for a consumer in my case. So, setting up auto.offset.reset=earliest fixed the problem in my scenario. Its because I was publishing event first and then starting a consumer.
By default, consumer only consumes events published after it started because auto.offset.reset=latest by default.
eg. consumer.properties
bootstrap.servers=localhost:9092
enable.auto.commit=true
auto.commit.interval.ms=1000
session.timeout.ms=30000
auto.offset.reset=earliest
key.deserializer=org.apache.kafka.common.serialization.StringDeserializer
value.deserializer=org.apache.kafka.common.serialization.StringDeserializer
Test
class KafkaEventConsumerSpecs extends FunSuite {
case class TestEvent(eventOffset: Long, hashValue: Long, created: Date, testField: String) extends BaseEvent
test("given an event in the event-store, consumes an event") {
EmbeddedKafka.start()
//PRODUCE
val event = TestEvent(0l, 0l, new Date(), "data")
val config = new Properties() {
{
load(this.getClass.getResourceAsStream("/producer.properties"))
}
}
val producer = new KafkaProducer[String, String](config)
val persistedEvent = producer.send(new ProducerRecord(event.getClass.getSimpleName, event.toString))
assert(persistedEvent.get().offset() == 0)
assert(persistedEvent.get().checksum() != 0)
//CONSUME
val consumerConfig = new Properties() {
{
load(this.getClass.getResourceAsStream("/consumer.properties"))
put("group.id", "consumers_testEventsGroup")
put("client.id", "testEventConsumer")
}
}
assert(consumerConfig.getProperty("group.id") == "consumers_testEventsGroup")
val kafkaConsumer = new KafkaConsumer[String, String](consumerConfig)
assert(kafkaConsumer.listTopics().asScala.map(_._1).toList == List("TestEvent"))
kafkaConsumer.subscribe(Collections.singletonList("TestEvent"))
val events = kafkaConsumer.poll(1000)
assert(events.count() == 1)
EmbeddedKafka.stop()
}
}
But if consumer is started first and then published, the consumer should be able to consume the event without auto.offset.reset required to be set to earliest.
References for kafka 0.10
https://kafka.apache.org/documentation/#consumerconfigs
In our case, we solved our problem with the following steps:
The first thing we found is that there is an config called 'retry' for KafkaProducer and its default value means 'No Retry'. Also, send method of the KafkaProducer is async without calling the get method of the send method's result. In this way, there is no guarantee to delivery produced messages to the corresponding broker without retry. So, you have to increase it a bit or can use idempotence or transactional mode of KafkaProducer.
The second case is about the Kafka and Zookeeper version. We chose the 1.0.0 version of the Kafka and Zookeeper 3.4.4. Especially, Kafka 1.0.0 had an issue about the connectivity with Zookeeper. If Kafka loose its connection to the Zookeeper with an unexpected exception, it looses the leadership of the partitions which didn't synced yet. There is an bug topic about this issue :
https://issues.apache.org/jira/browse/KAFKA-2729
After we found the corresponding logs at Kafka log which indicates same issue at topic above, we upgraded our Kafka broker version to the 1.1.0.
It is also important point to notice that small sized the partitions (like 100 or less), increases the throughput of the producer so if there is no enough consumer then the available consumer fall into the thread stuck on results with delayed messages(we measured delay with minutes, approximately 10-15 minutes). So you need to balance and configure the partition size and thread counts of your application correctly according to your available resources.
There might also be a case where kafka takes a long time to rebalance consumer groups when a new consumer is added to the same group id.
Check kafka logs to see if the group is rebalanced after starting your consumer

Categories

Resources