I have a stream topology as follows:
Topology topology = new Topology();
//WS connection processor
topology.addSource(WS_CONNECTION_SOURCE, new StringDeserializer(), new WebSocketConnectionEventDeserializer()
, utilService.getTopicByType(TopicType.CONNECTION_EVENTS_TOPIC))
.addProcessor(SESSION_PROCESSOR, WSUserSessionProcessor::new, WS_CONNECTION_SOURCE)
.addStateStore(sessionStoreBuilder, SESSION_PROCESSOR)
.addSink(WS_STATUS_SINK, utilService.getTopicByType(TopicType.ONLINE_STATUS_TOPIC),
stringSerializer, stringSerializer
, SESSION_PROCESSOR)
//WS session routing
.addSource(WS_ROUTING_BY_SESSION_SOURCE, new StringDeserializer(), new StringDeserializer(),
utilService.getTopicByType(TopicType.NOTIFICATION_TOPIC))
.addProcessor(WS_ROUTING_BY_SESSION_PROCESSOR, WSSessionRoutingProcessor::new,
WS_ROUTING_BY_SESSION_SOURCE)
.addStateStore(userConnectedNodesStoreBuilder, WS_ROUTING_BY_SESSION_PROCESSOR, SESSION_PROCESSOR);
topology.addGlobalStore(serviceDiscoveryGlobalStoreBuilder, "WS_SERVICE_DISCOVERY", new StringDeserializer(),
new ServiceDiscoveryEventDeserializer(), "ws-service-discovery", "WS_SERVICE_DISCOVERY_PROCESSOR",
ServiceDiscoveryProcessor::new);
While analyzing logs, there is huge delay in messages coming in WS_ROUTING_BY_SESSION_PROCESSOR process method. Delay is of the order of 10+ seconds.
This is the configuration used in streams:
Properties properties = new Properties();
properties.setProperty(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaServers);
properties.setProperty(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "500");
properties.setProperty(ConsumerConfig.MAX_POLL_INTERVAL_MS_CONFIG, "60000");
properties.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
properties.put(ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG, "60000");
properties.put(ConsumerConfig.DEFAULT_API_TIMEOUT_MS_CONFIG, "60000");
Partitions for NOTIFICATION_TOPIC is 10 and number of stream threads is 10
Related
I am writing a Spark 2.4 transformation for spark benchmarking which will get JSON Streams from Kafka topic and need to dump it to MongoDB. I can do it using Java MongoClient, but data can be huge such as 1 Million records coming through multiple threads from Kafka. Spark processes it very fast but mongo write is very slow.
SparkConf sparkConf = new SparkConf().setMaster("local[*]").
setAppName("JavaDirectKafkaStreaming");
sparkConf.set("spark.streaming.backpressure.enabled","true");
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, Durations.seconds(2));
Map<String, Object> kafkaParams = new HashMap<String, Object>();
kafkaParams.put("bootstrap.servers", "loacalhost:9092");
kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("group.id", "2");
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", false);
Collection<String> topics = Arrays.asList("poc-topic");
final JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(streamingContext,
LocationStrategies.PreferConsistent(),
org.apache.spark.streaming.kafka010.ConsumerStrategies.<String, String> Subscribe(topics, kafkaParams));
#SuppressWarnings("serial")
JavaPairDStream<String, String> jPairDStream = stream
.mapToPair(new PairFunction<ConsumerRecord<String, String>, String, String>() {
public Tuple2<String, String> call(ConsumerRecord<String, String> record) throws Exception {
return new Tuple2<>(record.key(), record.value());
}
});
jPairDStream.foreachRDD(jPairRDD -> {
jPairRDD.foreach(rdd -> {
System.out.println("value=" + rdd._2());
if (rdd._2() != null) {
System.out.println("inserting=" + rdd._2());
Document doc = Document.parse(rdd._2());
// List<Document> list = new ArrayList<>();
// list.add(doc);
db.getCollection("collection").insertOne(doc);
System.out.println("Inserted Data Done");
}
else {
System.out.println("Got no data in this window");
}
});
});
streamingContext.start();
streamingContext.awaitTermination();
Where
MongoClient mongo = new MongoClient("localhost", 27017);
MongoDatabase db = mongo.getDatabase("mongodb");
I expect to speed up the mongo Operation,how to achiever multithreading for mongo write? (should I use MongoClientOptions for minconnection per host?)
Also is the approach taken is correct to use MongoDriver or it should done by MonogSpark connector or By spark writestream() API's. If yes how to write each rdd as separate record in mongo any example in Java?
I don't know about "efficiently" because there are a lot of factors at play here.
For example, Kafka partitions and total Spark executors are just two values that need tuned to accomodate for thoughput.
I do see you are using the ForEachWriter, which is a good way to do it, but maybe not the best considering you're doing constantly calling insertOne, compared to using Spark Structed Streaming to begin with, reading from Kafka, manipulating your data into a Struct object, then using SparkSQL Mongo Connector to directly dump to Mongo collections (which I would guess uses Mongo transactions, and inserts mutiple records at a time)
Also worth mentioning, Landoop offers a MongoDB Kafka Connect Sink, which requires one config file, and no Spark code to be written.
I'm trying to reset consumer offset whenever calling consumer so that when I call consumer many times it can still read record sent by producer. I'm setting props.put("auto.offset.reset","earliest"); and calling consumer.seekToBeginning(consumer.assignment()); but when I call the consumer the second time it will receive no records. How can I fix this?
public ConsumerRecords<String, byte[]> consumer(){
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
//props.put("group.id", String.valueOf(System.currentTimeMillis()));
props.put("auto.offset.reset","earliest");
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer");
KafkaConsumer<String, byte[]> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Collections.singletonList("topiccc"));
ConsumerRecords<String, byte[]> records = consumer.poll(100);
consumer.seekToBeginning(consumer.assignment());
/* List<byte[]> videoContents = new ArrayList<byte[]>();
for (ConsumerRecord<String, byte[]> record : records) {
System.out.printf("offset = %d, key = %s, value = %s\n", record.offset(), record.key(), record.value());
videoContents.add(record.value());
}*/
return records;
}
public String producer(#RequestParam("message") String message) {
Map<String, Object> props = new HashMap<>();
// list of host:port pairs used for establishing the initial connections to the Kakfa cluster
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
Producer<String, byte[]> producer = new KafkaProducer<>(props);
Path path = Paths.get("C:/Programming Files/video-2012-07-05-02-29-27.mp4");
ProducerRecord<String, byte[]> record = null;
try {
record = new ProducerRecord<>("topiccc", "keyyyyy"
, Files.readAllBytes(path));
} catch (IOException e) {
e.printStackTrace();
}
producer.send(record);
producer.close();
//kafkaSender.send(record);
return "Message sent to the Kafka Topic java_in_use_topic Successfully";
}
From the Kafka Java Code, the documentation on AUTO_OFFSET_RESET_CONFIG says the following:
What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server (e.g. because that data has been deleted): earliest: automatically reset the offset to the earliest offsetlatest: automatically reset the offset to the latest offsetnone: throw exception to the consumer if no previous offset is found for the consumer's groupanything else: throw exception to the consumer.
This can be found here in GitHub:
https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerConfig.java
We can see from their comment that the setting is only used when the offset is not on the server. In the question, the offset is retrieved from the server and that's why the offset is not reset to the beginning but rather stays at the last offset, making it appear that there are no more records.
You would need to explicitly reset the offset on the server side to fix this as requested in the question.
Here is another answer that describes how that could be done.
https://stackoverflow.com/a/54492802/231860
This is a snippet of code that allowed me to reset the offset. NOTE: You can't call seekToBeginning if you call the subscribe method. I could only get it to work if I assign the partitions myself using the assign method. Pity.
// Create the consumer:
final Consumer<String, DataRecord> consumer = new KafkaConsumer<>(props);
// Get the partitions that exist for this topic:
List<PartitionInfo> partitions = consumer.partitionsFor(topic);
// Get the topic partition info for these partitions:
List<TopicPartition> topicPartitions = partitions.stream().map(info -> new TopicPartition(info.topic(), info.partition())).collect(Collectors.toList());
// Assign all the partitions to the topic so that we can seek to the beginning:
// NOTE: We can't use subscribe if we use assign, but we can't seek to the beginning if we use subscribe.
consumer.assign(topicPartitions);
// Make sure we seek to the beginning of the partitions:
consumer.seekToBeginning(topicPartitions);
Yes, it seems extremely complicated to achieve a seemingly rudimentary use case. This might indicate that the whole kafka world just seems to want to read streams once.
I am usually creating a new consumer with different group.id to read again records.
So do it like that:
props.put("group.id", Instant.now().getEpochSecond());
There is a workaround for this (not a production solution, though) which is to change the group.id configuration value each time you consume. Setting auto.offset.reset to earliest is not enough in many cases.
When you want one message to be consumed by consumers multiple time the ideal way is to create consumers with different consumer group so same message can be consumed.
But if you want the same consumer to consume the same message multiple time then you can play with commit and offset
You set the auto.commit very high or disable it and do commit as per your logic
You can refer to this for more details https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
This javadoc provides detail on how to manually manage offset
Update TTL for a topic so records stay in the topic for 10 days. I have to do this for a particular topic only by Leaving all other topics TTL's the same, current configuration, I have to do this using java because I am pushing a topic to kafka through Java. I am setting following properties for pushing a topic to kafka
Properties props = new Properties();
props.put("bootstrap.servers", KAFKA_SERVERS);
props.put("acks", ACKS);
props.put("retries", RETRIES);
props.put("linger.ms", new Integer(LINGER_MS));
props.put("buffer.memory", new Integer(BUFFER_MEMORY));
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
You can do that using the AdminClient, following a snippet of code that get the current configuration (just for testing) and then update the "retention.ms" config on the topic named "test".
Properties props = new Properties();
props.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
AdminClient adminClient = AdminClient.create(props);
ConfigResource resource = new ConfigResource(ConfigResource.Type.TOPIC, "test");
// get the current topic configuration
DescribeConfigsResult describeConfigsResult =
adminClient.describeConfigs(Collections.singleton(resource));
Map<ConfigResource, Config> config = describeConfigsResult.all().get();
System.out.println(config);
// create a new entry for updating the retention.ms value on the same topic
ConfigEntry retentionEntry = new ConfigEntry(TopicConfig.RETENTION_MS_CONFIG, "50000");
Map<ConfigResource, Config> updateConfig = new HashMap<ConfigResource, Config>();
updateConfig.put(resource, new Config(Collections.singleton(retentionEntry)));
AlterConfigsResult alterConfigsResult = adminClient.alterConfigs(updateConfig);
alterConfigsResult.all();
describeConfigsResult = adminClient.describeConfigs(Collections.singleton(resource));
config = describeConfigsResult.all().get();
System.out.println(config);
adminClient.close();
I'm trying to create a simple KafkaProducer and KafkaConsumer so I can send data to a topic on a broker, and then verify that the data was received. I have below the two methods I used to define my consumer and producer, and how I'm sending the message. The send method takes at lest 20 seconds to complete, and as far as I can tell the consumer.poll method never actually finishes, but the longest I've left it was 10 minutes.
Does anyone have a suggestion as to what I'm doing wrong? Is there some property for the producer/consumer that I'm not setting up correctly? Those properties are copied directly from the docs, so I don't understand why they won't work.
KafkaProducer docs
KafkaConsumer docs
"verify we can send to producer" in {
val consumer = createKafkaConsumer("address:9002")
val producer = createKafkaProducer("address:9002")
val message = "I am a message"
val record = new ProducerRecord[String, String]("myTopic", message)
producer.send(record)
TimeUnit.SECONDS.sleep(5)
val records = consumer.poll(5000)
println("records: "+records)
consumer1.close()
}
def createKafkaProducer(kafka: String): KafkaProducer[String,String] = {
val props = new Properties()
props.put("bootstrap.servers", kafka)
props.put("acks", "all")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
new KafkaProducer[String,String](props)
}
def createKafkaConsumer(kafka: String): KafkaConsumer[String, String] = {
val props = new Properties()
props.put("bootstrap.servers", kafka)
props.put("group.id", "test")
props.put("enable.auto.commit", "true")
props.put("auto.commit.interval.ms", "1000")
props.put("session.timeout.ms", "30000")
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
val consumer = new KafkaConsumer[String, String](props)
consumer.subscribe(Collections.singletonList("myTopic"))
consumer
}
Edit: I've updated my code so that I now get the response from the send method, and it seems that that times out with org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms.
Turns out I had a DNS issue that meant that I wasn't actually connecting to the broker. Fixing this allowed the messages to go through, there was nothing wrong with the config.
I am trying to write a Storm based code which reads the message from one topic and writes back to another topic. Input topic has data in ProtoBuf format and output will have JSON format. I am not able to achieve it.
This is code which build the topology:
Config conf = new Config();
//set producer properties.
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9093");
props.put("request.required.acks", "1");
props.put("key.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
conf.put("kafka.broker.config", props);
conf.put(KafkaBolt.TOPIC, "out-storm");
KafkaBolt bolt = new KafkaBolt()
.withProducerProperties(props)
.withTopicSelector(new DefaultTopicSelector("out-storm")).withTupleToKafkaMapper(new FieldNameBasedTupleToKafkaMapper<String, String>());
BrokerHosts hosts = new ZkHosts("localhost:2181");
SpoutConfig spoutConfig = new SpoutConfig(hosts, "incoming-server", "/" + "incoming-server",
UUID.randomUUID().toString());
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("kafka-spout", kafkaSpout);
builder.setBolt("lookup-bolt", new ReportBolt(),4).shuffleGrouping("kafka-spout");
builder.setBolt("kafka-producer-spout", bolt).shuffleGrouping("lookup-bolt");
LocalCluster cluster = new LocalCluster();
Config config = new Config();
config.setDebug(true);
config.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1);
config.put("kafka.broker.config", props);
config.put(KafkaBolt.TOPIC, "out-storm");
cluster.submitTopology("KafkaStormSample", config, builder.createTopology());
Thread.sleep(1000000);
In report Bolt I have done this:
System.out.println("HELLO " + input);
JSONObject jo= new JSONObject();
for (String f:input.getFields()){
jo.put(f, input.getValueByField(f));
}
collector.ack(input);
List<Object> list = new ArrayList<Object>();
list.add(jo);
collector.emit(list);
When I am starting getting this error:
5207 [main] WARN o.a.s.d.nimbus - Topology submission exception. (topology name='KafkaStormSample') #error {
:cause nil
:via
[{:type org.apache.storm.generated.InvalidTopologyException
:message nil
:at [org.apache.storm.daemon.common$validate_structure_BANG_ invoke common.clj 181]}]
:trace
[[org.apache.storm.daemon.common$validate_structure_BANG_ invoke common.clj 181]
[org.apache.storm.daemon.common$system_topology_BANG_ invoke common.clj 360]
[org.apache.storm.daemon.nimbus$fn__7064$exec_fn__2461__auto__$reify__7093 submitTopologyWithOpts nimbus.clj 1512]
[org.apache.storm.daemon.nimbus$fn__7064$exec_fn__2461__auto__$reify__7093 submitTopology nimbus.clj 1544]
[sun.reflect.NativeMethodAccessorImpl invoke0 NativeMethodAccessorImpl.java -2]
[sun.reflect.NativeMethodAccessorImpl invoke NativeMethodAccessorImpl.java 62]
[sun.reflect.DelegatingMethodAccessorImpl invoke DelegatingMethodAccessorImpl.java 43]
[java.lang.reflect.Method invoke Method.java 497]
[clojure.lang.Reflector invokeMatchingMethod Reflector.java 93]
[clojure.lang.Reflector invokeInstanceMethod Reflector.java 28]
[org.apache.storm.testing$submit_local_topology invoke testing.clj 301]
[org.apache.storm.LocalCluster$_submitTopology invoke LocalCluster.clj 49]
[org.apache.storm.LocalCluster submitTopology nil -1]
[com.mediaiq.StartStorm main StartStorm.java 81]]}
I think that the problem is that you are referencing the wrong port on your bootstrap.server config. Try changing it to the 9092.