List<NewTopic> newKafkaTopicsList = new List<NewTopic>;
NewTopic newTopic = new NewTopic("topicName", getPartitionCount(),
getReplicationFactor());
newKafkaTopicsList.add(newTopic)
Below is the adminClient api to create Topic which accepts
List<NewTopic>
which is provided by kafka adminClient which has constructor
NewTopic(java.lang.String name, int numPartitions, short replicationFactor)
and configs method
configs(java.util.Map<java.lang.String,java.lang.String> configs)
Can someone explain how to pass Map to Configs method?
CreateTopicsResult createTopicsResult = adminClient.createTopics(newKafkaTopicsList);
For example
Map<String, String> configMap = new HashMap<>();
configMap.put("cleanup.policy", "compact");
See Topic configs for more options
Call .configs(configMap);
Related
I am writing a Spark 2.4 transformation for spark benchmarking which will get JSON Streams from Kafka topic and need to dump it to MongoDB. I can do it using Java MongoClient, but data can be huge such as 1 Million records coming through multiple threads from Kafka. Spark processes it very fast but mongo write is very slow.
SparkConf sparkConf = new SparkConf().setMaster("local[*]").
setAppName("JavaDirectKafkaStreaming");
sparkConf.set("spark.streaming.backpressure.enabled","true");
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, Durations.seconds(2));
Map<String, Object> kafkaParams = new HashMap<String, Object>();
kafkaParams.put("bootstrap.servers", "loacalhost:9092");
kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("group.id", "2");
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", false);
Collection<String> topics = Arrays.asList("poc-topic");
final JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(streamingContext,
LocationStrategies.PreferConsistent(),
org.apache.spark.streaming.kafka010.ConsumerStrategies.<String, String> Subscribe(topics, kafkaParams));
#SuppressWarnings("serial")
JavaPairDStream<String, String> jPairDStream = stream
.mapToPair(new PairFunction<ConsumerRecord<String, String>, String, String>() {
public Tuple2<String, String> call(ConsumerRecord<String, String> record) throws Exception {
return new Tuple2<>(record.key(), record.value());
}
});
jPairDStream.foreachRDD(jPairRDD -> {
jPairRDD.foreach(rdd -> {
System.out.println("value=" + rdd._2());
if (rdd._2() != null) {
System.out.println("inserting=" + rdd._2());
Document doc = Document.parse(rdd._2());
// List<Document> list = new ArrayList<>();
// list.add(doc);
db.getCollection("collection").insertOne(doc);
System.out.println("Inserted Data Done");
}
else {
System.out.println("Got no data in this window");
}
});
});
streamingContext.start();
streamingContext.awaitTermination();
Where
MongoClient mongo = new MongoClient("localhost", 27017);
MongoDatabase db = mongo.getDatabase("mongodb");
I expect to speed up the mongo Operation,how to achiever multithreading for mongo write? (should I use MongoClientOptions for minconnection per host?)
Also is the approach taken is correct to use MongoDriver or it should done by MonogSpark connector or By spark writestream() API's. If yes how to write each rdd as separate record in mongo any example in Java?
I don't know about "efficiently" because there are a lot of factors at play here.
For example, Kafka partitions and total Spark executors are just two values that need tuned to accomodate for thoughput.
I do see you are using the ForEachWriter, which is a good way to do it, but maybe not the best considering you're doing constantly calling insertOne, compared to using Spark Structed Streaming to begin with, reading from Kafka, manipulating your data into a Struct object, then using SparkSQL Mongo Connector to directly dump to Mongo collections (which I would guess uses Mongo transactions, and inserts mutiple records at a time)
Also worth mentioning, Landoop offers a MongoDB Kafka Connect Sink, which requires one config file, and no Spark code to be written.
I am trying to receive very big message with spark from kafka.
But it seems that spark have a limit for the size of the message that can be read.
I have changed in kafka config to be able to consume and send big message but this is not enough (I think this is related to spark not to kafka) because when using kafka.consumer script I don't have any problem displaying the content of the message.
Maybe this is related to spark.streaming.kafka.consumer.cache.maxCapacity but I don't know how to set it in a spark java based program.
Thank you.
Update
I am using this to connect to Kafka normally args[0] is zookeeper address and the args[1] is the groupID.
if (args.length < 4) {
System.err.println("Usage: Stream Car data <zkQuorum> <group> <topics> <numThreads>");
System.exit(1);
}
SparkConf sparkConf = new SparkConf().setAppName("stream cars data");
final JavaSparkContext jSC = new JavaSparkContext(sparkConf);
// Creer le contexte avec une taille de batch de 2 secondes
JavaStreamingContext jssc = new JavaStreamingContext(jSC,new Duration(2000));
int numThreads = Integer.parseInt(args[3]);
Map<String, Integer> topicMap = new HashMap<>();
String[] topics = args[2].split(",");
for (String topic: topics) {
topicMap.put(topic, numThreads);
}
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
JavaDStream<String> data = messages.map(Tuple2::_2);
and this is the error that I get
18/04/13 17:20:33 WARN scheduler.ReceiverTracker: Error reported by receiver for stream 0: Error handling message; exiting - kafka.common.MessageSizeTooLargeException: Found a message larger than the maximum fetch size of this consumer on topic Hello-Kafka partition 0 at fetch offset 3008. Increase the fetch size, or decrease the maximum message size the broker will allow.
at kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:90)
at kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:33)
at kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:66)
at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:58)
at org.apache.spark.streaming.kafka.KafkaReceiver$MessageHandler.run(KafkaInputDStream.scala:133)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Depending on the version of Kafka you are using, you need to set the following consumer config in the consumer.properties file available (or to be created) in Kafka config files.
for version 0.8.X or below.
fetch.message.max.bytes
for Kafka version 0.9.0 or above, set
fetch.max.bytes
to appropriate values based on your application.
Eg. fetch.max.bytes=10485760
Refer this and this.
So I have found a solution to my problem, In fact as I said in the comment the files in the config file are just examples and they aren't taken into consideration when stating the server. So all the configuration of consumer including fetch.message.max.bytes need to be done in the consumer code.
And this is how I did it:
if (args.length < 4) {
System.err.println("Usage: Stream Car data <zkQuorum> <group> <topics> <numThreads>");
System.exit(1);
}
SparkConf sparkConf = new SparkConf().setAppName("stream cars data");
final JavaSparkContext jSC = new JavaSparkContext(sparkConf);
// Creer le contexte avec une taille de batch de 2 secondes
JavaStreamingContext jssc = new JavaStreamingContext(jSC,new Duration(2000));
int numThreads = Integer.parseInt(args[3]);
Map<String, Integer> topicMap = new HashMap<>();
String[] topics = args[2].split(",");
for (String topic: topics) {
topicMap.put(topic, numThreads);
}
Set<String> topicsSet = new HashSet<>(Arrays.asList(topics));
Map<String, String> kafkaParams = new HashMap<>();
kafkaParams.put("metadata.broker.list", args[0]);
kafkaParams.put("group.id", args[1]);
kafkaParams.put("zookeeper.connect", args[0]);
kafkaParams.put("fetch.message.max.bytes", "1100000000");
JavaPairReceiverInputDStream<String, String> messages=KafkaUtils.createStream(jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicMap,MEMORY_ONLY() );
JavaDStream<String> data = messages.map(Tuple2::_2);
Update TTL for a topic so records stay in the topic for 10 days. I have to do this for a particular topic only by Leaving all other topics TTL's the same, current configuration, I have to do this using java because I am pushing a topic to kafka through Java. I am setting following properties for pushing a topic to kafka
Properties props = new Properties();
props.put("bootstrap.servers", KAFKA_SERVERS);
props.put("acks", ACKS);
props.put("retries", RETRIES);
props.put("linger.ms", new Integer(LINGER_MS));
props.put("buffer.memory", new Integer(BUFFER_MEMORY));
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
You can do that using the AdminClient, following a snippet of code that get the current configuration (just for testing) and then update the "retention.ms" config on the topic named "test".
Properties props = new Properties();
props.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
AdminClient adminClient = AdminClient.create(props);
ConfigResource resource = new ConfigResource(ConfigResource.Type.TOPIC, "test");
// get the current topic configuration
DescribeConfigsResult describeConfigsResult =
adminClient.describeConfigs(Collections.singleton(resource));
Map<ConfigResource, Config> config = describeConfigsResult.all().get();
System.out.println(config);
// create a new entry for updating the retention.ms value on the same topic
ConfigEntry retentionEntry = new ConfigEntry(TopicConfig.RETENTION_MS_CONFIG, "50000");
Map<ConfigResource, Config> updateConfig = new HashMap<ConfigResource, Config>();
updateConfig.put(resource, new Config(Collections.singleton(retentionEntry)));
AlterConfigsResult alterConfigsResult = adminClient.alterConfigs(updateConfig);
alterConfigsResult.all();
describeConfigsResult = adminClient.describeConfigs(Collections.singleton(resource));
config = describeConfigsResult.all().get();
System.out.println(config);
adminClient.close();
I am implementing a simple Kafka consumer in Java. Here is the code:
public class TestConsumer {
public static void main(String []a) throws Exception{
Properties props = new Properties();
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("partition.assignment.strategy", "round-robin");
props.put("group.id", "test");
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG,"localhost:9092");
KafkaConsumer<String, String> consumer = new KafkaConsumer<String, String>(props);
try{
consumer.subscribe("ay_sparktopic");
Map<String, ConsumerRecords<String, String>> msg = consumer.poll(100);
System.out.println(msg);
}catch(Exception e){
System.out.println("Exception");
}
}
}
Above consumer gives following error message:
16/03/30 18:01:07 WARN ConsumerConfig: The configuration group.id = test was supplied but isn't a known config.
16/03/30 18:01:07 WARN ConsumerConfig: The configuration partition.assignment.strategy = round-robin was supplied but isn't a known config.
Any documentation I check online gives either range or roundrobin as possible assignment strategies and groupId is a custom name to my knowledge. Not sure what would be right config values here.
It looks like you´re trying to use the new consumer API that´s only available in Kafka 0.9+. To use the older API you have to import classes from the kafka.javaapi.consumer.* package instead of the new org.apache.kafka.clients.consumer package.
consumer.subscribe and consumer.poll relates to the new API so if you really want to use the old API, you need to change your code accordingly. If you instead want to use the new consumer API, you need to run Kafka 0.9 or later.
Using the below dependency resolves the issue.
libraryDependencies += "org.apache.kafka" % "kafka_2.11" % "0.9.0.0"
Even when you are having previous version running E.g.,kafka 0.8.2.1.
#RequestMapping(value = "/getTopics",method = RequestMethod.GET)
#ResponseBody
public Response getAllTopics() {
ZkClient zkClient = new ZkClient(ZookeeperProps.zookeeperURL, ZookeeperProps.connectionTimeoutMs,
ZookeeperProps.sessionTimeoutMs, ZKStringSerializer$.MODULE$);
Seq<String> topics = ZkUtils.getAllTopics(zkClient);
scala.collection.Iterator<String> topicIterator = topics.iterator();
String allTopics = "";
while(topicIterator.hasNext()) {
allTopics+=topicIterator.next();
allTopics+="\n";
}
Response response = new Response();
response.setResponseMessage(allTopics);
return response;
}
I am novice in apache kafka.
Now a days trying to understand kafka with zookeeper.
I want to fetch the topics associated with zookeeper. so I am trying following things
a:) first i made the zookeeper client as shown below :
ZkClient(ZookeeperProps.zookeeperURL, ZookeeperProps.connectionTimeoutMs, ZookeeperProps.sessionTimeoutMs, ZKStringSerializer$.MODULE$);
Seq<String> topics = ZkUtils.getAllTopics(zkClient);
but topics is blank set while executing with Java code.I am not getting what is problem here.
My Zookeeper Props is as follow : String zkConnect = "127.0.0.1:2181";
And zookeeper is running perfectly fine.
Please help guys.
It's pretty simple. (My example is written in Java, but it would be almost the same in Scala.)
import java.util.List;
import org.apache.zookeeper.ZooKeeper;
public class KafkaTopicListFetcher {
public static void main(String[] args) throws Exception {
ZooKeeper zk = new ZooKeeper("localhost:2181", 10000, null);
List<String> topics = zk.getChildren("/brokers/topics", false);
for (String topic : topics) {
System.out.println(topic);
}
}
}
The result when I have three topics: test, test2, and test 3
test
test2
test3
The picture below is what I drew for my own blog posting. It would be helpful when you understand the structure of ZooKeeper tree that Kafka uses. (It looks pretty small here. Open the image in a new tab and zoom in please.)
You can use kafka AdminClient . Below code snippet may help you:
Properties properties = new Properties();
properties.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
AdminClient adminClient = AdminClient.create(properties);
ListTopicsOptions listTopicsOptions = new ListTopicsOptions();
listTopicsOptions.listInternal(true);
System.out.println(adminClient.listTopics(listTopicsOptions).names().get());
I would prefer to use kafka-topics.sh which is a built in shell script of Kafka to get topics.
Kafka Client library has AdminClient API: which supports managing and inspecting topics, brokers, configurations, ACL’s.
You can find code samples for
Creating new topic
Delete topic
Describe topic: gives Leader, Partitions, ISR and Replicas
List topics
Fetch controller broker/node details
All brokers/nodes details from the cluster
https://medium.com/nerd-for-tech/how-client-application-interact-with-kafka-cluster-made-easy-with-java-apis-58f29229d992