Spark: get very big message from kafka - java

I am trying to receive very big message with spark from kafka.
But it seems that spark have a limit for the size of the message that can be read.
I have changed in kafka config to be able to consume and send big message but this is not enough (I think this is related to spark not to kafka) because when using kafka.consumer script I don't have any problem displaying the content of the message.
Maybe this is related to spark.streaming.kafka.consumer.cache.maxCapacity but I don't know how to set it in a spark java based program.
Thank you.
Update
I am using this to connect to Kafka normally args[0] is zookeeper address and the args[1] is the groupID.
if (args.length < 4) {
System.err.println("Usage: Stream Car data <zkQuorum> <group> <topics> <numThreads>");
System.exit(1);
}
SparkConf sparkConf = new SparkConf().setAppName("stream cars data");
final JavaSparkContext jSC = new JavaSparkContext(sparkConf);
// Creer le contexte avec une taille de batch de 2 secondes
JavaStreamingContext jssc = new JavaStreamingContext(jSC,new Duration(2000));
int numThreads = Integer.parseInt(args[3]);
Map<String, Integer> topicMap = new HashMap<>();
String[] topics = args[2].split(",");
for (String topic: topics) {
topicMap.put(topic, numThreads);
}
JavaPairReceiverInputDStream<String, String> messages =
KafkaUtils.createStream(jssc, args[0], args[1], topicMap);
JavaDStream<String> data = messages.map(Tuple2::_2);
and this is the error that I get
18/04/13 17:20:33 WARN scheduler.ReceiverTracker: Error reported by receiver for stream 0: Error handling message; exiting - kafka.common.MessageSizeTooLargeException: Found a message larger than the maximum fetch size of this consumer on topic Hello-Kafka partition 0 at fetch offset 3008. Increase the fetch size, or decrease the maximum message size the broker will allow.
at kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:90)
at kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:33)
at kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:66)
at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:58)
at org.apache.spark.streaming.kafka.KafkaReceiver$MessageHandler.run(KafkaInputDStream.scala:133)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Depending on the version of Kafka you are using, you need to set the following consumer config in the consumer.properties file available (or to be created) in Kafka config files.
for version 0.8.X or below.
fetch.message.max.bytes
for Kafka version 0.9.0 or above, set
fetch.max.bytes
to appropriate values based on your application.
Eg. fetch.max.bytes=10485760
Refer this and this.

So I have found a solution to my problem, In fact as I said in the comment the files in the config file are just examples and they aren't taken into consideration when stating the server. So all the configuration of consumer including fetch.message.max.bytes need to be done in the consumer code.
And this is how I did it:
if (args.length < 4) {
System.err.println("Usage: Stream Car data <zkQuorum> <group> <topics> <numThreads>");
System.exit(1);
}
SparkConf sparkConf = new SparkConf().setAppName("stream cars data");
final JavaSparkContext jSC = new JavaSparkContext(sparkConf);
// Creer le contexte avec une taille de batch de 2 secondes
JavaStreamingContext jssc = new JavaStreamingContext(jSC,new Duration(2000));
int numThreads = Integer.parseInt(args[3]);
Map<String, Integer> topicMap = new HashMap<>();
String[] topics = args[2].split(",");
for (String topic: topics) {
topicMap.put(topic, numThreads);
}
Set<String> topicsSet = new HashSet<>(Arrays.asList(topics));
Map<String, String> kafkaParams = new HashMap<>();
kafkaParams.put("metadata.broker.list", args[0]);
kafkaParams.put("group.id", args[1]);
kafkaParams.put("zookeeper.connect", args[0]);
kafkaParams.put("fetch.message.max.bytes", "1100000000");
JavaPairReceiverInputDStream<String, String> messages=KafkaUtils.createStream(jssc,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
topicMap,MEMORY_ONLY() );
JavaDStream<String> data = messages.map(Tuple2::_2);

Related

How to do Kafka-Spark-MongoDb integration efficiently

I am writing a Spark 2.4 transformation for spark benchmarking which will get JSON Streams from Kafka topic and need to dump it to MongoDB. I can do it using Java MongoClient, but data can be huge such as 1 Million records coming through multiple threads from Kafka. Spark processes it very fast but mongo write is very slow.
SparkConf sparkConf = new SparkConf().setMaster("local[*]").
setAppName("JavaDirectKafkaStreaming");
sparkConf.set("spark.streaming.backpressure.enabled","true");
JavaStreamingContext streamingContext = new JavaStreamingContext(sparkConf, Durations.seconds(2));
Map<String, Object> kafkaParams = new HashMap<String, Object>();
kafkaParams.put("bootstrap.servers", "loacalhost:9092");
kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("group.id", "2");
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("enable.auto.commit", false);
Collection<String> topics = Arrays.asList("poc-topic");
final JavaInputDStream<ConsumerRecord<String, String>> stream = KafkaUtils.createDirectStream(streamingContext,
LocationStrategies.PreferConsistent(),
org.apache.spark.streaming.kafka010.ConsumerStrategies.<String, String> Subscribe(topics, kafkaParams));
#SuppressWarnings("serial")
JavaPairDStream<String, String> jPairDStream = stream
.mapToPair(new PairFunction<ConsumerRecord<String, String>, String, String>() {
public Tuple2<String, String> call(ConsumerRecord<String, String> record) throws Exception {
return new Tuple2<>(record.key(), record.value());
}
});
jPairDStream.foreachRDD(jPairRDD -> {
jPairRDD.foreach(rdd -> {
System.out.println("value=" + rdd._2());
if (rdd._2() != null) {
System.out.println("inserting=" + rdd._2());
Document doc = Document.parse(rdd._2());
// List<Document> list = new ArrayList<>();
// list.add(doc);
db.getCollection("collection").insertOne(doc);
System.out.println("Inserted Data Done");
}
else {
System.out.println("Got no data in this window");
}
});
});
streamingContext.start();
streamingContext.awaitTermination();
Where
MongoClient mongo = new MongoClient("localhost", 27017);
MongoDatabase db = mongo.getDatabase("mongodb");
I expect to speed up the mongo Operation,how to achiever multithreading for mongo write? (should I use MongoClientOptions for minconnection per host?)
Also is the approach taken is correct to use MongoDriver or it should done by MonogSpark connector or By spark writestream() API's. If yes how to write each rdd as separate record in mongo any example in Java?
I don't know about "efficiently" because there are a lot of factors at play here.
For example, Kafka partitions and total Spark executors are just two values that need tuned to accomodate for thoughput.
I do see you are using the ForEachWriter, which is a good way to do it, but maybe not the best considering you're doing constantly calling insertOne, compared to using Spark Structed Streaming to begin with, reading from Kafka, manipulating your data into a Struct object, then using SparkSQL Mongo Connector to directly dump to Mongo collections (which I would guess uses Mongo transactions, and inserts mutiple records at a time)
Also worth mentioning, Landoop offers a MongoDB Kafka Connect Sink, which requires one config file, and no Spark code to be written.

How to add per topic configuration in Kafka Admin API?

List<NewTopic> newKafkaTopicsList = new List<NewTopic>;
NewTopic newTopic = new NewTopic("topicName", getPartitionCount(),
getReplicationFactor());
newKafkaTopicsList.add(newTopic)
Below is the adminClient api to create Topic which accepts
List<NewTopic>
which is provided by kafka adminClient which has constructor
NewTopic(java.lang.String name, int numPartitions, short replicationFactor)
and configs method
configs(java.util.Map<java.lang.String,java.lang.String> configs)
Can someone explain how to pass Map to Configs method?
CreateTopicsResult createTopicsResult = adminClient.createTopics(newKafkaTopicsList);
For example
Map<String, String> configMap = new HashMap<>();
configMap.put("cleanup.policy", "compact");
See Topic configs for more options
Call .configs(configMap);

How to reset Kafka consumer offset when calling consumer many times

I'm trying to reset consumer offset whenever calling consumer so that when I call consumer many times it can still read record sent by producer. I'm setting props.put("auto.offset.reset","earliest"); and calling consumer.seekToBeginning(consumer.assignment()); but when I call the consumer the second time it will receive no records. How can I fix this?
public ConsumerRecords<String, byte[]> consumer(){
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("group.id", "test");
//props.put("group.id", String.valueOf(System.currentTimeMillis()));
props.put("auto.offset.reset","earliest");
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer");
KafkaConsumer<String, byte[]> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Collections.singletonList("topiccc"));
ConsumerRecords<String, byte[]> records = consumer.poll(100);
consumer.seekToBeginning(consumer.assignment());
/* List<byte[]> videoContents = new ArrayList<byte[]>();
for (ConsumerRecord<String, byte[]> record : records) {
System.out.printf("offset = %d, key = %s, value = %s\n", record.offset(), record.key(), record.value());
videoContents.add(record.value());
}*/
return records;
}
public String producer(#RequestParam("message") String message) {
Map<String, Object> props = new HashMap<>();
// list of host:port pairs used for establishing the initial connections to the Kakfa cluster
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
Producer<String, byte[]> producer = new KafkaProducer<>(props);
Path path = Paths.get("C:/Programming Files/video-2012-07-05-02-29-27.mp4");
ProducerRecord<String, byte[]> record = null;
try {
record = new ProducerRecord<>("topiccc", "keyyyyy"
, Files.readAllBytes(path));
} catch (IOException e) {
e.printStackTrace();
}
producer.send(record);
producer.close();
//kafkaSender.send(record);
return "Message sent to the Kafka Topic java_in_use_topic Successfully";
}
From the Kafka Java Code, the documentation on AUTO_OFFSET_RESET_CONFIG says the following:
What to do when there is no initial offset in Kafka or if the current offset does not exist any more on the server (e.g. because that data has been deleted): earliest: automatically reset the offset to the earliest offsetlatest: automatically reset the offset to the latest offsetnone: throw exception to the consumer if no previous offset is found for the consumer's groupanything else: throw exception to the consumer.
This can be found here in GitHub:
https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerConfig.java
We can see from their comment that the setting is only used when the offset is not on the server. In the question, the offset is retrieved from the server and that's why the offset is not reset to the beginning but rather stays at the last offset, making it appear that there are no more records.
You would need to explicitly reset the offset on the server side to fix this as requested in the question.
Here is another answer that describes how that could be done.
https://stackoverflow.com/a/54492802/231860
This is a snippet of code that allowed me to reset the offset. NOTE: You can't call seekToBeginning if you call the subscribe method. I could only get it to work if I assign the partitions myself using the assign method. Pity.
// Create the consumer:
final Consumer<String, DataRecord> consumer = new KafkaConsumer<>(props);
// Get the partitions that exist for this topic:
List<PartitionInfo> partitions = consumer.partitionsFor(topic);
// Get the topic partition info for these partitions:
List<TopicPartition> topicPartitions = partitions.stream().map(info -> new TopicPartition(info.topic(), info.partition())).collect(Collectors.toList());
// Assign all the partitions to the topic so that we can seek to the beginning:
// NOTE: We can't use subscribe if we use assign, but we can't seek to the beginning if we use subscribe.
consumer.assign(topicPartitions);
// Make sure we seek to the beginning of the partitions:
consumer.seekToBeginning(topicPartitions);
Yes, it seems extremely complicated to achieve a seemingly rudimentary use case. This might indicate that the whole kafka world just seems to want to read streams once.
I am usually creating a new consumer with different group.id to read again records.
So do it like that:
props.put("group.id", Instant.now().getEpochSecond());
There is a workaround for this (not a production solution, though) which is to change the group.id configuration value each time you consume. Setting auto.offset.reset to earliest is not enough in many cases.
When you want one message to be consumed by consumers multiple time the ideal way is to create consumers with different consumer group so same message can be consumed.
But if you want the same consumer to consume the same message multiple time then you can play with commit and offset
You set the auto.commit very high or disable it and do commit as per your logic
You can refer to this for more details https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
This javadoc provides detail on how to manually manage offset

Example to read from KafkaSpout and write same back to Kafka Bolt

I am trying to write a Storm based code which reads the message from one topic and writes back to another topic. Input topic has data in ProtoBuf format and output will have JSON format. I am not able to achieve it.
This is code which build the topology:
Config conf = new Config();
//set producer properties.
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9093");
props.put("request.required.acks", "1");
props.put("key.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
conf.put("kafka.broker.config", props);
conf.put(KafkaBolt.TOPIC, "out-storm");
KafkaBolt bolt = new KafkaBolt()
.withProducerProperties(props)
.withTopicSelector(new DefaultTopicSelector("out-storm")).withTupleToKafkaMapper(new FieldNameBasedTupleToKafkaMapper<String, String>());
BrokerHosts hosts = new ZkHosts("localhost:2181");
SpoutConfig spoutConfig = new SpoutConfig(hosts, "incoming-server", "/" + "incoming-server",
UUID.randomUUID().toString());
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("kafka-spout", kafkaSpout);
builder.setBolt("lookup-bolt", new ReportBolt(),4).shuffleGrouping("kafka-spout");
builder.setBolt("kafka-producer-spout", bolt).shuffleGrouping("lookup-bolt");
LocalCluster cluster = new LocalCluster();
Config config = new Config();
config.setDebug(true);
config.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 1);
config.put("kafka.broker.config", props);
config.put(KafkaBolt.TOPIC, "out-storm");
cluster.submitTopology("KafkaStormSample", config, builder.createTopology());
Thread.sleep(1000000);
In report Bolt I have done this:
System.out.println("HELLO " + input);
JSONObject jo= new JSONObject();
for (String f:input.getFields()){
jo.put(f, input.getValueByField(f));
}
collector.ack(input);
List<Object> list = new ArrayList<Object>();
list.add(jo);
collector.emit(list);
When I am starting getting this error:
5207 [main] WARN o.a.s.d.nimbus - Topology submission exception. (topology name='KafkaStormSample') #error {
:cause nil
:via
[{:type org.apache.storm.generated.InvalidTopologyException
:message nil
:at [org.apache.storm.daemon.common$validate_structure_BANG_ invoke common.clj 181]}]
:trace
[[org.apache.storm.daemon.common$validate_structure_BANG_ invoke common.clj 181]
[org.apache.storm.daemon.common$system_topology_BANG_ invoke common.clj 360]
[org.apache.storm.daemon.nimbus$fn__7064$exec_fn__2461__auto__$reify__7093 submitTopologyWithOpts nimbus.clj 1512]
[org.apache.storm.daemon.nimbus$fn__7064$exec_fn__2461__auto__$reify__7093 submitTopology nimbus.clj 1544]
[sun.reflect.NativeMethodAccessorImpl invoke0 NativeMethodAccessorImpl.java -2]
[sun.reflect.NativeMethodAccessorImpl invoke NativeMethodAccessorImpl.java 62]
[sun.reflect.DelegatingMethodAccessorImpl invoke DelegatingMethodAccessorImpl.java 43]
[java.lang.reflect.Method invoke Method.java 497]
[clojure.lang.Reflector invokeMatchingMethod Reflector.java 93]
[clojure.lang.Reflector invokeInstanceMethod Reflector.java 28]
[org.apache.storm.testing$submit_local_topology invoke testing.clj 301]
[org.apache.storm.LocalCluster$_submitTopology invoke LocalCluster.clj 49]
[org.apache.storm.LocalCluster submitTopology nil -1]
[com.mediaiq.StartStorm main StartStorm.java 81]]}
I think that the problem is that you are referencing the wrong port on your bootstrap.server config. Try changing it to the 9092.

Spark checkpointing error when joining static dataset with DStream

I am trying to use Spark Streaming application in Java. My Spark application reads continuous feed from Hadoop
directory using textFileStream() at interval of each 1 Min.
I need to perform Spark aggregation(group by) operation on incoming DStream. After aggregation, I am joining aggregated DStream<Key, Value1> with RDD<Key, Value2>
with RDD<Key, Value2> created from static dataset read by textFile() from hadoop directory.
Problem comes when I enable checkpointing. With empty checkpoint directory, it runs fine. After running 2-3 batches I close it using ctrl+c and run it again.
On second run it throws spark exception immediately: "SPARK-5063"
Exception in thread "main" org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063
Following is the Block of Code of spark application:
private void compute(JavaSparkContext sc, JavaStreamingContext ssc) {
JavaRDD<String> distFile = sc.textFile(MasterFile);
JavaDStream<String> file = ssc.textFileStream(inputDir);
// Read Master file
JavaRDD<MasterParseLog> masterLogLines = distFile.flatMap(EXTRACT_MASTER_LOGLINES);
final JavaPairRDD<String, String> masterRDD = masterLogLines.mapToPair(MASTER_KEY_VALUE_MAPPER);
// Continuous Streaming file
JavaDStream<ParseLog> logLines = file.flatMap(EXTRACT_CKT_LOGLINES);
// calculate the sum of required field and generate group sum RDD
JavaPairDStream<String, Summary> sumRDD = logLines.mapToPair(CKT_GRP_MAPPER);
JavaPairDStream<String, Summary> grpSumRDD = sumRDD.reduceByKey(CKT_GRP_SUM);
//GROUP BY Operation
JavaPairDStream<String, Summary> grpAvgRDD = grpSumRDD.mapToPair(CKT_GRP_AVG);
// Join Master RDD with the DStream //This is the block causing error (without it code is working fine)
JavaPairDStream<String, Tuple2<String, String>> joinedStream = grpAvgRDD.transformToPair(
new Function2<JavaPairRDD<String, String>, Time, JavaPairRDD<String, Tuple2<String, String>>>() {
private static final long serialVersionUID = 1L;
public JavaPairRDD<String, Tuple2<String, String>> call(
JavaPairRDD<String, String> rdd, Time v2) throws Exception {
return masterRDD.value().join(rdd);
}
}
);
joinedStream.print(10);
}
public static void main(String[] args) {
JavaStreamingContextFactory contextFactory = new JavaStreamingContextFactory() {
public JavaStreamingContext create() {
// Create the context with a 60 second batch size
SparkConf sparkConf = new SparkConf();
final JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaStreamingContext ssc1 = new JavaStreamingContext(sc, Durations.seconds(duration));
app.compute(sc, ssc1);
ssc1.checkpoint(checkPointDir);
return ssc1;
}
};
JavaStreamingContext ssc = JavaStreamingContext.getOrCreate(checkPointDir, contextFactory);
// start the streaming server
ssc.start();
logger.info("Streaming server started...");
// wait for the computations to finish
ssc.awaitTermination();
logger.info("Streaming server stopped...");
}
I know that block of code which joins static dataset with DStream is causing error, But that is taken from spark-streaming
page of Apache spark website (sub heading "stream-dataset join" under "Join Operations"). Please help me to get it working even if
there is different way of doing it. I need to enable checkpointing in my streaming application.
Environment Details:
Centos6.5 :2 node Cluster
Java :1.8
Spark :1.4.1
Hadoop :2.7.1*

Categories

Resources