KafkaUtils.createDirectStream does not take right parameters - Spark Streaming + Kafka - java

I have an application which sends serialized Twitter data to a Kafka topic. All good so far.
The consumer application should read data and deserialize it. Now, when I call KafkaUtils.createDirectStream, I think I put the right parameters (as you will see inside the thrown error), so I can't understand why it is not working.
The method createDirectStream(JavaStreamingContext, Class -K-,
Class -V-, Class -KD-, Class -VD-, Map -String,String-, Set -String-) in
the type KafkaUtils is not applicable for the arguments
(JavaStreamingContext, Class-String-, Class-Status-,
Class -StringDeserializer-, Class -StatusDeserializer-,
Map-String,String-, Set-String-)
Checking the Spark Javadoc, my params still seem right to me.
My code is:
Set<String> topics = new HashSet<>();
topics.add("twitter-test");
JavaStreamingContext jssc = new JavaStreamingContext(jsc, new Duration(duration));
Map<String, String> props = new HashMap<>();
//some properties...
JavaPairInputDStream messages = KafkaUtils.createDirectStream(jssc, String.class, Status.class, org.apache.kafka.common.serialization.StringDeserializer.class, stream_data.StatusDeserializer.class, props, topics);
Status serializer code:
public class StatusSerializer implements Serializer<Status> {
#Override public byte[] serialize(String s, Status o) {
try {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ObjectOutputStream oos = new ObjectOutputStream(baos);
oos.writeObject(o);
oos.close();
byte[] b = baos.toByteArray();
return b;
} catch (IOException e) {
return new byte[0];
}
}
#Override public void close() {
}
#Override
public void configure(Map<String, ?> configs, boolean isKey) {
}
}

Looks like the issue is with "stream_data.StatusDeserializer.class". Can you please the code of this custom deserializer class. Also, can you please look into this Kafka Consumer for Spark written in Scala for Kafka API 0.10: custom AVRO deserializer .
Include the below in the KafkaParam arguments.
key.deserializer -> classOf[StringDeserializer]
value.deserializer -> classOf[StatusDeserializer]

Related

Apache Flink Kafka Stream Processing based on conditions

Am building a wrapper library using apache-flink where I am listening(consuming) from multiple topics and I have a set of applications that want to process the messages from those topics.
Example :
I have 10 applications app1, app2, app3 ... app10 (each of them is a java library part of the same on-prem project, ie., all 10 jars are part of same .war file)
out of which only 5 are supposed to consume the messages coming to the consumer group. I am able to do filtering for 5 apps with the help of filter function.
The challenge is in the strStream.process(executionServiceInterface) function, where app1 provides an implementation class for ExceucionServiceInterface as ExecutionServiceApp1Impl and similary app2 provides ExecutionServiceApp2Impl.
when there are multiple implementations available spring wants us to provide #Qualifier annotation or #Primary has to be marked on the implementations (ExecutionServiceApp1Impl , ExecutionServiceApp2Impl).
But I don't really want to do this. As am building a generic wrapper library that should support any no of such applications (app1, app2 etc) and all of them should be able to implement their own implementation logic(ExecutionServiceApp1Impl , ExecutionServiceApp2Impl).
Can someone help me here ? how to solve this ?
Below is the code for reference.
#Autowired
private ExceucionServiceInterface executionServiceInterface;
public void init(){
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
FlinkKafkaConsumer011<String> consumer = createStringConsumer(topicList, kafkaAddress, kafkaGroup);
if (consumer != null) {
DataStream<String> strStream = environment.addSource(consumer);
strStream.filter(filterFunctionInterface).process(executionServiceInterface);
}
}
public FlinkKafkaConsumer011<String> createStringConsumer(List<String> listOfTopics, String kafkaAddress, String kafkaGroup) throws Exception {
FlinkKafkaConsumer011<String> myConsumer = null;
try {
Properties props = new Properties();
props.setProperty("bootstrap.servers", kafkaAddress);
props.setProperty("group.id", kafkaGroup);
myConsumer = new FlinkKafkaConsumer011<>(listOfTopics, new SimpleStringSchema(), props);
} catch(Exception e) {
throw e;
}
return myConsumer;
}
Many thanks in advance!!
Solved this problem by using Reflection, below is the code that solved the issue.
Note : this requires me to know the list of fully qualified classNames and method names along with their parameters.
#Component
public class SampleJobExecutor extends ProcessFunction<String, String> {
#Autowired
MyAppProperties myAppProperties;
#Override
public void processElement(String inputMessage, ProcessFunction<String, String>.Context context,
Collector<String> collector) throws Exception {
String className = null;
String methodName = null;
try {
Map<String, List<String>> map = myAppProperties.getMapOfImplementors();
JSONObject json = new JSONObject(inputMessage);
if (json != null && json.has("appName")) {
className = map.get(json.getString("appName")).get(0);
methodName = map.get(json.getString("appName")).get(1);
}
Class<?> forName = Class.forName(className);
Object job = forName.newInstance();
Method method = forName.getDeclaredMethod(methodName, String.class);
method.invoke(job , inputMessage);
} catch (Exception e) {
e.printStackTrace();
}
}

How to skip an Avro serialization exception in KafkaStreams API?

I have a Kafka application that is written by KafkaStreams Java api. It reads data from Mysql binlog and do some stuff that is irrelevant to my question. The problem is one particular row produces error in deserialization from avro. I can dig into Avro schema file and find the problem but as a whole what I need is a forgiving exception handler that upon encountering such error does not bring the whole application to halt.
This is the main part of my stream app:
StreamsBuilder streamsBuilder = watchForCourierUpdate(builder);
KafkaStreams kafkaStreams = new KafkaStreams(streamsBuilder.build(), properties);
kafkaStreams.start();
Runtime.getRuntime().addShutdownHook(new Thread(kafkaStreams::close));
}
private static StreamsBuilder watchForCourierUpdate(StreamsBuilder builder){
CourierUpdateListener courierUpdateListener = new CourierUpdateListener(builder);
courierUpdateListener.start();
return builder;
}
private static Properties configProperties(){
Properties streamProperties = new Properties();
streamProperties.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, Configs.getConfig("schemaRegistryUrl"));
streamProperties.put(StreamsConfig.APPLICATION_ID_CONFIG, "courier_app");
streamProperties.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, Configs.getConfig("bootstrapServerUrl"));
streamProperties.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 1000);
streamProperties.put(StreamsConfig.STATE_DIR_CONFIG, "/tmp/state_dir");
streamProperties.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, "3");
streamProperties.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, GenericAvroSerde.class);
streamProperties.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, GenericAvroSerde.class);
streamProperties.put(StreamsConfig.METRICS_RECORDING_LEVEL_CONFIG, "DEBUG");
streamProperties.put(StreamsConfig.DEFAULT_PRODUCTION_EXCEPTION_HANDLER_CLASS_CONFIG,
CourierSerializationException.class);
return streamProperties;
}
This is my CourierSerializationException class:
public class CourierSerializationException implements ProductionExceptionHandler {
#Override
public ProductionExceptionHandlerResponse handle(ProducerRecord<byte[], byte[]> producerRecord, Exception e) {
Logger.logError("Failed to de/serialize entity from " + producerRecord.topic() + " topic.\n" + e);
return ProductionExceptionHandlerResponse.CONTINUE;
}
#Override
public void configure(Map<String, ?> map) {
}
}
Still, whenever an avro deserialization exception happens the stream shuts down and the application does not continue. Am I missing something!
Have you tried to do this with the default.deserialization.exception.handler provided by kafka? you can use LogAndContinueExceptionHandler which will log and continue.
I may be wrong but i think creating a Customexception by implementing ProductionExceptionHandler only works for network related error on the kafka side.
add this to the properties and see what happens:
> props.put(StreamsConfig.DEFAULT_DESERIALIZATION_EXCEPTION_HANDLER_CLASS_CONFIG, LogAndContinueExceptionHandler.class);

How to get the processing kafka topic name dynamically in Flink Kafka Consumer?

Currently, I have one Flink Cluster which wants to consume Kafka Topic by one Pattern, By using this way, we don't need to maintain one hard code Kafka topic list.
import java.util.regex.Pattern;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer010;
...
private static final Pattern topicPattern = Pattern.compile("(DC_TEST_([A-Z0-9_]+)");
...
FlinkKafkaConsumer010<KafkaMessage> kafkaConsumer = new FlinkKafkaConsumer010<>(
topicPattern, deserializerClazz.newInstance(), kafkaConsumerProps);
DataStream<KafkaMessage> input = env.addSource(kafkaConsumer);
I just want to know by using the above way, How can I get to know the real Kafka topic name during the processing?
Thanks.
--Update--
The reason why I need to know the topic information is we need this topic name as the parameter to be used in the coming Flink sink part.
You can implement your own custom KafkaDeserializationSchema, like this:
public class CustomKafkaDeserializationSchema implements KafkaDeserializationSchema<Tuple2<String, String>> {
#Override
public boolean isEndOfStream(Tuple2<String, String> nextElement) {
return false;
}
#Override
public Tuple2<String, String> deserialize(ConsumerRecord<byte[], byte[]> record) throws Exception {
return new Tuple2<>(record.topic(), new String(record.value(), "UTF-8"));
}
#Override
public TypeInformation<Tuple2<String, String>> getProducedType() {
return new TupleTypeInfo<>(BasicTypeInfo.STRING_TYPE_INFO, BasicTypeInfo.STRING_TYPE_INFO);
}
}
With the custom KafkaDeserializationSchema, you can create DataStream of which the element contains topic infos. In my demo case the element type is Tuple2<String, String>, so you can access the topic name by Tuple2#f0.
FlinkKafkaConsumer010<Tuple2<String, String>> kafkaConsumer = new FlinkKafkaConsumer010<>(
topicPattern, new CustomKafkaDeserializationSchema, kafkaConsumerProps);
DataStream<Tuple2<String, String>> input = env.addSource(kafkaConsumer);
input.process(new ProcessFunction<Tuple2<String,String>, String>() {
#Override
public void processElement(Tuple2<String, String> value, Context ctx, Collector<String> out) throws Exception {
String topicName = value.f0;
// your processing logic here.
out.collect(value.f1);
}
});
There are two ways to do that.
Option 1 :
You can use Kafka-clients library to access the Kafka metadata, get topic lists. Add maven dependency or equivalent.
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.3.0</version>
</dependency>
You can fetch topics from Kafka cluster and filter using regex as given below
private static final Pattern topicPattern = Pattern.compile("(DC_TEST_([A-Z0-9_]+)");
Properties properties = new Properties();
properties.put("bootstrap.servers","localhost:9092");
properties.put("client.id","java-admin-client");
try (AdminClient client = AdminClient.create(properties)) {
ListTopicsOptions options = new ListTopicsOptions();
options.listInternal(false);
Collection<TopicListing> listing = client.listTopics(options).listings().get();
List<String> allTopicsList = listings.stream().map(TopicListing::name)
.collect(Collectors.toList());
List<String> matchedTopics = allTopicsList.stream()
.filter(topicPattern.asPredicate())
.collect(Collectors.toList());
}catch (Exception e) {
e.printStackTrace();
}
}
Once you have matchedTopics list, you can pass that to FlinkKafkaConsumer.
Option 2 :
FlinkKafkaConsumer011 in Flink release 1.8 supports Topic & partition discovery dynamically based on pattern. Below is the example :
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
private static final Pattern topicPattern = Pattern.compile("(DC_TEST_([A-Z0-9_]+)");
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("group.id", "test");
FlinkKafkaConsumer011<String> myConsumer = new FlinkKafkaConsumer011<>(
topicPattern ,
new SimpleStringSchema(),
properties);
Link : https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/connectors/kafka.html#kafka-consumers-topic-and-partition-discovery
In your case, option 2 suits best.
Since you want to access topic metadata as part of KafkaMessage, you need to implement KafkaDeserializationSchema interface as given below :
public class CustomKafkaDeserializationSchema extends KafkaDeserializationSchema<KafkaMessage> {
/**
* Deserializes the byte message.
*
* #param messageKey the key as a byte array (null if no key has been set).
* #param message The message, as a byte array (null if the message was empty or deleted).
* #param partition The partition the message has originated from.
* #param offset the offset of the message in the original source (for example the Kafka offset).
*
* #return The deserialized message as an object (null if the message cannot be deserialized).
*/
#Override
public KafkaMessage deserialize(ConsumerRecord<byte[], byte[]> record) throws IOException {
//You can access record.key(), record.value(), record.topic(), record.partition(), record.offset() to get topic information.
KafkaMessage kafkaMessage = new KafkaMessage();
kafkaMessage.setTopic(record.topic());
// Make your kafka message here and assign the values like above.
return kafkaMessage ;
}
#Override
public boolean isEndOfStream(Long nextElement) {
return false;
}
}
And then call :
FlinkKafkaConsumer010<Tuple2<String, String>> kafkaConsumer = new FlinkKafkaConsumer010<>(
topicPattern, new CustomKafkaDeserializationSchema, kafkaConsumerProps);

Consumer unit test with poll() never receives anything

Consider the following code:
#Test(singleThreaded = true)
public class KafkaConsumerTest
{
private KafkaTemplate<String, byte[]> template;
private DefaultKafkaConsumerFactory<String, byte[]> consumerFactory;
private static final KafkaEmbedded EMBEDDED_KAFKA;
static {
EMBEDDED_KAFKA = new KafkaEmbedded(1, true, "topic");
try { EMBEDDED_KAFKA.before(); } catch (final Exception e) { e.printStackTrace(); }
}
#BeforeMethod
public void setUp() throws Exception {
final Map<String, Object> senderProps = KafkaTestUtils.senderProps(EMBEDDED_KAFKA.getBrokersAsString());
senderProps.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
senderProps.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, ByteArraySerializer.class);
final ProducerFactory<String, byte[]> pf = new DefaultKafkaProducerFactory<>(senderProps);
this.template = new KafkaTemplate<>(pf);
this.template.setDefaultTopic("topic");
final Map<String, Object> consumerProps = KafkaTestUtils.consumerProps("sender", "false", EMBEDDED_KAFKA);
this.consumerFactory = new DefaultKafkaConsumerFactory<>(consumerProps);
this.consumerFactory.setValueDeserializer(new ByteArrayDeserializer());
this.consumerFactory.setKeyDeserializer(new StringDeserializer());
}
#Test
public void testSendToKafka() throws InterruptedException, ExecutionException, TimeoutException {
final String message = "42";
final Message<byte[]> msg = MessageBuilder.withPayload(message.getBytes(StandardCharsets.UTF_8)).setHeader(KafkaHeaders.TOPIC, "topic").build();
this.template.send(msg).get(10, TimeUnit.SECONDS);
final Consumer<String, byte[]> consumer = this.consumerFactory.createConsumer();
consumer.subscribe(Collections.singleton("topic"));
final ConsumerRecords<String, byte[]> records = consumer.poll(10000);
Assert.assertTrue(records.count() > 0);
Assert.assertEquals(new String(records.iterator().next().value(), StandardCharsets.UTF_8), message);
consumer.commitSync();
}
}
I am trying to send a message to a KafkaTemplate and read it again using Consumer.poll(). The test framework I am using is TestNG.
Sending works, I have verified that using the "usual" code I found in the net (register a message listener on a KafkaMessageListenerContainer).
Only, I never receive anything in the consumer. I have tried the same sequence (create Consumer, poll()) against a "real" Kafka installation, and it works.
Hence it looks like there is something wrong with the way I set up my ConsumerFactory? Any help would be greatly appreciated!
You need to use
EMBEDDED_KAFKA.consumeFromAnEmbeddedTopic(consumer, "topic");
before publishing records via KafkaTemplate.
And then in the end of test for verification you need to use something like this:
ConsumerRecord<String, String> record = KafkaTestUtils.getSingleRecord(consumer, "topic");
You can also use it the way you do, only what you are missing is a ConsumerConfig.AUTO_OFFSET_RESET_CONFIG as an earliest, because the default one is latest. That way a consumer added to the topic later won't see any records published before.

Apache Kafka Avro Deserialization: Unable to deserialize or decode Specific type message.

I am trying to use Avro Serialize with Apache kafka for serialize/deserialize messages. I am create one producer, which is used to serialize specific type message and send it to the queue. When message is send successfully to the queue, our consumer pick the message and trying to process, but while trying we are facing an exception, for case bytes to specific object. The exception is as below:
[error] (run-main-0) java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to com.harmeetsingh13.java.avroserializer.Customer
java.lang.ClassCastException: org.apache.avro.generic.GenericData$Record cannot be cast to com.harmeetsingh13.java.avroserializer.Customer
at com.harmeetsingh13.java.consumers.avrodesrializer.AvroSpecificDeserializer.lambda$infiniteConsumer$0(AvroSpecificDeserializer.java:51)
at java.lang.Iterable.forEach(Iterable.java:75)
at com.harmeetsingh13.java.consumers.avrodesrializer.AvroSpecificDeserializer.infiniteConsumer(AvroSpecificDeserializer.java:46)
at com.harmeetsingh13.java.consumers.avrodesrializer.AvroSpecificDeserializer.main(AvroSpecificDeserializer.java:63)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
According to exception, we are using some inconenient way for read the data, below is our code:
Kafka Producer Code:
static {
kafkaProps.put("bootstrap.servers", "localhost:9092");
kafkaProps.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
kafkaProps.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
kafkaProps.put("schema.registry.url", "http://localhost:8081");
kafkaProducer = new KafkaProducer<>(kafkaProps);
}
public static void main(String[] args) throws InterruptedException, IOException {
Customer customer1 = new Customer(1002, "Jimmy");
Parser parser = new Parser();
Schema schema = parser.parse(AvroSpecificProducer.class
.getClassLoader().getResourceAsStream("avro/customer.avsc"));
SpecificDatumWriter<Customer> writer = new SpecificDatumWriter<>(schema);
try(ByteArrayOutputStream os = new ByteArrayOutputStream()) {
BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(os, null);
writer.write(customer1, encoder);
encoder.flush();
byte[] avroBytes = os.toByteArray();
ProducerRecord<String, byte[]> record1 = new ProducerRecord<>("CustomerSpecificCountry",
"Customer One 11 ", avroBytes
);
asyncSend(record1);
}
Thread.sleep(10000);
}
Kafka Consumer Code:
static {
kafkaProps.put("bootstrap.servers", "localhost:9092");
kafkaProps.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, KafkaAvroDeserializer.class);
kafkaProps.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, KafkaAvroDeserializer.class);
kafkaProps.put(ConsumerConfig.GROUP_ID_CONFIG, "CustomerCountryGroup1");
kafkaProps.put("schema.registry.url", "http://localhost:8081");
}
public static void infiniteConsumer() throws IOException {
try(KafkaConsumer<String, byte[]> kafkaConsumer = new KafkaConsumer<>(kafkaProps)) {
kafkaConsumer.subscribe(Arrays.asList("CustomerSpecificCountry"));
while(true) {
ConsumerRecords<String, byte[]> records = kafkaConsumer.poll(100);
System.out.println("<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<" + records.count());
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(AvroSpecificDeserializer.class
.getClassLoader().getResourceAsStream("avro/customer.avsc"));
records.forEach(record -> {
DatumReader<Customer> customerDatumReader = new SpecificDatumReader<>(schema);
BinaryDecoder binaryDecoder = DecoderFactory.get().binaryDecoder(record.value(), null);
try {
System.out.println(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>");
Customer customer = customerDatumReader.read(null, binaryDecoder);
System.out.println(customer);
} catch (IOException e) {
e.printStackTrace();
}
});
}
}
}
Using consumer in console, we are successfully able to receive the message. So what is the way for decode message into our pojo files ?
The solution of this problem is, use
DatumReader<GenericRecord> customerDatumReader = new SpecificDatumReader<>(schema);
instead of
`DatumReader<Customer> customerDatumReader = new SpecificDatumReader<>(schema);
The exact reason for this, still not found. This may be, because Kafka, doesn't know about the structure of message, we explicitly define schema for message, and GenericRecord is useful to convert any message into readable JSON format according to schema. After creating JSON, we can easily convert it into our POJO class.
But Still, need to find solution for convert directly into our POJO class.
You don't need to do the Avro serialization explicitly before passing the values to ProduceRecord. The serializer will do it for you. Your code would look like:
Customer customer1 = new Customer(1002, "Jimmy");
ProducerRecord<String, Customer> record1 = new ProducerRecord<>("CustomerSpecificCountry", customer1);
asyncSend(record1);
}
See an example from Confluent for a simple producer using avro

Categories

Resources