Kafka Connect Starts, but nothing happens

Kafka Connect Starts, but nothing happens - java

I am writing a Kafka source connector based on a working producer that I use for audio files. The connector starts but nothing happens, no errors, no data, I am not sure if this is a coding problem or configuration problem.
The connector should read an entire directory, and read files as a byte array.
Config class:
package hothman.example;
import org.apache.kafka.common.config.AbstractConfig;
import org.apache.kafka.common.config.ConfigDef;
import org.apache.kafka.common.config.ConfigDef.Type;
import org.apache.kafka.common.config.ConfigDef.Importance;
import java.util.Map;
public class AudioSourceConnectorConfig extends AbstractConfig {
public static final String FILENAME_CONFIG="fileName";
private static final String FILENAME_DOC ="Enter the path of the audio files";
public static final String TOPIC_CONFIG = "topic";
private static final String TOPIC_DOC = "Enter the topic to write to..";
public AudioSourceConnectorConfig(ConfigDef config, Map<String, String> parsedConfig) {
super(config, parsedConfig);
}
public AudioSourceConnectorConfig(Map<String, String> parsedConfig) {
this(conf(), parsedConfig);
}
public static ConfigDef conf() {
return new ConfigDef()
.define(FILENAME_CONFIG, Type.STRING, Importance.HIGH, FILENAME_DOC)
.define(TOPIC_CONFIG, Type.STRING, Importance.HIGH, TOPIC_DOC);
}
public String getFilenameConfig(){
return this.getString("fileName");
}
public String getTopicConfig(){
return this.getString("topic");
}
}
SourceConnectorClass
package hothman.example;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.kafka.common.config.ConfigDef;
import org.apache.kafka.connect.connector.Task;
import org.apache.kafka.connect.errors.ConnectException;
import org.apache.kafka.connect.source.SourceConnector;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class AudioSourceConnector extends SourceConnector {
/*
Your connector should never use System.out for logging. All of your classes should use slf4j
for logging
*/
private static Logger log = LoggerFactory.getLogger(AudioSourceConnector.class);
private AudioSourceConnectorConfig config;
private String filename;
private String topic;
#Override
public String version() {
return VersionUtil.getVersion();
}
#Override
public void start(Map<String, String> props) {
filename = config.getFilenameConfig();
topic = config.getTopicConfig();
if (topic == null || topic.isEmpty())
throw new ConnectException("AudiSourceConnector configuration must include 'topic' setting");
if (topic.contains(","))
throw new ConnectException("AudioSourceConnector should only have a single topic when used as a source.");
}
#Override
public Class<? extends Task> taskClass() {
//TODO: Return your task implementation.
return AudioSourceTask.class;
}
#Override
public List<Map<String, String>> taskConfigs(int maxTasks) {
ArrayList<Map<String, String>> configsList = new ArrayList<>();
// Only one input stream makes sense.
Map<String, String> configs = new HashMap<>();
if (filename != null)
configs.put(config.getFilenameConfig(), filename);
configs.put(config.getTopicConfig(), topic);
configsList.add(configs);
return configsList;
}
#Override
public void stop() {
}
#Override
public ConfigDef config() {
return AudioSourceConnectorConfig.conf();
}
}
SourceTask class
package hothman.example;
import org.apache.kafka.connect.data.Schema;
import org.apache.kafka.connect.data.SchemaAndValue;
import org.apache.kafka.connect.errors.ConnectException;
import org.apache.kafka.connect.source.SourceRecord;
import org.apache.kafka.connect.source.SourceTask;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.File;
import java.io.IOException;
import java.nio.file.*;
import java.util.*;
import static com.sun.nio.file.ExtendedWatchEventModifier.FILE_TREE;
import static java.nio.file.StandardWatchEventKinds.ENTRY_CREATE;
import static java.nio.file.StandardWatchEventKinds.ENTRY_DELETE;
public class AudioSourceTask extends SourceTask {
/*
Your connector should never use System.out for logging. All of your classes should use slf4j
for logging
*/
static final Logger log = LoggerFactory.getLogger(AudioSourceTask.class);
private AudioSourceConnectorConfig config;
public static final String POSITION_FIELD = "position";
private static final Schema VALUE_SCHEMA = Schema.BYTES_SCHEMA;
private String filename;
private String topic = null;
private int offset = 0;
private FileSystem fs = FileSystems.getDefault();
private WatchService ws = fs.newWatchService();
private Path dir;
private File directoryPath;
private ArrayList<File> listOfFiles;
private byte[] temp = null;
public AudioSourceTask() throws IOException {
}
#Override
public String version() {
return VersionUtil.getVersion();
}
#Override
public void start(Map<String, String> props) {
filename = config.getFilenameConfig();
topic = config.getTopicConfig();
if (topic == null)
throw new ConnectException("AudioSourceTask config missing topic setting");
dir = Paths.get(filename);
try {
dir.register(ws, new WatchEvent.Kind[]{ENTRY_CREATE, ENTRY_DELETE}, FILE_TREE);
} catch (IOException e) {
e.printStackTrace();
}
directoryPath = new File(String.valueOf(dir));
}
#Override
public List<SourceRecord> poll() throws InterruptedException {
//TODO: Create SourceRecord objects that will be sent the kafka cluster.
listOfFiles = new ArrayList<File>(Arrays.asList(directoryPath.listFiles()));
Map<String, Object> offset = context.offsetStorageReader().
offset(Collections.singletonMap(config.getFilenameConfig(), filename));
ArrayList<SourceRecord> records = new ArrayList<>(1);
try {
for (File file : listOfFiles) {
// send existing files first
temp = Files.readAllBytes(Paths.get(file.toString()));
records.add(new SourceRecord(null,
null, topic, Schema.BYTES_SCHEMA, temp));
}
return records;
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
#Override
public void stop() {
//TODO: Do whatever is required to stop your task.
}
}
VersionClass
package hothman.example;
/**
* Created by jeremy on 5/3/16.
*/
class VersionUtil {
public static String getVersion() {
try {
return VersionUtil.class.getPackage().getImplementationVersion();
} catch(Exception ex){
return "0.0.0.0";
}
}
}
Connector.properties
name=AudioSourceConnector
tasks.max=1
connector.class=hothman.example.AudioSourceConnector
fileName = G:\\Files
topic= my-topic
Connect-standalone.properties
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# These are defaults. This file just demonstrates how to override some settings.
bootstrap.servers=localhost:9092
# The converters specify the format of data in Kafka and how to translate it into Connect data. Every Connect user will
# need to configure these based on the format they want their data in when loaded from or stored into Kafka
#key.converter=org.apache.kafka.connect.json.JsonConverter
#value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
# Converter-specific settings can be passed in by prefixing the Converter's setting with the converter we want to apply
# it to
key.converter.schemas.enable=false
value.converter.schemas.enable=false
offset.storage.file.filename=G:/Kafka/kafka_2.12-2.8.0/tmp/connect.offsets
# Flush much faster than normal, which is useful for testing/debugging
offset.flush.interval.ms=10000
# Set to a list of filesystem paths separated by commas (,) to enable class loading isolation for plugins
# (connectors, converters, transformations). The list should consist of top level directories that include
# any combination of:
# a) directories immediately containing jars with plugins and their dependencies
# b) uber-jars with plugins and their dependencies
# c) directories immediately containing the package directory structure of classes of plugins and their dependencies
# Note: symlinks will be followed to discover dependencies or plugins.
# Examples:
# plugin.path=/usr/local/share/java,/usr/local/share/kafka/plugins,/opt/connectors,
plugin.path=G:/Kafka/kafka_2.12-2.8.0/plugins
ERROR:
[2021-05-05 01:24:27,926] INFO WorkerSourceTask{id=AudioSourceConnector-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:487)
[2021-05-05 01:24:27,928] ERROR WorkerSourceTask{id=AudioSourceConnector-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:184)
java.lang.OutOfMemoryError: Java heap space
at java.nio.file.Files.read(Files.java:3099)
at java.nio.file.Files.readAllBytes(Files.java:3158)
at hothman.example.AudioSourceTask.poll(AudioSourceTask.java:93)
at org.apache.kafka.connect.runtime.WorkerSourceTask.poll(WorkerSourceTask.java:273)
at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:240)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:182)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:231)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[2021-05-05 01:24:27,929] INFO [Producer clientId=connector-producer-AudioSourceConnector-0] Closing the Kafka producer with timeoutMillis = 30000 ms. (org.apache.kafka.clients.producer.KafkaProducer:1204)
[2021-05-05 01:24:27,933] INFO Metrics scheduler closed (org.apache.kafka.common.metrics.Metrics:659)
[2021-05-05 01:24:27,934] INFO Closing reporter org.apache.kafka.common.metrics.JmxReporter (org.apache.kafka.common.metrics.Metrics:663)
[2021-05-05 01:24:27,934] INFO Metrics reporters closed (org.apache.kafka.common.metrics.Metrics:669)
[2021-05-05 01:24:27,935] INFO App info kafka.producer for connector-producer-AudioSourceConnector-0 unregistered (org.apache.kafka.common.utils.AppInfoParser:83)
[2021-05-05 01:24:36,479] INFO WorkerSourceTask{id=AudioSourceConnector-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:487)

Using the Logger based on #OneCricketeer recommendation, I was able to pinpoint the problem.
config.getFilenameConfig();
returns null, so I had to encode the path manually for the time being in the connector.
the connector worked but gave java.lang.OutOfMemoryError: Java heap space error. to fix this i had to edit connect-standalone.properties file and change the size of producer.max.request.size and producer.buffer.memory and make sure their values are higher than any of the files that I am going to send.
I have also edited AudioSourceTask class and got rid of the for loop in the poll method and moved the initialization of listOfFiles from poll method to start method, they are as follows now
public void start(Map<String, String> props) {
filename = "G:\\AudioFiles";//config.getFilenameConfig();//
topic = "voice-wav1";//config.getTopicConfig();//
if (topic == null)
throw new ConnectException("AudioSourceTask config missing topic setting");
dir = Paths.get(filename);
try {
dir.register(ws, new WatchEvent.Kind[]{ENTRY_CREATE, ENTRY_DELETE}, FILE_TREE);
} catch (IOException e) {
e.printStackTrace();
}
directoryPath = new File(String.valueOf(dir));
listOfFiles = new ArrayList<File>(Arrays.asList(directoryPath.listFiles()));
}
#Override
public List<SourceRecord> poll() throws InterruptedException {
//TODO: Create SourceRecord objects that will be sent the kafka cluster.
Map<String, Object> offset = context.offsetStorageReader().
offset(Collections.singletonMap("G:\\AudioFiles", filename));
ArrayList<SourceRecord> records = new ArrayList<>(1);
try{
// send existing files first
if(listOfFiles.size()!=0) {
File file = listOfFiles.get(listOfFiles.size() - 1);
listOfFiles.remove(listOfFiles.size() - 1);
temp = Files.readAllBytes(Paths.get(file.toString()));
records.add(new SourceRecord(null,
null, topic, Schema.BYTES_SCHEMA, temp));
LOGGER.info("Reading file {}", file);
return records;
}
} catch (IOException e) {
e.printStackTrace();
}
return null;
}

Related

How to debug and test MapReduce on local Window machine?

I have found that debugging and testing a MapReduce project challenging.
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.log4j.Logger;
import org.json.simple.JSONArray;
import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;
import writables.Friend;
import writables.FriendArray;
import writables.FriendPair;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.StringTokenizer;
public class FacebookFriendsMapper extends Mapper<LongWritable, Text, FriendPair, FriendArray> {
#Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Logger log = Logger.getLogger(FacebookFriendsMapper.class);
StringTokenizer st = new StringTokenizer(value.toString(), "\t");
String person = st.nextToken();
String friends = st.nextToken();
Friend f1 = populateFriend(person);
List<Friend> friendList = populateFriendList(friends);
Friend[] friendArray = Arrays.copyOf(friendList.toArray(), friendList.toArray().length, Friend[].class);
FriendArray farray = new FriendArray(Friend.class, friendArray);
for(Friend f2 : friendList) {
FriendPair fpair = new FriendPair(f1, f2);
context.write(fpair, farray);
log.info(fpair+"......"+ farray);
}
}
private Friend populateFriend(String friendJson) {
JSONParser parser = new JSONParser();
Friend friend = null;
try {
Object obj = (Object)parser.parse(friendJson);
JSONObject jsonObject = (JSONObject) obj;
Long lid = (long)jsonObject.get("id");
IntWritable id = new IntWritable(lid.intValue());
Text name = new Text((String)jsonObject.get("name"));
Text hometown = new Text((String)jsonObject.get("hometown"));
friend = new Friend(id, name, hometown);
} catch (ParseException e) {
e.printStackTrace();
}
return friend;
}
private List<Friend> populateFriendList(String friendsJson) {
List<Friend> friendList = new ArrayList<Friend>();
try {
JSONParser parser = new JSONParser();
Object obj = (Object)parser.parse(friendsJson.toString());
JSONArray jsonarray = (JSONArray) obj;
for(Object jobj : jsonarray) {
JSONObject entry = (JSONObject)jobj;
Long lid = (long)entry.get("id");
IntWritable id = new IntWritable(lid.intValue());
Text name = new Text((String)entry.get("name"));
Text hometown = new Text((String)entry.get("hometown"));
Friend friend = new Friend(id, name, hometown);
friendList.add(friend);
}
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return friendList;
}
}
For debugging and testing, I usually get the script above and put it inside a public static void main(String[] args) in another testing class and run in debug mode of Intellij IDEA with reading a sample data from the local filesystem. Hence, I am pretty sure that the mapper's logic is correct.
About the reducer script, I am not sure in details that how the mapper pass its outputs to reducer. I checked the sample Reducer scripts during my research and came up with the initial version of my reducers as below:
public class FacebookFriendsReducer extends
Reducer<FriendPair, FriendArray, FriendPair, FriendArray> {
#Override
public void reduce(FriendPair key, Iterable<FriendArray> values, Context context)
throws IOException, InterruptedException {
}
}
This is where I can not proceed further as I can not simulate how the mapper pass its output to the FacebookFriendsReducer and the reduce method. My current approach for debugging is to write the reducer logic in a public static void main(String[] args) and then running it in debug mode in the process before putting to its reducer class.
Can someone help me how to pass the correct output of the mapper into the reducer so that I can further work on the logic ?
If you have a better alternative for debugging and testing MapReduce on a Local window machine before packaging it into a jar file and shipping it to Hadoop cluster, please let me know.
Edit for #OneCricketeer's answer:
You can check the Driver (main class) as below:
public class FacebookFriendsDriver extends Configured implements Tool {
#Override
public int run(String[] args) throws Exception {
String inputPath = "E:\\sampleInputPath\\inputFile";
String outputPath = "E:\\sampleOutputPath\\outputFile";
// if (args.length != 2) {
// System.err.println("Usage: fberature <input path> <output path>");
// System.exit(-1);
// }
//Job Setup
Job fb = Job.getInstance(getConf(), "facebook-friends");
fb.setJarByClass(FacebookFriendsDriver.class);
//File Input and Output format
FileInputFormat.addInputPath(fb, new Path(inputPath));
FileOutputFormat.setOutputPath(fb, new Path(outputPath));
fb.setInputFormatClass(TextInputFormat.class);
fb.setOutputFormatClass(SequenceFileOutputFormat.class);
//Mapper-Reducer-Combiner specifications
fb.setMapperClass(FacebookFriendsMapper.class);
fb.setReducerClass(FacebookFriendsReducer.class);
fb.setMapOutputKeyClass(FriendPair.class);
fb.setMapOutputValueClass(FriendArray.class);
//Output key and value
fb.setOutputKeyClass(FriendPair.class);
fb.setOutputValueClass(FriendArray.class);
//Submit job
return fb.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new FacebookFriendsDriver(), args);
System.exit(exitCode);
}
}
The sample Driver class above which I created based on other MapReduce job existing in our system. But I can not make it work on my Local Window Machine with the error below:
Connected to the target VM, address: '127.0.0.1:59143', transport: 'socket'
23/01/10 10:52:22 ERROR util.Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:324)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:339)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:332)
at org.apache.hadoop.util.GenericOptionsParser.preProcessForWindows(GenericOptionsParser.java:431)
at org.apache.hadoop.util.GenericOptionsParser.parseGeneralOptions(GenericOptionsParser.java:477)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:171)
at org.apache.hadoop.util.GenericOptionsParser.<init>(GenericOptionsParser.java:154)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:64)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at FacebookFriendsDriver.main(FacebookFriendsDriver.java:60)
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/C:/Users/Holyken/.m2/repository/org/apache/hadoop/hadoop-auth/2.3.0-cdh5.1.0/hadoop-auth-2.3.0-cdh5.1.0.jar) to method sun.security.krb5.Config.getInstance()
WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
23/01/10 10:52:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/01/10 10:52:23 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
23/01/10 10:52:23 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
Exception in thread "main" java.lang.NullPointerException
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1090)
at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:451)
at org.apache.hadoop.util.Shell.run(Shell.java:424)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:656)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:745)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:728)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:633)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:421)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:281)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:126)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:982)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:976)
at java.base/java.security.AccessController.doPrivileged(Native Method)
at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:976)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:582)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:612)
at FacebookFriendsDriver.run(FacebookFriendsDriver.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at FacebookFriendsDriver.main(FacebookFriendsDriver.java:60)
Disconnected from the target VM, address: '127.0.0.1:59143', transport: 'socket'
Can you elaborate more about how I can run MapReduce job on my local filesystem ?

You can set breakpoints in code from an IDE. You don't even need a real hadoop cluster. The code will run the same in local filesystem.
Otherwise, you can write unit tests, as well. For instance, your json parsing function, looks like it can return null values on exception, then you continue adding null values into your mapper output... You also don't need to convert a list to an array just to create a json array
Your main method for mapreduce for a Job driver application is what you'd start in a debugger.
can not simulate how the mapper pass its output to the FacebookFriendsReducer
The parameters are given like a GROUP BY key operation. Your value is an iterable of arrays, so you need to loop over them.
Not clear what your reducer needs to output, so the output types might not be correct

Not able to process kafka json message with Flink siddhi library

I am trying to create a simple application where the app will consume Kafka message do some cql transform and publish to Kafka and below is the code:
JAVA: 1.8
Flink: 1.13
Scala: 2.11
flink-siddhi: 2.11-0.2.2-SNAPSHOT
I am using library: https://github.com/haoch/flink-siddhi
input json to Kafka:
{
"awsS3":{
"ResourceType":"aws.S3",
"Details":{
"Name":"crossplane-test",
"CreationDate":"2020-08-17T11:28:05+00:00"
},
"AccessBlock":{
"PublicAccessBlockConfiguration":{
"BlockPublicAcls":true,
"IgnorePublicAcls":true,
"BlockPublicPolicy":true,
"RestrictPublicBuckets":true
}
},
"Location":{
"LocationConstraint":"us-west-2"
}
}
}
main class:
public class S3SidhiApp {
public static void main(String[] args) {
internalStreamSiddhiApp.start();
//kafkaStreamApp.start();
}
}
App class:
package flinksidhi.app;
import com.google.gson.JsonObject;
import flinksidhi.event.s3.source.S3EventSource;
import io.siddhi.core.SiddhiManager;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import org.apache.flink.streaming.siddhi.SiddhiCEP;
import org.json.JSONObject;
import java.nio.ByteBuffer;
import java.nio.charset.StandardCharsets;
import java.util.Map;
import static flinksidhi.app.connector.Consumers.createInputMessageConsumer;
import static flinksidhi.app.connector.Producer.*;
public class internalStreamSiddhiApp {
private static final String inputTopic = "EVENT_STREAM_INPUT";
private static final String outputTopic = "EVENT_STREAM_OUTPUT";
private static final String consumerGroup = "EVENT_STREAM1";
private static final String kafkaAddress = "localhost:9092";
private static final String zkAddress = "localhost:2181";
private static final String S3_CQL1 = "from inputStream select * insert into temp";
private static final String S3_CQL = "from inputStream select json:toObject(awsS3) as obj insert into temp;" +
"from temp select json:getString(obj,'$.awsS3.ResourceType') as affected_resource_type," +
"json:getString(obj,'$.awsS3.Details.Name') as affected_resource_name," +
"json:getString(obj,'$.awsS3.Encryption.ServerSideEncryptionConfiguration') as encryption," +
"json:getString(obj,'$.awsS3.Encryption.ServerSideEncryptionConfiguration.Rules[0].ApplyServerSideEncryptionByDefault.SSEAlgorithm') as algorithm insert into temp2; " +
"from temp2 select affected_resource_name,affected_resource_type, " +
"ifThenElse(encryption == ' ','Fail','Pass') as state," +
"ifThenElse(encryption != ' ' and algorithm == 'aws:kms','None','Critical') as severity insert into outputStream";
public static void start(){
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//DataStream<String> inputS = env.addSource(new S3EventSource());
//Flink kafka stream consumer
FlinkKafkaConsumer<String> flinkKafkaConsumer =
createInputMessageConsumer(inputTopic, kafkaAddress,zkAddress, consumerGroup);
//Add Data stream source -- flink consumer
DataStream<String> inputS = env.addSource(flinkKafkaConsumer);
SiddhiCEP cep = SiddhiCEP.getSiddhiEnvironment(env);
cep.registerExtension("json:toObject", io.siddhi.extension.execution.json.function.ToJSONObjectFunctionExtension.class);
cep.registerExtension( "json:getString", io.siddhi.extension.execution.json.function.GetStringJSONFunctionExtension.class);
cep.registerStream("inputStream", inputS, "awsS3");
inputS.print();
System.out.println(cep.getDataStreamSchemas());
//json needs extension jars to present during runtime.
DataStream<Map<String,Object>> output = cep
.from("inputStream")
.cql(S3_CQL1)
.returnAsMap("temp");
//Flink kafka stream Producer
FlinkKafkaProducer<Map<String, Object>> flinkKafkaProducer =
createMapProducer(env,outputTopic, kafkaAddress);
//Add Data stream sink -- flink producer
output.addSink(flinkKafkaProducer);
output.print();
try {
env.execute();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Consumer class:
package flinksidhi.app.connector;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.json.JSONObject;
import java.util.Properties;
public class Consumers {
public static FlinkKafkaConsumer<String> createInputMessageConsumer(String topic, String kafkaAddress, String zookeeprAddr, String kafkaGroup ) {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", kafkaAddress);
properties.setProperty("zookeeper.connect", zookeeprAddr);
properties.setProperty("group.id",kafkaGroup);
FlinkKafkaConsumer<String> consumer = new FlinkKafkaConsumer<String>(
topic,new SimpleStringSchema(),properties);
return consumer;
}
}
Producer class:
package flinksidhi.app.connector;
import flinksidhi.app.util.ConvertJavaMapToJson;
import org.apache.flink.api.common.serialization.SerializationSchema;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import org.apache.flink.streaming.util.serialization.KeyedSerializationSchema;
import org.json.JSONObject;
import java.util.Map;
public class Producer {
public static FlinkKafkaProducer<Tuple2> createStringProducer(StreamExecutionEnvironment env, String topic, String kafkaAddress) {
return new FlinkKafkaProducer<Tuple2>(kafkaAddress, topic, new AverageSerializer());
}
public static FlinkKafkaProducer<Map<String,Object>> createMapProducer(StreamExecutionEnvironment env, String topic, String kafkaAddress) {
return new FlinkKafkaProducer<Map<String,Object>>(kafkaAddress, topic, new SerializationSchema<Map<String, Object>>() {
#Override
public void open(InitializationContext context) throws Exception {
}
#Override
public byte[] serialize(Map<String, Object> stringObjectMap) {
String json = ConvertJavaMapToJson.convert(stringObjectMap);
return json.getBytes();
}
});
}
}
I have tried many things but the code where the CQL is invoked is never called and doesn't even give any error not sure where is it going wrong.
The same thing if I do creating an internal stream source and use the same input json to return as string it works.

Initial guess: if you are using event time, are you sure you have defined watermarks correctly? As stated in the docs:
(...) an incoming element is initially put in a buffer where elements are sorted in ascending order based on their timestamp, and when a watermark arrives, all the elements in this buffer with timestamps smaller than that of the watermark are processed (...)
If this doesn't help, I would suggest to decompose/simplify the job to a bare minimum, for example just a source operator and some naive sink printing/logging elements. And if that works, start adding back operators one by one. You could also start by simplifying your CEP pattern as much as possible.

First of all thanks a lot #Piotr Nowojski , just because of your small pointer which no matter how many times I pondered over about event time , it did not came in my mind. So yes while debugging the two cases:
With internal datasource , where it was processing successfully, while debugging the flow , I identified that it was processing a watermark after it was processing the data, but it did not catch me that it was somehow managing the event time of the data implicitly.
With kafka as a datasource , while I was debugging I could very clearly see that it was not processing any watermark in the flow, but it did not occur to me that , it is happening because of the event time and watermark not handled properly.
Just adding a single line of code in the application code which I understood from below Flink code snippet:
#deprecated In Flink 1.12 the default stream time characteristic has been changed to {#link
* TimeCharacteristic#EventTime}, thus you don't need to call this method for enabling
* event-time support anymore. Explicitly using processing-time windows and timers works in
* event-time mode. If you need to disable watermarks, please use {#link
* ExecutionConfig#setAutoWatermarkInterval(long)}. If you are using {#link
* TimeCharacteristic#IngestionTime}, please manually set an appropriate {#link
* WatermarkStrategy}. If you are using generic "time window" operations (for example {#link
* org.apache.flink.streaming.api.datastream.KeyedStream#timeWindow(org.apache.flink.streaming.api.windowing.time.Time)}
* that change behaviour based on the time characteristic, please use equivalent operations
* that explicitly specify processing time or event time.
*/
I got to know that by default flink considers event time and for that watermark needs to be handled properly which I didn't so I added below link for setting the time characteristics of the flink execution environment:
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
and kaboom ... it started working , while this is deprecated and needs some other configuration, but thanks a lot , it was a great pointer and helped me a lot and I solved the issue..
Thanks again #Piotr Nowojski

Flink SerializationSchema: Could not serialize row error

I have some trouble using flink's SerializationSchema.
Here is my main code :
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DeserializationSchema<Row> sourceDeserializer = new JsonRowDeserializationSchema.Builder( /*Extract TypeInformation<Row> from an avsc schema file*/ ).build();
DataStream<Row> myDataStream = env.addSource( new MyCustomSource(sourceDeserializer) ) ;
final SinkFunction<Row> sink = new MyCustomSink(new JsonRowSerializationSchema.Builder(myDataStream.getType()).build());
myDataStream.addSink(sink).name("MyCustomSink");
env.execute("MyJob");
Here is my custom Sink Function :
import org.apache.flink.api.common.serialization.SerializationSchema;
import org.apache.flink.streaming.api.functions.sink.SinkFunction;
import org.apache.flink.types.Row;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
#SuppressWarnings("serial")
public class MyCustomSink implements SinkFunction<Row> {
private static final Logger LOGGER = LoggerFactory.getLogger(MyCustomSink.class);
private final boolean print;
private final SerializationSchema<Row> serializationSchema;
public MyCustomSink(final SerializationSchema<Row> serializationSchema) {
this.serializationSchema = serializationSchema;
}
#Override
public void invoke(final Row value, final Context context) throws Exception {
try {
LOGGER.info("MyCustomSink- invoke : [{}]", new String(serializationSchema.serialize(value)));
}catch (Exception e){
LOGGER.error("MyCustomSink- Error while sending data : " + e);
}
}
}
And here is my custom Source Function (not sure it is useful for the problem I have) :
import org.apache.flink.api.common.serialization.DeserializationSchema;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.typeutils.ResultTypeQueryable;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.shaded.guava18.com.google.common.io.ByteStreams;
import org.apache.flink.streaming.api.functions.source.RichSourceFunction;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
public class MyCustomSource<T> extends RichSourceFunction<T> implements ResultTypeQueryable<T> {
/** logger */
private static final Logger LOGGER = LoggerFactory.getLogger(MyCustomSource.class);
/** the JSON deserializer */
private final DeserializationSchema<T> deserializationSchema;
public MyCustomSource(final DeserializationSchema<T> deserializer) {
this.deserializationSchema = deserializer;
}
#Override
public void open(final Configuration parameters) {
...
}
#Override
public void run(final SourceContext<T> ctx) throws Exception {
LOGGER.info("run");
InputStream data = ...; // Retrieve the input json data
final T row = deserializationSchema
.deserialize(ByteStreams.toByteArray(data));
ctx.collect(row);
}
#Override
public void cancel() {
...
}
#Override
public TypeInformation<T> getProducedType() {
return deserializationSchema.getProducedType();
}
}
Now I run my code and I send some data sequentially to my pipeline :
==>
{
"id": "sensor1",
"data":{
"rotation": 250
}
}
Here, the data is correctly printed by my sink : MyCustomSink- invoke : [{"id":"sensor1","data":{"rotation":250}}]
==>
{
"id": "sensor1"
}
Here, the data is correctly printed by my sink : MyCustomSink- invoke : [{"id":"sensor1","data":null}]
==>
{
"id": "sensor1",
"data":{
"rotation": 250
}
}
Here, there is an error on serialization. The error log printed is :
MyCustomSink- Error while sending data : java.lang.RuntimeException: Could not serialize row 'sensor1,250'. Make sure that the schema matches the input.
I do not understand at all why I have this behavior. Someone have an idea ?
Notes:
Using Flink 1.9.2
-- EDIT --
I added the CustomSource part
-- EDIT 2 --
After more investigations, it looks like this behavior is caused by the private transient ObjectNode node of the JsonRowSerializationSchema. If I understand correctly, this is used for optimization, but seems to be the cause of my problem.
Is it the normal behavior, and if it is, what would be the correct use of this class in my case ? (Else, is there any way to bypass this problem ?)

This is a JsonRowSerializationSchema bug which has been fixed in most recent Flink versions - I believe, this PR addresses the issue above.

Good Zookeeper Hello world Program with Java client

I was trying to use Zookeeper in our project. Could run the server..Even test it using zkcli.sh .. All good..
But couldn't find a good tutorial for me to connect to this server using Java ! All I need in Java API is a method
public String getServiceURL ( String serviceName )
I tried https://cwiki.apache.org/confluence/display/ZOOKEEPER/Index --> Not good for me.
http://zookeeper.apache.org/doc/trunk/javaExample.html : Sort of ok; but couldnt understand concepts clearly ! I feel it is not explained well..

Finally, this is the simplest and most basic program I came up with which will help you with ZooKeeper "Getting Started":
package core.framework.zookeeper;
import java.util.Date;
import java.util.List;
import java.util.concurrent.CountDownLatch;
import org.apache.zookeeper.CreateMode;
import org.apache.zookeeper.WatchedEvent;
import org.apache.zookeeper.Watcher;
import org.apache.zookeeper.Watcher.Event.KeeperState;
import org.apache.zookeeper.ZooDefs.Ids;
import org.apache.zookeeper.ZooKeeper;
public class ZkConnect {
private ZooKeeper zk;
private CountDownLatch connSignal = new CountDownLatch(0);
//host should be 127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002
public ZooKeeper connect(String host) throws Exception {
zk = new ZooKeeper(host, 3000, new Watcher() {
public void process(WatchedEvent event) {
if (event.getState() == KeeperState.SyncConnected) {
connSignal.countDown();
}
}
});
connSignal.await();
return zk;
}
public void close() throws InterruptedException {
zk.close();
}
public void createNode(String path, byte[] data) throws Exception
{
zk.create(path, data, Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);
}
public void updateNode(String path, byte[] data) throws Exception
{
zk.setData(path, data, zk.exists(path, true).getVersion());
}
public void deleteNode(String path) throws Exception
{
zk.delete(path, zk.exists(path, true).getVersion());
}
public static void main (String args[]) throws Exception
{
ZkConnect connector = new ZkConnect();
ZooKeeper zk = connector.connect("54.169.132.0,52.74.51.0");
String newNode = "/deepakDate"+new Date();
connector.createNode(newNode, new Date().toString().getBytes());
List<String> zNodes = zk.getChildren("/", true);
for (String zNode: zNodes)
{
System.out.println("ChildrenNode " + zNode);
}
byte[] data = zk.getData(newNode, true, zk.exists(newNode, true));
System.out.println("GetData before setting");
for ( byte dataPoint : data)
{
System.out.print ((char)dataPoint);
}
System.out.println("GetData after setting");
connector.updateNode(newNode, "Modified data".getBytes());
data = zk.getData(newNode, true, zk.exists(newNode, true));
for ( byte dataPoint : data)
{
System.out.print ((char)dataPoint);
}
connector.deleteNode(newNode);
}
}

This post has almost all operations required to interact with Zookeeper.
https://www.tutorialspoint.com/zookeeper/zookeeper_api.htm
Create ZNode with data
Delete ZNode
Get list of ZNodes(Children)
Check an ZNode exists or not
Edit the content of a ZNode...

This blog post, Zookeeper Java API examples, includes some good examples if you are looking for Java examples to start with. Zookeeper also provides a client API library( C and Java) that is very easy to use.
Zookeeper is one of the best open source server and service that helps to reliably coordinates distributed processes. Zookeeper is a CP system (Refer CAP Theorem) that provides Consistency and Partition tolerance. Replication of Zookeeper state across all the nods makes it an eventually consistent distributed service.

This is about as simple as you can get. I am building a tool which will use ZK to lock files that are being processed (hence the class name):
package mypackage;
import java.io.IOException;
import java.util.List;
import org.apache.zookeeper.KeeperException;
import org.apache.zookeeper.WatchedEvent;
import org.apache.zookeeper.ZooKeeper;
import org.apache.zookeeper.Watcher;
public class ZooKeeperFileLock {
public static void main(String[] args) throws IOException, KeeperException, InterruptedException {
String zkConnString = "<zknode1>:2181,<zknode2>:2181,<zknode3>:2181";
ZooKeeperWatcher zkWatcher = new ZooKeeperWatcher();
ZooKeeper client = new ZooKeeper(zkConnString, 10000, zkWatcher);
List<String> zkNodes = client.getChildren("/", true);
for(String node : zkNodes) {
System.out.println(node);
}
}
public static class ZooKeeperWatcher implements Watcher {
#Override
public void process(WatchedEvent event) {
}
}

If you are on AWS; now We can create internal ELB which supports redirection based on URI .. which can really solve this problem with High Availability already baked in.

what does this Spring JSON endpoint break in Jboss/Tomcat?

what does this Spring JSON endpoint break in Jboss/Tomcat ? I tried to add this to an existing APPLICATION and It worked until I started refactoring the code and now the errors do not point to anything that is logical to me.
Here is my code a controller and Helper class to keep things clean.
Controller.Java
import java.io.IOException;
import java.util.Properties;
import javax.annotation.Nonnull;
import org.apache.commons.configuration.ConfigurationException;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.stereotype.Controller;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.ResponseBody;
#Controller
public class PropertiesDisplayController {
#Nonnull
private final PropertiesDisplayHelper propertiesHelper;
/**
* #param propertiesHelper
*/
#Nonnull
#Autowired
public PropertiesDisplayController(#Nonnull final PropertiesDisplayHelper propertiesHelper) {
super();
this.propertiesHelper = propertiesHelper;
}
#Nonnull
#RequestMapping("/localproperties")
public #ResponseBody Properties localProperties() throws ConfigurationException, IOException {
return propertiesHelper.getLocalProperties();
}
#Nonnull
#RequestMapping("/properties")
public #ResponseBody Properties applicationProperties() throws IOException,
ConfigurationException {
return propertiesHelper.getApplicationProperties();
}
}
this would be the Helper.java
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Iterator;
import java.util.Map;
import java.util.Properties;
import java.util.TreeMap;
import javax.annotation.Nonnull;
import org.apache.commons.configuration.Configuration;
import org.apache.commons.configuration.ConfigurationException;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.StringEscapeUtils;
import org.apache.commons.lang.StringUtils;
public class PropertiesDisplayHelper {
/** Static location of properties in for SSO */
private static String LOCAL_PROPERTIES_LOCATION =
"local.properties";
/** Static strings for masking the passwords and blank values */
private static String NOVALUE = "**NO VALUE**";
/** Static strings for masking the passwords and blank values */
private static String MASKED = "**MASKED**";
#Nonnull
public Properties getApplicationProperties() throws ConfigurationException {
final Properties properties = new Properties();
final Configuration configuration = AppConfiguration.Factory.getConfiguration();
// create a map of properties
final Iterator<?> propertyKeys = configuration.getKeys();
final Map<String, String> sortedProperties = new TreeMap<String, String>();
// loops the configurations and builds the properties
while (propertyKeys.hasNext()) {
final String key = propertyKeys.next().toString();
final String value = configuration.getProperty(key).toString();
sortedProperties.put(key, value);
}
properties.putAll(sortedProperties);
// output of the result
formatsPropertiesData(properties);
return properties;
}
#Nonnull
public Properties getLocalProperties() throws ConfigurationException, IOException {
FileInputStream fis = null;
final Properties properties = new Properties();
// imports file local.properties from specified location
// desinated when the update to openAM12
try {
fis = new FileInputStream(LOCAL_PROPERTIES_LOCATION);
properties.load(fis);
} finally {
// closes file input stream
IOUtils.closeQuietly(fis);
}
formatsPropertiesData(properties);
return properties;
}
void formatsPropertiesData(#Nonnull final Properties properties) {
for (final String key : properties.stringPropertyNames()) {
String value = properties.getProperty(key);
if (StringUtils.isEmpty(value)) {
value = NOVALUE;
} else if (key.endsWith("ssword")) {
value = MASKED;
} else {
value = StringEscapeUtils.escapeHtml(value);
}
// places data to k,v paired properties object
properties.put(key, value);
}
}
}
they set up a json display of the properties in application and from a file for logging. Yet now this no intrusive code seems to break my entire application build.
Here is the error from Jboss
20:33:41,559 ERROR [org.jboss.as.server] (DeploymentScanner-threads - 1) JBAS015870: Deploy of deployment "openam.war" was rolled back with the following failure message:
{"JBAS014671: Failed services" => {"jboss.deployment.unit.\"APPLICATION.war\".STRUCTURE" => "org.jboss.msc.service.StartException in service jboss.deployment.unit.\"APPLICATION.war\".STRUCTURE: JBAS018733: Failed to process phase STRUCTURE of deployment \"APPLICATION.war\"
Caused by: org.jboss.as.server.deployment.DeploymentUnitProcessingException: JBAS018740: Failed to mount deployment content
Caused by: java.io.FileNotFoundException:APPLICATION.war (Access is denied)"}}
and the Errors from Tomcat
http://pastebin.com/PXdcpqvc
I am at a lost here and think there is something I just do not see.

A simple solution was at hand. The missing component was #Component annotation on the helper class.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Kafka Connect Starts, but nothing happens - java

Related

How to debug and test MapReduce on local Window machine?

Not able to process kafka json message with Flink siddhi library

Flink SerializationSchema: Could not serialize row error

Good Zookeeper Hello world Program with Java client

what does this Spring JSON endpoint break in Jboss/Tomcat?

Categories

Resources