Flink S3 StreamingFileSink not writing files to S3

Flink S3 StreamingFileSink not writing files to S3 - java

I am doing a POC for writing data to S3 using Flink. The program does not give a error. However I do not see any files being written in S3 either.
Below is the code
public class StreamingJob {
public static void main(String[] args) throws Exception {
// set up the streaming execution environment
final String outputPath = "s3a://testbucket-s3-flink/data/";
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//Enable checkpointing
env.enableCheckpointing();
//S3 Sink
final StreamingFileSink<String> sink = StreamingFileSink
.forRowFormat(new Path(outputPath), new SimpleStringEncoder<String>("UTF-8"))
.build();
//Source is a local kafka
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kafka:9094");
properties.setProperty("group.id", "test");
DataStream<String> input = env.addSource(new FlinkKafkaConsumer<String>("queueing.transactions", new SimpleStringSchema(), properties));
input.flatMap(new Tokenizer()) // Tokenizer for generating words
.keyBy(0) // Logically partition the stream for each word
.timeWindow(Time.minutes(1)) // Tumbling window definition
.sum(1) // Sum the number of words per partition
.map(value -> value.f0 + " count: " + value.f1.toString() + "\n")
.addSink(sink);
// execute program
env.execute("Flink Streaming Java API Skeleton");
}
public static final class Tokenizer
implements FlatMapFunction<String, Tuple2<String, Integer>> {
#Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
String[] tokens = value.toLowerCase().split("\\W+");
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2<>(token, 1));
}
}
}
}
}
Note that I have set the s3.access-key and s3.secret-key value in the configuration and tested by changing them to incorrect values (I got a error on incorrect values)
Any pointers what may be going wrong?

Could it be that you are running into this issue?
Given that Flink sinks and UDFs in general do not differentiate between normal job termination (e.g. finite input stream) and termination due to failure, upon normal termination of a job, the last in-progress files will not be transitioned to the “finished” state.

Related

Apache Flink Dynamic Pipeline

I'm working on creating a framework to allow customers to create their own plugins to my software built on Apache Flink. I've outlined in a snippet below what I'm trying to get working (just as a proof of concept), however I'm getting a org.apache.flink.client.program.ProgramInvocationException: The main method caused an error. error when trying to upload it.
I want to be able to branch the input stream into x number of different pipelines, then having those combine together into a single output. What I have below is just my simplified version I'm starting with.
public class ContentBase {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kf-service:9092");
properties.setProperty("group.id", "varnost-content");
// Setup up execution environment and get stream from Kafka
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<ObjectNode> logs = see.addSource(new FlinkKafkaConsumer011<>("log-input",
new JSONKeyValueDeserializationSchema(false), properties).setStartFromLatest())
.map((MapFunction<ObjectNode, ObjectNode>) jsonNodes -> (ObjectNode) jsonNodes.get("value"));
// Create a new List of Streams, one for each "rule" that is being executed
// For now, I have a simple custom wrapper on flink's `.filter` function in `MyClass.filter`
List<String> codes = Arrays.asList("404", "200", "500");
List<DataStream<ObjectNode>> outputs = new ArrayList<>();
for (String code : codes) {
outputs.add(MyClass.filter(logs, "response", code));
}
// It seemed as though I needed a seed DataStream to union all others on
ObjectMapper mapper = new ObjectMapper();
ObjectNode seedObject = (ObjectNode) mapper.readTree("{\"start\":\"true\"");
DataStream<ObjectNode> alerts = see.fromElements(seedObject);
// Union the output of each "rule" above with the seed object to then output
for (DataStream<ObjectNode> output : outputs) {
alerts.union(output);
}
// Convert to string and sink to Kafka
alerts.map((MapFunction<ObjectNode, String>) ObjectNode::toString)
.addSink(new FlinkKafkaProducer011<>("kf-service:9092", "log-output", new SimpleStringSchema()));
see.execute();
}
}
I can't figure out how to get the actual error out of the Flink web interface to add that information here

There were a few errors I found.
1) A Stream Execution Environment can only have one input (apparently? I could be wrong) so adding the .fromElements input was not good
2) I forgot all DataStreams are immutable so the .union operation creates a new DataStream output.
The final result ended up being much simpler
public class ContentBase {
public static void main(String[] args) throws Exception {
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kf-service:9092");
properties.setProperty("group.id", "varnost-content");
// Setup up execution environment and get stream from Kafka
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<ObjectNode> logs = see.addSource(new FlinkKafkaConsumer011<>("log-input",
new JSONKeyValueDeserializationSchema(false), properties).setStartFromLatest())
.map((MapFunction<ObjectNode, ObjectNode>) jsonNodes -> (ObjectNode) jsonNodes.get("value"));
// Create a new List of Streams, one for each "rule" that is being executed
// For now, I have a simple custom wrapper on flink's `.filter` function in `MyClass.filter`
List<String> codes = Arrays.asList("404", "200", "500");
List<DataStream<ObjectNode>> outputs = new ArrayList<>();
for (String code : codes) {
outputs.add(MyClass.filter(logs, "response", code));
}
Optional<DataStream<ObjectNode>> alerts = outputs.stream().reduce(DataStream::union);
// Convert to string and sink to Kafka
alerts.map((MapFunction<ObjectNode, String>) ObjectNode::toString)
.addSink(new FlinkKafkaProducer011<>("kf-service:9092", "log-output", new SimpleStringSchema()));
see.execute();
}
}

The code you post cannot be compiled through because of the last part code (i.e., converting to string). You mixed up the java stream API map with Flink one. Change it to
alerts.get().map(ObjectNode::toString);
can fix it.
Good luck.

Why isn't Kafka consumer producing results?

As a Kafka learning exercise, I have written a Java program TsdbMetricToKafkaTopic to copy data from openTSDB to a Kafka topic, and another Java program DumpKafkaTopic to print out the results; below is the key method of DumpKafkaTopic.
I have confirmed, by using the Kafka utility kafka-console-consumer.sh, that the data I expect are indeed getting written to the intended topic. However, the behavior of DumpKafkaTopic is strange: When I run the producer and then DumpKafkaTopic, it prints results as I'd expect. However, if I re-run it immediately, it prints nothing.
I thought that because I set auto.offset.reset to earliest, my program would be idempotent, that is, every time I run it, it should produce the same results (until I write something else to the topic). Why isn't this happening?
public void dump( String kafka_topic ) {
// Serializers/deserializers (serde) for key and value types
final Serde<Long> long_serde = Serdes.Long();
final Serde< TsdbObject > tsdb_object_serde =
Serdes.serdeFrom( new TsdbObject.TsdbObjectSerializer(),
new TsdbObject.TsdbObjectDeserializer() );
StreamsBuilder streams_builder = new StreamsBuilder();
KStream< Long, TsdbObject > kstream =
streams_builder.stream( kafka_topic, Consumed.with( long_serde, tsdb_object_serde ) );
// Add final operator, to print results to stdout:
Printed< Long, TsdbObject > printed = Printed.toSysOut();
kstream.print( printed );
Map<String, Object> kstreams_props = new HashMap<>();
kstreams_props.put(StreamsConfig.APPLICATION_ID_CONFIG, "DumpKafkaTopic");
kstreams_props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
// make sure to consume the complete topic via "auto.offset.reset = earliest"
kstreams_props.put( ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
StreamsConfig kstreams_config = new StreamsConfig(kstreams_props);
KafkaStreams kstreams = new KafkaStreams( streams_builder.build(), kstreams_config );
System.out.println( "Starting DumpKafkaTopic stream " );
kstreams.start();
// Add shutdown hook to respond to SIGTERM and gracefully close Kafka Streams (from https://www.confluent.io/blog/data-reprocessing-with-kafka-streams-resetting-a-streams-application/)
Runtime.getRuntime().addShutdownHook(new Thread(new Runnable() {
#Override
public void run() {
System.out.println( "Stopping DumpKafkaTopic stream " );
kstreams.close();
}
}));
}

Spark checkpointing error when joining static dataset with DStream

I am trying to use Spark Streaming application in Java. My Spark application reads continuous feed from Hadoop
directory using textFileStream() at interval of each 1 Min.
I need to perform Spark aggregation(group by) operation on incoming DStream. After aggregation, I am joining aggregated DStream<Key, Value1> with RDD<Key, Value2>
with RDD<Key, Value2> created from static dataset read by textFile() from hadoop directory.
Problem comes when I enable checkpointing. With empty checkpoint directory, it runs fine. After running 2-3 batches I close it using ctrl+c and run it again.
On second run it throws spark exception immediately: "SPARK-5063"
Exception in thread "main" org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063
Following is the Block of Code of spark application:
private void compute(JavaSparkContext sc, JavaStreamingContext ssc) {
JavaRDD<String> distFile = sc.textFile(MasterFile);
JavaDStream<String> file = ssc.textFileStream(inputDir);
// Read Master file
JavaRDD<MasterParseLog> masterLogLines = distFile.flatMap(EXTRACT_MASTER_LOGLINES);
final JavaPairRDD<String, String> masterRDD = masterLogLines.mapToPair(MASTER_KEY_VALUE_MAPPER);
// Continuous Streaming file
JavaDStream<ParseLog> logLines = file.flatMap(EXTRACT_CKT_LOGLINES);
// calculate the sum of required field and generate group sum RDD
JavaPairDStream<String, Summary> sumRDD = logLines.mapToPair(CKT_GRP_MAPPER);
JavaPairDStream<String, Summary> grpSumRDD = sumRDD.reduceByKey(CKT_GRP_SUM);
//GROUP BY Operation
JavaPairDStream<String, Summary> grpAvgRDD = grpSumRDD.mapToPair(CKT_GRP_AVG);
// Join Master RDD with the DStream //This is the block causing error (without it code is working fine)
JavaPairDStream<String, Tuple2<String, String>> joinedStream = grpAvgRDD.transformToPair(
new Function2<JavaPairRDD<String, String>, Time, JavaPairRDD<String, Tuple2<String, String>>>() {
private static final long serialVersionUID = 1L;
public JavaPairRDD<String, Tuple2<String, String>> call(
JavaPairRDD<String, String> rdd, Time v2) throws Exception {
return masterRDD.value().join(rdd);
}
}
);
joinedStream.print(10);
}
public static void main(String[] args) {
JavaStreamingContextFactory contextFactory = new JavaStreamingContextFactory() {
public JavaStreamingContext create() {
// Create the context with a 60 second batch size
SparkConf sparkConf = new SparkConf();
final JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaStreamingContext ssc1 = new JavaStreamingContext(sc, Durations.seconds(duration));
app.compute(sc, ssc1);
ssc1.checkpoint(checkPointDir);
return ssc1;
}
};
JavaStreamingContext ssc = JavaStreamingContext.getOrCreate(checkPointDir, contextFactory);
// start the streaming server
ssc.start();
logger.info("Streaming server started...");
// wait for the computations to finish
ssc.awaitTermination();
logger.info("Streaming server stopped...");
}
I know that block of code which joins static dataset with DStream is causing error, But that is taken from spark-streaming
page of Apache spark website (sub heading "stream-dataset join" under "Join Operations"). Please help me to get it working even if
there is different way of doing it. I need to enable checkpointing in my streaming application.
Environment Details:
Centos6.5 :2 node Cluster
Java :1.8
Spark :1.4.1
Hadoop :2.7.1*

Reading Data From FTP Server in Hadoop/Cascading

I want to read data from FTP Server.I am providing path of the file which resides on FTP server in the format ftp://Username:Password#host/path.
When I use map reduce program to read data from file it works fine. I want to read data from same file through Cascading framework. I am using Hfs tap of cascading framework to read data. It throws following exception
java.io.IOException: Stream closed
at org.apache.hadoop.fs.ftp.FTPInputStream.close(FTPInputStream.java:98)
at java.io.FilterInputStream.close(Unknown Source)
at org.apache.hadoop.util.LineReader.close(LineReader.java:83)
at org.apache.hadoop.mapred.LineRecordReader.close(LineRecordReader.java:168)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.close(MapTask.java:254)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:440)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Below is the code of cascading framework from where I am reading the files:
public class FTPWithHadoopDemo {
public static void main(String args[]) {
Tap source = new Hfs(new TextLine(new Fields("line")), "ftp://user:pwd#xx.xx.xx.xx//input1");
Tap sink = new Hfs(new TextLine(new Fields("line1")), "OP\\op", SinkMode.REPLACE);
Pipe pipe = new Pipe("First");
pipe = new Each(pipe, new RegexSplitGenerator("\\s+"));
pipe = new GroupBy(pipe);
Pipe tailpipe = new Every(pipe, new Count());
FlowDef flowDef = FlowDef.flowDef().addSource(pipe, source).addTailSink(tailpipe, sink);
new HadoopFlowConnector().connect(flowDef).complete();
}
}
I tried to look in Hadoop Source code for the same exception. I found that in the MapTask class there is one method runOldMapper which deals with stream. And in the same method there is finally block where stream gets closed (in.close()). When I remove that line from finally block it works fine. Below is the code:
private <INKEY, INVALUE, OUTKEY, OUTVALUE> void runOldMapper(final JobConf job, final TaskSplitIndex splitIndex,
final TaskUmbilicalProtocol umbilical, TaskReporter reporter)
throws IOException, InterruptedException, ClassNotFoundException {
InputSplit inputSplit = getSplitDetails(new Path(splitIndex.getSplitLocation()), splitIndex.getStartOffset());
updateJobWithSplit(job, inputSplit);
reporter.setInputSplit(inputSplit);
RecordReader<INKEY, INVALUE> in = isSkipping()
? new SkippingRecordReader<INKEY, INVALUE>(inputSplit, umbilical, reporter)
: new TrackedRecordReader<INKEY, INVALUE>(inputSplit, job, reporter);
job.setBoolean("mapred.skip.on", isSkipping());
int numReduceTasks = conf.getNumReduceTasks();
LOG.info("numReduceTasks: " + numReduceTasks);
MapOutputCollector collector = null;
if (numReduceTasks > 0) {
collector = new MapOutputBuffer(umbilical, job, reporter);
} else {
collector = new DirectMapOutputCollector(umbilical, job, reporter);
}
MapRunnable<INKEY, INVALUE, OUTKEY, OUTVALUE> runner = ReflectionUtils.newInstance(job.getMapRunnerClass(),
job);
try {
runner.run(in, new OldOutputCollector(collector, conf), reporter);
collector.flush();
} finally {
// close
in.close(); // close input
collector.close();
}
}
please assist me in solving this problem.
Thanks,
Arshadali

After some efforts I found out that hadoop uses org.apache.hadoop.fs.ftp.FTPFileSystem Class for FTP.
This class doesn't supports seek, i.e. Seek to the given offset from the start of the file. Data is read in one block and then file system seeks to next block to read. Default block size is 4KB for FTPFileSystem. As seek is not supported it can only read data less than or equal to 4KB.

Hadoop DistributedCache object changed during job

I'm trying to run KMeans on AWS, and I ran into the following exception when trying to read updated cluster centroids from the DistributedCache:
java.io.IOException: The distributed cache object s3://mybucket/centroids_6/part-r-00009 changed during the job from 4/8/13 2:20 PM to 4/8/13 2:20 PM
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.downloadCacheObject(TrackerDistributedCacheManager.java:401)
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.localizePublicCacheObject(TrackerDistributedCacheManager.java:475)
at org.apache.hadoop.filecache.TrackerDistributedCacheManager.getLocalCache(TrackerDistributedCacheManager.java:191)
at org.apache.hadoop.filecache.TaskDistributedCacheManager.setupCache(TaskDistributedCacheManager.java:182)
at org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1246)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.TaskTracker.initializeJob(TaskTracker.java:1237)
at org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:1152)
at org.apache.hadoop.mapred.TaskTracker$5.run(TaskTracker.java:2541)
at java.lang.Thread.run(Thread.java:662)
What sets this question apart from this one is the fact that this error appears intermittently. I've run the same code successfully on a smaller dataset. Furthermore, when I change the number of centroids from 12 (seen above in the code) to 8, it fails on iteration 5 instead of 6 (which can you see in the centroids_6 name above).
Here's the relevant DistributedCache code in the main driver that runs the KMeans loop:
int iteration = 1;
long changes = 0;
do {
// First, write the previous iteration's centroids to the dist cache.
Configuration iterConf = new Configuration();
Path prevIter = new Path(centroidsPath.getParent(),
String.format("centroids_%s", iteration - 1));
FileSystem fs = prevIter.getFileSystem(iterConf);
Path pathPattern = new Path(prevIter, "part-*");
FileStatus [] list = fs.globStatus(pathPattern);
for (FileStatus status : list) {
DistributedCache.addCacheFile(status.getPath().toUri(), iterConf);
}
// Now, set up the job.
Job iterJob = new Job(iterConf);
iterJob.setJobName("KMeans " + iteration);
iterJob.setJarByClass(KMeansDriver.class);
Path nextIter = new Path(centroidsPath.getParent(),
String.format("centroids_%s", iteration));
KMeansDriver.delete(iterConf, nextIter);
// Set input/output formats.
iterJob.setInputFormatClass(SequenceFileInputFormat.class);
iterJob.setOutputFormatClass(SequenceFileOutputFormat.class);
// Set Mapper, Reducer, Combiner
iterJob.setMapperClass(KMeansMapper.class);
iterJob.setCombinerClass(KMeansCombiner.class);
iterJob.setReducerClass(KMeansReducer.class);
// Set MR formats.
iterJob.setMapOutputKeyClass(IntWritable.class);
iterJob.setMapOutputValueClass(VectorWritable.class);
iterJob.setOutputKeyClass(IntWritable.class);
iterJob.setOutputValueClass(VectorWritable.class);
// Set input/output paths.
FileInputFormat.addInputPath(iterJob, data);
FileOutputFormat.setOutputPath(iterJob, nextIter);
iterJob.setNumReduceTasks(nReducers);
if (!iterJob.waitForCompletion(true)) {
System.err.println("ERROR: Iteration " + iteration + " failed!");
System.exit(1);
}
iteration++;
changes = iterJob.getCounters().findCounter(KMeansDriver.Counter.CONVERGED).getValue();
iterJob.getCounters().findCounter(KMeansDriver.Counter.CONVERGED).setValue(0);
} while (changes > 0);
How else would the files be modified? The only possibility I can think of is that, at the completion of one iteration, the loop begins again before the centroids from the previous job have finished writing. But within the comment, I invoke the job with waitForCompletion(true), so there shouldn't be any residual parts of the job running when the loop starts over. Any ideas?

This isn't really an answer, but I did realize it was silly to use the DistributedCache in the way I was, as opposed to reading the results from the previous iteration directly from HDFS. I instead wrote this method in the main driver:
public static HashMap<Integer, VectorWritable> readCentroids(Configuration conf, Path path)
throws IOException {
HashMap<Integer, VectorWritable> centroids = new HashMap<Integer, VectorWritable>();
FileSystem fs = FileSystem.get(path.toUri(), conf);
FileStatus [] list = fs.globStatus(new Path(path, "part-*"));
for (FileStatus status : list) {
SequenceFile.Reader reader = new SequenceFile.Reader(fs, status.getPath(), conf);
IntWritable key = null;
VectorWritable value = null;
try {
key = (IntWritable)reader.getKeyClass().newInstance();
value = (VectorWritable)reader.getValueClass().newInstance();
} catch (InstantiationException e) {
e.printStackTrace();
} catch (IllegalAccessException e) {
e.printStackTrace();
}
while (reader.next(key, value)) {
centroids.put(new Integer(key.get()),
new VectorWritable(value.get(), value.getClusterId(), value.getNumInstances()));
}
reader.close();
}
return centroids;
}
This is invoked in the setup() method of the Mapper and Reducer during each iteration, to read the centroids of the previous iteration.
protected void setup(Context context) throws IOException {
Configuration conf = context.getConfiguration();
Path centroidsPath = new Path(conf.get(KMeansDriver.CENTROIDS));
centroids = KMeansDriver.readCentroids(conf, centroidsPath);
}
This allowed me to remove the block of code in the loop in my original question which writes the centroids to the DistributedCache. I tested it, and it now works on both large and small datasets.
I still don't know why I was getting the error I posted about (how would something in the read-only DistributedCache be changed? especially when I was changing HDFS paths on every iteration?), but this seems to both work and be a much less hack-y way of reading the centroids.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Flink S3 StreamingFileSink not writing files to S3 - java

Related

Apache Flink Dynamic Pipeline

Why isn't Kafka consumer producing results?

Spark checkpointing error when joining static dataset with DStream

Reading Data From FTP Server in Hadoop/Cascading

Hadoop DistributedCache object changed during job

Categories

Resources