EDIT: This does not seem to be possible, see https://bugs.openjdk.java.net/browse/JDK-8039910.
I have a helper class that provides a Stream<Path>. This code just wraps Files.walk and sorts the output:
public Stream<Path> getPaths(Path path) {
return Files.walk(path, FOLLOW_LINKS).sorted();
}
As symlinks are followed, in case of loops in the filesystem (e.g. a symlink x -> .) the code used in Files.walk throws an UncheckedIOException wrapping an instance of FileSystemLoopException.
In my code I would like to catch such exceptions and, for example, just log a helpful message. The resulting stream could/should just stop providing entries as soon as this happens.
I tried adding .map(this::catchException) and .peek(this::catchException) to my code, but the exception is not caught in this stage.
Path checkException(Path path) {
try {
logger.info("path.toString() {}", path.toString());
return path;
} catch (UncheckedIOException exception) {
logger.error("YEAH");
return null;
}
}
How, if at all, can I catch an UncheckedIOException in my code giving out a Stream<Path>, so that consumers of the path do not encounter this exception?
As an example, the following code should never encounter the exception:
List<Path> paths = getPaths().collect(toList());
Right now, the exception is triggered by code invoking collect (and I could catch the exception there):
java.io.UncheckedIOException: java.nio.file.FileSystemLoopException: /tmp/junit5844257414812733938/selfloop
at java.nio.file.FileTreeIterator.fetchNextIfNeeded(FileTreeIterator.java:88)
at java.nio.file.FileTreeIterator.hasNext(FileTreeIterator.java:104)
at java.util.Iterator.forEachRemaining(Iterator.java:115)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at ...
EDIT: I provided a simple JUnit test class. In this question I ask you to fix the test by just modifying the code in provideStream.
package somewhere;
import org.junit.Rule;
import org.junit.Test;
import org.junit.rules.TemporaryFolder;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.stream.Collectors;
import java.util.stream.Stream;
import static java.nio.file.FileVisitOption.FOLLOW_LINKS;
import static org.hamcrest.MatcherAssert.assertThat;
import static org.hamcrest.Matchers.is;
import static org.hamcrest.Matchers.nullValue;
import static org.hamcrest.core.IsNot.not;
import static org.junit.Assert.fail;
public class StreamTest {
#Rule
public TemporaryFolder temporaryFolder = new TemporaryFolder();
#Test
public void test() throws Exception {
Path rootPath = Paths.get(temporaryFolder.getRoot().getPath());
createSelfloop();
Stream<Path> stream = provideStream(rootPath);
assertThat(stream.collect(Collectors.toList()), is(not(nullValue())));
}
private Stream<Path> provideStream(Path rootPath) throws IOException {
return Files.walk(rootPath, FOLLOW_LINKS).sorted();
}
private void createSelfloop() throws IOException {
String root = temporaryFolder.getRoot().getPath();
try {
Path symlink = Paths.get(root, "selfloop");
Path target = Paths.get(root);
Files.createSymbolicLink(symlink, target);
} catch (UnsupportedOperationException x) {
// Some file systems do not support symbolic links
fail();
}
}
}
You can make your own walking stream factory:
public class FileTree {
public static Stream<Path> walk(Path p) {
Stream<Path> s=Stream.of(p);
if(Files.isDirectory(p)) try {
DirectoryStream<Path> ds = Files.newDirectoryStream(p);
s=Stream.concat(s, StreamSupport.stream(ds.spliterator(), false)
.flatMap(FileTree::walk)
.onClose(()->{ try { ds.close(); } catch(IOException ex) {} }));
} catch(IOException ex) {}
return s;
}
// in case you don’t want to ignore exceprions silently
public static Stream<Path> walk(Path p, BiConsumer<Path,IOException> handler) {
Stream<Path> s=Stream.of(p);
if(Files.isDirectory(p)) try {
DirectoryStream<Path> ds = Files.newDirectoryStream(p);
s=Stream.concat(s, StreamSupport.stream(ds.spliterator(), false)
.flatMap(sub -> walk(sub, handler))
.onClose(()->{ try { ds.close(); }
catch(IOException ex) { handler.accept(p, ex); } }));
} catch(IOException ex) { handler.accept(p, ex); }
return s;
}
// and with depth limit
public static Stream<Path> walk(
Path p, int maxDepth, BiConsumer<Path,IOException> handler) {
Stream<Path> s=Stream.of(p);
if(maxDepth>0 && Files.isDirectory(p)) try {
DirectoryStream<Path> ds = Files.newDirectoryStream(p);
s=Stream.concat(s, StreamSupport.stream(ds.spliterator(), false)
.flatMap(sub -> walk(sub, maxDepth-1, handler))
.onClose(()->{ try { ds.close(); }
catch(IOException ex) { handler.accept(p, ex); } }));
} catch(IOException ex) { handler.accept(p, ex); }
return s;
}
}
Related
Here is a code snippet from my main Java function:
try (MultiFileReader multiReader = new MultiFileReader(inputs)) {
PriorityQueue<WordEntry> words = new PriorityQueue<>();
for (BufferedReader reader : multiReader.getReaders()) {
String word = reader.readLine();
if (word != null) {
words.add(new WordEntry(word, reader));
}
}
}
Here is how I get my BufferedReader readers from another Java file:
public List<BufferedReader> getReaders() {
return Collections.unmodifiableList(readers);
}
But for some reason, when I compile my code here is what I get:
The error happens exactly at the line where I wrote String word = reader.readLine(); and what's weird is that reader.readLine() is not null, in fact multiReader.getReaders() returns a list of 100 objects (they are files read from a directory). I would like some help solving that issue.
I posted where the issue is, now let me provide a broader view of my code. To run it, it suffices to compile it under the src/ directory doing javac *.java and java MergeShards shards/ sorted.txt provided that shards/ is present under src/ and contains .txt files in my scenario.
This is MergeShards.java where I have my main function:
import java.io.BufferedReader;
import java.io.Writer;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.List;
import java.util.Objects;
import java.util.PriorityQueue;
import java.util.stream.Collectors;
public final class MergeShards {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.out.println("Usage: MergeShards [input folder] [output file]");
return;
}
List<Path> inputs = Files.walk(Path.of(args[0]), 1).skip(1).collect(Collectors.toList());
Path outputPath = Path.of(args[1]);
try (MultiFileReader multiReader = new MultiFileReader(inputs)) {
PriorityQueue<WordEntry> words = new PriorityQueue<>();
for (BufferedReader reader : multiReader.getReaders()) {
String word = reader.readLine();
if (word != null) {
words.add(new WordEntry(word, reader));
}
}
try (Writer writer = Files.newBufferedWriter(outputPath)) {
while (!words.isEmpty()) {
WordEntry entry = words.poll();
writer.write(entry.word);
writer.write(System.lineSeparator());
String word = entry.reader.readLine();
if (word != null) {
words.add(new WordEntry(word, entry.reader));
}
}
}
}
}
private static final class WordEntry implements Comparable<WordEntry> {
private final String word;
private final BufferedReader reader;
private WordEntry(String word, BufferedReader reader) {
this.word = Objects.requireNonNull(word);
this.reader = Objects.requireNonNull(reader);
}
#Override
public int compareTo(WordEntry other) {
return word.compareTo(other.word);
}
}
}
This is my MultiFileReader.java file:
import java.io.BufferedReader;
import java.io.Closeable;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
public final class MultiFileReader implements Closeable {
private final List<BufferedReader> readers;
public MultiFileReader(List<Path> paths) {
readers = new ArrayList<>(paths.size());
try {
for (Path path : paths) {
readers.add(Files.newBufferedReader(path));
}
} catch (IOException e) {
e.printStackTrace();
} finally {
close();
}
}
public List<BufferedReader> getReaders() {
return Collections.unmodifiableList(readers);
}
#Override
public void close() {
for (BufferedReader reader : readers) {
try {
reader.close();
} catch (Exception ignored) {
}
}
}
}
The finally block in your constructor closes all of your readers. Remove that.
public MultiFileReader(List<Path> paths) {
readers = new ArrayList<>(paths.size());
try {
for (Path path : paths) {
readers.add(Files.newBufferedReader(path));
}
} catch (IOException e) {
e.printStackTrace();
} /* Not this. finally {
close();
} */
}
I need to create a utility, which downloads the files from the box folder. But I am unable to get it working though:
package com.box.sdk.example;
import com.box.sdk.BoxConfig;
import com.box.sdk.BoxDeveloperEditionAPIConnection;
import com.box.sdk.BoxFile;
import com.box.sdk.BoxFolder;
import com.box.sdk.BoxItem;
import java.io.BufferedReader;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
public class PlayGround {
public static void main(String[] args) {
Path configPath = Paths.get("config.json");
Path currentDir = Paths.get("").toAbsolutePath();
try (BufferedReader reader = Files.newBufferedReader(configPath, Charset.forName("UTF-8"))) {
BoxConfig boxConfig = BoxConfig.readFrom(reader);
BoxDeveloperEditionAPIConnection client = BoxDeveloperEditionAPIConnection.getAppEnterpriseConnection(boxConfig);
String folderId = "125601757844";
BoxFolder folder = new BoxFolder(client, folderId);
String folderName = folder.getInfo().getName();
Path localFolderPath = currentDir.resolve(Paths.get(folderName));
if (!Files.exists(localFolderPath)) {
localFolderPath = Files.createDirectory(localFolderPath);
} else {
localFolderPath = resetLocalFolder(localFolderPath);
}
for (BoxItem.Info itemInfo : folder) {
if (itemInfo instanceof BoxFile.Info) {
BoxFile.Info fileInfo = (BoxFile.Info) itemInfo;
BoxFile file = new BoxFile(client, fileInfo.getID());
String localFilePath = localFolderPath.resolve(Paths.get(fileInfo.getName())).toAbsolutePath()
.toString();
FileOutputStream stream = new FileOutputStream(localFilePath);
file.download(stream);
stream.close();
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
static Path resetLocalFolder(Path localFolderPath) throws IOException {
Files.list(localFolderPath).forEach(file -> {
System.out.println(file.getFileName());
try {
Files.delete(file.toAbsolutePath());
} catch (IOException e) {
}
});
Files.delete(localFolderPath);
localFolderPath = Files.createDirectory(localFolderPath);
return localFolderPath;
}
}
When I run this code, I get the following exception:
Exception in thread "main" com.box.sdk.BoxAPIResponseException: The API returned an error code [404 | lsgp4zgkfg6qipxg.0d094ed7daa5f78921603840e0fa470e1] not_found - Not Found
at com.box.sdk.BoxAPIResponse.<init>(BoxAPIResponse.java:92)
at com.box.sdk.BoxJSONResponse.<init>(BoxJSONResponse.java:32)
at com.box.sdk.BoxAPIRequest.trySend(BoxAPIRequest.java:680)
at com.box.sdk.BoxAPIRequest.send(BoxAPIRequest.java:382)
at com.box.sdk.BoxAPIRequest.send(BoxAPIRequest.java:349)
at com.box.sdk.BoxFolder.getInfo(BoxFolder.java:289)
at com.box.sdk.example.PlayGround.main(PlayGround.java:26)
Note: I am able to run the above code, by using developer token, which lasts for 1 hour, but I can't build my production application on such volatile code.
BoxAPIConnection client = new BoxAPIConnection("3i8b5sPnxUotd5etDuUkzGjXXzBphty9");
Please take a look at the code I have so far and if possible explain what I'm doing wrong. I'm trying to learn.
I made a little program to search for a type of file in a directory and all its sub-directories and copy them into another folder.
Code
import java.util.ArrayList;
import java.util.List;
import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.StandardCopyOption;
public class FandFandLoop {
public static void main(String[] args) {
final File folder = new File("C:/Users/ina/src");
List<String> result = new ArrayList<>();
search(".*\\.txt", folder, result);
File to = new File("C:/Users/ina/dest");
for (String s : result) {
System.out.println(s);
File from = new File(s);
try {
copyDir(from.toPath(), to.toPath());
System.out.println("done");
}
catch (IOException ex) {
ex.printStackTrace();
}
}
}
public static void copyDir(Path src, Path dest) throws IOException {
Files.walk(src)
.forEach(source -> {
try {
Files.copy(source, dest.resolve(src.relativize(source)),
StandardCopyOption.REPLACE_EXISTING);
} catch (IOException e) {
e.printStackTrace();
}
});
}
public static void search(final String pattern, final File folder, List<String> result) {
for (final File f : folder.listFiles()) {
if (f.isDirectory()) {
search(pattern, f, result);
}
if (f.isFile()) {
if (f.getName().matches(pattern)) {
result.add(f.getAbsolutePath());
}
}
}
}
}
It works, but what it actually does is to take my .txt files and write them into another file named dest without extension.
And at some point, it deletes the folder dest.
The deletion happens because of StandardCopyOption.REPLACE_EXISTING, if I understand this, but what I would have liked to obtain was that if several files had the same name then only one copy of it should be kept.
There is no need to call Files.walk on the matched source files.
You can improve this code by switching completely to using java.nio.file.Path and not mixing string paths and File objects. Additionally instead of calling File.listFiles() recursively you can use Files.walk or even better Files.find.
So you could instead use the following:
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.file.CopyOption;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.PathMatcher;
import java.nio.file.Paths;
import java.nio.file.attribute.BasicFileAttributes;
import java.util.Objects;
import java.util.function.BiPredicate;
import java.util.stream.Stream;
public class CopyFiles {
public static void copyFiles(Path src, Path dest, PathMatcher matcher, CopyOption... copyOptions) throws IOException {
// Argument validation
if (!Files.isDirectory(src)) {
throw new IllegalArgumentException("Source '" + src + "' is not a directory");
}
if (!Files.isDirectory(dest)) {
throw new IllegalArgumentException("Destination '" + dest + "' is not a directory");
}
Objects.requireNonNull(matcher);
Objects.requireNonNull(copyOptions);
BiPredicate<Path, BasicFileAttributes> filter = (path, attributes) -> attributes.isRegularFile() && matcher.matches(path);
// Use try-with-resources to close stream as soon as it is not longer needed
try (Stream<Path> files = Files.find(src, Integer.MAX_VALUE, filter)) {
files.forEach(file -> {
Path destFile = dest.resolve(src.relativize(file));
try {
copyFile(file, destFile, copyOptions);
}
// Stream methods do not allow checked exceptions, have to wrap it
catch (IOException ioException) {
throw new UncheckedIOException(ioException);
}
});
}
// Wrap UncheckedIOException; cannot unwrap it to get actual IOException
// because then information about the location where the exception was wrapped
// will get lost, see Files.find doc
catch (UncheckedIOException uncheckedIoException) {
throw new IOException(uncheckedIoException);
}
}
private static void copyFile(Path srcFile, Path destFile, CopyOption... copyOptions) throws IOException {
Path destParent = destFile.getParent();
// Parent might be null if dest is empty path
if (destParent != null) {
// Create parent directories before copying file
Files.createDirectories(destParent);
}
Files.copy(srcFile, destFile, copyOptions);
}
public static void main(String[] args) throws IOException {
Path srcDir = Paths.get("path/to/src");
Path destDir = Paths.get("path/to/dest");
// Could also use FileSystem.getPathMatcher
PathMatcher matcher = file -> file.getFileName().toString().endsWith(".txt");
copyFiles(srcDir, destDir, matcher);
}
}
Unable to use StreamingFileSink and store incoming events in compressed fashion.
I am trying to use StreamingFileSink to write unbounded event stream to S3. In the process, I would like to compress the data to make better use of storage size available.
I wrote a compressed string writer, by borrowing some code from SequenceFileWriterFactory from flink. It fails with the exception I described below.
If I try to use BucketingSink, it works great.
Using BucketingSink, I approached compressed string write as below. Again, I borrowed this code from some other pull request.
import org.apache.flink.streaming.connectors.fs.StreamWriterBase;
import org.apache.flink.streaming.connectors.fs.Writer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.compress.CodecPool;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.io.compress.Compressor;
import java.io.IOException;
public class CompressionStringWriter<T> extends StreamWriterBase<T> implements Writer<T> {
private static final long serialVersionUID = 3231207311080446279L;
private String codecName;
private String separator;
public String getCodecName() {
return codecName;
}
public String getSeparator() {
return separator;
}
private transient CompressionOutputStream compressedOutputStream;
public CompressionStringWriter(String codecName, String separator) {
this.codecName = codecName;
this.separator = separator;
}
public CompressionStringWriter(String codecName) {
this(codecName, System.lineSeparator());
}
protected CompressionStringWriter(CompressionStringWriter<T> other) {
super(other);
this.codecName = other.codecName;
this.separator = other.separator;
}
#Override
public void open(FileSystem fs, Path path) throws IOException {
super.open(fs, path);
Configuration conf = fs.getConf();
CompressionCodecFactory codecFactory = new CompressionCodecFactory(conf);
CompressionCodec codec = codecFactory.getCodecByName(codecName);
if (codec == null) {
throw new RuntimeException("Codec " + codecName + " not found");
}
Compressor compressor = CodecPool.getCompressor(codec, conf);
compressedOutputStream = codec.createOutputStream(getStream(), compressor);
}
#Override
public void close() throws IOException {
if (compressedOutputStream != null) {
compressedOutputStream.close();
compressedOutputStream = null;
} else {
super.close();
}
}
#Override
public void write(Object element) throws IOException {
getStream();
compressedOutputStream.write(element.toString().getBytes());
compressedOutputStream.write(this.separator.getBytes());
}
#Override
public CompressionStringWriter<T> duplicate() {
return new CompressionStringWriter<>(this);
}
}
BucketingSink<DeviceEvent> bucketingSink = new BucketingSink<>("s3://"+ this.bucketName + "/" + this.objectPrefix);
bucketingSink
.setBucketer(new OrgIdBasedBucketAssigner())
.setWriter(new CompressionStringWriter<DeviceEvent>("Gzip", "\n"))
.setPartPrefix("file-")
.setPartSuffix(".gz")
.setBatchSize(1_500_000);
The one with BucketingSink works.
Now my code snippets using StreamingFileSink involves the below set of code.
import org.apache.flink.api.common.serialization.BulkWriter;
import java.io.IOException;
public class CompressedStringBulkWriter<T> implements BulkWriter<T> {
private final CompressedStringWriter compressedStringWriter;
public CompressedStringBulkWriter(final CompressedStringWriter compressedStringWriter) {
this.compressedStringWriter = compressedStringWriter;
}
#Override
public void addElement(T element) throws IOException {
this.compressedStringWriter.write(element);
}
#Override
public void flush() throws IOException {
this.compressedStringWriter.flush();
}
#Override
public void finish() throws IOException {
this.compressedStringWriter.close();
}
}
import org.apache.flink.api.common.serialization.BulkWriter;
import org.apache.flink.core.fs.FSDataOutputStream;
import org.apache.hadoop.conf.Configuration;
import java.io.IOException;
public class CompressedStringBulkWriterFactory<T> implements BulkWriter.Factory<T> {
private SerializableHadoopConfiguration serializableHadoopConfiguration;
public CompressedStringBulkWriterFactory(final Configuration hadoopConfiguration) {
this.serializableHadoopConfiguration = new SerializableHadoopConfiguration(hadoopConfiguration);
}
#Override
public BulkWriter<T> create(FSDataOutputStream out) throws IOException {
return new CompressedStringBulkWriter(new CompressedStringWriter(out, serializableHadoopConfiguration.get(), "Gzip", "\n"));
}
}
import org.apache.flink.core.fs.FSDataOutputStream;
import org.apache.flink.core.fs.FileSystem;
import org.apache.flink.core.fs.Path;
import org.apache.flink.runtime.fs.hdfs.HadoopFileSystem;
import org.apache.flink.util.Preconditions;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.compress.CodecPool;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.CompressionCodecFactory;
import org.apache.hadoop.io.compress.CompressionOutputStream;
import org.apache.hadoop.io.compress.Compressor;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.io.IOException;
import java.io.Serializable;
public class CompressedStringWriter<T> implements Serializable {
private static final Logger LOG = LoggerFactory.getLogger(CompressedStringWriter.class);
private static final long serialVersionUID = 2115292142239557448L;
private String separator;
private transient CompressionOutputStream compressedOutputStream;
public CompressedStringWriter(FSDataOutputStream out, Configuration hadoopConfiguration, String codecName, String separator) {
this.separator = separator;
try {
Preconditions.checkNotNull(hadoopConfiguration, "Unable to determine hadoop configuration using path");
CompressionCodecFactory codecFactory = new CompressionCodecFactory(hadoopConfiguration);
CompressionCodec codec = codecFactory.getCodecByName(codecName);
Preconditions.checkNotNull(codec, "Codec " + codecName + " not found");
LOG.info("The codec name that was loaded from hadoop {}", codec);
Compressor compressor = CodecPool.getCompressor(codec, hadoopConfiguration);
this.compressedOutputStream = codec.createOutputStream(out, compressor);
LOG.info("Setup a compressor for codec {} and compressor {}", codec, compressor);
} catch (IOException ex) {
throw new RuntimeException("Unable to compose a hadoop compressor for the path", ex);
}
}
public void flush() throws IOException {
if (compressedOutputStream != null) {
compressedOutputStream.flush();
}
}
public void close() throws IOException {
if (compressedOutputStream != null) {
compressedOutputStream.close();
compressedOutputStream = null;
}
}
public void write(T element) throws IOException {
compressedOutputStream.write(element.toString().getBytes());
compressedOutputStream.write(this.separator.getBytes());
}
}
import org.apache.hadoop.conf.Configuration;
import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
import java.io.Serializable;
public class SerializableHadoopConfiguration implements Serializable {
private static final long serialVersionUID = -1960900291123078166L;
private transient Configuration hadoopConfig;
SerializableHadoopConfiguration(Configuration hadoopConfig) {
this.hadoopConfig = hadoopConfig;
}
Configuration get() {
return this.hadoopConfig;
}
// --------------------
private void writeObject(ObjectOutputStream out) throws IOException {
this.hadoopConfig.write(out);
}
private void readObject(ObjectInputStream in) throws IOException {
final Configuration config = new Configuration();
config.readFields(in);
if (this.hadoopConfig == null) {
this.hadoopConfig = config;
}
}
}
My actual flink job
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties kinesisConsumerConfig = new Properties();
...
...
DataStream<DeviceEvent> kinesis =
env.addSource(new FlinkKinesisConsumer<>(this.streamName, new DeviceEventSchema(), kinesisConsumerConfig)).name("source")
.setParallelism(16)
.setMaxParallelism(24);
final StreamingFileSink<DeviceEvent> bulkCompressStreamingFileSink = StreamingFileSink.<DeviceEvent>forBulkFormat(
path,
new CompressedStringBulkWriterFactory<>(
BucketingSink.createHadoopFileSystem(
new Path("s3a://"+ this.bucketName + "/" + this.objectPrefix),
null).getConf()))
.withBucketAssigner(new OrgIdBucketAssigner())
.build();
deviceEventDataStream.addSink(bulkCompressStreamingFileSink).name("bulkCompressStreamingFileSink").setParallelism(16);
env.execute();
I expect data to be saved in S3 as multiple files. Unfortunately no files are being created.
In the logs, I see below exception
2019-05-15 22:17:20,855 INFO org.apache.flink.runtime.taskmanager.Task - Sink: bulkCompressStreamingFileSink (11/16) (c73684c10bb799a6e0217b6795571e22) switched from RUNNING to FAILED.
java.lang.Exception: Could not perform checkpoint 1 for operator Sink: bulkCompressStreamingFileSink (11/16).
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:595)
at org.apache.flink.streaming.runtime.io.BarrierBuffer.notifyCheckpoint(BarrierBuffer.java:396)
at org.apache.flink.streaming.runtime.io.BarrierBuffer.processBarrier(BarrierBuffer.java:292)
at org.apache.flink.streaming.runtime.io.BarrierBuffer.getNextNonBlocked(BarrierBuffer.java:200)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:209)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not complete snapshot 1 for operator Sink: bulkCompressStreamingFileSink (11/16).
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:422)
at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.checkpointStreamOperator(StreamTask.java:1113)
at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1055)
at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:729)
at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:641)
at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:586)
... 8 more
Caused by: java.io.IOException: Stream closed.
at org.apache.flink.fs.s3.common.utils.RefCountedFile.requireOpened(RefCountedFile.java:117)
at org.apache.flink.fs.s3.common.utils.RefCountedFile.write(RefCountedFile.java:74)
at org.apache.flink.fs.s3.common.utils.RefCountedBufferingFileStream.flush(RefCountedBufferingFileStream.java:105)
at org.apache.flink.fs.s3.common.writer.S3RecoverableFsDataOutputStream.closeAndUploadPart(S3RecoverableFsDataOutputStream.java:199)
at org.apache.flink.fs.s3.common.writer.S3RecoverableFsDataOutputStream.closeForCommit(S3RecoverableFsDataOutputStream.java:166)
at org.apache.flink.streaming.api.functions.sink.filesystem.PartFileWriter.closeForCommit(PartFileWriter.java:71)
at org.apache.flink.streaming.api.functions.sink.filesystem.BulkPartWriter.closeForCommit(BulkPartWriter.java:63)
at org.apache.flink.streaming.api.functions.sink.filesystem.Bucket.closePartFile(Bucket.java:239)
at org.apache.flink.streaming.api.functions.sink.filesystem.Bucket.prepareBucketForCheckpointing(Bucket.java:280)
at org.apache.flink.streaming.api.functions.sink.filesystem.Bucket.onReceptionOfCheckpoint(Bucket.java:253)
at org.apache.flink.streaming.api.functions.sink.filesystem.Buckets.snapshotActiveBuckets(Buckets.java:244)
at org.apache.flink.streaming.api.functions.sink.filesystem.Buckets.snapshotState(Buckets.java:235)
at org.apache.flink.streaming.api.functions.sink.filesystem.StreamingFileSink.snapshotState(StreamingFileSink.java:347)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.trySnapshotFunctionState(StreamingFunctionUtils.java:118)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.snapshotFunctionState(StreamingFunctionUtils.java:99)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.snapshotState(AbstractUdfStreamOperator.java:90)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:395)
So wondering, what am I missing.
I am using AWS EMR latest (5.23).
In CompressedStringBulkWriter#close(), you are calling the close() method on the CompressionCodecStream which also closes the underlying the stream i.e. Flink's FSDataOutputStream. It has to be opened for the checkpointing to be done properly by Flink's internal to guarantee recoverable stream. That is why you are getting
Caused by: java.io.IOException: Stream closed.
at org.apache.flink.fs.s3.common.utils.RefCountedFile.requireOpened(RefCountedFile.java:117)
at org.apache.flink.fs.s3.common.utils.RefCountedFile.write(RefCountedFile.java:74)
at org.apache.flink.fs.s3.common.utils.RefCountedBufferingFileStream.flush(RefCountedBufferingFileStream.java:105)
at org.apache.flink.fs.s3.common.writer.S3RecoverableFsDataOutputStream.closeAndUploadPart(S3RecoverableFsDataOutputStream.java:199)
at org.apache.flink.fs.s3.common.writer.S3RecoverableFsDataOutputStream.closeForCommit(S3RecoverableFsDataOutputStream.java:166)
So instead of compressedOutputStream.close(), use compressedOutputStream.finish() which just flushes everything that's in buffer to the outputstream without closing it. BTW, there is an inbuilt HadoopCompressionBulkWriter made available in the latest version Flink, you can also use that.
I have a recursive watch service that I'm using to monitor directories while the application is running. For an unknown reason, the watchservice appears stop working after about a day. At that point I can add a new file to a monitored directory and get no log statements and my observers are not notified.
I thought Spring might be destroying the bean, so I added a log statement to the #pre-destroy section of the class, but that log statement doesn't show up after the watchservice stops working, so it seems that bean still exists, it's just not functioning as expected. The class is as follows
import com.sun.nio.file.SensitivityWatchEventModifier;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.springframework.stereotype.Service;
import java.io.File;
import java.io.IOException;
import java.nio.file.*;
import java.nio.file.attribute.BasicFileAttributes;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.function.Consumer;
import javax.annotation.PostConstruct;
import javax.annotation.PreDestroy;
import static java.nio.file.StandardWatchEventKinds.ENTRY_CREATE;
import static java.nio.file.StandardWatchEventKinds.ENTRY_DELETE;
#Service
public class DirectoryMonitor {
private static final Logger logger = LoggerFactory.getLogger(DirectoryMonitor.class);
private WatchService watcher;
private ExecutorService executor;
private List<DirectoryMonitorObserver> observerList = new ArrayList<>();
private final Map<WatchKey, Path> keys = new HashMap<>();
public void addObserver(DirectoryMonitorObserver observer){
observerList.add(observer);
}
private void notifyObservers(){
observerList.forEach(DirectoryMonitorObserver::directoryModified);
}
#PostConstruct
public void init() throws IOException {
watcher = FileSystems.getDefault().newWatchService();
executor = Executors.newSingleThreadExecutor();
}
#PreDestroy
public void cleanup() {
try {
logger.info("Stopping directory monitor");
watcher.close();
} catch (IOException e) {
logger.error("Error closing watcher service", e);
}
executor.shutdown();
}
#SuppressWarnings("unchecked")
public void startRecursiveWatcher(String pathToMonitor) {
logger.info("Starting Recursive Watcher");
Consumer<Path> register = p -> {
if (!p.toFile().exists() || !p.toFile().isDirectory())
throw new RuntimeException("folder " + p + " does not exist or is not a directory");
try {
Files.walkFileTree(p, new SimpleFileVisitor<Path>() {
#Override
public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) throws IOException {
logger.info("registering " + dir + " in watcher service");
WatchKey watchKey = dir.register(watcher, new WatchEvent.Kind[]{ENTRY_CREATE, ENTRY_DELETE}, SensitivityWatchEventModifier.HIGH);
keys.put(watchKey, dir);
return FileVisitResult.CONTINUE;
}
});
} catch (IOException e) {
throw new RuntimeException("Error registering path " + p);
}
};
register.accept(Paths.get(pathToMonitor));
executor.submit(() -> {
while (true) {
final WatchKey key;
try {
key = watcher.take();
} catch (InterruptedException ex) {
logger.error(ex.toString());
continue;
}
final Path dir = keys.get(key);
key.pollEvents().stream()
.map(e -> ((WatchEvent<Path>) e).context())
.forEach(p -> {
final Path absPath = dir.resolve(p);
if (absPath.toFile().isDirectory()) {
register.accept(absPath);
} else {
final File f = absPath.toFile();
logger.info("Detected new file " + f.getAbsolutePath());
}
});
notifyObservers();
key.reset();
}
});
}
}
This is where I'm creating the monitor bean..
#Component
public class MovieInfoFacade {
#Value("${media.path}")
private String mediaPath;
private MovieInfoControl movieInfoControl;
private DirectoryMonitor directoryMonitor;
private FileListProvider fileListProvider;
#Autowired
public MovieInfoFacade(MovieInfoControl movieInfoControl, DirectoryMonitor directoryMonitor, FileListProvider fileListProvider){
this.movieInfoControl = movieInfoControl;
this.directoryMonitor = directoryMonitor;
this.fileListProvider = fileListProvider;
}
#PostConstruct
public void startDirectoryMonitor(){
if(!mediaPath.equalsIgnoreCase("none")) {
directoryMonitor.addObserver(fileListProvider);
directoryMonitor.startRecursiveWatcher(mediaPath);
}
}
public int loadMovieListLength(String directoryPath){
return fileListProvider.listFiles(directoryPath).length;
}
public List<MovieInfo> loadMovieList(MovieSearchCriteria searchCriteria) {
List<File> files = Arrays.asList(fileListProvider.listFiles(searchCriteria.getPath()));
return files.parallelStream()
.sorted()
.skip(searchCriteria.getPage() * searchCriteria.getItemsPerPage())
.limit(searchCriteria.getItemsPerPage())
.map(file -> movieInfoControl.loadMovieInfoFromCache(file.getAbsolutePath()))
.collect(Collectors.toList());
}
public MovieInfo loadSingleMovie(String filePath) {
return movieInfoControl.loadMovieInfoFromCache(filePath);
}
}
It appears that the error was in my exception handling. After removing the throw statements (and replacing them with logs) I have not had any issues.