I am a new bee to flink and facing some challenges to solve the below use case
Use Case description:
I will receive a csv file with a timestamp on every single day in some folder say input. The file format would be file_name_dd-mm-yy-hh-mm-ss.csv.
Now my flink pipeline will read this csv file in a row by row fashion and it will be written to my Kafka topic.
Immediately after completion of data reading this file needs to be moved to another folder historic folder.
Why i need this is because : suppose that your ververica server stops either abruptly or manually and if you have all the processed files lying at the same location then after the ververica restart flink will re read all the files that it had processed earlier. So to prevent this scenario those files needs to be immediately move already read files to another location.
I googled a lot but did not find anything so can you guide me to achieve this.
Let me know if anything else is required.
Out of the box Flink provides the facility to monitor directory for new files and read them - via StreamExecutionEnvironment.getExecutionEnvironment.readFile (see similar stack overflow threads for examples - How to read newly added file in a directory in Flink / Monitoring directory for new files with Flink for data streams , etc.)
Looking into the source code of the readFile function, it calls for createFileInput() method, which simply instantiates ContinuousFileMonitoringFunction, ContinuousFileReaderOperatorFactory and configures the source -
addSource(monitoringFunction, sourceName, null, boundedness)
.transform("Split Reader: " + sourceName, typeInfo, factory);
ContinuousFileMonitoringFunction is actually a place where most of the logic happens.
So, if I were to implement your requirement, I would extend the functionality of ContinuousFileMonitoringFunction with my own logic of moving the processed file into the history folder and constructed the source from this function.
Given that the run method performs the read and forwarding inside the checkpointLock -
synchronized (checkpointLock) {
monitorDirAndForwardSplits(fileSystem, context);
}
I would say it's safe to move to historic folder on checkpoint completion files which have the modification day older then globalModificationTime, which is updated in monitorDirAndForwardSplits on splits collecting.
That said, I would extend the ContinuousFileMonitoringFunction class and implement the CheckpointListener interface, and in notifyCheckpointComplete would move the already processed files to historic folder:
public class ArchivingContinuousFileMonitoringFunction<OUT> extends ContinuousFileMonitoringFunction<OUT> implements CheckpointListener {
...
#Override
public void notifyCheckpointComplete(long checkpointId) throws Exception {
Map<Path, FileStatus> eligibleFiles = listEligibleForArchiveFiles(fs, new Path(path));
// do move logic
}
/**
* Returns the paths of the files already processed.
*
* #param fileSystem The filesystem where the monitored directory resides.
*/
private Map<Path, FileStatus> listEligibleForArchiveFiles(FileSystem fileSystem, Path path) {
final FileStatus[] statuses;
try {
statuses = fileSystem.listStatus(path);
} catch (IOException e) {
// we may run into an IOException if files are moved while listing their status
// delay the check for eligible files in this case
return Collections.emptyMap();
}
if (statuses == null) {
LOG.warn("Path does not exist: {}", path);
return Collections.emptyMap();
} else {
Map<Path, FileStatus> files = new HashMap<>();
// handle the new files
for (FileStatus status : statuses) {
if (!status.isDir()) {
Path filePath = status.getPath();
long modificationTime = status.getModificationTime();
if (shouldIgnore(filePath, modificationTime)) {
files.put(filePath, status);
}
} else if (format.getNestedFileEnumeration() && format.acceptFile(status)) {
files.putAll(listEligibleForArchiveFiles(fileSystem, status.getPath()));
}
}
return files;
}
}
}
and then define the data stream manually with the custom function:
ContinuousFileMonitoringFunction<OUT> monitoringFunction =
new ArchivingContinuousFileMonitoringFunction <>(
inputFormat, monitoringMode, getParallelism(), interval);
ContinuousFileReaderOperatorFactory<OUT, TimestampedFileInputSplit> factory = new ContinuousFileReaderOperatorFactory<>(inputFormat);
final Boundedness boundedness = Boundedness.CONTINUOUS_UNBOUNDED;
env.addSource(monitoringFunction, sourceName, null, boundedness)
.transform("Split Reader: " + sourceName, typeInfo, factory);
Flink itself does not provide a solution for doing this. You might need to build something yourself, or find a workflow tool that can be configured to handle this.
You can ask about this on the flink user mailing list. I know others have written scripts to do this; perhaps someone can share a solution.
Related
Let's say you have a task to read all files that are saved in some folder and process every single file. For simplicity sake let's say that all files are HTML files and you want to extract the HTML content from them.
In Java 8 there is Files.walk API that allows us to do something like that. Here is an example:
try (Stream<Path> paths = Files.walk(Paths.get("/home/you/Desktop"))) {
paths
.filter(Files::isRegularFile)
.forEach(System.out::println);
}
This sound really good if you have to process small amount of folders and files, but if you have milion of files distributed across several network drives then this process will take ages and obviously needs to be paralelised. Any ideas how to do parallelism in this case?
I don't think there is a simple general algorithm to solve your problem.
In fact the general idea when dealing with big amount of data distributed on many nodes is letting each node do the collecting of data and the processing those partial results in a single node.
Doing all the scanning from a single system is going to be hard.
To do some real optimization you cannot treat all the folder in the same way.
What you could do is to create a Collection of Paths that could be scanned in parallel.
So instead of walking along a single root not you could start several walks along several folder (possibly one for each network drive).
For this to work you need to know which path is a network path and which is a local one.
If you, for example have a folder where each child folder is a mounted network drive, you could easily collect all those folders and the run your walk in parallel for each.
I would do something similar to the following code:
public class ParallelWalks {
ExecutorService executor = Executors.newCachedThreadPool();
ExecutorService singleThreadExecutor = Executors.newSingleThreadExecutor();
public static void main(String[] args) {
new ParallelWalks().exec();
}
public ExecutorService executorSelector(Path path) {
if(isNetworkDrive(path)) {
return executor;
}else {
return singleThreadExecutor;
}
}
private boolean isNetworkDrive(Path path) {
// Here goes the logic to choose which path should go on a different
// thread.
return path.toString().contains("srv");
}
private void exec() {
Path path = Paths.get("/home/you/Desktop");
try (Stream<Path> files = Files.list(path)) {
files.forEach(this::taskRunner);
} catch (IOException e) {
// Do something with the exception
}
}
private void taskRunner(final Path path) {
executorSelector(path)
.submit(() -> doWalk(path));
}
private void doWalk(Path path) {
try (Stream<Path> paths = Files.walk(path)) {
paths.filter(Files::isRegularFile).forEach(System.out::println);
} catch (IOException e) {
// Do something with the exception
}
}
}
This way all your local dir will be processed sequentially, and all network drives will be processed each on his thread.
It would work only if all (or most of) your network drives share the same mount point parent.
Otherwise you should implement your own walk .
How do i delete a file after serving it over http,
Files.TemporaryFile file = null;
try {
file = new Files.TemporaryFile(f);
return ok().sendFile(file.file());
} catch (IllegalArgumentException e) {
return badRequest(Json.newObject().put("message", e.getMessage()));
} finally {
file.clean();
}
with this code, the file gets deleted before it is served. i receive an empty file on the client.
Play framework in version 2.8 should support onClose argument in sendFile method also in Java (so far it seems to be supported only in Scala version).
In older versions (I have tried only on 2.7.x) you may apply the same approach like in the fix for 2.8, so:
public Result doSomething() {
final File fileToReturn = ....;
final Source<ByteString, CompletionStage<IOResult>> source = FileIO.fromFile(fileToReturn);
return Results.ok().streamed(wrap(source, () -> fileToReturn.delete()), Optional.of(fileToReturn.length()), Optional.of("content type, e.g. application/zip"));
}
private Source<ByteString, CompletionStage<IOResult>> wrap(final Source<ByteString, CompletionStage<IOResult>> source, final Runnable handler) {
return source.mapMaterializedValue(
action -> action.whenCompleteAsync((ioResult, exception) -> handler.run())
);
}
From reading the JavaFileUpload documentation for 2.6.x, it sounds like you don't need that finally block to clean up the file afterwards. Since you are using a TemporaryFile, garbage collection should take care of deleting the resource:
...the idea behind TemporaryFile is that it’s only in scope at completion and should be moved out of the temporary file system as soon as possible. Any temporary files that are not moved are deleted [by the garbage collector].
The same section goes on to describe that there is the potential that the file will not get garbage collection causing Denial Of Service issues. If you find that the files are not getting removed, then you can use the TemporaryFilesReaper:
However, under certain conditions, garbage collection does not occur in a timely fashion. As such, there’s also a play.api.libs.Files.TemporaryFileReaper that can be enabled to delete temporary files on a scheduled basis using the Akka scheduler, distinct from the garbage collection method.
I am not forcing all the project, but you can use a Scala for only this controller, then you can use onClose parameter of the sendFile method. The only attention - that parameter is not workable in all versions, it looks like in 2.5 there was an issue so it was not triggered (was not work: https://github.com/playframework/playframework/issues/6351).
Another way - you can use Akka streams, like here: https://www.playframework.com/documentation/2.6.x/JavaStream#Chunked-responses.
I have a requirement of implementing a Watch Service on a folder. This is straight forward approach of using Java7's watch service. I have successfully done it, I am able to capture events whenever a file is created/updated/deleted on the folder where I have been watching. The problem here is it is not applicable for contents of sub folders and it is clearly written in the documentation. My requirement is to watch over contents of sub folder as well. This is not possible using the above approach unless I write a loop over all the sub folders manually and listen to each and every folder, this I think leads to some memory leak if not programmed well. Hence I am going with what spring suggested in the newer release explained here This is very clear approach which I have seen for WatchService. The problem here is this will listen to only ENTRY_CREATE events i.e., only the events where we have created the file and this can be at any level. This is not working when I change the file or delete the file. How should we go ahead in this case.
public static void watchFolderTree(String pathStr)
throws Exception
{
long waitTime = 10000;
WatchServiceDirectoryScanner scanner = new WatchServiceDirectoryScanner(pathStr);
scanner.start();
List<File> changedFiles = null;
while(true)
{
changedFiles = scanner.listFiles(new File(pathStr));
if(changedFiles.size() > 0)
{
System.out.println("There is a file ");
}
Thread.sleep(waitTime);
}
}
References :
Monitor subfolders with a Java watch service
JAVA 7 watch service
Background:
I have a requirement that messages displayed to the user must vary both by language and by company division. Thus, I can't use out of the box resource bundles, so I'm essentially writing my own version of resource bundles using PropertiesConfiguration files.
In addition, I have a requirement that messages must be modifiable dynamically in production w/o doing restarts.
I'm loading up three different iterations of property files:
-basename_division.properties
-basename_2CharLanguageCode.properties
-basename.properties
These files exist in the classpath. This code is going into a tag library to be used by multiple portlets in a Portal.
I construct the possible .properties files, and then try to load each of them via the following:
PropertiesConfiguration configurationProperties;
try {
configurationProperties = new PropertiesConfiguration(propertyFileName);
configurationProperties.setReloadingStrategy(new FileChangedReloadingStrategy());
} catch (ConfigurationException e) {
/* This is ok -- it just means that the specific configuration file doesn't
exist right now, which will often be true. */
return(null);
}
If it did successfully locate a file, it saves the created PropertiesConfiguration into a hashmap for reuse, and then tries to find the key. (Unlike regular resource bundles, if it doesn't find the key, it then tries to find the more general file to see if the key exists in that file -- so that only override exceptions need to be put into language/division specific property files.)
The Problem:
If a file did not exist the first time it was checked, it throws the expected exception. However, if at a later time a file is then later dropped into the classpath and this code is then re-run, the exception is still thrown. Restarting the portal obviously clears the problem, but that's not useful to me -- I need to be able to allow them to drop new messages in place for language/companyDivision overrides w/o a restart. And I'm not that interested in creating blank files for all possible divisions, since there are quite a few divisions.
I'm assuming this is a classLoader issue, in that it determines that the file did not exist in the classpath the first time, and caches that result when trying to reload the same file. I'm not interested in doing anything too fancy w/ the classLoader. (I'd be the only one who would be able to understand/maintain that code.) The specific environment is WebSphere Portal.
Any ways around this or am I stuck?
My guess is that I am not sure if Apache's FileChangedReloadingStrategy also reports the events of ENTRY_CREATE on a file system directory.
If you're using Java 7, I propose to try the following. Simply, implement a new ReloadingStrategy using Java 7 WatchService. In this way, every time either a file is changed in your target directories or a new property file is placed there, you poll for the event and able to add the properties to your application.
If not on Java 7, maybe using a library such as JNotify would be a better solution to get the event of a new entry in a directory. But again, you need to implement the ReloadingStrategy.
UPDATE for Java 6:
PropertiesConfiguration configurationProperties;
try {
configurationProperties = new PropertiesConfiguration(propertyFileName);
configurationProperties.setReloadingStrategy(new FileChangedReloadingStrategy());
} catch (ConfigurationException e) {
JNotify.addWatch(propertyFileDirectory, JNotify.FILE_CREATED, false, new FileCreatedListener());
}
where
class FileCreatedListener implements JNotifyListener {
// other methods
public void fileCreated(int watchId, String rootPath, String fileName) {
configurationProperties = new PropertiesConfiguration(rootPath + "/" + fileName);
configurationProperties.setReloadingStrategy(new FileChangedReloadingStrategy());
// or any other business with configurationProperties
}
}
Multiple clients send request to write a file and expecting a response either success or fail. I would like to describe concisly the work done at server side.
handle the request by servlet class and invoke another class to proceed the further.
FileWriter class is invoked and this class performs follwing the file writing process.
a). create directory under context and write a *.txt file inside directory
b). copy some files from context's existing directory to newly created directory.
c). compress (*.zip) this directory
class FileWriter {
public synchronized writeFile(String contextPath) {
creates a directory & new file under context
copyFiles(path_to_directory);
}
private void copyFiles(String path_to_directory){
copies files to /contextPath/directory/... from existingDirectory;
compressDir( Directory_path ); // to compress the file
}
private void compressDir(String Directory_path) {
compress the newly created directory
}
}
As you can see above in the class that there is one method is synchronized and two methods are private. only synchrnized method is invoking from servlet class others method are invoking inside from the method.
so is this a good / standard way to handling the multiple client request ?
or should i invoke each method directly from servlet class. please correct me and suggest a better way to implement the class.
#Edit : req1 comes and create directory & file e.g.
context/directory_1/file_1.txt
in the mean time req2 comes and checks that directory_1 is existing already so it creates directory_2 e.g. context/directory_2/file_1.txt.
now the second step is to copy the file from context to newly created directory. Let me tell you directory_1 has nothing to do with directory_2
all the newly created directory copies the file from a common_directory e.g. `context/common_directory/... to context/directory_1, context/directory_2'
and third step is to compress the directory : e.g. directory_1.zip, directory_2.zip
Two advices:
Do not name the class same to already existing class in JDK.
Do not chain method calls this way, create one-purpose methods and then
put them together in one method clearly showing your intension.
class FileProcessor /*FileUtil whatever, but not FileWriter */ {
public synchronized writeFile(String contextPath) {
// create a directory & new file under context
copyFiles(contextPath);
compressDir(contextPath); // to compress the file
}
// copies files to /contextPath/directory/... from existingDirectory;
private void copyFiles(String path_to_directory){ }
// compress the newly created directory
private void compressDir(String Directory_path) { }
Looking at the above code, if you calling writeFile from the servlet, your servlet ends up as a single threaded application.
If two are working on two separate directories and separate files and you guaranty that there is no overlap, you should call both methods directly and ditch synchronized. Looks like this is what your situation is. So you can use below approach:
Servlet Code
{
....
String uniqDir = createUniqDir();
copyFiles(uniqDir);
compressDir(uniqDir);
}
Now the whole idea is to create uniq dir name. Now there are many approaches to create uniq dir name. I ll use one which is based on time-stamp.
String createUniqDir() {
// Use SimpleDateFormat or just millis from Date
// We just trying to be as uniq as possible.
String timeStampStr;
Date now = new Date();
timeStampStr = "" + now.getTime(); // If using EPOC
// This soln if you wana use SimpleDateFormat
// SimpleDateFormat sdf = new SimpleDateFormat("yyyyMMdd_HHmmssSSS");
// timeStampStr = sdf.format(dt);
int counter = 1;
String dirToCreateStr = "some_prefix-" + timeStampStr;
File dirToCreate = new File(dirToCreate);
while(!dirToCreate.mkdir()) {
dirToCreateStr = "some_prefix-" + timeStampStr + "-" + counter;
file = new File(dirToCreate);
counter++;
}
return dirToCreateStr;
}
Since we are using mkdir and it is atomic and only return true if it is able to create a uniq dir. This soln is optimized as requesting colliding during a millisecond are way less and we dont need any synchronization overhead.
You can use some counter too for creating uniq name. But if your counter always starts from the beginning (i.e. you are not maintaining its state that too in a thread safe fashion) then you have performance/accuracy issues.