I've implemented a solution that uses Quartz to read a folder in a interval of time and for each file, it does some operations and deletes the file when it finish. It is smooth when i don't have thousand files in directory.
getFiles(config.getString("input")) match {
case Some(files) =>
files.foreach { file =>
try {
// check if file is in use
if (file.renameTo(file)) {
process(file, config)
}
} catch {
case e: Exception =>
} finally {
...
}
}
case None =>
...
}
def getFiles(path: String): Option[Array[File]] = {
new File(path).listFiles() match {
case files if files != null =>
Some(files.filter(file => file.lastModified < Calendar.getInstance.getTimeInMillis - 5000))
case _ =>
None
}
}
def process(file: File, clientConfig:Config) {
...
file.delete
}
Now my scenario is different - i'm working with thousand and thousand files - and my throughput is very slow: 50/sec (each file has 40kb).
I was wondering what is the best approach to process many files. Should I replace the method getFile() to return N elements and apply a FileLock on each element? If I use FileLock, I could to retrieve only the elements that are not in use. Or should i use something from Java NIO?
Thank in advance.
I think you can wrap your try catch block in a Future, so you can process the files in parallel. Apparently using an Execution Context backed by a cached threadpool is best for IO bound operations. This would also mean you do not need to worry about locks, as you spawn a future for each file synchronously.
You could also read the input files as a stream, which would mean your code would no longer store a reference to all the files in memory upfront, but instead only store a reference to the working set (one file), but I don't think this is the cause of your bottleneck.
Related
I am a new bee to flink and facing some challenges to solve the below use case
Use Case description:
I will receive a csv file with a timestamp on every single day in some folder say input. The file format would be file_name_dd-mm-yy-hh-mm-ss.csv.
Now my flink pipeline will read this csv file in a row by row fashion and it will be written to my Kafka topic.
Immediately after completion of data reading this file needs to be moved to another folder historic folder.
Why i need this is because : suppose that your ververica server stops either abruptly or manually and if you have all the processed files lying at the same location then after the ververica restart flink will re read all the files that it had processed earlier. So to prevent this scenario those files needs to be immediately move already read files to another location.
I googled a lot but did not find anything so can you guide me to achieve this.
Let me know if anything else is required.
Out of the box Flink provides the facility to monitor directory for new files and read them - via StreamExecutionEnvironment.getExecutionEnvironment.readFile (see similar stack overflow threads for examples - How to read newly added file in a directory in Flink / Monitoring directory for new files with Flink for data streams , etc.)
Looking into the source code of the readFile function, it calls for createFileInput() method, which simply instantiates ContinuousFileMonitoringFunction, ContinuousFileReaderOperatorFactory and configures the source -
addSource(monitoringFunction, sourceName, null, boundedness)
.transform("Split Reader: " + sourceName, typeInfo, factory);
ContinuousFileMonitoringFunction is actually a place where most of the logic happens.
So, if I were to implement your requirement, I would extend the functionality of ContinuousFileMonitoringFunction with my own logic of moving the processed file into the history folder and constructed the source from this function.
Given that the run method performs the read and forwarding inside the checkpointLock -
synchronized (checkpointLock) {
monitorDirAndForwardSplits(fileSystem, context);
}
I would say it's safe to move to historic folder on checkpoint completion files which have the modification day older then globalModificationTime, which is updated in monitorDirAndForwardSplits on splits collecting.
That said, I would extend the ContinuousFileMonitoringFunction class and implement the CheckpointListener interface, and in notifyCheckpointComplete would move the already processed files to historic folder:
public class ArchivingContinuousFileMonitoringFunction<OUT> extends ContinuousFileMonitoringFunction<OUT> implements CheckpointListener {
...
#Override
public void notifyCheckpointComplete(long checkpointId) throws Exception {
Map<Path, FileStatus> eligibleFiles = listEligibleForArchiveFiles(fs, new Path(path));
// do move logic
}
/**
* Returns the paths of the files already processed.
*
* #param fileSystem The filesystem where the monitored directory resides.
*/
private Map<Path, FileStatus> listEligibleForArchiveFiles(FileSystem fileSystem, Path path) {
final FileStatus[] statuses;
try {
statuses = fileSystem.listStatus(path);
} catch (IOException e) {
// we may run into an IOException if files are moved while listing their status
// delay the check for eligible files in this case
return Collections.emptyMap();
}
if (statuses == null) {
LOG.warn("Path does not exist: {}", path);
return Collections.emptyMap();
} else {
Map<Path, FileStatus> files = new HashMap<>();
// handle the new files
for (FileStatus status : statuses) {
if (!status.isDir()) {
Path filePath = status.getPath();
long modificationTime = status.getModificationTime();
if (shouldIgnore(filePath, modificationTime)) {
files.put(filePath, status);
}
} else if (format.getNestedFileEnumeration() && format.acceptFile(status)) {
files.putAll(listEligibleForArchiveFiles(fileSystem, status.getPath()));
}
}
return files;
}
}
}
and then define the data stream manually with the custom function:
ContinuousFileMonitoringFunction<OUT> monitoringFunction =
new ArchivingContinuousFileMonitoringFunction <>(
inputFormat, monitoringMode, getParallelism(), interval);
ContinuousFileReaderOperatorFactory<OUT, TimestampedFileInputSplit> factory = new ContinuousFileReaderOperatorFactory<>(inputFormat);
final Boundedness boundedness = Boundedness.CONTINUOUS_UNBOUNDED;
env.addSource(monitoringFunction, sourceName, null, boundedness)
.transform("Split Reader: " + sourceName, typeInfo, factory);
Flink itself does not provide a solution for doing this. You might need to build something yourself, or find a workflow tool that can be configured to handle this.
You can ask about this on the flink user mailing list. I know others have written scripts to do this; perhaps someone can share a solution.
I have 1000 big files to be processed in order as mentioned below:
First those files needs to be copied to a different directory in parallel, I am planning to use ExecutorService with 10 threads to achieve it.
As soon as any file is copied to another location(#1), I will submit that file for further processing to ExecutorService with 10 threads.
And finally, another action needs to be performed on these files in parallel, like #2 gets input from #1, #3 gets input from #2.
Now, I can use CompletionService here, so I can process the thread results from #1 to #2 and #2 to #3 in the order they are getting completed. CompletableFuture says we can chain asynchronous tasks together which sounds like something I can use in this case.
I am not sure if I should implement my solution with CompletableFuture (since it is relatively new and ought to be better) or if CompletionService is sufficient? And why should I chose one over another in this case?
It would probably be best if you tried both approaches and then choose the one you are more comfortable with. Though it sounds like CompletableFutures are better suited for this task because they make chaining processing steps / stages really easy. For example in your case the code could look like this:
ExecutorService copyingExecutor = ...
// Not clear from the requirements, but let's assume you have
// a separate executor for this
ExecutorService processingExecutor = ...
public CompletableFuture<MyResult> process(Path file) {
return CompletableFuture
.supplyAsync(
() -> {
// Retrieve destination path where file should be copied to
Path destination = ...
try {
Files.copy(file, destination);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
return destination;
},
copyingExecutor
)
.thenApplyAsync(
copiedFile -> {
// Process the copied file
...
},
processingExecutor
)
// This separate stage does not make much sense, so unless you have
// yet another executor for this or this stage is applied at a different
// location in your code, it should probably be merged with the
// previous stage
.thenApply(
previousResult -> {
// Process the previous result
...
}
);
}
Let's say you have a task to read all files that are saved in some folder and process every single file. For simplicity sake let's say that all files are HTML files and you want to extract the HTML content from them.
In Java 8 there is Files.walk API that allows us to do something like that. Here is an example:
try (Stream<Path> paths = Files.walk(Paths.get("/home/you/Desktop"))) {
paths
.filter(Files::isRegularFile)
.forEach(System.out::println);
}
This sound really good if you have to process small amount of folders and files, but if you have milion of files distributed across several network drives then this process will take ages and obviously needs to be paralelised. Any ideas how to do parallelism in this case?
I don't think there is a simple general algorithm to solve your problem.
In fact the general idea when dealing with big amount of data distributed on many nodes is letting each node do the collecting of data and the processing those partial results in a single node.
Doing all the scanning from a single system is going to be hard.
To do some real optimization you cannot treat all the folder in the same way.
What you could do is to create a Collection of Paths that could be scanned in parallel.
So instead of walking along a single root not you could start several walks along several folder (possibly one for each network drive).
For this to work you need to know which path is a network path and which is a local one.
If you, for example have a folder where each child folder is a mounted network drive, you could easily collect all those folders and the run your walk in parallel for each.
I would do something similar to the following code:
public class ParallelWalks {
ExecutorService executor = Executors.newCachedThreadPool();
ExecutorService singleThreadExecutor = Executors.newSingleThreadExecutor();
public static void main(String[] args) {
new ParallelWalks().exec();
}
public ExecutorService executorSelector(Path path) {
if(isNetworkDrive(path)) {
return executor;
}else {
return singleThreadExecutor;
}
}
private boolean isNetworkDrive(Path path) {
// Here goes the logic to choose which path should go on a different
// thread.
return path.toString().contains("srv");
}
private void exec() {
Path path = Paths.get("/home/you/Desktop");
try (Stream<Path> files = Files.list(path)) {
files.forEach(this::taskRunner);
} catch (IOException e) {
// Do something with the exception
}
}
private void taskRunner(final Path path) {
executorSelector(path)
.submit(() -> doWalk(path));
}
private void doWalk(Path path) {
try (Stream<Path> paths = Files.walk(path)) {
paths.filter(Files::isRegularFile).forEach(System.out::println);
} catch (IOException e) {
// Do something with the exception
}
}
}
This way all your local dir will be processed sequentially, and all network drives will be processed each on his thread.
It would work only if all (or most of) your network drives share the same mount point parent.
Otherwise you should implement your own walk .
This question already has answers here:
JAVA NIO Watcher: How to detect end of a long lasting (copy) operation?
(2 answers)
Closed 8 years ago.
I am writing a directory monitoring utility in java(1.6) using polling at certain intervals using lastModified long value as the indication of change. I found that when my polling interval is small (seconds) and the copied file is big then the change event is fired before the actual completion of file copying.
I would like to know whether there is a way I can find the status of file like in transit, complete etc.
Environments: Java 1.6; expected to work on windows and linux.
There are two approaches I've used in the past which are platform agnostic.
1/ This was for FTP transfers where I controlled what was put, so it may not be directly relevant.
Basically, whatever is putting a file file.txt will, when it's finished, also put a small (probably zero-byte) dummy file called file.txt.marker (for example).
That way, the monitoring tool just looks for the marker file to appear and, when it does, it knows the real file is complete. It can then process the real file and delete the marker.
2/ An unchanged duration.
Have your monitor program wait until the file is unchanged for N seconds (where N is reasonably guaranteed to be large enough that the file will be finished).
For example, if the file size hasn't changed in 60 seconds, there's a good chance it's finished.
There's a balancing act between not thinking the file is finished just because there's no activity on it, and the wait once it is finished before you can start processing it. This is less of a problem for local copying than FTP.
This solution worked for me:
File ff = new File(fileStr);
if(ff.exists()) {
for(int timeout = 100; timeout>0; timeout--) {
RandomAccessFile ran = null;
try {
ran = new RandomAccessFile(ff, "rw");
break; // no errors, done waiting
} catch (Exception ex) {
System.out.println("timeout: " + timeout + ": " + ex.getMessage());
} finally {
if(ran != null) try {
ran.close();
} catch (IOException ex) {
//do nothing
}
ran = null;
}
try {
Thread.sleep(100); // wait a bit then try again
} catch (InterruptedException ex) {
//do nothing
}
}
System.out.println("File lockable: " + fileStr +
(ff.exists()?" exists":" deleted during process"));
} else {
System.out.println("File does not exist: " + fileStr);
}
This solution relies on the fact that you can't open the file for writing if another process has it open. It will stay in the loop until the timeout value is reached or the file can be opened. The timeout values will need to be adjusted depending on the application's actual needs. I also tried this method with channels and tryLock(), but it didn't seem to be necessary.
Do you mean that you're waiting for the lastModified time to settle? At best that will be a bit hit-and-miss.
How about trying to open the file with write access (appending rather than truncating the file, of course)? That won't succeed if another process is still trying to write to it. It's a bit ugly, particularly as it's likely to be a case of using exceptions for flow control (ick) but I think it'll work.
If I understood the question correctly, you're looking for a way to distinguish whether the copying of a file is complete or still in progress?
How about comparing the size of the source and destination file (i.e. file.length())? If they're equal, then copying is complete. Otherwise, it's still in progress.
I'm not sure it's efficient since it would still require polling. But it "might" work.
You could look into online file upload with progressbar techniques - they use OutputStreamListener and custom writer to notify the listener about bytes written.
http://www.missiondata.com/blog/java/28/file-upload-progress-with-ajax-and-java-and-prototype/
File Upload with Java (with progress bar)
We used to monitor the File Size change for determine whether the File is inComplete or not.
we used Spring integration File endpoint to do the polling for a directory for every 200 ms.
Once the file is detected(regardless of whether it is complete or not), We have a customer File filter, which will have a interface method "accept(File file)" to return a flag indicating whether we can process the file.
If the False is returned by the filter, this FILE instance will be ignored and it will be pick up during the next polling for the same filtering process..
The filter does the following:
First, we get its current file size. and we will wait for 200ms(can be less) and check for the size again. If the size differs, we will retry for 5 times. Only when the file size stops growing, the File will be marked as COMPLETED.(i.e. return true).
Sample code used is as the following:
public class InCompleteFileFilter<F> extends AbstractFileListFilter<F> {
protected Object monitor = new Object();
#Override
protected boolean accept(F file) {
synchronized (monitor){
File currentFile = (File)file;
if(!currentFile.getName().contains("Conv1")){return false;}
long currentSize = currentFile.length();
try { Thread.sleep(200); } catch (InterruptedException e) { e.printStackTrace(); }
int retryCount = 0;
while(retryCount++ < 4 && currentFile.length() > currentSize){
try { Thread.sleep(200); } catch (InterruptedException e) { e.printStackTrace(); }
}
if(retryCount == 5){
return false;
}else{
return true;
}
}
}
}
This question already has answers here:
JAVA NIO Watcher: How to detect end of a long lasting (copy) operation?
(2 answers)
Closed 8 years ago.
I am writing a directory monitoring utility in java(1.6) using polling at certain intervals using lastModified long value as the indication of change. I found that when my polling interval is small (seconds) and the copied file is big then the change event is fired before the actual completion of file copying.
I would like to know whether there is a way I can find the status of file like in transit, complete etc.
Environments: Java 1.6; expected to work on windows and linux.
There are two approaches I've used in the past which are platform agnostic.
1/ This was for FTP transfers where I controlled what was put, so it may not be directly relevant.
Basically, whatever is putting a file file.txt will, when it's finished, also put a small (probably zero-byte) dummy file called file.txt.marker (for example).
That way, the monitoring tool just looks for the marker file to appear and, when it does, it knows the real file is complete. It can then process the real file and delete the marker.
2/ An unchanged duration.
Have your monitor program wait until the file is unchanged for N seconds (where N is reasonably guaranteed to be large enough that the file will be finished).
For example, if the file size hasn't changed in 60 seconds, there's a good chance it's finished.
There's a balancing act between not thinking the file is finished just because there's no activity on it, and the wait once it is finished before you can start processing it. This is less of a problem for local copying than FTP.
This solution worked for me:
File ff = new File(fileStr);
if(ff.exists()) {
for(int timeout = 100; timeout>0; timeout--) {
RandomAccessFile ran = null;
try {
ran = new RandomAccessFile(ff, "rw");
break; // no errors, done waiting
} catch (Exception ex) {
System.out.println("timeout: " + timeout + ": " + ex.getMessage());
} finally {
if(ran != null) try {
ran.close();
} catch (IOException ex) {
//do nothing
}
ran = null;
}
try {
Thread.sleep(100); // wait a bit then try again
} catch (InterruptedException ex) {
//do nothing
}
}
System.out.println("File lockable: " + fileStr +
(ff.exists()?" exists":" deleted during process"));
} else {
System.out.println("File does not exist: " + fileStr);
}
This solution relies on the fact that you can't open the file for writing if another process has it open. It will stay in the loop until the timeout value is reached or the file can be opened. The timeout values will need to be adjusted depending on the application's actual needs. I also tried this method with channels and tryLock(), but it didn't seem to be necessary.
Do you mean that you're waiting for the lastModified time to settle? At best that will be a bit hit-and-miss.
How about trying to open the file with write access (appending rather than truncating the file, of course)? That won't succeed if another process is still trying to write to it. It's a bit ugly, particularly as it's likely to be a case of using exceptions for flow control (ick) but I think it'll work.
If I understood the question correctly, you're looking for a way to distinguish whether the copying of a file is complete or still in progress?
How about comparing the size of the source and destination file (i.e. file.length())? If they're equal, then copying is complete. Otherwise, it's still in progress.
I'm not sure it's efficient since it would still require polling. But it "might" work.
You could look into online file upload with progressbar techniques - they use OutputStreamListener and custom writer to notify the listener about bytes written.
http://www.missiondata.com/blog/java/28/file-upload-progress-with-ajax-and-java-and-prototype/
File Upload with Java (with progress bar)
We used to monitor the File Size change for determine whether the File is inComplete or not.
we used Spring integration File endpoint to do the polling for a directory for every 200 ms.
Once the file is detected(regardless of whether it is complete or not), We have a customer File filter, which will have a interface method "accept(File file)" to return a flag indicating whether we can process the file.
If the False is returned by the filter, this FILE instance will be ignored and it will be pick up during the next polling for the same filtering process..
The filter does the following:
First, we get its current file size. and we will wait for 200ms(can be less) and check for the size again. If the size differs, we will retry for 5 times. Only when the file size stops growing, the File will be marked as COMPLETED.(i.e. return true).
Sample code used is as the following:
public class InCompleteFileFilter<F> extends AbstractFileListFilter<F> {
protected Object monitor = new Object();
#Override
protected boolean accept(F file) {
synchronized (monitor){
File currentFile = (File)file;
if(!currentFile.getName().contains("Conv1")){return false;}
long currentSize = currentFile.length();
try { Thread.sleep(200); } catch (InterruptedException e) { e.printStackTrace(); }
int retryCount = 0;
while(retryCount++ < 4 && currentFile.length() > currentSize){
try { Thread.sleep(200); } catch (InterruptedException e) { e.printStackTrace(); }
}
if(retryCount == 5){
return false;
}else{
return true;
}
}
}
}