Read all files and folders on my network drive - java

Let's say you have a task to read all files that are saved in some folder and process every single file. For simplicity sake let's say that all files are HTML files and you want to extract the HTML content from them.
In Java 8 there is Files.walk API that allows us to do something like that. Here is an example:
try (Stream<Path> paths = Files.walk(Paths.get("/home/you/Desktop"))) {
paths
.filter(Files::isRegularFile)
.forEach(System.out::println);
}
This sound really good if you have to process small amount of folders and files, but if you have milion of files distributed across several network drives then this process will take ages and obviously needs to be paralelised. Any ideas how to do parallelism in this case?

I don't think there is a simple general algorithm to solve your problem.
In fact the general idea when dealing with big amount of data distributed on many nodes is letting each node do the collecting of data and the processing those partial results in a single node.
Doing all the scanning from a single system is going to be hard.
To do some real optimization you cannot treat all the folder in the same way.
What you could do is to create a Collection of Paths that could be scanned in parallel.
So instead of walking along a single root not you could start several walks along several folder (possibly one for each network drive).
For this to work you need to know which path is a network path and which is a local one.
If you, for example have a folder where each child folder is a mounted network drive, you could easily collect all those folders and the run your walk in parallel for each.
I would do something similar to the following code:
public class ParallelWalks {
ExecutorService executor = Executors.newCachedThreadPool();
ExecutorService singleThreadExecutor = Executors.newSingleThreadExecutor();
public static void main(String[] args) {
new ParallelWalks().exec();
}
public ExecutorService executorSelector(Path path) {
if(isNetworkDrive(path)) {
return executor;
}else {
return singleThreadExecutor;
}
}
private boolean isNetworkDrive(Path path) {
// Here goes the logic to choose which path should go on a different
// thread.
return path.toString().contains("srv");
}
private void exec() {
Path path = Paths.get("/home/you/Desktop");
try (Stream<Path> files = Files.list(path)) {
files.forEach(this::taskRunner);
} catch (IOException e) {
// Do something with the exception
}
}
private void taskRunner(final Path path) {
executorSelector(path)
.submit(() -> doWalk(path));
}
private void doWalk(Path path) {
try (Stream<Path> paths = Files.walk(path)) {
paths.filter(Files::isRegularFile).forEach(System.out::println);
} catch (IOException e) {
// Do something with the exception
}
}
}
This way all your local dir will be processed sequentially, and all network drives will be processed each on his thread.
It would work only if all (or most of) your network drives share the same mount point parent.
Otherwise you should implement your own walk .

Related

Move already process file from one folder to another folder in flink

I am a new bee to flink and facing some challenges to solve the below use case
Use Case description:
I will receive a csv file with a timestamp on every single day in some folder say input. The file format would be file_name_dd-mm-yy-hh-mm-ss.csv.
Now my flink pipeline will read this csv file in a row by row fashion and it will be written to my Kafka topic.
Immediately after completion of data reading this file needs to be moved to another folder historic folder.
Why i need this is because : suppose that your ververica server stops either abruptly or manually and if you have all the processed files lying at the same location then after the ververica restart flink will re read all the files that it had processed earlier. So to prevent this scenario those files needs to be immediately move already read files to another location.
I googled a lot but did not find anything so can you guide me to achieve this.
Let me know if anything else is required.
Out of the box Flink provides the facility to monitor directory for new files and read them - via StreamExecutionEnvironment.getExecutionEnvironment.readFile (see similar stack overflow threads for examples - How to read newly added file in a directory in Flink / Monitoring directory for new files with Flink for data streams , etc.)
Looking into the source code of the readFile function, it calls for createFileInput() method, which simply instantiates ContinuousFileMonitoringFunction, ContinuousFileReaderOperatorFactory and configures the source -
addSource(monitoringFunction, sourceName, null, boundedness)
.transform("Split Reader: " + sourceName, typeInfo, factory);
ContinuousFileMonitoringFunction is actually a place where most of the logic happens.
So, if I were to implement your requirement, I would extend the functionality of ContinuousFileMonitoringFunction with my own logic of moving the processed file into the history folder and constructed the source from this function.
Given that the run method performs the read and forwarding inside the checkpointLock -
synchronized (checkpointLock) {
monitorDirAndForwardSplits(fileSystem, context);
}
I would say it's safe to move to historic folder on checkpoint completion files which have the modification day older then globalModificationTime, which is updated in monitorDirAndForwardSplits on splits collecting.
That said, I would extend the ContinuousFileMonitoringFunction class and implement the CheckpointListener interface, and in notifyCheckpointComplete would move the already processed files to historic folder:
public class ArchivingContinuousFileMonitoringFunction<OUT> extends ContinuousFileMonitoringFunction<OUT> implements CheckpointListener {
...
#Override
public void notifyCheckpointComplete(long checkpointId) throws Exception {
Map<Path, FileStatus> eligibleFiles = listEligibleForArchiveFiles(fs, new Path(path));
// do move logic
}
/**
* Returns the paths of the files already processed.
*
* #param fileSystem The filesystem where the monitored directory resides.
*/
private Map<Path, FileStatus> listEligibleForArchiveFiles(FileSystem fileSystem, Path path) {
final FileStatus[] statuses;
try {
statuses = fileSystem.listStatus(path);
} catch (IOException e) {
// we may run into an IOException if files are moved while listing their status
// delay the check for eligible files in this case
return Collections.emptyMap();
}
if (statuses == null) {
LOG.warn("Path does not exist: {}", path);
return Collections.emptyMap();
} else {
Map<Path, FileStatus> files = new HashMap<>();
// handle the new files
for (FileStatus status : statuses) {
if (!status.isDir()) {
Path filePath = status.getPath();
long modificationTime = status.getModificationTime();
if (shouldIgnore(filePath, modificationTime)) {
files.put(filePath, status);
}
} else if (format.getNestedFileEnumeration() && format.acceptFile(status)) {
files.putAll(listEligibleForArchiveFiles(fileSystem, status.getPath()));
}
}
return files;
}
}
}
and then define the data stream manually with the custom function:
ContinuousFileMonitoringFunction<OUT> monitoringFunction =
new ArchivingContinuousFileMonitoringFunction <>(
inputFormat, monitoringMode, getParallelism(), interval);
ContinuousFileReaderOperatorFactory<OUT, TimestampedFileInputSplit> factory = new ContinuousFileReaderOperatorFactory<>(inputFormat);
final Boundedness boundedness = Boundedness.CONTINUOUS_UNBOUNDED;
env.addSource(monitoringFunction, sourceName, null, boundedness)
.transform("Split Reader: " + sourceName, typeInfo, factory);
Flink itself does not provide a solution for doing this. You might need to build something yourself, or find a workflow tool that can be configured to handle this.
You can ask about this on the flink user mailing list. I know others have written scripts to do this; perhaps someone can share a solution.

Process Many Files and Delete

I've implemented a solution that uses Quartz to read a folder in a interval of time and for each file, it does some operations and deletes the file when it finish. It is smooth when i don't have thousand files in directory.
getFiles(config.getString("input")) match {
case Some(files) =>
files.foreach { file =>
try {
// check if file is in use
if (file.renameTo(file)) {
process(file, config)
}
} catch {
case e: Exception =>
} finally {
...
}
}
case None =>
...
}
def getFiles(path: String): Option[Array[File]] = {
new File(path).listFiles() match {
case files if files != null =>
Some(files.filter(file => file.lastModified < Calendar.getInstance.getTimeInMillis - 5000))
case _ =>
None
}
}
def process(file: File, clientConfig:Config) {
...
file.delete
}
Now my scenario is different - i'm working with thousand and thousand files - and my throughput is very slow: 50/sec (each file has 40kb).
I was wondering what is the best approach to process many files. Should I replace the method getFile() to return N elements and apply a FileLock on each element? If I use FileLock, I could to retrieve only the elements that are not in use. Or should i use something from Java NIO?
Thank in advance.
I think you can wrap your try catch block in a Future, so you can process the files in parallel. Apparently using an Execution Context backed by a cached threadpool is best for IO bound operations. This would also mean you do not need to worry about locks, as you spawn a future for each file synchronously.
You could also read the input files as a stream, which would mean your code would no longer store a reference to all the files in memory upfront, but instead only store a reference to the working set (one file), but I don't think this is the cause of your bottleneck.

Using WatchServiceDirectoryScanner in Spring

I have a requirement of implementing a Watch Service on a folder. This is straight forward approach of using Java7's watch service. I have successfully done it, I am able to capture events whenever a file is created/updated/deleted on the folder where I have been watching. The problem here is it is not applicable for contents of sub folders and it is clearly written in the documentation. My requirement is to watch over contents of sub folder as well. This is not possible using the above approach unless I write a loop over all the sub folders manually and listen to each and every folder, this I think leads to some memory leak if not programmed well. Hence I am going with what spring suggested in the newer release explained here This is very clear approach which I have seen for WatchService. The problem here is this will listen to only ENTRY_CREATE events i.e., only the events where we have created the file and this can be at any level. This is not working when I change the file or delete the file. How should we go ahead in this case.
public static void watchFolderTree(String pathStr)
throws Exception
{
long waitTime = 10000;
WatchServiceDirectoryScanner scanner = new WatchServiceDirectoryScanner(pathStr);
scanner.start();
List<File> changedFiles = null;
while(true)
{
changedFiles = scanner.listFiles(new File(pathStr));
if(changedFiles.size() > 0)
{
System.out.println("There is a file ");
}
Thread.sleep(waitTime);
}
}
References :
Monitor subfolders with a Java watch service
JAVA 7 watch service

Sharing a resource among Threads, different behavior in different java versions

This is the first time I've encountered something like below.
Multiple Threads (Inner classes implementing Runnable) sharing a Data Structure (instance variable of the upper class).
Working: took classes from Eclipse project's bin folder, ran on a Unix machine.
NOT WORKING: directly compiled the src on Unix machine and used those class files. Code compiles and then runs with no errors/warnings, but one thread is not able to access shared resource properly.
PROBLEM: One thread adds elements to the above common DS. Second thread does the following...
while(true){
if(myArrayList.size() > 0){
//do stuff
}
}
The Log shows that the size is updated in Thread 1.
For some mystic reason, the workflow is not enetering if() ...
Same exact code runs perfectly if I directly paste the class files from Eclipse's bin folder.
I apologize if I missed anything obvious.
Code:
ArrayList<CSRequest> newCSRequests = new ArrayList<CSRequest>();
//Thread 1
private class ListeningSocketThread implements Runnable {
ServerSocket listeningSocket;
public void run() {
try {
LogUtil.log("Initiating...");
init(); // creates socket
processIncomongMessages();
listeningSocket.close();
} catch (IOException e) {
e.printStackTrace();
}
}
private void processIncomongMessages() throws IOException {
while (true) {
try {
processMessage(listeningSocket.accept());
} catch (ClassNotFoundException e) {
e.printStackTrace();
}
}
}
private void processMessage(Socket s) throws IOException, ClassNotFoundException {
// read message
ObjectInputStream ois = new ObjectInputStream(s.getInputStream());
Object message = ois.readObject();
LogUtil.log("adding...: before size: " + newCSRequests.size());
synchronized (newCSRequests) {
newCSRequests.add((CSRequest) message);
}
LogUtil.log("adding...: after size: " + newCSRequests.size()); // YES, THE SIZE IS UPDATED TO > 0
//closing....
}
........
}
//Thread 2
private class CSRequestResponder implements Runnable {
public void run() {
LogUtil.log("Initiating..."); // REACHES..
while (true) {
// LogUtil.log("inside while..."); // IF NOT COMMENTED, FLOODS THE CONSOLE WITH THIS MSG...
if (newCSRequests.size() > 0) { // DOES NOT PASS
LogUtil.log("inside if size > 0..."); // NEVER REACHES....
try {
handleNewCSRequests();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
....
}
UPDATE
Solution was to add synchronized(myArrayList) before I check the size in the Thread 2.
To access a shared structure in a multi-threaded environment, you should use implicit or explicit locking to ensure safe publication and access among threads.
Using the code above, it should look like this:
while(true){
synchronized (myArrayList) {
if(myArrayList.size() > 0){
//do stuff
}
}
//sleep(...) // outside the lock!
}
Note: This pattern looks much like a producer-consumer and is better implemented using a queue. LinkedBlockingQueue is a good option for that and provides built-in concurrency control capabilities. It's a good structure for safe publishing of data among threads.
Using a concurrent data structure lets you get rid of the synchronized block:
Queue queue = new LinkedBlockingQueue(...)
...
while(true){
Data data = queue.take(); // this will wait until there's data in the queue
doStuff(data);
}
Every time you modify a given shared variable inside a parallel region (a region with multiple threads running in parallel) you must ensure mutual exclusion. You can guarantee mutual exclusion in Java by using synchronized or locks, normally you use locks when you want a finer grain synchronization.
If the program only performance reads on a given shared variable there is no need for synchronized/lock the accesses to this variable.
Since you are new in this subject I recommend you this tutorial
If I got this right.. There are at least 2 threads that work with the same, shared, datastructure. The array you mentioned.. One thread adds values to the array and the second thread "does stuff" if the size of the array > 0.
There is a chance that the thread scheduler ran the second thread (that checks if the collection is > 0), before the first thread got a chance to run and add a value.
Running the classes from bin or recompiling them has nothing to do. If you were to run the application over again from the bin directory, you might seen the issue again. How many times did you ran the app?
It might not reproduce consistently but at one point you might see the issue again.
You could access the datastruce in a serial fashion, allowing only one thread at a time to access the array. Still that does not guarantee that the first thread will run and only then the second one will check if the size > 0.
Depending on what you need to accomplish, there might be better / other ways to achieve that. Not necessarily using a array to coordinate the threads..
Check the return of
newCSRequests.add((CSRequest) message);
I am guessing its possible that it didn't get added for some reason. If it was a HashSet or similar, it could have been because the hashcode for multiple objects return the same value. What is the equals implementation of the message object?
You could also use
List list = Collections.synchronizedList(new ArrayList(...));
to ensure the arraylist is always synchronised correctly.
HTH

Writing java program to recursively test file naming standards

I want to write a Java application that validate files and directories according to certain naming standards. The program would let you pick a directory and would recursively analyze -- giving a list of files/directories that do not match the given rules.
Eventually I want the user to be able to input rules, but for now they would be hard coded. Oh, and this would need to be cross-platform.
I'm have a working knowledge of basic Java constructs but have no experience with libraries and have not had much luck finding demos/code samples for this type of thing.
I would love suggestions for what trees to start barking up, pseudo-code -- whatever you feel would be helpful.
EDIT: I'm not trying to remove anything here, just get a recursive listing of any names that break certain rules (e.g. no spaces or special characters, no directories that start with uppercase) in the chosen directory.
I would like to use Commons IO, I think DirectoryWalker will help you.
Here is the sample for checking and removing ".svn" dir
public class FileCleaner extends DirectoryWalker {
public FileCleaner() {
super();
}
public List clean(File startDirectory) {
List results = new ArrayList();
walk(startDirectory, results);
return results;
}
protected boolean handleDirectory(File directory, int depth, Collection results) {
// delete svn directories and then skip
if (".svn".equals(directory.getName())) {
directory.delete();
return false;
} else {
return true;
}
}
protected void handleFile(File file, int depth, Collection results) {
// delete file and add to list of deleted
file.delete();
results.add(file);
}
}

Categories

Resources