Performance optimization searching data in file system - java

I have a network associated storage where around 5 million txt files are there related to around 3 million transactions. Size of the total data is around 3.5 TB. I have to search in that location to find if the transaction related file is available or not and have to make two separate reports as CSV file of "available files" and "not available files". We are
still in JAVA 6. The challenge that I am facing since I have to search in the location recursively, it takes me around average 2 mins to search in that location because of huge size. I am using Java I/O API to search recursively like below. Is there any way I can improve the performance?
File searchFile(File location, String fileName) {
if (location.isDirectory()) {
File[] arr = location.listFiles();
for (File f : arr) {
File found = searchFile(f, fileName);
if (found != null)
return found;
}
} else {
if (location.getName().equals(fileName)) {
return location;
}
}
return null;
}

You should take a different approach, rather than walking the entire directory every time you search for a file, you should instead create an index, which is a mapping from filename to file location.
Essentially:
void buildIndex(Map index, File baseDir) {
if (location.isDirectory()) {
File[] arr = location.listFiles();
for (File f : arr) {
buildIndex(index, f);
}
} else {
index.put(f.getName(), f);
}
}
Now that you've got the index, searching for the files becomes trivial.
Now you've got the files in a Map, you can also even use Set operation to find the intersection:
Map index = new HashMap();
buildIndex(index, ...);
Set fileSet = index.keySet();
Set transactionSet = ...;
Set intersection = new HashSet(fileSet);
fileSet.retainAll(transactionSet);
Optionally, if the index itself is too big to keep in memory, you may want to create the index in an SQLite database.

Searching in a Directory or a Network Associated Storage is a
nightmare.It takes lot of time when directory is too big / depth. As you are in Java 6 ,
So you can follow an old fashion approach. List all files in a CSV file like
below.
e.g
find . -type f -name '*.txt' >> test.csv . (if unix)
dir /b/s *.txt > test.csv (if Windows)
Now load this CSV file into a Map to have an index as filename. Loading the file will take some time as it will be huge but once you load then searching in the map ( as it will be file name ) will be much more quick and will reduce your search time drastically.

You can use NIO FileVisitor, available in java 6.
Path findTransactionFile(Path root) {
Path transactionFile = null;
Files.walkFileTree(root, new SimpleFileVisitor<Path>() {
#Override
public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) throws IOException {
if (/* todo dir predicate*/ false) {
return FileVisitResult.SKIP_SUBTREE; // optimization
}
return FileVisitResult.CONTINUE;
}
#Override
public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
if (/* todo file predicate*/ true) {
transactionFile = file;
return FileVisitResult.TERMINATE; // found
}
return FileVisitResult.CONTINUE;
}
});
return transactionFile;
}

I dont know the answer, but from algorithm perspective, your program has the worst complexity. per single look up for single transaction , it iterates all the files (5 million). you have 3 million transactions.
my suggestion is to iterate the files (5 million files) and build up an index based on the file name. then iterate the transactions and search the index instead of full scan.
Or there might be third party free tools that can index a large file system and then that index can be accessed by an external application (in this case your java app). if you can not find that kind of tool, better you invent it (then you can build the index in a optimum way that suits your requirement).

Related

Retrieve folders and subfolders to read a file in java using tail recursion

I am using normal recursion to a method to iterate and to get files from folders and subfolders in java.
Can someone help me in changing that to tail recursion method? I couldn't understand what tail recursion is. It will be useful for me to understand.
public void findFiles(String filePath) throws IOException {
List<File> files = Files.list(Paths.get(filePath))
.map(path -> path.toFile())
.collect(Collectors.toList());
for(File file: files) {
if(file.isDirectory()){
if(file.list().length == 0){
boolean isDeleted = file.delete();
}else{
findFiles(file.getAbsolutePath());
}
}else{
//process files
}
}
}
This is the normal recursion I have, can someone help me to write a tail recursion for this?
I tried a way, But I am not sure whether this is tail recursion and how it works.
public static void findFiles(String filePath) throws IOException{
List<File> files = Files.list(Paths.get(filePath))
.map(path -> path.toFile())
.collect(Collectors.toList());
for(File file: files) {
if(file.isDirectory() && file.list().length == 0){
boolean isDeleted = file.delete();
}else if(!file.isDirectory()){
System.out.println("Processing files!!!" + file.getAbsolutePath());
}
if(file.isDirectory()) {
findFiles(file.getAbsolutePath());
}
}
}
Thanks in Advance.
Tail recursion is a special kind of recursion which does not do anything after the recursive call but return.
Some programming languages take advantage of this by optimising the call stack, so that if you have a very deep recursion you don't end up with stack overflows (apart from the memory and invocation efficiency gains themselves).
The trick that is often used is that you add an extra accumulator parameter, which takes any outstanding data to be processed. Since this might make the recursive function less usable, it is usually done separately, so that to the user of your function it appears simple.
So in your example it would be like this, the normal findFiles() just prepares for the recursive call, while the private findFilesRecursive() is doing the tail recursive work.
public void findFiles(String filePath) throws IOException {
//we use a Deque<> for Last In First Out ordering (to keep subfolders with their parent)
Deque<Path> paths = new ArrayDeque<Path>();
paths.add(Paths.get(filePath);
return findFilesRecursive(paths);
}
private void findFilesRecursive(Deque<Path> pending) {
if (pending.isEmpty()) {
//base case, we are ready
return;
}
Path path = pending.removeFirst();
if (Files.isRegularFile(path)) {
//todo: process the file
} else {
//it is a directory, queue its subfolders for processing
List<Path> inside = Files.list(path).collect(Collectors.toList());
if (inside.isEmpty() {
Files.delete(path);
} else {
//we use LIFO so that subfolders get processed first
inside.forEach(pending::addFirst);
}
}
//tail recursion, we do nothing after we call it
return findFilesRecursive(pending);
}
Note that Java doesn't (yet) take advantage of tail recursion. Other programming languages like Scala and Kotlin do.
Side note, Path is generally more powerful from the old File, you don't need to change a Path to a File in your case.

Check for new files in a loop - java

I have a program that needs to read files. I need to check every 10 seconds if there is new files.
To do that, I've made this :
ArrayList<File>oldFiles = new ArrayList<File>();
ArrayList<File>files=new ArrayList<File>();
while(isFinished != true){
files=listFilesForFolder(folder);
if(oldFiles.size() != files.size()){
System.out.println("Here is when a new file(s) is(are) in the folder");
}
Thread.sleep(10000);
}
Basically, the listFilesForFolder is getting a folder destination, and check the files in there.
My problem : My program does every loop my reading function on every file. I want to do my reading function ONLY on new files.
How can I do something like :
new files - old files = my files I want to read.
Rather than your approach why not store the DateTime of the last time that you checked.
Then compare this time to the File.lastModified value
The problem with your appraoch is that the array sizes will be different even in a file is deleted, and will be the same if one file is deleted and one file is added.
Rather than comparing old and new files, why not write a method to just return Last Modified Files.
public static ArrayList<File> listLastModifiedFiles(File folder,
long sleepDuration) throws Exception {
ArrayList<File> newFileList = new ArrayList<File>();
for (File fileEntry : folder.listFiles())
if ((System.currentTimeMillis() - fileEntry.lastModified()) <= sleepDuration)
newFileList.add(fileEntry);
return newFileList;
}
//Sample usage:
long sleepDuration = 10000;
ArrayList<File> newFileList;
int counter = 10;
while (counter-- > 0) {
newFileList = listLastModifiedFiles(folder, sleepDuration);
for (File File : newFileList)
System.out.println(File.getName());
Thread.sleep(sleepDuration);
}
You can use sets. Instead of returning an ArrayList, you could return a set instead.
newFiles.removeAll(oldFiles);
would then give you all the files that are not in the old set. I'm not saying that working with the modification date as Scary Wombat has pointed out is a worse idea, I'm just offering another solution.
Additionally, you have to modify your oldFiles to hold all files you've already encountered. The following example I think does what you're trying to achieve.
private static Set<File> findFilesIn(File directory) {
// Or whatever logic you have for finding files
return new HashSet<File>(Arrays.asList(directory.listFiles()));
}
public static void main(String[] args) throws Throwable {
Set<File> allFiles = new HashSet<File>(); // Renamed from oldFiles
Set<File> newFiles = new HashSet<File>();
File dir = new File("/tmp/stackoverflow/");
while (true) {
allFiles.addAll(newFiles); // Add files from last round to collection of all files
newFiles = findFilesIn(dir);
newFiles.removeAll(allFiles); // Remove all the ones we already know.
System.out.println(String.format("Found %d new files: %s", newFiles.size(), newFiles));
System.out.println("Sleeping...");
Thread.sleep(5000);
}
}
Sets are a more appropiate data storage for your case since you don't need any order in your collection of files and can benefit from faster lookup times (when using a HashSet).
Assuming that you only need to detect new files, not modified ones, and no file will be removed while your code is running:
ArrayList implements removeAll(Collection c), which does exactly what you want:
Removes from this list all of its elements that are contained in the
specified collection.
You might want to consider using the Java WatchService API which uses the low level operating system to notify you of changes to the file system. It's more efficient and faster than listing the files in directory.
There is a tutorial at Watching a Directory for Changes and the API is documented here: Interface WatchService

Recursion: Checking for files in Directories and reading them

Before you speculate something like "This guy is asking for homework help", I'll go ahead and clear any doubts you may have and say yes, this is related to homework. However, I hope that does not take away from the learning that this question provides to me and/or anyone who reads this in the future.
Background: We're currently working on recursion and our assignment asks that we write a program that uses command arguments to check a directory and its file contents for a string(that is also a command argument). We must use recursion for this.
-I want to make this clear that I UNDERSTAND WHAT THE ASSIGNMENT IS ASKING
I am simply asking, how would this work recursively because I just don't get it.
We did a problem where we had to find the size of a directory and it made sense, but I don't get how to check if something is a directory or file and based on that we read its contents or go deeper into the directory until we find a file.
Here's what I've currently done. Not too sure how wrong this is as I'm basing entirely off of the 'check the size of a directory' assignment we previously did:
The folder that I'm checking is something like this:
Directory ---> files --inside main directory --->> Two directories ----> files within both of those directories
public class SearchingForStrings {
public static void main(String[] args) {
String path = "."; // default location of this project
File sf = new File(path);
String mysteriesDirectory = args[0];
String keyString = args[1];
countLinesWithString(sf, mysteriesDirectory, keyString);
}
public static int countLinesWithString(File startPath, String mysteriesDirectory, String keyString) {
if(!startPath.exists()) {
throw new IllegalArgumentException("File " + startPath + " does not exist!");
} else if(startPath.isFile()) {
return Integer.parseInt(startPath.getAbsolutePath()); // Just to show where the file is I located the parsing is just to stop an error from flagging on this part; Going to ask professor if it's okay with him
// this is where we would begin reading the contents of the files
} else if(startPath.isDirectory()) {
// This is where our recursion would take place: essentially
// we will be going 'deeper' into the directory until we find a file
//File[] subFiles = startPath.listFiles();
countLinesWithString(startPath, mysteriesDirectory, keyString);
} else {
throw new IllegalStateException("Unknown file type: " + startPath);
}
}
}
In short: Could someone explain how recursion would work if you wanted to go deeper into a director(y/ies)?
I'll give this a try. It's something that is easier to explain than to understand.
The recursive method, on which you have made a decent start, might be documented as follows:
"For a given directory: for each file in the directory, count all the lines which contain a given string; for each directory in the directory, recurse".
The recursion is possible - and useful - because your original target is a container, and one of the types of things it can contain is another container.
So think of the counting method like this:
int countLines(dir, string) // the string could be an instance variable, also, and not passed in
{
var countedLines = 0;
for each item in dir:
if item is file, countedLines += matchedLinesInFile(item, string);
else if item is dir, countedLines += countLines(item, string);
else throw up; // or throw an exception -- your choice
}
then call countLines from an exterior method with the original dir to use, plus the string.
One of the things that trips people up about recursion is that, after you get it written, it doesn't seem possible that it can do all that it does. But think through the above for different scenarios. If the dir passed in has files and no dirs, it will accumulate countedLines for each file in the dir, and return the result. That's what you want.
If the dir does contain other dirs, then for each one of those, you're going to call the routine and start on that contained dir. The call will accumulate countedLines for each file in that dir, and call itself for each dir recursively down the tree, until it reaches a dir that has no dirs in it. And it still counts lines in those, it just doesn't have any further down to recurse.
At the lowest level, it is going to accumulate those lines and return them. Then the second-lowest level will get that total to add to its total, and start the return trips back up the recursion tree.
Does that explain it any better?
Just help you get started with recursion check this :
It will recursively go from base directory printing all the folders and files.
Modify this to your requirements. Try and let us know.
import java.io.File;
public class Test {
public static void getResource(final String resourcePath) {
File file = new File(resourcePath);
if (file.isFile()) {
System.out.println("File Name : " + file.getName());
return;
} else {
File[] listFiles = file.listFiles();
if (listFiles != null) {
for (File resourceInDirectory : listFiles) {
if (!resourceInDirectory.isFile()) {
System.out.println("Folder "
+ resourceInDirectory.getAbsolutePath());
getResource(resourceInDirectory.getAbsolutePath());
} else {
getResource(resourceInDirectory.getAbsolutePath());
}
}
}
}
}
public static void main(String[] args) {
final String folderPath = "C:/Test";
getResource(folderPath);
}
}

Java Data structure files StackOverflowError

My program collect all path to files on the computer(OS Ubuntu) to one Map.
The key in the Map is a file size and value is list of canonical path to files the size is equal to key.
Map<Long, ArrayList<String>> map = new HashMap<>(100000);
Total number of files on computer is: 281091
A method that collects the files, it is recursively.
private void scanner(String path) throws Exception {
File[] dirs = new File(path).listFiles(new FileFilter() {
#Override
public boolean accept(File file) {
if (file.isFile() && file.canRead()) {
long size = file.length();
String canonPath = file.getCanonicalPath();
if (map.containsKey(size))
map.get(size).add(canonPath);
else map.put(size, new ArrayList<>(Arrays.asList(canonPath)));
return false;
}
return file.isDirectory() && file.canRead();
}
});
for (File dir : dirs) {
scanner(dir.getCanonicalPath());
}
}
When I begin start scanning from the root folder "/" have exception:
Exception in thread "main" java.lang.StackOverflowError
at java.io.UnixFileSystem.canonicalize0(Native Method)
at java.io.UnixFileSystem.canonicalize(UnixFileSystem.java:172)
at java.io.File.getCanonicalPath(File.java:589)
at taskB.FileScanner.setCanonPath(FileScanner.java:49)
at taskB.FileScanner.access$000(FileScanner.java:12)
at taskB.FileScanner$1.accept(FileScanner.java:93)
at java.io.File.listFiles(File.java:1217)
at taskB.FileScanner.scanner(FileScanner.java:85)
at taskB.FileScanner.scanner(FileScanner.java:109)
at taskB.FileScanner.scanner(FileScanner.java:109)
...
But for test I filled directory "~/Documents" more than 400~ thousand files and began to scanning from it. Everything works fine.
Why when the program starts from the root directory "/" where less 300 thousand files I have exception? What should I do to prevent this was?
StackOverflow means that you called so many nested functions that your program ran out of space in memory for the function call information (retained for after returning from the call). In your case I suspect that it is due to parsing the "." (current directory) and ".." (parent directory) entries when returned in the directory list, thus you recurse into the same directory more than once.
The most likely explanation is that you have a symbolic link somewhere in the filesystem that creates a cycle (an infinite loop). For example, the following would be a cycle
/home/userid/test/data -> /home/userid
While scanning files you need to ignore symbolic links to directories.
#Jim Garrison was right, this was due to symbolic links. Solve their problems I found here.
I use the isSymbolicLink(Path) method.
return file.isDirectory() && file.canRead() && !Files.isSymbolicLink(file.toPath());

Java non-recursive filesystem walking

I need to create app which uses non-recursive walk through filesystem and prints out files which are on a certain depth.
What I have:
public void putFileToQueue() throws IOException, InterruptedException {
File root = new File(rootPath).getAbsoluteFile();
checkFile(root, depth);
Queue<DepthControl> queue = new ArrayDeque<DepthControl>();
DepthControl e = new DepthControl(0, root);
do {
root = e.getFileName();
if (root.isDirectory()) {
File[] files = root.listFiles();
if (files != null)
for (File file : files) {
if (e.getDepth() + 1 <= depth && file.isDirectory()) {
queue.offer(new DepthControl(e.getDepth() + 1,file));
}
if (file.getName().contains(mask)) {
if (e.getDepth() == depth) {
System.out.println(Thread.currentThread().getName()
+ " putting in queue: "
+ file.getAbsolutePath());
}
}
}
}
e = queue.poll();
} while (e != null);
}
And helper class
public class DepthControl {
private int depth;
private File file;
public DepthControl(int depth, File file) {
this.depth = depth;
this.file = file;
}
public File getFileName() {
return file;
}
public int getDepth() {
return depth;
}
}
I received answer, that this program uses additional memory because of Breadth-first search(hope right translation). I have O(k^n), where k - average amount of subdirectories, n - depth. This program could be easily done with O(k*n). Please help me to fix my algorithm.
I think this should do the job and is a bit simpler. It just keeps track of files at next level, expands them, then repeats the process. The algorithm itself keeps track of depth so there is no need for that extra class.
// start in home directory.
File root = new File(System.getProperty("user.dir"));
List<File> expand = new LinkedList<File>();
expand.add(root);
for (int depth = 0; depth < 10; depth++) {
File[] expandCopy = expand.toArray(new File[expand.size()]);
expand.clear();
for (File file : expandCopy) {
System.out.println(depth + " " + file);
if (file.isDirectory()) {
expand.addAll(Arrays.asList(file.listFiles()));
}
}
}
In Java 8, you can use stream, Files.walk and a maxDepth of 1
try (Stream<Path> walk = Files.walk(Paths.get(filePath), 1)) {
List<String> result = walk.filter(Files::isRegularFile)
.map(Path::toString).collect(Collectors.toList());
result.forEach(System.out::println);
} catch (IOException e) {
e.printStackTrace();
}
To avoid recursion when walking a tree there are basically two options:
Use a "work list" (similar to the above) to track work to be done. As each item is examined new work items that are "discovered" as a result are added to the work list (can be FIFO, LIFO, or random order -- doesn't matter conceptually though it will often affect "locality of reference" for performance).
Use a stack/"push down list" so essentially simulate the recursive scheme.
For #2 you have to write an algorithm that is something of a state machine, returning to the stack after every step to determine what to do next. The stack entries, for a tree walk, basically contain the current tree node and the index into the child list of the next child to examine.
If you're using Java 7, there is a very elegant method of walking file trees. You'll need to confirm whether it meets your needs recursion wise though.
import java.nio.file.*;
import java.nio.file.attribute.BasicFileAttributes;
import static java.nio.file.FileVisitResult.*;
public class myFinder extends SimpleFileVisitor<Path> {
public FileVisitResult visitFile(Path file, BasicFileAttributes attr) { }
public FileVisitResult postVisitDirectory(Path dir, IOException exc) { }
public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) { }
public FileVisitResult visitFileFailed(Path file, IOException exc) { }
<snip>
}
Essentially it does a depth first walk of the tree and calls certain methods when it enters/exits directories and when it "visits" a file.
I believe this to be specific to Java 7 though.
http://docs.oracle.com/javase/tutorial/essential/io/walk.html
Assuming you want to limit the amount of space used and:
you can assume the list of files/directories is static over the course of your traversal, AND
you can assume the list of files/directories in a give directory are always returned in the same order
you have access to the parent of the current directory
Then you can traverse the directory using only the information of the last node visited. Specifically, something along the lines of
1. Keep track of the last Entry (directory or file) visited
2. Keep track of the current directory
3. Get a list of files in the current directory
4. Find the index of the last Entry visited in the list of files
5. If lastVisited is the last Entry in the current directory,
5.1.1 If current directory == start directory, we're done
5.1.2 Otherwise, lastVisited = the current directory and current directory is the parent directory
5.2. Otherwise, visit the element after lastVisited and set lastVisited to that element
6. Repeat from step 3
If I can, I'll try to write up some code to show what I mean tomorrow... but I just don't have the time right now.
NOTE: This isn't a GOOD way to traverse the directory structure... its just a possible way. Outside the normal box, and probably for good reason.
You'll have to forgive me for not giving sample code in Java, I don't have the time to work on that atm. Doing it in Tcl is faster for me and it shouldn't be too hard to understand. So, that being said:
proc getFiles {dir} {
set result {}
foreach entry [glob -tails -directory $dir * .*] {
if { $entry != "." && $entry != ".." } {
lappend result [file join $dir $entry]
}
}
return [lsort $result]
}
proc listdir {startDir} {
if {! ([file exists $startDir] && [file isdirectory $startDir])} {
error "File '$startDir' either doesn't exist or isnt a directory"
}
set result {}
set startDir [file normalize $startDir]
set currDir $startDir
set currFile {}
set fileList [getFiles $currDir]
for {set i 0} {$i < 1000} {incr i} { # use for to avoid infinate loop
set index [expr {1 + ({} == $currFile ? -1 : [lsearch $fileList $currFile])}]
if {$index < ([llength $fileList])} {
set currFile [lindex $fileList $index]
lappend result $currFile
if { [file isdirectory $currFile] } {
set currDir $currFile
set fileList [getFiles $currDir]
set currFile {}
}
} else {
# at last entry in the dir, move up one dir
if {$currDir == $startDir} {
# at the starting directory, we're done
return $result
}
set currFile $currDir
set currDir [file dirname $currDir]
set fileList [getFiles $currDir]
}
}
}
puts "Files:\n\t[join [listdir [lindex $argv 0]] \n\t]"
And, running it:
VirtualBox:~/Programming/temp$ ./dirlist.tcl /usr/share/gnome-media/icons/hicolor
Files:
/usr/share/gnome-media/icons/hicolor/16x16
/usr/share/gnome-media/icons/hicolor/16x16/status
/usr/share/gnome-media/icons/hicolor/16x16/status/audio-input-microphone-high.png
/usr/share/gnome-media/icons/hicolor/16x16/status/audio-input-microphone-low.png
/usr/share/gnome-media/icons/hicolor/16x16/status/audio-input-microphone-medium.png
/usr/share/gnome-media/icons/hicolor/16x16/status/audio-input-microphone-muted.png
/usr/share/gnome-media/icons/hicolor/22x22
[snip]
/usr/share/gnome-media/icons/hicolor/48x48/devices/audio-subwoofer-testing.svg
/usr/share/gnome-media/icons/hicolor/48x48/devices/audio-subwoofer.svg
/usr/share/gnome-media/icons/hicolor/scalable
/usr/share/gnome-media/icons/hicolor/scalable/status
/usr/share/gnome-media/icons/hicolor/scalable/status/audio-input-microphone-high.svg
/usr/share/gnome-media/icons/hicolor/scalable/status/audio-input-microphone-low.svg
/usr/share/gnome-media/icons/hicolor/scalable/status/audio-input-microphone-medium.svg
/usr/share/gnome-media/icons/hicolor/scalable/status/audio-input-microphone-muted.svg
And - of course - there's always the multi-threaded option to avoid recursion.
Create a queue of files.
If it's a file add it to the queue.
If it's a folder start a new thread to list files in it that also feeds this queue.
Get next item.
Repeat from 2 as necessary.
Obviously this may not list the files in a predictable order.

Categories

Resources