Improve performance of File.IsDirectory() [duplicate] - java

Suppose a very simple program that lists out all the subdirectories of a given directory. Sound simple enough? Except the only way to list all subdirectories in Java is to use FilenameFilter combined with File.list().
This works for the trivial case, but when the folder has say 150,000 files and 2 sub folders, it's silly waiting there for 45 seconds iterating through all the files and testing for file.isDirectory(). Is there a better way to list sub directories??
PS. Sorry, please save the lectures on having too many files in the same directory. Our live environment has this as part of the requirement.

As has already been mentioned, this is basicly a hardware problem. Disk access is always slow, and most file systems aren't really designed to handle directories with that many files.
If you for some reason have to store all the files in the same directory, I think you'll have to maintain your own cache. This could be done using a local database such as sqlite, HeidiSQL or HSQL. If you want extreme performance, use a java TreeSet and cache it in memory. This means at the very least that you'll have to read the directory less often, and it could possibly be done in the background. You could reduce the need to refresh the list even further by using your systems native file update notification API (inotify on linux) to subscribe to changes to the directory.
This doesn't seem to be possible for you, but I once solved a similiar problem by "hashing" the files into subdirectories. In my case, the challenge was to store a couple of millions images with numeric ids. I constructed the directory structure as follows:
images/[id - (id % 1000000)]/[id - (id % 1000)]/[id].jpg
This has worked well for us, and it's the solution that I would recommend. You could do something similiar to alpha-numeric filenames by simply taking the first two letters of the filename, and then the next two letters. I've done this as well once, and it did the job as well.

Do you know the finite list of possible subdirectory names? If so, use a loop over all possible names and check for directory's existence.
Otherwise, you can not get ONLY directory names in most underlying OSs (e.g. in Unix, the directory listing is merely reading contents of "directory" file, so there's no way to find "just directories" quickly without listing all the files).
However, in NIO.2 in Java7 (see http://java.sun.com/developer/technicalArticles/javase/nio/#3 ), there's a way to have a streaming directory list so you don't get a full array of file elements cluttering your memory/network.

There's actually a reason why you got the lectures: it's the correct answer to your problem. Here's the background, so that perhaps you can make some changes in your live environment.
First: directories are stored on the filesystem; think of them as files, because that's exactly what they are. When you iterate through the directory, you have to read those blocks from the disk. Each directory entry will require enough space to hold the filename, and permissions, and information on where that file is found on-disk.
Second: directories aren't stored with any internal ordering (at least, not in the filesystems where I've worked with directory files). If you have 150,000 entries and 2 sub-directories, those 2 sub-directory references could be anywhere within the 150,000. You have to iterate to find them, there's no way around that.
So, let's say that you can't avoid the big directory. Your only real option is to try to keep the blocks comprising the directory file in the in-memory cache, so that you're not hitting the disk every time you access them. You can achieve this by regularly iterating over the directory in a background thread -- but this is going to cause undue load on your disks, and interfere with other processes. Alternatively, you can scan once and keep track of the results.
The alternative is to create a tiered directory structure. If you look at commercial websites, you'll see URLs like /1/150/15023.html -- this is meant to keep the number of files per directory small. Think of it as a BTree index in a database.
Of course, you can hide that structure: you can create a filesystem abstraction layer that takes filenames and automatically generates the directory tree where those filenames can be found.

The key problem could be File.isDirectory() function called in a loop.
File.isDirectory() can be extremely slow. I saw NFS take 10 seconds to process 200 file directory.
If you can by all means prevent File.isDirectory() calls (e.g. test for extension, no extension == directory), you could improve the performance drastically.
Otherwise I would suggest doing JNA/JNI/writing a native script that does this for you.
The jCifs library lets you manipulate windows network shares more efficiently. I am not aware of a library that would do this for other network file systems.

You could hack it if the 150k files all (or a significant number of them) had a similar naming convention like:
*.jpg
*Out.txt
and only actually create file objects for the ones you are unsure about being a folder.

I don't know if the overhead of shelling out to cmd.exe would eat it up, but one possibility would be something like this:
...
Runtime r = Runtime.getRuntime();
Process p = r.exec("cmd.exe /k dir /s/b/ad C:\\folder");
BufferedReader br = new BufferedReader(new InputStreamReader(p.getInputStream()));
for (;;) {
String d = br.readLine();
if (d == null)
break;
System.out.println(d);
}
...
/s means search subdirectories
/ad means only return directories
/b means return the full pathname from the root

I came across similar question when debugging performance in a Java application enumerating plenty of files. It is using old approach
for (File f : new File("C:\\").listFiles()) {
if (f.isDirectory()) {
continue;
}
}
And it appears that each f.isDirectory() is the call into native FileSsystem which, at least on NTFS, is very slow. Java7 NIO has additional API, but not all methods are good there. I'll just provide JMH benchmark result here
Benchmark Mode Cnt Score Error Units
MyBenchmark.dir_listFiles avgt 5 0.437 ? 0.064 s/op
MyBenchmark.path_find avgt 5 0.046 ? 0.001 s/op
MyBenchmark.path_walkTree avgt 5 1.702 ? 0.047 s/op
Number come from execution of this code:
java -jar target/benchmarks.jar -bm avgt -f 1 -wi 5 -i 5 -t 1
static final String testDir = "C:/Sdk/Ide/NetBeans/src/dev/src/";
static final int nCycles = 50;
public static class Counter {
int countOfFiles;
int countOfFolders;
}
#Benchmark
public List<File> dir_listFiles() {
List<File> files = new ArrayList<>(1000);
for( int i = 0; i < nCycles; i++ ) {
File dir = new File(testDir);
files.clear();
for (File f : dir.listFiles()) {
if (f.isDirectory()) {
continue;
}
files.add(f);
}
}
return files;
}
#Benchmark
public List<Path> path_walkTree() throws Exception {
final List<Path> files = new ArrayList<>(1000);
for( int i = 0; i < nCycles; i++ ) {
Path dir = Paths.get(testDir);
files.clear();
Files.walkFileTree(dir, new SimpleFileVisitor<Path> () {
#Override
public FileVisitResult visitFile(Path path, BasicFileAttributes arg1) throws IOException {
files.add(path);
return FileVisitResult.CONTINUE;
}
#Override
public FileVisitResult preVisitDirectory(Path path, BasicFileAttributes arg1)
throws IOException {
return path == dir ? FileVisitResult.CONTINUE : FileVisitResult.SKIP_SUBTREE;
}
});
}
return files;
}
#Benchmark
public List<Path> path_find() throws Exception {
final List<Path> files = new ArrayList<>(1000);
for( int i = 0; i < nCycles; i++ ) {
Path dir = Paths.get(testDir);
files.clear();
files.addAll(Files.find(dir, 1, (path, attrs)
-> true /*!attrs.isDirectory()*/).collect(Collectors.toList()));
}
return files;
}

if your OS is 'stable' give a try to JNA:
opendir/readdir on UNIX
FindFirstFile and related API on Windows
Java7 with NIO2
these are all "streaming API". They doesn't force you to allocate a 150k list/array before start searching. IMHO this is a great advantage in your scenario.

Here's an off-the wall solution, and devoid of any testing at all. It's also dependent on having a filesystem that supports symbolic links. This isn't a Java solution. I suspect your problem is filesystem/OS-related, and not Java related.
Is it possible to create a parallel directory structure, with subdirectories based on initial letters of the filenames, and then symbolically link to the real files ? An illustration
/symlinks/a/b/cde
would link to
/realfiles/abcde
(where /realfiles is where your 150,000 files reside)
You'd have to create and maintain this directory structure, and I don't have enough info to determine if that's practical. But the above would create a fast(er) index into your non-hierarchical (and slow) directory.

there is also a recursive parallel scanning at http://blogs.oracle.com/adventures/entry/fast_directory_scanning. Essentially siblings are processed in parallel. There also encouraging performance tests.

Maybe you could write a directory searching program in C#/C/C++ and use JNI to get it to Java. Do not know if this would improve performance or not.

Well, either JNI, or, if you say your deployment is constant, just run "dir" on Windows or "ls" on *nixes, with appropriate flags to list only directories (Runtime.exec())

In that case you might try some JNA solution - a platform dependant directory traverser (FindFirst, FindNext on Windows) with the possibility of some iteration pattern. Also Java 7 will have much better file system support, worth checking out the specs (I don't remember any specifics).
Edit: An idea: one option is to hide the slowness of the directory listing from the user's eyes. In a client side app, you could use some animation while the listing is working to distract the user. Actually depends on what else your application does beside the listing.

As of 2020, the DirectoryStream does seem to be faster than using File.listFiles() and checking each file with isDirectory().
I learned the answer from here:
https://www.baeldung.com/java-list-directory-files
I'm using Java 1.8 on Windows 10.

Related

How can I improve the run-time complexity of my method?

I wrote a function in Java that edit file name, and replace each space char into dash char.
Currently I iterate all the files in a specific directory, iterate in each file name, creating a new file name, and replace the file in the directory.
I guess that the current complexity is O(N*M) {N = number of files in directory, M = number of chars in each file}.
Can anyone help me improve the run-time-complexity?
Thanks
public static void editSpace(String source, String target) {
// Source directory where all the files are there
File dir = new File(source);
File[] directoryListing = dir.listFiles();
// Iterate in each file in the directory
for (File file : directoryListing) {
String childName = file.getName();
String childNameNew = "";
// Iterate in each file name and change every space char to dash char
for (int i = 0; i < childName.length(); i++) {
if (childName.charAt(i) == ' ') {
childNameNew += "-";
} else {
childNameNew += childName.charAt(i);
}
}
// Update the new directory of the child
String childDir = target + "\\" + childNameNew;
// Renaming the file and moving it to a new location
if (!(childNameNew.equals(""))
&& (file.renameTo(new File(childDir)))) {
// If file copied successfully then delete the original file .
file.delete();
// Print message
System.out.println(childName + " File moved successfully to "
+ childDir);
}
// Moving failed
else {
// Print message
System.out.println(childName + " Failed to move the file to "
+ childDir);
}
}
}
I guess that the current complexity is O(N*M) {N = number of files in directory, M = number of chars in each file}. Can anyone help me improve the run-time-complexity?
Nobody can. You figured it yourself: when your task is to modify N file names that have like M chars to read(or modify), then you end up with NxM. There is no conceptual way to modify N file names based on their current names without each file and at each thing in there.
But what is possible: look carefully at your code, and see if you can improve the actual implementation.
You should start by relying much more on library methods. For example, you have String.replace() that allows you to turn all spaces into dashes with a single call. That shouldn't affect performance, but it allows your own code (having less code is mostly a good thing!). You could go one step further and look at streams to use even less code, see here.
But the real answer here: you are probably doing pre-mature optimisation. In the end, you are talking about something where the JVM needs to OS in order to make changes out there in the file system. There are zillions of aspects that influence overall, end to end performance for such a use case. It might be helpful to have more than one thread, so that can "process" file names from different directories in parallel for example.
On the other hand: creating a thread is a costly operation. And typically, it only helps you to speed up CPU intensive activities. Worse, multiple threads accessing the file system like that in parallel ... might actually slow down things, overall.
Meaning: depending on your overall setup, you might be able to speed up renaming files. Or not.
In the end, you are spending a lot of time and energy here. And the real question: is it really worth it?! Does it really matter to you whether your code will need 500 ms, or 1 sec, or 2 seconds? Depending on context it might, but maybe: it doesn't. That is the first thing to clarify. And when you figure that you really need the highest performance solution here, then you will have to invest real time into measuring what is going on, and doing experiments to find out which setting affects performance the most.
In other words: if you really care about performance here, you have a lot of low level details to look at. If you don't care about performance that much, I would throw away the java code and write 3 lines of python code, or Kotlin, or whatever you normally use for scripting, and go with that. Not because that code will be faster, but it will easier to read, write, and maintain. Because that is what matters when performance isn't your primary priority.

Fastest way to find all direct subdirectories of a directory [duplicate]

Suppose a very simple program that lists out all the subdirectories of a given directory. Sound simple enough? Except the only way to list all subdirectories in Java is to use FilenameFilter combined with File.list().
This works for the trivial case, but when the folder has say 150,000 files and 2 sub folders, it's silly waiting there for 45 seconds iterating through all the files and testing for file.isDirectory(). Is there a better way to list sub directories??
PS. Sorry, please save the lectures on having too many files in the same directory. Our live environment has this as part of the requirement.
As has already been mentioned, this is basicly a hardware problem. Disk access is always slow, and most file systems aren't really designed to handle directories with that many files.
If you for some reason have to store all the files in the same directory, I think you'll have to maintain your own cache. This could be done using a local database such as sqlite, HeidiSQL or HSQL. If you want extreme performance, use a java TreeSet and cache it in memory. This means at the very least that you'll have to read the directory less often, and it could possibly be done in the background. You could reduce the need to refresh the list even further by using your systems native file update notification API (inotify on linux) to subscribe to changes to the directory.
This doesn't seem to be possible for you, but I once solved a similiar problem by "hashing" the files into subdirectories. In my case, the challenge was to store a couple of millions images with numeric ids. I constructed the directory structure as follows:
images/[id - (id % 1000000)]/[id - (id % 1000)]/[id].jpg
This has worked well for us, and it's the solution that I would recommend. You could do something similiar to alpha-numeric filenames by simply taking the first two letters of the filename, and then the next two letters. I've done this as well once, and it did the job as well.
Do you know the finite list of possible subdirectory names? If so, use a loop over all possible names and check for directory's existence.
Otherwise, you can not get ONLY directory names in most underlying OSs (e.g. in Unix, the directory listing is merely reading contents of "directory" file, so there's no way to find "just directories" quickly without listing all the files).
However, in NIO.2 in Java7 (see http://java.sun.com/developer/technicalArticles/javase/nio/#3 ), there's a way to have a streaming directory list so you don't get a full array of file elements cluttering your memory/network.
There's actually a reason why you got the lectures: it's the correct answer to your problem. Here's the background, so that perhaps you can make some changes in your live environment.
First: directories are stored on the filesystem; think of them as files, because that's exactly what they are. When you iterate through the directory, you have to read those blocks from the disk. Each directory entry will require enough space to hold the filename, and permissions, and information on where that file is found on-disk.
Second: directories aren't stored with any internal ordering (at least, not in the filesystems where I've worked with directory files). If you have 150,000 entries and 2 sub-directories, those 2 sub-directory references could be anywhere within the 150,000. You have to iterate to find them, there's no way around that.
So, let's say that you can't avoid the big directory. Your only real option is to try to keep the blocks comprising the directory file in the in-memory cache, so that you're not hitting the disk every time you access them. You can achieve this by regularly iterating over the directory in a background thread -- but this is going to cause undue load on your disks, and interfere with other processes. Alternatively, you can scan once and keep track of the results.
The alternative is to create a tiered directory structure. If you look at commercial websites, you'll see URLs like /1/150/15023.html -- this is meant to keep the number of files per directory small. Think of it as a BTree index in a database.
Of course, you can hide that structure: you can create a filesystem abstraction layer that takes filenames and automatically generates the directory tree where those filenames can be found.
The key problem could be File.isDirectory() function called in a loop.
File.isDirectory() can be extremely slow. I saw NFS take 10 seconds to process 200 file directory.
If you can by all means prevent File.isDirectory() calls (e.g. test for extension, no extension == directory), you could improve the performance drastically.
Otherwise I would suggest doing JNA/JNI/writing a native script that does this for you.
The jCifs library lets you manipulate windows network shares more efficiently. I am not aware of a library that would do this for other network file systems.
You could hack it if the 150k files all (or a significant number of them) had a similar naming convention like:
*.jpg
*Out.txt
and only actually create file objects for the ones you are unsure about being a folder.
I don't know if the overhead of shelling out to cmd.exe would eat it up, but one possibility would be something like this:
...
Runtime r = Runtime.getRuntime();
Process p = r.exec("cmd.exe /k dir /s/b/ad C:\\folder");
BufferedReader br = new BufferedReader(new InputStreamReader(p.getInputStream()));
for (;;) {
String d = br.readLine();
if (d == null)
break;
System.out.println(d);
}
...
/s means search subdirectories
/ad means only return directories
/b means return the full pathname from the root
I came across similar question when debugging performance in a Java application enumerating plenty of files. It is using old approach
for (File f : new File("C:\\").listFiles()) {
if (f.isDirectory()) {
continue;
}
}
And it appears that each f.isDirectory() is the call into native FileSsystem which, at least on NTFS, is very slow. Java7 NIO has additional API, but not all methods are good there. I'll just provide JMH benchmark result here
Benchmark Mode Cnt Score Error Units
MyBenchmark.dir_listFiles avgt 5 0.437 ? 0.064 s/op
MyBenchmark.path_find avgt 5 0.046 ? 0.001 s/op
MyBenchmark.path_walkTree avgt 5 1.702 ? 0.047 s/op
Number come from execution of this code:
java -jar target/benchmarks.jar -bm avgt -f 1 -wi 5 -i 5 -t 1
static final String testDir = "C:/Sdk/Ide/NetBeans/src/dev/src/";
static final int nCycles = 50;
public static class Counter {
int countOfFiles;
int countOfFolders;
}
#Benchmark
public List<File> dir_listFiles() {
List<File> files = new ArrayList<>(1000);
for( int i = 0; i < nCycles; i++ ) {
File dir = new File(testDir);
files.clear();
for (File f : dir.listFiles()) {
if (f.isDirectory()) {
continue;
}
files.add(f);
}
}
return files;
}
#Benchmark
public List<Path> path_walkTree() throws Exception {
final List<Path> files = new ArrayList<>(1000);
for( int i = 0; i < nCycles; i++ ) {
Path dir = Paths.get(testDir);
files.clear();
Files.walkFileTree(dir, new SimpleFileVisitor<Path> () {
#Override
public FileVisitResult visitFile(Path path, BasicFileAttributes arg1) throws IOException {
files.add(path);
return FileVisitResult.CONTINUE;
}
#Override
public FileVisitResult preVisitDirectory(Path path, BasicFileAttributes arg1)
throws IOException {
return path == dir ? FileVisitResult.CONTINUE : FileVisitResult.SKIP_SUBTREE;
}
});
}
return files;
}
#Benchmark
public List<Path> path_find() throws Exception {
final List<Path> files = new ArrayList<>(1000);
for( int i = 0; i < nCycles; i++ ) {
Path dir = Paths.get(testDir);
files.clear();
files.addAll(Files.find(dir, 1, (path, attrs)
-> true /*!attrs.isDirectory()*/).collect(Collectors.toList()));
}
return files;
}
if your OS is 'stable' give a try to JNA:
opendir/readdir on UNIX
FindFirstFile and related API on Windows
Java7 with NIO2
these are all "streaming API". They doesn't force you to allocate a 150k list/array before start searching. IMHO this is a great advantage in your scenario.
Here's an off-the wall solution, and devoid of any testing at all. It's also dependent on having a filesystem that supports symbolic links. This isn't a Java solution. I suspect your problem is filesystem/OS-related, and not Java related.
Is it possible to create a parallel directory structure, with subdirectories based on initial letters of the filenames, and then symbolically link to the real files ? An illustration
/symlinks/a/b/cde
would link to
/realfiles/abcde
(where /realfiles is where your 150,000 files reside)
You'd have to create and maintain this directory structure, and I don't have enough info to determine if that's practical. But the above would create a fast(er) index into your non-hierarchical (and slow) directory.
there is also a recursive parallel scanning at http://blogs.oracle.com/adventures/entry/fast_directory_scanning. Essentially siblings are processed in parallel. There also encouraging performance tests.
Maybe you could write a directory searching program in C#/C/C++ and use JNI to get it to Java. Do not know if this would improve performance or not.
Well, either JNI, or, if you say your deployment is constant, just run "dir" on Windows or "ls" on *nixes, with appropriate flags to list only directories (Runtime.exec())
In that case you might try some JNA solution - a platform dependant directory traverser (FindFirst, FindNext on Windows) with the possibility of some iteration pattern. Also Java 7 will have much better file system support, worth checking out the specs (I don't remember any specifics).
Edit: An idea: one option is to hide the slowness of the directory listing from the user's eyes. In a client side app, you could use some animation while the listing is working to distract the user. Actually depends on what else your application does beside the listing.
As of 2020, the DirectoryStream does seem to be faster than using File.listFiles() and checking each file with isDirectory().
I learned the answer from here:
https://www.baeldung.com/java-list-directory-files
I'm using Java 1.8 on Windows 10.

Merge 2 large csv files using inner join

I need the advice from someone who knows very well java and the memory issues. I have a large CSV files (something like 500mb each) and I need to merge these files in one using only 64mb of xmx. I've tried to do it different ways, but nothing works - always got memory exception. What should I do to make it work properly?
The task is:
Develop a simple implementation that joins two input tables in a reasonably efficient way and can store both tables in RAM if needed.
My code works, but it takes alot of memory, so can't fit at 64mb.
public class ImprovedInnerJoin {
public static void main(String[] args) throws IOException {
RandomAccessFile firstFile = new RandomAccessFile("input_A.csv", "r");
FileChannel firstChannel = firstFile.getChannel();
RandomAccessFile secondFile = new RandomAccessFile("input_B.csv", "r");
FileChannel secondChannel = secondFile.getChannel();
RandomAccessFile resultFile = new RandomAccessFile("result2.csv", "rw");
FileChannel resultChannel = resultFile.getChannel().position(0);
ByteBuffer resultBuffer = ByteBuffer.allocate(40);
ByteBuffer firstBuffer = ByteBuffer.allocate(25);
ByteBuffer secondBuffer = ByteBuffer.allocate(25);
while (secondChannel.position() != secondChannel.size()){
Map <String, List<String>>table2Part = new HashMap();
for (int i = 0; i < secondChannel.size(); ++i){
if (secondChannel.read(secondBuffer) == -1)
break;
secondBuffer.rewind();
String[] table2Tuple = (new String(secondBuffer.array(), Charset.defaultCharset())).split(",");
if (!table2Part.containsKey(table2Tuple[0]))
table2Part.put(table2Tuple[0], new ArrayList());
table2Part.get(table2Tuple[0]).add(table2Tuple[1]);
secondBuffer.clear();
}
Set <String> taple2keys = table2Part.keySet();
while (firstChannel.read(firstBuffer) != -1){
firstBuffer.rewind();
String[] table1Tuple = (new String(firstBuffer.array(), Charset.defaultCharset())).split(",");
for (String table2key : taple2keys){
if (table1Tuple[0].equals(table2key)){
for (String value : table2Part.get(table2key)){
String result = table1Tuple[0] + "," + table1Tuple[1].substring(0,14) + "," + value; // 0,14 or result buffer will be overflown
resultBuffer.put(result.getBytes());
resultBuffer.rewind();
while(resultBuffer.hasRemaining()){
resultChannel.write(resultBuffer);
}
resultBuffer.clear();
}
}
}
firstBuffer.clear();
}
firstChannel.position(0);
table2Part.clear();
}
firstChannel.close();
secondChannel.close();
resultChannel.close();
System.out.println("Operation completed.");
}
}
A very easy to implement version of an external join is the external hash join.
It is much easier to implement than an external merge sort join and only has one drawback (more on that later).
How does it work?
Very similar to a hashtable.
Choose a number n, which signifies how many files ("buckets") you're distributing your data into.
Then do the following:
Setup n file writers
For each of your files that you want to join and for each line:
take the hashcode of the key you want to join on
compute the modulo of the hashcode and n, that will give you k
append your csv line to the kth file writer
Flush/Close all n writers.
Now you have n, hopefully smaller, files with the guarantee that the same key will always be in the same file. Now you can run your standard HashMap/HashMultiSet based join on each of these files separately.
Limitations
Why did I mentioned hopefully smaller files? Well, it depends on the distribution of the keys and their hashcodes. Think for the worst case, all of your files have the exact same key: you only have one file and you didn't win anything from partitioning.
Similar for skewed distributions, sometimes a few of your bucket files will be too big to fit into your RAM.
Usually there are three ways out of this dilemma:
Run the algorithm again with a bigger n, so you have more buckets to distribute to
Take only the buckets that are too big and do another hash partitioning pass only on those files (so each file goes into n newly created buckets again)
Fallback to an external merge sort on the big partition files.
Sometimes all three are used in a different combinations, which is called dynamic partitioning.
If central memory is a constraint for your application but you can access a persistent file, I would create as suggested by blahfunk a temporary SQLite file to your tmp folder, read every file by chunks and merge them with a simple join. You could could create a temporary SQLite DB by giving a look to libraries such as Hibernate, just take a look to what have I found on this StackOverflow question: How to create database in Hibernate at runtime?
If you cannot perform such a task, your remaining option is to consume more cpu and load just the first row of the first file searching for a row with the same index on the second file, buffering the result and flushing them as late as possible on the output file, repeating this for every row of the first file.
Maybe you can stream the first file and turn each line into a hashcode and save all those hashcodes in memory. Then stream the second file and make a hashcode for each line as it comes in. If the hashcode is in the first file, i.e., in memory, then don't write the line, else write the line. After that, append the first file in its entirety into the result file.
This would be effectively creating an index to compare your updates to.

How to autodetect the file location in Java?

I just wondered if there are any ways to let your program find a file you want to use by just giving the name without writing the whole search path like this.
Scanner betalningsservice = new Scanner(new File("/afs/nada.kth.se/home/i/u1vxrjgi/betalningsservice.txt"));
String line1 = betalningsservice.nextLine();
You can see that its a pretty long path and I would like the program to be able to detect the file "betalningsservice.txt" whereever the file is located in the computer (in case the file has been moved to somewhere else). Any tips guys? :)
Thanks in advance
Since there's some debate about what exactly is wanted in this question, I'll post another answer.
If you're using Java 8, finding a file is made somewhat simpler by using the Files.find function. It has the advantage of being able to limit how deep the search goes, keeping search speed under control. Here's an example that sticks the Paths of all matching files into a List. If you find more than one matching file you can have the user choose the right one:
final String SEARCH_FILE = "betalningsservice.txt"; // the file you're looking for
final String SEARCH_ROOT = "/afs/nada.kth.se/home/i/"; // where to start the search (top folder)
final int SEARCH_DEPTH = 4; // how many nested subfolders to delve into
final List<Path> files = new LinkedList<>();
Files.find(Paths.get(SEARCH_ROOT), SEARCH_DEPTH, (p, a) -> p.endsWith(SEARCH_FILE))
.forEach(e -> files.add(e));
It's debatable whether one big-ass statement that does all of the logic of the search is more readable or less readable, but that's Java 8 for you.
If you want to get advanced, you can also append FileVisitOptions to the find function's parameter list (for example, to follow symbolic links).
It's interesting to note that in Java 8, Path has generally replaced File as the way to represent files and folders, hence the List of Paths. Once you've selected the correct Path (we'll say it's in a variable called path), you can use it similarly to how you would use a File:
Scanner betalningsservice = new Scanner(path);
The rest is as before.
File can also create files using a relative path. Just don't start the file name with a slash. For example, if you run the program from the folder "/afs/nada.kth.se/home/i/u1vxrjgi/", you can just use:
new File("betalningsservice.txt")
...and that will get you the file you want.
There is not a practical/reliable way to just find the file anywhere on the computer.
You can, however, utilize relative file paths if you know your working directory. So if your working directory were /afs/nada.kth.se/home/i/u1vxrjgi, you could refer to the file just by new File("betalningsservice.txt"). Similarly, if your working diretory were /afs/nada.kth.se/home/i, you could refer to the file as new File("u1vxrjgi/betalningsservice.txt").
Another option would be to read from the classpath. This can be accomplished by getting the classloader to get a resource.

How to determine if a file will be logically moved or physically moved

The facts:
When a file is moved, there's two possibilities:
The source and destination file are on the same partition and only the file system index is updated
The source and destination are on two different file system and the file need to be moved byte per byte. (aka copy on move)
The question:
How can I determine if a file will be either logically or physically moved ?
I'm transferring large files (700+ megs) and would adopt a different behaviors for each situation.
Edit:
I've already coded a moving file dialog with a worker thread that perform the blocking io call to copy the file a meg at a time. It provide information to the user like rough estimate of the remaining time and transfer rate.
The problem is: how do I know if the file can be moved logically before trying to move it physically ?
On Linux or other *nices, call stat() on the source and destination directories and compare their st_dev values. If they are the same, a logical move can be performed, otherwise a physical copy+delete must be performed.
On Windows, you can call GetFileInformationByHandle() on handles to the two directories and compare their dwVolumeSerialNumber values. Note that this requires Windows 2000 or later.
I see you're using Java -- there must be some portal through which you can access this OS-level info (perhaps JNI?)
Ok I'm on something :)
Using JNA I am able to call the Win32 API (and *nix API too) from java.
I tried calling GetFileInformationByHandle and did got a result BUT the dwVolumeSerialNumber attribute always equals 0 (tried with my C: and D: drive)
Then I saw this function on MSDN: MoveFileEx. When the flag parametter is set to 0, the copy on move feature will be disable. AND IT WORKS !!!!
So I will simply call
if (!Kernel32.INSTANCE.MoveFileEx(source.getAbsolutePath(), destination.getAbsolutePath(), 0)) {
System.out.println("logical move failed");
}
Here is the code to put in the Kernel32.java interface (this file can be found in the src.zip package in the download section of the JNA site):
boolean MoveFileEx(String lpExistingFileName, String lpNewFileName, int dwFlags);
int MOVEFILE_REPLACE_EXISTING = 0x01;
int MOVEFILE_COPY_ALLOWED = 0x02;
int MOVEFILE_CREATE_HARDLINK = 0x04;
int MOVEFILE_WRITE_THROUGH = 0x08;
int MOVEFILE_DELAY_UNTIL_REBOOT = 0x10;
int MOVEFILE_FAIL_IF_NOT_TRACKABLE = 0x20;

Categories

Resources