What is a more efficient way to retrieve the hard link count of a file, that scales with high numbers of files?
I'm writing an application that scans all files on a volume to draw a graph. It is similar to a freeware program inconveniently called Scanner, which does takes Hard links into account, and scans really fast, faster than I can achieve in Java without even checking hard links.
I already tried checking the hard link count in the following (slow) ways:
(Examples are very simplified for readability)
Via the CMD program STAT (windows)
process = Runtime.getRuntime().exec(new String[]{"stat", "--printf=%h", "\"" + filePath + "\""});
in = new BufferedReader(new InputStreamReader(process.getInputStream()));
String inpMsg = in.readLine();
linkCount = Integer.parseInt(inpMsg);
and using a JNI call to GetFileInformationByHandle:
String lpFileName = filePath;
int dwShareMode = Kernel32.FILE_SHARE_READ | Kernel32.FILE_SHARE_WRITE;
Pointer lpSecurityAttributes = null;
int dwCreationDisposition = Kernel32.OPEN_EXISTING;
int dwFlagsAndAttributes = 0;
int hTemplateFile = 0;
hFile = Kernel32.INSTANCE.CreateFile(lpFileName, dwDesiredAccess, dwShareMode, lpSecurityAttributes, dwCreationDisposition, dwFlagsAndAttributes, hTemplateFile);
Memory lpFileInformation = new Memory(56);
Kernel32.INSTANCE.GetFileInformationByHandle(hFile, lpFileInformation);
linkCount = lpFileInformation.getInt(40);
To give an idea of why I want a faster method, here is a list of how fast different processes can iterate over all files on my C: drive (170000 files):
Alt+Enter on C:\: 19000 files per second (9 seconds)
Scanner (mentioned above): 7800 files per second (22 seconds)
Java (no Hard links): 1750 files per second (98 seconds)
Java (with JNI): 40 files per second (1:10 hours (projected))
Java (with STAT): 8 files per second (5:50 hours (projected))
The fact that Java is slower than Scanner might have to do with the fact that I'm using File.listFiles() instead of the new FileVisitor, but I won't accept a speed of 40 files/second which is 43 times slower than without hard links.
(I ran these tests after already scanning several times before. The first scan always takes several times longer)
Related
I want to create an application that shows a user how many times he opened or used the software. For this I have created the code below. But it is not showing correct output: when I run the application first it is showing 1 and then the second time I run it it is also showing 1.
public Founder() {
initComponents();
int c=0;
c++;
jLabel1.setText(""+c);
return;
}
I’m unsure whether I’m helping you or giving you a load of new problems and unanswered questions. The following will store the count of times the class Founder has been constructed in a file called useCount.txt in the program’s working directory (probably the root binary directory, where your .class files are stored). Next time you run the program, it will read the count from the file, add 1 and write the new value back to the file.
static final Path counterFile = FileSystems.getDefault().getPath("useCount.txt");
public Founder() throws IOException {
initComponents();
// read use count from file
int useCount;
if (Files.exists(counterFile)) {
List<String> line = Files.readAllLines(counterFile);
if (line.size() == 1) { // one line in file as expected
useCount = Integer.parseInt(line.get(0));
} else { // not the right file, ignore lines from it
useCount = 0;
}
} else { // program has never run before
useCount = 0;
}
useCount++;
jLabel1.setText(String.valueOf(useCount));
// write new use count back to file
Files.write(counterFile, Arrays.asList(String.valueOf(useCount)));
}
It’s not the most elegant nor robust solution, but it may get you started. If you run the program on another computer, it will not find the file and will start counting over from 0.
When you are running your code the first time, the data related to it will be stored in your system's RAM. Then when you close your application, all the data related to it will be deleted from the RAM (for simplicity let's just assume it will be deleted, although in reality it is a little different).
Now when you are opening your application second time, new data will be stored in the RAM. This new data contains the starting state of your code. So the value of c is set to 0 (c=0).
If you want to remember the data, you have to store it in the permanent storage (your system hard drive for example). But I think you are a beginner. These concepts are pretty advanced. You should do some basic programming practice before trying such things.
Here you need to store it on permanent basic.
Refer properties class to store data permanently: https://docs.oracle.com/javase/7/docs/api/java/util/Properties.html
You can also use data files ex. *.txt, *.csv
Serialization also provide a way for persistent storage.
You can create a class that implements Serializable with a field for each piece of data you want to store. Then you can write the entire class out to a file, and you can read it back in later.Learn about serialization here:https://www.tutorialspoint.com/java/java_serialization.htm
I have some intermediate data that I need to be stored in HDFS and local as well. I'm using Spark 1.6. In HDFS as intermediate form I'm getting data in /output/testDummy/part-00000 and /output/testDummy/part-00001. I want to save these partitions in local using Java/Scala so that I could save them as /users/home/indexes/index.nt(by merging both in local) or /users/home/indexes/index-0000.nt and /home/indexes/index-0001.nt separately.
Here is my code:
Note: testDummy is same as test, output is with two partitions. I want to store them separately or combined but local with index.nt file. I prefer to store separately in two data-nodes. I'm using cluster and submit spark job on YARN. I also added some comments, how many times and what data I'm getting. How could I do? Any help is appreciated.
val testDummy = outputFlatMapTuples.coalesce(Constants.INITIAL_PARTITIONS).saveAsTextFile(outputFilePathForHDFS+"/testDummy")
println("testDummy done") //1 time print
def savesData(iterator: Iterator[(String)]): Iterator[(String)] = {
println("Inside savesData") // now 4 times when coalesce(Constants.INITIAL_PARTITIONS)=2
println("iter size"+iterator.size) // 2 735 2 735 values
val filenamesWithExtension = outputPath + "/index.nt"
println("filenamesWithExtension "+filenamesWithExtension.length) //4 times
var list = List[(String)]()
val fileWritter = new FileWriter(filenamesWithExtension,true)
val bufferWritter = new BufferedWriter(fileWritter)
while (iterator.hasNext){ //iterator.hasNext is false
println("inside iterator") //0 times
val dat = iterator.next()
println("datadata "+iterator.next())
bufferWritter.write(dat + "\n")
bufferWritter.flush()
println("index files written")
val dataElements = dat.split(" ")
println("dataElements") //0
list = list.::(dataElements(0))
list = list.::(dataElements(1))
list = list.::(dataElements(2))
}
bufferWritter.close() //closing
println("savesData method end") //4 times when coal=2
list.iterator
}
println("before saving data into local") //1
val test = outputFlatMapTuples.coalesce(Constants.INITIAL_PARTITIONS).mapPartitions(savesData)
println("testRDD partitions "+test.getNumPartitions) //2
println("testRDD size "+test.collect().length) //0
println("after saving data into local") //1
PS: I followed, this and this but not exactly same what I'm looking for, I did somehow but not getting anything in index.nt
A couple of things:
Never call Iterator.size if you plan to use data later. Iterators are TraversableOnce. The only way to compute Iterator size is to traverse all its element and after that there is no more data to be read.
Don't use transformations like mapPartitions for side effects. If you want to perform some type of IO use actions like foreach / foreachPartition. It is a bad practice and doesn't guarantee that given piece of code will be executed only once.
Local path inside action or transformations is a local path of particular worker. If you want to write directly on the client machine you should fetch data first with collect or toLocalIterator. It could be better though to write to distributed storage and fetch data later.
Java 7 provides means to watch directories.
https://docs.oracle.com/javase/tutorial/essential/io/notification.html
The idea is to create a watch service, register it with the directory of interest (mention the events of your interest, like file creation, deletion, etc.,), do watch, you will be notified of any events like creation, deletion, etc., you can take whatever action you want then.
You will have to depend on Java hdfs api heavily wherever applicable.
Run the program in background since it waits for events forever. (You can write logic to quit after you do whatever you want)
On the other hand, shell scripting will also help.
Be aware of coherency model of hdfs file system while reading files.
Hope this helps with some idea.
I'm trying to write a program that adds every single file and folder name on my C: drive to an ArrayList. The code works fine, but because of the massive amount of recursion, it gets painfully slow. Here is the code:
public static void updateFileDataBase()
{
ArrayList<String> currentFiles = new ArrayList<String>();
addEverythingUnder("C:/",currentFiles,new String[]{"SteamApps","AppData"});
for(String name : currentFiles)
System.out.println(name);
}
private static void addEverythingUnder(String path, ArrayList<String> list, String[] exceptions)
{
System.gc();
System.out.println("searching " + path);
File search = new File(path);
try
{
for(int i = 0; i < search.list().length; i++)
{
boolean include = true;
for(String exception : exceptions)
if(search.list()[i].contains(exception))
include = false;
if(include)
{
list.add(search.list()[i]);
if(new File(path + "/" + search.list()[i]).isDirectory())
{
addEverythingUnder(path + "/" + search.list()[i],list,exceptions);
}
}
}
}
catch(Exception error)
{
System.out.println("ACCESS DENIED");
}
}
I was wondering if there was anything at all that I could do to speed up the process. Thanks in advance :)
Program slowing down due to recursion
No it isn't. Recursion doesn't make things slow. Poor algorithms and bad coding make things slow.
For example, you are calling Files.list() four times for every file you process, as well as once per directory. You can save an O(N) by doing that once per directory:
for(File file : search.listFiles())
{
String name = file.getName();
boolean include = true;
for(String exception : exceptions)
if(name.contains(exception))
include = false;
if(include)
{
list.add(name);
if(file.isDirectory())
{
addEverythingUnder(file,list,exceptions);
}
}
}
There is (as of Java 7) a built in way to do this, Files.walkFileTree, which is much more efficient and removes the need to reinvent the wheel. It calls into a FileVisitor for every entry it finds. There are a couple of examples on the FileVisitor page to get you started.
Is there a particular reason for reinventing the wheel?
If you dont mind please use
http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/FileUtils.html#listFiles(java.io.File, java.lang.String[], boolean)
because of the massive amount of recursion, it gets painfully slow
While your code is very inefficient as EJP suggests, I suspect the problem is even more basic. When you access a large number of files, this takes time to read from disk (the first time, reading the same again, and again is much quicker as it is cache) Opening files is also pretty slow for a HDD.
A typical HDD has a seek time of 8 ms, if finding and opening a file takes two operations, then you are looking at 16 ms per file. say you have 10,000 files, this will take at least 160 seconds, no matter how efficient you make the code. BTW If you use a decent SSD, this will take about 1 second.
In short, you are likely to be hitting a hardware limit which has nothing to do with how you wrote your software. Shorter: Don't have large numbers of files if you want performance.
I am doing some performance tests of a HTML stripper (written in java), that is to say, I pass a string (actually html content) to a method of the HTML stripper
and the latter returns plain text (without HTML tags and meta information).
Here is an example of the concrete implementation
public void performanceTest() throws IOException {
long totalTime;
File file = new File("/directory/to/ten/different/htmlFiles");
for (int i = 0; i < 200; ++i) {
for (File fileEntry : file.listFiles()) {
HtmlStripper stripper = new HtmlStripper();
URL url = fileEntry.toURI().toURL();
InputStream inputStream = url.openStream();
String html = IOUtils.toString(inputStream, "UTF-8");
long start = System.currentTimeMillis();
String text = stripper.getText(html);
long end = System.currentTimeMillis();
totalTime = totalTime + (end - start);
//The duration for the stripping of each file is computed here
// (200 times for each time). That duration value decreases and then becomes constant
//IMHO if the duration for the same file should always remain the same.
//Or is a cache technique used by the JVM?
System.out.println("time needed for stripping current file: "+ (end -start));
}
}
System.out.println("Average time for one document: "
+ (totalTime / 2000));
}
But the duration for the stripping of each file is computed 200 times for each time and has a different decreasing value. IMHO if the duration for one and the same file X should always remain the same!? Or is a cache technique used by the JVM?
Any help would be appreciated.
Thanks in advance
Horace
N.B:
- I am doing the tests local (NO remote, NO http) on my machine.
- I am using java 6 on Ubuntu 10.04
This is totally normal. The JIT compiles methods to native code and optimizes them more heavily as they're more and more heavily used. (The "constant" your benchmark eventually converges to is the peak of the JIT's optimization capabilities.)
You cannot get good benchmarks in Java without running the method many times before you start timing at all.
IMHO if the duration for one and the same file X should always remain the same
Not in the presence of an optimizing just-in-time compiler. Among other things it keeps track of how many times a particular method/branch is used, and selectively compiles Java byte codes into native code.
I do have a problem with millis set and read on Android 2.3.4 on a Nexus One. This is the code:
File fileFolder = new File(Environment.getExternalStorageDirectory(), appName + "/"
+ URLDecoder.decode(folder.getUrl()));
if (fileFolder != null && !fileFolder.exists()) {
fileFolder.setLastModified(1310198774);
fileFolder.mkdirs();
fileFolder.setLastModified(1310198774);
}
if (fileFolder != null && fileFolder.exists()) {
long l = fileFolder.lastModified();
}
In this small test I write 1310198774 but the result that is returned from lastModified() is 1310199771000.
Even if I cut the trailing "000" there's a difference of several minutes.
I need to sync files between a webservice and the Android device. The lastmodification millis are part of the data sent by this service. I do set the millis to the created/copied files and folders to check if the file/folder needs to be overwritten.
Everything is working BUT the millis that are returned from the filesystem are different from the values that were set.
I'm pretty sure there's something wrong with my code - but I can't find it.
Many thanks in advance.
HJW
On Jelly Bean+, it's different (mostly on Nexus devices yet, and others that use the new fuse layer for /mnt/shell/emulated sdcard emulation):
It's a VFS permission problem, the syscall utimensat() fails with EPERM due to inappropriate permissions (e.g. ownership).
in platform/system/core/sdcard/sdcard.c:
/* all files owned by root.sdcard */
attr->uid = 0;
attr->gid = AID_SDCARD_RW;
From utimensat()'s syscall man page:
2. the caller's effective user ID must match the owner of the file; or
3. the caller must have appropriate privileges.
To make any change other than setting both timestamps to the current time
(i.e., times is not NULL, and both tv_nsec fields are not UTIME_NOW and both
tv_nsec fields are not UTIME_OMIT), either condition 2 or 3 above must apply.
Old FAT offers an override of the iattr->valid flag via a mount option to allow changing timestamps to anyone, FUSE+Android's sdcard-FUSE don't do this at the moment (so the 'inode_change_ok() call fails) and the attempt gets rejected with -EPERM. Here's FAT's ./fs/fat/file.c:
/* Check for setting the inode time. */
ia_valid = attr->ia_valid;
if (ia_valid & TIMES_SET_FLAGS) {
if (fat_allow_set_time(sbi, inode))
attr->ia_valid &= ~TIMES_SET_FLAGS;
}
error = inode_change_ok(inode, attr);
I also added this info to this open bug.
So maybe I'm missing something but I see some problems with your code above. Your specific problem may be due (as #JB mentioned) to Android issues but for posterity, I thought I'd provide an answer.
First off, File.setLastModified() takes the time in milliseconds. Here are the javadocs. You seem to be trying to set it in seconds. So your code should be something like:
fileFolder.setLastModified(1310198774000L);
As mentioned in the javadocs, many filesystems only support seconds granularity for last-modification time. So if you need to see the same modification time in a file then you should do something like the following:
private void changeModificationFile(File file, long time) {
// round the value down to the nearest second
file.setLastModified((time / 1000) * 1000);
}
If this all doesn't work try this (ugly) workaround quoted from https://code.google.com/p/android/issues/detail?id=18624:
//As a workaround, this ugly hack will set the last modified date to now:
RandomAccessFile raf = new RandomAccessFile(file, "rw");
long length = raf.length();
raf.setLength(length + 1);
raf.setLength(length);
raf.close();
Works on some devices but not on others. Do not design a solution that relies on it working. See https://code.google.com/p/android/issues/detail?id=18624#c29
Here is a simple test to see if it works.
public void testSetLastModified() throws IOException {
long time = 1316137362000L;
File file = new File("/mnt/sdcard/foo");
file.createNewFile();
file.setLastModified(time);
assertEquals(time, file.lastModified());
}
If you only want to change the date/time of a directory to the current date/time (i.e., "now"), then you can create some sort of temporary file inside that directory, write something into it, then immediately delete it. This has the effect of changing the 'lastModified()' date/time of the directory to the present date/time. This won't work though, if you want to change the directory date/time to some other random value, and can't be applied to a file, obviously.