How does FileInputStream find the file? - java

I'm designing a program that needs to read a file from a folder that contains roughly 10^8 files.
How does FileInputStream find the desired file from it's provided filename? Does it work similarly to a hashmap with O(1) lookup time, or does it linearly traverse the files in the given folder until it finds a match?
I imagine this might have more to do with how windows file storage works than with FileInputStream, but I'm honestly not sure.

The file name is passed to the OS, and it reads the directory looking for a matching name. It might optimise the lookup but Java doesn't get involved.
You might consider breaking the files into multiple directories and ideally using less files. Opening and closing lots of small files spends most of it's time opening and closing the file descriptor. The finding and reading is usually much smaller.

It asks the operating system to find the file. How does the operating system do it? It depends on the OS, and on the file system. In at least some cases, the answer is, "Yes, it works like a HashMap." On the other hand, I know of at least some OS/filesystem combinations that get seriously bogged down once you have more than a few thousand files in the same directory.

Related

Java monitor folder for files

I need to monitor a certain folder for new files, which I need to process.
I have the following requirements:
The filenames of the files are sequence numbers. I need to process each file in order. (Lowest number first, there's no guarantee that each sequence number exists. eg: 1,2,5,8,9
If files already exist in the folder during startup, I need to process them directly
I need a guarantee that I only process each file once
I need to avoid reading incomplete files (which are still being copied)
The service should ofcourse be reliable...
What is the most common way to accomplish this?
I'm using Java SE7 and Spring 4.
I already had a look at the WatchService of Java 7 but it seems to have problems with processing already existing files during startup, and avoid processing incomplete files.
Assembling comments into an answer.
Easiest way to parse the files in the correct order is to load the entire directory file listing into an array / list and then sort the list using an appropriate comparator. E.g. Load files with File.list() or File.listFiles().
This is not the most efficient methodology, but for less than 10,000 files should be adequate unless you need faster startup time performance (I can imagine a small lag before processing begins as all of the files are listed).
To avoid reading incomplete files you should acquire an exclusive FileLock (via a FileChannel which you can get from the FileOutputStream or FileInputStream, however you may not be able to get an exclusive lock from the FileInputStream) on the file. Assuming the OS being used supports file locking (which modern OSes do) and the application writing the file is well behaved and holding a lock (hopefully it is) then as soon as you are able to acquire the lock you know the file is complete.
If for some reason you cannot rely on file locking then you either need to have the writing program first write to a temporary file (perhaps with a different extension) and then atomically move / rename the file (atomic for most OSes if on the same file system / partition), or monitor the file for a period of time to see if further bytes are being written (not the most robust methodology).

To identify temporary files in the system using java

I want to know whether is there any function related to file which determines the file is a temporary or log file.
since while running Jnotify it becomes tedious since temporary files are created and modified frequently increasing the burden and making jnotify unstable.
So is there anything let me Know asap.
Is there an operating system file attribute that tells you that the file is "temporary"? (You don't specify the operating system.) Without such an attribute, determining that a file is "temporary" by simply examining it (vs using some algorithm based on name/path) will be impossible.

The number of blocks allocated to a sparse file

Is there any way to access the number of blocks allocated to a file with the standard Java File API? Or even do it with some unsupported & undocumented API underneat. Anything to avoid native code plugins.
I'm talking about the st_blocks field of struct stat that the fstat/stat syscalls work on in Unix.
What I want to do is to create a sparse copy of a file that now has lots of redundant data, i.e. make a new copy of it, only containing the active data but sparsely written to it. Then swap the two files with an atomic rename/link operation. But I need a way to find out how many blocks are allocated to the file beforehand, it might already have been sparsely copied. The old file is then removed.
This will be used to free up disk space in a database application that is 100% Java. The benefit on relying on sparse file support in the filesystem is that I would not have to change the index that point out the location where the data is, that increases the complexity of the task at hand.
I think I can do somewhat well by relying on the file timestamp to see if files have already been cleaned up. But this intrigued me. I can not even find anything in the java 7 NIO.2 API for file attribute access at this level.
The only way I can think of is to use ls -s filename to get the actual size of the file on disk. http://www.lrdev.com/lr/unix/sparsefile.html

How to decompress faster in Java?

Our system is having a problem with too much files, which is used in a webapp which should be using all the time. That mean the files cannot be deleted and there are too much of them, making the system(which is a windows) slow. We would like zip up the files, and when the file is request, we unzip the particular file out.
I've try the java ZipFile class, and the performance is not good enough, because there will be many people using the webapp and they will request the files. From my observation, the unzipping action require time between 0.5 secs to 2 secs, and when there are too much user, the system cannot catch up to them.
For example, I've use a Jmeter to simulate a situation where 30 user use the system, with a random delay between 0.3 secs to 0.6 secs. Although I doubt there may not be so much requests, I cannot know for advance that how many people will use the webapps. I would like to ask you guys, is there any other method to solve this problem?
Thanks in advance!!
P.S. If any 3rd party library is need, it must be free!
P.S. Because the number of files is just too much, and it hang the machine. We would like do this : zip up 2000 file into a zip file, then the number of files will decrease and hope the system won't hang anymore, and when need, we unzip some file out.
Okay, here's some thoughts. It appears to me that your core problem is the slowness of your system and that you're trying to fix it by compressing the files and decompressing them on demand. Then you've found that the decompression is too slow and you need a faster way to do that.
Now I'm not entirely certain why you think this compression will speed things up instead of making things slower.
I would go back to the original problem and work more on solving that. Why is the number of files making your system slow? If you can figure that out, you can fix it in a way that doesn't involve things going even slower.
If it's an issue with too many files in a directory, think about splitting into multiple directories. But I have no idea whether NTFS even has that problem (FAT did). For example, if you have a directory with files for every minute of the last ten years (five million files), you can split them into day directories (three and a half thousand directories with fifteen hundred files in each).
Compression won't reduce the number of files, just the space taken by them.
If it's an issue with the number of files on the system (rather than in a directory), there are plenty of ways to split files between systems as well. Example, hive off 10% of the entire file set to ten different machines and forward incoming requests for a specific file to the relevant machine.
But, I have to say, I've seen Windows machines handle absolute bucket-loads of files so I'd be very surprised if the problem lay there. I think you're probably just going to have to track down what's actually causing your "hangs".
compressing/uncompressing the files will not make the windows faster.
If zip doesn't provides performance gain (despite has native implementation in Java), you can try to improve at the filesystem-level. Folders with too many (>10000) files doesn't work well under some Windows filesystems, so try to divide the files into several folders, tune the NTFS filesystem (cluster size, reserved space for filesystem), disable anti virus, disable indexing, buy an SSD SLC hard disk...

What is the most efficient way to list all of the files in a directory (including sub-directories)?

I am writing a servlet which will examine a directory on the server (external to the web container), and recursively search for certain files (by certain files, I mean files that are of a certain extension as well as a certain naming convention). Once these files are found, the servlet responds with a long list of all of the found files (including the full path to the files). My problem is that there are so many files and directories that my servlet goes extremely slow. I was wondering if there was a best practice or existing servlet for this type of problem? Would it be more efficient to simply compile the entire list of files and do the filtering via js/jquery on the client side?
Disk access is slow and as the number of files and directories increases, you'll rapidly reach a point where your servlet will be useless when using the conventional recursive search through the directory tree. You'll especially reach this limit quickly if you have a large number of concurrent users performing the same search at the same time.
It's instead, much better to use an external batch job to generate the list of files which can then be read into the servlet through a database call or even by just parsing a file containing all the file names separated by a newline character. Using "find" on Linux is a simple way to do this: e.g.
find <path_to_directory> -name '*.bin' > list_files.txt
This would list every file name that ends with .bin in a particular directory and output it into a file named list_files.txt. Your servlet could then read in that file and create the list of files from there.
If you really have loads of files, you might think about spawning an external process to do the searching. If you're running on a unix-like server (like linux), you might get speed gains by having the "find" command do the searching for you, and parse its output.
You can google for many examples of how to use "find".
I see two possible reasons why this process might be going slowly:
1) Disk I/O is taking too long. This'll be a real constraint that you can't do much about. Usually the operating system is pretty good at keeping structures in memory that allow it to find files in your folders much quicker. If it is too slow regardless, you might have to build an index yourself in memory. This all depends on how you're doing it.
In any case, if this is the issue (you can try measuring), then there's no way doing the filtering client side will help, as that shouldn't really take very long, no matter where you do it. Instead you're going to make the client slower by sending it more data to sort through.
2) There's something wrong with your directory traversal. You say it's "recursive". If you mean it's actually recursive, i.e. a method that calls itself whenever it encounters a new directory, then that might well be slowing you down (the overhead really adds up). There's some stuff about tree traversal on wikipedia, but basically just use a queue or stack to keep track of where you are in the traversal, instead of using your method state to do so.
Note that a file system isn't actually a tree, but I'm assuming that it is in this case. It gets a bit hairier otherwise.
I don't agree with the other posters that you can't implement it in-process. It should work pretty well up to a certain point, no need for batch jobs just yet.
i think your servlet works slow because of hard drive speed. if list of files a permanent you should load it to memory

Categories

Resources