I'm in a dilemma between the 'old' way and the 'new' faster 1.7 way of scanning directories.
I need to scan all directories on a drive and build a similar tree structure. There is no problem in 1.6 (except it's 10 times slower), but with FileFisitor I have some big hurdles.
How do I know beforehand how many items (files+subdirectories) a directory contains?
Old way: File[] files = path.listFiles(); and files.length is the answer.
New way: in callback function public FileVisitResult preVisitDirectory(Path path, BasicFileAttributes bfa){}, where is the count?
Using a scalable array (ArrayList) for each subdirectory will definitely hurt both performance and the already large memory footprint, hence I need to use regular fixed-length arrays. An alternative I've been pondering is using a reusable master array and once I know the length, copy it to a destination array. This however conflicts with the recursive nature, and the fact that directories and file are walked interleaved instead of grouped. I'd need a master array for every recursion depth (potentially infinite) unless I can make it walk directories first, then files (which my research says can't be done.)
I would really question this assumption:
Using a scalable array (ArrayList) for each subdirectory will
definitely hurt both performance and the already large memory
footprint
What basis do you have for this ? Note that your performance will likely be limited (or at least affected) by the speed of access to your filesystem.
I think (as for most question sof this nature) that you try a simple extensible solution and identify any issues for real, rather than make assumptions in advance.
Related
I am using java 7 but could upgrade to 8 if deemed worthwhile.
I am creating a program to compare files in 2 folders.
I am using Files.newDirectoryStream to extract the file data, although i have still to work out how to do that recursively, but I'm sure that should not be too difficult.
So to the point, I will do the compare and then add missing items so they are in sync.
This means I will need to store:
1) the name &
2) the path
Therefore I will need to use something to do this. I know I can do it with an Array[][]. But is this the best way to do this, or are lists more efficient.
I imagine the largest folder to hold 200 files.
Thanks in advance.
You can use ArrayList<File> which can be seen as a "Wrapper around plain arrays" (for the sake of simplicity and understanding). As each element in the List will be of type File, you will already have access to the path and name of the file and do not need to store them seperatly.
Of course the ArrayList has a bit more overhead then using simple arrays, but if you expect the largest amount to be 200 files, it's not a big deal and nothing to worry about. Except you run your program on your calculator ;)
If you can, you should first fetch the amount of files in your current working directory and use that number as initial size for the List:
int numberOfFiles = fetchFileCount(directory);
ArrayList<File> currentFiles = new ArrayList<>(numberOfFiles);
I've huge file with unique words in each line. Size of file is around 1.6 GB(I've to sort other files after this which are around 15GB). Till now, for smaller files I used Array.sort(). But for this file I get java.lang.OutOfMemoryError: Java heap space. I know the reason for this error. Is there any way instead of writing complete quick sort or merge sort program.
I read that Array.sort() uses Quicksort or Hybrid Sort internally. Is there any procedure like Array.sort() ??
If I have to write a program for sorting, which one should I use? Quicksort or Merge sort. I'm worried about worst case.
Depending on the structure of the data to store, you can do many different things.
In case of well structured data, where you need to sort by one or more specific fields (in which case system tools might not be helpful), you are probably better off using a datastore that allows sorting. MongoDB comes to mind as a good fit for this given that the size doesn't exceed few 100s of GBs. Other NoSQL datastores might also fit the bill nicely, although Mongo's simplicity of use and installation and support for JSON data makes it a really great candidate.
If you really want to go with the java approach, it gets real tricky. This is the kind of questions you ask at job interviews and I would never actually expect anybody to implement code. However, the general solution is merge sort (using random access files is a bad idea because it means insertion sort, i.e., non optimal run time which can be bad given the size of your file).
By merge sort I mean reading one chunk of the file at a time small enough to fit it in memory (so it depends on how much RAM you have), sorting it and then writing it back to a new file on disk. After you read the whole file you can start merging the chunk files two at a time by reading just the head of each and writing (the smaller of the two records) back to a third file. Do that for the 'first generation' of files and then continue with the second one until you end up with one big sorted file. Note that this is basically a bottom up way of implementing merge sort, the academic recursive algorithm being the top down approach.
Note that having intermediate files can be avoided altogether by using a multiway merge algorithm. This is typically based on a heap/priority queue, so the implementation might get slightly more complex but it reduces the number of I/O operations required.
Please also see these links.
Implementing the above in java shouldn't be too difficult with some careful design although it can definitely get tricky. I still highly recommend an out-of-the-box solution like Mongo.
As it turns out, your problem is that your heap cannot accommodate such a large array, so you must forget any solution that implies loading the whole file content in an array (as long as you can't grow your heap).
So you're facing streaming. It's the only (and typical) solution when you have to handle input sources that are larger than your available memory. I would suggest streaming the file content to your program, which should perform the sorting by either outputting to a random access file (trickier) or to a database.
I'd take a different approach.
Given a file, say with a single element per line, I'd read the first n elements. I would repeat this m times, such that the amount of lines in the file is n * m + C with C being left-over lines.
When dealing with Integers, you may wish to use around 100,000 elements per read, with Strings I would use less, maybe around 1,000. It depends on the data type and memory needed per element.
From there, I would sort the n amount of elements and write them to a temporary file with a unique name.
Now, since you have all the files sorted, the smallest elements will be at the start. You can then just iterate over the files until you have processed all the elements, finding the smallest element and printing it to the new final output.
This approach will reduce the amount of RAM needed and instead rely on drive space and will allow you to handle sorting of any file size.
Build the array of record positions inside the file (kind of index), maybe it would fit into memory instead. You need a 8 byte java long per file record. Sort the array, loading records only for comparison and not retaining (use RandomAccessFile). After sorting, write the new final file using index pointers to get the records in the needed order.
This will also work if the records are not all the same size.
In Java, I know that if you are going to build a B-Tree index on Hard Disk, you probably should use serialisation were the B-Tree structure has to be written from RAM to HD. My question is, if later I'd like to query the value of a key out of the index, is it possible to deserialise just part of the B-Tree back to RAM? Ideally, only retrieving the value of a specific key. Fetching the whole index to RAM is a bad design, at least where the B-Tree is larger than the RAM size.
If this is possible, it'd be great if someone provides some code. How DBMSs are doing this, either in Java or C?
Thanks in advance.
you probably should use serialisation were the B-Tree structure has to be written from RAM to HD
Absolutely not. Serialization is the last technique to use when implementing a disk-based B-tree. You have to be able to read individual nodes into memory, add/remove keys, change pointers, etc, and put them back. You also want the file to be readable by other languages. You should define a language-independent representation of a B-tree node. It's not difficult. You don't need anything beyond what RandomAccessFile provides.
You generally split the B-tree into several "pages," each with some of they key-value pairs, etc. Then you only need to load one page into memory at a time.
For inspiration of how rdbms are doing it, it's probably a good idea to check the source code of the embedded Java databases: Derby, HyperSql, H2, ...
And if those databases solve your problem, I'd rather forget about implementing indices and use their product right away. Because they're embedded, there is no need to set up a server. - the rdbms code is part of the application's classpath - and the memory footprint is modest.
IF that is a possibility for you of course...
If the tree can easily fit into memory, I'd strongly advise to keep it there. The difference in performance will be huge. Not to mention the difficulties to keep changes in sync on disk, reorganizing, etc...
When at some point you'll need to store it, check Externalizable instead of the regular serialization. Serializing is notoriously slow and extensive. While Externalizable allows you to control each byte being written to disk. Not to mention the difference in performance when reading the index back into memory.
If the tree is too big to fit into memory, you'll have to use RandomAccessFile with some kind of memory caching. Such that often accessed items come out of memory nonetheless. But then you'll need to take updates to the index into account. You'll have to flush them to disk at some point.
So, personally, I'd rather not do this from scratch. But rather use the code that's out there. :-)
Dear StackOverflowers,
I am in the process of writing an application that sorts a huge amount of integers from a binary file. I need to do it as quickly as possible and the main performance issue is the disk access time, since I make a multitude of reads it slows down the algorithm quite significantly.
The standard way of doing this would be to fill ~50% of the available memory with a buffered object of some sort (BufferedInputStream etc) then transfer the integers from the buffered object into an array of integers (which takes up the rest of free space) and sort the integers in the array. Save the sorted block back to disk, repeat the procedure until the whole file is split into sorted blocks and then merge the blocks together.
The strategy for sorting the blocks utilises only 50% of the memory available since the data is essentially duplicated (50% for the cache and 50% for the array while they store the same data).
I am hoping that I can optimise this phase of the algorithm (sorting the blocks) by writing my own buffered class that allows caching data straight into an int array, so that the array could take up all of the free space not just 50% of it, this would reduce the number of disk accesses in this phase by a factor of 2. The thing is I am not sure where to start.
EDIT:
Essentially I would like to find a way to fill up an array of integers by executing only one read on the file. Another constraint is the array has to use most of the free memory.
If any of the statements I made are wrong or at least seem to be please correct me,
any help appreciated,
Regards
when you say limited, how limited... <1mb <10mb <64mb?
It makes a difference since you won't actually get much benefit if any from having large BufferedInputStreams in most cases the default value of 8192 (JDK 1.6) is enough and increasing doesn't ussually make that much difference.
Using a smaller BufferedInputStream should leave you with nearly all of the heap to create and sort each chunk before writing them to disk.
You might want to look into the Java NIO libraries, specifically File Channels and Int Buffers.
You dont give many hints. But two things come to my mind. First, if you have many integers, but not that much distinctive values, bucket sort could be the solution.
Secondly, one word (ok term), screams in my head when I hear that: external tape sorting. In early computer days (i.e. stone age) data relied on tapes, and it was very hard to sort data spread over multiple tapes. It is very similar to your situation. And indeed merge sort was the most often used sorting that days, and as far as I remember, Knuths TAOCP had a nice chapter about it. There might be some good hints about the size of caches, buffers and similar.
What algorithms or Java libraries are available to do N-way, recursive diff/merge of directories?
I need to be able to generate a list of folder trees that have many identical files, and have subdirectories with many similar files. I want to be able to use 2-way merge operations to quickly remove as much redundancy as possible.
Goals:
Find pairs of directories that have many similar files between them.
Generate short list of directory pairs that can be synchronized with 2-way merge to eliminate duplicates
Should operate recursively (there may be nested duplicates of higher-level directories)
Run time and storage should be O(n log n) in numbers of directories and files
Should be able to use an embedded DB or page to disk for processing more files than fit in memory (100,000+).
Optional: generate an ancestry and change-set between folders
Optional: sort the merge operations by how many duplicates they can elliminate
I know how to use hashes to find duplicate files in roughly O(n) space, but I'm at a loss for how to go from this to finding partially overlapping sets between folders and their children.
EDIT: some clarification
The tricky part is the difference between "exact same" contents (otherwise hashing file hashes would work) and "similar" (which will not). Basically, I want to feed this algorithm at a set of directories and have it return a set of 2-way merge operations I can perform in order to reduce duplicates as much as possible with as few conflicts possible. It's effectively constructing an ancestry tree showing which folders are derived from each other.
The end goal is to let me incorporate a bunch of different folders into one common tree. For example, I may have a folder holding programming projects, and then copy some of its contents to another computer to work on it. Then I might back up and intermediate version to flash drive. Except I may have 8 or 10 different versions, with slightly different organizational structures or folder names. I need to be able to merge them one step at a time, so I can chose how to incorporate changes at each step of the way.
This is actually more or less what I intend to do with my utility (bring together a bunch of scattered backups from different points in time). I figure if I can do it right I may as well release it as a small open source util. I think the same tricks might be useful for comparing XML trees though.
It seems desirable just to work on the filenames and sizes (and timestamps if you find that they are reliable), to avoid reading in all those files and hashing or diffing them.
Here's what comes to mind.
Load all the data from the filesystem. It'll be big, but it'll fit in memory.
Make a list of candidate directory-pairs with similarity scores. For each directory-name that appears in both trees, score 1 point for all pairs of directories that share that name. For each filename that appears in both trees (but not so often that it's meaningless), score 1 point for all pairs of directories that contain a file with that name. Score bonus points if the two files are identical. Score bonus points if the filename doesn't appear anywhere else. Each time you give points, also give some points to all ancestor-pairs, so that if a/x/y/foo.txt is similar to b/z/y/foo.txt, then the pairs (a/x/y, b/z/y) and (a/x, b/z) and (a, b) all get points.
Optionally, discard all pairs with scores too low to bother with, and critically examine the other pairs. Up to now we've only considered ways that directories are similar. Look again, and penalize directory-pairs that show signs of not having common ancestry. (A general way to do this would be to calculate the maximum score the two directories could possibly have, if they both had all the files and they were all identical; and reject the pair if only a small fraction of that possible score was actually achieved. But it might be better to do something cheap and heuristic, or to skip this step entirely.)
Choose the best-scoring candidate directory-pair. Output it. Eliminate those directories and all their subdirectories from contention. Repeat.
Choosing the right data structures is left as an exercise.
This algorithm makes no attempt to find similar files with different filenames. You can do that across large sets of files using something like the rsync algorithm, but I'm not sure you need it.
This algorithm makes no serious attempt to determine whether two files are actually similar. It just scores 1 point for the same filename and bonus points for the same size and timestamp. You certainly could diff them to assign a more precise score. I doubt it's worth it.