I have a set of geographically remote nodes with heterogeneous operating systems which need to transfer files and updates around using a Java program I am writing. At present I need to send the entire file again if the file changes. Is there a way to determine the sections of the files that are different and only send those (note that these files are not necessarily text, they could be any format). The only way I can think of is to split the file into blocks, hash the blocks and send the hashes back the the requester which then requests only the blocks it needs but for small blocks and large files this is a large overhead so is there any way to send some message describing my file such that the singular message can be analysed to provide a list of the blocks that need to be transmitted?
Most digest functions are designed such that a small change to the data results in a large change over the whole hash output, I basically need the reverse of this, that will work on all operating systems.
If I understand your question correctly, you need to keep files in sync on two systems. There is a tool called rsync that can synchronize two files (or whole directories) by only sending the changes made to the file.
You may also be interested in the Rsync algorithm.
Related
I'm writing an encryption program that will encrypt files.
I want the encrypted content will replace the original content so it can't be recovered by recovery programs (That is, using the same memory addresses as the original content).
Assume that the encrypted content has equal size as the original content.
I guess that File.renameTo() will not do the trick since it's platform independent so it's somewhat unpredictable.
Forgive me for not posting my full code (duh!) but I use Buffered Input/OutputStream to read/write the data.
In some cases (on some operating systems, on some filesystems, with some mount options) RandomAccessFile will let you do what you want to do. Also think about how to keep sensitive data out of the Java heap... for example, avoid using String as part of the objects constructed from the unencrypted file, and later written to the encrypted file. However, in other cases what you're proposing is simply impossible. As stated in the manpage for GNU shred,
CAUTION: Note that shred relies on a very important assumption: that
the file system overwrites data in place. This is the traditional way
to do things, but many modern file system designs do not satisfy this
assumption. The following are examples of file systems on which shred
is not effective, or is not guaranteed to be effective in all file system modes:
log-structured or journaled file systems, such as those supplied with
AIX and Solaris (and JFS, ReiserFS, XFS, Ext3, etc.)
file systems that write redundant data and carry on even if some
writes fail, such as RAID-based file systems
file systems that make snapshots, such as Network Appliance’s NFS
server
file systems that cache in temporary locations, such as NFS version 3
clients
compressed file systems
I want to distribute the processing of a large file (almost 1GB) over many machines. One option would be to store the whole file in all the machines and pass indexes from the master machine. But I not able to understand how I must do this in Java. I want to do this mainly for optimization. More specifically, I want each machine to process a different part of the file and return the result. The problem is that the file reading cannot start from a specific line, so on each machine the file would have to be read from the beginning, which will waste the time as the processing is being done several times. Can there be any solution for this?
I am writing a servlet which will examine a directory on the server (external to the web container), and recursively search for certain files (by certain files, I mean files that are of a certain extension as well as a certain naming convention). Once these files are found, the servlet responds with a long list of all of the found files (including the full path to the files). My problem is that there are so many files and directories that my servlet goes extremely slow. I was wondering if there was a best practice or existing servlet for this type of problem? Would it be more efficient to simply compile the entire list of files and do the filtering via js/jquery on the client side?
Disk access is slow and as the number of files and directories increases, you'll rapidly reach a point where your servlet will be useless when using the conventional recursive search through the directory tree. You'll especially reach this limit quickly if you have a large number of concurrent users performing the same search at the same time.
It's instead, much better to use an external batch job to generate the list of files which can then be read into the servlet through a database call or even by just parsing a file containing all the file names separated by a newline character. Using "find" on Linux is a simple way to do this: e.g.
find <path_to_directory> -name '*.bin' > list_files.txt
This would list every file name that ends with .bin in a particular directory and output it into a file named list_files.txt. Your servlet could then read in that file and create the list of files from there.
If you really have loads of files, you might think about spawning an external process to do the searching. If you're running on a unix-like server (like linux), you might get speed gains by having the "find" command do the searching for you, and parse its output.
You can google for many examples of how to use "find".
I see two possible reasons why this process might be going slowly:
1) Disk I/O is taking too long. This'll be a real constraint that you can't do much about. Usually the operating system is pretty good at keeping structures in memory that allow it to find files in your folders much quicker. If it is too slow regardless, you might have to build an index yourself in memory. This all depends on how you're doing it.
In any case, if this is the issue (you can try measuring), then there's no way doing the filtering client side will help, as that shouldn't really take very long, no matter where you do it. Instead you're going to make the client slower by sending it more data to sort through.
2) There's something wrong with your directory traversal. You say it's "recursive". If you mean it's actually recursive, i.e. a method that calls itself whenever it encounters a new directory, then that might well be slowing you down (the overhead really adds up). There's some stuff about tree traversal on wikipedia, but basically just use a queue or stack to keep track of where you are in the traversal, instead of using your method state to do so.
Note that a file system isn't actually a tree, but I'm assuming that it is in this case. It gets a bit hairier otherwise.
I don't agree with the other posters that you can't implement it in-process. It should work pretty well up to a certain point, no need for batch jobs just yet.
i think your servlet works slow because of hard drive speed. if list of files a permanent you should load it to memory
Im currently developing a system where the user will end up having large arrays( using android). However the JVM memory is at risk of running out, so in order to prevent this I was thinking of creating a temporary database and store the data in there. However, one of the concerns that comes to me is the SDcard limited by read and write. Also another problem that comes to mind is the overhead of such an operation. Can anyone clear up my concerns, as well as also suggest a possibly good alternative to handling large arrays ( in the end these arrays will be uploaded to a website by writing a csv file and uploading it).
Thanks,
Faisal
A couple of thoughts:
You could store them using a dbms like Derby, which is built into many versions of java
You could store them in a compressed output stream that writes to bytes - this would work especially well if the data is easily compressed, i.e. regularly repeating numbers, text, etc
You could upload portions of the arrays at a time, i.e. as you generate them, begin uploading pieces of the data up to the servers in chunks
My program receives large CSV files and transforms them to XML files. In order to have better performance I would like to split this files in smaller segments of (for example) 500 lines. What are the available Java libraries for splitting text files?
I don't understand what you'd be gaining by splitting up the CSV file into smaller ones? With Java, you can read and process the file as you go, you don't have to read it all at once...
What do you intend to do with those data ?
If it is just record by record processing then event oriented (SAX or StaX) parsing will be the way to go. For record by record processing, an existing "pipeline" toolkit may be applicable.
You can pre-process your file with a splitter function like this one or this Splitter.java.
How are you planning on distributing the work once the files have been split?
I have done something similar to this on a framework called GridGain - it's a grid computing framework which allows you to execute tasks on a grid of computers.
With this in hand you can then use a cache provider such as JBoss Cache to distribute the file to multiple nodes, specify a start and end line number and process. This is outlined in the following GridGain example: http://www.gridgainsystems.com/wiki/display/GG15UG/Affinity+MapReduce+with+JBoss+Cache
Alternatively you could look at something like Hadoop and the Hadoop File System for moving the file between different nodes.
The same concept could be done on your local machine by loading the file into a cache and then assigning certain "chunks" of the file to be worked on by seperate threads. The grid computing stuff really is only for really large problems, or to provide some level of scalability transparently to your solution. You might need to watch out for IO bottlenecks and locks, but a simple thread pool which you dispatch "jobs" into after the file is split could work.