Check if archives are identical

Check if archives are identical - java

I'm using a shell script to automatically create a zipped backup of various directories every hour. If I haven't been working on any of them for quite some time, this creates alot of duplicate archives. MD5 hashes of the files don't match, because they do have different filenames & creation dates etc.
Other than making sure there won't be duplicates in the first place, another option is checking if filesizes match, but that doesn't necesseraly mean they are duplicates.
Filenames are done like so;
Qt_2012-03-15_23_00.tgz
Qt_2012-03-16_00_00.tgz
So maybe it would be an option to check if files have identical filesizes consequently (if that's the right word for it.)
Pseudo code:
int previoussize = 0;
String previouspath = null;
String Filename = null;
String workDir = "/path/to/workDir ";
String processedDir = "/path/to/processedDir ";
//Loop over all files
for file in workDir
{
//Match
if(file.size() == previoussize)
{
if(previouspath!=null) //skip first loop
{
rm previouspath; //Delete file
}
}
else //No Match
{
/*If there's no match, we can move the previous file
to another directory so it doesn't get checked again*/
if(previouspath!=null) //skip first loop
{
mv previouspath processedDir/Filename;
}
}
previoussize = file.size();
previouspath = file.path();
Filename = file.name();
}
Example:
Qt_2012-03-15_23_00.tgz 10KB
Qt_2012-03-16_00_00.tgz 10KB
Qt_2012-03-16_01_00.tgz 10KB
Qt_2012-03-16_02_00.tgz 15KB
Qt_2012-03-16_03_00.tgz 10KB
Qt_2012-03-16_04_00.tgz 10KB
If I'm correct this would only delete the first 2 and the second to last one. The third and the fourth should be moved to the processedDir.
So I guess I have 2 questions:
Would my pseudo code work the way I intend it to? (I find these things rather confusing.)
Is there a better/simpler/faster way? Because even though the chance of accidentally deleting non-identicals like that is very small, it's still a chance.

I can think of a couple of alternatives:
Deploy a version control system such as Git, Subversion, etc, and write a script that periodically checks in any changes. This will save a lot of space because only files that have actually changed get saved, and because changes to text files will be stored as diffs.
Use an incremental backup tool. This article lists a number of alternatives.
Normal practice is to put the version control system / backups on a different machine, but you don't have to do that.

Not clear if this need to run as a batch. If it's manual, you can run BeyondCompare or any decent comparison tool to diff the two archives

Related

How to copy multiple files atomically from src to dest in java?

in one requirement, i need to copy multiple files from one location to another network location.
let assume that i have the following files present in the /src location.
a.pdf, b.pdf, a.doc, b.doc, a.txt and b.txt
I need to copy a.pdf, a.doc and a.txt files atomically into /dest location at once.
Currently i am using Java.nio.file.Files packages and code as follows
Path srcFile1 = Paths.get("/src/a.pdf");
Path destFile1 = Paths.get("/dest/a.pdf");
Path srcFile2 = Paths.get("/src/a.doc");
Path destFile2 = Paths.get("/dest/a.doc");
Path srcFile3 = Paths.get("/src/a.txt");
Path destFile3 = Paths.get("/dest/a.txt");
Files.copy(srcFile1, destFile1);
Files.copy(srcFile2, destFile2);
Files.copy(srcFile3, destFile3);
but this process the file are copied one after another.
As an alternate to this, in order to make whole process as atomic,
i am thinking of zipping all the files and move to /dest and unzip at the destination.
is this approach is correct to make whole copy process as atomic ? any one experience similar concept and resolved it.

is this approach is correct to make whole copy process as atomic ? any one experience similar concept and resolved it.
You can copy the files to a new temporary directory and then rename the directory.
Before renaming your temporary directory, you need to delete the destination directory
If other files are already in the destination directory that you don't want to overwrite, you can move all files from the temporary directory to the destination directory.
This is not completely atomic, however.
With removing /dest:
String tmpPath="/tmp/in/same/partition/as/source";
File tmp=new File(tmpPath);
tmp.mkdirs();
Path srcFile1 = Paths.get("/src/a.pdf");
Path destFile1 = Paths.get(tmpPath+"/dest/a.pdf");
Path srcFile2 = Paths.get("/src/a.doc");
Path destFile2 = Paths.get(tmpPath+"/dest/a.doc");
Path srcFile3 = Paths.get("/src/a.txt");
Path destFile3 = Paths.get(tmpPath+"/dest/a.txt");
Files.copy(srcFile1, destFile1);
Files.copy(srcFile2, destFile2);
Files.copy(srcFile3, destFile3);
delete(new File("/dest"));
tmp.renameTo("/dest");
void delete(File f) throws IOException {
if (f.isDirectory()) {
for (File c : f.listFiles())
delete(c);
}
if (!f.delete())
throw new FileNotFoundException("Failed to delete file: " + f);
}
With just overwriting the files:
String tmpPath="/tmp/in/same/partition/as/source";
File tmp=new File(tmpPath);
tmp.mkdirs();
Path srcFile1 = Paths.get("/src/a.pdf");
Path destFile1=paths.get("/dest/a.pdf");
Path tmp1 = Paths.get(tmpPath+"/a.pdf");
Path srcFile2 = Paths.get("/src/a.doc");
Path destFile2=Paths.get("/dest/a.doc");
Path tmp2 = Paths.get(tmpPath+"/a.doc");
Path srcFile3 = Paths.get("/src/a.txt");
Path destFile3=Paths.get("/dest/a.txt");
Path destFile3 = Paths.get(tmpPath+"/a.txt");
Files.copy(srcFile1, tmp1);
Files.copy(srcFile2, tmp2);
Files.copy(srcFile3, tmp3);
//Start of non atomic section(it can be done again if necessary)
Files.deleteIfExists(destFile1);
Files.deleteIfExists(destFile2);
Files.deleteIfExists(destFile2);
Files.move(tmp1,destFile1);
Files.move(tmp2,destFile2);
Files.move(tmp3,destFile3);
//end of non-atomic section
Even if the second method contains a non-atomic section, the copy process itself uses a temporary directory so that the files are not overwritten.
If the process aborts during moving the files, it can easily be completed.
See https://stackoverflow.com/a/4645271/10871900 as reference for moving files and https://stackoverflow.com/a/779529/10871900 for recursively deleting directories.

First there are several possibilities to copy a file or a directory. Baeldung gives a very nice insight into different possibilities. Additionally you can also use the FileCopyUtils from Spring. Unfortunately, all these methods are not atomic.
I have found an older post and adapt it a little bit. You can try using the low-level transaction management support. That means you make a transaction out of the method and define what should be done in a rollback. There is also a nice article from Baeldung.
#Autowired
private PlatformTransactionManager transactionManager;
#Transactional(rollbackOn = IOException.class)
public void copy(List<File> files) throws IOException {
TransactionDefinition transactionDefinition = new DefaultTransactionDefinition();
TransactionStatus transactionStatus = transactionManager.getTransaction(transactionDefinition);
TransactionSynchronizationManager.registerSynchronization(new TransactionSynchronization() {
#Override
public void afterCompletion(int status) {
if (status == STATUS_ROLLED_BACK) {
// try to delete created files
}
}
});
try {
// copy files
transactionManager.commit(transactionStatus);
} finally {
transactionManager.rollback(transactionStatus);
}
}
Or you can use a simple try-catch-block. If an exception is thrown you can delete the created files.

Your question lacks the goal of atomicity. Even unzipping is never atomic, the VM might crash with OutOfMemoryError right in between inflating the blocks of the second file. So there's one file complete, a second not and a third entirely missing.
The only thing I can think of is a two phase commit, like all the suggestions with a temporary destination that suddenly becomes the real target. This way you can be sure, that the second operation either never occurs or creates the final state.
Another approach would be to write a sort of cheap checksum file in the target afterwards. This would make it easy for an external process to listen for creation of such files and verify their content with the files found.
The latter would be the same like offering the container/ ZIP/ archive right away instead of piling files in a directory. Most archives have or support integrity checks.
(Operating systems and file systems also differ in behaviour if directories or folders disappear while being written. Some accept it and write all data to a recoverable buffer. Others still accept writes but don't change anything. Others fail immediately upon first write since the target block on the device is unknown.)

FOR ATOMIC WRITE:
There is no atomicity concept for standard filesystems, so you need to do only single action - that would be atomic.
Therefore, for writing more files in an atomic way, you need to create a folder with, let's say, the timestamp in its name, and copy files into this folder.
Then, you can either rename it to the final destination or create a symbolic link.
You can use anything similar to this, like file-based volumes on Linux, etc.
Remember that deleting the existing symbolic link and creating a new one will never be atomic, so you would need to handle the situation in your code and switch to the renamed/linked folder once it's available instead of removing/creating a link. However, under normal circumstances, removing and creating a new link is a really fast operation.
FOR ATOMIC READ:
Well, the problem is not in the code, but on the operation system/filesystem level.
Some time ago, I got into a very similar situation. There was a database engine running and changing several files "at once". I needed to copy the current state, but the second file was already changed before the first one was copied.
There are two different options:
Use a filesystem with support for snapshots. At some moment, you create a snapshot and then copy files from it.
You can lock the filesystem (on Linux) using fsfreeze --freeze, and unlock it later with fsfreeze --unfreeze. When the filesystem is frozen, you can read the files as usual, but no process can change them.
None of these options worked for me as I couldn't change the filesystem type, and locking the filesystem wasn't possible (it was root filesystem).
I created an empty file, mount it as a loop filesystem, and formatted it. From that moment on, I could fsfreeze just my virtual volume without touching the root filesystem.
My script first called fsfreeze --freeze /my/volume, then perform the copy action, and then called fsfreeze --unfreeze /my/volume. For the duration of the copy action, the files couldn't be changed, and so the copied files were all exactly from the same moment in time - for my purpose, it was like an atomic operation.
Btw, be sure to not fsfreeze your root filesystem :-). I did, and restart is the only solution.
DATABASE-LIKE APPROACH:
Even databases cannot rely on atomic operations, and so they first write the change to WAL (write-ahead log) and flush it to the storage. Once it's flushed, they can apply the change to the data file.
If there is any problem/crash, the database engine first loads the data file and checks whether there are some unapplied transactions in WAL and eventually apply them.
This is also called journaling, and it's used by some filesystems (ext3, ext4).

I hope this solution would be useful : as per my understanding you need to copy the files from one directory to another directory.
so my solution is as follows:
Thank You.!!
public class CopyFilesDirectoryProgram {
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
String sourcedirectoryName="//mention your source path";
String targetdirectoryName="//mention your destination path";
File sdir=new File(sourcedirectoryName);
File tdir=new File(targetdirectoryName);
//call the method for execution
abc (sdir,tdir);
}
private static void abc(File sdir, File tdir) throws IOException {
if(sdir.isDirectory()) {
copyFilesfromDirectory(sdir,tdir);
}
else
{
Files.copy(sdir.toPath(), tdir.toPath());
}
}
private static void copyFilesfromDirectory(File source, File target) throws IOException {
if(!target.exists()) {
target.mkdir();
}else {
for(String items:source.list()) {
abc(new File(source,items),new File(target,items));
}
}
}
}

Implementing Microsoft.CognitiveServices.Speech recognition for several files

I have gotten the coding example from here to work.
I can run a .wav file through and get the transcript, however in the example the program never ends until I hit a key:
System.out.println("Press any key to stop");
new Scanner(System.in).nextLine();
That seems to artificially pause everything while the service is being queried.
If I remove those that line, the program jumps through to fast and concludes without waiting for the service to respond.
Question: How do I resume/continue the program with the full transcription without needing to hit a key?
I would like to run this for multiple .wav files transcribing each one after the other. But so far it runs the first one then waits.
I have been scouring the documentation and I have tried multiple things including using recognizer.close(); which I would expect to end the SpeechRecognizer but which seems to do nothing.
Or using result = recognizer.recognizeOnceAsync().get(); which does not transcribe the full file.
Does anyone know of an example of this running multiple files or how to implement that?
Thanks.

You can create a function that will read and return the list of files in your directory:
private static String[] GetFiles(String directory)
{
String[] files = (new File(directory)).list(File::isFile);
return files;
}
Then loop through them to process them, and then transcribe them.
String[] files = GetFiles(args[0]);
for (String file : files)
{
//Your code goes here.
System.out.printf("File %1$s processed" + "\r\n",file);//print out which file has been successfully processed.
}
You could also try using the Batch Transcription feature!
Batch transcription is ideal if you have large amounts of audio in
storage.

How to reduce time of listing files in a directory in which files are being added continously

I have to filter files in a directory using a FileFilter (based on date modification). Files are continuously being added to this directory.
I am using Threadpool executor to process each file returned by FileFilter accept() method.
But the time taken to list all the files in this directory is large which slows down the completion time of my code. This is due to files being continuously added in the directory.
Is there any other way where files can be listed much faster. Please note that I need files which have been modified before a certain modification time and this is being checked through FileFilter.
final ThreadPoolExecutor executor = poolFactory.getExecutor();
FileFilter fileFIlter = new FileFilter() {
#Override
public boolean accept(File file) {
if (file.getName().toUpperCase().contains(fileNameFilter) &&
null != startDate && file.lastModified() >= startDate.getTime() &&
null != endDate && file.lastModified() <= endDate.getTime()
) {
executor.execute(new FileFinder(file, textFinder));
return true;
}
return false;
}
};
file.listFiles(fileFIlter);

There is no faster way to read a directory. And it is not a Java issue. Simply put, the operating system only "indexes" a directory by name. Any other form of lookup / query needs to be implemented by iterating all entries, one at a time, and retrieving and testing the file attributes.
The only way you are going to do better than that is if you do a first scan of the directory (on application startup), and then use the file watcher service to look for any changes. The first scan takes just as long as currently, but using the file watcher avoids repeatedly re-scanning.
If that doesn't work for you, then you are going to need to manage your files differently. For example:
Maybe you could use File.rename to move files to another directory once you have processed them.
Maybe you could append the incoming information to the end of an existing file rather than creating new files.
Maybe you could put the information (straight) into a database and do away with the need for an intermediate file.
And if you can't do any of the above, then sorry but there is no way to make it go faster.

Reading a log file which gets rolled over

I am trying to use a simple program to read from a log file. The code used is as follows:
RandomAccessFile in = new RandomAccessFile("/home/hduser/Documents/Sample.txt", "r");
String line;
while(true) {
if((line = in.readLine()) != null) {
System.out.println(line);
} else {
Thread.sleep(2000);
The code works well for new lines being added to the log file but it does not replicate the rollover process. i.e. when the content of the log file is cleared I expect the java console to continue reading text from the first line newly written to the log. Could that be possible? What changes need to be made to the existing code to achieve that?

At my work I had to deal with the processing of logs that can be rolled over without missing any data. What I do is store a tiny memo file that contains:
A hash of the first 1024 bytes (or less) of the log (I used SHA-1 or something because it's easy)
The number of bytes used to generate the hash
The current file position
I close the log file after processing all lines, or some maximum number of lines, and update the memo file. I sleep for a tiny bit and then open the log file again. This allows me to check whether a rollover has occurred. A rollover is detected when:
The current file is smaller than the last file position
The hash is not the same
In my case, I can use the hash to find the correct log file, and work backwards to get up to date. Once I know I've picked up where I left off in the correct file, I can continue reading and memoizing my position. I don't know if this is relevant to what you want to do, but maybe that gives you ideas.
If you don't have any persistence requirements, you probably don't need to store any memo files. If your 'rollover' just clears the log and doesn't move it away, you probably don't need to remember any file hashes.

I am sorry... My Bad.. I don't want it to go blank.. I just want the next new line written to the log to be read.
Since what you need is able to read from beginning when you file is cleared, you will need to monitor the length of file and reset the cursor pointer when length of file reduces. You can reset the cursor using seek(..) method.
See code below -
RandomAccessFile in = new RandomAccessFile("/home/hduser/Documents/Sample.txt", "r");
String line;
long length = 0;//used to check the file length
while (true) {
if(in.length()<length){//new condition to reset position if file length is reduced
in.seek(0);
}
if ((line = in.readLine()) != null) {
System.out.println(line);
length = in.length();
} else {
Thread.sleep(2000);
}
}

it does not replicate the rollover process. i.e. when the content of the log file is cleared I expect the java console to continue reading text from the first line newly written to the log. Could that be possible?
Struggling with this as well. +1 to #paddy for the hash idea.
Another solution (depending on your operating system) is to use the use the inode of the file although this may only work under unix:
Long inode = (Long)Files.getAttribute(logFile.toPath(), "unix:ino");
This returns the inode of the underlying file-system associated with log-file. If the inode changes then the file is a brand new file. This assumes when the log is rolled over that it is moved aside and the same file is not written over.
To make this work you would record the inode of the file you are reading then check to see if the inode has changed if you haven't gotten any new data in some period of time.

Objects in Java ArrayList don't get updated

SOLVED:
This is what was wrong:
current.addFolder(folder); (in the final else clause of the if statement)
Added a new folder, but did not guarantee that the folder passed is the folder added, it may simply do nothing if the folder already exists, so to overcome this I changed addFolder to return the actual folder (for example if it already existed) and I assigned folder to that return value. And that did the trick, so now I've got:
folder = current.addFolder(folder);
current = folder;
Thanks a lot people, your help was much appreciated :)
This is going to be a very long post, hopefully you can understand what I'm talking about and I appreciate any help. Thanks
Basically, I've created a personal, non-commercial project (which I don't plan to release) that can read ZIP and RAR files. It can only read the contents in the archive, the folders inside, the files inside the folders and its properties (such as last modified date, last modified time, CRC checksum, uncompressed size, compressed size and file name). It can't extract files either, so it's really a ZIP/RAR viewer if you may.
Anyway that's slightly irrelevant to my problem but I thought I'd give you some background info.
Now for my problem:
I can successfully list all the folders and files inside a ZIP archive, so now I want to take that raw input and link it together in some useful way. I made 2 classes: ArchiveFile (represents a file inside a ZIP) and ArchiveFolder (represents a folder inside a ZIP). They both have some useful methods such as getLastModifiedDate, getName, getPath and so on. But the difference is that ArchiveFolder can hold an ArrayList of ArchiveFile's and additional ArchiveFolder's (think of this as files and folders inside a folder).
Now I want to populate my raw input into one root ArchiveFolder, which will have all the files in the root dir of the ZIP in the ArchiveFile's ArrayList and any additional folders in the root dir of the ZIP in the ArchiveFolder's ArrayList (and this process can continue on like this like a chain reaction (more files/folders in that ArchiveFolder etc etc).
So I came up with the following code:
while (archive.hasMore()) {
String path = "";
ArchiveFolder current = root;
String[] contents = archive.getName().split("/");
for (int x = 0; x < contents.length; ++x) {
if (x == (contents.length - 1) && !archive.getName().endsWith("/")) { // If on last item and item is a file
path += contents[x]; // Update final path ArchiveFile
file = new ArchiveFile(path, contents[x], archive.getUncompressedSize(), archive.getCompressedSize(), archive.getModifiedTime(), archive.getModifiedDate(), archive.getCRC());
current.addFile(file); // Create and add the file to the current ArchiveFolder
}
else if (x == (contents.length - 1)) { // Else if we are on last item and it is a folder
path += contents[x] + "/"; // Update final path
ArchiveFolder folder = new ArchiveFolder(path, contents[x], archive.getModifiedTime(), archive.getModifiedDate());
current.addFolder(folder); // Create and add this folder to the current ArchiveFile
}
else { // Else if we are still traversing through the path
path += contents[x] + "/"; // Update path
ArchiveFolder folder = new ArchiveFolder(path, contents[x]);
current.addFolder(folder); // Create and add folder (remember we do not know the modified date/time as all we know is the path, so we can deduce the name only)
current = folder; // Update current ArchiveFolder to the newly created one for the next iteration of the for loop
}
}
archive.getNext();
}
Assume that root is the root ArchiveFolder (initially empty).
And that archive.getName() returns the name of the current file OR folder in the following fashion: file.txt or folder1/file2.txt or folder4/folder2/ (this is a empty folder) etc. So basically the relative path from the root of the ZIP archive.
Please read through the comments in the above code to familiarize yourself with it. Also assume that the addFolder method in an ArchiveFile, only adds the folder if it doesn't exist already (so there are no multiple folders) and it also updates the time and date of an existing folder if it is blank (ie it was a intermediate folder we only knew the name of, but now we know its details). The code for addFolder is (pretty self-explanitory):
public void addFolder(ArchiveFolder folder) {
int loc = folders.indexOf(folder); // folders is the ArrayList containing ArchiveFolder's
if (loc == -1) {
folders.add(folder);
}
else {
ArchiveFolder real = folders.get(loc);
if (real.getTime() == null) {
real.setTime(folder.getTime());
real.setDate(folder.getDate());
}
}
}
So I can't see anything wrong with the code, it works and after finishing, the root ArchiveFolder contains all the files in the root of the ZIP as I want it to, and it contains all the direcories in the root folder as I want it to. So you'd think it works as expected, but no the ArchiveFolder's in the root folder don't contain the data inside those 'child' folders, it's just a blank folder with no additional files and folders (while it does really contain some more files/folders when viewed in WinZip).
After debugging using Eclipse, the for loop does iterate through all the files (even those not included above), so this led me to believe that there is a problem with this line of the code:
current = folder;
What it does is, it updates the current folder (used as an intermediate by the loop) to the newly added folder.
I thought Java passed by reference and thus all new operations and new additions in future ArchiveFile's and ArchiveFolder's are automatically updated, and parent ArchiveFolder's will be updated accordingly. But that does not appear to be the case?
I know this is a long ass post and I really hope anyone can help me out with this.
Thanks in advance.

Since you use eclipse, set a breakpoint and step through the method, it may take time but it helps with finding bugs. (check the object ids for example to see if the reference has changed).

Java does not actually pass references in the way you'd understand this in C++ for example. It passes by value, but all variables of non-primitive types are actually pointers to objects. So whenever you pass a variable to a method, you are giving or a copy of the pointer, meaning both variables point to the same object (change the object from one and the other will "see" the change. But assigning a different value to the pointer on caller or callee side will not change the other side's pointer.
Hope I'm clear?

I suspect you haven't overloaded equals() and hashCode() correctly on your ArchiveFolder class, and thus
folders.indexOf(folder)
in addFolder() is always returning -1.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.