Scan duplicate document with md5

Scan duplicate document with md5 - java

for some reasons I can't use MessageDigest.getInstance("MD5"), so I must write the algorithm code in manual way, my project is scan duplicate document (*.doc, *.txt, *.pdf) on Android device. My question is, what must I write before entering the algorithm, to scan the duplicate document on MY ROOT directory on Android device? Without select the directory, when I press button scan, the process begin, the listview show. Is anyone can help me? My project deadline will come. Thank you so much.
public class MD5 {
//What must I write here, so I allow to scan for duplicate document on Android root with MD5 Hash
//MD5 MANUAL ALGORITHM CODE
}

WHOLE PROCESS:
your goal is to detect (and perhaps store information about) duplicate files.
1 Then, first, you have to iterate through directories and files,
see this:
list all files from directories and subdirectories in Java
2 and for each file, to load it like a byte array
see this:
Reading a binary input stream into a single byte array in Java
3 then compute your MD5 - your project
4 and store this information
Your can use a Set to dectect duplicates (a Set has unique elements).
Set<String> files_hash; // each String is a string representation of MD5
if (files_hash.contains(my_md5)) // you know you have it already
or a
Map<String,String> file_and_hash; // each is file => hash
// you have to iterate to know if you have it already, or keep also a Set
ANSWER for MD5:
read algorithm:
https://en.wikipedia.org/wiki/MD5
RFC: https://www.ietf.org/rfc/rfc1321.txt
some googling ...
this presentation, step by step
http://infohost.nmt.edu/~sfs/Students/HarleyKozushko/Presentations/MD5.pdf
or try to duplicate C (or java) implementation ...
OVERALL STRATEGY
To keep time and have processus faster, you must also think about the use of your function:
if you use it once, for one unique file, better is to reduce work, by selecting before other files on their size.
if you use it regularly (and want to do it fast), scan regularly new files in background to keep an hash base up to date. Detection of new file is straightforward.
if you want to get all files duplicated, better scan everything, and use Set Strategy also
Hope this helps

You'll want to recursively scan for files, then, for each file found, calculate its MD5 or whatever and store that hash value, either in a Set<...> if you only want to know if a file is a dupe, or in a Map<..., File> if you want to be able to tell which file the current file is a duplicate of.
For each file's hash, you look into the collection of already known hashes to check if that particular hash value is in it; if it is, you (most likely) have a duplicate file; if it is not, you add the new hash value to the collection and proceed with the next file.

Related

Sorting text lines from hard drive files by partly loading to memory | Java

My task is to sort file which is too large to fit in memory. File contains text lines.
What I did:
read from original file by parts (of allowed size).
sorted each part
saved each sorted part to tempfiles
As I understand next thing i should do is:
read first lines of each file
sort them between each other (use local variable to temporarily store it, but I am not sure if it will be below restricted size)
write first line (as result of sorting) to final file
now I need to remove line I just wrote from temporary file
now i need to repeat steps 1-4 until all lines are sorted and "transferred" from temp files to final file
I am most unsure about step 4 - is there a class than can look for a value and then erase line with this value (at that point I won't even know from which file that line came)? I think that this is not a proper way to reach my goal at all. But I need to remove lines which are already sorted. And I can't operate with files' data in memory.

do you need to do this in Java (assuming by the tag)? As memory wise it isn't going to be efficient way. The simplest option in my opinion would be using sort and just sort the file directly on the OS level.
This article will give you a guide on how to use sort: https://www.geeksforgeeks.org/sort-command-linuxunix-examples/
Sort is available on Windows as well as unix/linux and can handle huge files.

Persistent file validation in java

I have need in a java application that the file created by my application should not be modified by user, if it is modified then it should get validated.
My approach: I have taken the last modified time of each file in a hashmap and validate the modified file basis on this.
Problem: It is fine for particular session and if i want to persist that information then i have to create another file containing the last modified information that also can modified by user. For now I am not using any database.
So request you to give me a alternative of this i.e. how can i validate file? and is my approach most optimized one?

Use a decent hashing algorithm to take a hash of the file contents. To test if the user modified the file, conduct the same hash procedure again (at the time of the test) and compare it to the original hash. If the hashes are different, then the user clearly modified the file. I would suggest you use SHA-1 for your hashing algorithm but someone else might have a better hashing algorithm to use.
You can see this SO answer for information about how to compute a SHA-1 hash from a byte array. You can see this SO answer for information about reading a File into a byte array.
For storing the hash information of each file, I recommend using a database but you don't have to do so. You could use a normal file and within that file, create a format for storing hash related information. Your format could be the user's file name=the hash. For example:
myfile.txt=489892945720524750
otheruserfile.txt=390495940542905490

Java - overwriting specific parts of a file

I would like to update specific part of a text file using Java. I would like to be able to scan through the file and select specific lines to be updated, a bit like in a database, for instance given the file:
ID Value
1 100
2 500
4 20
I would like to insert 3 and update 4, e.g.
ID Value
1 100
2 500
3 80
4 1000
Is there a way to achieve this (seemingly) easy task? I know you can append to a file, but I am more interested in a random access

I know you can append to a file, but I am more interested in a random access
You're trying to insert and delete bytes in the middle of a file. You can't do that. File systems simply don't (in general) support that. You can overwrite specific bytes, but you can't insert or delete them.
You could update specific records with random access if your records were fixed-length (in bytes) but it looks like that's not the case.
You could either load the whole file into memory, or read from the original file, writing to a new file with either the old data or the new data as appropriate on a per line basis.

You can do so using Random Access files in java where you can place your current write and read position using available methods. you can explore more here

Load the file into memory, change your value, and then re-write the file
if there's a way to insert into a file without loading it, I haven't heard of it. You have to move the other data out of the way first.
unless you're dealing with huge files, frequently, performance isn't too much of a concern

As said by the previous answers, it's not possible to do that symply using streams. You could try to use properties, that are key, value pairs that can be saved and modified in a text file.
For example you can add to a file a new property with the command
setProperty(String key, String value)
This method adds a new property or, if already existing, modifies the value of the property with the choosen key.
Obviously, new properties are added at the end of the file but the lack of ordering is not a problem for performances because the access to the file is made with the getProperty method that calls the Hashtable method put.
See this tutorial for some examples:
http://docs.oracle.com/javase/tutorial/essential/environment/properties.html

Hadoop MapReduce: Read a file and use it as input to filter other files

I would like to write a hadoop application which takes as input a file and an input folder which contains several files. The single file contains keys whose records need to be selected and extracted out of the other files in the folder. How can I achieve this?
By the way, I have a running hadoop mapreduce application which takes as input a path to a folder, does the processing and writes out the result into a different folder.
I am kind of stuck with how to use a file to get keys that need to be selected and extracted out of other files in a specific directory. The file containing keys is a big file so that it can not be fit into the main memory directly. How can I do it?
Thx!

If the number of keys is too large to fit in memory, then consider loading the key set into a bloom filter (of suitable size to yield a low false positive rate) and then process the files, checking each key for membership in the bloom filter (Hadoop comes with a BloomFilter class, check the Javadocs).
You'll also need to perform a second MR Job to do a final validation (most probably in a reduce side join) to eliminate the false positives output from the first job.

I would read the single file first before you run your job. Store all needed keys in the job configuration. You can then write a job to read the files from the folder. In your mapper/reducer setup(context) method, read out the keys from the configuration and store them globally, so that you have the possibility to read them during map or reduce.

Need advice in Efficiency: Scanning 2 very large files worth of information

I have a relatively strange question.
I have a file that is 6 gigabytes long. What I need to do, is scan the entire file, line by line, and determine all rows that match an id number of any other row in the file. Essentially, its like analyzing a web log file where there are many session ids that are organized by the time of each click rather than by userID.
I tried to do the simple (dumb) thing, which was to create 2 file readers. One that scans the file line by line getting the userID, and the next to 1. verify that the userID has not been processed already and 2. If it hasn't been processed, read every line that begins with the userID that is contained in the file and store (some value X, related to the rows)
Any advice or tips on how I can make this process work more efficiently?

Import file into SQL database
Use SQL
Performance!
Seriously, that's it. Databases are optimized exactly for this kind of thing. Alternatively, if you have a machine with enough RAM, just put all the data into a HashMap for easy lookup.

Easiest: create a datamodel and import the file in a database and take benefit of JDBC and SQL powers. You can if necessary (when the file format is pretty specific) write a some Java which does import line by line with help of under each BufferedReader#readLine() and PreparedStatement#addBatch().
Hardest: write your Java code so that it doesn't unnecessarily keep large amounts of data in the memory. You're then basically reinventing what the average database already does.

For each row R in the file {
Let N be the number that you need to
extract from R.
Check if there is a file called N. If
not, create it.
Append R to the file called N
}

How much data are you storing about each line, compared with the size of the line? Do you have enough memory to maintain the state for each distinct ID (e.g. number of log lines seen, number of exceptions or whatever)? That's what I'd do if possible.
Otherwise, you'll either need to break the log file into separate chunks (e.g. split it based on the first character of the ID) and then parse each file separately, or perhaps have some way of pretending you have enough memory to maintain the state for each distinct ID: have an in-memory cache which dumps values to disk (or reads them back) only when it has to.

You don't mention whether or not this is a regular, ongoing thing or an occasional check.
Have you considered pre-processing the data? Not practical for dynamic data, but if you can sort it based on the field you're interested in, it makes solving the problem much easier. Extracting only the fields you care about may reduce the data volume to a more manageable size as well.

Alot of the other advice here is good but assumes that you'll be able to load what you need into memory without running out of memory. If you can do that that would be better than the 'worst case' solution I'm mentioning.
If you have large files you may end up needing to sort them first. In the past I've dealt with multiple large files where I needed to match them up based on a key (sometimes matches were in all files, sometimes only in a couple, etc). If this is the case the first thing you need to do is sort your files. Hopefully you're on a box where you can easily do this (for example there are many good Unix scripts for this). After you've sorted each file read each file until you get matching IDs then process.
I'd suggest:
1. Open both files and read the first record
2. See if you have matching IDs and processing accordingly
3. Read the file(s) for the key just processed and do step 2 again until EOF.
For example if you had a key of 1,2,5,8 in FILE1 and 2,3,5,9 in FILE2 you'd:
1. Open and read both files (FILE1 has ID 1, FILE2 had ID2).
2. Process 1.
3. Read FILE1 (FILE1 has ID 2)
4. Process 2.
5. Read FILE1 (ID 5) and FILE2 (ID 3)
6. Process 3.
7. Read FILE 2 (ID 5)
8. Process 5.
9. Read FILE1 (ID 8) and FILE2 (ID 9).
10. Process 8.
11. Read FILE1 (EOF....no more FILE1 processing).
12. Process 9.
13. Read FILE2 (EOF....no more FILE2 processing).
Make sense?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Scan duplicate document with md5 - java

Related

Sorting text lines from hard drive files by partly loading to memory | Java

Persistent file validation in java

Java - overwriting specific parts of a file

Hadoop MapReduce: Read a file and use it as input to filter other files

Need advice in Efficiency: Scanning 2 very large files worth of information

Categories

Resources