I am trying to implement a data deduplication program in the cloud using Java.
I'm not sure how to proceed with the implementation.
First, I wanted to do a simple file compare of the file size, date and name of the file. However, this is ineffective since the file might have same content but a different name.
I have decided on a simple algorithm which is
file upload -> file chunking -> Rabin-karp hashing -> determine to see whether can upload file.
Will this be fine or are there any improvements?
Where would I be able to find out more information on this? I have tried looking around the Internet but I can't find anything. Most of it is just broken down into certain implementations but without explanation or details on file chunking or Rabin-karp hashing.
I would want to know about which Java libraries I should look into regarding this program.
It would be easier if you state your problem constraints. Assuming the following:
The smallest indivisible unit of data is a file
Files are reasonably small to fit in memory for computing hashes
Your files are in some cloud bucket or something where you can list them all. Also that eliminates identical filenames.
You can probably narrow down your problem.
Iterate through all the files in all the files using some fast hashing algorithm like a basic CRC checksum and build a map. (Can be easily parallelized).
Filter out all the files which have a collision. You can easily leave out the rest of the files which for all practical purposes should be a pretty reasonable chunk of the data.
Run through this remaining subset of files with a cryptographic hash (or worst case, match the entire files) and identify matches.
This can be refined depending on the underlying data.
However, this is how I would approach the problem and given the structure of it; this problem can be easily partitioned and solved in a parallel manner. Feel free to elaborate more so that we can reach a good solution.
Related
I have always used Python for clustering, but recently I came across a situation in which I need the implementations of both CluStream and DenStream (stream clustering algorithms), available in R and Java (there are some implementations in Python from the community but I already tried them and they do no work).
The thing is that I have to compare many clustering algorithms written in Python, and as a prev stage I was using the well known scikit learn data sets (to show how algorithms handle non-globular clusters - of course then I will use time series data).
Now, I wanna know if the proper way to try those R/Java algorithms and compute a metric coded in Python (DBCV) with the R/Java clustering results ....
--> So, summing up, I need to compare many algorithms (coded in Python and R/Java) using the same data sets (which I figured could be persisted into csv files) and computing the same validity metric (Python).
Any help would be appreciated. Thanks in advance!
EDIT: the solution I came across is the following:
Generate the toy data sets with sklearn and persist them into csv files
Use the different clustering algorithms with those data sets and persist also the clustering results into csv files (it does not matter which programming language it's used)
Develop another app which:
takes the clustering solutions stored in the cvs files
computes the metric and shows the results
PLEASE let me know if you find a better solution!
Notes:
This R package is the one i wanna try: streamMOA
I do not know anything about R and I have worked with Java before (what implementation I choose depends on the better approach regarding the integration with Python)
MOA is a Java software. There is no good reason to use it via R unless you are already in the R ecosystem (which you aren't).
You can write the data to CSV and load it in whatever tool you like
These data sets are not streams. They lack all the difficulties and challenges of streams - a simple subsample will be enough to identify the clustering structure. Conclusions drawn from this data are useless. Use real data streams, not synthetic data with no sequential order to it.
I’m am taking a data structures course and am developing a project in java. The project is pretty much complete except for one aspect, implementation of a cache. My professor has been very vague as he is and should be with everything on how to implement this. The only hint he had given is that our operating system has its own file system which in and of itself is a map, and we can use it as a way to create a cache. I will paste the assignment details below. Any help would be greatly appreciated.
Almost forgot. My OS is windows 10
Requirements
This assignment asks you to create a web page categorization program.
The program reads 20 (or more) web pages. The urls for some of these web pages can be maintained in a control file that is read when the program starts. The others should be links from these pages. (Wikipedia is recommended source.) For each page, the program maintains frequencies of words along with any other related information that you choose.
The user can enter any other URL, and the program reports which other known page is most closely related, using a similarity metric of your choosing.
The implementation restrictions are:
Create a cache based on a custom hash table class you implement to keep track of pages that have not been modified since accessed; keep them in local files.
Use library collections or your own data structures for all other data stores. Read through the Collections tutorial.
Establish a similarity metric. This must be in part based on word-frequencies, but may include other attributes. If you follow the recommended approach of hash-based TF-IDF, create a hash table storing these.
A GUI allows a user to indicate one entity, and displays one or more similar ones.
I've got a legacy system that uses SAS to ingest raw data from the database, cleanse and consolidate it, and then score the outputted documents.
I'm wanting to move to a Java or similar object oriented solution, so I can implement unit testing, and otherwise general better code control. (I'm not talking about overhauling the whole system, but injecting java where I can).
In terms of data size, we're talking about around 1 TB of data being both ingested and created. In terms of scaling, this might increase by a factor of around 10, but isn't likely to increase on massive scale like a worldwide web project might.
The question is - what tools would be most appropriate for this kind of project?
Where would I find this information - what search terms should be used?
Is doing processing on an SQL database (creating and dropping tables, adding columns, as needed) an appropriate, or awful, solution?
I've had a quick look at Hadoop - but due to the small scale of this project, would Hadoop be an unnecessary complication?
Are there any Java packages that do similar functionality as SAS or SQL in terms of merging, joining, sorting, grouping datasets, as well as modifying data?
It's hard for me to prescribe exactly what you need given your problem statement.
It sounds like a good database API (i.e. native JDBC might be all you need with a good open source database backend)
However, I think you should take some time to check out Lucene. It's a fantastic tool and may meet your scoring needs very well. Taking a search engine indexing approach to your problem may be fruitful.
I think the question you need to ask yourself is
what's the nature of your data set, how often it will be updated.
what's the workload you will have on this 1TB or more data in the future. Will there be mainly offline read and analysis operations? Or there will also have a lot random write operations?
Here is an article talking about if to choose using Hadoop or not which I think is worth reading.
Hadoop is a better choice if you only have daily or weekly update of your data set. And the major operations on the data is read-only operations, along with further data analysis. For the merging, joining, sorting, grouping datasets operation you mentioned, Cascading is a Java library running on top of Hadoop which supports this operation well.
I originally tried posting a similar post to the elasticsearch mailing list (https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/BZLFJSEpl78) but didn't get any helpful responses so I though I'd give Stack Overflow a try. This is my first post on SO so apologies if it doesn't quite fit into the mould it is meant to.
I'm currently working with a university helping them to implement a test suite to further refine some research they have been conducting. Their research is based around dynamic schema searching. After spending some time evaluating the various open source search solutions I settled on elasticsearch as the base platform and I am wondering what the best way to proceed would be. I have spent about a week looking into the elasticsearch documentation and the code itself and also reading the documentation of Lucene but I am struggling to see a clear way forward.
The goal of the project is to provide the researches with a piece of software they can use to plugin revisions of the searching algorithm to test and refine. They would like to be able to write the pluggable algorithm in languages other then Java that is supported by the JVM like Groovy, Python or Closure but that isn't a hard requirement. Part of that will be to provide them with a front end to run queries and see output and an admin interface to add documents to an index. I am comfortable with all of that thanks to the very powerful and complete REST API. What I am not so sure about is how to proceed with implementing the pluggable search algorithm.
The researcher's algorithm requires 4 inputs to function:
The query terms(s).
A Word (term) x Document matrix across a index.
A Document x Word (term) matrix across a index.
A Word (term) frequency list across a index. That is how many times each word appears across the entire index.
For their purposes, a document doesn't correspond to an actual real-world document (they actually call them text events). Rather, for now, it corresponds to one sentence (having that configurable might also be useful). I figure the best way to handle this is to break down documents into their sentences (using Apache Tika or something similar), putting each sentence in as its own document in the index. I am confident I can do this in the Admin UI I provide using the mapper-attachement plugin as a starting point. The downside is that breaking up the document before giving it to elasticsearch isn't a very configurable way of doing it. If they want to change the resolution to their algorithm, they would need to re-add all documents to the index again. If the index stored that full documents as is and the search algorithm could chose what resolution to work at per query then that would be perfect. I'm not sure it is possible or not though.
The next problem is how to get the three inputs they require and pass it into their pluggable search algorithm. I'm really struggling where to start with this one. It seems from looking at Luecene that I need to provide my own search/query implementation, but I'm not sure if this is right or not. There also doesn't seem to be any search plugins listed on the elasticsearch site, so I'm not even sure if it is possible. The important things here are that the algorithm needs to operate at the index level with the query terms available to generate its schema before using the schema to score each document in the index. From what I can tell, this means that the scripting interface provided by elasticsearch won't be of any use. The description of the scripting interface in the elasticsearch guide makes it sound like a script operates at the document level and not the index level. Other concerns/considerations are the ability to program this algorithm in a range of languages (just like the scripting interface) and the ability to augment what is returned by the REST API for a search to include the schema the algorithm generated (which I assume means I will need to define my own REST endpoint(s)).
Can anybody give me some advice on where to get started here? It seems like I am going to have to write my own search plugin that can accept scripts as it's core algorithm. The plugin will be responsible for organising the 4 inputs that I outlined earlier before passing control to the script. It will also be responsible for getting the output from the script and returning it via it's own REST API. Does this seem logical? If so, how do I get started with doing this? What parts of the code do I need to look it?
You should store 1 sentence per document if that's how their algorithm works. You can always reindex if they change their model.
Lucene is pretty good at finding matches, so I suspect your co-workers' algorithm will be dealing with scoring. ElasticSearch supports custom scoring script. You can pass params to a given scoring script. You can use groovy for scripting in ES.
http://www.elasticsearch.org/guide/reference/modules/scripting.html
To use larger datastructures in your search algorithm, it does not make sense to pass those datastructures as params, you might find it useful to use other datasources in scoring script.
For example Redis: http://java.dzone.com/articles/connecting-redis-elasticsearch .
When versioning or optimizing file backups one idea is to use only the delta or data that has been modified.
This sounds like a simple idea at first but actually determining where unmodified data ends and new data starts comes accross as a difficult task.
Is there an existing framework that already does something like this or an efficient file comparison algorithm?
XDelta is not Java but is worth looking at anyway. There is Java version of it but I don't know how stable is it.
Instead of rolling your own, you might consider leveraging an open source version control system (eg, Subversion). You get a lot more than just a delta versioning algorithm that way.
It sounds like you are describing a difference based storage scheme. Most source code control systems use such systems to minimize their storage requirements. The *nix "diff" command is capable of generating the data you would need to implement it on your own.
Here's a Java library that can compute diffs between two plain text files:
http://code.google.com/p/google-diff-match-patch/
I don't know any library for binary diffs though. Try googling for 'java binary diff' ;-)
As for my opinion, Bsdiff tool is the best choice for binary files. It uses suffix sorting (Larsson and Sadakane's qsufsort) and takes advantage of how executable files change. Bsdiff was written in C++ by Colin Percival. Diff files created by Bsdiff are generally smaller than the files created by Xdelta.
It is also worth noting that Bsdiff uses bzip2 compression algorithm. Binary patches created by Bsdiff sometimes can be further compressed using other compression algorithms (like the WinRAR archiver's one).
Here is the site where you can find Bsdiff documentation and download Bsdiff for free: http://www.daemonology.net/bsdiff/