I have been thinking in an approach for this problem but I have not found any solution which convince me. I am programming a crawler and I have a downloading task for every url from a urls list. In addition, the different html documents are parsed in different mode depending of the site url and the information that I want to take. So my problem is how to link every task with its appropriate parse.
The ideas are:
Creating an huge 'if' where check the download type and to associate a parse.
(Avoided, because the 'if' is growing with every new different site added to crawler)
Using polymorphism, to create a download task different for every different site and related to type of information which I want to get, and then use a post-action where link its parse.
(Increase the complexity again with every new parser)
So I am looking for some kind of software pattern or idea for say:
Hey I am a download task with this information
Really? Then you need this parse for extract it. Here is the parse you need.
Additional information:
The architecture is very simple. A list with urls which are seeds for the crawler. A producer which download the pages. Other list with html documents downloaded. And a consumer who will should apply the right parse for the page.
Depending of the page download sometimes we need use a parse A, or a parse B, etc..
EDIT
An example:
We have three site webs: site1.com, site2.com and site3.com
There are three urls type which we want parsing: site1.com/A, site1.com/B, site1.com/C, site2.com/A, site2.com/B, site2.com/C, ... site3.com/C
Every url it parsed different and usually the same information is between site1.com/A - site2.com/A - site3.com/A ; ... ; site1.com/C - site2.com/C - site3.com/C
Looks like a Genetic Algorithm aproached solution fits for your description of the problem, what you need to find first is the basics (atomic) solutions.
Here's a tiny description from wikipedia:
In a genetic algorithm, a population of candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem is evolved toward better solutions. Each candidate solution has a set of properties (its chromosomes or genotype) which can be mutated and altered; traditionally, solutions are represented in binary as strings of 0s and 1s, but other encodings are also possible.[2]
The evolution usually starts from a population of randomly generated individuals, and is an iterative process, with the population in each iteration called a generation. In each generation, the fitness of every individual in the population is evaluated; the fitness is usually the value of the objective function in the optimization problem being solved. The more fit individuals are stochastically selected from the current population, and each individual's genome is modified (recombined and possibly randomly mutated) to form a new generation. The new generation of candidate solutions is then used in the next iteration of the algorithm. Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population.
A typical genetic algorithm requires:
a genetic representation of the solution domain,
a fitness function to evaluate the solution domain.
A standard representation of each candidate solution is as an array of bits.[2] Arrays of other types and structures can be used in essentially the same way. The main property that makes these genetic representations convenient is that their parts are easily aligned due to their fixed size, which facilitates simple crossover operations. Variable length representations may also be used, but crossover implementation is more complex in this case. Tree-like representations are explored in genetic programming and graph-form representations are explored in evolutionary programming; a mix of both linear chromosomes and trees is explored in gene expression programming.
Once the genetic representation and the fitness function are defined, a GA proceeds to initialize a population of solutions and then to improve it through repetitive application of the mutation, crossover, inversion and selection operators.
I would externalize the parsing pattern / structure in some form ( like XML ) and use them dynamically.
For example, I have to download site1.com an site2.com . Both are having two different layout . I will create two xml which holds the layout pattern .
And one master xml which can hold which url should use which xml .
While startup load this master xml and use it as dictionary. When you have to download , download the page and find the xml from dictionary and pass the dictionary and stream to the parser ( single generic parser) which can read the stream based on Xml flow and xml information.
In this way, we can create common patterns in xml and use it to read similar sites. Use Regular expressions in xml patterns to cover most of sites in single xml.
If the layout is completely different , just create one xml and modify master xml that's it.
The secret / success of this design is how you create such generic xmls and it is purely depends on what you need and what you are doing after parsing.
This seems to be a connectivity problem. I'd suggest considering the quick find algorithm.
See here for more details.
http://jaysonlagare.blogspot.com.au/2011/01/union-find-algorithms.html
and here's a simple java sample,
https://gist.github.com/gtkesh/3604922
Related
I successfully followed deeplearning4j.org tutorial on Word2Vec, so I am able to load already trained model or train a new one based on some raw text (more specifically, I am using GoogleNews-vectors-negative300 and Emoji2Vec pre-trained model).
However, I would like to combine these two above models for the following reason: Having a sentence (for example, a comment from Instagram or Twitter, which consists of emoji), I want to identify the emoji in the sentence and then map it to the word it is related to. In order to do that, I was planning to iterate over all the words in the sentence and calculate the closeness (how near the emoji and the word are located in the vector space).
I found the code how to uptrain the already existing model. However, it is mentioned that new words are not added in this case and only weights for the existing words will be updated based on a new text corpus.
I would appreciate any help or ideas on the problem I have. Thanks in advance!
Combining two models trained from different corpuses is not a simple, supported operation in the word2vec libraries with which I'm most familiar.
In particular, even if the same word appears in both corpuses, and even in similar contexts, the randomization that's used by this algorithm during initialization and training, and extra randomization injected by multithreaded training, mean that word may appear in wildly different places. It's only the relative distances/orientation with respect to other words that should be roughly similar – not the specific coordinates/rotations.
So to merge two models requires translating one's coordinates to the other. That in itself will typically involve learning-a-projection from one space to the other, then moving unique words from a source space to the surviving space. I don't know if DL4J has a built-in routine for this; the Python gensim library has a TranslationMatrix example class in recent versions which can do this, as motivated by the use of word-vectors for language-to-language translations.
For a data loss prevention like tool, I have a requirement where I need to lookup different types of data such as driver's license number, social security number, names etc. While most of this is pattern based and hence could be looked up using pattern matching with regular expressions, name happens to be a very broad category. There could be virtually any set of characters that could form a name. However, to make it a meaningful lookup, I think I should only lookup them against a defined dictionary of names. Here is what I am thinking.
Provide a dictionary of names as a configuration item. This looks more sensible as for each use case, the names might vary from different geographic regions. I am looking for best practices for doing this in Java. Basically these are the questions-
What is a good data structure to store the names. Set comes to mind as the first option, are there better options like in memory databases.
How should I go about searching these names in the large data sets. These data sets are really large and I only have the facility to read them row by row.
Any other option?
Take a look at concurrent-trees and CQEngine projects.
You can do it with full text indexing or online search.
I would prefer full text indexing, e.g. with Lucene. You will have to define how the indexer finds tokens in the text (by defining the token patterns and the dont-care-patterns).
Known patterns (e.g. license numbers) should be annotated at indexing time with their type. Querying the index for an annotated type (e.g. license number) will return you all contained license numbers.
Flexible patterns (like names) should be index as tokens. You can then iterate over the collection of legal names and query the index with it.
This approach is not the most flexible, but it is very robust to changes to the set of data files (simply put the new file to the index) or to the set of names (simply query the new name in the index).
In this approach it is not really performance relevant how you store the set of names
The other approach would be to search for multiple strings (names). Note that there are special search algorithms for multiple strings and that most algorithms have a preferred range of params (pattern size, alphabet size, number of patterns to search). You can get some impressions at StringBench.
This approach allows you more flexible string patterns.
However it is not robust to modifications to the set of names (then the complete search has to be repeated).
Multi-string usually would accept a set of strings to search, but they will store this set in a algorithm-specific way (most use a trie)
edit:
Efficient search for multiple patterns/strings can be done with DFA-based automata.
The first time I wanted to search efficiently in text I chose dk.brics.automaton. Its automaton is very efficient, yet it is optimized for matching not for searching (search is done in naive way).
I then shifted to my own implementation rexlex. It is DFA-based, but slightly slower than brics. The search algorithm is not as naive as in brics, but adds some overhead.
You find a link to a benchmark comparing both. The benchmark visualizes the problem of DFA-based regexes - the time to compile such a DFA can get very expensive if the regex is large.
I currently favor the stringandchars implementation of multi-string/pattern-search. It is focused on search performance, yet I do not know how it compares to the solutions above. The most common case of searching multiple regex patterns in text will be much more performant as in the above solutions.
That's rather newbie question, so please take it with a grain of salt.
I'm new in the field of data mining and trying to get my head wrapped around this topic. Right now I'm trying to polish my existing model so that it classifies instances better. The problem is, that my model has around 480 attributes. I know for sure that not all of them are relevant, but it's hard for me point out which are indeed important.
The question is: having valid training and test sets, does one can use some sort of data mining algorithm which would throw away attributes that seem to not have any impact on the quality of classification?
I'm using Weka.
You should test using some of the Classifier algorithms that Weka has.
The basic idea is to use the Cross-validation option, so you can see which algorithm gives you the best Correctly Classified Instances value.
I can give you an example of one of my training set, using the Cross-validation option and choosing Folds 10.
As you can see, using the J48 classifier I will have:
Correctly Classified Instances 4310 83.2207 %
Incorrectly Classified Instances 869 16.7793 %
and if I will use for example the NaiveBayes Algorithm I will have:
Correctly Classified Instances 1996 38.5403 %
Incorrectly Classified Instances 3183 61.4597 %
and so on, the values differ depending on the algorithm.
So, test as many algorithms as possible and see which one gives you the best Correctly Classified Instances / Time consumed.
Comment converted to answer as OP suggested:
If You use weka 3.6.6 - select module explorer -> than go to tab "Select attributes" and choose "Attribute evaluator" and "Search method", you can also choose between using full data set or cv sets, for more details see e.g. http://forums.pentaho.com/showthread.php?68687-Selecting-Attributes-with-Weka or http://weka.wikispaces.com/Performing+attribute+selection
Read up on the topic of clustering algorithms (only on your training set though!)
Look into the InfoGainAttributeEval class.
The buildEvaluator() and the evaluateAttribute(int index) functions should help.
So we have many street names. They come in a file. Id probably cache them when booting the server up in production. The search should be auto complete like - e.g. you type 'lang ' and you would get maybe 8 hits : langstr, langestr. Etc
What you are looking for is some sort of compressed trie representation. You might want to look into succinct tries or DAWGs as a starting point, as they give excellent efficiency and very good space usage.
Hope this helps!
Autocomplete is usually implemented using one of the following:
Trees. By indexing the searchable text in a tree structure (prefix tree, suffix tree, dawg, etc..) one can execute very fast searches at the expense of memory storage. The tree traversal can be adapted for approximate matching.
Pattern Partitioning. By partitioning the text into tokens (ngrams) one can execute searches for pattern occurrences using a simple hashing scheme.
Filtering. Find a set of potential matches and then apply a sequential algorithm to check each candidate.
Take a look at completely, a Java autocomplete library that implements some of the latter concepts.
I am using iso 19794-2 fingerprint data format. All the data are in the iso 19794-2 format. I have more than hundred thousand fingerprints. I wish to make efficient search to identify the match. Is it possible to construct a binary tree like structure to perform an efficient(fastest) search for match? or suggest me a better way to find the match. and also suggest me an open source api for java to do fingerprint matching. Help me. Thanks.
Do you have a background in fingerprint matching? It is not a simple problem and you'll need a bit of theory to tackle such a problem. Have a look at this introduction to fingerprint matching by Bologna University's BioLab (a leading research lab in this field).
Let's now answer to your question, that is how to make the search more efficient.
Fingerprints can be classified into 5 main classes, according to the type of macro-singularity that they exhibit.
There are three types of macro-singularities:
whorl (a sort of circle)
loop (a U inversion)
delta (a sort of three-way crossing)
According to the position of those macro-singularities, you can classify the fingerprint in those classes:
arch
tented arch
right loop
left loop
whorl
Once you have narrowed the search to the correct class, you can perform your matches. From your question it looks like you have to do an identification task, so I'm afraid that you'll have to do all the comparisons, or else add some layers of pre-processing (like the classification I wrote about) to further narrow the search field.
You can find lots of information about fingerprint matching in the book Handbook of Fingerprint Recognition, by Maltoni, Maio, Jain and Prabhakar - leading researchers in this field.
In order to read ISO 19794-2 format, you could use some utilities developed by NIST called BiomDI, Software Tools supporting Standard Biometric Data Interchange Formats. You could try to interface it with open source matching algorithms like the one found in this biometrics SDK. It would however need a lot of work, including the conversion from one format to another and the fine-tuning of algorithms.
My opinion (as a Ph.D. student working in biometrics) is that in this field you can easily write code that does the 60% of what you need in no time, but the remaining 40% will be:
hard to write (20%); and
really hard to write without money and time (20%).
Hope that helps!
Edit: added info about NIST BiomDI
Edit 2: since people sometimes email me asking for a copy of the standard, I unfortunately don't have one to share. All I have is a link to the ISO page that sells the standard.
The iso format specifies useful mechanisms for matching and decision parameters. Decide on what mechanism you wish to employ to identify the match, and the relevant decision parameters. When you have determined these mechanisms and decision parameters, examine them to see which are capable of being put into an order - with a fairly high degree of individual values, as you want to avoid multiple collisions on the data. When you have identified a small number of data items (preferably one) that have this property, calculate the property for each fingerprint - preferably as they are added to the database, though a bulk load can be done initially. Then the search for a match is done on the calculated characteristic, and can be done by a binary tree, a black-red tree, or a variety of other search processes. I cannot recommend a particular search strategy without knowing what form and degree of differentiation of values you have in your database. Such a search strategy should, however, be capable of delivering a (small) range of possible matches - which can then be tested individually against your match mechanism and parameters, before deciding on a specific match.