I successfully followed deeplearning4j.org tutorial on Word2Vec, so I am able to load already trained model or train a new one based on some raw text (more specifically, I am using GoogleNews-vectors-negative300 and Emoji2Vec pre-trained model).
However, I would like to combine these two above models for the following reason: Having a sentence (for example, a comment from Instagram or Twitter, which consists of emoji), I want to identify the emoji in the sentence and then map it to the word it is related to. In order to do that, I was planning to iterate over all the words in the sentence and calculate the closeness (how near the emoji and the word are located in the vector space).
I found the code how to uptrain the already existing model. However, it is mentioned that new words are not added in this case and only weights for the existing words will be updated based on a new text corpus.
I would appreciate any help or ideas on the problem I have. Thanks in advance!
Combining two models trained from different corpuses is not a simple, supported operation in the word2vec libraries with which I'm most familiar.
In particular, even if the same word appears in both corpuses, and even in similar contexts, the randomization that's used by this algorithm during initialization and training, and extra randomization injected by multithreaded training, mean that word may appear in wildly different places. It's only the relative distances/orientation with respect to other words that should be roughly similar – not the specific coordinates/rotations.
So to merge two models requires translating one's coordinates to the other. That in itself will typically involve learning-a-projection from one space to the other, then moving unique words from a source space to the surviving space. I don't know if DL4J has a built-in routine for this; the Python gensim library has a TranslationMatrix example class in recent versions which can do this, as motivated by the use of word-vectors for language-to-language translations.
Related
is there any pretrained vector for particular topic only? for example "java", so i want vectors related java in file. mean if i give input inheritance then cosine similarity show me polymorphism and other related stuff only!
i am using corpus as GoogleNews-vectors-negative300.bin and Glove vectors. still not getting related words.
Not sure if I understand your question/problem statement, but if you want to work with a corpus of java source code you can use code2vec which provides pre-trained word-embeddings models. Check it out: https://code2vec.org/
Yes, you can occasionally find other groups' pre-trained vectors for download, which may have better coverage of whatever problem domains they've been trained on: both more specialized words, and word-vectors matching the word sense in that domain.
For example, the GoogleNews word-vectors were trained on news articles circa 2012, so its vector for 'Java' may be dominated by stories of the Java island of Indosnesia as much as the programming language. And many other vector-sets are trained on Wikipedia text, which will be dominated by usages in that particular reference-style of writing. But there could be other sets that better emphasize the word-senses you need.
However, the best approach is often to train your own word-vectors, from a training corpus that closely matches the topics/documents you are concerned about. Then, the word-vectors are well-tuned to your domain-of-concern. As long as you have "enough" varied examples of a word used in context, the resulting vector will likely be better than generic vectors from someone else's corpus. ("Enough" has no firm definition, but is usually at least 5, and ideally dozens to hundreds, of representative, diverse uses.)
Let's consider your example goal – showing some similarity between the ideas of 'polymorphism' and 'input inheritance'. For that, you'd need a training corpus that discusses those concepts, ideally many times, from many authors, in many problem-contexts. (Textbooks, online articles, and Stack Overflow pages might be possible sources.)
You'd further need a tokenization strategy that manages to create a single word-token for the two-word concept 'input_inheritance' - which is a separate challenge, and might be tackled via (1) a hand-crafted glossary of multi-word-phrases that should be combined; (2) statistical analysis of word-pairs that seem to occur so often together, they should be combined; (3) more sophisticated grammar-aware phrase- and entity-detection preprocessing.
(The multiword phrases in the GoogleNews set were created via a statistical algorithm which is also available in the gensim Python library as the Phrases class. But, the exact parameters Google used have not, as far as I know, been revealed.And, good results from this algorithm can require a lot of data and tuning, and still result in some combinations that a person would consider nonsense, and missing others that a person would consider natural.)
I am trying to parse a TABLE in PDF file and display it as CSV. I have attached sample data from PDF below(only few columns) and sample output for the same. Each column width is fixed, let's say Company Name(18 char),Amount(8 char), Type(5 char) etc. I tried using Itext and PDFBox jars to get each page data and parsed line by line, but sounds like it is not a great solution as the line breaks and page breaks in PDF are not proper. Please me let me know if there is any other appropriate solution. We want to use any open source software for this.
This is a very complex problem. There are multiple master dissertations about this even.
An easy analogy: I have 5000 puzzle-pieces, all of them are perfectly square and could fit anywhere. Some of them have pieces of lines on them, some of them have snippets of text.
However, that does not mean it can't be done. It'll just take work.
General approach:
use iText (specifically IEventListener) to get information on all rendering events for every page
select those rendering events that make sense for your application. PathRenderInfo and TextRenderInfo.
Events in a pdf do not need to appear in order according to the spec. Solve this problem by implementing a comparator over IEventData. This comparator should sort according to reading order. This implies you might have to implement some basic language detection, since not every language reads left-to-right.
Once sorted, you can now start clustering items together according to any of the various heuristics you find in literature. For instance, two characters can be grouped into a snippet of text if they follow each other in the sorted list of events (meaning they appear next to each other in reading order), if the y-position does not differ too much (subscript and superscript might screw with this), and if the x-position does not differ too much (kerning).
Continue clustering characters until you have formed words
Assuming you have formed words, use similar algorithm to form words into lines. Use PathRenderInfo to withhold merging words if they intersect with a line.
Assuming you have managed to create lines, now look for tables. One possible approach is apply a horizontal and vertical projection. And look for those sub-areas in the page that (when projected) show a grid-like structure.
This high-level approach should make it painfully obvious why this is not a widely available thing. It's very hard to implement. It requires domain-knowledge of both PDF, fonts, and machine-learning.
If you are ok with commercial solutions, try out pdf2Data. It's an iText add-on that features this exact functionality.
http://itextpdf.com/itext7/pdf2Data
I have been thinking in an approach for this problem but I have not found any solution which convince me. I am programming a crawler and I have a downloading task for every url from a urls list. In addition, the different html documents are parsed in different mode depending of the site url and the information that I want to take. So my problem is how to link every task with its appropriate parse.
The ideas are:
Creating an huge 'if' where check the download type and to associate a parse.
(Avoided, because the 'if' is growing with every new different site added to crawler)
Using polymorphism, to create a download task different for every different site and related to type of information which I want to get, and then use a post-action where link its parse.
(Increase the complexity again with every new parser)
So I am looking for some kind of software pattern or idea for say:
Hey I am a download task with this information
Really? Then you need this parse for extract it. Here is the parse you need.
Additional information:
The architecture is very simple. A list with urls which are seeds for the crawler. A producer which download the pages. Other list with html documents downloaded. And a consumer who will should apply the right parse for the page.
Depending of the page download sometimes we need use a parse A, or a parse B, etc..
EDIT
An example:
We have three site webs: site1.com, site2.com and site3.com
There are three urls type which we want parsing: site1.com/A, site1.com/B, site1.com/C, site2.com/A, site2.com/B, site2.com/C, ... site3.com/C
Every url it parsed different and usually the same information is between site1.com/A - site2.com/A - site3.com/A ; ... ; site1.com/C - site2.com/C - site3.com/C
Looks like a Genetic Algorithm aproached solution fits for your description of the problem, what you need to find first is the basics (atomic) solutions.
Here's a tiny description from wikipedia:
In a genetic algorithm, a population of candidate solutions (called individuals, creatures, or phenotypes) to an optimization problem is evolved toward better solutions. Each candidate solution has a set of properties (its chromosomes or genotype) which can be mutated and altered; traditionally, solutions are represented in binary as strings of 0s and 1s, but other encodings are also possible.[2]
The evolution usually starts from a population of randomly generated individuals, and is an iterative process, with the population in each iteration called a generation. In each generation, the fitness of every individual in the population is evaluated; the fitness is usually the value of the objective function in the optimization problem being solved. The more fit individuals are stochastically selected from the current population, and each individual's genome is modified (recombined and possibly randomly mutated) to form a new generation. The new generation of candidate solutions is then used in the next iteration of the algorithm. Commonly, the algorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population.
A typical genetic algorithm requires:
a genetic representation of the solution domain,
a fitness function to evaluate the solution domain.
A standard representation of each candidate solution is as an array of bits.[2] Arrays of other types and structures can be used in essentially the same way. The main property that makes these genetic representations convenient is that their parts are easily aligned due to their fixed size, which facilitates simple crossover operations. Variable length representations may also be used, but crossover implementation is more complex in this case. Tree-like representations are explored in genetic programming and graph-form representations are explored in evolutionary programming; a mix of both linear chromosomes and trees is explored in gene expression programming.
Once the genetic representation and the fitness function are defined, a GA proceeds to initialize a population of solutions and then to improve it through repetitive application of the mutation, crossover, inversion and selection operators.
I would externalize the parsing pattern / structure in some form ( like XML ) and use them dynamically.
For example, I have to download site1.com an site2.com . Both are having two different layout . I will create two xml which holds the layout pattern .
And one master xml which can hold which url should use which xml .
While startup load this master xml and use it as dictionary. When you have to download , download the page and find the xml from dictionary and pass the dictionary and stream to the parser ( single generic parser) which can read the stream based on Xml flow and xml information.
In this way, we can create common patterns in xml and use it to read similar sites. Use Regular expressions in xml patterns to cover most of sites in single xml.
If the layout is completely different , just create one xml and modify master xml that's it.
The secret / success of this design is how you create such generic xmls and it is purely depends on what you need and what you are doing after parsing.
This seems to be a connectivity problem. I'd suggest considering the quick find algorithm.
See here for more details.
http://jaysonlagare.blogspot.com.au/2011/01/union-find-algorithms.html
and here's a simple java sample,
https://gist.github.com/gtkesh/3604922
I want to implement object detection in license plate (the city name) . I have an image:
and I want to detect if the image contains the word "بابل":
I have tried using a template matching method using OpenCV and also using MATLAB but the result is poor when tested with other images.
I have also read this page, but I was not able to get a good understanding of what to do from that.
Can anyone help me or give me a step by step way to solve that?
I have a project to recognize the license plate and we can recognize and detect the numbers but I need to detect and recognize the words (it is the same words with more cars )
Your question is very broad, but I will do my best to explain optical character recognition (OCR) in a programmatic context and give you a general project workflow followed by successful OCR algorithms.
The problem you face is easier than most, because instead of having to recognize/differentiate between different characters, you only have to recognize a single image (assuming this is the only city you want to recognize). You are, however, subject to many of the limitations of any image recognition algorithm (quality, lighting, image variation).
Things you need to do:
1) Image isolation
You'll have to isolate your image from a noisy background:
I think that the best isolation technique would be to first isolate the license plate, and then isolate the specific characters you're looking for. Important things to keep in mind during this step:
Does the license plate always appear in the same place on the car?
Are cars always in the same position when the image is taken?
Is the word you are looking for always in the same spot on the license plate?
The difficulty/implementation of the task depends greatly on the answers to these three questions.
2) Image capture/preprocessing
This is a very important step for your particular implementation. Although possible, it is highly unlikely that your image will look like this:
as your camera would have to be directly in front of the license plate. More likely, your image may look like one of these:
depending on the perspective where the image is taken from. Ideally, all of your images will be taken from the same vantage point and you'll simply be able to apply a single transform so that they all look similar (or not apply one at all). If you have photos taken from different vantage points, you need to manipulate them or else you will be comparing two different images. Also, especially if you are taking images from only one vantage point and decide not to do a transform, make sure that the text your algorithm is looking for is transformed to be from the same point-of-view. If you don't, you'll have an not-so-great success rate that's difficult to debug/figure out.
3) Image optimization
You'll probably want to (a) convert your images to black-and-white and (b) reduce the noise of your images. These two processes are called binarization and despeckling, respectively. There are many implementations of these algorithms available in many different languages, most accessible by a Google search. You can batch process your images using any language /free tool if you want, or find an implementation that works with whatever language you decide to work in.
4) Pattern recognition
If you only want to search for the name of this one city (only one word ever), you'll most likely want to implement a matrix matching strategy. Many people also refer to matrix matching as pattern recognition so you may have heard it in this context before. Here is an excellent paper detailing an algorithmic implementation that should help you immensely should you choose to use matrix matching. The other algorithm available is feature extraction, which attempts to identify words based on patterns within letters (i.e. loops, curves, lines). You might use this if the font style of the word on the license plate ever changes, but if the same font will always be used, I think matrix matching will have the best results.
5) Algorithm training
Depending on the approach you take (if you use a learning algorithm), you may need to train your algorithm with data that is tagged. What this means is that you have a series of images that you've identified as True (contains city name) or False (does not). Here's a psuedocode example of how this works:
train = [(img1, True), (img2, True), (img3, False), (img4, False)]
img_recognizer = algorithm(train)
Then, you apply your trained algorithm to identify untagged images.
test_untagged = [img5, img6, img7]
for image in test_untagged:
img_recognizer(image)
Your training sets should be much larger than four data points; in general, the bigger the better. Just make sure, as I said before, that all the images are of an identical transformation.
Here is a very, very high-level code flow that may be helpful in implementing your algorithm:
img_in = capture_image()
cropped_img = isolate(img_in)
scaled_img = normalize_scale(cropped_img)
img_desp = despeckle(scaled_img)
img_final = binarize(img_desp)
#train
match() = train_match(training_set)
boolCity = match(img_final)
The processes above have been implemented many times and are thoroughly documented in many languages. Below are some implementations in the languages tagged in your question.
Pure Java
cvBlob in OpenCV (check out this tutorial and this blog post too)
tesseract-ocr in C++
Matlab OCR
Good luck!
If you ask "I want to detect if the image contains the word "بابل" - this is classic problem which is solved using http://code.opencv.org/projects/opencv/wiki/FaceDetection like classifier.
But I assume you still want more. Years ago I tried to solve simiar problems and I provide example image to show how good/bad it was:
To detected licence plate I used very basic rectangle detection which is included in every OpenCV samples folder. And then used perspective transform to fix layout and size. It was important to implement multiple checks to see if rectangle looks good enough to be licence plate. For example if rectangle is 500px tall and 2px wide, then probably this is not what I want and was rejected.
Use https://code.google.com/p/cvblob/ to extract arabic text and other components on detected plate. I just had similar need yesterday on other project. I had to extract Japanese kanji symbols from page:
CvBlob does a lot of work for you.
Next step use technique explained http://blog.damiles.com/2008/11/basic-ocr-in-opencv/ to match city name. Just teach algorithm with example images of different city names and soon it will tell 99% of them just out of box. I have used similar approaches on different projects and quite sure they work
I am new to Lucene and my project is to provide specialized search for a set
of booklets. I am using Lucene Java 3.1.
The basic idea is to help people know where to look for information in the (rather
large and dry) booklets by consulting the index to find out what booklet and page numbers match their query. Each Document in my index represents a particular page in one of the booklets.
So far I have been able to successfully scrape the raw text from the booklets,
insert it into an index, and query it just fine using StandardAnalyzer on both
ends.
So here's my general question:
Many queries on the index will involve searching for place names mentioned in the
booklets. Some place names use notational variants. For instance, in the body text
it will be called "Ship Creek" on one page, but in a map diagram elsewhere it might be listed as "Ship Cr." or even "Ship Ck.". What I need to know is how to approach treating the two consecutive words as a single term and add the notational variants as synonyms.
My goal is of course to search with any of the variants and catch all occurrences. If I search for (Ship AND (Cr Ck Creek)) this does not give me what I want because other words may appear between [ship] and [cr]/[ck]/[creek] leading to false positives.
So, in a nutshell I probably still need the basic stuff provided by StandardAnalyzer, but with specific term grouping to emit place names as complete terms and possibly insert synonyms to cover the variants.
For instance, the text "...allowed from the mouth of Ship Creek upstream to ..." would
result in tokens [allowed],[mouth],[ship creek],[upstream]. Perhaps via a TokenFilter along
the way, the [ship creek] term would expand into [ship creek][ship ck][ship cr].
As a bonus it would be nice to treat the trickier text "..except in Ship, Bird, and
Campbell creeks where the limit is..." as [except],[ship creek],[bird creek],
[campbell creek],[where],[limit].
This seems like a pretty basic use case, but it's not clear to me how I might be able to use existing components from Lucene contrib or SOLR to accomplish this. Should the detection and merging be done in some kind of TokenFilter? Do I need a custom Analyzer implementation?
Some of the term grouping can probably be done heuristically [],[creek] is [ creek]
but I also have an exhaustive list of places mentioned in the text if that helps.
Thanks for any help you can provide.
You can use Solr's Synonym Filter. Just set up "creek" to have synonyms "ck", "cr" etc.
I'm not aware of any existing functionality to solve your "bonus" problem.