How to train chunker in OpenNLP to Predict sequence of Words - java

In my project i need to predict the word sequence in sentence. I used OpenNLP sentence detection, tokenization with their trained model. But i need to classify sequence of words in a sentence as one token for my related group. But their chunker is not predicting the patterns.
For example if my group is Food items, then chunker should predict
chicken pizza as one token.
can anybody explain how to train their model for our domain.

OpenNLP is open source, a quick poke through the source code shows me that they're using a Naive Bayes Classifier [source here]. Somewhere in there will be the code they used to train it. That will tell you both how to train it, and what type of corpus you need.
Re-training it will not be an afternoon project though, these things tend to be time-sinks. So depending on what you're doing it may be a better use of your time to use their classifier as is, even if this is not exactly what you're looking for. I'm not sure exactly what you're trying to do, but it may be possible to use some hack, like co-occurrence scores between your sequences of words (ie how often "chicken" and "pizza" appear together), as an approximation of what you're hoping to do with a re-trained classifier.

Related

Pre trained vectors, nlp, word2vec, word embedding for particular topic?

is there any pretrained vector for particular topic only? for example "java", so i want vectors related java in file. mean if i give input inheritance then cosine similarity show me polymorphism and other related stuff only!
i am using corpus as GoogleNews-vectors-negative300.bin and Glove vectors. still not getting related words.
Not sure if I understand your question/problem statement, but if you want to work with a corpus of java source code you can use code2vec which provides pre-trained word-embeddings models. Check it out: https://code2vec.org/
Yes, you can occasionally find other groups' pre-trained vectors for download, which may have better coverage of whatever problem domains they've been trained on: both more specialized words, and word-vectors matching the word sense in that domain.
For example, the GoogleNews word-vectors were trained on news articles circa 2012, so its vector for 'Java' may be dominated by stories of the Java island of Indosnesia as much as the programming language. And many other vector-sets are trained on Wikipedia text, which will be dominated by usages in that particular reference-style of writing. But there could be other sets that better emphasize the word-senses you need.
However, the best approach is often to train your own word-vectors, from a training corpus that closely matches the topics/documents you are concerned about. Then, the word-vectors are well-tuned to your domain-of-concern. As long as you have "enough" varied examples of a word used in context, the resulting vector will likely be better than generic vectors from someone else's corpus. ("Enough" has no firm definition, but is usually at least 5, and ideally dozens to hundreds, of representative, diverse uses.)
Let's consider your example goal – showing some similarity between the ideas of 'polymorphism' and 'input inheritance'. For that, you'd need a training corpus that discusses those concepts, ideally many times, from many authors, in many problem-contexts. (Textbooks, online articles, and Stack Overflow pages might be possible sources.)
You'd further need a tokenization strategy that manages to create a single word-token for the two-word concept 'input_inheritance' - which is a separate challenge, and might be tackled via (1) a hand-crafted glossary of multi-word-phrases that should be combined; (2) statistical analysis of word-pairs that seem to occur so often together, they should be combined; (3) more sophisticated grammar-aware phrase- and entity-detection preprocessing.
(The multiword phrases in the GoogleNews set were created via a statistical algorithm which is also available in the gensim Python library as the Phrases class. But, the exact parameters Google used have not, as far as I know, been revealed.And, good results from this algorithm can require a lot of data and tuning, and still result in some combinations that a person would consider nonsense, and missing others that a person would consider natural.)

Speech recognition / how to create TTS?

I have an idea to build program than can interact with the user voice in Arabic language, since one year ,I started with sphinx-4 but I need to make arabic acoustic model , grammar , dictionary. .
but I can't find the rood I want you to tell me in detailed description how to create those things?
the needed iIDE or program
please help me....
Ok, let me start at the very beginning, because I think you are not aware of the dimensions of your project, and you are mixing up things (ASR and TTS). First, I would like to explain what the different things are that you were talking about:
Acoustic Model: Every speech recognition system requires an acoustic model. Language, in particular words, are made up of phonemes. Phonemes describe how something sounds. To give you an example, the letter a is not always pronounced the same way, as you can see from the two words below:
to bark <=> to take
Now your ASR system needs to detect these phonemes. To do this, it performs a spectral analysis of many short frames of the audio signal and computes features, like MFCCs. What to do with these features? It puts them into a classifier (I could write a new chapter about the classifier here, but this will be too much information). Your classifier has to learn how to actually perform classification. What it does in simple words is it maps a set of features to a phoneme.
Dictionary: In your dictionary, you define every word that can be recognized by your ASR system. It tells the ASR the phoneme composition of a word. A short example for this is:
hello H EH L OW
world W ER L D
With this small dictionary, your system would be able to recognize the words hello and world.
Language Model (or Grammar): The language model holds information about the assembly of words for a given language. What does this mean? Think of the virtual keyboard of your smartphone. When you type in the words 'Will you marry', your keyboard might guess the next word to be 'me'. That is no magic. The model was learned from huge amounts of text files. Your LM does the same. It adds the knowledge about meaningful word compositions (what everybody calls a sentence) into the ASR system to further improve detection.
Now back to your problem: You need transcribed audio data for the following reasons:
You want to train your acoustic model if you have none.
You want to create a large enough dictionary.
You want to generate a language model from the text.
Long story short: You are wrong if you think you could solve all these tasks on your own. Only a reliable transcription is already a large amount of work. You should clearly overthink your idea.

Improve CoreNLP POS tagger and NER tagger?

The CoreNLP parts of speech tagger and name entity recognition tagger are pretty good out of the box, but I'd like to improve the accuracy further so that the overall program runs better. To explain more about accuracy -- there are situations in which the POS/NER is wrongly tagged. For instance:
"Oversaw car manufacturing" gets tagged as NNP-NN-NN
Rather than VB* or something similar, since it's a verb-like phrase (I'm not a linguist, so take this with a grain of salt).
So what's the best way to accomplish accuracy improvement?
Are there better models out there for POS/NER that can be incorporated into CoreNLP?
Should I switch to other NLP tools?
Or create training models with exception rules?
First of all, "Oversaw car manufacturing" is not even a sentence and on its own does not make much sense :-) These models are most often trained on whole sentences. If you enter "He oversaw car manufacturing" here [1], which is using CoreNLP, then you get a more sane result.
Let's assume though that you still have inaccurate results. Unless you're using some small example model there's no "better" model per se. It always depends on the domain, and even the "default" models are trained on certain domains, e.g. newspapers.
Most likely you will have to train a model yourself, not with exception rules, but for a specific domain of text, e.g. texts talking about cars or about manufacturing, or with a certain style of writing, etc.
[1] http://nlp.stanford.edu:8080/corenlp/process

How to validate a chapter heading for text using fuzzy logic in Java

I need a solution for identifying incorrect chapter headings in a book.
We are developing an ingestion system for books that does all sorts of validation, like spell-checking and offensive-language-filtering. Now we'd like to flag chapter headings that seem inaccurate given the chapter body. For example, if the heading was "The Function of the Spleen", I would not expect the chapter to be about the liver.
I am familiar with fuzzy string matching algorithms but this seems like more like an NLP or classification problem. If I could match (or closely match) the phrase "function of the spleen", then that's great -- high confidence. Otherwise, a high occurrence of both "function" and "spleen" in the text also yields confidence. And of course, the closer they are together the better.
This needs to be done in-memory, on the fly, and in Java.
My current naive approach is to simply tokenize all the words, remove noise words (like prepositions), stem what's left, and then count the number of matches. At a minimum I'd expect each word in the heading to appear at least once in the text.
Is there a different approach, ideally one that would take into account things like proximity and ordering?
I think that it is a classification problem, as such take a look at WEKA

Sentence Classification (Categorization)

I have been reading about text classification and found several Java tools which are available for classification, but I am still wondering: Is text classification the same as sentence classification!
Is there any tool which focuses on sentence classification?
Theres no formal difference between 'Text classification' and 'Sentence classification'. After all, a sentence is a type of text. But generally, when people talk about text classification, IMHO they mean larger units of text such as an essay, review or speech. Classifying a politician's speech into democrat or republican is a lot easier than classifying a tweet. When you have a lot of text per instance, you don't need to squeeze each training instance for all the information it can give you and get pretty good performance out a bag-of-words naive-bayes model.
Basically you might not get the required performance numbers if you throw off-the-shelf weka classifiers at a corpora of sentences. You might have to augment the data in the sentence with POS tags, parse trees, word ordering, ngrams, etc. Also get any related metadata such as creation time, creation location, attributes of sentence author, etc. Obviously all of this depends on what exactly are you trying to classify.. the features that will work out for you need to be intuitively meaningful to the problem at hand.

Categories

Resources