dl4j - what's the label mechanism in paragraph2vec?

dl4j - what's the label mechanism in paragraph2vec? - java

I just read the paper Distributed Representations of Sentences and Documents. In the sentiment analysis experiment section, it says, "After learning the vector representations for training sentences and their subphrases, we feed them to a logistic regression to learn a predictor of the movie rating." So it uses logistic regression algorithm as a classifier to determine what the label is.
Then I moved on to dl4j, I read the example "ParagraphVectorsClassifierExample" the code shows as below:
void makeParagraphVectors() throws Exception {
ClassPathResource resource = new ClassPathResource("paravec/labeled");
// build a iterator for our dataset
iterator = new FileLabelAwareIterator.Builder()
.addSourceFolder(resource.getFile())
.build();
tokenizerFactory = new DefaultTokenizerFactory();
tokenizerFactory.setTokenPreProcessor(new CommonPreprocessor());
// ParagraphVectors training configuration
paragraphVectors = new ParagraphVectors.Builder()
.learningRate(0.025)
.minLearningRate(0.001)
.batchSize(1000)
.epochs(20)
.iterate(iterator)
.trainWordVectors(true)
.tokenizerFactory(tokenizerFactory)
.build();
// Start model training
paragraphVectors.fit();
}
void checkUnlabeledData() throws IOException {
/*
At this point we assume that we have model built and we can check
which categories our unlabeled document falls into.
So we'll start loading our unlabeled documents and checking them
*/
ClassPathResource unClassifiedResource = new ClassPathResource("paravec/unlabeled");
FileLabelAwareIterator unClassifiedIterator = new FileLabelAwareIterator.Builder()
.addSourceFolder(unClassifiedResource.getFile())
.build();
/*
Now we'll iterate over unlabeled data, and check which label it could be assigned to
Please note: for many domains it's normal to have 1 document fall into few labels at once,
with different "weight" for each.
*/
MeansBuilder meansBuilder = new MeansBuilder(
(InMemoryLookupTable<VocabWord>)paragraphVectors.getLookupTable(),
tokenizerFactory);
LabelSeeker seeker = new LabelSeeker(iterator.getLabelsSource().getLabels(),
(InMemoryLookupTable<VocabWord>) paragraphVectors.getLookupTable());
while (unClassifiedIterator.hasNextDocument()) {
LabelledDocument document = unClassifiedIterator.nextDocument();
INDArray documentAsCentroid = meansBuilder.documentAsVector(document);
List<Pair<String, Double>> scores = seeker.getScores(documentAsCentroid);
/*
please note, document.getLabel() is used just to show which document we're looking at now,
as a substitute for printing out the whole document name.
So, labels on these two documents are used like titles,
just to visualize our classification done properly
*/
log.info("Document '" + document.getLabels() + "' falls into the following categories: ");
for (Pair<String, Double> score: scores) {
log.info(" " + score.getFirst() + ": " + score.getSecond());
}
}
}
It demonstrates how does doc2vec associate arbitrary documents with labels, but it hides the implementations behind the scenes. My question is: is it does so also by logistic regression? if not, what is it? And how can I do it by logistic regression?

I'm not familiar with DL4J's approach, but at the core 'Paragraph Vector'/'Doc2Vec' level, documents typically have an identifier assigned by the user – most typically, a single unique ID. Sometimes, though, these (provided) IDs have been called "labels", and further, sometimes it can be useful to re-use known-labels as if they were per-document doc-tokens, which can lead to confusion. In the Python gensim library, we call those user-provided tokens "tags" to distinguish from "labels" that might be from a totally different, and downstream, vocabulary.
So in a followup paper like "Document Embedding with Paragraph Vectors", each document has a unique ID - its title or identifer within Wikpedia or Arxiv. But then the resulting doc-vectors are evaluated by how well they place documents with the same category-labels closer to each other than third documents. So there's both a learned doc-tag space, and a downstream evaluation based on other labels (that weren't in any way provided to the unsupervised Paragraph Vector algorithm).
Similarly, you might give all training documents unique IDs, but then later train a separate classifier (of any algorithm) to use the doc=vectors as inputs, and learn to predict other labels. That's my understanding of the IMDB experiment in the original 'Paragraph Vectors' paper: every review has a unique ID during training, and thus got its own doc-vector. But then a downstream classifier was trained to predict positive/negative review sentiment based on those doc-vectors. So, the assessment/prediction of labels ("positive"/"negative") was a separate downstream step.
As mentioned, it's sometimes the case that re-using known category-labels as doc-ids – either as the only doc-ID, or as an extra ID in addition to a unique-per-document ID – can be useful. In a way, it creates synthetic combined documents for training, made up of all documents with the same label. This may tend to influence the final space/coordinates to be more discriminative with regard to the known labels, and thus make the resulting doc-vectors more helpful to downstream classifiers. But then you've replaced classic 'Paragraph Vector', with one ID per doc, with a similar semi-supervised approach where known labels influence training.

Related

how to split a file using python in key and values (or any other format)

Lets say I have a doc file which contains the following
Therapeutic Focus and Assessment : Describe the (1) types of interventions (such as pharmacologic, surgical, preventive, lifestyle,
self-care) and (2) administration and intensity of the intervention
(including dosage, strength, duration, frequency).
Follow-up and Outcomes : Please describe the clinical course of this case including all follow-up visits as well as (1) intervention
modification, interruption, or discontinuation, and the reasons; (2)
adherence to the intervention and how this was assessed;
Discussion : Please describe the strengths and limitations of this case report including case management, and the scientific and
medical literature related to this case report.
In this file I want to separate each heading and its content. Which means I will have 3 headings and 3 contains. I am thinking to make headings as a key and content as its value. How I can use regex to filter this information.
Little change in file structure : (additional question)
Therapeutic Focus and Assessment : Describe the (1) types of interventions (such as pharmacologic, surgical, preventive, lifestyle,
self-care) and (2) administration and intensity of the intervention
(including dosage, strength, duration, frequency).
Discussion :
Please describe the strengths and limitations of this case report including case > management, and the scientific. Health : medical literature related to this > case report.
If I will have this king of file where content in first paragraph is in line and in second paragraph it has a line gap. One more additional section is also included in the same paragraph. In that case how I will split?

Here's a way to do it with splitting based on characters rather then regex.
String document = "Header: blah blah \n Header: blah blah"
String[] sections = document.split("\n");
String[] headers = new String[sections.length];
String[] bodies = new String[sections.length];;
for(int i = 0; i < sections.length; i++){
headers[i] = sections[i].split(":")[0];
bodies[i] = sections[i].substring(headers[i].length() + 2);
}
The same split method works on regex patterns if you have something more complex to divide by then a carriage return and a ":", but by the looks of it this may work for you.

How to retrieve all variants of a lexeme in Java?

I am searching for a way to retrieve all variants of the lexeme of a specific word.
Example: running -> (run, runs, ran, running…)
I tried out Stanford NLP according to this post. However, the lemma-annotator only retrieves the lemma (running -> run), not the complete set of variants. Is there a way to do this with Stanford NLP or another Java Lib/Framework?
Clarification: I do not search for a stemmer. Also, I would like to avoid programming a new algorithm from scratch to crawl WordNet or similar dictionaries.

The short answer is that a standard NLP library or toolkit is unlikely to solve this problem. Like Stanford NLP, most libraries will only provide a mapping from word --> lemma. Note that this is a many-to-one function, i.e., the inverse function is not well-defined in a word space. It is, however, a well defined function from the space of words to the space of sets of words (i.e., it's a one-to-many mapping in word-space).
Without some form of explicit mapping being maintained, it is impossible to generate all the variants from a given lemma. This is a theoretical impossibility because lemmatization is a lossy, one-way function.
You can, however, generate a mapping of lemma --> set-of-words without much coding (and definitely without coding a new algorithm):
// Java
Map<String, Set<String>> inverseLemmaMap = new HashMap<>();
// Guava
Multimap<String, String> inverseLemmaMap = HashMultimap.create();
Then, as you annotate your corpus using Stanford NLP, you can obtain the lemma and its corresponding token, and populate the above map (or multimap). This way, after a single pass over your dataset, you will have the required inverse lemmatization.
Note that this will be restricted to the corpus/dataset you are using, and not all words in the English language will be included.
Another note is that people often think that an inflection is uniquely determined by the part of speech. This is incorrect:
String s = "My running was beginning to hurt me. I was running all day."
The first instance of running is tagged NN, while the second instance is the present continuous tense of the verb, tagged VBG. This is what I meant by "lossy, one-way function" earlier in my answer.

Using multiple classifiers in a Java program

I am using Stanford named entity recognition system to identify named entities in my queries.
I discover that one of the classifier (english.all.3class.distsim.crf.ser.gz) identify Person named entity more than the other (english.muc.7class.distsim.crf.ser.gz). While the second classifier identify Organization named entity more than the first classifier.
The question is how do i modify my code to combine both the performance of 3class and 7class classifiers. I mean how to combine line 2 and 3. Below is my program
public void main () {
//String serializedClassifier = "classifiers/english.all.3class.distsim.crf.ser.gz";
String serializedClassifier = "classifiers/english.muc.7class.distsim.crf.ser.gz";
AbstractSequenceClassifier<CoreLabel> classifier = CRFClassifier.getClassifierNoExceptions (serializedClassifier);
//String s5 = "Access Team Microsoft";
String s5 = " Victor Vianu";
String ans4 = classifier.classifyToString(s5);
System.out.println(ans4);
}

You are actually rephrasing a common optimization problem for classification tasks. It is possible to combine inputs from different named entity recognizers, but it is very unlikely that you will be able to do this in code. In that case, you would in practice be writing rule sets to express the preference of one annotation over the other. Such rule sets can become very complex and are very difficult to maintain, even if you are using a specialized framework.
Normally, an additional classifier is trained using both inputs along with annotation performed by humans. This is called supervised machine learning. If you are interested in the topic, have a look at the Gate framework, which provides a relatively gentle (GUI and lots of documentation available) introduction into text engineering and machine learning. You may be interested in the section about setting up machine learning: https://gate.ac.uk/sale/tao/splitch19.html#x24-46100019.2

One idea is to get the scores or confidence of the ner tagger (3class, 7class etc.) and select the results of one classifier or another based on this score, for each token that has been identified as an entity of a certain type.
You can do that by creating an e.g.,
List> results3ClassClassifier = 3Class.classify(textWithInstances);
(you get the results of the 7Class classifier as well,)
then you can have access to the scores , for instance:
for (List<CoreLabel> sentence : results3ClassClassifier) {
Triple<Counter<Integer>, Counter<Integer>, TwoDimensionalCounter<Integer, String>> scoresPerClass = 3Class.printProbsDocument(sentence);
//take score corresponding to a tagged token
//put that in a set of scores to get later max confidence
}
THis will give you something like:
Victor null O=0.011843980924431554 PERSON=0.9836181561256115 DATE=1.8909193023530869E-6 LOCATION=9.292607205855801E-5 ORGANIZATION=0.004434710463296233 PERCENT=1.168471927873708E-6 MONEY=3.925377823223501E-6 TIME=3.241645549570373E-6
Vianu null O=0.0019172569594321069 PERSON=0.9933531725355365 DATE=8.384789134033624E-6 LOCATION=2.134536699512499E-4 ORGANIZATION=0.004496303036216309 PERCENT=1.6256270957396572E-6 MONEY=7.507677140678878E-6 TIME=2.2957054950451435E-6
for Victor Vianu (great researcher!) this is straightforward, but for less known entities, the fact that you can get the confidence is useful in practice.

How can I build a model to distinguish tweets about Apple (Inc.) from tweets about apple (fruit)?

See below for 50 tweets about "apple." I have hand labeled the positive matches about Apple Inc. They are marked as 1 below.
Here are a couple of lines:
1|“#chrisgilmer: Apple targets big business with new iOS 7 features http://bit.ly/15F9JeF ”. Finally.. A corp iTunes account!
0|“#Zach_Paull: When did green skittles change from lime to green apple? #notafan” #Skittles
1|#dtfcdvEric: #MaroneyFan11 apple inc is searching for people to help and tryout all their upcoming tablet within our own net page No.
0|#STFUTimothy have you tried apple pie shine?
1|#SuryaRay #India Microsoft to bring Xbox and PC games to Apple, Android phones: Report: Microsoft Corp... http://dlvr.it/3YvbQx #SuryaRay
Here is the total data set: http://pastebin.com/eJuEb4eB
I need to build a model that classifies "Apple" (Inc). from the rest.
I'm not looking for a general overview of machine learning, rather I'm looking for actual model in code (Python preferred).

What you are looking for is called Named Entity Recognition. It is a statistical technique that (most commonly) uses Conditional Random Fields to find named entities, based on having been trained to learn things about named entities.
Essentially, it looks at the content and context of the word, (looking back and forward a few words), to estimate the probability that the word is a named entity.
Good software can look at other features of words, such as their length or shape (like "Vcv" if it starts with "Vowel-consonant-vowel")
A very good library (GPL) is Stanford's NER
Here's the demo: http://nlp.stanford.edu:8080/ner/
Some sample text to try:
I was eating an apple over at Apple headquarters and I thought about
Apple Martin, the daughter of the Coldplay guy
(the 3class and 4class classifiers get it right)

I would do it as follows:
Split the sentence into words, normalise them, build a dictionary
With each word, store how many times they occurred in tweets about the company, and how many times they appeared in tweets about the fruit - these tweets must be confirmed by a human
When a new tweet comes in, find every word in the tweet in the dictionary, calculate a weighted score - words that are used frequently in relation to the company would get a high company score, and vice versa; words used rarely, or used with both the company and the fruit, would not have much of a score.

I have a semi-working system that solves this problem, open sourced using scikit-learn, with a series of blog posts describing what I'm doing. The problem I'm tackling is word-sense disambiguation (choosing one of multiple word sense options), which is not the same as Named Entity Recognition. My basic approach is somewhat-competitive with existing solutions and (crucially) is customisable.
There are some existing commercial NER tools (OpenCalais, DBPedia Spotlight, and AlchemyAPI) that might give you a good enough commercial result - do try these first!
I used some of these for a client project (I consult using NLP/ML in London), but I wasn't happy with their recall (precision and recall). Basically they can be precise (when they say "This is Apple Inc" they're typically correct), but with low recall (they rarely say "This is Apple Inc" even though to a human the tweet is obviously about Apple Inc). I figured it'd be an intellectually interesting exercise to build an open source version tailored to tweets. Here's the current code:
https://github.com/ianozsvald/social_media_brand_disambiguator
I'll note - I'm not trying to solve the generalised word-sense disambiguation problem with this approach, just brand disambiguation (companies, people, etc.) when you already have their name. That's why I believe that this straightforward approach will work.
I started this six weeks ago, and it is written in Python 2.7 using scikit-learn. It uses a very basic approach. I vectorize using a binary count vectorizer (I only count whether a word appears, not how many times) with 1-3 n-grams. I don't scale with TF-IDF (TF-IDF is good when you have a variable document length; for me the tweets are only one or two sentences, and my testing results didn't show improvement with TF-IDF).
I use the basic tokenizer which is very basic but surprisingly useful. It ignores # # (so you lose some context) and of course doesn't expand a URL. I then train using logistic regression, and it seems that this problem is somewhat linearly separable (lots of terms for one class don't exist for the other). Currently I'm avoiding any stemming/cleaning (I'm trying The Simplest Possible Thing That Might Work).
The code has a full README, and you should be able to ingest your tweets relatively easily and then follow my suggestions for testing.
This works for Apple as people don't eat or drink Apple computers, nor do we type or play with fruit, so the words are easily split to one category or the other. This condition may not hold when considering something like #definance for the TV show (where people also use #definance in relation to the Arab Spring, cricket matches, exam revision and a music band). Cleverer approaches may well be required here.
I have a series of blog posts describing this project including a one-hour presentation I gave at the BrightonPython usergroup (which turned into a shorter presentation for 140 people at DataScienceLondon).
If you use something like LogisticRegression (where you get a probability for each classification) you can pick only the confident classifications, and that way you can force high precision by trading against recall (so you get correct results, but fewer of them). You'll have to tune this to your system.
Here's a possible algorithmic approach using scikit-learn:
Use a Binary CountVectorizer (I don't think term-counts in short messages add much information as most words occur only once)
Start with a Decision Tree classifier. It'll have explainable performance (see Overfitting with a Decision Tree for an example).
Move to logistic regression
Investigate the errors generated by the classifiers (read the DecisionTree's exported output or look at the coefficients in LogisticRegression, work the mis-classified tweets back through the Vectorizer to see what the underlying Bag of Words representation looks like - there will be fewer tokens there than you started with in the raw tweet - are there enough for a classification?)
Look at my example code in https://github.com/ianozsvald/social_media_brand_disambiguator/blob/master/learn1.py for a worked version of this approach
Things to consider:
You need a larger dataset. I'm using 2000 labelled tweets (it took me five hours), and as a minimum you want a balanced set with >100 per class (see the overfitting note below)
Improve the tokeniser (very easy with scikit-learn) to keep # # in tokens, and maybe add a capitalised-brand detector (as user #user2425429 notes)
Consider a non-linear classifier (like #oiez's suggestion above) when things get harder. Personally I found LinearSVC to do worse than logistic regression (but that may be due to the high-dimensional feature space that I've yet to reduce).
A tweet-specific part of speech tagger (in my humble opinion not Standford's as #Neil suggests - it performs poorly on poor Twitter grammar in my experience)
Once you have lots of tokens you'll probably want to do some dimensionality reduction (I've not tried this yet - see my blog post on LogisticRegression l1 l2 penalisation)
Re. overfitting. In my dataset with 2000 items I have a 10 minute snapshot from Twitter of 'apple' tweets. About 2/3 of the tweets are for Apple Inc, 1/3 for other-apple-uses. I pull out a balanced subset (about 584 rows I think) of each class and do five-fold cross validation for training.
Since I only have a 10 minute time-window I have many tweets about the same topic, and this is probably why my classifier does so well relative to existing tools - it will have overfit to the training features without generalising well (whereas the existing commercial tools perform worse on this snapshop, but more reliably across a wider set of data). I'll be expanding my time window to test this as a subsequent piece of work.

You can do the following:
Make a dict of words containing their count of occurrence in fruit and company related tweets. This can be achieved by feeding it some sample tweets whose inclination we know.
Using enough previous data, we can find out the probability of a word occurring in tweet about apple inc.
Multiply individual probabilities of words to get the probability of the whole tweet.
A simplified example:
p_f = Probability of fruit tweets.
p_w_f = Probability of a word occurring in a fruit tweet.
p_t_f = Combined probability of all words in tweet occurring a fruit tweet
= p_w1_f * p_w2_f * ...
p_f_t = Probability of fruit given a particular tweet.
p_c, p_w_c, p_t_c, p_c_t are respective values for company.
A laplacian smoother of value 1 is added to eliminate the problem of zero frequency of new words which are not there in our database.
old_tweets = {'apple pie sweet potatoe cake baby https://vine.co/v/hzBaWVA3IE3': '0', ...}
known_words = {}
total_company_tweets = total_fruit_tweets =total_company_words = total_fruit_words = 0
for tweet in old_tweets:
company = old_tweets[tweet]
for word in tweet.lower().split(" "):
if not word in known_words:
known_words[word] = {"company":0, "fruit":0 }
if company == "1":
known_words[word]["company"] += 1
total_company_words += 1
else:
known_words[word]["fruit"] += 1
total_fruit_words += 1
if company == "1":
total_company_tweets += 1
else:
total_fruit_tweets += 1
total_tweets = len(old_tweets)
def predict_tweet(new_tweet,K=1):
p_f = (total_fruit_tweets+K)/(total_tweets+K*2)
p_c = (total_company_tweets+K)/(total_tweets+K*2)
new_words = new_tweet.lower().split(" ")
p_t_f = p_t_c = 1
for word in new_words:
try:
wordFound = known_words[word]
except KeyError:
wordFound = {'fruit':0,'company':0}
p_w_f = (wordFound['fruit']+K)/(total_fruit_words+K*(len(known_words)))
p_w_c = (wordFound['company']+K)/(total_company_words+K*(len(known_words)))
p_t_f *= p_w_f
p_t_c *= p_w_c
#Applying bayes rule
p_f_t = p_f * p_t_f/(p_t_f*p_f + p_t_c*p_c)
p_c_t = p_c * p_t_c/(p_t_f*p_f + p_t_c*p_c)
if p_c_t > p_f_t:
return "Company"
return "Fruit"

If you don't have an issue using an outside library, I'd recommend scikit-learn since it can probably do this better & faster than anything you could code by yourself. I'd just do something like this:
Build your corpus. I did the list comprehensions for clarity, but depending on how your data is stored you might need to do different things:
def corpus_builder(apple_inc_tweets, apple_fruit_tweets):
corpus = [tweet for tweet in apple_inc_tweets] + [tweet for tweet in apple_fruit_tweets]
labels = [1 for x in xrange(len(apple_inc_tweets))] + [0 for x in xrange(len(apple_fruit_tweets))]
return (corpus, labels)
The important thing is you end up with two lists that look like this:
([['apple inc tweet i love ios and iphones'], ['apple iphones are great'], ['apple fruit tweet i love pie'], ['apple pie is great']], [1, 1, 0, 0])
The [1, 1, 0, 0] represent the positive and negative labels.
Then, you create a Pipeline! Pipeline is a scikit-learn class that makes it easy to chain text processing steps together so you only have to call one object when training/predicting:
def train(corpus, labels)
pipe = Pipeline([('vect', CountVectorizer(ngram_range=(1, 3), stop_words='english')),
('tfidf', TfidfTransformer(norm='l2')),
('clf', LinearSVC()),])
pipe.fit_transform(corpus, labels)
return pipe
Inside the Pipeline there are three processing steps. The CountVectorizer tokenizes the words, splits them, counts them, and transforms the data into a sparse matrix. The TfidfTransformer is optional, and you might want to remove it depending on the accuracy rating (doing cross validation tests and a grid search for the best parameters is a bit involved, so I won't get into it here). The LinearSVC is a standard text classification algorithm.
Finally, you predict the category of tweets:
def predict(pipe, tweet):
prediction = pipe.predict([tweet])
return prediction
Again, the tweet needs to be in a list, so I assumed it was entering the function as a string.
Put all those into a class or whatever, and you're done. At least, with this very basic example.
I didn't test this code so it might not work if you just copy-paste, but if you want to use scikit-learn it should give you an idea of where to start.
EDIT: tried to explain the steps in more detail.

Using a decision tree seems to work quite well for this problem. At least it produces a higher accuracy than a naive bayes classifier with my chosen features.
If you want to play around with some possibilities, you can use the following code, which requires nltk to be installed. The nltk book is also freely available online, so you might want to read a bit about how all of this actually works: http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html
#coding: utf-8
import nltk
import random
import re
def get_split_sets():
structured_dataset = get_dataset()
train_set = set(random.sample(structured_dataset, int(len(structured_dataset) * 0.7)))
test_set = [x for x in structured_dataset if x not in train_set]
train_set = [(tweet_features(x[1]), x[0]) for x in train_set]
test_set = [(tweet_features(x[1]), x[0]) for x in test_set]
return (train_set, test_set)
def check_accurracy(times=5):
s = 0
for _ in xrange(times):
train_set, test_set = get_split_sets()
c = nltk.classify.DecisionTreeClassifier.train(train_set)
# Uncomment to use a naive bayes classifier instead
#c = nltk.classify.NaiveBayesClassifier.train(train_set)
s += nltk.classify.accuracy(c, test_set)
return s / times
def remove_urls(tweet):
tweet = re.sub(r'http:\/\/[^ ]+', "", tweet)
tweet = re.sub(r'pic.twitter.com/[^ ]+', "", tweet)
return tweet
def tweet_features(tweet):
words = [x for x in nltk.tokenize.wordpunct_tokenize(remove_urls(tweet.lower())) if x.isalpha()]
features = dict()
for bigram in nltk.bigrams(words):
features["hasBigram(%s)" % ",".join(bigram)] = True
for trigram in nltk.trigrams(words):
features["hasTrigram(%s)" % ",".join(trigram)] = True
return features
def get_dataset():
dataset = """copy dataset in here
"""
structured_dataset = [('fruit' if x[0] == '0' else 'company', x[2:]) for x in dataset.splitlines()]
return structured_dataset
if __name__ == '__main__':
print check_accurracy()

Thank you for the comments thus far. Here is a working solution I prepared with PHP. I'd still be interested in hearing from others a more algorithmic approach to this same solution.
<?php
// Confusion Matrix Init
$tp = 0;
$fp = 0;
$fn = 0;
$tn = 0;
$arrFP = array();
$arrFN = array();
// Load All Tweets to string
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://pastebin.com/raw.php?i=m6pP8ctM');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$strCorpus = curl_exec($ch);
curl_close($ch);
// Load Tweets as Array
$arrCorpus = explode("\n", $strCorpus);
foreach ($arrCorpus as $k => $v) {
// init
$blnActualClass = substr($v,0,1);
$strTweet = trim(substr($v,2));
// Score Tweet
$intScore = score($strTweet);
// Build Confusion Matrix and Log False Positives & Negatives for Review
if ($intScore > 0) {
if ($blnActualClass == 1) {
// True Positive
$tp++;
} else {
// False Positive
$fp++;
$arrFP[] = $strTweet;
}
} else {
if ($blnActualClass == 1) {
// False Negative
$fn++;
$arrFN[] = $strTweet;
} else {
// True Negative
$tn++;
}
}
}
// Confusion Matrix and Logging
echo "
Predicted
1 0
Actual 1 $tp $fp
Actual 0 $fn $tn
";
if (count($arrFP) > 0) {
echo "\n\nFalse Positives\n";
foreach ($arrFP as $strTweet) {
echo "$strTweet\n";
}
}
if (count($arrFN) > 0) {
echo "\n\nFalse Negatives\n";
foreach ($arrFN as $strTweet) {
echo "$strTweet\n";
}
}
function LoadDictionaryArray() {
$strDictionary = <<<EOD
10|iTunes
10|ios 7
10|ios7
10|iPhone
10|apple inc
10|apple corp
10|apple.com
10|MacBook
10|desk top
10|desktop
1|config
1|facebook
1|snapchat
1|intel
1|investor
1|news
1|labs
1|gadget
1|apple store
1|microsoft
1|android
1|bonds
1|Corp.tax
1|macs
-1|pie
-1|clientes
-1|green apple
-1|banana
-10|apple pie
EOD;
$arrDictionary = explode("\n", $strDictionary);
foreach ($arrDictionary as $k => $v) {
$arr = explode('|', $v);
$arrDictionary[$k] = array('value' => $arr[0], 'term' => strtolower(trim($arr[1])));
}
return $arrDictionary;
}
function score($str) {
$str = strtolower($str);
$intScore = 0;
foreach (LoadDictionaryArray() as $arrDictionaryItem) {
if (strpos($str,$arrDictionaryItem['term']) !== false) {
$intScore += $arrDictionaryItem['value'];
}
}
return $intScore;
}
?>
The above outputs:
Predicted
1 0
Actual 1 31 1
Actual 0 1 17
False Positives
1|Royals apple #ASGame #mlb # News Corp Building http://instagram.com/p/bBzzgMrrIV/
False Negatives
-1|RT #MaxFreixenet: Apple no tiene clientes. Tiene FANS// error.... PAGAS por productos y apps, ergo: ERES CLIENTE.

In all the examples that you gave, Apple(inc) was either referred to as Apple or apple inc, so a possible way could be to search for:
a capital "A" in Apple
an "inc" after apple
words/phrases like "OS", "operating system", "Mac", "iPhone", ...
or a combination of them

To simplify answers based on Conditional Random Fields a bit...context is huge here. You will want to pick out in those tweets that clearly show Apple the company vs apple the fruit. Let me outline a list of features here that might be useful for you to start with. For more information look up noun phrase chunking, and something called BIO labels. See (http://www.cis.upenn.edu/~pereira/papers/crf.pdf)
Surrounding words: Build a feature vector for the previous word and the next word, or if you want more features perhaps the previous 2 and next 2 words. You don't want too many words in the model or it won't match the data very well.
In Natural Language Processing, you are going to want to keep this as general as possible.
Other features to get from surrounding words include the following:
Whether the first character is a capital
Whether the last character in the word is a period
The part of speech of the word (Look up part of speech tagging)
The text itself of the word
I don't advise this, but to give more examples of features specifically for Apple:
WordIs(Apple)
NextWordIs(Inc.)
You get the point. Think of Named Entity Recognition as describing a sequence, and then using some math to tell a computer how to calculate that.
Keep in mind that natural language processing is a pipeline based system. Typically, you break things in to sentences, move to tokenization, then do part of speech tagging or even dependency parsing.
This is all to get you a list of features you can use in your model to identify what you're looking for.

There's a really good library for processing natural language text in Python called nltk. You should take a look at it.
One strategy you could try is to look at n-grams (groups of words) with the word "apple" in them. Some words are more likely to be used next to "apple" when talking about the fruit, others when talking about the company, and you can use those to classify tweets.

Use LibShortText. This Python utility has already been tuned to work for short text categorization tasks, and it works well. The maximum you'll have to do is to write a loop to pick the best combination of flags. I used it to do supervised speech act classification in emails and the results were up to 95-97% accurate (during 5 fold cross validation!).
And it comes from the makers of LIBSVM and LIBLINEAR whose support vector machine (SVM) implementation is used in sklearn and cran, so you can be reasonably assured that their implementation is not buggy.

Make an AI filter to distinguish Apple Inc (the company) from apple (the fruit). Since these are tweets, define your training set with a vector of 140 fields, each field being the character written in the tweet at position X (0 to 139). If the tweet is shorter, just give a value for being blank.
Then build a training set big enough to get a good accuracy (subjective to your taste). Assign a result value to each tweet, a Apple Inc tweet get 1 (true) and an apple tweet (fruit) gets 0. It would be a case of supervised learning in a logistic regression.
That is machine learning, is generally easier to code and performs better. It has to learn from the set you give it, and it's not hardcoded.
I don't know Python, so I can not write the code for it, but if you were to take more time for machine learning's logic and theory you might want to look the class I'm following.
Try the Coursera course Machine Learning by Andrew Ng. You will learn machine learning on MATLAB or Octave, but once you get the basics you will be able to write machine learning in about any language if you do understand the simple math (simple in logistic regression).
That is, getting the code from someone won't make you able to understand what is going in the machine learning code. You might want to invest a couple of hours on the subject to see what is really going on.

I would recommend avoiding answers suggesting entity recognition. Because this task is a text-classification first and entity recognition second (you can do it without the entity recognition at all).
I think the fastest path to results will be spacy + prodigy.
Spacy has well thought through model for English language, so you don't have to build your own. While prodigy allows quickly create training datasets and fine tune spacy model for your needs.
If you have enough samples, you can have a decent model in 1 day.

Lucene: Searching multiple fields with default operator = AND

To allow users to search across multiple fields with Lucene 3.5 I currently create and add a QueryParser for each field to be searched to a DisjunctionMaxQuery. This works great when using OR as the default operator but I now want to change the default operator to AND to get more accurate (and fewer) results.
Problem is, queryParser.setDefaultOperator(QueryParser.AND_OPERATOR) misses many documents since all terms must be in atleast 1 field.
For example, consider the following data for a document: title field = "Programming Languages", body field = "Java, C++, PHP". If a user were to search for Java Programming this particular document would not be included in the results since the title nor the body field contains all terms in the query although combined they do. I would want this document returned for the above query but not for the query HTML Programming.
I've considered a catchall field but I have a few problems with it. First, users frequently include per field terms in their queries (author:bill) which is not possible with a catchall field. Also, I highlight certain fields with FastVectorHighlighter which requires them to be indexed and stored. So by adding a catchall field I would have to index most of the same data twice which is time and space consuming.
Any ideas?

Guess I should have done a little more research. Turns out MultiFieldQueryParser provides the exact functionality I was looking for. For whatever reason I was creating a QueryParser for each field I wanted to search like this:
String[] fields = {"title", "body", "subject", "author"};
QueryParser[] parsers = new QueryParser[fields.length];
for(int i = 0; i < parsers.length; i++)
{
parsers[i] = new QueryParser(Version.LUCENE_35, fields[i], analyzer);
parsers[i].setDefaultOperator(QueryParser.AND_OPERATOR);
}
This would result in a query like this:
(+title:java +title:programming) | (+body:java +body:programming)
...which is not what I was looking. Now I create a single MultiFieldQueryParser like this:
MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_35, new String[]{"title", "body", "subject"}, analyzer);
parser.setDefaultOperator(QueryParser.AND_OPERATOR);
This gives me the query I was looking for:
+(title:java body:java) +(title:programming body:programming)
Thanks to #seeta and #femtoRgon for the help!

Perhaps what you need is a combination of Boolean queries that capture the different combinations of fields and terms. In your given example, the query could be -
(title:Java AND body:programming) OR (title:programming AND body:Java).
I don't know if there's an existing Query class that generates this automatically for you, but I think that's what should be the ultimate query that's run on the index.

You want to be able to search multiple fields with the same set of terms, then the question from your comment:
((title:java title:programming) | (body:java body:programming))~0.2
May not be the best implementation.
You're effectively getting either the score from the title, or the score from the body for the combined set of terms. The case where you hit java in the title and programming in the body would be given approx. equal weight to a hit on java in the body and no hit on programming.
I think a better structured query would be:
(title:java body:java)~0.2 (title:programming body:programming)~0.2
This makes more sense to me, since you want the dismax queries to limit score growing on multiple queries of the same term (in different fields), but you do want scoring to grow for hits on different terms, I believe.
If that sort of query structure gets you better score results, limiting results to a certain minimum score (a percentage of the max score returned, rather than a simple hard-coded value) may be adequate to prevent too-weak results from being seen.
I also still wouldn't count out indexing an all field. It's an implementation I've used before, while indexing BOTH the specific field and the catchall field, thus allowing both general querying and specific single-field queries. Index storage tends to be pretty lean for unstored terms, and it will generally help performance, if you find yourself having to create big, complicated queries to make up for not having it.
If you really want to be sure that it takes minimal storage, you can even turn off TermVectors for that field:
new Field(name, value, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.NO);
Although I don't know how much of a difference that would really make.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.