Related
I am working on an engine that does OCR post-processing, and currently I have a set of organizations in the database, including Chamber of Commerce Numbers.
Also from the OCR output I have a list of possible Chamber of Commerce (COC) numbers.
What would be the best way to search the most similar one? Currently I am using Levenshtein Distance, but the result range is simply too big and on big databases I really doubt it's feasibility. Currently it's implemented in Java, and the database is a MySQL database.
Side note: A Chamber of Commerce number in The Netherlands is defined to be an 8-digit number for every company, an earlier version of this system used another 4 digits (0000, 0001, etc.) to indicate an establishment of an organization, nowadays totally new COC numbers are being given out for those.
Example of COCNumbers:
30209227
02045251
04087614
01155720
20081288
020179310000
09053023
09103292
30039925
13041611
01133910
09063023
34182B01
27124701
List of possible COCNumbers determined by post-processing:
102537177
000450093333
465111338098
NL90223l30416l
NLfl0737D447B01
12juni2013
IBANNL32ABNA0242244777
lncassantNL90223l30416l10000
KvK13041611
BtwNLfl0737D447B01
A few extra notes:
The post-processing picks up words and word groups from the invoice, and those word groups are being concatenated in one string. (A word group is at it says, a group of words, usually denoted by a space between them).
The condition that the post-processing uses for it to be a COC number is the following: The length should be 8 or more, half of the content should be numbers and it should be alphanumerical.
The amount of possible COCNumbers determined by post-processing is relatively small.
The database itself can grow very big, up to 10.000s of records.
How would I proceed to find the best match in general? (In this case (13041611, KvK13041611) is the best (and moreover correct) match)
Doing this matching exclusively in MySQL is probably a bad idea for a simple reason: there's no way to use a regular expression to modify a string natively.
You're going to need to use some sort of scoring algorithm to get this right, in my experience (which comes from ISBNs and other book-identifying data).
This is procedural -- you probably need to do it in Java (or some other procedural programming language).
Is the candidate string found in the table exactly? If yes, score 1.0.
Is the candidate string "kvk" (case-insensitive) prepended to a number that's found in the table exactly? If so, score 1.0.
Is the candidate string the correct length, and does it match after changing lower case L into 1 and upper case O into 0? If so, score 0.9
Is the candidate string the correct length after trimming all alphabetic characters from either beginning or the end, and does it match? If so, score 0.8.
Do both steps 3 and 4, and if you get a match score 0.7.
Trim alpha characters from both the beginning and end, and if you get a match score 0.6.
Do steps 3 and 6, and if you get a match score 0.55.
The highest scoring match wins.
Take a visual look at the ones that don't match after this set of steps and see if you can discern another pattern of OCR junk or concatenated junk. Perhaps your OCR is seeing "g" where the input is "8", or other possible issues.
You may be able to try using Levenshtein's distance to process these remaining items if you match substrings of equal length. They may also be few enough in number that you can correct your data manually and proceed.
Another possibility: you may be able to use Amazon Mechanical Turk to purchase crowdsourced labor to resolve some difficult cases.
Background:
I’m trying to make a “document-term” matrix in Java on Hadoop using MapReduce. A document-term matrix is like a huge table where each row represents a document and each column represents a possible word/term.
Problem Statement:
Assuming that I already have a term index list (so that I know which term is associated with which column number), what is the best way to look up the index for each term in each document so that I can build the matrix row-by-row (i.e., document-by-document)?
So far I can think of two approaches:
Approach #1:
Store the term index list on the Hadoop distributed file system. Each time a mapper reads a new document for indexing, spawn a new MapReduce job -- one job for each unique word in the document, where each job queries the distributed terms list for its term. This approach sounds like overkill, since I am guessing there is some overhead associated with starting up a new job, and since this approach might call for tens of millions of jobs. Also, I’m not sure if it’s possible to call a MapReduce job within another MapReduce job.
Approach #2:
Append the term index list to each document so that each mapper ends up with a local copy of the term index list. This approach is pretty wasteful with storage (there will be as many copies of the term index list as there are documents). Also, I’m not sure how to merge the term index list with each document -- would I merge them in a mapper or in a reducer?
Question Update 1
Input File Format:
The input file will be a CSV (comma separated value) file containing all of the documents (product reviews). There is no column header in the file, but the values for each review appear in the following order: product_id, review_id, review, stars. Below is a fake example:
“Product A”, “1”,“Product A is very, very expensive.”,”2”
“Product G”, ”2”, “Awesome product!!”, “5”
Term Index File Format:
Each line in the term index file consists of the following: an index number, a tab, and then a word. Each possible word is listed only once in the index file, such that the term index file is analogous to what could be a list of primary keys (the words) for an SQL table. For each word in a particular document, my tentative plan is to iterate through each line of the term index file until I find the word. The column number for that word is then defined as the column/term index associated with that word. Below is an example of the term index file, which was constructed using the two example product reviews mentioned earlier.
1 awesome
2 product
3 a
4 is
5 very
6 expensive
Output File Format:
I would like the output to be in the “Matrix Market” (MM) format, which is the industry standard for compressing matrices with many zeros. This is the ideal format because most reviews will contain only a small proportion of all possible words, so for a particular document it is only necessary to specify the non-zero columns.
The first row in the MM format has three tab separated values: the total number of documents, the total number of word columns, and the total number of lines in the MM file excluding the header. After the header, each additional row contains the matrix coordinates associated with a particular entry, and the value of the entry, in this order: reviewID, wordColumnID, entry (how many times this word appears in the review). For more details on the Matrix Market format, see this link: http://math.nist.gov/MatrixMarket/formats.html.
Each review’s ID will equal its row index in the document-term matrix. This way I can preserve the review’s ID in the Matrix Market format so that I can still associate each review with its star rating. My ultimate goal -- which is beyond the scope of this question -- is to build a natural language processing algorithm to predict the number of stars in a new review based on its text.
Using the example above, the final output file would look like this (I can't get Stackoverflow to show tabs instead of spaces):
2 6 7
1 2 1
1 3 1
1 4 1
1 5 2
1 6 1
2 1 1
2 2 1
Well, you can use something analogous to a inverted index concept.
I'm suggesting this becaue, I'm assuming both the files are big. Hence, comparing each other like one-to-one would be real performance bottle neck.
Here's a way that can be used -
You can feed both the Input File Format csv file(s) (say, datafile1, datafile2) and the term index file (say, term_index_file) as input to your job.
Then in each mapper, you filter the source file name, something like this -
Pseudo code for mapper -
map(key, row, context){
String filename= ((FileSplit)context.getInputSplit()).getPath().getName();
if (filename.startsWith("datafile") {
//split the review_id, words from row
....
context.write(new Text("word), new Text("-1 | review_id"));
} else if(filename.startsWith("term_index_file") {
//split index and word
....
context.write(new Text("word"), new Text("index | 0"));
}
}
e.g. output from different mappers
Key Value source
product -1|1 datafile
very 5|0 term_index_file
very -1|1 datafile
product -1|2 datafile
very -1|1 datafile
product 2|0 term_index_file
...
...
Explanation (the example):
As it clearly shows the key will be your word and the value will be made of two parts separated by a delimiter "|"
If the source is a datafile then you emit key=product and value=-1|1, where -1 is a dummy element and 1 is a review_id.
If the source is a term_index_file then you emit key=product and value=2|0, where 2 is a index of word 'product' and 0 is a dummy review_id, which we would use for sorting- explained later.
Definitely, no duplicate index will be processed by two different mappers if we are providing the term_index_file as a normal input file to the job.
So, 'product, vary' or any other indexed word in the term_index_file will only be available to one mapper. Note this is only valid for term_index_file not the datafile.
Next step:
Hadoop mapreduce framework, as you might well know, will group by keys
So, you will have something like this going to different reducers,
reduce-1: key=product, value=<-1|1, -1|2, 2|0>
reduce-2: key=very, value=<5|0, -1|1, -1|1>
But, we have a problem in the above case. We would want a sort in the values after '|' i.e. in the reduce-1 -> 2|0, -1|1, -1|2 and in reduce-2 -> <5|0, -1|1, -1|1>
To achieve that you can use a secondary sort implemented using a sort comparator. Please google for this but here's a link that might help. Mentioning it here can go real lengthy.
In each reduce-1, since the values are sorted as above, when we begin iteration, we would get the '0' in the first iteration and with it the index_id=2, which could then be used for subsequent iterations. In the next two iteration, we get review ids 1 and 2 consecutively, and we use a counter, so that we could keep track of any repeated review ids. When we get repeated review ids that would mean that a word appeared twice in the same review_id row. We reset the counter only when we find a different review_id and emit the previous review_id details for the particular index_id, something like this -
previous_review_id + "\t" + index_id + "\t" + count
When the loop ends, we'll be left with a single previous_review_id, which we finally emit in the same fashion.
Pseudo code for reducer -
reduce(key, Iterable values, context) {
String index_id = null;
count = 1;
String previousReview_id = null;
for(value: values) {
Split split[] = values.split("\\|");
....
//when consecutive review_ids are same, we increment count
//and as soon as the review_id differ, we emit, reset the counter and print
//the previous review_id detected.
if (split[0].equals("-1") && split[1].equals(previousReview_id)) {
count++;
} else if(split[0].equals("-1") && !split[1].equals(prevValue)) {
context.write(previousReview_id + "\t" + index_id + "\t" + count);
previousReview_id = split[1];//resting with new review_id id
count=1;//resetting count for new review_id
} else {
index_id = split[0];
}
}
//the last previousReview_id will be left out,
//so, writing it now after the loop completion
context.write(previousReview_id + "\t" + index_id + "\t" + count);
}
This job is done with multiple reducers in order to leverage Hadoop for what it best known for - performance, as a result, the final output will be scattered, something like the following, deviating from your desired output.
1 4 1
2 1 1
1 5 2
1 2 1
1 3 1
1 6 1
2 2 1
But, if you want everything to be sorted according to the review_id (as your desired outpout), you can write one more job that will do that for your using a single reducer and the output of the previos job as input. And also at the same time calculate 2 6 7 and put it at the front of the output.
This is just an approach ( or an idea), I think, that might help you. You definitely want to modify this, put a better algorithm and use it the your way that you think would benefit you.
You can also use Composite keys for better clarity than using a delimiter such as "|".
I am open for any clarification. Please ask if you think, it might be useful to you.
Thank you!
You can load the term index list in Hadoop distributed cache so that it is available to mappers and reducers. For instance, in Hadoop streaming, you can run your job as follows:
$ hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-*.jar \
-input myInputDirs \
-output myOutputDir \
-mapper myMapper.py \
-reducer myReducer.py \
-file myMapper.py \
-file myReducer.py \
-file myTermIndexList.txt
Now in myMapper.py you can load the file myTermIndexList.txt and use it to your purpose. If you give a more detailed description of your input and desired output I can give you more details.
Approach #1 is not good but very common if you don't have much hadoop experience. Starting jobs is very expensive. What you are going to want to do is have 2-3 jobs that feed each other to get the desired result. A common solution to similar problems is to have the mapper tokenize the input and output pairs, group them in the reducer executing some kind of calculation and then feed that into job 2. In the mapper in job 2 you invert the data in some way and in the reducer do some other calculation.
I would highly recommend learning more about Hadoop through a training course. Interestingly Cloudera's dev course has a very similar problem to the one you are trying to address. Alternatively or perhaps in addition to a course I would look at "Data-Intensive Text Processing with MapReduce" specifically the sections on "COMPUTING RELATIVE FREQUENCIES" and "Inverted Indexing for Text Retrieval"
http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf
See below for 50 tweets about "apple." I have hand labeled the positive matches about Apple Inc. They are marked as 1 below.
Here are a couple of lines:
1|“#chrisgilmer: Apple targets big business with new iOS 7 features http://bit.ly/15F9JeF ”. Finally.. A corp iTunes account!
0|“#Zach_Paull: When did green skittles change from lime to green apple? #notafan” #Skittles
1|#dtfcdvEric: #MaroneyFan11 apple inc is searching for people to help and tryout all their upcoming tablet within our own net page No.
0|#STFUTimothy have you tried apple pie shine?
1|#SuryaRay #India Microsoft to bring Xbox and PC games to Apple, Android phones: Report: Microsoft Corp... http://dlvr.it/3YvbQx #SuryaRay
Here is the total data set: http://pastebin.com/eJuEb4eB
I need to build a model that classifies "Apple" (Inc). from the rest.
I'm not looking for a general overview of machine learning, rather I'm looking for actual model in code (Python preferred).
What you are looking for is called Named Entity Recognition. It is a statistical technique that (most commonly) uses Conditional Random Fields to find named entities, based on having been trained to learn things about named entities.
Essentially, it looks at the content and context of the word, (looking back and forward a few words), to estimate the probability that the word is a named entity.
Good software can look at other features of words, such as their length or shape (like "Vcv" if it starts with "Vowel-consonant-vowel")
A very good library (GPL) is Stanford's NER
Here's the demo: http://nlp.stanford.edu:8080/ner/
Some sample text to try:
I was eating an apple over at Apple headquarters and I thought about
Apple Martin, the daughter of the Coldplay guy
(the 3class and 4class classifiers get it right)
I would do it as follows:
Split the sentence into words, normalise them, build a dictionary
With each word, store how many times they occurred in tweets about the company, and how many times they appeared in tweets about the fruit - these tweets must be confirmed by a human
When a new tweet comes in, find every word in the tweet in the dictionary, calculate a weighted score - words that are used frequently in relation to the company would get a high company score, and vice versa; words used rarely, or used with both the company and the fruit, would not have much of a score.
I have a semi-working system that solves this problem, open sourced using scikit-learn, with a series of blog posts describing what I'm doing. The problem I'm tackling is word-sense disambiguation (choosing one of multiple word sense options), which is not the same as Named Entity Recognition. My basic approach is somewhat-competitive with existing solutions and (crucially) is customisable.
There are some existing commercial NER tools (OpenCalais, DBPedia Spotlight, and AlchemyAPI) that might give you a good enough commercial result - do try these first!
I used some of these for a client project (I consult using NLP/ML in London), but I wasn't happy with their recall (precision and recall). Basically they can be precise (when they say "This is Apple Inc" they're typically correct), but with low recall (they rarely say "This is Apple Inc" even though to a human the tweet is obviously about Apple Inc). I figured it'd be an intellectually interesting exercise to build an open source version tailored to tweets. Here's the current code:
https://github.com/ianozsvald/social_media_brand_disambiguator
I'll note - I'm not trying to solve the generalised word-sense disambiguation problem with this approach, just brand disambiguation (companies, people, etc.) when you already have their name. That's why I believe that this straightforward approach will work.
I started this six weeks ago, and it is written in Python 2.7 using scikit-learn. It uses a very basic approach. I vectorize using a binary count vectorizer (I only count whether a word appears, not how many times) with 1-3 n-grams. I don't scale with TF-IDF (TF-IDF is good when you have a variable document length; for me the tweets are only one or two sentences, and my testing results didn't show improvement with TF-IDF).
I use the basic tokenizer which is very basic but surprisingly useful. It ignores # # (so you lose some context) and of course doesn't expand a URL. I then train using logistic regression, and it seems that this problem is somewhat linearly separable (lots of terms for one class don't exist for the other). Currently I'm avoiding any stemming/cleaning (I'm trying The Simplest Possible Thing That Might Work).
The code has a full README, and you should be able to ingest your tweets relatively easily and then follow my suggestions for testing.
This works for Apple as people don't eat or drink Apple computers, nor do we type or play with fruit, so the words are easily split to one category or the other. This condition may not hold when considering something like #definance for the TV show (where people also use #definance in relation to the Arab Spring, cricket matches, exam revision and a music band). Cleverer approaches may well be required here.
I have a series of blog posts describing this project including a one-hour presentation I gave at the BrightonPython usergroup (which turned into a shorter presentation for 140 people at DataScienceLondon).
If you use something like LogisticRegression (where you get a probability for each classification) you can pick only the confident classifications, and that way you can force high precision by trading against recall (so you get correct results, but fewer of them). You'll have to tune this to your system.
Here's a possible algorithmic approach using scikit-learn:
Use a Binary CountVectorizer (I don't think term-counts in short messages add much information as most words occur only once)
Start with a Decision Tree classifier. It'll have explainable performance (see Overfitting with a Decision Tree for an example).
Move to logistic regression
Investigate the errors generated by the classifiers (read the DecisionTree's exported output or look at the coefficients in LogisticRegression, work the mis-classified tweets back through the Vectorizer to see what the underlying Bag of Words representation looks like - there will be fewer tokens there than you started with in the raw tweet - are there enough for a classification?)
Look at my example code in https://github.com/ianozsvald/social_media_brand_disambiguator/blob/master/learn1.py for a worked version of this approach
Things to consider:
You need a larger dataset. I'm using 2000 labelled tweets (it took me five hours), and as a minimum you want a balanced set with >100 per class (see the overfitting note below)
Improve the tokeniser (very easy with scikit-learn) to keep # # in tokens, and maybe add a capitalised-brand detector (as user #user2425429 notes)
Consider a non-linear classifier (like #oiez's suggestion above) when things get harder. Personally I found LinearSVC to do worse than logistic regression (but that may be due to the high-dimensional feature space that I've yet to reduce).
A tweet-specific part of speech tagger (in my humble opinion not Standford's as #Neil suggests - it performs poorly on poor Twitter grammar in my experience)
Once you have lots of tokens you'll probably want to do some dimensionality reduction (I've not tried this yet - see my blog post on LogisticRegression l1 l2 penalisation)
Re. overfitting. In my dataset with 2000 items I have a 10 minute snapshot from Twitter of 'apple' tweets. About 2/3 of the tweets are for Apple Inc, 1/3 for other-apple-uses. I pull out a balanced subset (about 584 rows I think) of each class and do five-fold cross validation for training.
Since I only have a 10 minute time-window I have many tweets about the same topic, and this is probably why my classifier does so well relative to existing tools - it will have overfit to the training features without generalising well (whereas the existing commercial tools perform worse on this snapshop, but more reliably across a wider set of data). I'll be expanding my time window to test this as a subsequent piece of work.
You can do the following:
Make a dict of words containing their count of occurrence in fruit and company related tweets. This can be achieved by feeding it some sample tweets whose inclination we know.
Using enough previous data, we can find out the probability of a word occurring in tweet about apple inc.
Multiply individual probabilities of words to get the probability of the whole tweet.
A simplified example:
p_f = Probability of fruit tweets.
p_w_f = Probability of a word occurring in a fruit tweet.
p_t_f = Combined probability of all words in tweet occurring a fruit tweet
= p_w1_f * p_w2_f * ...
p_f_t = Probability of fruit given a particular tweet.
p_c, p_w_c, p_t_c, p_c_t are respective values for company.
A laplacian smoother of value 1 is added to eliminate the problem of zero frequency of new words which are not there in our database.
old_tweets = {'apple pie sweet potatoe cake baby https://vine.co/v/hzBaWVA3IE3': '0', ...}
known_words = {}
total_company_tweets = total_fruit_tweets =total_company_words = total_fruit_words = 0
for tweet in old_tweets:
company = old_tweets[tweet]
for word in tweet.lower().split(" "):
if not word in known_words:
known_words[word] = {"company":0, "fruit":0 }
if company == "1":
known_words[word]["company"] += 1
total_company_words += 1
else:
known_words[word]["fruit"] += 1
total_fruit_words += 1
if company == "1":
total_company_tweets += 1
else:
total_fruit_tweets += 1
total_tweets = len(old_tweets)
def predict_tweet(new_tweet,K=1):
p_f = (total_fruit_tweets+K)/(total_tweets+K*2)
p_c = (total_company_tweets+K)/(total_tweets+K*2)
new_words = new_tweet.lower().split(" ")
p_t_f = p_t_c = 1
for word in new_words:
try:
wordFound = known_words[word]
except KeyError:
wordFound = {'fruit':0,'company':0}
p_w_f = (wordFound['fruit']+K)/(total_fruit_words+K*(len(known_words)))
p_w_c = (wordFound['company']+K)/(total_company_words+K*(len(known_words)))
p_t_f *= p_w_f
p_t_c *= p_w_c
#Applying bayes rule
p_f_t = p_f * p_t_f/(p_t_f*p_f + p_t_c*p_c)
p_c_t = p_c * p_t_c/(p_t_f*p_f + p_t_c*p_c)
if p_c_t > p_f_t:
return "Company"
return "Fruit"
If you don't have an issue using an outside library, I'd recommend scikit-learn since it can probably do this better & faster than anything you could code by yourself. I'd just do something like this:
Build your corpus. I did the list comprehensions for clarity, but depending on how your data is stored you might need to do different things:
def corpus_builder(apple_inc_tweets, apple_fruit_tweets):
corpus = [tweet for tweet in apple_inc_tweets] + [tweet for tweet in apple_fruit_tweets]
labels = [1 for x in xrange(len(apple_inc_tweets))] + [0 for x in xrange(len(apple_fruit_tweets))]
return (corpus, labels)
The important thing is you end up with two lists that look like this:
([['apple inc tweet i love ios and iphones'], ['apple iphones are great'], ['apple fruit tweet i love pie'], ['apple pie is great']], [1, 1, 0, 0])
The [1, 1, 0, 0] represent the positive and negative labels.
Then, you create a Pipeline! Pipeline is a scikit-learn class that makes it easy to chain text processing steps together so you only have to call one object when training/predicting:
def train(corpus, labels)
pipe = Pipeline([('vect', CountVectorizer(ngram_range=(1, 3), stop_words='english')),
('tfidf', TfidfTransformer(norm='l2')),
('clf', LinearSVC()),])
pipe.fit_transform(corpus, labels)
return pipe
Inside the Pipeline there are three processing steps. The CountVectorizer tokenizes the words, splits them, counts them, and transforms the data into a sparse matrix. The TfidfTransformer is optional, and you might want to remove it depending on the accuracy rating (doing cross validation tests and a grid search for the best parameters is a bit involved, so I won't get into it here). The LinearSVC is a standard text classification algorithm.
Finally, you predict the category of tweets:
def predict(pipe, tweet):
prediction = pipe.predict([tweet])
return prediction
Again, the tweet needs to be in a list, so I assumed it was entering the function as a string.
Put all those into a class or whatever, and you're done. At least, with this very basic example.
I didn't test this code so it might not work if you just copy-paste, but if you want to use scikit-learn it should give you an idea of where to start.
EDIT: tried to explain the steps in more detail.
Using a decision tree seems to work quite well for this problem. At least it produces a higher accuracy than a naive bayes classifier with my chosen features.
If you want to play around with some possibilities, you can use the following code, which requires nltk to be installed. The nltk book is also freely available online, so you might want to read a bit about how all of this actually works: http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html
#coding: utf-8
import nltk
import random
import re
def get_split_sets():
structured_dataset = get_dataset()
train_set = set(random.sample(structured_dataset, int(len(structured_dataset) * 0.7)))
test_set = [x for x in structured_dataset if x not in train_set]
train_set = [(tweet_features(x[1]), x[0]) for x in train_set]
test_set = [(tweet_features(x[1]), x[0]) for x in test_set]
return (train_set, test_set)
def check_accurracy(times=5):
s = 0
for _ in xrange(times):
train_set, test_set = get_split_sets()
c = nltk.classify.DecisionTreeClassifier.train(train_set)
# Uncomment to use a naive bayes classifier instead
#c = nltk.classify.NaiveBayesClassifier.train(train_set)
s += nltk.classify.accuracy(c, test_set)
return s / times
def remove_urls(tweet):
tweet = re.sub(r'http:\/\/[^ ]+', "", tweet)
tweet = re.sub(r'pic.twitter.com/[^ ]+', "", tweet)
return tweet
def tweet_features(tweet):
words = [x for x in nltk.tokenize.wordpunct_tokenize(remove_urls(tweet.lower())) if x.isalpha()]
features = dict()
for bigram in nltk.bigrams(words):
features["hasBigram(%s)" % ",".join(bigram)] = True
for trigram in nltk.trigrams(words):
features["hasTrigram(%s)" % ",".join(trigram)] = True
return features
def get_dataset():
dataset = """copy dataset in here
"""
structured_dataset = [('fruit' if x[0] == '0' else 'company', x[2:]) for x in dataset.splitlines()]
return structured_dataset
if __name__ == '__main__':
print check_accurracy()
Thank you for the comments thus far. Here is a working solution I prepared with PHP. I'd still be interested in hearing from others a more algorithmic approach to this same solution.
<?php
// Confusion Matrix Init
$tp = 0;
$fp = 0;
$fn = 0;
$tn = 0;
$arrFP = array();
$arrFN = array();
// Load All Tweets to string
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://pastebin.com/raw.php?i=m6pP8ctM');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$strCorpus = curl_exec($ch);
curl_close($ch);
// Load Tweets as Array
$arrCorpus = explode("\n", $strCorpus);
foreach ($arrCorpus as $k => $v) {
// init
$blnActualClass = substr($v,0,1);
$strTweet = trim(substr($v,2));
// Score Tweet
$intScore = score($strTweet);
// Build Confusion Matrix and Log False Positives & Negatives for Review
if ($intScore > 0) {
if ($blnActualClass == 1) {
// True Positive
$tp++;
} else {
// False Positive
$fp++;
$arrFP[] = $strTweet;
}
} else {
if ($blnActualClass == 1) {
// False Negative
$fn++;
$arrFN[] = $strTweet;
} else {
// True Negative
$tn++;
}
}
}
// Confusion Matrix and Logging
echo "
Predicted
1 0
Actual 1 $tp $fp
Actual 0 $fn $tn
";
if (count($arrFP) > 0) {
echo "\n\nFalse Positives\n";
foreach ($arrFP as $strTweet) {
echo "$strTweet\n";
}
}
if (count($arrFN) > 0) {
echo "\n\nFalse Negatives\n";
foreach ($arrFN as $strTweet) {
echo "$strTweet\n";
}
}
function LoadDictionaryArray() {
$strDictionary = <<<EOD
10|iTunes
10|ios 7
10|ios7
10|iPhone
10|apple inc
10|apple corp
10|apple.com
10|MacBook
10|desk top
10|desktop
1|config
1|facebook
1|snapchat
1|intel
1|investor
1|news
1|labs
1|gadget
1|apple store
1|microsoft
1|android
1|bonds
1|Corp.tax
1|macs
-1|pie
-1|clientes
-1|green apple
-1|banana
-10|apple pie
EOD;
$arrDictionary = explode("\n", $strDictionary);
foreach ($arrDictionary as $k => $v) {
$arr = explode('|', $v);
$arrDictionary[$k] = array('value' => $arr[0], 'term' => strtolower(trim($arr[1])));
}
return $arrDictionary;
}
function score($str) {
$str = strtolower($str);
$intScore = 0;
foreach (LoadDictionaryArray() as $arrDictionaryItem) {
if (strpos($str,$arrDictionaryItem['term']) !== false) {
$intScore += $arrDictionaryItem['value'];
}
}
return $intScore;
}
?>
The above outputs:
Predicted
1 0
Actual 1 31 1
Actual 0 1 17
False Positives
1|Royals apple #ASGame #mlb # News Corp Building http://instagram.com/p/bBzzgMrrIV/
False Negatives
-1|RT #MaxFreixenet: Apple no tiene clientes. Tiene FANS// error.... PAGAS por productos y apps, ergo: ERES CLIENTE.
In all the examples that you gave, Apple(inc) was either referred to as Apple or apple inc, so a possible way could be to search for:
a capital "A" in Apple
an "inc" after apple
words/phrases like "OS", "operating system", "Mac", "iPhone", ...
or a combination of them
To simplify answers based on Conditional Random Fields a bit...context is huge here. You will want to pick out in those tweets that clearly show Apple the company vs apple the fruit. Let me outline a list of features here that might be useful for you to start with. For more information look up noun phrase chunking, and something called BIO labels. See (http://www.cis.upenn.edu/~pereira/papers/crf.pdf)
Surrounding words: Build a feature vector for the previous word and the next word, or if you want more features perhaps the previous 2 and next 2 words. You don't want too many words in the model or it won't match the data very well.
In Natural Language Processing, you are going to want to keep this as general as possible.
Other features to get from surrounding words include the following:
Whether the first character is a capital
Whether the last character in the word is a period
The part of speech of the word (Look up part of speech tagging)
The text itself of the word
I don't advise this, but to give more examples of features specifically for Apple:
WordIs(Apple)
NextWordIs(Inc.)
You get the point. Think of Named Entity Recognition as describing a sequence, and then using some math to tell a computer how to calculate that.
Keep in mind that natural language processing is a pipeline based system. Typically, you break things in to sentences, move to tokenization, then do part of speech tagging or even dependency parsing.
This is all to get you a list of features you can use in your model to identify what you're looking for.
There's a really good library for processing natural language text in Python called nltk. You should take a look at it.
One strategy you could try is to look at n-grams (groups of words) with the word "apple" in them. Some words are more likely to be used next to "apple" when talking about the fruit, others when talking about the company, and you can use those to classify tweets.
Use LibShortText. This Python utility has already been tuned to work for short text categorization tasks, and it works well. The maximum you'll have to do is to write a loop to pick the best combination of flags. I used it to do supervised speech act classification in emails and the results were up to 95-97% accurate (during 5 fold cross validation!).
And it comes from the makers of LIBSVM and LIBLINEAR whose support vector machine (SVM) implementation is used in sklearn and cran, so you can be reasonably assured that their implementation is not buggy.
Make an AI filter to distinguish Apple Inc (the company) from apple (the fruit). Since these are tweets, define your training set with a vector of 140 fields, each field being the character written in the tweet at position X (0 to 139). If the tweet is shorter, just give a value for being blank.
Then build a training set big enough to get a good accuracy (subjective to your taste). Assign a result value to each tweet, a Apple Inc tweet get 1 (true) and an apple tweet (fruit) gets 0. It would be a case of supervised learning in a logistic regression.
That is machine learning, is generally easier to code and performs better. It has to learn from the set you give it, and it's not hardcoded.
I don't know Python, so I can not write the code for it, but if you were to take more time for machine learning's logic and theory you might want to look the class I'm following.
Try the Coursera course Machine Learning by Andrew Ng. You will learn machine learning on MATLAB or Octave, but once you get the basics you will be able to write machine learning in about any language if you do understand the simple math (simple in logistic regression).
That is, getting the code from someone won't make you able to understand what is going in the machine learning code. You might want to invest a couple of hours on the subject to see what is really going on.
I would recommend avoiding answers suggesting entity recognition. Because this task is a text-classification first and entity recognition second (you can do it without the entity recognition at all).
I think the fastest path to results will be spacy + prodigy.
Spacy has well thought through model for English language, so you don't have to build your own. While prodigy allows quickly create training datasets and fine tune spacy model for your needs.
If you have enough samples, you can have a decent model in 1 day.
I'm trying to do a document classification using Weka java API.
Here is my directory structure of the data files.
+- text_example
|
+- class1
| |
| 3 html files
|
+- class2
| |
| 1 html file
|
+- class3
|
3 html files
I have the 'arff' file created with 'TextDirectoryLoader'. Then I use the StringToWordVector filter on the created arff file, with filter.setOutputWordCounts(true).
Below is a sample of the output once the filter is applied. I need to get few things clarified.
#attribute </form> numeric
#attribute </h1> numeric
.
.
#attribute earth numeric
#attribute easy numeric
This huge list should be the tokenization of the content of the initial html files. right?
Then I have,
#data
{1 2,3 2,4 1,11 1,12 7,..............}
{10 4,34 1,37 5,.......}
{2 1,5 6,6 16,...}
{0 class2,34 11,40 15,.....,4900 3,...
{0 class3,1 2,37 3,40 5....
{0 class3,1 2,31 20,32 17......
{0 class3,32 5,42 1,43 10.........
why there is no class attribute for the first 3 items? (it should have class1).
what does the leading 0 means as in {0 class2,..}, {0 class3..}.
It says, for instance, that in the 3rd html file in the class3 folder, the word identified by the integer 32 appears 5 times. Just to see how do I get the word (token) referred by 32?
How do I reduce the dimensionality of the feature vector? don't we need to make all the feature vectors the same size? (like consider only the say 100 most frequent terms from the training set and later when it comes to testing, consider the occurrence of only those 100 terms in test documents. Because, in this way what happens if we come up with a totally new word in the testing phase, will the classifier just ignore it?).
Am I missing something here? I'm new to Weka.
Also I really appreciate the help if someone can explain me how the classifier uses this vector created with StringToWordVector filter. (like creating the vocabulary with the training data, dimensionality reduction, are those happening inside the Weka code?)
The huge list of #attribute contains all the tokens derived from your input.
Your #data section is in the sparse format, that is for each attribute, the value is only stated if it is different from zero. For the first three lines, the class attribute is class1, you just can't see it (if it were unknown, you would see a 0 ? at the beginning of the first three lines). Why is that so? Weka internally represents nominal attributes (that includes classes) as doubles and starts counting at zero. So your three classes are internally: class1=0.0, class2=1.0, class3=2.0. As zero-values are not stated in the sparse format, you can't see the class in the first three lines. (Also see the section "Sparse ARFF files" on http://www.cs.waikato.ac.nz/ml/weka/arff.html)
To get the word/token represented by index n, you can either count or, if you have the Instances object, invoke attribute(n).name() on it. For that, n starts counting at 0.
To reduce dimensionality of the feature vector, there are a lot of options. If you only want to have the 100 most frequent terms, you stringToWordVector.setWordsToKeep(100). Note that this will try to keep 100 words of every class. If you do not want to keep 100 words per class, stringToWordVector.setDoNotOperateOnPerClassBasis(true). You will get slightly above 100 if there are several words with the same frequency, so the 100 is just a kind of target value.
As for the new words occuring in the test phase, I think that cannot happen because you have to hand the stringToWordVector all instances before classifying. I am not 100% sure on that one though, as I am using a two-class setup and I let StringToWordVector transform all my instances before telling the classifier anything about it.
I can generally recomment to you, to experiment with the Weka KnowledgeFlow tool to learn how to use the different classes. If you know how to do things there, you can use that knowledge for your Java code quite easily.
Hope I was able to help you, although the answer is a bit late.
I am processing a file which I need to split based on the separator.
The following code shows the separators defined for the files I am processing
private static final String component = Character.toString((char) 31);
private static final String data = Character.toString((char) 29);
private static final String segment = Character.toString((char) 28);
Can someone please explain the significance of these specific separators?
Looking at the ASCII codes, these separators are file, group and unit separators. I don't really understand what this means.
Found this here. Cool website!
28 – FS – File separator The file
separator FS is an interesting control
code, as it gives us insight in the
way that computer technology was
organized in the sixties. We are now
used to random access media like RAM
and magnetic disks, but when the ASCII
standard was defined, most data was
serial. I am not only talking about
serial communications, but also about
serial storage like punch cards, paper
tape and magnetic tapes. In such a
situation it is clearly efficient to
have a single control code to signal
the separation of two files. The FS
was defined for this purpose.
29 – GS – Group separator
Data storage was one
of the main reasons for some control
codes to get in the ASCII definition.
Databases are most of the time setup
with tables, containing records. All
records in one table have the same
type, but records of different tables
can be different. The group separator
GS is defined to separate tables in a
serial data storage system. Note that
the word table wasn't used at that
moment and the ASCII people called it
a group.
30 – RS – Record separator
Within a group (or table) the records
are separated with RS or record
separator.
31 – US – Unit separator
The smallest data items to be stored
in a database are called units in the
ASCII definition. We would call them
field now. The unit separator
separates these fields in a serial
data storage environment. Most current
database implementations require that
fields of most types have a fixed
length. Enough space in the record is
allocated to store the largest
possible member of each field, even if
this is not necessary in most cases.
This costs a large amount of space in
many situations. The US control code
allows all fields to have a variable
length. If data storage space is
limited—as in the sixties—this is a
good way to preserve valuable space.
On the other hand is serial storage
far less efficient than the table
driven RAM and disk implementations of
modern times. I can't imagine a
situation where modern SQL databases
are run with the data stored on paper
tape or magnetic reels...
The ascii control characters range from 28-31. (0x1C to 0x1F)
31 Unit Separator
30 Record Separator
29 Group Separator
28 File Separator
Sample invocation:
char record_separator = 0x1F;
String s = "hello" + record_separator + "world"
These characters are control characters. They're not meant to be written or read by humans, but by computers. You should treat them in your program like any other character.