Open Source implementation of a Spaced Repetition Algorithm in Java [closed] - java

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I work on a project where Spaced Repetition is essential, however I am not a specialist on the subject and I am afraid to reinvent the square wheel. My research pointed me two different systems, namely the Leitner system and the SM family of algorithms.
I haven't decided yet which system would best fit into my project. If I was to take a SM orientation, I guess I would try to implement something similar to what Anki uses.
My best option would be to use an existing Java library. It could be quite simple, all I need is to compute the time for the next repetition.
Has anyone heard of such an initiative ?

I haven't looked at Anki's implementation but have you seen this one? quiz-me an SRS in Java.
Basically it goes like this
public static void calcuateInterval(Card card) {
if (card.getEFactor() < 3) {
card.setCount(1);
}
int count = card.getCount();
int interval = 1;
if (count == 2) {
interval = 6;
} else if (count > 2) {
interval = Math.round(card.getInterval() * card.getEFactor());
}
card.setInterval(interval);
}
If you really want Anki's algorithm, look through the source of Anki in Android available in Github. It is GPL though so you might need to buy a license.

I did reinvent the square wheel in my own flashcard app. The algorithm is quite simple: The weight of an item is the product of an age component, a progress component, and an effort component.
Age component
The formula is A(x) = Cn^x, where
x is the time in days since the item was last tested,
C is the value you want when x is zero, and
n is a constant based on how fast you want the value to increase as x increases.
For example, if you want the value to double every five days, n = e^(ln(2/C)/5).
Progress component
The formula is P(x) = Cn^-x, where
x is a number that corresponds to how successful you've been with the item,
C is the value you want when x is zero, and
n is a constant based on how fast you want the value to decay as x increases.
For example, if you want the value to halve every five consecutive successes, n = e^(ln(1/2)/-5).
Effort component
This takes on one of two values:
10 if you found your last recall of the item to be "hard", or
1 otherwise.
The progress is adjusted thus:
New entries start with progress 0.
If you find an answer easy, the item's progress is increased by 1.
If you find an answer hard, the item's progress goes to min(int(previous / 2), previous - 1).
If you get an answer wrong, the item's progress goes to min(-1, previous - 1).
Yes, values can go negative. :)
The app selects the next item to test by making a random selection from all the items, with the probability of selection varying directly with an item's weight.
The specific numbers in the algorithm are tweakable. I've been using my current values for about a year, leading to great success in accumulating and retaining vocabulary for Spanish, German, and Latin.
(Sorry for potato quality of the math expressions. LaTeX isn't allowed here.)

Anki uses the SM2 algorithm. However, SM2 as described by that article has a number of serious flaws. Fortunately, they are simple to fix.
Explaining how to do so would be too lengthy of a subject for this post, so I've written a blog post about it here. There is no need to use an open-source library to do this, as the actual implementation is incredibly simple.

Here is a spaced repetition algorithm that is well documented and easy to understand.
Features
Introduces sub-decks for efficiently learning large decks (Super
useful!)
Intuitive variable names and algorithm parameters. Fully
open-source with human-readable examples.
Easily configurable
parameters to accommodate for different users' memorization
abilities.
Computationally cheap to compute next card. No need to
run a computation on every card in the deck.
https://github.com/Jakobovski/SaneMemo.
Disclaimer: I am the author of SaneMemo.
import random
import datetime
# The number of times needed for the user to get the card correct(EASY) consecutively before removing the card from
# the current sub_deck.
CONSECUTIVE_CORRECT_TO_REMOVE_FROM_SUBDECK_WHEN_KNOWN = 2
CONSECUTIVE_CORRECT_TO_REMOVE_FROM_SUBDECK_WHEN_WILL_FORGET = 3
# The number of cards in the sub-deck
SUBDECK_SIZE = 15
REMINDER_RATE = 1.6
class Deck(object):
def __init__(self):
self.cards = []
# Used to make sure we don't display the same card twice
self.last_card = None
def add_card(self, card):
self.cards.append(card)
def get_next_card(self):
self.cards = sorted(self.cards) # Sorted by next_practice_time
sub_deck = self.cards[0:min(SUBDECK_SIZE, len(self.cards))]
card = random.choice(sub_deck)
# In case card == last card lets select again. We don't want to show the same card two times in a row.
while card == self.last_card:
card = random.choice(sub_deck)
self.last_card = card
return card
class Card(object):
def __init__(self, front, back):
self.front = front
self.back = back
self.next_practice_time = datetime.utc.now()
self.consecutive_correct_answer = 0
self.last_time_easy = datetime.utc.now()
def update(self, performance_str):
""" Updates the card after the user has seen it and answered how difficult it was. The user can provide one of
three options: [I_KNOW, KNOW_BUT_WILL_FORGET, DONT_KNOW].
"""
if performance_str == "KNOW_IT":
self.consecutive_correct_answer += 1
if self.consecutive_correct_answer >= CONSECUTIVE_CORRECT_TO_REMOVE_FROM_SUBDECK_WHEN_KNOWN:
days_since_last_easy = (datetime.utc.now() - self.last_time_easy).days
days_to_next_review = (days_since_last_easy + 2) * REMINDER_RATE
self.next_practice_time = datetime.utc.now() + datetime.time(days=days_to_next_review)
self.last_time_easy = datetime.utc.now()
else:
self.next_practice_time = datetime.utc.now()
elif performance_str == "KNOW_BUT_WILL_FORGET":
self.consecutive_correct_answer += 1
if self.consecutive_correct_answer >= CONSECUTIVE_CORRECT_TO_REMOVE_FROM_SUBDECK_WHEN_WILL_FORGET:
self.next_practice_time = datetime.utc.now() + datetime.time(days=1)
else:
self.next_practice_time = datetime.utc.now()
elif performance_str == "DONT_KNOW":
self.consecutive_correct_answer = 0
self.next_practice_time = datetime.utc.now()
def __cmp__(self, other):
"""Comparator or sorting cards by next_practice_time"""
if hasattr(other, 'next_practice_time'):
return self.number.__cmp__(other.next_practice_time)

Related

Is there a better method for randomizing functions besides Random?

I have 2 strings in an array. I want there to be a 10% chance of one and 90% chance to select the other. Right now I am using:
Random random = new Random();
int x = random.nextInt(100 - 1) + 1;
if (x < 10) {
string = stringArray(0);
} else {
string = stringArray(1);
}
Is this the best way of accomplishing this or is there a better method?
I know it's typically a bad idea to submit a stack overflow response without submitting code, but I really challenge this question of " the best way." People ask this all the time and, while there are established design patterns in software worth knowing, this question almost always can be answered by "it depends."
For example, your pattern looks fine (I might add some comments). You might get a minuscule performance increase by using 1 - 10 instead of 1 - 100, but the things you need to ask yourself are as follows :
If I get hit by a bus, is the person who is going to be working on the application going to know what I was trying to do?
If it isn't intuitive, I should write a comment. Then I should ask myself, "Can I change this code so that a comment isn't necessary?"
Is there an existing library that solves this problem? If so, is it FOSS approved (if applicable) / can I use it?
What is the size of this codebase eventually going to be? Am I making a full program with microservices, a DAO, DTO, Controller, View, and different layers for validation?
Is there an existing convention to solve my problem (either at my company or in general), or is it unique enough that I can take my own spin on it?
Does this follow the DRY principle?
I'm in (apparently) a very small camp on stack overflow that doesn't always believe in universal "bests" for solving code problems. Just remember, programming is only as hard as the problem you're trying to solve.
EDIT
Since people asked, I'd do it like this:
/*
* #author DaveCat
* #version 1.0
* #since 2019-03-9
* Convenience method that calculates 90% odds of A and 10% odds of B.
*
*/
public static String[] calculatesNinetyPercent()
{
Random random = new Random();
int x = random.nextInt(10 - 1 ) + 1
//Option A
if(x <= 9) {
return stringArray(0);
}
else
{
//Option B
return stringArray(1);
}
}
As an aside, one of the common mistakes junior devs make in enterprise level development is excessive comments.This has a javadoc, which is probably overkill, but I'm assuming this is a convenience method you're using in a greater program.
Edit (again)
You guys keep confusing me. This is how you randomly generate between 2 given numbers in Java
One alternative is to use a random float value between 0..1 and comparing it to the probability of the event. If the random value is less than the probability, then the event occurs.
In this specific example, set x to a random float and compare it to 0.1
I like this method because it can be used for probabilities other than percent integers.

Blacklist strings in an array on android

I have an app that I have made for friends to randomize races for a board game, such that a player gets a randomly selected species every time. I want to include a feature in an update that will allow for a blacklist so that certain races cannot be chosen. What would be the best way to go about this? I'll take any and all advice. Much thanks in advance.
Of note: The array consists of strings of names, and I'd like for this to be persistent for as long as they have the app installed or until they change it.
Edit: Apologies for my lack of clarity. I know generally how I need to do it, but what about saving the settings from the settings menu so that upon closing and reopening, blacklisted races are persistent? Even when the app is closed, upon reopening I'd like for the settings to stay the same. That way next time they play the game (weeks later), assuming their tastes haven't changed, they can go to clicking without blacklisting again.
Without any code, it's hard to say for sure how to go about this, because I don't know how you're implementing the rest of the app.
One way I can think of doing this is if you associate an integer with each race, e.g. by using constants:
public static final GIRAFFE = 1;
public static final GOOSE = 2;
Then you could generate random integers to randomly pick each race. If you created a method for the random integer generation, you could then pass certain numbers that would be excluded. You could keep generating a random integer while the integer generated was one of the excluded numbers.
E.g. (Since it's Android I'll code in Java)
// assume min = smallest integer assigned to an animal,
// and max = largest integer assigned to an animal
public static int randomNumber(int ... exclude)
{
int random = min + (int)(Math.random() * ((max - min) + 1));
for (int i = 0; i < exclude.length; i++)
{
if (exclude[i] == random)
random = Min + (int)(Math.random() * ((Max - Min) + 1));
}
return random;
}
I'm a little new to StackOverflow, so let me know if this helps.
What you need to do is persist data. You can use two solutions:
1) The SharedPreferences framework for saving key-value pairs without any effort. See here to see how to save a list Store a List or Set in SharedPreferences
2) use SQL database. (custom solution)
In case 1 you would persist the blacklisted strings and in case 2 you would be best to create a table with all strings and have a boolean as to whether its blacklisted or not.
I recommend the shared preferences one since it should take you 5 min to do whereas, unless you are familiar with databases, the database solution will take you a while to work out.

How can I build a model to distinguish tweets about Apple (Inc.) from tweets about apple (fruit)?

See below for 50 tweets about "apple." I have hand labeled the positive matches about Apple Inc. They are marked as 1 below.
Here are a couple of lines:
1|“#chrisgilmer: Apple targets big business with new iOS 7 features http://bit.ly/15F9JeF ”. Finally.. A corp iTunes account!
0|“#Zach_Paull: When did green skittles change from lime to green apple? #notafan” #Skittles
1|#dtfcdvEric: #MaroneyFan11 apple inc is searching for people to help and tryout all their upcoming tablet within our own net page No.
0|#STFUTimothy have you tried apple pie shine?
1|#SuryaRay #India Microsoft to bring Xbox and PC games to Apple, Android phones: Report: Microsoft Corp... http://dlvr.it/3YvbQx #SuryaRay
Here is the total data set: http://pastebin.com/eJuEb4eB
I need to build a model that classifies "Apple" (Inc). from the rest.
I'm not looking for a general overview of machine learning, rather I'm looking for actual model in code (Python preferred).
What you are looking for is called Named Entity Recognition. It is a statistical technique that (most commonly) uses Conditional Random Fields to find named entities, based on having been trained to learn things about named entities.
Essentially, it looks at the content and context of the word, (looking back and forward a few words), to estimate the probability that the word is a named entity.
Good software can look at other features of words, such as their length or shape (like "Vcv" if it starts with "Vowel-consonant-vowel")
A very good library (GPL) is Stanford's NER
Here's the demo: http://nlp.stanford.edu:8080/ner/
Some sample text to try:
I was eating an apple over at Apple headquarters and I thought about
Apple Martin, the daughter of the Coldplay guy
(the 3class and 4class classifiers get it right)
I would do it as follows:
Split the sentence into words, normalise them, build a dictionary
With each word, store how many times they occurred in tweets about the company, and how many times they appeared in tweets about the fruit - these tweets must be confirmed by a human
When a new tweet comes in, find every word in the tweet in the dictionary, calculate a weighted score - words that are used frequently in relation to the company would get a high company score, and vice versa; words used rarely, or used with both the company and the fruit, would not have much of a score.
I have a semi-working system that solves this problem, open sourced using scikit-learn, with a series of blog posts describing what I'm doing. The problem I'm tackling is word-sense disambiguation (choosing one of multiple word sense options), which is not the same as Named Entity Recognition. My basic approach is somewhat-competitive with existing solutions and (crucially) is customisable.
There are some existing commercial NER tools (OpenCalais, DBPedia Spotlight, and AlchemyAPI) that might give you a good enough commercial result - do try these first!
I used some of these for a client project (I consult using NLP/ML in London), but I wasn't happy with their recall (precision and recall). Basically they can be precise (when they say "This is Apple Inc" they're typically correct), but with low recall (they rarely say "This is Apple Inc" even though to a human the tweet is obviously about Apple Inc). I figured it'd be an intellectually interesting exercise to build an open source version tailored to tweets. Here's the current code:
https://github.com/ianozsvald/social_media_brand_disambiguator
I'll note - I'm not trying to solve the generalised word-sense disambiguation problem with this approach, just brand disambiguation (companies, people, etc.) when you already have their name. That's why I believe that this straightforward approach will work.
I started this six weeks ago, and it is written in Python 2.7 using scikit-learn. It uses a very basic approach. I vectorize using a binary count vectorizer (I only count whether a word appears, not how many times) with 1-3 n-grams. I don't scale with TF-IDF (TF-IDF is good when you have a variable document length; for me the tweets are only one or two sentences, and my testing results didn't show improvement with TF-IDF).
I use the basic tokenizer which is very basic but surprisingly useful. It ignores # # (so you lose some context) and of course doesn't expand a URL. I then train using logistic regression, and it seems that this problem is somewhat linearly separable (lots of terms for one class don't exist for the other). Currently I'm avoiding any stemming/cleaning (I'm trying The Simplest Possible Thing That Might Work).
The code has a full README, and you should be able to ingest your tweets relatively easily and then follow my suggestions for testing.
This works for Apple as people don't eat or drink Apple computers, nor do we type or play with fruit, so the words are easily split to one category or the other. This condition may not hold when considering something like #definance for the TV show (where people also use #definance in relation to the Arab Spring, cricket matches, exam revision and a music band). Cleverer approaches may well be required here.
I have a series of blog posts describing this project including a one-hour presentation I gave at the BrightonPython usergroup (which turned into a shorter presentation for 140 people at DataScienceLondon).
If you use something like LogisticRegression (where you get a probability for each classification) you can pick only the confident classifications, and that way you can force high precision by trading against recall (so you get correct results, but fewer of them). You'll have to tune this to your system.
Here's a possible algorithmic approach using scikit-learn:
Use a Binary CountVectorizer (I don't think term-counts in short messages add much information as most words occur only once)
Start with a Decision Tree classifier. It'll have explainable performance (see Overfitting with a Decision Tree for an example).
Move to logistic regression
Investigate the errors generated by the classifiers (read the DecisionTree's exported output or look at the coefficients in LogisticRegression, work the mis-classified tweets back through the Vectorizer to see what the underlying Bag of Words representation looks like - there will be fewer tokens there than you started with in the raw tweet - are there enough for a classification?)
Look at my example code in https://github.com/ianozsvald/social_media_brand_disambiguator/blob/master/learn1.py for a worked version of this approach
Things to consider:
You need a larger dataset. I'm using 2000 labelled tweets (it took me five hours), and as a minimum you want a balanced set with >100 per class (see the overfitting note below)
Improve the tokeniser (very easy with scikit-learn) to keep # # in tokens, and maybe add a capitalised-brand detector (as user #user2425429 notes)
Consider a non-linear classifier (like #oiez's suggestion above) when things get harder. Personally I found LinearSVC to do worse than logistic regression (but that may be due to the high-dimensional feature space that I've yet to reduce).
A tweet-specific part of speech tagger (in my humble opinion not Standford's as #Neil suggests - it performs poorly on poor Twitter grammar in my experience)
Once you have lots of tokens you'll probably want to do some dimensionality reduction (I've not tried this yet - see my blog post on LogisticRegression l1 l2 penalisation)
Re. overfitting. In my dataset with 2000 items I have a 10 minute snapshot from Twitter of 'apple' tweets. About 2/3 of the tweets are for Apple Inc, 1/3 for other-apple-uses. I pull out a balanced subset (about 584 rows I think) of each class and do five-fold cross validation for training.
Since I only have a 10 minute time-window I have many tweets about the same topic, and this is probably why my classifier does so well relative to existing tools - it will have overfit to the training features without generalising well (whereas the existing commercial tools perform worse on this snapshop, but more reliably across a wider set of data). I'll be expanding my time window to test this as a subsequent piece of work.
You can do the following:
Make a dict of words containing their count of occurrence in fruit and company related tweets. This can be achieved by feeding it some sample tweets whose inclination we know.
Using enough previous data, we can find out the probability of a word occurring in tweet about apple inc.
Multiply individual probabilities of words to get the probability of the whole tweet.
A simplified example:
p_f = Probability of fruit tweets.
p_w_f = Probability of a word occurring in a fruit tweet.
p_t_f = Combined probability of all words in tweet occurring a fruit tweet
= p_w1_f * p_w2_f * ...
p_f_t = Probability of fruit given a particular tweet.
p_c, p_w_c, p_t_c, p_c_t are respective values for company.
A laplacian smoother of value 1 is added to eliminate the problem of zero frequency of new words which are not there in our database.
old_tweets = {'apple pie sweet potatoe cake baby https://vine.co/v/hzBaWVA3IE3': '0', ...}
known_words = {}
total_company_tweets = total_fruit_tweets =total_company_words = total_fruit_words = 0
for tweet in old_tweets:
company = old_tweets[tweet]
for word in tweet.lower().split(" "):
if not word in known_words:
known_words[word] = {"company":0, "fruit":0 }
if company == "1":
known_words[word]["company"] += 1
total_company_words += 1
else:
known_words[word]["fruit"] += 1
total_fruit_words += 1
if company == "1":
total_company_tweets += 1
else:
total_fruit_tweets += 1
total_tweets = len(old_tweets)
def predict_tweet(new_tweet,K=1):
p_f = (total_fruit_tweets+K)/(total_tweets+K*2)
p_c = (total_company_tweets+K)/(total_tweets+K*2)
new_words = new_tweet.lower().split(" ")
p_t_f = p_t_c = 1
for word in new_words:
try:
wordFound = known_words[word]
except KeyError:
wordFound = {'fruit':0,'company':0}
p_w_f = (wordFound['fruit']+K)/(total_fruit_words+K*(len(known_words)))
p_w_c = (wordFound['company']+K)/(total_company_words+K*(len(known_words)))
p_t_f *= p_w_f
p_t_c *= p_w_c
#Applying bayes rule
p_f_t = p_f * p_t_f/(p_t_f*p_f + p_t_c*p_c)
p_c_t = p_c * p_t_c/(p_t_f*p_f + p_t_c*p_c)
if p_c_t > p_f_t:
return "Company"
return "Fruit"
If you don't have an issue using an outside library, I'd recommend scikit-learn since it can probably do this better & faster than anything you could code by yourself. I'd just do something like this:
Build your corpus. I did the list comprehensions for clarity, but depending on how your data is stored you might need to do different things:
def corpus_builder(apple_inc_tweets, apple_fruit_tweets):
corpus = [tweet for tweet in apple_inc_tweets] + [tweet for tweet in apple_fruit_tweets]
labels = [1 for x in xrange(len(apple_inc_tweets))] + [0 for x in xrange(len(apple_fruit_tweets))]
return (corpus, labels)
The important thing is you end up with two lists that look like this:
([['apple inc tweet i love ios and iphones'], ['apple iphones are great'], ['apple fruit tweet i love pie'], ['apple pie is great']], [1, 1, 0, 0])
The [1, 1, 0, 0] represent the positive and negative labels.
Then, you create a Pipeline! Pipeline is a scikit-learn class that makes it easy to chain text processing steps together so you only have to call one object when training/predicting:
def train(corpus, labels)
pipe = Pipeline([('vect', CountVectorizer(ngram_range=(1, 3), stop_words='english')),
('tfidf', TfidfTransformer(norm='l2')),
('clf', LinearSVC()),])
pipe.fit_transform(corpus, labels)
return pipe
Inside the Pipeline there are three processing steps. The CountVectorizer tokenizes the words, splits them, counts them, and transforms the data into a sparse matrix. The TfidfTransformer is optional, and you might want to remove it depending on the accuracy rating (doing cross validation tests and a grid search for the best parameters is a bit involved, so I won't get into it here). The LinearSVC is a standard text classification algorithm.
Finally, you predict the category of tweets:
def predict(pipe, tweet):
prediction = pipe.predict([tweet])
return prediction
Again, the tweet needs to be in a list, so I assumed it was entering the function as a string.
Put all those into a class or whatever, and you're done. At least, with this very basic example.
I didn't test this code so it might not work if you just copy-paste, but if you want to use scikit-learn it should give you an idea of where to start.
EDIT: tried to explain the steps in more detail.
Using a decision tree seems to work quite well for this problem. At least it produces a higher accuracy than a naive bayes classifier with my chosen features.
If you want to play around with some possibilities, you can use the following code, which requires nltk to be installed. The nltk book is also freely available online, so you might want to read a bit about how all of this actually works: http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html
#coding: utf-8
import nltk
import random
import re
def get_split_sets():
structured_dataset = get_dataset()
train_set = set(random.sample(structured_dataset, int(len(structured_dataset) * 0.7)))
test_set = [x for x in structured_dataset if x not in train_set]
train_set = [(tweet_features(x[1]), x[0]) for x in train_set]
test_set = [(tweet_features(x[1]), x[0]) for x in test_set]
return (train_set, test_set)
def check_accurracy(times=5):
s = 0
for _ in xrange(times):
train_set, test_set = get_split_sets()
c = nltk.classify.DecisionTreeClassifier.train(train_set)
# Uncomment to use a naive bayes classifier instead
#c = nltk.classify.NaiveBayesClassifier.train(train_set)
s += nltk.classify.accuracy(c, test_set)
return s / times
def remove_urls(tweet):
tweet = re.sub(r'http:\/\/[^ ]+', "", tweet)
tweet = re.sub(r'pic.twitter.com/[^ ]+', "", tweet)
return tweet
def tweet_features(tweet):
words = [x for x in nltk.tokenize.wordpunct_tokenize(remove_urls(tweet.lower())) if x.isalpha()]
features = dict()
for bigram in nltk.bigrams(words):
features["hasBigram(%s)" % ",".join(bigram)] = True
for trigram in nltk.trigrams(words):
features["hasTrigram(%s)" % ",".join(trigram)] = True
return features
def get_dataset():
dataset = """copy dataset in here
"""
structured_dataset = [('fruit' if x[0] == '0' else 'company', x[2:]) for x in dataset.splitlines()]
return structured_dataset
if __name__ == '__main__':
print check_accurracy()
Thank you for the comments thus far. Here is a working solution I prepared with PHP. I'd still be interested in hearing from others a more algorithmic approach to this same solution.
<?php
// Confusion Matrix Init
$tp = 0;
$fp = 0;
$fn = 0;
$tn = 0;
$arrFP = array();
$arrFN = array();
// Load All Tweets to string
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://pastebin.com/raw.php?i=m6pP8ctM');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$strCorpus = curl_exec($ch);
curl_close($ch);
// Load Tweets as Array
$arrCorpus = explode("\n", $strCorpus);
foreach ($arrCorpus as $k => $v) {
// init
$blnActualClass = substr($v,0,1);
$strTweet = trim(substr($v,2));
// Score Tweet
$intScore = score($strTweet);
// Build Confusion Matrix and Log False Positives & Negatives for Review
if ($intScore > 0) {
if ($blnActualClass == 1) {
// True Positive
$tp++;
} else {
// False Positive
$fp++;
$arrFP[] = $strTweet;
}
} else {
if ($blnActualClass == 1) {
// False Negative
$fn++;
$arrFN[] = $strTweet;
} else {
// True Negative
$tn++;
}
}
}
// Confusion Matrix and Logging
echo "
Predicted
1 0
Actual 1 $tp $fp
Actual 0 $fn $tn
";
if (count($arrFP) > 0) {
echo "\n\nFalse Positives\n";
foreach ($arrFP as $strTweet) {
echo "$strTweet\n";
}
}
if (count($arrFN) > 0) {
echo "\n\nFalse Negatives\n";
foreach ($arrFN as $strTweet) {
echo "$strTweet\n";
}
}
function LoadDictionaryArray() {
$strDictionary = <<<EOD
10|iTunes
10|ios 7
10|ios7
10|iPhone
10|apple inc
10|apple corp
10|apple.com
10|MacBook
10|desk top
10|desktop
1|config
1|facebook
1|snapchat
1|intel
1|investor
1|news
1|labs
1|gadget
1|apple store
1|microsoft
1|android
1|bonds
1|Corp.tax
1|macs
-1|pie
-1|clientes
-1|green apple
-1|banana
-10|apple pie
EOD;
$arrDictionary = explode("\n", $strDictionary);
foreach ($arrDictionary as $k => $v) {
$arr = explode('|', $v);
$arrDictionary[$k] = array('value' => $arr[0], 'term' => strtolower(trim($arr[1])));
}
return $arrDictionary;
}
function score($str) {
$str = strtolower($str);
$intScore = 0;
foreach (LoadDictionaryArray() as $arrDictionaryItem) {
if (strpos($str,$arrDictionaryItem['term']) !== false) {
$intScore += $arrDictionaryItem['value'];
}
}
return $intScore;
}
?>
The above outputs:
Predicted
1 0
Actual 1 31 1
Actual 0 1 17
False Positives
1|Royals apple #ASGame #mlb # News Corp Building http://instagram.com/p/bBzzgMrrIV/
False Negatives
-1|RT #MaxFreixenet: Apple no tiene clientes. Tiene FANS// error.... PAGAS por productos y apps, ergo: ERES CLIENTE.
In all the examples that you gave, Apple(inc) was either referred to as Apple or apple inc, so a possible way could be to search for:
a capital "A" in Apple
an "inc" after apple
words/phrases like "OS", "operating system", "Mac", "iPhone", ...
or a combination of them
To simplify answers based on Conditional Random Fields a bit...context is huge here. You will want to pick out in those tweets that clearly show Apple the company vs apple the fruit. Let me outline a list of features here that might be useful for you to start with. For more information look up noun phrase chunking, and something called BIO labels. See (http://www.cis.upenn.edu/~pereira/papers/crf.pdf)
Surrounding words: Build a feature vector for the previous word and the next word, or if you want more features perhaps the previous 2 and next 2 words. You don't want too many words in the model or it won't match the data very well.
In Natural Language Processing, you are going to want to keep this as general as possible.
Other features to get from surrounding words include the following:
Whether the first character is a capital
Whether the last character in the word is a period
The part of speech of the word (Look up part of speech tagging)
The text itself of the word
I don't advise this, but to give more examples of features specifically for Apple:
WordIs(Apple)
NextWordIs(Inc.)
You get the point. Think of Named Entity Recognition as describing a sequence, and then using some math to tell a computer how to calculate that.
Keep in mind that natural language processing is a pipeline based system. Typically, you break things in to sentences, move to tokenization, then do part of speech tagging or even dependency parsing.
This is all to get you a list of features you can use in your model to identify what you're looking for.
There's a really good library for processing natural language text in Python called nltk. You should take a look at it.
One strategy you could try is to look at n-grams (groups of words) with the word "apple" in them. Some words are more likely to be used next to "apple" when talking about the fruit, others when talking about the company, and you can use those to classify tweets.
Use LibShortText. This Python utility has already been tuned to work for short text categorization tasks, and it works well. The maximum you'll have to do is to write a loop to pick the best combination of flags. I used it to do supervised speech act classification in emails and the results were up to 95-97% accurate (during 5 fold cross validation!).
And it comes from the makers of LIBSVM and LIBLINEAR whose support vector machine (SVM) implementation is used in sklearn and cran, so you can be reasonably assured that their implementation is not buggy.
Make an AI filter to distinguish Apple Inc (the company) from apple (the fruit). Since these are tweets, define your training set with a vector of 140 fields, each field being the character written in the tweet at position X (0 to 139). If the tweet is shorter, just give a value for being blank.
Then build a training set big enough to get a good accuracy (subjective to your taste). Assign a result value to each tweet, a Apple Inc tweet get 1 (true) and an apple tweet (fruit) gets 0. It would be a case of supervised learning in a logistic regression.
That is machine learning, is generally easier to code and performs better. It has to learn from the set you give it, and it's not hardcoded.
I don't know Python, so I can not write the code for it, but if you were to take more time for machine learning's logic and theory you might want to look the class I'm following.
Try the Coursera course Machine Learning by Andrew Ng. You will learn machine learning on MATLAB or Octave, but once you get the basics you will be able to write machine learning in about any language if you do understand the simple math (simple in logistic regression).
That is, getting the code from someone won't make you able to understand what is going in the machine learning code. You might want to invest a couple of hours on the subject to see what is really going on.
I would recommend avoiding answers suggesting entity recognition. Because this task is a text-classification first and entity recognition second (you can do it without the entity recognition at all).
I think the fastest path to results will be spacy + prodigy.
Spacy has well thought through model for English language, so you don't have to build your own. While prodigy allows quickly create training datasets and fine tune spacy model for your needs.
If you have enough samples, you can have a decent model in 1 day.

Calculating the optimal move in checkers

I created a checkress game and I would like the computer to calculate the most optimal move.
Here is what I've done so far:
public BoardS calcNextMove(BoardS bs)
{
ArrayList<BoardS>options = calcPossibleOptions(bs);
int max = -1;
int temp;
int bestMove = 0;
for(int k=0;k<options.size();k++)
{
temp = calculateNextMove2(options.get(k));
if(max<temp)
{
max = temp;
bestMove = k;
}
}
return options.get(bestMove);
}
public int calculateNextMove2(BoardS bs)
{
int res = soWhoWon(bs);
if(res == 2) //pc won(which is good so we return 1)
return 1;
if(res == 1)
return 0;
ArrayList<BoardS>options = calcPossibleOptions(bs);
int sum = 0;
for(int k=0;k<options.size();k++)
{
sum += calculateNextMove2(options.get(k));
}
return sum;
}
I keep getting
Exception in thread "AWT-EventQueue-0" java.lang.StackOverflowError
calcPossibleOptions works good, it's a function that returns an array of all the possible options.
BoardS is a clas that represent a game board.
I guess I have to make it more efficient, how?
The recursion on "calculateNextMove2()" will run too deep and this is why you get a StackOverFlow. If you expect the game to run to the end (unless relatively close to an actual win) this may take a very long time indeed. Like with chess e.g. (which I have more experience with) an engine will go maybe 20 moves deep before you will likely go with what it's found thus far. If you ran it from the start of a chess game... it can run for 100s of years on current technology (and still not really be able to say what the winning first move is). Maybe just try 10 or 20 moves deep? (which would still beat most humans - and would still probably classify as "best move"). As with chess you will have the difficulty of assessing a position as good or bad for this to work (often calculated as a combination of material advantage and positional advantage). That is the tricky part. Note: Project Chinook has done what you're looking to achieve - it has "defeated" the game of Checkers (no such thing exists yet for chess EDIT: Magnus Carslen has "deafeted" the game for all intents and purposes :D).
see: Alpha-beta pruning
(it may help somewhat)
Also here is an (oldish) paper I saw on the subject:
Graham Owen 1997 Search Algorithms and Game Playing
(it may be useful)
Also note that returning the first move that results in a "win" is 'naive' - because it may be a totally improbably line of moves where the opponent will gain an advantage if they don't play in a very particular win - e.g. Playing for Fools Mate or Scholars Mate in chess... (quick but very punishable)
You're going to need to find an alternative to MinMax, even with alpha-beta pruning most likely, as the board is too large making too many possible moves, and counter moves. This leads to the stack overflow.
I was fairly surprised to see that making the full decision tree for Tic-Tac-Toe didn't overflow myself, but I'm afraid I don't know enough about AI planning, or another algorithm for doing these problems, to help you beyond that.

How to reduce an algorithm into smaller parts so I can scale it?

I have updated this question(found last question not clear, if you want to refer to it check out the reversion history). The current answers so far do not work because I failed to explain my question clearly(sorry, second attempt).
Goal:
Trying to take a set of numbers(pos or neg, thus needs bounds to limit growth of specific variable) and find their linear combinations that can be used to get to a specific sum. For example, to get to a sum of 10 using [2,4,5] we get:
5*2 + 0*4 + 0*5 = 10
3*2 + 1*4 + 0*5 = 10
1*2 + 2*4 + 0*5 = 10
0*2 + 0*4 + 2*5 = 10
How can I create an algo that is scalable for large number of variables and target_sums? I can write the code on my own if an algo is given, but if there's a library avail, I'm fine with any library but prefer to use java.
One idea would be to break out of the loop once you set T[z][i] to true, since you are only basically modifying T[z][i] here, and if it does become true, it won't ever be modified again.
for i = 1 to k
for z = 0 to sum:
for j = z-x_i to 0:
if(T[j][i-1]):
T[z][i]=true;
break;
EDIT2: Additionally, if I am getting it right, T[z][i] depends on the array T[z-x_i..0][i-1]. T[z+1][i] depends on T[z+1-x_i..0][i-1]. So once you know if T[z][i] is true, you only need to check one additional element (T[z+1-x_i][i-1]) to know if T[z+1][i-1] will be true.
Let's say you represent the fact whether T[z][i] was updated by a variable changed. Then, you can simply say that T[z][i] = changed && T[z-1][i]. So you should be done in two loops instead of three. This should make it much faster.
Now, to scale it - Now that T[z,i] depends only on T[z-1,i] and T[z-1-x_i,i-1], so to populate T[z,i], you do not need to wait until the whole (i-1)th column is populated. You can start working on T[z,i] as soon as the required values are populated. I can't implement it without knowing the details, but you can try this approach.
I take it this is something like unbounded knapsack? You can dispense with the loop over c entirely.
for i = 1 to k
for z = 0 to sum
T[z][i] = z >= x_i cand (T[z - x_i][i - 1] or T[z - x_i][i])
Based on the original example data you gave (linear combination of terms) and your answer to my question in the comments section (there are bounds), would a brute force approach not work?
c0x0 + c1x1 + c2x2 +...+ cnxn = SUM
I'm guessing I'm missing something important but here it is anyway:
Brute Force Divide and Conquer:
main controller generates coefficients for say, half of the terms (or however many may make sense)
it then sends each partial set of fixed coefficients to a work queue
a worker picks up a partial set of fixed coefficients and proceeds to brute force its own way through the remaining combinations
it doesn't use much memory at all as it works sequentially on each valid set of coefficients
could be optimized to ignore equivalent combinations and probably many other ways
Pseudocode for Multiprocessing
class Controller
work_queue = Queue
solution_queue = Queue
solution_sets = []
create x number of workers with access to work_queue and solution_queue
#say for 2000 terms:
for partial_set in coefficient_generator(start_term=0, end_term=999):
if worker_available(): #generate just in time
push partial set onto work_queue
while solution_queue:
add any solutions to solution_sets
#there is an efficient way to do this type of polling but I forget
class Worker
while true: #actually stops when a stop work token is received
get partial_set from the work queue
for remaining_set in coefficient_generator(start_term=1000, end_term=1999):
combine the two sets (partial_set.extend(remaining_set))
if is_solution(full_set):
push full_set onto the solution queue

Categories

Resources