Encog: weights keep increasing

Encog: weights keep increasing - java

I am trying to train a neural network with Encog library.
Dataset (~7000 examples) before splitting (into training (60%), cross-validation (20%) and testing (20%)) is linearly normalised so that it looks like this:
Min=-1.000000; Max=1.000000; Average=-0.077008
Target (ideal) dataset (also linearly normalised) looks like this:
Min=0.201540; Max=0.791528; Average=0.477080
I initialie network like this:
mNetwork = new BasicNetwork();
mNetwork.addLayer(new BasicLayer(null, false, trainingDataSet.getInputSize()));
mNetwork.addLayer(new BasicLayer(new ActivationSigmoid(), true, numberOfNeurons));
mNetwork.addLayer(new BasicLayer(new ActivationSigmoid(), false, trainingDataSet.getIdealSize()));
mNetwork.getStructure().finalizeStructure();
mNetwork.reset();
I use ResilientPropagation trainer (tried Backpropagationas well):
ResilientPropagation training = new ResilientPropagation(mNetwork, mTrainingDataSet);
for (int i = 0; i < mNumberOfIterations; ++i) {
training.iteration();
result.trainingErrors[i] = mNetwork.calculateError(mTrainingDataSet);
result.validationErrors[i] = mNetwork.calculateError(mValidationDataSet);
System.out.println(String.format("Iteration #%d: error=%.8f", i, training.getError()));
}
training.finishTraining();
During the process of training error reported by the trainer is generally decreasing. After finishing training I dump weights:
0.04274211002929323,-0.5481902707068103,0.28978635361541294,-0.203635994176051,22965.18656660482,22964.992410871928,22966.23882308963,22966.355722230965,22965.036733143017,22964.894030965166,22966.002332259202,22965.177650526788,22966.009842504238,22965.971560546248,22966.257180159628,22966.234150681423,-21348.311232865744,-21640.843082085466,-21057.13217475862,-21347.52051343582,-21347.988714647887,-21641.161098510198,-21057.27275747668,-21348.784123049118,-21347.719149090022,-21639.773689115867,-21057.095487328377,-21348.269878600076,22800.304816865206,23090.894751729396,22799.39388588725,22799.72408290791,22800.249806096508,22799.19823789763,22799.85510732227,22799.99965531053,22799.574773588192,22799.57945236908,22799.12542315293,22799.523065957797
They usually are either very large or very small. With sigmoid it ends up with predictions converging to some number, for instance weights (got after 500 iterations) above give me:
Min=0.532179; Max=0.532179; Average=0.532179
There seems to be something wrong with the network or training configurations. If my network suffered from low variance, at least it would generate a results that are withing the target range. If it suffered from high variance, it would match the target. And now, it just misses the target entirely.
How come error is decreasing and getting pretty low even if predictions are way off? Anyones sees obvious mistakes in my example above? I'm not very experienced with neural networks yet.

The problem, in my opinion is that you are normalizing between -1 and 1 and use an activation function sigmoid that works with numbers between 0 and 1.
I would suggest you to normalize between 0.1 and 0.9 or use a tanh activation function an retry.
I also would use k-fold cross validation see here http://www.heatonresearch.com/node/2000
Vincenzo

Related

How to find the optimal K for clustering? [duplicate]

I've been studying about k-means clustering, and one thing that's not clear is how you choose the value of k. Is it just a matter of trial and error, or is there more to it?

You can maximize the Bayesian Information Criterion (BIC):
BIC(C | X) = L(X | C) - (p / 2) * log n
where L(X | C) is the log-likelihood of the dataset X according to model C, p is the number of parameters in the model C, and n is the number of points in the dataset.
See "X-means: extending K-means with efficient estimation of the number of clusters" by Dan Pelleg and Andrew Moore in ICML 2000.
Another approach is to start with a large value for k and keep removing centroids (reducing k) until it no longer reduces the description length. See "MDL principle for robust vector quantisation" by Horst Bischof, Ales Leonardis, and Alexander Selb in Pattern Analysis and Applications vol. 2, p. 59-72, 1999.
Finally, you can start with one cluster, then keep splitting clusters until the points assigned to each cluster have a Gaussian distribution. In "Learning the k in k-means" (NIPS 2003), Greg Hamerly and Charles Elkan show some evidence that this works better than BIC, and that BIC does not penalize the model's complexity strongly enough.

Basically, you want to find a balance between two variables: the number of clusters (k) and the average variance of the clusters. You want to minimize the former while also minimizing the latter. Of course, as the number of clusters increases, the average variance decreases (up to the trivial case of k=n and variance=0).
As always in data analysis, there is no one true approach that works better than all others in all cases. In the end, you have to use your own best judgement. For that, it helps to plot the number of clusters against the average variance (which assumes that you have already run the algorithm for several values of k). Then you can use the number of clusters at the knee of the curve.

Yes, you can find the best number of clusters using Elbow method, but I found it troublesome to find the value of clusters from elbow graph using script. You can observe the elbow graph and find the elbow point yourself, but it was lot of work finding it from script.
So another option is to use Silhouette Method to find it. The result from Silhouette completely comply with result from Elbow method in R.
Here`s what I did.
#Dataset for Clustering
n = 150
g = 6
set.seed(g)
d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))),
y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
mydata<-d
#Plot 3X2 plots
attach(mtcars)
par(mfrow=c(3,2))
#Plot the original dataset
plot(mydata$x,mydata$y,main="Original Dataset")
#Scree plot to deterine the number of clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) {
wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
}
plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram
groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k=5, border="red")
#Silhouette analysis for determining the number of clusters
library(fpc)
asw <- numeric(20)
for (k in 2:20)
asw[[k]] <- pam(mydata, k) $ silinfo $ avg.width
k.best <- which.max(asw)
cat("silhouette-optimal number of clusters:", k.best, "\n")
plot(pam(d, k.best))
# K-Means Cluster Analysis
fit <- kmeans(mydata,k.best)
mydata
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata <- data.frame(mydata, clusterid=fit$cluster)
plot(mydata$x,mydata$y, col = fit$cluster, main="K-means Clustering results")
Hope it helps!!

May be someone beginner like me looking for code example. information for silhouette_score
is available here.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
range_n_clusters = [2, 3, 4] # clusters range you want to select
dataToFit = [[12,23],[112,46],[45,23]] # sample data
best_clusters = 0 # best cluster number which you will get
previous_silh_avg = 0.0
for n_clusters in range_n_clusters:
clusterer = KMeans(n_clusters=n_clusters)
cluster_labels = clusterer.fit_predict(dataToFit)
silhouette_avg = silhouette_score(dataToFit, cluster_labels)
if silhouette_avg > previous_silh_avg:
previous_silh_avg = silhouette_avg
best_clusters = n_clusters
# Final Kmeans for best_clusters
kmeans = KMeans(n_clusters=best_clusters, random_state=0).fit(dataToFit)

Look at this paper, "Learning the k in k-means" by Greg Hamerly, Charles Elkan. It uses a Gaussian test to determine the right number of clusters. Also, the authors claim that this method is better than BIC which is mentioned in the accepted answer.

There is something called Rule of Thumb. It says that the number of clusters can be calculated by
k = (n/2)^0.5
where n is the total number of elements from your sample.
You can check the veracity of this information on the following paper:
http://www.ijarcsms.com/docs/paper/volume1/issue6/V1I6-0015.pdf
There is also another method called G-means, where your distribution follows a Gaussian Distribution or Normal Distribution.
It consists of increasing k until all your k groups follow a Gaussian Distribution.
It requires a lot of statistics but can be done.
Here is the source:
http://papers.nips.cc/paper/2526-learning-the-k-in-k-means.pdf
I hope this helps!

If you don't know the numbers of the clusters k to provide as parameter to k-means so there are four ways to find it automaticaly:
G-means algortithm: it discovers the number of clusters automatically using a statistical test to decide whether to split a k-means center into two. This algorithm takes a hierarchical approach to detect the number of clusters, based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution (continuous function which approximates the exact binomial distribution of events), and if not it splits the cluster. It starts with a small number of centers, say one cluster only (k=1), then the algorithm splits it into two centers (k=2) and splits each of these two centers again (k=4), having four centers in total. If G-means does not accept these four centers then the answer is the previous step: two centers in this case (k=2). This is the number of clusters your dataset will be divided into. G-means is very useful when you do not have an estimation of the number of clusters you will get after grouping your instances. Notice that an inconvenient choice for the "k" parameter might give you wrong results. The parallel version of g-means is called p-means. G-means sources:
source 1
source 2
source 3
x-means: a new algorithm that efficiently, searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) measure. This version of k-means finds the number k and also accelerates k-means.
Online k-means or Streaming k-means: it permits to execute k-means by scanning the whole data once and it finds automaticaly the optimal number of k. Spark implements it.
MeanShift algorithm: it is a nonparametric clustering technique which does not require prior knowledge of the number of clusters, and does not constrain the shape of the clusters. Mean shift clustering aims to discover “blobs” in a smooth density of samples. It is a centroid-based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids. Sources: source1, source2, source3

First build a minimum spanning tree of your data.
Removing the K-1 most expensive edges splits the tree into K clusters,
so you can build the MST once, look at cluster spacings / metrics for various K,
and take the knee of the curve.
This works only for Single-linkage_clustering,
but for that it's fast and easy. Plus, MSTs make good visuals.
See for example the MST plot under
stats.stackexchange visualization software for clustering.

I'm surprised nobody has mentioned this excellent article:
http://www.ee.columbia.edu/~dpwe/papers/PhamDN05-kmeans.pdf
After following several other suggestions I finally came across this article while reading this blog:
https://datasciencelab.wordpress.com/2014/01/21/selection-of-k-in-k-means-clustering-reloaded/
After that I implemented it in Scala, an implementation which for my use cases provide really good results. Here's code:
import breeze.linalg.DenseVector
import Kmeans.{Features, _}
import nak.cluster.{Kmeans => NakKmeans}
import scala.collection.immutable.IndexedSeq
import scala.collection.mutable.ListBuffer
/*
https://datasciencelab.wordpress.com/2014/01/21/selection-of-k-in-k-means-clustering-reloaded/
*/
class Kmeans(features: Features) {
def fkAlphaDispersionCentroids(k: Int, dispersionOfKMinus1: Double = 0d, alphaOfKMinus1: Double = 1d): (Double, Double, Double, Features) = {
if (1 == k || 0d == dispersionOfKMinus1) (1d, 1d, 1d, Vector.empty)
else {
val featureDimensions = features.headOption.map(_.size).getOrElse(1)
val (dispersion, centroids: Features) = new NakKmeans[DenseVector[Double]](features).run(k)
val alpha =
if (2 == k) 1d - 3d / (4d * featureDimensions)
else alphaOfKMinus1 + (1d - alphaOfKMinus1) / 6d
val fk = dispersion / (alpha * dispersionOfKMinus1)
(fk, alpha, dispersion, centroids)
}
}
def fks(maxK: Int = maxK): List[(Double, Double, Double, Features)] = {
val fadcs = ListBuffer[(Double, Double, Double, Features)](fkAlphaDispersionCentroids(1))
var k = 2
while (k <= maxK) {
val (fk, alpha, dispersion, features) = fadcs(k - 2)
fadcs += fkAlphaDispersionCentroids(k, dispersion, alpha)
k += 1
}
fadcs.toList
}
def detK: (Double, Features) = {
val vals = fks().minBy(_._1)
(vals._3, vals._4)
}
}
object Kmeans {
val maxK = 10
type Features = IndexedSeq[DenseVector[Double]]
}

If you use MATLAB, any version since 2013b that is, you can make use of the function evalclusters to find out what should the optimal k be for a given dataset.
This function lets you choose from among 3 clustering algorithms - kmeans, linkage and gmdistribution.
It also lets you choose from among 4 clustering evaluation criteria - CalinskiHarabasz, DaviesBouldin, gap and silhouette.

I used the solution I found here : http://efavdb.com/mean-shift/ and it worked very well for me :
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt
from itertools import cycle
from PIL import Image
#%% Generate sample data
centers = [[1, 1], [-.75, -1], [1, -1], [-3, 2]]
X, _ = make_blobs(n_samples=10000, centers=centers, cluster_std=0.6)
#%% Compute clustering with MeanShift
# The bandwidth can be automatically estimated
bandwidth = estimate_bandwidth(X, quantile=.1,
n_samples=500)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
n_clusters_ = labels.max()+1
#%% Plot result
plt.figure(1)
plt.clf()
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
my_members = labels == k
cluster_center = cluster_centers[k]
plt.plot(X[my_members, 0], X[my_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1],
'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

My idea is to use Silhouette Coefficient to find the optimal cluster number(K). Details explanation is here.

Assuming you have a matrix of data called DATA, you can perform partitioning around medoids with estimation of number of clusters (by silhouette analysis) like this:
library(fpc)
maxk <- 20 # arbitrary here, you can set this to whatever you like
estimatedK <- pamk(dist(DATA), krange=1:maxk)$nc

One possible answer is to use Meta Heuristic Algorithm like Genetic Algorithm to find k.
That's simple. you can use random K(in some range) and evaluate the fit function of Genetic Algorithm with some measurment like Silhouette
And Find best K base on fit function.
https://en.wikipedia.org/wiki/Silhouette_(clustering)

km=[]
for i in range(num_data.shape[1]):
kmeans = KMeans(n_clusters=ncluster[i])#we take number of cluster bandwidth theory
ndata=num_data[[i]].dropna()
ndata['labels']=kmeans.fit_predict(ndata.values)
cluster=ndata
co=cluster.groupby(['labels'])[cluster.columns[0]].count()#count for frequency
me=cluster.groupby(['labels'])[cluster.columns[0]].median()#median
ma=cluster.groupby(['labels'])[cluster.columns[0]].max()#Maximum
mi=cluster.groupby(['labels'])[cluster.columns[0]].min()#Minimum
stat=pd.concat([mi,ma,me,co],axis=1)#Add all column
stat['variable']=stat.columns[1]#Column name change
stat.columns=['Minimum','Maximum','Median','count','variable']
l=[]
for j in range(ncluster[i]):
n=[mi.loc[j],ma.loc[j]]
l.append(n)
stat['Class']=l
stat=stat.sort(['Minimum'])
stat=stat[['variable','Class','Minimum','Maximum','Median','count']]
if missing_num.iloc[i]>0:
stat.loc[ncluster[i]]=0
if stat.iloc[ncluster[i],5]==0:
stat.iloc[ncluster[i],5]=missing_num.iloc[i]
stat.iloc[ncluster[i],0]=stat.iloc[0,0]
stat['Percentage']=(stat[[5]])*100/count_row#Freq PERCENTAGE
stat['Cumulative Percentage']=stat['Percentage'].cumsum()
km.append(stat)
cluster=pd.concat(km,axis=0)## see documentation for more info
cluster=cluster.round({'Minimum': 2, 'Maximum': 2,'Median':2,'Percentage':2,'Cumulative Percentage':2})

Another approach is using Self Organizing Maps (SOP) to find optimal number of clusters. The SOM (Self-Organizing Map) is an unsupervised neural
network methodology, which needs only the input is used to
clustering for problem solving. This approach used in a paper about customer segmentation.
The reference of the paper is
Abdellah Amine et al., Customer Segmentation Model in E-commerce Using
Clustering Techniques and LRFM Model: The Case
of Online Stores in Morocco, World Academy of Science, Engineering and Technology
International Journal of Computer and Information Engineering
Vol:9, No:8, 2015, 1999 - 2010

Hi I'll make it simple and straight to explain, I like to determine clusters using 'NbClust' library.
Now, how to use the 'NbClust' function to determine the right number of clusters: You can check the actual project in Github with actual data and clusters - Extention to this 'kmeans' algorithm also performed using the right number of 'centers'.
Github Project Link: https://github.com/RutvijBhutaiya/Thailand-Customer-Engagement-Facebook

You can choose the number of clusters by visually inspecting your data points, but you will soon realize that there is a lot of ambiguity in this process for all except the simplest data sets. This is not always bad, because you are doing unsupervised learning and there's some inherent subjectivity in the labeling process. Here, having previous experience with that particular problem or something similar will help you choose the right value.
If you want some hint about the number of clusters that you should use, you can apply the Elbow method:
First of all, compute the sum of squared error (SSE) for some values of k (for example 2, 4, 6, 8, etc.). The SSE is defined as the sum of the squared distance between each member of the cluster and its centroid. Mathematically:
SSE=∑Ki=1∑x∈cidist(x,ci)2
If you plot k against the SSE, you will see that the error decreases as k gets larger; this is because when the number of clusters increases, they should be smaller, so distortion is also smaller. The idea of the elbow method is to choose the k at which the SSE decreases abruptly. This produces an "elbow effect" in the graph, as you can see in the following picture:
In this case, k=6 is the value that the Elbow method has selected. Take into account that the Elbow method is an heuristic and, as such, it may or may not work well in your particular case. Sometimes, there are more than one elbow, or no elbow at all. In those situations you usually end up calculating the best k by evaluating how well k-means performs in the context of the particular clustering problem you are trying to solve.

I worked on a Python package kneed (Kneedle algorithm). It finds cluster numbers dynamically as the point where the curve starts to flatten. Given a set of x and y values, kneed will return the knee point of the function. The knee joint is the point of maximum curvature. Here is the sample code.
y = [7342.1301373073857, 6881.7109460930769, 6531.1657905495022,
6356.2255554679778, 6209.8382535595829, 6094.9052166741121,
5980.0191582610196, 5880.1869867848218, 5779.8957906367368,
5691.1879324562778, 5617.5153566271356, 5532.2613232619951,
5467.352265375117, 5395.4493783888756, 5345.3459908298091,
5290.6769823693812, 5243.5271656371888, 5207.2501206569532,
5164.9617535255456]
x = range(1, len(y)+1)
from kneed import KneeLocator
kn = KneeLocator(x, y, curve='convex', direction='decreasing')
print(kn.knee)

Leave here a pretty cool gif from Codecademy course:
The K-Means algorithm:
Place k random centroids for the initial clusters.
Assign data samples to the nearest centroid.
Update centroids based on the above-assigned data samples.
Btw, its not a explanation of full algorithm, its just helpful vizualization

how to evaluate my cluster algorithm

I start clustering using simple k-mean clustering in weka
after the clustering this result show
Number of iterations: 9
Within cluster sum of squared errors: 570.1974952009115
my questions:
the number of sum of squared errors is huge does this mean my number of cluster is wrong ? and how to define the optimistic number of clusters ?
how to split the data into training and test set to evaluate the performance ? and how to know the right percentage ?
how to measure the SSB

1.1 In k-means it's you who decides how many clusters to pick. You probably know this already.
1.2 In k-means there is no optimal number of clusters as in "global maximum of a function graph". You decide with respect to your business problem. See also "elbow method" for a semi-empirical procedure that seldom works in practice.
1.3 You might have outliers in your data which make the sum of squares large for any clustering operation. The outliers are always far away from your cluster centers, no matter how many clusters you pick .
2.1 There is no "optimal" percentage split.
2.2 You could use visualization to check if there is any overlap in the clusters. It's also more understandable for your audience to see the "decision boundaries".
3.1 What is SSB?

I'm attaching code for the Elbow method in case anyone wants to do a quick test.
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
X = pd.read_csv("data.csv")
X = data.select_dtypes(np.number) #If all your data is numerical, you dont have to do this
from sklearn.cluster import KMeans
wcss = []
for i in range(1,50):
model = KMeans(n_clusters = i, init = 'k-means++',
max_iter=300,n_init=10,random_state=0)
model.fit(X)
wcss.append(model.inertia_)
plt.figure(figsize=(10,7))
plt.plot(range(1,50), wcss)
plt.title("Elbow Method")
plt.xlabel("No. of clusters")
plt.ylabel("WCSS")
If you have time and patience, you can make an outer loop to loop around random state and record it.

How can I build a model to distinguish tweets about Apple (Inc.) from tweets about apple (fruit)?

See below for 50 tweets about "apple." I have hand labeled the positive matches about Apple Inc. They are marked as 1 below.
Here are a couple of lines:
1|“#chrisgilmer: Apple targets big business with new iOS 7 features http://bit.ly/15F9JeF ”. Finally.. A corp iTunes account!
0|“#Zach_Paull: When did green skittles change from lime to green apple? #notafan” #Skittles
1|#dtfcdvEric: #MaroneyFan11 apple inc is searching for people to help and tryout all their upcoming tablet within our own net page No.
0|#STFUTimothy have you tried apple pie shine?
1|#SuryaRay #India Microsoft to bring Xbox and PC games to Apple, Android phones: Report: Microsoft Corp... http://dlvr.it/3YvbQx #SuryaRay
Here is the total data set: http://pastebin.com/eJuEb4eB
I need to build a model that classifies "Apple" (Inc). from the rest.
I'm not looking for a general overview of machine learning, rather I'm looking for actual model in code (Python preferred).

What you are looking for is called Named Entity Recognition. It is a statistical technique that (most commonly) uses Conditional Random Fields to find named entities, based on having been trained to learn things about named entities.
Essentially, it looks at the content and context of the word, (looking back and forward a few words), to estimate the probability that the word is a named entity.
Good software can look at other features of words, such as their length or shape (like "Vcv" if it starts with "Vowel-consonant-vowel")
A very good library (GPL) is Stanford's NER
Here's the demo: http://nlp.stanford.edu:8080/ner/
Some sample text to try:
I was eating an apple over at Apple headquarters and I thought about
Apple Martin, the daughter of the Coldplay guy
(the 3class and 4class classifiers get it right)

I would do it as follows:
Split the sentence into words, normalise them, build a dictionary
With each word, store how many times they occurred in tweets about the company, and how many times they appeared in tweets about the fruit - these tweets must be confirmed by a human
When a new tweet comes in, find every word in the tweet in the dictionary, calculate a weighted score - words that are used frequently in relation to the company would get a high company score, and vice versa; words used rarely, or used with both the company and the fruit, would not have much of a score.

I have a semi-working system that solves this problem, open sourced using scikit-learn, with a series of blog posts describing what I'm doing. The problem I'm tackling is word-sense disambiguation (choosing one of multiple word sense options), which is not the same as Named Entity Recognition. My basic approach is somewhat-competitive with existing solutions and (crucially) is customisable.
There are some existing commercial NER tools (OpenCalais, DBPedia Spotlight, and AlchemyAPI) that might give you a good enough commercial result - do try these first!
I used some of these for a client project (I consult using NLP/ML in London), but I wasn't happy with their recall (precision and recall). Basically they can be precise (when they say "This is Apple Inc" they're typically correct), but with low recall (they rarely say "This is Apple Inc" even though to a human the tweet is obviously about Apple Inc). I figured it'd be an intellectually interesting exercise to build an open source version tailored to tweets. Here's the current code:
https://github.com/ianozsvald/social_media_brand_disambiguator
I'll note - I'm not trying to solve the generalised word-sense disambiguation problem with this approach, just brand disambiguation (companies, people, etc.) when you already have their name. That's why I believe that this straightforward approach will work.
I started this six weeks ago, and it is written in Python 2.7 using scikit-learn. It uses a very basic approach. I vectorize using a binary count vectorizer (I only count whether a word appears, not how many times) with 1-3 n-grams. I don't scale with TF-IDF (TF-IDF is good when you have a variable document length; for me the tweets are only one or two sentences, and my testing results didn't show improvement with TF-IDF).
I use the basic tokenizer which is very basic but surprisingly useful. It ignores # # (so you lose some context) and of course doesn't expand a URL. I then train using logistic regression, and it seems that this problem is somewhat linearly separable (lots of terms for one class don't exist for the other). Currently I'm avoiding any stemming/cleaning (I'm trying The Simplest Possible Thing That Might Work).
The code has a full README, and you should be able to ingest your tweets relatively easily and then follow my suggestions for testing.
This works for Apple as people don't eat or drink Apple computers, nor do we type or play with fruit, so the words are easily split to one category or the other. This condition may not hold when considering something like #definance for the TV show (where people also use #definance in relation to the Arab Spring, cricket matches, exam revision and a music band). Cleverer approaches may well be required here.
I have a series of blog posts describing this project including a one-hour presentation I gave at the BrightonPython usergroup (which turned into a shorter presentation for 140 people at DataScienceLondon).
If you use something like LogisticRegression (where you get a probability for each classification) you can pick only the confident classifications, and that way you can force high precision by trading against recall (so you get correct results, but fewer of them). You'll have to tune this to your system.
Here's a possible algorithmic approach using scikit-learn:
Use a Binary CountVectorizer (I don't think term-counts in short messages add much information as most words occur only once)
Start with a Decision Tree classifier. It'll have explainable performance (see Overfitting with a Decision Tree for an example).
Move to logistic regression
Investigate the errors generated by the classifiers (read the DecisionTree's exported output or look at the coefficients in LogisticRegression, work the mis-classified tweets back through the Vectorizer to see what the underlying Bag of Words representation looks like - there will be fewer tokens there than you started with in the raw tweet - are there enough for a classification?)
Look at my example code in https://github.com/ianozsvald/social_media_brand_disambiguator/blob/master/learn1.py for a worked version of this approach
Things to consider:
You need a larger dataset. I'm using 2000 labelled tweets (it took me five hours), and as a minimum you want a balanced set with >100 per class (see the overfitting note below)
Improve the tokeniser (very easy with scikit-learn) to keep # # in tokens, and maybe add a capitalised-brand detector (as user #user2425429 notes)
Consider a non-linear classifier (like #oiez's suggestion above) when things get harder. Personally I found LinearSVC to do worse than logistic regression (but that may be due to the high-dimensional feature space that I've yet to reduce).
A tweet-specific part of speech tagger (in my humble opinion not Standford's as #Neil suggests - it performs poorly on poor Twitter grammar in my experience)
Once you have lots of tokens you'll probably want to do some dimensionality reduction (I've not tried this yet - see my blog post on LogisticRegression l1 l2 penalisation)
Re. overfitting. In my dataset with 2000 items I have a 10 minute snapshot from Twitter of 'apple' tweets. About 2/3 of the tweets are for Apple Inc, 1/3 for other-apple-uses. I pull out a balanced subset (about 584 rows I think) of each class and do five-fold cross validation for training.
Since I only have a 10 minute time-window I have many tweets about the same topic, and this is probably why my classifier does so well relative to existing tools - it will have overfit to the training features without generalising well (whereas the existing commercial tools perform worse on this snapshop, but more reliably across a wider set of data). I'll be expanding my time window to test this as a subsequent piece of work.

You can do the following:
Make a dict of words containing their count of occurrence in fruit and company related tweets. This can be achieved by feeding it some sample tweets whose inclination we know.
Using enough previous data, we can find out the probability of a word occurring in tweet about apple inc.
Multiply individual probabilities of words to get the probability of the whole tweet.
A simplified example:
p_f = Probability of fruit tweets.
p_w_f = Probability of a word occurring in a fruit tweet.
p_t_f = Combined probability of all words in tweet occurring a fruit tweet
= p_w1_f * p_w2_f * ...
p_f_t = Probability of fruit given a particular tweet.
p_c, p_w_c, p_t_c, p_c_t are respective values for company.
A laplacian smoother of value 1 is added to eliminate the problem of zero frequency of new words which are not there in our database.
old_tweets = {'apple pie sweet potatoe cake baby https://vine.co/v/hzBaWVA3IE3': '0', ...}
known_words = {}
total_company_tweets = total_fruit_tweets =total_company_words = total_fruit_words = 0
for tweet in old_tweets:
company = old_tweets[tweet]
for word in tweet.lower().split(" "):
if not word in known_words:
known_words[word] = {"company":0, "fruit":0 }
if company == "1":
known_words[word]["company"] += 1
total_company_words += 1
else:
known_words[word]["fruit"] += 1
total_fruit_words += 1
if company == "1":
total_company_tweets += 1
else:
total_fruit_tweets += 1
total_tweets = len(old_tweets)
def predict_tweet(new_tweet,K=1):
p_f = (total_fruit_tweets+K)/(total_tweets+K*2)
p_c = (total_company_tweets+K)/(total_tweets+K*2)
new_words = new_tweet.lower().split(" ")
p_t_f = p_t_c = 1
for word in new_words:
try:
wordFound = known_words[word]
except KeyError:
wordFound = {'fruit':0,'company':0}
p_w_f = (wordFound['fruit']+K)/(total_fruit_words+K*(len(known_words)))
p_w_c = (wordFound['company']+K)/(total_company_words+K*(len(known_words)))
p_t_f *= p_w_f
p_t_c *= p_w_c
#Applying bayes rule
p_f_t = p_f * p_t_f/(p_t_f*p_f + p_t_c*p_c)
p_c_t = p_c * p_t_c/(p_t_f*p_f + p_t_c*p_c)
if p_c_t > p_f_t:
return "Company"
return "Fruit"

If you don't have an issue using an outside library, I'd recommend scikit-learn since it can probably do this better & faster than anything you could code by yourself. I'd just do something like this:
Build your corpus. I did the list comprehensions for clarity, but depending on how your data is stored you might need to do different things:
def corpus_builder(apple_inc_tweets, apple_fruit_tweets):
corpus = [tweet for tweet in apple_inc_tweets] + [tweet for tweet in apple_fruit_tweets]
labels = [1 for x in xrange(len(apple_inc_tweets))] + [0 for x in xrange(len(apple_fruit_tweets))]
return (corpus, labels)
The important thing is you end up with two lists that look like this:
([['apple inc tweet i love ios and iphones'], ['apple iphones are great'], ['apple fruit tweet i love pie'], ['apple pie is great']], [1, 1, 0, 0])
The [1, 1, 0, 0] represent the positive and negative labels.
Then, you create a Pipeline! Pipeline is a scikit-learn class that makes it easy to chain text processing steps together so you only have to call one object when training/predicting:
def train(corpus, labels)
pipe = Pipeline([('vect', CountVectorizer(ngram_range=(1, 3), stop_words='english')),
('tfidf', TfidfTransformer(norm='l2')),
('clf', LinearSVC()),])
pipe.fit_transform(corpus, labels)
return pipe
Inside the Pipeline there are three processing steps. The CountVectorizer tokenizes the words, splits them, counts them, and transforms the data into a sparse matrix. The TfidfTransformer is optional, and you might want to remove it depending on the accuracy rating (doing cross validation tests and a grid search for the best parameters is a bit involved, so I won't get into it here). The LinearSVC is a standard text classification algorithm.
Finally, you predict the category of tweets:
def predict(pipe, tweet):
prediction = pipe.predict([tweet])
return prediction
Again, the tweet needs to be in a list, so I assumed it was entering the function as a string.
Put all those into a class or whatever, and you're done. At least, with this very basic example.
I didn't test this code so it might not work if you just copy-paste, but if you want to use scikit-learn it should give you an idea of where to start.
EDIT: tried to explain the steps in more detail.

Using a decision tree seems to work quite well for this problem. At least it produces a higher accuracy than a naive bayes classifier with my chosen features.
If you want to play around with some possibilities, you can use the following code, which requires nltk to be installed. The nltk book is also freely available online, so you might want to read a bit about how all of this actually works: http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html
#coding: utf-8
import nltk
import random
import re
def get_split_sets():
structured_dataset = get_dataset()
train_set = set(random.sample(structured_dataset, int(len(structured_dataset) * 0.7)))
test_set = [x for x in structured_dataset if x not in train_set]
train_set = [(tweet_features(x[1]), x[0]) for x in train_set]
test_set = [(tweet_features(x[1]), x[0]) for x in test_set]
return (train_set, test_set)
def check_accurracy(times=5):
s = 0
for _ in xrange(times):
train_set, test_set = get_split_sets()
c = nltk.classify.DecisionTreeClassifier.train(train_set)
# Uncomment to use a naive bayes classifier instead
#c = nltk.classify.NaiveBayesClassifier.train(train_set)
s += nltk.classify.accuracy(c, test_set)
return s / times
def remove_urls(tweet):
tweet = re.sub(r'http:\/\/[^ ]+', "", tweet)
tweet = re.sub(r'pic.twitter.com/[^ ]+', "", tweet)
return tweet
def tweet_features(tweet):
words = [x for x in nltk.tokenize.wordpunct_tokenize(remove_urls(tweet.lower())) if x.isalpha()]
features = dict()
for bigram in nltk.bigrams(words):
features["hasBigram(%s)" % ",".join(bigram)] = True
for trigram in nltk.trigrams(words):
features["hasTrigram(%s)" % ",".join(trigram)] = True
return features
def get_dataset():
dataset = """copy dataset in here
"""
structured_dataset = [('fruit' if x[0] == '0' else 'company', x[2:]) for x in dataset.splitlines()]
return structured_dataset
if __name__ == '__main__':
print check_accurracy()

Thank you for the comments thus far. Here is a working solution I prepared with PHP. I'd still be interested in hearing from others a more algorithmic approach to this same solution.
<?php
// Confusion Matrix Init
$tp = 0;
$fp = 0;
$fn = 0;
$tn = 0;
$arrFP = array();
$arrFN = array();
// Load All Tweets to string
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://pastebin.com/raw.php?i=m6pP8ctM');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$strCorpus = curl_exec($ch);
curl_close($ch);
// Load Tweets as Array
$arrCorpus = explode("\n", $strCorpus);
foreach ($arrCorpus as $k => $v) {
// init
$blnActualClass = substr($v,0,1);
$strTweet = trim(substr($v,2));
// Score Tweet
$intScore = score($strTweet);
// Build Confusion Matrix and Log False Positives & Negatives for Review
if ($intScore > 0) {
if ($blnActualClass == 1) {
// True Positive
$tp++;
} else {
// False Positive
$fp++;
$arrFP[] = $strTweet;
}
} else {
if ($blnActualClass == 1) {
// False Negative
$fn++;
$arrFN[] = $strTweet;
} else {
// True Negative
$tn++;
}
}
}
// Confusion Matrix and Logging
echo "
Predicted
1 0
Actual 1 $tp $fp
Actual 0 $fn $tn
";
if (count($arrFP) > 0) {
echo "\n\nFalse Positives\n";
foreach ($arrFP as $strTweet) {
echo "$strTweet\n";
}
}
if (count($arrFN) > 0) {
echo "\n\nFalse Negatives\n";
foreach ($arrFN as $strTweet) {
echo "$strTweet\n";
}
}
function LoadDictionaryArray() {
$strDictionary = <<<EOD
10|iTunes
10|ios 7
10|ios7
10|iPhone
10|apple inc
10|apple corp
10|apple.com
10|MacBook
10|desk top
10|desktop
1|config
1|facebook
1|snapchat
1|intel
1|investor
1|news
1|labs
1|gadget
1|apple store
1|microsoft
1|android
1|bonds
1|Corp.tax
1|macs
-1|pie
-1|clientes
-1|green apple
-1|banana
-10|apple pie
EOD;
$arrDictionary = explode("\n", $strDictionary);
foreach ($arrDictionary as $k => $v) {
$arr = explode('|', $v);
$arrDictionary[$k] = array('value' => $arr[0], 'term' => strtolower(trim($arr[1])));
}
return $arrDictionary;
}
function score($str) {
$str = strtolower($str);
$intScore = 0;
foreach (LoadDictionaryArray() as $arrDictionaryItem) {
if (strpos($str,$arrDictionaryItem['term']) !== false) {
$intScore += $arrDictionaryItem['value'];
}
}
return $intScore;
}
?>
The above outputs:
Predicted
1 0
Actual 1 31 1
Actual 0 1 17
False Positives
1|Royals apple #ASGame #mlb # News Corp Building http://instagram.com/p/bBzzgMrrIV/
False Negatives
-1|RT #MaxFreixenet: Apple no tiene clientes. Tiene FANS// error.... PAGAS por productos y apps, ergo: ERES CLIENTE.

In all the examples that you gave, Apple(inc) was either referred to as Apple or apple inc, so a possible way could be to search for:
a capital "A" in Apple
an "inc" after apple
words/phrases like "OS", "operating system", "Mac", "iPhone", ...
or a combination of them

To simplify answers based on Conditional Random Fields a bit...context is huge here. You will want to pick out in those tweets that clearly show Apple the company vs apple the fruit. Let me outline a list of features here that might be useful for you to start with. For more information look up noun phrase chunking, and something called BIO labels. See (http://www.cis.upenn.edu/~pereira/papers/crf.pdf)
Surrounding words: Build a feature vector for the previous word and the next word, or if you want more features perhaps the previous 2 and next 2 words. You don't want too many words in the model or it won't match the data very well.
In Natural Language Processing, you are going to want to keep this as general as possible.
Other features to get from surrounding words include the following:
Whether the first character is a capital
Whether the last character in the word is a period
The part of speech of the word (Look up part of speech tagging)
The text itself of the word
I don't advise this, but to give more examples of features specifically for Apple:
WordIs(Apple)
NextWordIs(Inc.)
You get the point. Think of Named Entity Recognition as describing a sequence, and then using some math to tell a computer how to calculate that.
Keep in mind that natural language processing is a pipeline based system. Typically, you break things in to sentences, move to tokenization, then do part of speech tagging or even dependency parsing.
This is all to get you a list of features you can use in your model to identify what you're looking for.

There's a really good library for processing natural language text in Python called nltk. You should take a look at it.
One strategy you could try is to look at n-grams (groups of words) with the word "apple" in them. Some words are more likely to be used next to "apple" when talking about the fruit, others when talking about the company, and you can use those to classify tweets.

Use LibShortText. This Python utility has already been tuned to work for short text categorization tasks, and it works well. The maximum you'll have to do is to write a loop to pick the best combination of flags. I used it to do supervised speech act classification in emails and the results were up to 95-97% accurate (during 5 fold cross validation!).
And it comes from the makers of LIBSVM and LIBLINEAR whose support vector machine (SVM) implementation is used in sklearn and cran, so you can be reasonably assured that their implementation is not buggy.

Make an AI filter to distinguish Apple Inc (the company) from apple (the fruit). Since these are tweets, define your training set with a vector of 140 fields, each field being the character written in the tweet at position X (0 to 139). If the tweet is shorter, just give a value for being blank.
Then build a training set big enough to get a good accuracy (subjective to your taste). Assign a result value to each tweet, a Apple Inc tweet get 1 (true) and an apple tweet (fruit) gets 0. It would be a case of supervised learning in a logistic regression.
That is machine learning, is generally easier to code and performs better. It has to learn from the set you give it, and it's not hardcoded.
I don't know Python, so I can not write the code for it, but if you were to take more time for machine learning's logic and theory you might want to look the class I'm following.
Try the Coursera course Machine Learning by Andrew Ng. You will learn machine learning on MATLAB or Octave, but once you get the basics you will be able to write machine learning in about any language if you do understand the simple math (simple in logistic regression).
That is, getting the code from someone won't make you able to understand what is going in the machine learning code. You might want to invest a couple of hours on the subject to see what is really going on.

I would recommend avoiding answers suggesting entity recognition. Because this task is a text-classification first and entity recognition second (you can do it without the entity recognition at all).
I think the fastest path to results will be spacy + prodigy.
Spacy has well thought through model for English language, so you don't have to build your own. While prodigy allows quickly create training datasets and fine tune spacy model for your needs.
If you have enough samples, you can have a decent model in 1 day.

Neural Network Back-Propagation Algorithm Gets Stuck on XOR Training PAttern

Overview
So I'm trying to get a grasp on the mechanics of neural networks. I still don't totally grasp the math behind it, but I think I understand how to implement it. I currently have a neural net that can learn AND, OR, and NOR training patterns. However, I can't seem to get it to implement the XOR pattern. My feed forward neural network consists of 2 inputs, 3 hidden, and 1 output. The weights and biases are randomly set between -0.5 and 0.5, and outputs are generated with the sigmoidal activation function
Algorithm
So far, I'm guessing I made a mistake in my training algorithm which is described below:
For each neuron in the output layer, provide an error value that is the desiredOutput - actualOutput --go to step 3
For each neuron in a hidden or input layer (working backwards) provide an error value that is the sum of all forward connection weights * the errorGradient of the neuron at the other end of the connection --go to step 3
For each neuron, using the error value provided, generate an error gradient that equals output * (1-output) * error. --go to step 4
For each neuron, adjust the bias to equal current bias + LEARNING_RATE * errorGradient. Then adjust each backward connection's weight to equal current weight + LEARNING_RATE * output of neuron at other end of connection * this neuron's errorGradient
I'm training my neural net online, so this runs after each training sample.
Code
This is the main code that runs the neural network:
private void simulate(double maximumError) {
int errorRepeatCount = 0;
double prevError = 0;
double error; // summed squares of errors
int trialCount = 0;
do {
error = 0;
// loop through each training set
for(int index = 0; index < Parameters.INPUT_TRAINING_SET.length; index++) {
double[] currentInput = Parameters.INPUT_TRAINING_SET[index];
double[] expectedOutput = Parameters.OUTPUT_TRAINING_SET[index];
double[] output = getOutput(currentInput);
train(expectedOutput);
// Subtracts the expected and actual outputs, gets the average of those outputs, and then squares it.
error += Math.pow(getAverage(subtractArray(output, expectedOutput)), 2);
}
} while(error > maximumError);
Now the train() function:
public void train(double[] expected) {
layers.outputLayer().calculateErrors(expected);
for(int i = Parameters.NUM_HIDDEN_LAYERS; i >= 0; i--) {
layers.allLayers[i].calculateErrors();
}
}
Output layer calculateErrors() function:
public void calculateErrors(double[] expectedOutput) {
for(int i = 0; i < numNeurons; i++) {
Neuron neuron = neurons[i];
double error = expectedOutput[i] - neuron.getOutput();
neuron.train(error);
}
}
Normal (Hidden & Input) layer calculateErrors() function:
public void calculateErrors() {
for(int i = 0; i < neurons.length; i++) {
Neuron neuron = neurons[i];
double error = 0;
for(Connection connection : neuron.forwardConnections) {
error += connection.output.errorGradient * connection.weight;
}
neuron.train(error);
}
}
Full Neuron class:
package neuralNet.layers.neurons;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import neuralNet.Parameters;
import neuralNet.layers.NeuronLayer;
public class Neuron {
private double output, bias;
public List<Connection> forwardConnections = new ArrayList<Connection>(); // Forward = layer closer to input -> layer closer to output
public List<Connection> backwardConnections = new ArrayList<Connection>(); // Backward = layer closer to output -> layer closer to input
public double errorGradient;
public Neuron() {
Random random = new Random();
bias = random.nextDouble() - 0.5;
}
public void addConnections(NeuronLayer prevLayer) {
// This is true for input layers. They create their connections differently. (See InputLayer class)
if(prevLayer == null) return;
for(Neuron neuron : prevLayer.neurons) {
Connection.createConnection(neuron, this);
}
}
public void calcOutput() {
output = bias;
for(Connection connection : backwardConnections) {
connection.input.calcOutput();
output += connection.input.getOutput() * connection.weight;
}
output = sigmoid(output);
}
private double sigmoid(double output) {
return 1 / (1 + Math.exp(-1*output));
}
public double getOutput() {
return output;
}
public void train(double error) {
this.errorGradient = output * (1-output) * error;
bias += Parameters.LEARNING_RATE * errorGradient;
for(Connection connection : backwardConnections) {
// for clarification: connection.input refers to a neuron that outputs to this neuron
connection.weight += Parameters.LEARNING_RATE * connection.input.getOutput() * errorGradient;
}
}
}
Results
When I'm training for AND, OR, or NOR the network can usually converge within about 1000 epochs, however when I train with XOR, the outputs become fixed and it never converges. So, what am I doing wrong? Any ideas?
Edit
Following the advice of others, I started over and implemented my neural network without classes...and it works. I'm still not sure where my problem lies in the above code, but it's in there somewhere.

This is surprising because you are using a big enough network (barely) to learn XOR. Your algorithm looks right, so I dont really know what is going on. It might help to know how you generate your training data: are you just reating the samples (1,0,1),(1,1,0),(0,1,1),(0,0,0) or something like that over and over? Perhaps the problem is that stochastic gradient descent is causing you to jump around the stabilizing minima. You could try some things to fix this: perhaps randomly sample from your training examples instead of repeating them (if that is what you are doing). Or, alternatively, you could modify your learning algorithm:
currently you have something equivalent to:
weight(epoch) = weight(epoch - 1) + deltaWeight(epoch)
deltaWeight(epoch) = mu * errorGradient(epoch)
where mu is the learning rate
One option is to very slowly decrease the value of mu.
An alternative would be to change your definition of deltaWeight to include a "momentum"
deltaWeight(epoch) = mu * errorGradient(epoch) + alpha * deltaWeight(epoch -1)
where alpha is the momentum parameter (between 0 and 1).
Visually, you can think of gradient descent as trying to find the minimum point of a curved surface by placing an object on that surface, and then step by step moving that object small amounts in which ever directing is sloping down based on where it is currently located. The problem is that you dont really do gradient descent: instead you do stochastic gradient descent where you move in direction by sampling from a set of training vectors and moving in what ever direction the sample makes look like is down. On average over the entire training data, stochastic gradient descent should work, but it is isn't guaranteed to because you can get into a situation where you jump back and forth never making progress. Slowly decreasing the learning rate means you take smaller and smaller steps each time so can not get stuck in an infinite cycle.
On the other hand, momentum makes the algorithm into something akin to rolling a rubber ball. As the ball roles it tends to go in the downwards direction, but it also tends to keep going in the direction it was going before, and if it is ever on a stretch where the down slope is in the same direction for a while it will speed up. The ball will therefore jump over some local minima, and it will be more resilient against stepping back and forth over the target because doing so means working against the force of momentum.
Having some code and thinking about this some more, it is pretty clear that your problem is in training the early layers. The functions you have successfully learned are all linearly separable, so it would make sense that only a single layer is being properly learned. I agree with LiKao about implementation strategies in general, although your approach should work. My suggestion for how to debug this is figure out what the progression of the weights on the connections between the input layer and the output layer looks like.
You should post the rest implementation of Neuron.

I faced the same problem short time ago. Finally I found the solution, how to write a code solving XOR wit the MLP algorithm.
The XOR problem seems to be an easy to learn problem but it isn't for the MLP because it's not linearly separable. So even if your MLP is OK (I mean there is no bug in your code) you have to find the good parameters to be able to learn the XOR problem.
Two hidden and one output neuron is fine. The 2 main thing you have to set:
although you have only 4 training samples you have to run the training for a couple of thousands epoch.
if you use sigmoid hidden layers but linear output the network will converge faster
Here is the detailed description and sample code: http://freeconnection.blogspot.hu/2012/09/solving-xor-with-mlp.html

Small hint - if the output of your NN seem to drift toward 0.5 then everything's OK!
The algorithm using just the learning rate and bias is just too simple to quickly learn XOR. You can either increase the number of epochs or change the algorithm.
My recommendation is to use momentum:
1000 epochs
learningRate = 0.3
momentum = 0.8
weights drawn from [0,1]
bias drawn form [-0.5, 0.5]
And some crucial pseudo code (assuming back and forward propagation works) :
for every edge:
previous_edge_weight_change = -1 * learningRate * edge_source_neuron_value * edge_target_neuron_delta + previous_edge_weight * momentum
edge_weight += previous_edge_weight_change
for every neuron:
previous_neuron_bias_change = -1 * learningRate * neuron_delta + previous_neuron_bias_change * momentum
bias += previous_neuron_bias_change

I would suggest you to generate a grid (say from [-5,-5] to [5,5] with a step like 0.5), learn your MLP on the XOR and apply it to the grid. Plotted in color you could see some kind of a frontier.
If you do that at each iteration, you'll see the evolution of the frontier and can control the learning.

It's been a while since I last implemented an Neural Network myself, but I think your mistake is in the lines:
bias += Parameters.LEARNING_RATE * errorGradient;
and
connection.weight += Parameters.LEARNING_RATE * connection.input.getOutput() * errorGradient;
The first of these lines should not be there at all. Bias is best modeled as the input of a neuron which is fixed at 1. This will serve to make your code a lot simpler and cleaner, because you will not have to treat the bias in any special way.
The other point is, that I think the sign is wrong in both of these expressions. Think about it like this:
Your gradient points into the direction of steepest ascend, so if you go into that direction, your error will get larger.
What you are doing here is adding something to the weights, in case the error is already positive, i.e. you are making it more positive. If it is negative you are substracting someting, i.e. you make it more negative.
Unless I am missing something about your definition of error or the gradient calculation you should change these lines to:
bias -= Parameters.LEARNING_RATE * errorGradient;
and
connection.weight -= Parameters.LEARNING_RATE * connection.input.getOutput() * errorGradient;
I had a similar mistake in one of my early implementations and it lead to exactly the same behaviour, i.e. it resulted in a network that learned in simple cases, but not anymore once the training data became more complex.

LiKao's comment to simplify my implementation and get rid of the object-oriented aspects solved my problem. The flaw in the algorithm as it is described above is unknown, however I now have a working neural network that is a lot smaller.
Feel free to continue to provide insight on the problem with my previous implementation, as others may have the same problem in the future.

I'm a bit rusty on neural networks, but I think there was a problem to implement the XOR with one perceptron: basically a neuron is able to separate two groups of solutions through a straight line, but one straight line is not sufficient for the XOR problem...
Here should be the answer!

I couldn't see anything wrong with the code, but I was having a similar problem with my network not converging for XOR, so figured I'd post my working configuration.
3 input neurons (one of them being a fixed bias of 1.0)
3 hidden neurons
1 output neuron
Weights randomly chosen between -0.5 and 0.5.
Sigmoid activation function.
Learning rate = 0.2
Momentum = 0.4
Epochs = 50,000
Converged 10/10 times.
One of the mistakes I was making was not connecting the bias input to the output neuron, and this would mean for the same configuration it only converged 2 out of 10 times with the other eight times failing because 1 and 1 would output 0.5.
Another mistake was not doing enough epochs. If I only did 1000 then the outputs tend to be around 0.5 for every test case. With epochs >= 8000 so 2000 times for each test case, it started to look like it might be working (but only if using momentum).
When doing 50000 epochs it did not matter whether momentum was used or not.
Another thing I tried was to not apply the sigmoid function to the output neurons output (which I think was what an earlier post had suggested), but this wrecked the network because the output*(1-output) part of the error equation could now be negative meaning weights were updated in a way that caused the error to increase.

Estimating a probability given other probabilities from a prior

I have a bunch of data coming in (calls to an automated callcenter) about whether or not a person buys a particular product, 1 for buy, 0 for not buy.
I want to use this data to create an estimated probability that a person will buy a particular product, but the problem is that I may need to do it with relatively little historical data about how many people bought/didn't buy that product.
A friend recommended that with Bayesian probability you can "help" your probability estimate by coming up with a "prior probability distribution", essentially this is information about what you expect to see, prior to taking into account the actual data.
So what I'd like to do is create a method that has something like this signature (Java):
double estimateProbability(double[] priorProbabilities, int buyCount, int noBuyCount);
priorProbabilities is an array of probabilities I've seen for previous products, which this method would use to create a prior distribution for this probability. buyCount and noBuyCount are the actual data specific to this product, from which I want to estimate the probability of the user buying, given the data and the prior. This is returned from the method as a double.
I don't need a mathematically perfect solution, just something that will do better than a uniform or flat prior (ie. probability = buyCount / (buyCount+noBuyCount)). Since I'm far more familiar with source code than mathematical notation, I'd appreciate it if people could use code in their explanation.

Here's the Bayesian computation and one example/test:
def estimateProbability(priorProbs, buyCount, noBuyCount):
# first, estimate the prob that the actual buy/nobuy counts would be observed
# given each of the priors (times a constant that's the same in each case and
# not worth the effort of computing;-)`
condProbs = [p**buyCount * (1.0-p)**noBuyCount for p in priorProbs]
# the normalization factor for the above-mentioned neglected constant
# can most easily be computed just once
normalize = 1.0 / sum(condProbs)
# so here's the probability for each of the prior (starting from a uniform
# metaprior)
priorMeta = [normalize * cp for cp in condProbs]
# so the result is the sum of prior probs weighed by prior metaprobs
return sum(pm * pp for pm, pp in zip(priorMeta, priorProbs))
def example(numProspects=4):
# the a priori prob of buying was either 0.3 or 0.7, how does it change
# depending on how 4 prospects bought or didn't?
for bought in range(0, numProspects+1):
result = estimateProbability([0.3, 0.7], bought, numProspects-bought)
print 'b=%d, p=%.2f' % (bought, result)
example()
output is:
b=0, p=0.31
b=1, p=0.36
b=2, p=0.50
b=3, p=0.64
b=4, p=0.69
which agrees with my by-hand computation for this simple case. Note that the probability of buying, by definition, will always be between the lowest and the highest among the set of priori probabilities; if that's not what you want you might want to introduce a little fudge by introducing two "pseudo-products", one that nobody will ever buy (p=0.0), one that anybody will always buy (p=1.0) -- this gives more weight to actual observations, scarce as they may be, and less to statistics about past products. If we do that here, we get:
b=0, p=0.06
b=1, p=0.36
b=2, p=0.50
b=3, p=0.64
b=4, p=0.94
Intermediate levels of fudging (to account for the unlikely but not impossible chance that this new product may be worse than any one ever previously sold, or better than any of them) can easily be envisioned (give lower weight to the artificial 0.0 and 1.0 probabilities, by adding a vector priorWeights to estimateProbability's arguments).
This kind of thing is a substantial part of what I do all day, now that I work developing applications in Business Intelligence, but I just can't get enough of it...!-)

A really simple way of doing this without any difficult math is to increase buyCount and noBuyCount artificially by adding virtual customers that either bought or didn't buy the product. You can tune how much you believe in each particular prior probability in terms of how many virtual customers you think it is worth.
In pseudocode:
def estimateProbability(priorProbs, buyCount, noBuyCount, faithInPrior=None):
if faithInPrior is None: faithInPrior = [10 for x in buyCount]
adjustedBuyCount = [b + p*f for b,p,f in
zip(buyCount, priorProbs, faithInPrior]
adjustedNoBuyCount = [n + (1-p)*f for n,p,f in
zip(noBuyCount, priorProbs, faithInPrior]
return [b/(b+n) for b,n in zip(adjustedBuyCount, adjustedNoBuyCount]

Sounds like what you're trying to do is Association Rule Learning. I don't have time right now to provide you with any code, but I will point you in the direction of WEKA which is a fantastic open source data mining toolkit for Java. You should find plenty of interesting things there that will help you solve your problem.

As I see it, the best you could do is use the uniform distribution, unless you have some clue regarding the distribution. Or are you talking about making a relationship between this products and products previously bought by the same person in the Amazon Fashion "people who buy this product also buy..." ??

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.