Related
I've been studying about k-means clustering, and one thing that's not clear is how you choose the value of k. Is it just a matter of trial and error, or is there more to it?
You can maximize the Bayesian Information Criterion (BIC):
BIC(C | X) = L(X | C) - (p / 2) * log n
where L(X | C) is the log-likelihood of the dataset X according to model C, p is the number of parameters in the model C, and n is the number of points in the dataset.
See "X-means: extending K-means with efficient estimation of the number of clusters" by Dan Pelleg and Andrew Moore in ICML 2000.
Another approach is to start with a large value for k and keep removing centroids (reducing k) until it no longer reduces the description length. See "MDL principle for robust vector quantisation" by Horst Bischof, Ales Leonardis, and Alexander Selb in Pattern Analysis and Applications vol. 2, p. 59-72, 1999.
Finally, you can start with one cluster, then keep splitting clusters until the points assigned to each cluster have a Gaussian distribution. In "Learning the k in k-means" (NIPS 2003), Greg Hamerly and Charles Elkan show some evidence that this works better than BIC, and that BIC does not penalize the model's complexity strongly enough.
Basically, you want to find a balance between two variables: the number of clusters (k) and the average variance of the clusters. You want to minimize the former while also minimizing the latter. Of course, as the number of clusters increases, the average variance decreases (up to the trivial case of k=n and variance=0).
As always in data analysis, there is no one true approach that works better than all others in all cases. In the end, you have to use your own best judgement. For that, it helps to plot the number of clusters against the average variance (which assumes that you have already run the algorithm for several values of k). Then you can use the number of clusters at the knee of the curve.
Yes, you can find the best number of clusters using Elbow method, but I found it troublesome to find the value of clusters from elbow graph using script. You can observe the elbow graph and find the elbow point yourself, but it was lot of work finding it from script.
So another option is to use Silhouette Method to find it. The result from Silhouette completely comply with result from Elbow method in R.
Here`s what I did.
#Dataset for Clustering
n = 150
g = 6
set.seed(g)
d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))),
y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
mydata<-d
#Plot 3X2 plots
attach(mtcars)
par(mfrow=c(3,2))
#Plot the original dataset
plot(mydata$x,mydata$y,main="Original Dataset")
#Scree plot to deterine the number of clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) {
wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
}
plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram
groups <- cutree(fit, k=5) # cut tree into 5 clusters
# draw dendogram with red borders around the 5 clusters
rect.hclust(fit, k=5, border="red")
#Silhouette analysis for determining the number of clusters
library(fpc)
asw <- numeric(20)
for (k in 2:20)
asw[[k]] <- pam(mydata, k) $ silinfo $ avg.width
k.best <- which.max(asw)
cat("silhouette-optimal number of clusters:", k.best, "\n")
plot(pam(d, k.best))
# K-Means Cluster Analysis
fit <- kmeans(mydata,k.best)
mydata
# get cluster means
aggregate(mydata,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata <- data.frame(mydata, clusterid=fit$cluster)
plot(mydata$x,mydata$y, col = fit$cluster, main="K-means Clustering results")
Hope it helps!!
May be someone beginner like me looking for code example. information for silhouette_score
is available here.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
range_n_clusters = [2, 3, 4] # clusters range you want to select
dataToFit = [[12,23],[112,46],[45,23]] # sample data
best_clusters = 0 # best cluster number which you will get
previous_silh_avg = 0.0
for n_clusters in range_n_clusters:
clusterer = KMeans(n_clusters=n_clusters)
cluster_labels = clusterer.fit_predict(dataToFit)
silhouette_avg = silhouette_score(dataToFit, cluster_labels)
if silhouette_avg > previous_silh_avg:
previous_silh_avg = silhouette_avg
best_clusters = n_clusters
# Final Kmeans for best_clusters
kmeans = KMeans(n_clusters=best_clusters, random_state=0).fit(dataToFit)
Look at this paper, "Learning the k in k-means" by Greg Hamerly, Charles Elkan. It uses a Gaussian test to determine the right number of clusters. Also, the authors claim that this method is better than BIC which is mentioned in the accepted answer.
There is something called Rule of Thumb. It says that the number of clusters can be calculated by
k = (n/2)^0.5
where n is the total number of elements from your sample.
You can check the veracity of this information on the following paper:
http://www.ijarcsms.com/docs/paper/volume1/issue6/V1I6-0015.pdf
There is also another method called G-means, where your distribution follows a Gaussian Distribution or Normal Distribution.
It consists of increasing k until all your k groups follow a Gaussian Distribution.
It requires a lot of statistics but can be done.
Here is the source:
http://papers.nips.cc/paper/2526-learning-the-k-in-k-means.pdf
I hope this helps!
If you don't know the numbers of the clusters k to provide as parameter to k-means so there are four ways to find it automaticaly:
G-means algortithm: it discovers the number of clusters automatically using a statistical test to decide whether to split a k-means center into two. This algorithm takes a hierarchical approach to detect the number of clusters, based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution (continuous function which approximates the exact binomial distribution of events), and if not it splits the cluster. It starts with a small number of centers, say one cluster only (k=1), then the algorithm splits it into two centers (k=2) and splits each of these two centers again (k=4), having four centers in total. If G-means does not accept these four centers then the answer is the previous step: two centers in this case (k=2). This is the number of clusters your dataset will be divided into. G-means is very useful when you do not have an estimation of the number of clusters you will get after grouping your instances. Notice that an inconvenient choice for the "k" parameter might give you wrong results. The parallel version of g-means is called p-means. G-means sources:
source 1
source 2
source 3
x-means: a new algorithm that efficiently, searches the space of cluster locations and number of clusters to optimize the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) measure. This version of k-means finds the number k and also accelerates k-means.
Online k-means or Streaming k-means: it permits to execute k-means by scanning the whole data once and it finds automaticaly the optimal number of k. Spark implements it.
MeanShift algorithm: it is a nonparametric clustering technique which does not require prior knowledge of the number of clusters, and does not constrain the shape of the clusters. Mean shift clustering aims to discover “blobs” in a smooth density of samples. It is a centroid-based algorithm, which works by updating candidates for centroids to be the mean of the points within a given region. These candidates are then filtered in a post-processing stage to eliminate near-duplicates to form the final set of centroids. Sources: source1, source2, source3
First build a minimum spanning tree of your data.
Removing the K-1 most expensive edges splits the tree into K clusters,
so you can build the MST once, look at cluster spacings / metrics for various K,
and take the knee of the curve.
This works only for Single-linkage_clustering,
but for that it's fast and easy. Plus, MSTs make good visuals.
See for example the MST plot under
stats.stackexchange visualization software for clustering.
I'm surprised nobody has mentioned this excellent article:
http://www.ee.columbia.edu/~dpwe/papers/PhamDN05-kmeans.pdf
After following several other suggestions I finally came across this article while reading this blog:
https://datasciencelab.wordpress.com/2014/01/21/selection-of-k-in-k-means-clustering-reloaded/
After that I implemented it in Scala, an implementation which for my use cases provide really good results. Here's code:
import breeze.linalg.DenseVector
import Kmeans.{Features, _}
import nak.cluster.{Kmeans => NakKmeans}
import scala.collection.immutable.IndexedSeq
import scala.collection.mutable.ListBuffer
/*
https://datasciencelab.wordpress.com/2014/01/21/selection-of-k-in-k-means-clustering-reloaded/
*/
class Kmeans(features: Features) {
def fkAlphaDispersionCentroids(k: Int, dispersionOfKMinus1: Double = 0d, alphaOfKMinus1: Double = 1d): (Double, Double, Double, Features) = {
if (1 == k || 0d == dispersionOfKMinus1) (1d, 1d, 1d, Vector.empty)
else {
val featureDimensions = features.headOption.map(_.size).getOrElse(1)
val (dispersion, centroids: Features) = new NakKmeans[DenseVector[Double]](features).run(k)
val alpha =
if (2 == k) 1d - 3d / (4d * featureDimensions)
else alphaOfKMinus1 + (1d - alphaOfKMinus1) / 6d
val fk = dispersion / (alpha * dispersionOfKMinus1)
(fk, alpha, dispersion, centroids)
}
}
def fks(maxK: Int = maxK): List[(Double, Double, Double, Features)] = {
val fadcs = ListBuffer[(Double, Double, Double, Features)](fkAlphaDispersionCentroids(1))
var k = 2
while (k <= maxK) {
val (fk, alpha, dispersion, features) = fadcs(k - 2)
fadcs += fkAlphaDispersionCentroids(k, dispersion, alpha)
k += 1
}
fadcs.toList
}
def detK: (Double, Features) = {
val vals = fks().minBy(_._1)
(vals._3, vals._4)
}
}
object Kmeans {
val maxK = 10
type Features = IndexedSeq[DenseVector[Double]]
}
If you use MATLAB, any version since 2013b that is, you can make use of the function evalclusters to find out what should the optimal k be for a given dataset.
This function lets you choose from among 3 clustering algorithms - kmeans, linkage and gmdistribution.
It also lets you choose from among 4 clustering evaluation criteria - CalinskiHarabasz, DaviesBouldin, gap and silhouette.
I used the solution I found here : http://efavdb.com/mean-shift/ and it worked very well for me :
import numpy as np
from sklearn.cluster import MeanShift, estimate_bandwidth
from sklearn.datasets.samples_generator import make_blobs
import matplotlib.pyplot as plt
from itertools import cycle
from PIL import Image
#%% Generate sample data
centers = [[1, 1], [-.75, -1], [1, -1], [-3, 2]]
X, _ = make_blobs(n_samples=10000, centers=centers, cluster_std=0.6)
#%% Compute clustering with MeanShift
# The bandwidth can be automatically estimated
bandwidth = estimate_bandwidth(X, quantile=.1,
n_samples=500)
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(X)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
n_clusters_ = labels.max()+1
#%% Plot result
plt.figure(1)
plt.clf()
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
my_members = labels == k
cluster_center = cluster_centers[k]
plt.plot(X[my_members, 0], X[my_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1],
'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
My idea is to use Silhouette Coefficient to find the optimal cluster number(K). Details explanation is here.
Assuming you have a matrix of data called DATA, you can perform partitioning around medoids with estimation of number of clusters (by silhouette analysis) like this:
library(fpc)
maxk <- 20 # arbitrary here, you can set this to whatever you like
estimatedK <- pamk(dist(DATA), krange=1:maxk)$nc
One possible answer is to use Meta Heuristic Algorithm like Genetic Algorithm to find k.
That's simple. you can use random K(in some range) and evaluate the fit function of Genetic Algorithm with some measurment like Silhouette
And Find best K base on fit function.
https://en.wikipedia.org/wiki/Silhouette_(clustering)
km=[]
for i in range(num_data.shape[1]):
kmeans = KMeans(n_clusters=ncluster[i])#we take number of cluster bandwidth theory
ndata=num_data[[i]].dropna()
ndata['labels']=kmeans.fit_predict(ndata.values)
cluster=ndata
co=cluster.groupby(['labels'])[cluster.columns[0]].count()#count for frequency
me=cluster.groupby(['labels'])[cluster.columns[0]].median()#median
ma=cluster.groupby(['labels'])[cluster.columns[0]].max()#Maximum
mi=cluster.groupby(['labels'])[cluster.columns[0]].min()#Minimum
stat=pd.concat([mi,ma,me,co],axis=1)#Add all column
stat['variable']=stat.columns[1]#Column name change
stat.columns=['Minimum','Maximum','Median','count','variable']
l=[]
for j in range(ncluster[i]):
n=[mi.loc[j],ma.loc[j]]
l.append(n)
stat['Class']=l
stat=stat.sort(['Minimum'])
stat=stat[['variable','Class','Minimum','Maximum','Median','count']]
if missing_num.iloc[i]>0:
stat.loc[ncluster[i]]=0
if stat.iloc[ncluster[i],5]==0:
stat.iloc[ncluster[i],5]=missing_num.iloc[i]
stat.iloc[ncluster[i],0]=stat.iloc[0,0]
stat['Percentage']=(stat[[5]])*100/count_row#Freq PERCENTAGE
stat['Cumulative Percentage']=stat['Percentage'].cumsum()
km.append(stat)
cluster=pd.concat(km,axis=0)## see documentation for more info
cluster=cluster.round({'Minimum': 2, 'Maximum': 2,'Median':2,'Percentage':2,'Cumulative Percentage':2})
Another approach is using Self Organizing Maps (SOP) to find optimal number of clusters. The SOM (Self-Organizing Map) is an unsupervised neural
network methodology, which needs only the input is used to
clustering for problem solving. This approach used in a paper about customer segmentation.
The reference of the paper is
Abdellah Amine et al., Customer Segmentation Model in E-commerce Using
Clustering Techniques and LRFM Model: The Case
of Online Stores in Morocco, World Academy of Science, Engineering and Technology
International Journal of Computer and Information Engineering
Vol:9, No:8, 2015, 1999 - 2010
Hi I'll make it simple and straight to explain, I like to determine clusters using 'NbClust' library.
Now, how to use the 'NbClust' function to determine the right number of clusters: You can check the actual project in Github with actual data and clusters - Extention to this 'kmeans' algorithm also performed using the right number of 'centers'.
Github Project Link: https://github.com/RutvijBhutaiya/Thailand-Customer-Engagement-Facebook
You can choose the number of clusters by visually inspecting your data points, but you will soon realize that there is a lot of ambiguity in this process for all except the simplest data sets. This is not always bad, because you are doing unsupervised learning and there's some inherent subjectivity in the labeling process. Here, having previous experience with that particular problem or something similar will help you choose the right value.
If you want some hint about the number of clusters that you should use, you can apply the Elbow method:
First of all, compute the sum of squared error (SSE) for some values of k (for example 2, 4, 6, 8, etc.). The SSE is defined as the sum of the squared distance between each member of the cluster and its centroid. Mathematically:
SSE=∑Ki=1∑x∈cidist(x,ci)2
If you plot k against the SSE, you will see that the error decreases as k gets larger; this is because when the number of clusters increases, they should be smaller, so distortion is also smaller. The idea of the elbow method is to choose the k at which the SSE decreases abruptly. This produces an "elbow effect" in the graph, as you can see in the following picture:
In this case, k=6 is the value that the Elbow method has selected. Take into account that the Elbow method is an heuristic and, as such, it may or may not work well in your particular case. Sometimes, there are more than one elbow, or no elbow at all. In those situations you usually end up calculating the best k by evaluating how well k-means performs in the context of the particular clustering problem you are trying to solve.
I worked on a Python package kneed (Kneedle algorithm). It finds cluster numbers dynamically as the point where the curve starts to flatten. Given a set of x and y values, kneed will return the knee point of the function. The knee joint is the point of maximum curvature. Here is the sample code.
y = [7342.1301373073857, 6881.7109460930769, 6531.1657905495022,
6356.2255554679778, 6209.8382535595829, 6094.9052166741121,
5980.0191582610196, 5880.1869867848218, 5779.8957906367368,
5691.1879324562778, 5617.5153566271356, 5532.2613232619951,
5467.352265375117, 5395.4493783888756, 5345.3459908298091,
5290.6769823693812, 5243.5271656371888, 5207.2501206569532,
5164.9617535255456]
x = range(1, len(y)+1)
from kneed import KneeLocator
kn = KneeLocator(x, y, curve='convex', direction='decreasing')
print(kn.knee)
Leave here a pretty cool gif from Codecademy course:
The K-Means algorithm:
Place k random centroids for the initial clusters.
Assign data samples to the nearest centroid.
Update centroids based on the above-assigned data samples.
Btw, its not a explanation of full algorithm, its just helpful vizualization
I've programmed (Java) my own feed-forward network learning by back propagation. My network is trained to learn the XOR problem. I have an input matrix 4x2 and target 4x1.
Inputs:
{{0,0},
{0,1},
{1,0},
{1,1}}
Outputs:
{0.95048}
{-0.06721}
{-0.06826}
{0.95122}
I have this trained network and now I want to test it on new inputs like:
{.1,.9} //should result in 1
However, I'm not sure how to implement a float predict(double[] input) method. From what I can see, my problem is that my training data has a different size than my input data.
Please suggest.
EDIT:
The way I have this worded, it sounds like I want a regression value. However, I'd like the output to be a probability vector (classification) which I can then analyze.
In your situation your neural network should have a input with dimension 2 and output 1. So during training you will provide each example input {x0, x1} and output {y0} for it to learn. Then finally when predicting you can provide a vector {.9, .1} and get the desired output.
Basically, when you train a neural network, you get a bunch of parameters that can be used to predict the result. You get the result by adding the products of your features and weights in each layer and then you apply your activation function to it. for example, let's say your network has 3 layers (other than features), and each hidden layer has three neurons and your output layer has one neuron. W1 denotes your weights for the first layer, therefore it has a shape of [3,2]. With the same argument, W2, the weights for your second layer has a shape of [3,3]. Finally, W3 which is the weights for your output layer has a shape of [1,3]. Now if we use a function called g(z) as your activation function, you can calculate the result for an example like this:
Z1 = W1.X
A1 = g(Z1)
Z2 = W2.A1
A2 = g(Z2)
Z3 = W3.A2
A3 = g(Z3)
and A3 is your result, predicting the XOR of two numbers. Please note that I have not considered bias terms for this example.
Any clever ideas on how to generate random coordinates (latitude / longitude) of places on Earth? Latitude / Longitude. Precision to 5 points and avoid bodies of water.
double minLat = -90.00;
double maxLat = 90.00;
double latitude = minLat + (double)(Math.random() * ((maxLat - minLat) + 1));
double minLon = 0.00;
double maxLon = 180.00;
double longitude = minLon + (double)(Math.random() * ((maxLon - minLon) + 1));
DecimalFormat df = new DecimalFormat("#.#####");
log.info("latitude:longitude --> " + df.format(latitude) + "," + df.format(longitude));
Maybe i'm living in a dream world and the water topic is unavoidable ... but hopefully there's a nicer, cleaner and more efficient way to do this?
EDIT
Some fantastic answers/ideas -- however, at scale, let's say I need to generate 25,000 coordinates. Going to an external service provider may not be the best option due to latency, cost and a few other factors.
To deal with the body of water problem is going to be largely a data issue, e.g. do you just want to miss the oceans or do you need to also miss small streams. Either you need to use a service with the quality of data that you need, or, you need to obtain the data yourself and run it locally. From your edit, it sounds like you want to go the local data route, so I'll focus on a way to do that.
One method is to obtain a shapefile for either land areas or water areas. You can then generate a random point and determine if it intersects a land area (or alternatively, does not intersect a water area).
To get started, you might get some low resolution data here and then get higher resolution data here for when you want to get better answers on coast lines or with lakes/rivers/etc. You mentioned that you want precision in your points to 5 decimal places, which is a little over 1m. Do be aware that if you get data to match that precision, you will have one giant data set. And, if you want really good data, be prepared to pay for it.
Once you have your shape data, you need some tools to help you determine the intersection of your random points. Geotools is a great place to start and probably will work for your needs. You will also end up looking at opengis code (docs under geotools site - not sure if they consumed them or what) and JTS for the geometry handling. Using this you can quickly open the shapefile and start doing some intersection queries.
File f = new File ( "world.shp" );
ShapefileDataStore dataStore = new ShapefileDataStore ( f.toURI ().toURL () );
FeatureSource<SimpleFeatureType, SimpleFeature> featureSource =
dataStore.getFeatureSource ();
String geomAttrName = featureSource.getSchema ()
.getGeometryDescriptor ().getLocalName ();
ResourceInfo resourceInfo = featureSource.getInfo ();
CoordinateReferenceSystem crs = resourceInfo.getCRS ();
Hints hints = GeoTools.getDefaultHints ();
hints.put ( Hints.JTS_SRID, 4326 );
hints.put ( Hints.CRS, crs );
FilterFactory2 ff = CommonFactoryFinder.getFilterFactory2 ( hints );
GeometryFactory gf = JTSFactoryFinder.getGeometryFactory ( hints );
Coordinate land = new Coordinate ( -122.0087, 47.54650 );
Point pointLand = gf.createPoint ( land );
Coordinate water = new Coordinate ( 0, 0 );
Point pointWater = gf.createPoint ( water );
Intersects filter = ff.intersects ( ff.property ( geomAttrName ),
ff.literal ( pointLand ) );
FeatureCollection<SimpleFeatureType, SimpleFeature> features = featureSource
.getFeatures ( filter );
filter = ff.intersects ( ff.property ( geomAttrName ),
ff.literal ( pointWater ) );
features = featureSource.getFeatures ( filter );
Quick explanations:
This assumes the shapefile you got is polygon data. Intersection on lines or points isn't going to give you what you want.
First section opens the shapefile - nothing interesting
you have to fetch the geometry property name for the given file
coordinate system stuff - you specified lat/long in your post but GIS can be quite a bit more complicated. In general, the data I pointed you at is geographic, wgs84, and, that is what I setup here. However, if this is not the case for you then you need to be sure you are dealing with your data in the correct coordinate system. If that all sounds like gibberish, google around for a tutorial on GIS/coordinate systems/datum/ellipsoid.
generating the coordinate geometries and the filters are pretty self-explanatory. The resulting set of features will either be empty, meaning the coordinate is in the water if your data is land cover, or not empty, meaning the opposite.
Note: if you do this with a really random set of points, you are going to hit water pretty often and it could take you a while to get to 25k points. You may want to try to scope your point generation better than truly random (like remove big chunks of the Atlantic/Pacific/Indian oceans).
Also, you may find that your intersection queries are too slow. If so, you may want to look into creating a quadtree index (qix) with a tool like GDAL. I don't recall which index types are supported by geotools, though.
This has being asked a long time ago and I now have the similar need. There are two possibilities I am looking into:
1. Define the surface ranges for the random generator.
Here it's important to identify the level of precision you are going for. The easiest way would be to have a very relaxed and approximate approach. In this case you can divide the world map into "boxes":
Each box has it's own range of lat lon. Then you first randomise to get a random box, then you randomise to get a random lat and random long within the boundaries of that box.
Precisions is of course not the best at all here... Though it depends:) If you do your homework well and define a lot of boxes covering most complex surface shapes - you might be quite ok with the precision.
2. List item
Some API to return continent name from coordinates OR address OR country OR district = something that WATER doesn't have. Google Maps API's can help here. I didn't research this one deeper, but I think it's possible, though you will have to run the check on each generated pair of coordinates and rerun IF it's wrong. So you can get a bit stuck if random generator keeps throwing you in the ocean.
Also - some water does belong to countries, districts...so yeah, not very precise.
For my needs - I am going with "boxes" because I also want to control exact areas from which the random coordinates are taken and don't mind if it lands on a lake or river, just not open ocean:)
Download a truckload of KML files containing land-only locations.
Extract all coordinates from them this might help here.
Pick them at random.
Definitely you should have a map as a resource. You can take it here: http://www.naturalearthdata.com/
Then I would prepare 1bit black and white bitmap resource with 1s marking land and 0x marking water.
The size of bitmap depends on your required precision. If you need 5 degrees then your bitmap will be 360/5 x 180/5 = 72x36 pixels = 2592 bits.
Then I would load this bitmap in Java, generate random integer withing range above, read bit, and regenerate if it was zero.
P.S. Also you can dig here http://geotools.org/ for some ready made solutions.
To get a nice even distribution over latitudes and longitudes you should do something like this to get the right angles:
double longitude = Math.random() * Math.PI * 2;
double latitude = Math.acos(Math.random() * 2 - 1);
As for avoiding bodies of water, do you have the data for where water is already? Well, just resample until you get a hit! If you don't have this data already then it seems some other people have some better suggestions than I would for that...
Hope this helps, cheers.
There is another way to approach this using the Google Earth Api. I know it is javascript, but I thought it was a novel way to solve the problem.
Anyhow, I have put together a full working solution here - notice it works for rivers too: http://www.msa.mmu.ac.uk/~fraser/ge/coord/
The basic idea I have used is implement the hiTest method of the GEView object in the Google Earth Api.
Take a look at the following example of the hitest from Google.
http://earth-api-samples.googlecode.com/svn/trunk/examples/hittest.html
The hitTest method is supplied a random point on the screen in (pixel coordinates) for which it returns a GEHitTestResult object that contains information about the geographic location corresponding to the point. If one uses the GEPlugin.HIT_TEST_TERRAIN mode with the method one can limit results only to land (terrain) as long as we screen the results to points with an altitude > 1m
This is the function I use that implements the hitTest:
var hitTestTerrain = function()
{
var x = getRandomInt(0, 200); // same pixel size as the map3d div height
var y = getRandomInt(0, 200); // ditto for width
var result = ge.getView().hitTest(x, ge.UNITS_PIXELS, y, ge.UNITS_PIXELS, ge.HIT_TEST_TERRAIN);
var success = result && (result.getAltitude() > 1);
return { success: success, result: result };
};
Obviously you also want to have random results from anywhere on the globe (not just random points visible from a single viewpoint). To do this I move the earth view after each successful hitTestTerrain call. This is achieved using a small helper function.
var flyTo = function(lat, lng, rng)
{
lookAt.setLatitude(lat);
lookAt.setLongitude(lng);
lookAt.setRange(rng);
ge.getView().setAbstractView(lookAt);
};
Finally here is a stripped down version of the main code block that calls these two methods.
var getRandomLandCoordinates = function()
{
var test = hitTestTerrain();
if (test.success)
{
coords[coords.length] = { lat: test.result.getLatitude(), lng: test.result.getLongitude() };
}
if (coords.length <= number)
{
getRandomLandCoordinates();
}
else
{
displayResults();
}
};
So, the earth moves randomly to a postition
The other functions in there are just helpers to generate the random x,y and random lat,lng numbers, to output the results and also to toggle the controls etc.
I have tested the code quite a bit and the results are not 100% perfect, tweaking the altitude to something higher, like 50m solves this but obviously it is diminishing the area of possible selected coordinates.
Obviously you could adapt the idea to suit you needs. Maybe running the code multiple times to populate a database or something.
As a plan B, maybe you can pick a random country and then pick a random coordinate inside of this country. To be fair when picking a country, you can use its area as weight.
There is a library here and you can use its .random() method to get a random coordinate. Then you can use GeoNames WebServices to determine whether it is on land or not. They have a list of webservices and you'll just have to use the right one. GeoNames is free and reliable.
Go there http://wiki.openstreetmap.org/
Try to use API: http://wiki.openstreetmap.org/wiki/Databases_and_data_access_APIs
I guess you could use a world map, define a few points on it to delimit most of water bodies as you say and use a polygon.contains method to validate the coordinates.
A faster algorithm would be to use this map, take some random point and check the color beneath, if it's blue, then water... when you have the coordinates, you convert them to lat/long.
You might also do the blue green thing , and then store all the green points for later look up. This has the benifit of being "step wise" refinable. As you figure out a better way to make your list of points you can just point your random graber at a more and more acurate group of points.
Maybe a service provider has an answer to your question already: e.g. https://www.google.com/enterprise/marketplace/viewListing?productListingId=3030+17310026046429031496&pli=1
Elevation api? http://code.google.com/apis/maps/documentation/elevation/ above sea level or below? (no dutch points for you!)
Generating is easy, the Problem is that they should not be on water. I would import the "Open Streetmap" for example here http://ftp.ecki-netz.de/osm/ and import it to an Database (verry easy data Structure). I would suggest PostgreSQL, it comes with some geo functions http://www.postgresql.org/docs/8.2/static/functions-geometry.html . For that you have to save the points in a "polygon"-column, then you can check with the "&&" operator if it is in an Water polygon. For the attributes of an OpenStreetmap Way-Entry you should have a look at http://wiki.openstreetmap.org/wiki/Category:En:Keys
Supplementary to what bsimic said about digging into GeoNames' Webservices, here is a shortcut:
they have a dedicated WebService for requesting an ocean name.
(I am aware the of OP's constraint to not using public web services due to the amount of requests. Nevertheless I stumbled upon this with the same basic question and consider this helpful.)
Go to http://www.geonames.org/export/web-services.html#astergdem and have a look at "Ocean / reverse geocoding". It is available as XML and JSON. Create a free user account to prevent daily limits on the demo account.
Request example on ocean area (Baltic Sea, JSON-URL):
http://api.geonames.org/oceanJSON?lat=54.049889&lng=10.851388&username=demo
results in
{
"ocean": {
"distance": "0",
"name": "Baltic Sea"
}
}
while some coordinates on land result in
{
"status": {
"message": "we are afraid we could not find an ocean for latitude and longitude :53.0,9.0",
"value": 15
}
}
Do the random points have to be uniformly distributed all over the world? If you could settle for a seemingly uniform distribution, you can do this:
Open your favorite map service, draw a rectangle inside the United States, Russia, China, Western Europe and definitely the northern part of Africa - making sure there are no big lakes or Caspian seas inside the rectangles. Take the corner coordinates of each rectangle, and then select coordinates at random inside those rectangles.
You are guaranteed non of these points will be on any sea or lake. You might find an occasional river, but I'm not sure how many geoservices are going to be accurate enough for that anyway.
This is an extremely interesting question, from both a theoretical and practical perspective. The most suitable solution will largely depend on your exact requirements. Do you need to account for every body of water, or just the major seas and oceans? How critical are accuracy and correctness; Will identifying sea as land or vice-versa be a catastrophic failure?
I think machine learning techniques would be an excellent solution to this problem, provided that you don't mind the (hopefully small) probability that a point of water is incorrectly classified as land. If that's not an issue, then this approach should have a number of advantages against other techniques.
Using a bitmap is a nice solution, simple and elegant. It can be produced to a specified accuracy and the classification is guaranteed to be correct (Or a least as correct as you made the bitmap). But its practicality is dependent on how accurate you need the solution to be. You mention that you want the coordinate accuracy to 5 decimal places (which would be equivalent to mapping the whole surface of the planet to about the nearest metre). Using 1 bit per element, the bitmap would weigh in at ~73.6 terabytes!
We don't need to store all of this data though; We only need to know where the coastlines are. Just by knowing where a point is in relation to the coast, we can determine whether it is on land or sea. As a rough estimate, the CIA world factbook reports that there are 22498km of coastline on Earth. If we were to store coordiates for every metre of coastline, using a 32 bit word for each latitude and longitude, this would take less than 1.35GB to store. It's still a lot if this is for a trivial application, but a few orders of magnitude less than using a bitmap. If having such a high degree of accuracy isn't neccessary though, these numbers would drop considerably. Reducing the mapping to only the nearest kilometre would make the bitmap just ~75GB and the coordinates for the world's coastline could fit on a floppy disk.
What I propose is to use a clustering algorithm to decide whether a point is on land or not. We would first need a suitably large number of coordinates that we already know to be on either land or sea. Existing GIS databases would be suitable for this. Then we can analyse the points to determine clusters of land and sea. The decision boundary between the clusters should fall on the coastlines, and all points not determining the decision boundary can be removed. This process can be iterated to give a progressively more accurate boundary.
Only the points determining the decision boundary/the coastline need to be stored, and by using a simple distance metric we can quickly and easily decide if a set of coordinates are on land or sea. A large amount of resources would be required to train the system, but once complete the classifier would require very little space or time.
Assuming Atlantis isn't in the database, you could randomly select cities. This also provides a more realistic distribution of points if you intend to mimic human activity:
https://simplemaps.com/data/world-cities
There's only 7,300 cities in the free version.
I have written a kernel density estimator in Java that takes input in the form of ESRI shapefiles and outputs a GeoTIFF image of the estimated surface. To test this module I need an example shapefile, and for whatever reason I have been told to retrieve one from the sample data included in R. Problem is that none of the sample data is a shapefile...
So I'm trying to use the shapefiles package's funciton convert.to.shapefile(4) to convert the bei dataset included in the spatstat package in R to a shapefile. Unfortunately this is proving to be harder than I thought. Does anyone have any experience in doing this? If you'd be so kind as to lend me a hand here I'd greatly appreciate it.
Thanks,
Ryan
References:
spatstat,
shapefiles
There are converter functions for Spatial objects in the spatstat and maptools packages that can be used for this. A shapefile consists of at least points (or lines or polygons) and attributes for each object.
library(spatstat)
library(sp)
library(maptools)
data(bei)
Coerce bei to a Spatial object, here just points without attributes since there are no "marks" on the ppp object.
spPoints <- as(bei, "SpatialPoints")
A shapefile requires at least one column of attribute data, so create a dummy.
dummyData <- data.frame(dummy = rep(0, npoints(bei)))
Using the SpatialPoints object and the dummy data, generate a SpatialPointsDataFrame.
spDF <- SpatialPointsDataFrame(spPoints, dummyData)
At this point you should definitely consider what the coordinate system used by bei is and whether you can represent that with a WKT CRS (well-known text coordinate reference system). You can assign that to the Spatial object as another argument to SpatialPointsDataFrame, or after create with proj4string(spDF) <- CRS("+proj=etc...") (but this is an entire problem all on its own that we could write pages on).
Load the rgdal package (this is the most general option as it supports many formats and uses the GDAL library, but may not be available because of system dependencies.
library(rgdal)
(Use writePolyShape in the maptools package if rgdal is not available).
The syntax is the object, then the "data source name" (here the current directory, this can be a full path to a .shp, or a folder), then the layer (for shapefiles the file name without the extension), and then the name of the output driver.
writeOGR(obj = spDF, dsn = ".", layer = "bei", driver = "ESRI Shapefile")
Note that the write would fail if the "bei.shp" already existed and so would have to be deleted first unlink("bei.shp").
List any files that start with "bei":
list.files(pattern = "^bei")
[1] "bei.dbf" "bei.shp" "bei.shx"
Note that there is no general "as.Spatial" converter for ppp objects, since decisions must be made as to whether this is a point patter with marks and so on - it might be interesting to try writing one, that reports on whether dummy data was required and so on.
See the following vignettes for further information and details on the differences between these data representations:
library(sp); vignette("sp")
library(spatstat); vignette("spatstat")
A general solution is:
convert the "ppp" or "owin" classed objects to appropriate classed objects from the sp package
use the writeOGR() function from package rgdal to write the Shapefile out
For example, if we consider the hamster data set from spatstat:
require(spatstat)
require(maptools)
require(sp)
require(rgdal)
data(hamster)
first convert this object to a SpatialPointsDataFrame object:
ham.sp <- as.SpatialPointsDataFrame.ppp(hamster)
This gives us a sp object to work from:
> str(ham.sp, max = 2)
Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
..# data :'data.frame': 303 obs. of 1 variable:
..# coords.nrs : num(0)
..# coords : num [1:303, 1:2] 6 10.8 25.8 26.8 32.5 ...
.. ..- attr(*, "dimnames")=List of 2
..# bbox : num [1:2, 1:2] 0 0 250 250
.. ..- attr(*, "dimnames")=List of 2
..# proj4string:Formal class 'CRS' [package "sp"] with 1 slots
This object has a single variable in the #data slot:
> head(ham.sp#data)
marks
1 dividing
2 dividing
3 dividing
4 dividing
5 dividing
6 dividing
So say we now want to write out this variable as an ESRI Shapefile, we use writeOGR()
writeOGR(ham.sp, "hamster", "marks", driver = "ESRI Shapefile")
This will create several marks.xxx files in directory hamster created in the current working directory. That set of files is the ShapeFile.
One of the reasons why I didn't do the above with the bei data set is that it doesn't contain any data and thus we can't coerce it to a SpatialPointsDataFrame object. There are data we could use, in bei.extra (loaded at same time as bei), but these extra data or on a regular grid. So we'd have to
convert bei.extra to a SpatialGridDataFrame object (say bei.spg)
convert bei to a SpatialPoints object (say bei.sp)
overlay() the bei.sp points on to the bei.spg grid, yielding values from the grid for each of the points in bei
that should give us a SpatialPointsDataFrame that can be written out using writeOGR() as above
As you see, that is a bit more involved just to give you a Shapefile. Will the hamster data example I show suffice? If not, I can hunt out my Bivand et al tomorrow and run through the steps for bei.
I have a bunch of data coming in (calls to an automated callcenter) about whether or not a person buys a particular product, 1 for buy, 0 for not buy.
I want to use this data to create an estimated probability that a person will buy a particular product, but the problem is that I may need to do it with relatively little historical data about how many people bought/didn't buy that product.
A friend recommended that with Bayesian probability you can "help" your probability estimate by coming up with a "prior probability distribution", essentially this is information about what you expect to see, prior to taking into account the actual data.
So what I'd like to do is create a method that has something like this signature (Java):
double estimateProbability(double[] priorProbabilities, int buyCount, int noBuyCount);
priorProbabilities is an array of probabilities I've seen for previous products, which this method would use to create a prior distribution for this probability. buyCount and noBuyCount are the actual data specific to this product, from which I want to estimate the probability of the user buying, given the data and the prior. This is returned from the method as a double.
I don't need a mathematically perfect solution, just something that will do better than a uniform or flat prior (ie. probability = buyCount / (buyCount+noBuyCount)). Since I'm far more familiar with source code than mathematical notation, I'd appreciate it if people could use code in their explanation.
Here's the Bayesian computation and one example/test:
def estimateProbability(priorProbs, buyCount, noBuyCount):
# first, estimate the prob that the actual buy/nobuy counts would be observed
# given each of the priors (times a constant that's the same in each case and
# not worth the effort of computing;-)`
condProbs = [p**buyCount * (1.0-p)**noBuyCount for p in priorProbs]
# the normalization factor for the above-mentioned neglected constant
# can most easily be computed just once
normalize = 1.0 / sum(condProbs)
# so here's the probability for each of the prior (starting from a uniform
# metaprior)
priorMeta = [normalize * cp for cp in condProbs]
# so the result is the sum of prior probs weighed by prior metaprobs
return sum(pm * pp for pm, pp in zip(priorMeta, priorProbs))
def example(numProspects=4):
# the a priori prob of buying was either 0.3 or 0.7, how does it change
# depending on how 4 prospects bought or didn't?
for bought in range(0, numProspects+1):
result = estimateProbability([0.3, 0.7], bought, numProspects-bought)
print 'b=%d, p=%.2f' % (bought, result)
example()
output is:
b=0, p=0.31
b=1, p=0.36
b=2, p=0.50
b=3, p=0.64
b=4, p=0.69
which agrees with my by-hand computation for this simple case. Note that the probability of buying, by definition, will always be between the lowest and the highest among the set of priori probabilities; if that's not what you want you might want to introduce a little fudge by introducing two "pseudo-products", one that nobody will ever buy (p=0.0), one that anybody will always buy (p=1.0) -- this gives more weight to actual observations, scarce as they may be, and less to statistics about past products. If we do that here, we get:
b=0, p=0.06
b=1, p=0.36
b=2, p=0.50
b=3, p=0.64
b=4, p=0.94
Intermediate levels of fudging (to account for the unlikely but not impossible chance that this new product may be worse than any one ever previously sold, or better than any of them) can easily be envisioned (give lower weight to the artificial 0.0 and 1.0 probabilities, by adding a vector priorWeights to estimateProbability's arguments).
This kind of thing is a substantial part of what I do all day, now that I work developing applications in Business Intelligence, but I just can't get enough of it...!-)
A really simple way of doing this without any difficult math is to increase buyCount and noBuyCount artificially by adding virtual customers that either bought or didn't buy the product. You can tune how much you believe in each particular prior probability in terms of how many virtual customers you think it is worth.
In pseudocode:
def estimateProbability(priorProbs, buyCount, noBuyCount, faithInPrior=None):
if faithInPrior is None: faithInPrior = [10 for x in buyCount]
adjustedBuyCount = [b + p*f for b,p,f in
zip(buyCount, priorProbs, faithInPrior]
adjustedNoBuyCount = [n + (1-p)*f for n,p,f in
zip(noBuyCount, priorProbs, faithInPrior]
return [b/(b+n) for b,n in zip(adjustedBuyCount, adjustedNoBuyCount]
Sounds like what you're trying to do is Association Rule Learning. I don't have time right now to provide you with any code, but I will point you in the direction of WEKA which is a fantastic open source data mining toolkit for Java. You should find plenty of interesting things there that will help you solve your problem.
As I see it, the best you could do is use the uniform distribution, unless you have some clue regarding the distribution. Or are you talking about making a relationship between this products and products previously bought by the same person in the Amazon Fashion "people who buy this product also buy..." ??