Distinguishing voiced/ unvoiced speech using zero-crossing rate

Distinguishing voiced/ unvoiced speech using zero-crossing rate - java

The zero-crossing rate is the rate of sign-changes along a signal, i.e., the rate at which the signal changes from positive to negative or back.
The zero-crossing rate Zn can be used to:
1-Distinguish voiced/unvoiced speech
2-Seperate unvoiced speech from static background noise.
It is a simple (yet effective) way to distinguish between
voiced and unvoiced speech regions:
• Voiced region: lower zero-crossing rate
• Unvoiced region: higher zero-crossing rate
and here is the code i am using:
public double evaluate(){
int numZC=0;
int size=signals.length;
for (int i=0; i<size-1; i++){
if((signals[i]>=0 && signals[i+1]<0) || (signals[i]<0 && signals[i+1]>=0)){
numZC++;
}
}
return numZC/lengthInSecond;
}
MY questions are:
1- My goal of using zero crossing is to eliminate the unvoiced part of the signal,,, and this code gives back the ZERO-CROSSING RATE. SO how will i do that?!
2- How will i know how much is a "low" zero-crossing rate and how much is a "high" zero-crossing rate???

The fundamental problem is that while you've found a way to calculate the zero crossing rate of a block of samples, you can't use that to distinguish sounds within that block because it only gives you one number that describes your entire block.
One potential solution is to divide your big block into small blocks, and then work on those. If you do that, you will soon find that your small blocks, which you made arbitrarily, don't fit into neat categories of voiced and unvoiced, and simply removing one block or setting a block's volume to zero will leave you with "choppy" sounds or even harsh clicking sounds, and won't divide the parts of speech as cleanly as you like.
This may be a worthwhile point to start with, because it's closer to your existing code, but it won't work out in the long run, unless you are just looking to do something rough (in which case, this might be good enough!).
To resolve this, you may want to consider calculating an "instantaneous zero crossing rate"1 that updates the Zr for each sample.
My goal of using zero crossing is to eliminate the unvoiced part of the signal,,, and this code gives back the ZERO-CROSSING RATE. SO how will i do that?! It's not clear what you want. What do you mean by "eliminate"? Do you want silence or do you want to skip those sections? For silence, simply replace the unwanted sections with zero. To skip, simply remove those samples. Of course, you will still end up with clicks and pops, but I assume you know how to get rid of that. If not, maybe you can read up on linear interpolation. Keep in mind that you will almost certainly have to apply some heuristics like "don't remove any sections that are smaller than n samples".
How will i know how much is a "low" zero-crossing rate and how much is a "high" zero-crossing rate??? I would guess a good threshold will be roughly around 400Hz, but speech is not my specialty. Moreover it will vary a bit by speaker and possibly by language and other factors. I suggest you make some samples and see for yourself.
1 this name is a bit misleading and you could say "there's no such thing as an instantaneous zero crossing rate". I'm not here to argue that; rather I want to use that phrase because it expresses what I mean and I hope you understand it. Suffice it to say you should do your best to update Zr as often as you can. eg. something like this:
int lastSign = 0;
int lastCrossing = 0;
float nextZeroCrossing( float newSample ) {
int thisSign = newSample > 0 ? 1 : -1 ;
if( thisSign != lastSign ) {
lastSign = thisSign;
//zero crossing has happened. Update our estimate of Zr using lastCrossing and return that
} else {
++lastCrossing;
//zero crossing has not happened. Return existing Zr
}
}
You may want to "smooth" the output of nextZeroCrossing(), as it will tend to jump around a lot. A simple exponential or moving average filter will work great.

Related

Is there a better method for randomizing functions besides Random?

I have 2 strings in an array. I want there to be a 10% chance of one and 90% chance to select the other. Right now I am using:
Random random = new Random();
int x = random.nextInt(100 - 1) + 1;
if (x < 10) {
string = stringArray(0);
} else {
string = stringArray(1);
}
Is this the best way of accomplishing this or is there a better method?

I know it's typically a bad idea to submit a stack overflow response without submitting code, but I really challenge this question of " the best way." People ask this all the time and, while there are established design patterns in software worth knowing, this question almost always can be answered by "it depends."
For example, your pattern looks fine (I might add some comments). You might get a minuscule performance increase by using 1 - 10 instead of 1 - 100, but the things you need to ask yourself are as follows :
If I get hit by a bus, is the person who is going to be working on the application going to know what I was trying to do?
If it isn't intuitive, I should write a comment. Then I should ask myself, "Can I change this code so that a comment isn't necessary?"
Is there an existing library that solves this problem? If so, is it FOSS approved (if applicable) / can I use it?
What is the size of this codebase eventually going to be? Am I making a full program with microservices, a DAO, DTO, Controller, View, and different layers for validation?
Is there an existing convention to solve my problem (either at my company or in general), or is it unique enough that I can take my own spin on it?
Does this follow the DRY principle?
I'm in (apparently) a very small camp on stack overflow that doesn't always believe in universal "bests" for solving code problems. Just remember, programming is only as hard as the problem you're trying to solve.
EDIT
Since people asked, I'd do it like this:
/*
* #author DaveCat
* #version 1.0
* #since 2019-03-9
* Convenience method that calculates 90% odds of A and 10% odds of B.
*
*/
public static String[] calculatesNinetyPercent()
{
Random random = new Random();
int x = random.nextInt(10 - 1 ) + 1
//Option A
if(x <= 9) {
return stringArray(0);
}
else
{
//Option B
return stringArray(1);
}
}
As an aside, one of the common mistakes junior devs make in enterprise level development is excessive comments.This has a javadoc, which is probably overkill, but I'm assuming this is a convenience method you're using in a greater program.
Edit (again)
You guys keep confusing me. This is how you randomly generate between 2 given numbers in Java

One alternative is to use a random float value between 0..1 and comparing it to the probability of the event. If the random value is less than the probability, then the event occurs.
In this specific example, set x to a random float and compare it to 0.1
I like this method because it can be used for probabilities other than percent integers.

n-body Simulation expected performance barnes hut

I made a 2D n-body simulation using brute force at first, but then following http://arborjs.org/docs/barnes-hut this I've implemented a Barnes-Hut approximation algorithm. However this didn't give me the effect I was looking for.
Ex:
Barnes-Hut -> 2000 Bodies; frametime avg. 32 ms and 5000; 164 ms
Brute force -> 2000 Bodies; frametime avg. 31 ms and 5000; 195 ms
These values are with rendering turned off.
Am I correct to assume that I haven't correctly implemented the algorithm and am thus not getting a substantial increase in performance?
Theta is currently set to s/d < 0.5. Changing this value to e.g. 1 does increase performance, but it's quite obvious why this isn't preferred.
Single threaded
My code along general lines:
while(!close)
{
long newTime = System.currentTimeMillis();
long frameTime = newTime-currentTime;
System.out.println(frameTime);
currentTime = newTime;
update the bodies
}
Within the function that updates the bodies:
first insert all bodies into the quadtree with all its subnodes
for all bodies
{
compute the physics using Barnes-Hut which yields a net force per planet (doPhysics(body))
calculate instantaneous acceleration from net force
update the instantaneous velocity
}
The barneshut function:
doPhysics(body)
{
if(node is external (contains 1 body) and that body is not itself)
{
calculate the force between those two bodies
}else if(node is internal and s/d < 0.5)
{
create a pseudobody at the COM with the nodes total mass
calculate the force between the body and pseudobody
}else (if is it internal but s/d >= 0.5)
{
(this is where recursion comes in)
doPhysics on same body but on the NorthEast subnode
doPhysics on same body but on the NorthWest subnode
doPhysics on same body but on the SouthEast subnode
doPhysics on same body but on the SouthWest subnode
}
}
Actually calculating the force:
calculateforce(body, otherbody)
{
if(body is not at exactly the same position (avoid division by 0))
{
calculate force using newtons law of gravitation in vector form
add the force to the bodies' net force in this frame
}
}

Your code is still incomplete (read on SSCCEs ), and in-depth debugging of incomplete code is not the purpose of the site. However, this is how I would approach the next steps of figuring what, if anything, is wrong:
time only the function that you are worried about (let us call it barnes_hutt_update()); and not the whole update loop. Compare that to the equivalent, non-B-H code, and not to the whole update loop without B-H. This would result in a much more meaningful comparison.
you seem to have hard-coded s/d 0.5 into your algorithm. Leaving it as an argument, you should be able to notice speedups when it is set higher; and very similar performance to a naive, non-B-H implementation if set to 0. Speedup in B-H comes from evaluating less nodes (because far-away nodes are lumped together); do you know how many nodes you are managing to skip? No skipped nodes, no speedup. On the other hand, skipping nodes introduces small errors in the calculation - have you quantified those?
have a look at other implementations of B-H online. D3's force layout uses it internally, and is quite readable. There are multiple existing quadtree implementations. If you have built your own, they may be sub-optimal (or even buggy). Unless you are trying to learn-by-doing, it is always better to use a tested library instead of rolling your own.
slowdown may be due to the use of quadtrees, rather than from force addition itself. It would be useful to know how long building and updating the quadtree is taking, as compared to the B-H force aproximation itself -- because quadtrees are, in this case, pure overhead. B-H needs quadtrees, but the naive, non-B-H implementation does not. For small amounts of nodes, naive will be faster (but will get slower very fast as you add more and more). How does the performance scale as you add more and more bodies?
are you creating and discarding large amounts of objects? You can make your algorithm avoid the associated overhead (yes, lots of news + garbage collection can result in significant slowdowns) by using a memory pool.

checking a value for reset value before resetting it - performance impact?

I have a variable that gets read and updated thousands of times a second. It needs to be reset regularly. But "half" the time, the value is already the reset value. Is it a good idea to check the value first (to see if it needs resetting) before resetting (a write operaion), or I should just reset it regardless? The main goal is to optimize the code for performance.
To illustrate:
Random r = new Random();
int val = Integer.MAX_VALUE;
for (int i=0; i<100000000; i++) {
if (i % 2 == 0)
val = Integer.MAX_VALUE;
else
val = r.nextInt();
if (val != Integer.MAX_VALUE) //skip check?
val = Integer.MAX_VALUE;
}
I tried to use the above program to test the 2 scenarios (by un/commenting the 2nd "if" line), but any difference is masked by the natural variance of the run duration time.
Thanks.

Don't check it.
It's more execution steps = more cycles = more time.
As an aside, you are breaking one of the basic software golden rules: "Don't optimise early". Unless you have hard evidence that this piece if code is a performance problem, you shouldn't be looking at it. (Note that doesn't mean you code without performance in mind, you still follow normal best practice, but you don't add any special code whose only purpose is "performance related")

The check has no actual performance impact. We'd be talking about a single clock cycle or something, which is usually not relevant in a Java program (as hard-core number crunching usually isn't done in Java).
Instead, base the decision on readability. Think of the maintainer who's going to change this piece of code five years on.
In the case of your example, using my rationale, I would skip the check.

Most likely the JIT will optimise the code away because it doesn't do anything.
Rather than worrying about performance, it is usually better to worry about what it
simpler to understand
cleaner to implement
In both cases, you might remove the code as it doesn't do anything useful and it could make the code faster as well.
Even if it did make the code a little slower it would be very small compared to the cost of calling r.nextInt() which is not cheap.

Random geographic coordinates (on land, avoid ocean)

Any clever ideas on how to generate random coordinates (latitude / longitude) of places on Earth? Latitude / Longitude. Precision to 5 points and avoid bodies of water.
double minLat = -90.00;
double maxLat = 90.00;
double latitude = minLat + (double)(Math.random() * ((maxLat - minLat) + 1));
double minLon = 0.00;
double maxLon = 180.00;
double longitude = minLon + (double)(Math.random() * ((maxLon - minLon) + 1));
DecimalFormat df = new DecimalFormat("#.#####");
log.info("latitude:longitude --> " + df.format(latitude) + "," + df.format(longitude));
Maybe i'm living in a dream world and the water topic is unavoidable ... but hopefully there's a nicer, cleaner and more efficient way to do this?
EDIT
Some fantastic answers/ideas -- however, at scale, let's say I need to generate 25,000 coordinates. Going to an external service provider may not be the best option due to latency, cost and a few other factors.

To deal with the body of water problem is going to be largely a data issue, e.g. do you just want to miss the oceans or do you need to also miss small streams. Either you need to use a service with the quality of data that you need, or, you need to obtain the data yourself and run it locally. From your edit, it sounds like you want to go the local data route, so I'll focus on a way to do that.
One method is to obtain a shapefile for either land areas or water areas. You can then generate a random point and determine if it intersects a land area (or alternatively, does not intersect a water area).
To get started, you might get some low resolution data here and then get higher resolution data here for when you want to get better answers on coast lines or with lakes/rivers/etc. You mentioned that you want precision in your points to 5 decimal places, which is a little over 1m. Do be aware that if you get data to match that precision, you will have one giant data set. And, if you want really good data, be prepared to pay for it.
Once you have your shape data, you need some tools to help you determine the intersection of your random points. Geotools is a great place to start and probably will work for your needs. You will also end up looking at opengis code (docs under geotools site - not sure if they consumed them or what) and JTS for the geometry handling. Using this you can quickly open the shapefile and start doing some intersection queries.
File f = new File ( "world.shp" );
ShapefileDataStore dataStore = new ShapefileDataStore ( f.toURI ().toURL () );
FeatureSource<SimpleFeatureType, SimpleFeature> featureSource =
dataStore.getFeatureSource ();
String geomAttrName = featureSource.getSchema ()
.getGeometryDescriptor ().getLocalName ();
ResourceInfo resourceInfo = featureSource.getInfo ();
CoordinateReferenceSystem crs = resourceInfo.getCRS ();
Hints hints = GeoTools.getDefaultHints ();
hints.put ( Hints.JTS_SRID, 4326 );
hints.put ( Hints.CRS, crs );
FilterFactory2 ff = CommonFactoryFinder.getFilterFactory2 ( hints );
GeometryFactory gf = JTSFactoryFinder.getGeometryFactory ( hints );
Coordinate land = new Coordinate ( -122.0087, 47.54650 );
Point pointLand = gf.createPoint ( land );
Coordinate water = new Coordinate ( 0, 0 );
Point pointWater = gf.createPoint ( water );
Intersects filter = ff.intersects ( ff.property ( geomAttrName ),
ff.literal ( pointLand ) );
FeatureCollection<SimpleFeatureType, SimpleFeature> features = featureSource
.getFeatures ( filter );
filter = ff.intersects ( ff.property ( geomAttrName ),
ff.literal ( pointWater ) );
features = featureSource.getFeatures ( filter );
Quick explanations:
This assumes the shapefile you got is polygon data. Intersection on lines or points isn't going to give you what you want.
First section opens the shapefile - nothing interesting
you have to fetch the geometry property name for the given file
coordinate system stuff - you specified lat/long in your post but GIS can be quite a bit more complicated. In general, the data I pointed you at is geographic, wgs84, and, that is what I setup here. However, if this is not the case for you then you need to be sure you are dealing with your data in the correct coordinate system. If that all sounds like gibberish, google around for a tutorial on GIS/coordinate systems/datum/ellipsoid.
generating the coordinate geometries and the filters are pretty self-explanatory. The resulting set of features will either be empty, meaning the coordinate is in the water if your data is land cover, or not empty, meaning the opposite.
Note: if you do this with a really random set of points, you are going to hit water pretty often and it could take you a while to get to 25k points. You may want to try to scope your point generation better than truly random (like remove big chunks of the Atlantic/Pacific/Indian oceans).
Also, you may find that your intersection queries are too slow. If so, you may want to look into creating a quadtree index (qix) with a tool like GDAL. I don't recall which index types are supported by geotools, though.

This has being asked a long time ago and I now have the similar need. There are two possibilities I am looking into:
1. Define the surface ranges for the random generator.
Here it's important to identify the level of precision you are going for. The easiest way would be to have a very relaxed and approximate approach. In this case you can divide the world map into "boxes":
Each box has it's own range of lat lon. Then you first randomise to get a random box, then you randomise to get a random lat and random long within the boundaries of that box.
Precisions is of course not the best at all here... Though it depends:) If you do your homework well and define a lot of boxes covering most complex surface shapes - you might be quite ok with the precision.
2. List item
Some API to return continent name from coordinates OR address OR country OR district = something that WATER doesn't have. Google Maps API's can help here. I didn't research this one deeper, but I think it's possible, though you will have to run the check on each generated pair of coordinates and rerun IF it's wrong. So you can get a bit stuck if random generator keeps throwing you in the ocean.
Also - some water does belong to countries, districts...so yeah, not very precise.
For my needs - I am going with "boxes" because I also want to control exact areas from which the random coordinates are taken and don't mind if it lands on a lake or river, just not open ocean:)

Download a truckload of KML files containing land-only locations.
Extract all coordinates from them this might help here.
Pick them at random.

Definitely you should have a map as a resource. You can take it here: http://www.naturalearthdata.com/
Then I would prepare 1bit black and white bitmap resource with 1s marking land and 0x marking water.
The size of bitmap depends on your required precision. If you need 5 degrees then your bitmap will be 360/5 x 180/5 = 72x36 pixels = 2592 bits.
Then I would load this bitmap in Java, generate random integer withing range above, read bit, and regenerate if it was zero.
P.S. Also you can dig here http://geotools.org/ for some ready made solutions.

To get a nice even distribution over latitudes and longitudes you should do something like this to get the right angles:
double longitude = Math.random() * Math.PI * 2;
double latitude = Math.acos(Math.random() * 2 - 1);
As for avoiding bodies of water, do you have the data for where water is already? Well, just resample until you get a hit! If you don't have this data already then it seems some other people have some better suggestions than I would for that...
Hope this helps, cheers.

There is another way to approach this using the Google Earth Api. I know it is javascript, but I thought it was a novel way to solve the problem.
Anyhow, I have put together a full working solution here - notice it works for rivers too: http://www.msa.mmu.ac.uk/~fraser/ge/coord/
The basic idea I have used is implement the hiTest method of the GEView object in the Google Earth Api.
Take a look at the following example of the hitest from Google.
http://earth-api-samples.googlecode.com/svn/trunk/examples/hittest.html
The hitTest method is supplied a random point on the screen in (pixel coordinates) for which it returns a GEHitTestResult object that contains information about the geographic location corresponding to the point. If one uses the GEPlugin.HIT_TEST_TERRAIN mode with the method one can limit results only to land (terrain) as long as we screen the results to points with an altitude > 1m
This is the function I use that implements the hitTest:
var hitTestTerrain = function()
{
var x = getRandomInt(0, 200); // same pixel size as the map3d div height
var y = getRandomInt(0, 200); // ditto for width
var result = ge.getView().hitTest(x, ge.UNITS_PIXELS, y, ge.UNITS_PIXELS, ge.HIT_TEST_TERRAIN);
var success = result && (result.getAltitude() > 1);
return { success: success, result: result };
};
Obviously you also want to have random results from anywhere on the globe (not just random points visible from a single viewpoint). To do this I move the earth view after each successful hitTestTerrain call. This is achieved using a small helper function.
var flyTo = function(lat, lng, rng)
{
lookAt.setLatitude(lat);
lookAt.setLongitude(lng);
lookAt.setRange(rng);
ge.getView().setAbstractView(lookAt);
};
Finally here is a stripped down version of the main code block that calls these two methods.
var getRandomLandCoordinates = function()
{
var test = hitTestTerrain();
if (test.success)
{
coords[coords.length] = { lat: test.result.getLatitude(), lng: test.result.getLongitude() };
}
if (coords.length <= number)
{
getRandomLandCoordinates();
}
else
{
displayResults();
}
};
So, the earth moves randomly to a postition
The other functions in there are just helpers to generate the random x,y and random lat,lng numbers, to output the results and also to toggle the controls etc.
I have tested the code quite a bit and the results are not 100% perfect, tweaking the altitude to something higher, like 50m solves this but obviously it is diminishing the area of possible selected coordinates.
Obviously you could adapt the idea to suit you needs. Maybe running the code multiple times to populate a database or something.

As a plan B, maybe you can pick a random country and then pick a random coordinate inside of this country. To be fair when picking a country, you can use its area as weight.

There is a library here and you can use its .random() method to get a random coordinate. Then you can use GeoNames WebServices to determine whether it is on land or not. They have a list of webservices and you'll just have to use the right one. GeoNames is free and reliable.

Go there http://wiki.openstreetmap.org/
Try to use API: http://wiki.openstreetmap.org/wiki/Databases_and_data_access_APIs

I guess you could use a world map, define a few points on it to delimit most of water bodies as you say and use a polygon.contains method to validate the coordinates.
A faster algorithm would be to use this map, take some random point and check the color beneath, if it's blue, then water... when you have the coordinates, you convert them to lat/long.

You might also do the blue green thing , and then store all the green points for later look up. This has the benifit of being "step wise" refinable. As you figure out a better way to make your list of points you can just point your random graber at a more and more acurate group of points.
Maybe a service provider has an answer to your question already: e.g. https://www.google.com/enterprise/marketplace/viewListing?productListingId=3030+17310026046429031496&pli=1
Elevation api? http://code.google.com/apis/maps/documentation/elevation/ above sea level or below? (no dutch points for you!)

Generating is easy, the Problem is that they should not be on water. I would import the "Open Streetmap" for example here http://ftp.ecki-netz.de/osm/ and import it to an Database (verry easy data Structure). I would suggest PostgreSQL, it comes with some geo functions http://www.postgresql.org/docs/8.2/static/functions-geometry.html . For that you have to save the points in a "polygon"-column, then you can check with the "&&" operator if it is in an Water polygon. For the attributes of an OpenStreetmap Way-Entry you should have a look at http://wiki.openstreetmap.org/wiki/Category:En:Keys

Supplementary to what bsimic said about digging into GeoNames' Webservices, here is a shortcut:
they have a dedicated WebService for requesting an ocean name.
(I am aware the of OP's constraint to not using public web services due to the amount of requests. Nevertheless I stumbled upon this with the same basic question and consider this helpful.)
Go to http://www.geonames.org/export/web-services.html#astergdem and have a look at "Ocean / reverse geocoding". It is available as XML and JSON. Create a free user account to prevent daily limits on the demo account.
Request example on ocean area (Baltic Sea, JSON-URL):
http://api.geonames.org/oceanJSON?lat=54.049889&lng=10.851388&username=demo
results in
{
"ocean": {
"distance": "0",
"name": "Baltic Sea"
}
}
while some coordinates on land result in
{
"status": {
"message": "we are afraid we could not find an ocean for latitude and longitude :53.0,9.0",
"value": 15
}
}

Do the random points have to be uniformly distributed all over the world? If you could settle for a seemingly uniform distribution, you can do this:
Open your favorite map service, draw a rectangle inside the United States, Russia, China, Western Europe and definitely the northern part of Africa - making sure there are no big lakes or Caspian seas inside the rectangles. Take the corner coordinates of each rectangle, and then select coordinates at random inside those rectangles.
You are guaranteed non of these points will be on any sea or lake. You might find an occasional river, but I'm not sure how many geoservices are going to be accurate enough for that anyway.

This is an extremely interesting question, from both a theoretical and practical perspective. The most suitable solution will largely depend on your exact requirements. Do you need to account for every body of water, or just the major seas and oceans? How critical are accuracy and correctness; Will identifying sea as land or vice-versa be a catastrophic failure?
I think machine learning techniques would be an excellent solution to this problem, provided that you don't mind the (hopefully small) probability that a point of water is incorrectly classified as land. If that's not an issue, then this approach should have a number of advantages against other techniques.
Using a bitmap is a nice solution, simple and elegant. It can be produced to a specified accuracy and the classification is guaranteed to be correct (Or a least as correct as you made the bitmap). But its practicality is dependent on how accurate you need the solution to be. You mention that you want the coordinate accuracy to 5 decimal places (which would be equivalent to mapping the whole surface of the planet to about the nearest metre). Using 1 bit per element, the bitmap would weigh in at ~73.6 terabytes!
We don't need to store all of this data though; We only need to know where the coastlines are. Just by knowing where a point is in relation to the coast, we can determine whether it is on land or sea. As a rough estimate, the CIA world factbook reports that there are 22498km of coastline on Earth. If we were to store coordiates for every metre of coastline, using a 32 bit word for each latitude and longitude, this would take less than 1.35GB to store. It's still a lot if this is for a trivial application, but a few orders of magnitude less than using a bitmap. If having such a high degree of accuracy isn't neccessary though, these numbers would drop considerably. Reducing the mapping to only the nearest kilometre would make the bitmap just ~75GB and the coordinates for the world's coastline could fit on a floppy disk.
What I propose is to use a clustering algorithm to decide whether a point is on land or not. We would first need a suitably large number of coordinates that we already know to be on either land or sea. Existing GIS databases would be suitable for this. Then we can analyse the points to determine clusters of land and sea. The decision boundary between the clusters should fall on the coastlines, and all points not determining the decision boundary can be removed. This process can be iterated to give a progressively more accurate boundary.
Only the points determining the decision boundary/the coastline need to be stored, and by using a simple distance metric we can quickly and easily decide if a set of coordinates are on land or sea. A large amount of resources would be required to train the system, but once complete the classifier would require very little space or time.

Assuming Atlantis isn't in the database, you could randomly select cities. This also provides a more realistic distribution of points if you intend to mimic human activity:
https://simplemaps.com/data/world-cities
There's only 7,300 cities in the free version.

Estimating a probability given other probabilities from a prior

I have a bunch of data coming in (calls to an automated callcenter) about whether or not a person buys a particular product, 1 for buy, 0 for not buy.
I want to use this data to create an estimated probability that a person will buy a particular product, but the problem is that I may need to do it with relatively little historical data about how many people bought/didn't buy that product.
A friend recommended that with Bayesian probability you can "help" your probability estimate by coming up with a "prior probability distribution", essentially this is information about what you expect to see, prior to taking into account the actual data.
So what I'd like to do is create a method that has something like this signature (Java):
double estimateProbability(double[] priorProbabilities, int buyCount, int noBuyCount);
priorProbabilities is an array of probabilities I've seen for previous products, which this method would use to create a prior distribution for this probability. buyCount and noBuyCount are the actual data specific to this product, from which I want to estimate the probability of the user buying, given the data and the prior. This is returned from the method as a double.
I don't need a mathematically perfect solution, just something that will do better than a uniform or flat prior (ie. probability = buyCount / (buyCount+noBuyCount)). Since I'm far more familiar with source code than mathematical notation, I'd appreciate it if people could use code in their explanation.

Here's the Bayesian computation and one example/test:
def estimateProbability(priorProbs, buyCount, noBuyCount):
# first, estimate the prob that the actual buy/nobuy counts would be observed
# given each of the priors (times a constant that's the same in each case and
# not worth the effort of computing;-)`
condProbs = [p**buyCount * (1.0-p)**noBuyCount for p in priorProbs]
# the normalization factor for the above-mentioned neglected constant
# can most easily be computed just once
normalize = 1.0 / sum(condProbs)
# so here's the probability for each of the prior (starting from a uniform
# metaprior)
priorMeta = [normalize * cp for cp in condProbs]
# so the result is the sum of prior probs weighed by prior metaprobs
return sum(pm * pp for pm, pp in zip(priorMeta, priorProbs))
def example(numProspects=4):
# the a priori prob of buying was either 0.3 or 0.7, how does it change
# depending on how 4 prospects bought or didn't?
for bought in range(0, numProspects+1):
result = estimateProbability([0.3, 0.7], bought, numProspects-bought)
print 'b=%d, p=%.2f' % (bought, result)
example()
output is:
b=0, p=0.31
b=1, p=0.36
b=2, p=0.50
b=3, p=0.64
b=4, p=0.69
which agrees with my by-hand computation for this simple case. Note that the probability of buying, by definition, will always be between the lowest and the highest among the set of priori probabilities; if that's not what you want you might want to introduce a little fudge by introducing two "pseudo-products", one that nobody will ever buy (p=0.0), one that anybody will always buy (p=1.0) -- this gives more weight to actual observations, scarce as they may be, and less to statistics about past products. If we do that here, we get:
b=0, p=0.06
b=1, p=0.36
b=2, p=0.50
b=3, p=0.64
b=4, p=0.94
Intermediate levels of fudging (to account for the unlikely but not impossible chance that this new product may be worse than any one ever previously sold, or better than any of them) can easily be envisioned (give lower weight to the artificial 0.0 and 1.0 probabilities, by adding a vector priorWeights to estimateProbability's arguments).
This kind of thing is a substantial part of what I do all day, now that I work developing applications in Business Intelligence, but I just can't get enough of it...!-)

A really simple way of doing this without any difficult math is to increase buyCount and noBuyCount artificially by adding virtual customers that either bought or didn't buy the product. You can tune how much you believe in each particular prior probability in terms of how many virtual customers you think it is worth.
In pseudocode:
def estimateProbability(priorProbs, buyCount, noBuyCount, faithInPrior=None):
if faithInPrior is None: faithInPrior = [10 for x in buyCount]
adjustedBuyCount = [b + p*f for b,p,f in
zip(buyCount, priorProbs, faithInPrior]
adjustedNoBuyCount = [n + (1-p)*f for n,p,f in
zip(noBuyCount, priorProbs, faithInPrior]
return [b/(b+n) for b,n in zip(adjustedBuyCount, adjustedNoBuyCount]

Sounds like what you're trying to do is Association Rule Learning. I don't have time right now to provide you with any code, but I will point you in the direction of WEKA which is a fantastic open source data mining toolkit for Java. You should find plenty of interesting things there that will help you solve your problem.

As I see it, the best you could do is use the uniform distribution, unless you have some clue regarding the distribution. Or are you talking about making a relationship between this products and products previously bought by the same person in the Amazon Fashion "people who buy this product also buy..." ??

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.