I have an amount of ranges, each with a weight. Every point on the total range is scored by the sum of the weights of all the ranges the point falls into. I'd like to be able to cheaply find the total value of points, and would like to be able to find a maximum. Ideally, it would also be able to find the maximum for a set of (equidistantly) spaced points.
Unfortunately, I'm heavily limited by performance, and am struggling to find a good algorithm for this.
The only two decent solutions I could find are:
- Bruteforce it by sampling a bunch of points. For each: check every range whether it fits, find the total value, then check if it's better than the best so far. Decent point samples can be found by taking the boundaries of the ranges.
- Create a set of buckets. Iterate through all the ranges, adding a value to all the buckets that fit within the range. Then iterate through all the buckets to find the best one
Neither are fast enough for my liking (they have been tested), and the latter isn't continuous so has accuracy problems.
I'd be okay with getting a slightly inaccurate response as long as the performance is way better.
What adds a bit of extra complexity to my particular case is that I'm actually dealing with angles, so the environment is modular. The ranges can't be ordered, and I need to ensure that a range going from 340 degrees to 20 degrees contains both a point at 350 and at 10 degrees.
The angle-ranges I'm dealing with can't exceed 180 beyond degrees and only very rarely are above 90.
The amount of ranges generally isn't very high (1-30), but I need to do this calculation a lot.
The language is Java if it matters.
Make a list (array) of angle intervals. If interval finish value less than start value (20<340), add 360 to the finish (340, 380)
Make a list of pair (angle, +weight for start point or -weight for finish point).
Concatenate list with its copy to provide circular intersection. (It is possible to copy only part of list)
Sort them by angle (use +/- as secondary key in case of tie: - before +)
Make CurrWeight=0
Walk through the list, adding +/weight field to CurrWeight. Check for max value.
(Such approach works for linear lists, I tried to modify it for circular ones, perhaps I might miss some caveats)
here, instead of the term 'edges', i should have better used the term 'boundaries', because it referes to interval boundaries
import java.util.ArrayList;
import java.util.Iterator;
import java.util.SortedSet;
import java.util.TreeSet;
public class Main {
ArrayList<Interval> intervals;
public static void main(String args[]) {
Main main = new Main();
main.intervals = new ArrayList<Interval>();
Interval i1 = new Interval(10, 30, 1);
Interval i2= new Interval(20, 40, 1);
Interval i3= new Interval(50, 60, 1);
Interval i4= new Interval(0, 70, 1);
main.intervals.add(i1);
main.intervals.add(i2);
main.intervals.add(i3);
main.intervals.add(i4);
Interval winningInterval = main.processIntervals(main.intervals);
System.out.println("winning interval="+winningInterval);
}
public Interval processIntervals(ArrayList<Interval> intervals)
{
SortedSet<Integer> intervalEdges = new TreeSet<Integer>();
for(int i = 0;i<intervals.size();i++)
{
Interval currentInterval = intervals.get(i);
intervalEdges.add(currentInterval.a);
intervalEdges.add(currentInterval.b);
}
System.out.println(intervalEdges);
//edges stores the same data as intervalEdges, but for convenience, it is a list
ArrayList<Integer> edges = new ArrayList<Integer>(intervalEdges);
ArrayList<Interval> intersectionIntervals = new ArrayList<Interval>();
for(int i=0; i<edges.size()-1;i++)
{
Interval newInterval = new Interval(edges.get(i), edges.get(i+1), 0);
int score = 0; //the sum of the values of the overlapping intervals
for(int j=0; j<intervals.size();j++)
{
if(newInterval.isIncludedInInterval(intervals.get(j)))
score = score+ intervals.get(j).val;
}
newInterval.val = score;
intersectionIntervals.add(newInterval);
}
System.out.println(intersectionIntervals);
int maxValue=0; //the maximum value of an interval
Interval x = new Interval(-1,-1,0);//that interval with the maximum value
for(int i=0; i<intersectionIntervals.size();i++)
{
if(intersectionIntervals.get(i).val > maxValue)
{
maxValue=intersectionIntervals.get(i).val;
x=intersectionIntervals.get(i);
}
}
return x;
}
}
class Interval
{
public int a, b, val;
public Interval(int a, int b, int val) {
super();
this.a = a;
this.b = b;
this.val = val;
}
#Override
public String toString() {
return "Interval [a=" + a + ", b=" + b + ", val=" + val + "]";
}
boolean isIncludedInInterval(Interval y)
{
//returns true if current interval is included in interval y
return this.a>=y.a && this.b<= y.b;
}
}
gives the output
[0, 10, 20, 30, 40, 50, 60, 70]
[Interval [a=0, b=10, val=1], Interval [a=10, b=20, val=2], Interval [a=20, b=30, val=3], Interval [a=30, b=40, val=2], Interval [a=40, b=50, val=1], Interval [a=50, b=60, val=2], Interval [a=60, b=70, val=1]]
winning interval=Interval [a=20, b=30, val=3]
This solves the case when the intervals are straight line intervals, and not angular intervals. I will come back with modifications to take into account the fact that x=x+360.
Related
I have a List<BigDecimal> collection which contains (for the sake of simplicity) BigDecimal prices. I would like to process the collection and get:
All of the highest prices.
All of the lowest prices.
My initial thoughts are to approach this using look-behind in order to decide if the numbers are moving in an up or down trend. When the trend changes - determine which of the previous numbers are "highest" or "lowest" prices and then add them to the respectful List<BigDecimal> lowestPrices and List<BigDecimal highestPrices collections. For example, the first 3 dots are in an up-trend, but the 4th changes the trend to a down-trend. So can now determine the min/max of the numbers before the change (0,1,2) and get the prices.
I am not entirely sure if this isn't a naive approach so I was wondering if there would be the best approach to solving this issue in java?
Maybe a library that can already do this? (probably better not to re-invent the wheel)
You are looking for local maxima (/minima).
Just look at whether the current point is greater (/less) than the point preceding and following it:
For a local maximum:
list.get(i) > list.get(i - 1) && list.get(i) > list.get(i + 1)
For a local minimum:
list.get(i) < list.get(i - 1) && list.get(i) < list.get(i + 1)
Pseudocode:
for (int i = 1; i < list.size()-1; ++i) {
if (local maximum) {
// Add to list of local maxima
} else if (local minimum) {
// Add to list of local minima
}
}
and handle the two endpoints as you desire.
(You can also do this in ways that are more efficient for non-random access lists, e.g. LinkedList, using (List)Iterators; but the principle is the same).
I decided to try implementing this, although I'm sure my implementation could be improved. The idea is just as you say, to keep track of the trend and record a local minimum or local maximum whenever the trend changes. There are two additional details to consider: first, initially we are not trending up or down, but the first value is either a minimum or maximum, so we have a third possibility for the trend, in addition to increasing or decreasing: inchoate; second, after the end of the loop we have to add the last item as either a minimum or maximum, depending on the direction the trend was going when we finished. Note that it will never add null if the list of prices is empty, because in that case, the trend would never have changed from inchoate.
import java.math.BigDecimal;
import java.util.ArrayList;
import java.util.Map;
import java.util.List;
public class Partition {
public static void main(String[] args) {
List<String> values = List.of("10.99", "15.99", "19.99", "12.99", "24.99",
"21.99", "17.99", "11.99", "22.99", "29.99", "35.99", "27.99", "20.99");
List<BigDecimal> prices = values.stream().map(BigDecimal::new).toList();
Map<Extrema, List<BigDecimal>> part = new Partition().partitionExtrema(prices);
System.out.format("Minima: %s%n", part.get(Extrema.MINIMA));
System.out.format("Maxima: %s%n", part.get(Extrema.MAXIMA));
}
public Map<Extrema, List<BigDecimal>> partitionExtrema(List<BigDecimal> prices) {
Trend trend = Trend.INCHOATE; // intially we don't know if we're going up or down
List<BigDecimal> maxima = new ArrayList<>();
List<BigDecimal> minima = new ArrayList<>();
BigDecimal previous = null;
for (BigDecimal current : prices) {
int direction = previous == null ? 0 : current.compareTo(previous);
if (direction > 0) {
if (trend != Trend.DECREASING) {
minima.add(previous); // switching from decreasing to increasing
}
trend = Trend.INCREASING;
}
if (direction < 0) {
if (trend != Trend.INCREASING) {
maxima.add(previous); // switching from increasing to decreasing
}
trend = Trend.DECREASING;
}
previous = current;
}
if (trend == trend.INCREASING) {
maxima.add(previous);
} else if (trend == trend.DECREASING) {
minima.add(previous);
}
return Map.of(Extrema.MINIMA, minima, Extrema.MAXIMA, maxima);
}
}
public enum Trend {
INCREASING,
DECREASING,
INCHOATE
}
public enum Extrema {
MAXIMA,
MINIMA
}
I have developed an algorithm to solve the 2 sum problem using a hash table although its performance is dreadful for huge inputs.
My goal is to find all distinct numbers x,y where -10000<= x+y <=10000. By the way, is the performance of my code O(n*m) where n the size of input and m the number of keys on the map?
Here is my code:
import com.google.common.base.Stopwatch;
import java.util.Scanner;
import java.util.HashMap;
import java.util.ArrayList;
import static com.google.common.collect.Lists.newArrayList;
public class TwoSum {
private HashMap<Long, Long> map;
private ArrayList<Long> Ts;
private long result = 0L;
public TwoSum() {
Ts = newArrayList();
for(long i = -10000; i < 10001; i++){
Ts.add(i);
}
Scanner scan = new Scanner(System.in);
map = new HashMap<>();
while (scan.hasNextLong()) {
long a = scan.nextLong();
if (!map.containsKey(a)) {
map.put(a, a);
}
}
}
private long count(){
//long c = 0L;
for (Long T : Ts) {
long t = T;
for (Long x : map.values()) {
long y = t - x;
if (map.containsValue(y) && y != x) {
result++;
}
//System.out.println(c++);
}
}
return result / 2;
}
public static void main(String [] args) {
TwoSum s = new TwoSum();
Stopwatch stopwatch = Stopwatch.createStarted();
System.out.println(s.count());
stopwatch.stop();
System.out.println("time:" + stopwatch);
}
}
sample input:
-7590801
-3823598
-5316263
-2616332
-7575597
-621530
-7469475
1084712
-7780489
-5425286
3971489
-57444
1371995
-5401074
2383653
1752912
7455615
3060706
613097
-1073084
7759843
7267574
-7483155
-2935176
-5128057
-7881398
-637647
-2607636
-3214997
-8253218
2980789
168608
3759759
-5639246
555129
-4489068
44019
2275782
-3506307
-8031288
-213609
-4524262
-1502015
-1040324
3258235
32686
1047621
-3376656
7601567
-7051390
6633993
-6245148
4994051
-4259178
856589
6047000
1785511
4449514
-1177519
4972172
8274315
7725694
-4923179
5076288
-876369
-7663790
1613721
4472116
-4587501
3194726
6195357
-3364248
-113737
6260410
1974241
3174620
3510171
7289166
4532581
-6650736
-3782721
7007010
6007081
-7661180
-1372125
-5967818
516909
-7625800
-2700089
-7676790
-2991247
2283308
1614251
-4619234
2741749
567264
4190927
5307122
-5810503
-6665772
output: 6
The gist of your algorithm can be rewritten in pseudocode as:
for all integers t from -10k to 10k,
for all map keys x,
if t - x in map, and t is not 2*x,
count ++
return count / 2
You can easily improve this a bit:
for all integers t from -10k to 10k,
for the lower half of keys x in ascending order such that t is not 2*x
if t - x in map,
count ++
This makes it go twice as fast (you no longer double-count). However, you need to sort your inputs to ensure map keys in ascending order. You can add them into a TreeSet and then move it into a LinkedHashSet. Using Sets is better than Maps if you do not care about the values, and all the information is in the keys.
Running time is still O(inputs * range), since you have two nested loops, one with range iterations and the other with half your input. This is a fundamental shortcoming of the algorithm, and no amount of optimization will fix it.
The question is an assignment from Algorithms: Design and Analysis
- an online course offered by Stanford University and taught by Prof. Tim Roughgarden. I happen to be taking the same course.
The usual solution for looking up t - i in a hash table is O(n) for a single t, but doing that 20001 * 1000000 times results in roughly 20 billion lookups!
A better solution is to create a sorted set xs from the input file, and ∀i ∈ xs, find all numbers from xs in the range [-10000 - i, 10000 - i]. Since a sorted set, by definition, doesn't have duplicates, so we don't need to worry about any number in the range being equal to i. There's one gotcha though, which is really unclear in the problem statement. It is not only sufficient to find unique (x, y) ∀ x, y ∈ xs, but also that their sum is unique. Obviously, 2 unique numbers may produce equal sums (e.g. 2 + 4 = 1 + 5 = 6). Thus, we need to keep track of the sums too.
Lastly, we can stop once we go past 5000, since there can't be any more numbers to the right that add up to less than 10000.
Here's a Scala solution:
def twoSumCount(xs: SortedSet[Long]): Int = {
xs
.foldLeft(collection.mutable.Set.empty[Long]) { (sums, i) =>
if (i < TenThou / 2) {
xs
// using from makes it slower
.range(-TenThou - i, TenThou - i + 1)
.map(_ + i)
// using diff makes it slower
.withFilter(y => !sums.contains(y))
// adding individual elements is faster than using
// diff/filter/filterNot and adding all using ++=
.foreach(sums.add)
}
sums
}
.size
}
Benchmark:
cores: 8
hostname: ***
name: OpenJDK 64-Bit Server VM
osArch: x86_64
osName: Mac OS X
vendor: Azul Systems, Inc.
version: 11.0.1+13-LTS
Parameters(file -> 2sum): 116.069441 ms
I've built a model of the solar system in Java. In order to determine the position of a planet it does do a whole lot of computations which give a very exact value. However I am often satisfied with the approximate position, if that could make it go faster. Because I'm using it in a simulation speed is important, as the position of the planet will be requested millions of times.
Currently I try to cache the position of a planet throughout its orbit and then use those coordinates over and over. If a position in between two values is requested I perform a linear interpolation. This is how I store values:
for(int t=0; t<tp; t++) {
listCoordinates[t]=super.coordinates(ti+t);
}
interpolator = new PlanetOrbit(listCoordinates,tp);
PlanetOrbit has the interpolation code:
package cometsim;
import org.apache.commons.math3.util.FastMath;
public class PlanetOrbit {
final double[][] coordinates;
double tp;
public PlanetOrbit(double[][] coordinates, double tp) {
this.coordinates = coordinates;
this.tp = tp;
}
public double[] coordinates(double julian) {
double T = julian % FastMath.floor(tp);
if(coordinates.length == 1 || coordinates.length == 0) return coordinates[0];
if(FastMath.round(T) == T) return coordinates[(int) T];
int floor = (int) FastMath.floor(T);
if(floor>=coordinates.length) floor=coordinates.length-5;
double[] f = coordinates[floor];
double[] c = coordinates[floor+1];
double[] retval = f;
retval[0] += (T-FastMath.floor(T))*(c[0]-f[0]);
retval[1] += (T-FastMath.floor(T))*(c[1]-f[1]);
retval[2] += (T-FastMath.floor(T))*(c[2]-f[2]);
return retval;
}
}
You can think of FastMath as Math but faster. However, this code is not much of a speed improvement over calculating the exact value every time. Do you have any ideas for how to make it faster?
There are a few issues I can see, the main ones I can see are as follows
PlanetOrbit#coordinates seems to actually change the values in the variable coordinates. As this method is supposed to only interpolate I expect that your orbit will actually corrupt slightly everytime you run though it (because it is a linear interpolation the orbit will actually degrade towards its centre).
You do the same thing several times, most clearly T-FastMath.floor(T) occures 3 seperate times in the code.
Not a question of efficiency or accuracy but the variable and method names are very opaque, use real words for variable names.
My proposed method would be as follows
public double[] getInterpolatedCoordinates(double julian){ //julian calendar? This variable name needs to be something else, like day, or time, or whatever it actually means
int startIndex=(int)julian;
int endIndex=(startIndex+1>=coordinates.length?1:startIndex+1); //wrap around
double nonIntegerPortion=julian-startIndex;
double[] start = coordinates[startIndex];
double[] end = coordinates[endIndex];
double[] returnPosition= new double[3];
for(int i=0;i< start.length;i++){
returnPosition[i]=start[i]*(1-nonIntegerPortion)+end[i]*nonIntegerPortion;
}
return returnPosition;
}
This avoids corrupting the coordinates array and avoids repeating the same floor several times (1-nonIntegerPortion is still done several times and could be removed if needs be but I expect profiling will show it isn't significant). However, it does create a new double[] each time which may be inefficient if you only need the array temporarily. This can be corrected using a store object (an object you used previously but no longer need, usually from the previous loop)
public double[] getInterpolatedCoordinates(double julian, double[] store){
int startIndex=(int)julian;
int endIndex=(startIndex+1>=coordinates.length?1:startIndex+1); //wrap around
double nonIntegerPortion=julian-startIndex;
double[] start = coordinates[startIndex];
double[] end = coordinates[endIndex];
double[] returnPosition= store;
for(int i=0;i< start.length;i++){
returnPosition[i]=start[i]*(1-nonIntegerPortion)+end[i]*nonIntegerPortion;
}
return returnPosition; //store is returned
}
This was an interview question:
Given an amount, say $167.37 find all the possible ways of generating the change for this amount using the denominations available in the currency?
Anyone who could think of a space and time efficient algorithm and supporting code, please share.
Here is the code that i wrote (working) . I am trying to find the running time of this, any help is appreciated
import java.util.HashMap;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.Map;
public class change_generation {
/**
* #param args
*/
public static void generatechange(float amount,LinkedList<Float> denominations,HashMap<Float,Integer> useddenominations)
{
if(amount<0)
return;
if(amount==0)
{
Iterator<Float> it = useddenominations.keySet().iterator();
while(it.hasNext())
{
Float val = it.next();
System.out.println(val +" :: "+useddenominations.get(val));
}
System.out.println("**************************************");
return;
}
for(Float denom : denominations)
{
if(amount-denom < 0)
continue;
if(useddenominations.get(denom)== null)
useddenominations.put(denom, 0);
useddenominations.put(denom, useddenominations.get(denom)+1);
generatechange(amount-denom, denominations, useddenominations);
useddenominations.put(denom, useddenominations.get(denom)-1);
}
}
public static void main(String[] args) {
// TODO Auto-generated method stub
float amount = 2.0f;
float nikle=0.5f;
float dollar=1.0f;
float ddollar=2.0f;
LinkedList<Float> denominations = new LinkedList<Float>();
denominations.add(ddollar);
denominations.add(dollar);
denominations.add(nikle);
HashMap<Float,Integer> useddenominations = new HashMap<Float,Integer>();
generatechange(amount, denominations, useddenominations);
}
}
EDIT
This is a specific example of the combination / subset problem, answered here.
Finding all possible combinations of numbers to reach a given sum
--- I am retaining my answer below (as it was usefull to someone), however, admittedly, it is not a direct answer to this question ---
ORIGINAL ANSWER
The most common solution is dynamic programming :
First, you find the simplest way to make change of 1, then you use that solution to make change for 2, 3, 4, 5, 6, etc.... At each iteration, you "check" if you can go "backwards" and decrease the amount of coins in your answer. For example, up to "4" you must add pennies. But, once you get to "5", you can remove all pennies, and your solution has only one coin required : the nickel. But then, until 9, you again must add pennies, etc etc etc.
However, the dynamic programming methodology is not gauranteed to be fast.
Alternatively, you can use a greedy method, where you continually pick the largest coin possible. This is extremely fast , but doesnt always give you an optimal solution. However, if your coins are 1 5 10 and 25 , Greedy works perfectly, and is much faster then the linear programming method.
Memoization (kind of) is your friend here. A simple implementation in C:
unsigned int findRes(int n)
{
//Setup array, etc.
//Only one way to make zero... no coins.
results[0] = 1;
for(i=0; i<number_of_coins; i++)
{
for(j=coins[i]; j<=n; j++)
{
results[j] += results[j - coins[i]];
}
}
return results[n];
}
So, what we're really doing here is saying:
1) Our only possible way to make 0 coins is 0 (this is our base case)
2) If we are trying to calculate value m, then let's check each coin k. As long as k <= m, we can use that coin k in a solution
3) Well, if we can use k in a solution, then couldn't we just take the solution for (m-k) and add it to our current total?
I'd try to model this in real life.
If you were at the till and you knew you had to find $167.37 you would probably initially consider $200 as the "simplest" tender, being just two notes. Then, if I had it, I may consider $170, i.e. $100, $50 and $20 (three notes). See where I am going?
More formally, try to over-tender with the minimum number of notes/coins. This would be much easier to enumerate than the full set of possibilities.
Don't use floats, even tiniest inaccuracies will destroy your algorithm.
Go from biggest to lowest coin/banknote. For every possible amount call the function recursively. When there are no more coins left pay the rest in ones and print the solution. This is how it looks in pseudo-C:
#define N 14
int coinValue[N]={20000,10000,5000,2000,1000,500,200,100,50,20,10,5,2,1};
int coinCount[N];
void f(int toSpend, int i)
{
if(coinValue[i]>1)
{
for(coinCount[i]=0;coinCount[i]*coinValue[i]<=toSpend;coinCount[i]++)
{
f(toSpend-coinCount[i]*coinValue[i],i+1);
}
}
else
{
coinCount[i]=toSpend;
print(coinCount);
}
}
import java.util.HashMap;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.Map;
public class change_generation {
static int jj=1;
public static void generatechange(float amount,LinkedList<Float> denominations,
HashMap<Float,Integer> useddenominations) {
if(amount<0)
return;
if(amount==0) {
Iterator<Float> it = useddenominations.keySet().iterator();
while(it.hasNext()) {
Float val = it.next();
System.out.println(val +" :: "+useddenominations.get(val));
}
System.out.println("**************************************");
return;
}
for(Float denom : denominations) {
if(amount-denom < 0)
continue;
if(useddenominations.get(denom)== null)
useddenominations.put(denom, 0);
useddenominations.put(denom, useddenominations.get(denom)+1);
generatechange(amount-denom, denominations, useddenominations);
useddenominations.put(denom, useddenominations.get(denom)-1);
}
}
public static void main(String[] args) {
float amount = 2.0f;
float nikle=0.25f;
float dollar=1.0f;
float ddollar=2.0f;
LinkedList<Float> denominations = new LinkedList<Float>();
denominations.add(ddollar);
denominations.add(dollar);
denominations.add(nikle);
HashMap<Float,Integer> useddenominations = new HashMap<Float,Integer>();
generatechange(amount, denominations, useddenominations);
}
}
Imagine you have a set of five elements (A-E) with some numeric values of a measured property (several observations for each element, for example "heart rate"):
A = {100, 110, 120, 130}
B = {110, 100, 110, 120, 90}
C = { 90, 110, 120, 100}
D = {120, 100, 120, 110, 110, 120}
E = {110, 120, 120, 110, 120}
First, I have to detect if there are significant differences on the average levels. So I run a one way ANOVA using the Statistical package provided by Apache Commons Math. No problems so far, I obtain a boolean that tells me whether differences are found or not.
Second, if differences are found, I need to know the element (or elements) that is different from the rest. I plan to use unpaired t-tests, comparing each pair of elements (A with B, A with C .... D with E), to know if an element is different than the other. So, at this point I have the information of the list of elements that present significant differences with others, for example:
C is different than B
C is different than D
But I need a generic algorithm to efficiently determine, with that information, what element is different than the others (C in the example, but could be more than one).
Leaving statistical issues aside, the question could be (in general terms): "Given the information about equality/inequality of each one of the pairs of elements in a collection, how could you determine the element/s that is/are different from the others?"
Seems to be a problem where graph theory could be applied. I am using Java language for the implementation, if that is useful.
Edit: Elements are people and measured values are times needed to complete a task. I need to detect who is taking too much or too few time to complete the task in some kind of fraud detection system.
Just in case anyone is interested in the final code, using Apache Commons Math to make statistical operations, and Trove to work with collections of primitive types.
It looks for the element(s) with the highest degree (the idea is based on comments made by #Pace and #Aniko, thanks).
I think the final algorithm is O(n^2), suggestions are welcome. It should work for any problem involving one cualitative vs one cuantitative variable, assuming normality of the observations.
import gnu.trove.iterator.TIntIntIterator;
import gnu.trove.map.TIntIntMap;
import gnu.trove.map.hash.TIntIntHashMap;
import gnu.trove.procedure.TIntIntProcedure;
import gnu.trove.set.TIntSet;
import gnu.trove.set.hash.TIntHashSet;
import java.util.ArrayList;
import java.util.List;
import org.apache.commons.math.MathException;
import org.apache.commons.math.stat.inference.OneWayAnova;
import org.apache.commons.math.stat.inference.OneWayAnovaImpl;
import org.apache.commons.math.stat.inference.TestUtils;
public class TestMath {
private static final double SIGNIFICANCE_LEVEL = 0.001; // 99.9%
public static void main(String[] args) throws MathException {
double[][] observations = {
{150.0, 200.0, 180.0, 230.0, 220.0, 250.0, 230.0, 300.0, 190.0 },
{200.0, 240.0, 220.0, 250.0, 210.0, 190.0, 240.0, 250.0, 190.0 },
{100.0, 130.0, 150.0, 180.0, 140.0, 200.0, 110.0, 120.0, 150.0 },
{200.0, 230.0, 150.0, 230.0, 240.0, 200.0, 210.0, 220.0, 210.0 },
{200.0, 230.0, 150.0, 180.0, 140.0, 200.0, 110.0, 120.0, 150.0 }
};
final List<double[]> classes = new ArrayList<double[]>();
for (int i=0; i<observations.length; i++) {
classes.add(observations[i]);
}
OneWayAnova anova = new OneWayAnovaImpl();
// double fStatistic = anova.anovaFValue(classes); // F-value
// double pValue = anova.anovaPValue(classes); // P-value
boolean rejectNullHypothesis = anova.anovaTest(classes, SIGNIFICANCE_LEVEL);
System.out.println("reject null hipothesis " + (100 - SIGNIFICANCE_LEVEL * 100) + "% = " + rejectNullHypothesis);
// differences are found, so make t-tests
if (rejectNullHypothesis) {
TIntSet aux = new TIntHashSet();
TIntIntMap fraud = new TIntIntHashMap();
// i vs j unpaired t-tests - O(n^2)
for (int i=0; i<observations.length; i++) {
for (int j=i+1; j<observations.length; j++) {
boolean different = TestUtils.tTest(observations[i], observations[j], SIGNIFICANCE_LEVEL);
if (different) {
if (!aux.add(i)) {
if (fraud.increment(i) == false) {
fraud.put(i, 1);
}
}
if (!aux.add(j)) {
if (fraud.increment(j) == false) {
fraud.put(j, 1);
}
}
}
}
}
// TIntIntMap is sorted by value
final int max = fraud.get(0);
// Keep only those with a highest degree
fraud.retainEntries(new TIntIntProcedure() {
#Override
public boolean execute(int a, int b) {
return b != max;
}
});
// If more than half of the elements are different
// then they are not really different (?)
if (fraud.size() > observations.length / 2) {
fraud.clear();
}
// output
TIntIntIterator it = fraud.iterator();
while (it.hasNext()) {
it.advance();
System.out.println("Element " + it.key() + " has significant differences");
}
}
}
}
Your edit gives good details; thanks,
Based on that I would presume a fairly well-behaved distribution of times (normal, or possibly gamma; depends on how close to zero your times get) for typical responses. Rejecting a sample from this distribution could be as simple as computing a standard deviation and seeing which samples lie more than n stdevs from the mean, or as complex as taking subsets which exclude outliers until your data settles down into a nice heap (e.g. the mean stops moving around 'much').
Now, you have an added wrinkle if you assume that a person who monkeys with one trial will monkey with another. So you're erally trying to discriminate between a person who just happens to be fast (or slow) vs. one who is 'cheating'. You could do something like compute the stdev rank of each score (I forget the proper name for this: if a value is two stdevs above the mean, the score is '2'), and use that as your statistic.
Then, given this new statistic, there are some hypotheses you'll need to test. E.g., my suspicion is that the stdev of this statistic will be higher for cheaters than for someone who is just uniformly faster than other people--but you'd need data to verify that.
Good luck with it!
You would have to run the paired t-test (or whatever pairwise test you want to implement) and the increment the counts in a hash where the key is the Person and the count is the number times it was different.
I guess you could also have an arrayList that contains people objects. The people object could store their ID and the counts of time they were different. Implement comparable and then you could sort the arraylist by count.
If the items in the list were sorted in numerical order, you can walk two lists simultaneously, and any differences can easily be recognized as insertions or deletions. For example
List A List B
1 1 // Match, increment both pointers
3 3 // Match, increment both pointers
5 4 // '4' missing in list A. Increment B pointer only.
List A List B
1 1 // Match, increment both pointers
3 3 // Match, increment both pointers
4 5 // '4' missing in list B (or added to A). Incr. A pointer only.