How do I go about calculating weighted mean of a Map<Double, Integer> where the Integer value is the weight for the Double value to be averaged.
eg: Map has following elements:
(0.7, 100) // value is 0.7 and weight is 100
(0.5, 200)
(0.3, 300)
(0.0, 400)
I am looking to apply the following formula using Java 8 streams, but unsure how to calculate the numerator and denominator together and preserve it at the same time. How to use reduction here?
You can create your own collector for this task:
static <T> Collector<T,?,Double> averagingWeighted(ToDoubleFunction<T> valueFunction, ToIntFunction<T> weightFunction) {
class Box {
double num = 0;
long denom = 0;
}
return Collector.of(
Box::new,
(b, e) -> {
b.num += valueFunction.applyAsDouble(e) * weightFunction.applyAsInt(e);
b.denom += weightFunction.applyAsInt(e);
},
(b1, b2) -> { b1.num += b2.num; b1.denom += b2.denom; return b1; },
b -> b.num / b.denom
);
}
This custom collector takes two functions as parameter: one is a function returning the value to use for a given stream element (as a ToDoubleFunction), and the other returns the weight (as a ToIntFunction). It uses a helper local class storing the numerator and denominator during the collecting process. Each time an entry is accepted, the numerator is increased with the result of multiplying the value with its weight, and the denominator is increased with the weight. The finisher then returns the division of the two as a Double.
A sample usage would be:
Map<Double,Integer> map = new HashMap<>();
map.put(0.7, 100);
map.put(0.5, 200);
double weightedAverage =
map.entrySet().stream().collect(averagingWeighted(Map.Entry::getKey, Map.Entry::getValue));
You can use this procedure to calculate the weighted average of a map. Note that the key of the map entry should contain the value and the value of the map entry should contain the weight.
/**
* Calculates the weighted average of a map.
*
* #throws ArithmeticException If divide by zero happens
* #param map A map of values and weights
* #return The weighted average of the map
*/
static Double calculateWeightedAverage(Map<Double, Integer> map) throws ArithmeticException {
double num = 0;
double denom = 0;
for (Map.Entry<Double, Integer> entry : map.entrySet()) {
num += entry.getKey() * entry.getValue();
denom += entry.getValue();
}
return num / denom;
}
You can look at its unit test to see a usecase.
/**
* Tests our method to calculate the weighted average.
*/
#Test
public void testAveragingWeighted() {
Map<Double, Integer> map = new HashMap<>();
map.put(0.7, 100);
map.put(0.5, 200);
Double weightedAverage = calculateWeightedAverage(map);
Assert.assertTrue(weightedAverage.equals(0.5666666666666667));
}
You need these imports for the unit tests:
import org.junit.Assert;
import org.junit.Test;
You need these imports for the code:
import java.util.HashMap;
import java.util.Map;
I hope it helps.
public static double weightedAvg(Collection<Map.Entry<? extends Number, ? extends Number> data) {
var sumWeights = data.stream()
.map(Map.Entry::getKey)
.mapToDouble(Number::doubleValue)
.sum();
var sumData = data.stream()
.mapToDouble(e -> e.getKey().doubleValue() * e.getValue().doubleValue())
.sum();
return sumData / sumWeights;
}
static float weightedMean(List<Double> value, List<Integer> weighted, int n) {
int sum = 0;
double numWeight = 0;
for (int i = 0; i < n; i++) {
numWeight = numWeight + value.get(i).doubleValue() * weighted.get(i).intValue();
sum = sum + weighted.get(i).intValue();
}
return (float) (numWeight) / sum;
}
Related
I'm working on a project that prompts the user to create and fill an array with integers, then displays the mean, mode, median, and standard deviation of that array. It starts by asking the user what the size of the array will be, to which the number entered will declare and initialize the array. The program will then iterate several times asking the user to declare an integer value, and each value will be stored into the array until the array is filled. The program will then print the contents of the array, as well as the mean, mode, median, and standard deviation.
I have a code that seems to meet all these requirements. However, one thing I am struggling on is the mode. While it does print out the most repeated number in the array, it doesn't take into account multiple modes with the same number of repetitions, nor does it take into account what will happen if there is no mode.
Right now, if two numbers are entered twice each, the mode displayed is the first number to be repeated more than once. For example, if I have an array size of 10 integers, and the integers I enter are 1, 2, 2, 3, 3, 4, 5, 6, 7, 8, it will print out "2.0" for the mode instead of printing both "2.0" and "3.0." If there is no mode, it simply enters the number first entered, rather than saying "None."
What would be the best course of action to go about accomplishing this?
Here is my code:
import java.util.*;
public class ArrayStatistics {
public static void main(String[] args) {
double total = 0;
Scanner input = new Scanner(System.in);
System.out.print("Enter the size of your array >> ");
int size = input.nextInt();
double[] myArray = new double[size];
System.out.print("Enter the integer values >> ");
for (int i=0; i<size; i++) {
myArray[i] = input.nextInt();
}
System.out.println("\nIntegers:");
for (int i=0; i<size; i++) {
System.out.println(myArray[i]);
}
double mean = calculateMean(myArray);
System.out.println("\nMean: " + mean);
double mode = calculateMode(myArray);
System.out.println("Mode: " + mode);
double median = calculateMedian(myArray);
System.out.println("Median: " + median);
double SD = calculateSD(myArray);
System.out.format("Standard Deviation: %.6f", SD);
}
public static double calculateMean(double myArray[]) {
int sum = 0;
for(int i = 0; i<myArray.length; i++) {
sum = (int) (sum + myArray[i]);
}
double mean = ((double) sum) / (double)myArray.length;
return mean;
}
public static double calculateMode(double myArray[]) {
int modeCount = 0;
int mode = 0;
int currCount = 0;
for(double candidateMode : myArray) {
currCount = 0;
for(double element : myArray) {
if(candidateMode == element) {
currCount++;
}
}
if(currCount > modeCount) {
modeCount = currCount;
mode = (int) candidateMode;
}
}
return mode;
}
public static double calculateMedian(double myArray[]) {
Arrays.sort(myArray);
int val = myArray.length/2;
double median = ((myArray[val]+myArray[val-1])/2.0);
return median;
}
public static double calculateSD(double myArray[]) {
double sum = 0.0;
double standardDeviation = 0.0;
int length = myArray.length;
for(double num : myArray) {
sum += num;
}
double mean = sum/length;
for(double num : myArray) {
standardDeviation += Math.pow(num - mean, 2);
}
return Math.sqrt(standardDeviation/length);
}
First the code, then the explanations.
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.Scanner;
import java.util.stream.Collectors;
public class ArrayStatistics {
public static void main(String[] args) {
int total = 0;
Scanner input = new Scanner(System.in);
System.out.print("Enter the size of your array >> ");
int size = input.nextInt();
int[] myArray = new int[size];
Map<Integer, Integer> frequencies = new HashMap<>();
System.out.print("Enter the integer values >> ");
for (int i = 0; i < size; i++) {
myArray[i] = input.nextInt();
if (frequencies.containsKey(myArray[i])) {
int frequency = frequencies.get(myArray[i]);
frequencies.put(myArray[i], frequency + 1);
}
else {
frequencies.put(myArray[i], 1);
}
total += myArray[i];
}
System.out.println("\nIntegers:");
for (int i = 0; i < size; i++) {
System.out.println(myArray[i]);
}
double mean = calculateMean(size, total);
System.out.println("\nMean: " + mean);
List<Integer> mode = calculateMode(frequencies);
System.out.println("Mode: " + mode);
double median = calculateMedian(myArray);
System.out.println("Median: " + median);
double stdDev = calculateSD(mean, total, size, myArray);
System.out.format("Standard Deviation: %.6f", stdDev);
}
public static double calculateMean(int count, int total) {
double mean = ((double) total) / count;
return mean;
}
public static List<Integer> calculateMode(Map<Integer, Integer> frequencies) {
Map<Integer, Integer> sorted = frequencies.entrySet()
.stream()
.sorted((e1, e2) -> e2.getValue() - e1.getValue())
.collect(Collectors.toMap(e -> e.getKey(),
e -> e.getValue(),
(i1, i2) -> i1,
LinkedHashMap::new));
Iterator<Integer> iterator = sorted.keySet().iterator();
Integer first = iterator.next();
Integer val = sorted.get(first);
List<Integer> modes = new ArrayList<>();
if (val > 1) {
modes.add(first);
while (iterator.hasNext()) {
Integer next = iterator.next();
Integer nextVal = sorted.get(next);
if (nextVal.equals(val)) {
modes.add(next);
}
else {
break;
}
}
}
return modes;
}
public static double calculateMedian(int myArray[]) {
Arrays.sort(myArray);
int val = myArray.length / 2;
double median = ((myArray[val] + myArray[val - 1]) / 2.0);
return median;
}
public static double calculateSD(double mean, int sum, int length, int[] myArray) {
double standardDeviation = 0.0;
for (double num : myArray) {
standardDeviation += Math.pow(num - mean, 2);
}
return Math.sqrt(standardDeviation / length);
}
}
In order to determine the mode(s), you need to keep track of the occurrences of integers entered into your array. I use a Map to do this. I also calculate the total while entering the integers. I use this total in methods that require it, for example calculateMean. Seems like extra work to recalculate the total each time you need it.
You are dealing with integers, so why declare myArray as array of double? Hence I changed it to array of int.
Your question was how to determine the mode(s). Consequently I refactored method calculatMode. In order to determine the mode(s), you need to interrogate the frequencies, hence the method parameter. Since you claim that there can be zero, one or more than one modes, the method returns a List. First I sort the Map entries according to the value, i.e. the number of occurrences of a particular integer in myArray. I sort the entries in descending order. Then I collect all the sorted entries to a LinkedHashMap since that is a map that stores its entries in the order in which they were added. Hence the first entry in the LinkedHashMap will be the integer with the most occurrences. If the number of occurrences of the first map entry is 1 (one), that means there are no modes (according to this definition that I found):
If no number in the list is repeated, then there is no mode for the list.
In the case of no modes, method calculateMode returns an empty List.
If the number of occurrences of the first entry is more than one, I add the integer to the List. Then I iterate through the remaining map entries and add the integer to the List if its occurrences equals that of the first map entry. As soon as the number of occurrences in an entry does not equal that of the first entry, I exit the while loop. Now List contains all the integers in myArray with the highest number of occurrences.
Here is a sample run (using example data from your question):
Enter the size of your array >> 10
Enter the integer values >> 1 2 2 3 3 4 5 6 7 8
Integers:
1
2
2
3
3
4
5
6
7
8
Mean: 4.1
Mode: [2, 3]
Median: 3.5
Standard Deviation: 2.211334
I have a list of fitness values (percentages), which are ordered in descending order:
List<Double> fitnesses = new ArrayList<Double>();
I would like to choose one of these Doubles, with an extreme likelyhood of it being the first one, then decreasing likelyhood for each item, until the final one is close to 0% chance of it being the final item in the list.
How do I go about achieving this?
Thanks for any advice.
If you want to select "one of these Doubles, with an extreme likelihood of it being the first one, then decreasing likelihood for each item, until the final one is close to 0% chance of it being the final item in the list" then it seems like you want an exponential probability function. (p = x2).
However, you will only know whether you have chosen the right function once you have coded a solution and tried it, and if it does not suit your needs then you will need to choose some other probability function, like a sinusoidal (p = sin( x * PI/2 )) or an inverse ratio (p = 1/x).
So, the important thing is to code an algorithm for selecting an item based on a probability function, so that you can then try any probability function you like.
So, here is one way to do it.
Note the following:
I am seeding the random number generator with 10 in order to always produce the same results. Remove the seeding to get different results at each run.
I am using a list of Integer for your "percentages" in order to avoid confusion. Feel free to replace with a list of Double once you have understood how things work.
I am providing a few sample probability functions. Try them to see what distributions they yield.
Have fun!
import java.util.*;
public final class Scratch3
{
private Scratch3()
{
}
interface ProbabilityFunction<T>
{
double getProbability( double x );
}
private static double exponential2( double x )
{
assert x >= 0.0 && x <= 1.0;
return StrictMath.pow( x, 2 );
}
private static double exponential3( double x )
{
assert x >= 0.0 && x <= 1.0;
return StrictMath.pow( x, 3 );
}
private static double inverse( double x )
{
assert x >= 0.0 && x <= 1.0;
return 1/x;
}
private static double identity( double x )
{
assert x >= 0.0 && x <= 1.0;
return x;
}
#SuppressWarnings( { "UnsecureRandomNumberGeneration", "ConstantNamingConvention" } )
private static final Random randomNumberGenerator = new Random( 10 );
private static <T> T select( List<T> values, ProbabilityFunction<T> probabilityFunction )
{
double x = randomNumberGenerator.nextDouble();
double p = probabilityFunction.getProbability( x );
int i = (int)( p * values.size() );
return values.get( i );
}
public static void main( String[] args )
{
List<Integer> values = Arrays.asList( 10, 11, 12, 13, 14, 15 );
Map<Integer,Integer> counts = new HashMap<>();
for( int i = 0; i < 10000; i++ )
{
int value = select( values, Scratch3::exponential3 );
counts.merge( value, 1, ( a, b ) -> a + b );
}
for( int value : values )
System.out.println( value + ": " + counts.get( value ) );
}
}
Here's another way of doing it that gives you the ability to approximate an arbitrary weight distribution.
The array passed to WeightedIndexPicker indicates the number of "buckets" (>0) that should be allocated to each index. In your case these would be descending, but they don't have to be. When you need an index, pick a random number between 0 and the total number of buckets and return the index associated with that bucket.
I've used an int weight array as it's easier to visualize and it avoids rounding errors associated with floating point.
import java.util.Random;
public class WeightedIndexPicker
{
private int total;
private int[] counts;
private Random rand;
public WeightedIndexPicker(int[] weights)
{
rand = new Random();
counts = weights.clone();
for(int i=1; i<counts.length; i++)
{
counts[i] += counts[i-1];
}
total = counts[counts.length-1];
}
public int nextIndex()
{
int idx = 0;
int pick = rand.nextInt(total);
while(pick >= counts[idx]) idx++;
return idx;
}
public static void main(String[] args)
{
int[] dist = {1000, 100, 10, 1};
WeightedIndexPicker wip = new WeightedIndexPicker(dist);
int idx = wip.nextIndex();
System.out.println(idx);
}
}
I don't think you need all this code to answer your question since your question seems to be much more about math than code. For example, using the apache commons maths library getting a distribution is easy:
ExponentialDistribution dist = new ExponentialDistribution(1);
// getting a sample (aka index into the list) is easy
dist.sample();
// lot's of extra code to display the distribution.
int NUM_BUCKETS = 100;
int NUM_SAMPLES = 1000000;
DoubleStream.of(dist.sample(NUM_SAMPLES))
.map(s->((long)s*NUM_BUCKETS)/NUM_BUCKETS)
.boxed()
.collect(groupingBy(identity(), TreeMap::new, counting()))
.forEach((k,v)->System.out.println(k.longValue() + " -> " + v));
However, as you said, there are so many possible distributions in the math library. If you are writing code for a specific purpose then the end user will probably want you to explain why you chose a specific distribution and why you set the parameters for that distribution the way you did. That's a math question and should be asked in the mathematics forum.
I need to determine the minimum value after removing the first value.
For instance is these are the numbers 0.5 70 80 90 10
I need to remove 0.5, the determine the minimum value in the remaining numbers. calweightAvg is my focus ...
The final output should be “The weighted average of the numbers is 40, when using the data 0.5 70 80 90 10, where 0.5 is the weight, and the average is computed after dropping the lowest of the rest of the values.”
EDIT: Everything seems to be working, EXCEPT during the final out put. "The weighted average of the numbers is 40.0, when using the data 70.0, 80.0, 90.0, 10.0, where 70.0 (should be 0.5) is the weight, and the average is computed after dropping the lowest of the rest of the values."
So the math is right, the output is not.
EDIT: While using a class static double weight=0.5;to establish the weight, if the user were to change the values in the input file, that would not work. How can I change the class?
/*
*
*/
package calcweightedavg;
import java.util.Scanner;
import java.util.ArrayList;
import java.io.File;
import java.io.PrintWriter;
import java.io.FileNotFoundException;
import java.io.IOException;
public class CalcWeightedAvg {
/**
* #param args the command line arguments
*/
public static void main(String[] args) throws IOException {
//System.out.println(System.getProperty("user.dir"));
ArrayList<Double> inputValues = getData(); // User entered integers.
double weightedAvg = calcWeightedAvg(inputValues); // User entered weight.
printResults(inputValues, weightedAvg); //Weighted average of integers.
}
public class CalcWeightedAvg {
/**
* #param args the command line arguments
*/
public static void main(String[] args) throws IOException {
//System.out.println(System.getProperty("user.dir"));
ArrayList<Double> inputValues = getData(); // User entered integers.
double weightedAvg = calcWeightedAvg(inputValues); // User entered weight.
printResults(inputValues, weightedAvg); //Weighted average of integers.
}
public static ArrayList<Double> getData() throws FileNotFoundException {
// Get input file name.
Scanner console = new Scanner(System.in);
System.out.print("Input File: ");
String inputFileName = console.next();
File inputFile = new File(inputFileName);
//
Scanner in = new Scanner(inputFile);
String inputString = in.nextLine();
//
String[] strArray = inputString.split("\\s+"); //LEFT OFF HERE
// Create arraylist with integers.
ArrayList<Double> doubleArrayList = new ArrayList<>();
for (String strElement : strArray) {
doubleArrayList.add(Double.parseDouble(strElement));
}
in.close();
return doubleArrayList;
}
public static double calcWeightedAvg(ArrayList<Double> inputValues){
//Get and remove weight.
Double weight = inputValues.get(0);
inputValues.remove(0);
//Sum and find min.
double min = Double.MAX_VALUE;
double sum = 0;
for (Double d : inputValues) {
if (d < min) min = d;
sum += d;
}
// Calculate weighted average.
return (sum-min)/(inputValues.size()-1) * weight;
}
public static void printResults(ArrayList<Double> inputValues, double weightedAvg) throws IOException {
Scanner console = new Scanner(System.in);
System.out.print("Output File: ");
String outputFileName = console.next();
PrintWriter out = new PrintWriter(outputFileName);
System.out.println("Your output is in the file " + outputFileName);
out.print("The weighted average of the numbers is " + weightedAvg + ", ");
out.print("when using the data ");
for (int i=0; i<inputValues.size(); i++) {
out.print(inputValues.get(i) + ", ");
}
out.print("\n where " + inputValues.get(0) + " is the weight, ");
out.print("and the average is computed after dropping the lowest of the rest of the values.\n");
out.close();
}
}
to do this task in a complexity of O(n) isn't a hard task.
you can use ArrayList's .get(0) to Save weight in a temp variable, then use .remove(0) function which removes the first value (in this case 0.5)
then you should use a For Each loop for (Double d : list) to sum AND find the minimal value
afterwards subtract the minimum value from the sum. and apply weight to the sum (in this case you'll end up with 240*0.5 = 120; 120\3 = 40;
finally, you can use ArrayList's .size()-1 function to determine the divisor.
The problem in your code:
in your implementation you've removed the weight item from list. then multiplied by the first item in the list even though it's no longer the weight:
return (sum-min)/(inputValues.size()-1) * inputValues.get(0);
your calculation than was: ((70+80+90+10)-10)/(4-1) * (70) = 5600
if(inputValues.size() <= 1){
inputValues.remove(0);
}
this size safeguard will not remove weight from the list. perhaps you've meant to use >=1
even if that was your intention this will not result in a correct computation of your algorithm in the edge cases where size==0\1\2 I would recommend that you re-think this.
the full steps that need to be taken in abstract code:
ArrayList<Double> list = new ArrayList();
// get and remove weight
Double weight = list.get(0);
list.remove(0);
// sum and find min
double min=Double.MAX_VALUE;
double sum=0;
for (Double d : list) {
if (d<min) min = d;
sum+=d;
}
// subtract min value from sum
sum-=min;
// apply weight
sum*=weight;
// calc weighted avg
double avg = sum/list.size()-1;
// viola!
do take notice that you can now safely add weight back into the array list after its use via ArrayList's .add(int index, T value) function. also, the code is very abstract and safeguards regarding size should be implemented.
Regarding your Edit:
it appears you're outputting the wrong variable.
out.print("\n where " + inputValues.get(0) + " is the weight, ");
the weight variable was already removed from the list at this stage, so the first item in the list is indeed 70. either add back the weight variable into the list after you've computed the result or save it in a class variable and input it directly.
following are the implementation of both solutions. you should only use one of them not both.
1) add weight back into list solution:
change this function to add weight back to list:
public static double calcWeightedAvg(ArrayList<Double> inputValues){
//Get and remove weight.
Double weight = inputValues.get(0);
inputValues.remove(0);
//Sum and find min.
double min = Double.MAX_VALUE;
double sum = 0;
for (Double d : inputValues) {
if (d < min) min = d;
sum += d;
}
// Calculate weighted average.
double returnVal = (sum-min)/(inputValues.size()-1) * weight;
// add weight back to list
inputValues.add(0,weight);
return returnVal;
}
2) class variable solution:
change for class:
public class CalcWeightedAvg {
static double weight=0;
//...
}
change for function:
public static double calcWeightedAvg(ArrayList<Double> inputValues){
//Get and remove weight.
weight = inputValues.get(0); // changed to class variable
//...
}
change for output:
out.print("\n where " + weight + " is the weight, ");
Since you're using an ArrayList, this should be a piece of cake.
To remove a value from an ArrayList, just find the index of the value and call
myList.remove(index);
If 0.5 is the first element in the list, remove it with
inputValues.remove(0);
If you want to find the minimum value in an ArrayList of doubles, just use this algorithm to find both the minimum value and its index:
double minVal = Double.MAX_VALUE;
int minIndex = -1;
for(int i = 0; i < myList.size(); i++) {
if(myList.get(i) < minVal) {
minVal = myList.get(i);
minIndex = i;
}
}
Hope this helps!
If you want to remove the first element from ArrayList and calculate the minimum in the remaining you should do:
if(inputValues.size() <= 1) //no point in calculation of one element
return;
inputValues.remove(0);
double min = inputValues.get(0);
for (int i = 1; i < inputValues.size(); i++) {
if (inputValues.get(i) < min)
min = inputValues.get(i);
}
I am a little unclear about your goal here. If you are required to make frequent calls to check the minimum value, a min heap would be a very good choice.
A min heap has the property that it offers constant time access to the minimum value. This [implementation] uses an ArrayList. So, you can add to the ArrayList using the add() method, and minValue() gives constant time access to the minimum value of the list since it ensures that the minimum value is always at index 0. The list is modified accordingly when the least value is removed, or a new value is added (called heapify).
I am not adding any code here since the link should make that part clear. If you would like some clarification, I would be more than glad to be of help.
Edit.
public class HelloWorld {
private static ArrayList<Double> values;
private static Double sum = 0.0D;
/**
* Identifies the minimum value stored in the heap
* #return the minimum value
*/
public static Double minValue() {
if (values.size() == 0) {
throw new NoSuchElementException();
}
return values.get(0);
}
/**
* Adds a new value to the heap.
* #param newValue the value to be added
*/
public static void add(Double newValue) {
values.add(newValue);
int pos = values.size()-1;
while (pos > 0) {
if (newValue.compareTo(values.get((pos-1)/2)) < 0) {
values.set(pos, values.get((pos-1)/2));
pos = (pos-1)/2;
}
else {
break;
}
}
values.set(pos, newValue);
// update global sum
sum += newValue;
}
/**
* Removes the minimum value from the heap.
*/
public static void remove() {
Double newValue = values.remove(values.size()-1);
int pos = 0;
if (values.size() > 0) {
while (2*pos+1 < values.size()) {
int minChild = 2*pos+1;
if (2*pos+2 < values.size() &&
values.get(2*pos+2).compareTo(values.get(2*pos+1)) < 0) {
minChild = 2*pos+2;
}
if (newValue.compareTo(values.get(minChild)) > 0) {
values.set(pos, values.get(minChild));
pos = minChild;
}
else {
break;
}
}
values.set(pos, newValue);
}
// update global sum
sum -= newValue;
}
/**
* NEEDS EDIT Computes the average of the list, leaving out the minimum value.
* #param newValue the value to be added
*/
public static double calcWeightedAvg() {
double minValue = minValue();
// the running total of the sum took this into account
// so, we have to remove this from the sum to get the effective sum
double effectiveSum = (sum - minValue);
return effectiveSum * minValue;
}
public static void main(String []args) {
values = new ArrayList<Double>();
// add values to the arraylist -> order is intentionally ruined
double[] arr = new double[]{10,70,90,80,0.5};
for(double val: arr)
add(val);
System.out.println("Present minimum in the list: " + minValue()); // 0.5
System.out.println("CalcWeightedAvg: " + calcWeightedAvg()); // 125.0
}
}
While working on a toy project I was faced with the problem of generating a set of N 2d points where every point was between distance A and B from every other point in the set (and also within certain absolute bounds).
I prefer working with java streams and lambdas for practice, because of their elegance and the possibility for easy parallelization, so I'm not asking how to solve this problem in an imperative manner!
The solution that first came to mind was:
seed the set (or list) with a random vector
until the set reaches size N:
create a random vector with length between A and B and add it to a random "parent" vector
if it's outside the bounds or closer than A to any vector in the set, discard it, otherwise add it to the set
repeat
This would be trivial for me with imperative programming (loops), but I was stumped when doing this the functional way because the newly generated elements in the stream depend on previously generated elements in the same stream.
Here's what I came up with - notice the icky loop at the beginning.
while (pointList.size() < size) {
// find a suitable position, not too close and not too far from another one
Vec point =
// generate a stream of random vectors
Stream.generate(vecGen::generate)
// elongate the vector and add it to the position of one randomly existing vector
.map(v -> listSelector.getRandom(pointList).add(v.mul(random.nextDouble() * (maxDistance - minDistance) + minDistance)))
// remove those that are outside the borders
.filter(v -> v.length < diameter)
// remove those that are too close to another one
.filter(v -> pointList.stream().allMatch(p -> Vec.distance(p, v) > minDistance))
// take the first one
.findAny().get();
pointList.add(point);
}
I know that this loop might never terminate, depending on the parameters - the real code has additional checks.
One working functional solution that comes to mind is to generate completely random sets of N vectors until one of the sets satisfy the condition, but the performance would be abysmal. Also, this would circumvent the problem I'm facing: is it possible to work with the already generated elements in a stream while adding new elements to the stream (Pretty sure that would violate some fundamental principle, so I guess the answer is NO)?
Is there a way to do this in a functional - and not too wasteful - way?
A simple solution is shown below. The Pair class can be found in the Apache commons lang3.
public List<Pair<Double, Double>> generate(int N, double A, double B) {
Random ySrc = new Random();
return new Random()
.doubles(N, A, B)
.boxed()
.map(x -> Pair.of(x, (ySrc.nextDouble() * (B - A)) + A))
.collect(Collectors.toList());
}
My original solution (above) missed the point that A and B represented the minimum and maximum distance between any two points. So I would instead propose a different solution (way more complicated) that relies on generating points on a unit circle. I scale (multiply) the unit vector representing the point using a random distance with minimum of -1/2 B and maximum of 1/2 B. This approach uniformly distributes points in an area bounded by a circle of radius 1/2 B. This addresses the maximum distance between points constraint. Given sufficient difference between A and B, where A < B, and N is not too large, the minimum distance constraint will probably also be satisfied.Satisfying the maximum distance constraint can be accomplished with purely functional code (i.e., no side effects).
To ensure that the minimum constraint is satisfied requires some imperative code (i.e., side effects). For this purpose, I use a predicate with side effects. The predicate accumulates points that meet the minimum constraint criteria and returns true when N points have been accumulated.
Note the running time is unknown because points are randomly generated. With N = 100, A = 1.0, and B = 30.0, the test code runs quickly. I tried values of 10 and 20 for B and didn't wait for it to end. If you want a tighter cluster of points you will probably need to speed up this code or start looking at linear solvers.
public class RandomPoints {
/**
* The stop rule is a predicate implementation with side effects. Not sure
* about the wisdom of this approach. The class does not support concurrent
* modification.
*
* #author jgmorris
*
*/
private class StopRule implements Predicate<Pair<Double, Double>> {
private final int N;
private final List<Pair<Double, Double>> points;
public StopRule(int N, List<Pair<Double, Double>> points) {
this.N = N;
this.points = points;
}
#Override
public boolean test(Pair<Double, Double> t) {
// Brute force test. A hash based test would work a lot better.
for (int i = 0; i < points.size(); ++i) {
if (distance(t, points.get(i)) < dL) {
// List size unchanged, continue
return false;
}
}
points.add(t);
return points.size() >= N;
}
}
private final double dL;
private final double dH;
private final double maxRadius;
private final Random r;
public RandomPoints(double dL, double dH) {
this.dL = dL;
this.dH = dH;
this.maxRadius = dH / 2;
r = new Random();
}
public List<Pair<Double, Double>> generate(int N) {
List<Pair<Double, Double>> points = new ArrayList<>();
StopRule pred = new StopRule(N, points);
new Random()
// Generate a uniform distribution of doubles between 0.0 and
// 1.0
.doubles()
// Transform primitive double into a Double
.boxed()
// Transform to a number between 0.0 and 2ϖ
.map(u -> u * 2 * Math.PI)
// Generate a random point
.map(theta -> randomPoint(theta))
// Add point to points if it meets minimum distance criteria.
// Stop when enough points are gathered.
.anyMatch(p -> pred.test(p));
return points;
}
private final Pair<Double, Double> randomPoint(double theta) {
double x = Math.cos(theta);
double y = Math.sin(theta);
double radius = randRadius();
return Pair.of(radius * x, radius * y);
}
private double randRadius() {
return maxRadius * (r.nextDouble() - 0.5);
}
public static void main(String[] args) {
RandomPoints rp = new RandomPoints(1.0, 30.0);
List<Pair<Double, Double>> points = rp.generate(100);
for (int i = 0; i < points.size(); ++i) {
for (int j = 1; j < points.size() - 1; ++j) {
if (i == j) {
continue;
}
double distance = distance(points.get(i), points.get(j));
if (distance < 1.0 || distance > 30.0) {
System.out.println("oops");
}
}
}
}
private static double distance(Pair<Double, Double> p1, Pair<Double, Double> p2) {
return Math.sqrt(Math.pow(p1.getLeft() - p2.getLeft(), 2.0) + Math.pow(p1.getRight() - p2.getRight(), 2.0));
}
}
I'm trying to think of some code that will allow me to search through my ArrayList and detect any values outside the common range of "good values."
Example:
100
105
102
13
104
22
101
How would I be able to write the code to detect that (in this case) 13 and 22 don't fall within the "good values" of around 100?
There are several criteria for detecting outliers. The simplest ones, like Chauvenet's criterion, use the mean and standard deviation calculated from the sample to determine a "normal" range for values. Any value outside of this range is deemed an outlier.
Other criterions are Grubb's test and Dixon's Q test and may give better results than Chauvenet's for example if the sample comes from a skew distribution.
package test;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
public class Main {
public static void main(String[] args) {
List<Double> data = new ArrayList<Double>();
data.add((double) 20);
data.add((double) 65);
data.add((double) 72);
data.add((double) 75);
data.add((double) 77);
data.add((double) 78);
data.add((double) 80);
data.add((double) 81);
data.add((double) 82);
data.add((double) 83);
Collections.sort(data);
System.out.println(getOutliers(data));
}
public static List<Double> getOutliers(List<Double> input) {
List<Double> output = new ArrayList<Double>();
List<Double> data1 = new ArrayList<Double>();
List<Double> data2 = new ArrayList<Double>();
if (input.size() % 2 == 0) {
data1 = input.subList(0, input.size() / 2);
data2 = input.subList(input.size() / 2, input.size());
} else {
data1 = input.subList(0, input.size() / 2);
data2 = input.subList(input.size() / 2 + 1, input.size());
}
double q1 = getMedian(data1);
double q3 = getMedian(data2);
double iqr = q3 - q1;
double lowerFence = q1 - 1.5 * iqr;
double upperFence = q3 + 1.5 * iqr;
for (int i = 0; i < input.size(); i++) {
if (input.get(i) < lowerFence || input.get(i) > upperFence)
output.add(input.get(i));
}
return output;
}
private static double getMedian(List<Double> data) {
if (data.size() % 2 == 0)
return (data.get(data.size() / 2) + data.get(data.size() / 2 - 1)) / 2;
else
return data.get(data.size() / 2);
}
}
Output:
[20.0]
Explanation:
Sort a list of integers, from low to high
Split a list of integers into 2 parts (by a middle) and put them into 2 new separate ArrayLists (call them "left" and "right")
Find a middle number (median) in both of those new ArrayLists
Q1 is a median from left side, and Q3 is the median from the right side
Applying mathematical formula:
IQR = Q3 - Q1
LowerFence = Q1 - 1.5*IQR
UpperFence = Q3 + 1.5*IQR
More info about this formula: http://www.mathwords.com/o/outlier.htm
Loop through all of my original elements, and if any of them are lower than a lower fence, or higher than an upper fence, add them to
"output" ArrayList
This new "output" ArrayList contains the outliers
An implementation of the Grubb's test can be found at MathUtil.java. It will find a single outlier, of which you can remove from your list and repeat until you've removed all outliers.
Depends on commons-math, so if you're using Gradle:
dependencies {
compile 'org.apache.commons:commons-math:2.2'
}
find the mean value for your list
create a Map that maps the number to the distance from mean
sort values by the distance from mean
and differentiate last n number, making sure there is no injustice with distance
Use this algorithm. This algorithm uses the average and standard deviation. These 2 number optional values (2 * standardDeviation).
public static List<int> StatisticalOutLierAnalysis(List<int> allNumbers)
{
if (allNumbers.Count == 0)
return null;
List<int> normalNumbers = new List<int>();
List<int> outLierNumbers = new List<int>();
double avg = allNumbers.Average();
double standardDeviation = Math.Sqrt(allNumbers.Average(v => Math.Pow(v - avg, 2)));
foreach (int number in allNumbers)
{
if ((Math.Abs(number - avg)) > (2 * standardDeviation))
outLierNumbers.Add(number);
else
normalNumbers.Add(number);
}
return normalNumbers;
}
As Joni already pointed out , you can eliminate outliers with the help of Standard Deviation and Mean. Here is my code, that you can use for your purposes.
public static void main(String[] args) {
List<Integer> values = new ArrayList<>();
values.add(100);
values.add(105);
values.add(102);
values.add(13);
values.add(104);
values.add(22);
values.add(101);
System.out.println("Before: " + values);
System.out.println("After: " + eliminateOutliers(values,1.5f));
}
protected static double getMean(List<Integer> values) {
int sum = 0;
for (int value : values) {
sum += value;
}
return (sum / values.size());
}
public static double getVariance(List<Integer> values) {
double mean = getMean(values);
int temp = 0;
for (int a : values) {
temp += (a - mean) * (a - mean);
}
return temp / (values.size() - 1);
}
public static double getStdDev(List<Integer> values) {
return Math.sqrt(getVariance(values));
}
public static List<Integer> eliminateOutliers(List<Integer> values, float scaleOfElimination) {
double mean = getMean(values);
double stdDev = getStdDev(values);
final List<Integer> newList = new ArrayList<>();
for (int value : values) {
boolean isLessThanLowerBound = value < mean - stdDev * scaleOfElimination;
boolean isGreaterThanUpperBound = value > mean + stdDev * scaleOfElimination;
boolean isOutOfBounds = isLessThanLowerBound || isGreaterThanUpperBound;
if (!isOutOfBounds) {
newList.add(value);
}
}
int countOfOutliers = values.size() - newList.size();
if (countOfOutliers == 0) {
return values;
}
return eliminateOutliers(newList,scaleOfElimination);
}
eliminateOutliers() method is doing all the work
It is a recursive method, which modifies the list with every recursive call
scaleOfElimination variable, which you pass to the method, defines at what scale
you want to remove outliers: Normally i go with 1.5f-2f, the greater the variable is,
the less outliers will be removed
The output of the code:
Before: [100, 105, 102, 13, 104, 22, 101]
After: [100, 105, 102, 104, 101]
I'm very glad and thanks to Valiyev. His solution helped me a lot. And I want to shere my little SRP on his works.
Please note that I use List.of() to store Dixon's critical values, for this reason it is required to use Java higher than 8.
public class DixonTest {
protected List<Double> criticalValues =
List.of(0.941, 0.765, 0.642, 0.56, 0.507, 0.468, 0.437);
private double scaleOfElimination;
private double mean;
private double stdDev;
private double getMean(final List<Double> input) {
double sum = input.stream()
.mapToDouble(value -> value)
.sum();
return (sum / input.size());
}
private double getVariance(List<Double> input) {
double mean = getMean(input);
double temp = input.stream()
.mapToDouble(a -> a)
.map(a -> (a - mean) * (a - mean))
.sum();
return temp / (input.size() - 1);
}
private double getStdDev(List<Double> input) {
return Math.sqrt(getVariance(input));
}
protected List<Double> eliminateOutliers(List<Double> input) {
int N = input.size() - 3;
scaleOfElimination = criticalValues.get(N).floatValue();
mean = getMean(input);
stdDev = getStdDev(input);
return input.stream()
.filter(this::isOutOfBounds)
.collect(Collectors.toList());
}
private boolean isOutOfBounds(Double value) {
return !(isLessThanLowerBound(value)
|| isGreaterThanUpperBound(value));
}
private boolean isGreaterThanUpperBound(Double value) {
return value > mean + stdDev * scaleOfElimination;
}
private boolean isLessThanLowerBound(Double value) {
return value < mean - stdDev * scaleOfElimination;
}
}
I hope it will help someone else.
Best regard
Thanks to #Emil_Wozniak for posting the complete code. I struggled with it for a while not realizing that eliminateOutliers() actually returns the outliers, not the list with them eliminated. The isOutOfBounds() method also was confusing because it actually returns TRUE when the value is IN bounds. Below is my update with some (IMHO) improvements:
The eliminateOutliers() method returns the input list with outliers removed
Added getOutliers() method to get just the list of outliers
Removed confusing isOutOfBounds() method in favor of a simple filtering expression
Expanded N list to support up to 30 input values
Protect against out of bounds errors when input list is too big or too small
Made stats methods (mean, stddev, variance) static utility methods
Calculate upper/lower bounds only once instead of on every comparison
Supply input list on ctor and store as an instance variable
Refactor to avoid using the same variable name as instance and local variables
Code:
/**
* Implements an outlier removal algorithm based on https://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/dixon.htm#:~:text=It%20can%20be%20used%20to,but%20one%20or%20two%20observations).
* Original Java code by Emil Wozniak at https://stackoverflow.com/questions/18805178/how-to-detect-outliers-in-an-arraylist
*
* Reorganized, made more robust, and clarified many of the methods.
*/
import java.util.List;
import java.util.stream.Collectors;
public class DixonTest {
protected List<Double> criticalValues =
List.of( // Taken from https://sebastianraschka.com/Articles/2014_dixon_test.html#2-calculate-q
// Alfa level of 0.1 (90% confidence)
0.941, // N=3
0.765, // N=4
0.642, // ...
0.56,
0.507,
0.468,
0.437,
0.412,
0.392,
0.376,
0.361,
0.349,
0.338,
0.329,
0.32,
0.313,
0.306,
0.3,
0.295,
0.29,
0.285,
0.281,
0.277,
0.273,
0.269,
0.266,
0.263,
0.26 // N=30
);
// Stats calculated on original input data (including outliers)
private double scaleOfElimination;
private double mean;
private double stdDev;
private double UB;
private double LB;
private List<Double> input;
/**
* Ctor taking a list of values to be analyzed.
* #param input
*/
public DixonTest(List<Double> input) {
this.input = input;
// Create statistics on the original input data
calcStats();
}
/**
* Utility method returns the mean of a list of values.
* #param valueList
* #return
*/
public static double getMean(final List<Double> valueList) {
double sum = valueList.stream()
.mapToDouble(value -> value)
.sum();
return (sum / valueList.size());
}
/**
* Utility method returns the variance of a list of values.
* #param valueList
* #return
*/
public static double getVariance(List<Double> valueList) {
double listMean = getMean(valueList);
double temp = valueList.stream()
.mapToDouble(a -> a)
.map(a -> (a - listMean) * (a - listMean))
.sum();
return temp / (valueList.size() - 1);
}
/**
* Utility method returns the std deviation of a list of values.
* #param input
* #return
*/
public static double getStdDev(List<Double> valueList) {
return Math.sqrt(getVariance(valueList));
}
/**
* Calculate statistics and bounds from the input values and store
* them in class variables.
* #param input
*/
private void calcStats() {
int N = Math.min(Math.max(0, input.size() - 3), criticalValues.size()-1); // Changed to protect against too-small or too-large lists
scaleOfElimination = criticalValues.get(N).floatValue();
mean = getMean(input);
stdDev = getStdDev(input);
UB = mean + stdDev * scaleOfElimination;
LB = mean - stdDev * scaleOfElimination;
}
/**
* Returns the input values with outliers removed.
* #param input
* #return
*/
public List<Double> eliminateOutliers() {
return input.stream()
.filter(value -> value>=LB && value <=UB)
.collect(Collectors.toList());
}
/**
* Returns the outliers found in the input list.
* #param input
* #return
*/
public List<Double> getOutliers() {
return input.stream()
.filter(value -> value<LB || value>UB)
.collect(Collectors.toList());
}
/**
* Test and sample usage
* #param args
*/
public static void main(String[] args) {
List<Double> testValues = List.of(1200.0,1205.0,1220.0,1194.0,1212.0);
DixonTest outlierDetector = new DixonTest(testValues);
List<Double> goodValues = outlierDetector.eliminateOutliers();
List<Double> badValues = outlierDetector.getOutliers();
System.out.println(goodValues.size()+ " good values:");
for (double v: goodValues) {
System.out.println(v);
}
System.out.println(badValues.size()+" outliers detected:");
for (double v: badValues) {
System.out.println(v);
}
// Get stats on remaining (good) values
System.out.println("\nMean of good values is "+DixonTest.getMean(goodValues));
}
}
It is just a very simple implementation which fetches the information which numbers are not in the range:
List<Integer> notInRangeNumbers = new ArrayList<Integer>();
for (Integer number : numbers) {
if (!isInRange(number)) {
// call with a predefined factor value, here example value = 5
notInRangeNumbers.add(number, 5);
}
}
Additionally inside the isInRange method you have to define what do you mean by 'good values'. Below you will find an examplary implementation.
private boolean isInRange(Integer number, int aroundFactor) {
//TODO the implementation of the 'in range condition'
// here the example implementation
return number <= 100 + aroundFactor && number >= 100 - aroundFactor;
}