How to find outliers using avg and stddev? - java

I am having conflict filtering a Dataset<'Row> using the MEAN() and STDEV() built in functions in the org.apache.spark.sql.functions library.
This is the set of data I am working with (top 10):
Name Size Volumes
File1 1030 107529
File2 997 106006
File3 1546 112426
File4 2235 117335
File5 2061 115363
File6 1875 114015
File7 1237 110002
File8 1546 112289
File9 1030 107154
File10 1339 110276
What I am currently trying to do is find the outliers in this dataset. For that, I need to find the rows where the SIZE and VOLUMES are outliers using the 95% rule: μ - 2σ ≤ X ≤ μ + 2σ
This is the SQL-like query that I would like to run on this Dataset:
SELECT * FROM DATASET
WHERE size < (SELECT (AVG(size)-2STDEV(size)) FROM DATASET)
OR size > (SELECT (AVG(size)+2STDEV(size)) FROM DATASET)
OR volumes < (SELECT (AVG(volumes)-2STDEV(volumes)) FROM DATASET)
OR volumes > (SELECT (AVG(volumes)+2STDEV(volumes)) FROM DATASET)
I don't know how to implement nested queries and I'm struggling to find a way to solve this.
Also, if you happen to know other way of getting what I want, feel free to share it.
This is what I attempted to do but I get an error:
Column meanSize = functions.mean(size);
Column meanRecords = functions.mean(records);
Column stdSize = functions.stddev(size);
Column stdRecords = functions.stddev(records);
Column lowerSizeThreshold = size.lt((meanSize.minus(stdSize).minus(stdSize)));
Column upperSizeThreshold = size.gt(meanSize.plus(stdSize).plus(stdSize));
Column lowerRecordsThreshold = records.lt(meanRecords.minus(stdRecords).minus(stdRecords));
Column upperRecordsThreshold = records.gt(meanRecords.plus(stdRecords).plus(stdRecords));
Dataset<Row> outliers = dataFrame.where(lowerSizeThreshold.or(upperSizeThreshold).or(lowerRecordsThreshold).or(upperRecordsThreshold));

You asked about Java that I'm currently not using at all, so here comes a Scala version that I hope might somehow help you to find a corresponding Java version.
What about the following solution?
// preparing the dataset
val input = spark.
read.
text("input.txt").
as[String].
filter(line => !line.startsWith("Name")).
map(_.split("\\W+")).
withColumn("name", $"value"(0)).
withColumn("size", $"value"(1) cast "int").
withColumn("volumes", $"value"(2) cast "int").
select("name", "size", "volumes")
scala> input.show
+------+----+-------+
| name|size|volumes|
+------+----+-------+
| File1|1030| 107529|
| File2| 997| 106006|
| File3|1546| 112426|
| File4|2235| 117335|
| File5|2061| 115363|
| File6|1875| 114015|
| File7|1237| 110002|
| File8|1546| 112289|
| File9|1030| 107154|
|File10|1339| 110276|
+------+----+-------+
// the final computation
import org.apache.spark.sql.functions._
val (sizeAvg, sizeStddev, volumesAvg, volumesStddev) = input.
groupBy().
agg(avg("size"), stddev("size"), avg("volumes"), stddev("volumes")).
as[(Double, Double, Double, Double)].
head
val sizeLessThanStddev = col("size") < (sizeAvg - 2 * sizeStddev)
input.filter(sizeLessThanStddev)
This is only a first part of the 4-part filter operator, and am leaving the rest as a home exercise.

Thanks for your comments guys.
So this solution is for the Java implementation of Spark. If you want the implementation of Scala, look at Jacek Laskowski post.
Solution:
//first convert the columns Size and Records to a List<Double>
List<Double> sizeList = dataFrame.select("Size").javaRDD().map(f -> f.getDouble(0)).collect();
List<Double> recordsList = dataFrame.select("Records").javaRDD().map(f -> f.getDouble(0)).collect();
//then convert the lists into JavaDoubleRDD
JavaDoubleRDD size = sparkContext.parallelizeDoubles(sizeList);
JavaDoubleRDD records = sparkContext.parallelizeDoubles(recordsList);
//calculate the mean and stddev using the built in functions:
double sizeMean = size.mean();
double sizeStdev = size.stdev();
double recordsMean = records.mean();
double recordsStdev = records.stdev();
After that I can finally use those values in the Column comparison functions.
Hope you understood what I did.

Related

Spark - how to get random unique rows

I need a way to get some x number of random rows from a dataset which are unique. I tried sample method of dataset class but it sometimes pick duplicate rows.
Dataset's sample method:
https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/sql/Dataset.html#sample-boolean-double-
Sample Function with withReplacement=>'false' would always pick distinct rows df1.sample(false, 0.1).show()
sample(boolean withReplacement, double fraction)
Consider below example:
where withReplacement => 'true' gave duplicate rows which can be verified by count, but withReplacement => 'false' did not.
import org.apache.spark.sql.functions._
val df1 = ((1 to 10000).toList).zip(((1 to 10000).map(x=>x*2))).toDF("col1", "col2")
// df1.sample(false, 0.1).show()
println("Sample Count for with Replacement : " + df1.sample(true, 0.1).count)
println("Sample Count for with Out Replacement : " + df1.sample(false, 0.1).count)
df1.sample(true, 0.1).groupBy($"col1", $"col2").count().filter($"count">1).show(5)
df1.sample(false, 0.1).groupBy($"col1", $"col2").count().filter($"count">1).show(5)
Sample Count for with Replacement : 978
Sample Count for with Out Replacement : 973
+----+-----+-----+
|col1| col2|count|
+----+-----+-----+
|7464|14928| 2|
|6080|12160| 2|
|6695|13390| 2|
|3393| 6786| 2|
|2137| 4274| 2|
+----+-----+-----+
only showing top 5 rows
+----+----+-----+
|col1|col2|count|
+----+----+-----+
+----+----+-----+
you should use sample function with withReplacement of false, for example, you can use:
val sampledData=df.sample(withReplacement=false,0.5)
but this is NOT guaranteed to provide exactly the fraction of the total count of your given Dataset.
for doing that, after you get your sampled data by sample function, take X entity of sampled data.

Searching and updating a Spark Dataset column with values from another Dataset

Java 8 and Spark 2.11:2.3.2 here. Although I would greatly prefer Java API answers, I do speak a wee bit of Scala so I will be able to understand any answers provided in it! But Java if at all possible (please)!
I have two datasets with different schema, with the exception of a common "model_number" (string) column: that exists on both.
For each row in my first Dataset (we'll call that d1), I need to scan/search the second Dataset ("d2") to see if there is a row with the same model_number, and if so, update another d2 column.
Here are my Dataset schemas:
d1
===========
model_number : string
desc : string
fizz : string
buzz : date
d2
===========
model_number : string
price : double
source : string
So again, if a d1 row has a model_number of , say, 12345, and a d2 row also has the same model_number, I want to update the d2.price by multiplying it by 10.0.
My best attempt thus far:
// I *think* this would give me a 3rd dataset with all d1 and d2 columns, but only
// containing rows from d1 and d2 that have matching 'model_number' values
Dataset<Row> d3 = d1.join(d2, d1.col("model_number") == d2.col("model_number"));
// now I just need to update d2.price based on matching
Dataset<Row> d4 = d3.withColumn("adjusted_price", d3.col("price") * 10.0);
Can anyone help me cross the finish line here? Thanks in advance!
Some points here, as #VamsiPrabhala mentioned in the comment, the function that you need to use is join on your specific fields. Regarding the "update", you need to take in mind that df, ds and rdd in spark are immutable, so you can not update them. So, the solution here is, after join your df's, you need to perform your calculation, in this case multiplication, in a select or using withColumn and then select. In other words, you can not update the column, but you can create the new df with the "new" column.
Example:
Input data:
+------------+------+------+----+
|model_number| desc| fizz|buzz|
+------------+------+------+----+
| model_a|desc_a|fizz_a|null|
| model_b|desc_b|fizz_b|null|
+------------+------+------+----+
+------------+-----+--------+
|model_number|price| source|
+------------+-----+--------+
| model_a| 10.0|source_a|
| model_b| 20.0|source_b|
+------------+-----+--------+
using join will output:
val joinedDF = d1.join(d2, "model_number")
joinedDF.show()
+------------+------+------+----+-----+--------+
|model_number| desc| fizz|buzz|price| source|
+------------+------+------+----+-----+--------+
| model_a|desc_a|fizz_a|null| 10.0|source_a|
| model_b|desc_b|fizz_b|null| 20.0|source_b|
+------------+------+------+----+-----+--------+
applying your calculation:
joinedDF.withColumn("price", col("price") * 10).show()
output:
+------------+------+------+----+-----+--------+
|model_number| desc| fizz|buzz|price| source|
+------------+------+------+----+-----+--------+
| model_a|desc_a|fizz_a|null| 100.0|source_a|
| model_b|desc_b|fizz_b|null| 200.0|source_b|
+------------+------+------+----+-----+--------+

How to group by in spark

I have below data smaple data but in real life this dataset is huge.
A B 1-1-2018 10
A B 2-1-2018 20
C D 1-1-2018 15
C D 2-1-2018 25
I need to group by above data using date and generate key pair values
1-1-2018->key
-----------------
A B 1-1-2018 10
C D 1-1-2018 15
2-1-2018->key
-----------------
A B 2-1-2018 20
C D 2-1-2018 25
Can anyone please tell me how can we do that in spark in best optimize way (using java if possible )
Not Java but looking at your code above it seems you wants recursively set your dataframes into sub-groups by Key. The best way I know how to do it is by a while loop and its not the easiest on the planet earth.
//You will also need to import all DataFrame and Array data types in Scala, don't know if you need to do it for Java for the below code.
//Inputting your DF, with columns as Value_1, Value_2, Key, Output_Amount
val inputDF = //DF From above
//Need to get an empty DF, I just like doing it this way
val testDF = spark.sql("select 'foo' as bar")
var arrayOfDataFrames = Array[DataFrame] = Array(testDF)
val arrayOfKeys = inputDF.selectExpr("Key").distinct.rdd.map(x=>x.mkString).collect
var keyIterator = 1
//Need to overwrite the foo bar first DF
arrayOfDataFrames = Array(inputDF.where($""===arrayOfKeys(keyIterator - 1)))
keyIterator = keyIterator + 1
//loop through find the key and place it into the DataFrames array
while(keyIterator <= arrayOfKeys.length) {
arrayOfDataFrames = arrayOfDataFrames ++ Array(inputDF.where($"Key"===arrayOfKeys(keyIterator - 1)))
keyIterator = keyIterator + 1
}
At the end of the command you will have two array of same length DataFrames and Keys that match. Meaning if you select the 3rd element of the Keys it matches the 3rd element of the DataFrames.
Since this isn't Java and doesn't directly answer your question, does this at least help push you in a direction that might help (I built it in Spark Scala).

Using a Java Stream instead of a for-loop skips code execution

Switching a bunch of for-loop code to use a parallel stream is apparently causing a certain part of the code to be ignored.
I'm using MOA and Weka with Java 11 to run a simple recommendation engine example, taking cues from the source code of moa.tasks.EvaluateOnlineRecomender, which uses MOA's internal task setup to test the accuracy of the Biased Regularized Incremental Simultaneous Matrix Factorization (BRISMF) implementation provided by MOA. Instead of using MOA's prepared MovielensDataset class, I switched over to Weka's Instances for prospects of applying Weka's ML tools.
The time it took to process about a million instances (I'm using the Movielens 1M dataset) was about 13-14 minutes. In a bid to see improvements, I wanted to run it on a parallel stream, and became suspicious when the task finished in about 40 seconds. I found that BRISMFPredictor.predictRating was always producing 0 within the parallel stream's body. Here's the code for either case:
Code for initialisation:
import com.github.javacliparser.FileOption;
import com.github.javacliparser.IntOption;
import moa.options.ClassOption;
import moa.recommender.predictor.BRISMFPredictor;
import moa.recommender.predictor.RatingPredictor;
import moa.recommender.rc.data.RecommenderData;
import weka.core.converters.CSVLoader;
...
private static ClassOption datasetOption;
private static ClassOption ratingPredictorOption;
private static IntOption sampleFrequencyOption;
private static FileOption defaultFileOption;
static {
ratingPredictorOption = new ClassOption("ratingPredictor",
's', "Rating Predictor to evaluate on.", RatingPredictor.class,
"moa.recommender.predictor.BRISMFPredictor");
sampleFrequencyOption = new IntOption("sampleFrequency",
'f', "How many instances between samples of the learning performance.", 100, 0, 2147483647);
defaultFileOption = new FileOption("file",
'f', "File to load.",
"C:\\Users\\shiva\\Documents\\Java-ML\\mlapp\\data\\ml-1m\\ratings.dat", "dat", false);
}
... and inside main() (a quirk with Weka's CSVLoader required that I replace the default :: delimiter with +)
var csvLoader = new CSVLoader();
csvLoader.setSource(defaultFileOption.getFile());
csvLoader.setFieldSeparator("+");
var dataset = csvLoader.getDataSet();
System.out.println(dataset.toSummaryString());
var predictor = new BRISMFPredictor();
predictor.prepareForUse();
RecommenderData data = predictor.getData();
data.clear();
data.disableUpdates(false);
Now, alternating between the following snippets:
for (var instance : dataset) {
var user = (int) instance.value(0);
var item = (int) instance.value(1);
var rating = instance.value(2);
double predictedRating = predictor.predictRating(user, item);
System.out.printf("User %d | Movie %d | Actual Rating %d | Predicted Rating %f%n",
user, item, Math.round(rating), predictedRating);
}
(Now being a noob in everything concurrent):
dataset.parallelStream().forEach(instance -> {
var user = (int) instance.value(0);
var item = (int) instance.value(1);
var rating = instance.value(2);
double predictedRating = predictor.predictRating(user, item);
System.out.printf("User %d | Movie %d | Actual Rating %d | Predicted Rating %f%n",
user, item, Math.round(rating), predictedRating);
});
Now I decide that heck, maybe this operation can't be done in parallel, and I switch it to use stream(). Even then, the segment seems to be completely ignored since the output is again 0.0 each time
dataset.stream().forEach(instance -> {
var user = (int) instance.value(0);
var item = (int) instance.value(1);
var rating = instance.value(2);
double predictedRating = predictor.predictRating(user, item);
System.out.printf("User %d | Movie %d | Actual Rating %d | Predicted Rating %f%n",
user, item, Math.round(rating), predictedRating);
});
I have tried removing the print statement from the run, but without avail.
Obviously, I get the expected output lines consisting of actual and predicted rating within about 13 minutes in the first case, but find that the predicted rating is 0.0 in the second case with suspiciously low execution time. Is there something I'm missing out on?
EDIT: using dataset.forEach() does the same thing. Perhaps a quirk of lambdas?

retrieve histogram from mssql table using java

I want to implement java application that can connect to any sql server and load any table from it. For each table I want to create histogram based on some arbitrary columns.
For example if I have this table
name profit
------------
name1 12
name2 14
name3 18
name4 13
I can create histogram with bin size 4 based on min and max value of profit column and count number of records for each bin.
result is:
profit count
---------------
12-16 3
16-20 1
My solution for this problem is retrieving all the data based on required columns and after that construct the bins and group by the records using java stream Collectors.groupingBy.
I'm not sure if my solution is optimized and for this I want some help to find the better algorithm specially when I have big number of records.(for example use some benefits of sql server or other frameworks that can be used.)
Can I use better algorithm for this issue?
edit 1:
assume my sql result is in List data
private String mySimpleHash(Object[] row, int index) {
StringBuilder hash = new StringBuilder();
for (int i = 0; i < row.length; i++)
if (i != index)
hash.append(row[i]).append(":");
return hash.toString();
}
//index is index of column for histogram
List<Object[]> histogramData = new ArrayList<>();
final Map<String, List<Object[]>> map = data.stream().collect(
Collectors.groupingBy(row -> mySimpleHash(Arrays.copyOfRange(row, index))));
for (final Map.Entry<String, List<Object[]>> entry : map.entrySet()) {
Object[] newRow = newData.get(rowNum);
double result = entry.getValue().stream()
.mapToDouble(row ->
Double.valueOf(row[index].toString())).count();
newRow[index] = result;
histogramData.add(newRow);
}
As you have considered, performing the aggregation after getting all the data out of SQL server is going to be very expensive if the number of rows in your tables increase. You can simply do the aggregation within SQL. Depending on how you are expressing your histogram bins, this is either trivial or requires some work. In your case, the requirement that the lowest bin start at min value requires a little bit of setup as opposed to binning starting from 0. See sample below. The inner query is mapping values to a bin number, the outer query is aggregating and computing the bin boundaries.
CREATE TABLE Test (
Name varchar(max) NOT NULL,
Profit int NOT NULL
)
INSERT Test(Name, Profit)
VALUES
('name1', 12),
('name2', 14),
('name3', 18),
('name4', 13)
DECLARE #minValue int = (SELECT MIN(Profit) FROM Test)
DECLARE #binSize int = 4
SELECT
(#minValue + #binSize * Bin) AS BinLow,
(#minValue + #binSize * Bin) + #binSize - 1 AS BinHigh,
COUNT(*) AS Count
FROM (
SELECT
((Profit - #minValue) / #binSize) AS Bin
FROM
Test
) AS t
GROUP BY Bin
| BinLow | BinHigh | Count |
|--------|---------|-------|
| 12 | 15 | 3 |
| 16 | 19 | 1 |
http://sqlfiddle.com/#!18/d093c/9

Categories

Resources