aggregate function Count usage with groupBy in Spark

aggregate function Count usage with groupBy in Spark - java

I'm trying to make multiple operations in one line of code in pySpark,
and not sure if that's possible for my case.
My intention is not having to save the output as a new dataframe.
My current code is rather simple:
encodeUDF = udf(encode_time, StringType())
new_log_df.cache().withColumn('timePeriod', encodeUDF(col('START_TIME')))
.groupBy('timePeriod')
.agg(
mean('DOWNSTREAM_SIZE').alias("Mean"),
stddev('DOWNSTREAM_SIZE').alias("Stddev")
)
.show(20, False)
And my intention is to add count() after using groupBy, to get, well, the count of records matching each value of timePeriod column, printed\shown as output.
When trying to use groupBy(..).count().agg(..) I get exceptions.
Is there any way to achieve both count() and agg().show() prints, without splitting code to two lines of commands, e.g. :
new_log_df.withColumn(..).groupBy(..).count()
new_log_df.withColumn(..).groupBy(..).agg(..).show()
Or better yet, for getting a merged output to agg.show() output - An extra column which states the counted number of records matching the row's value. e.g.:
timePeriod | Mean | Stddev | Num Of Records
X | 10 | 20 | 315

count() can be used inside agg() as groupBy expression is same.
With Python
import pyspark.sql.functions as func
new_log_df.cache().withColumn("timePeriod", encodeUDF(new_log_df["START_TIME"]))
.groupBy("timePeriod")
.agg(
func.mean("DOWNSTREAM_SIZE").alias("Mean"),
func.stddev("DOWNSTREAM_SIZE").alias("Stddev"),
func.count(func.lit(1)).alias("Num Of Records")
)
.show(20, False)
pySpark SQL functions doc
With Scala
import org.apache.spark.sql.functions._ //for count()
new_log_df.cache().withColumn("timePeriod", encodeUDF(col("START_TIME")))
.groupBy("timePeriod")
.agg(
mean("DOWNSTREAM_SIZE").alias("Mean"),
stddev("DOWNSTREAM_SIZE").alias("Stddev"),
count(lit(1)).alias("Num Of Records")
)
.show(20, false)
count(1) will count the records by first column which is equal to count("timePeriod")
With Java
import static org.apache.spark.sql.functions.*;
new_log_df.cache().withColumn("timePeriod", encodeUDF(col("START_TIME")))
.groupBy("timePeriod")
.agg(
mean("DOWNSTREAM_SIZE").alias("Mean"),
stddev("DOWNSTREAM_SIZE").alias("Stddev"),
count(lit(1)).alias("Num Of Records")
)
.show(20, false)

Related

Snowflake Java UDF using BigInteger returning indeterministic results

I created a Java UDF in Snowflake which takes in a bigint in SQL using a BigInteger in Java, then returns it as a string.
create or replace function pass_through_print(divisor bigint)
returns string
language java
handler='TestClass.pass_through_print'
as $$
import java.math.*;
class TestClass {
public static String pass_through_print(BigInteger divisor) {
return divisor.toString();
}
}
$$;
I then ran the following tests
Test 1 (works as expected)
select pass_through_val, count(*)
from (
select
pass_through_print(9) as pass_through_val
from (
select seq8() as val
from table(generator(rowcount => 100000))
)
)
group by pass_through_val
Results (Works as expected)
PASS_THROUGH_VAL COUNT(*)
9 100,000
Test 2 (indeterministic)
select pass_through_val, count(*)
from (
select
case
when val % 2 = 0 then null
else pass_through_print(9)
end as pass_through_val
from (
select seq8() as val
from table(generator(rowcount => 100000))
)
)
group by pass_through_val
Expected Result
PASS_THROUGH_VAL COUNT(*)
null 50,000
9 50,000
Actual results from multiple runs
The pass through result changes randomly.
PASS_THROUGH_VAL COUNT(*)
null 50,000
-25 50,000
PASS_THROUGH_VAL COUNT(*)
null 50,000
-1 50,000
PASS_THROUGH_VAL COUNT(*)
null 50,000
95 50,000
Changing BigInteger to int in the Java UDF works as expected for all tests
create or replace function pass_through_int_print(val int)
returns string
language java
handler='TestClass.pass_through_int_print'
as $$
import java.math.*;
class TestClass {
public static String pass_through_int_print(int val) {
return String.valueOf(val);
}
}
$$;
It seems like the case statement in combination with BigInteger conversion is causing things to break. Does anyone know what could be happening?
UPDATE
Even weirder behaviour is that if I call pass_through_print outside of the case statement, it works as expected, and it also makes the one inside the case statement work too.
New Test
select pass_through_val_correct, pass_through_val, count(*)
from (
select
pass_through_print(123) as pass_through_val_correct, -- added this line
case
when val % 2 = 0 then null
else pass_through_print(9)
end as pass_through_val
from (
select seq8() as val
from table(generator(rowcount => 1000000))
)
)
group by pass_through_val_correct, pass_through_val
Result
PASS_THROUGH_VAL_CORRECT PASS_THROUGH_VAL COUNT(*)
123 null 500,000
123 9 500,000

When you change the query (and call pass_through_print outside of the case statement), the execution plan changes, and the function is executed in a stand-alone operation (ExtensionFunction), but in your original query, the function is defined as a volatile function and executed as a part of the projection operation (which is a sub-step of Aggregation).
Somehow calling the function as part of the projection step prevents processing BigInteger (maybe there could be other libraries impacted). Please submit a case to Snowflake support. This looks like a bug.

Spark - how to get random unique rows

I need a way to get some x number of random rows from a dataset which are unique. I tried sample method of dataset class but it sometimes pick duplicate rows.
Dataset's sample method:
https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/sql/Dataset.html#sample-boolean-double-

Sample Function with withReplacement=>'false' would always pick distinct rows df1.sample(false, 0.1).show()
sample(boolean withReplacement, double fraction)
Consider below example:
where withReplacement => 'true' gave duplicate rows which can be verified by count, but withReplacement => 'false' did not.
import org.apache.spark.sql.functions._
val df1 = ((1 to 10000).toList).zip(((1 to 10000).map(x=>x*2))).toDF("col1", "col2")
// df1.sample(false, 0.1).show()
println("Sample Count for with Replacement : " + df1.sample(true, 0.1).count)
println("Sample Count for with Out Replacement : " + df1.sample(false, 0.1).count)
df1.sample(true, 0.1).groupBy($"col1", $"col2").count().filter($"count">1).show(5)
df1.sample(false, 0.1).groupBy($"col1", $"col2").count().filter($"count">1).show(5)
Sample Count for with Replacement : 978
Sample Count for with Out Replacement : 973
+----+-----+-----+
|col1| col2|count|
+----+-----+-----+
|7464|14928| 2|
|6080|12160| 2|
|6695|13390| 2|
|3393| 6786| 2|
|2137| 4274| 2|
+----+-----+-----+
only showing top 5 rows
+----+----+-----+
|col1|col2|count|
+----+----+-----+
+----+----+-----+

you should use sample function with withReplacement of false, for example, you can use:
val sampledData=df.sample(withReplacement=false,0.5)
but this is NOT guaranteed to provide exactly the fraction of the total count of your given Dataset.
for doing that, after you get your sampled data by sample function, take X entity of sampled data.

Can I compare resultsets like this? I'm facing the below error

I have 2 ResultSets. 1st ResultSet contains the records from table1 from database1 and 2nd ResultSet contains the records from table2 from database2. I need a list of records from resultset1 which are not present in resultSet2. For this I wrote this logic but it is not working and throwing me the following error.
java.sql.SQLException: Invalid operation for read only resultset: deleteRow
if ( table1ResultSet != null )
{
while ( table1ResultSet.next() )
{
final String table1Record = table1ResultSet.getString( 1 );
if ( table2ResultSet != null )
{
while ( table2ResultSet.next() )
{
final String table2Record = table2ResultSet.getString( 1 );
if ( table1Record.toString().equalsIgnoreCase( table2Record.toString() ) )
{
table1ResultSet.deleteRow();
break;
}
}
}
}
}
return table1ResultSet;

That exception says what the problem is - your result set doesn't support delete. In order to have updateable result set there are some requirements:
When you prepare statement did you make it with ResultSet.CONCUR_UPDATABLE?
A query can select from only a single table without any join operations.
The query must select all non-nullable columns and all columns that do not have a default value. A query cannot use "SELECT * ". Cannot select derived columns or aggregates such as the SUM or MAX of a set of columns.
You might want to move the results sets into Java sets before working doing what you are doing though because using deleteRow will actually delete the row from the database (unless that's the expected result)
There is another problem with your code though. Even if delete works your code will fail on the second iteration of result set 1 because you never reset table2ResultSet and for the second iteration there won't be more results in table2resulset.
But on top of all that. Why would you go through all that hussle and get all that rows that you don't need instead of doing it with one single query like:
select * from table 1 where id not in select id from table 2
or
delete from table 1 where id not in select id from table 2
if that's the goal

Your logic:
Assumes the records come in some order (which may or may not be true, depending on your SQL)
Consumes the entire result set 2 for each row of result set 1, which is unlikely your intent
Deletes things, which is also not what you mentioned in the question
Your question can be implemented easily as such:
Set<String> list1 = new HashSet<>();
while (table1ResultSet.next())
list1.add(table1ResultSet.getString(1).toLowerCase());
while (table2ResultSet.next())
list1.remove(table2ResultSet.getString(1).toLowerCase());
System.out.println(list1);
This will print all the values (without duplicates) that are present in the first result set, but not in the second.

retrieve histogram from mssql table using java

I want to implement java application that can connect to any sql server and load any table from it. For each table I want to create histogram based on some arbitrary columns.
For example if I have this table
name profit
------------
name1 12
name2 14
name3 18
name4 13
I can create histogram with bin size 4 based on min and max value of profit column and count number of records for each bin.
result is:
profit count
---------------
12-16 3
16-20 1
My solution for this problem is retrieving all the data based on required columns and after that construct the bins and group by the records using java stream Collectors.groupingBy.
I'm not sure if my solution is optimized and for this I want some help to find the better algorithm specially when I have big number of records.(for example use some benefits of sql server or other frameworks that can be used.)
Can I use better algorithm for this issue?
edit 1:
assume my sql result is in List data
private String mySimpleHash(Object[] row, int index) {
StringBuilder hash = new StringBuilder();
for (int i = 0; i < row.length; i++)
if (i != index)
hash.append(row[i]).append(":");
return hash.toString();
}
//index is index of column for histogram
List<Object[]> histogramData = new ArrayList<>();
final Map<String, List<Object[]>> map = data.stream().collect(
Collectors.groupingBy(row -> mySimpleHash(Arrays.copyOfRange(row, index))));
for (final Map.Entry<String, List<Object[]>> entry : map.entrySet()) {
Object[] newRow = newData.get(rowNum);
double result = entry.getValue().stream()
.mapToDouble(row ->
Double.valueOf(row[index].toString())).count();
newRow[index] = result;
histogramData.add(newRow);
}

As you have considered, performing the aggregation after getting all the data out of SQL server is going to be very expensive if the number of rows in your tables increase. You can simply do the aggregation within SQL. Depending on how you are expressing your histogram bins, this is either trivial or requires some work. In your case, the requirement that the lowest bin start at min value requires a little bit of setup as opposed to binning starting from 0. See sample below. The inner query is mapping values to a bin number, the outer query is aggregating and computing the bin boundaries.
CREATE TABLE Test (
Name varchar(max) NOT NULL,
Profit int NOT NULL
)
INSERT Test(Name, Profit)
VALUES
('name1', 12),
('name2', 14),
('name3', 18),
('name4', 13)
DECLARE #minValue int = (SELECT MIN(Profit) FROM Test)
DECLARE #binSize int = 4
SELECT
(#minValue + #binSize * Bin) AS BinLow,
(#minValue + #binSize * Bin) + #binSize - 1 AS BinHigh,
COUNT(*) AS Count
FROM (
SELECT
((Profit - #minValue) / #binSize) AS Bin
FROM
Test
) AS t
GROUP BY Bin
| BinLow | BinHigh | Count |
|--------|---------|-------|
| 12 | 15 | 3 |
| 16 | 19 | 1 |
http://sqlfiddle.com/#!18/d093c/9

How to find outliers using avg and stddev?

I am having conflict filtering a Dataset<'Row> using the MEAN() and STDEV() built in functions in the org.apache.spark.sql.functions library.
This is the set of data I am working with (top 10):
Name Size Volumes
File1 1030 107529
File2 997 106006
File3 1546 112426
File4 2235 117335
File5 2061 115363
File6 1875 114015
File7 1237 110002
File8 1546 112289
File9 1030 107154
File10 1339 110276
What I am currently trying to do is find the outliers in this dataset. For that, I need to find the rows where the SIZE and VOLUMES are outliers using the 95% rule: μ - 2σ ≤ X ≤ μ + 2σ
This is the SQL-like query that I would like to run on this Dataset:
SELECT * FROM DATASET
WHERE size < (SELECT (AVG(size)-2STDEV(size)) FROM DATASET)
OR size > (SELECT (AVG(size)+2STDEV(size)) FROM DATASET)
OR volumes < (SELECT (AVG(volumes)-2STDEV(volumes)) FROM DATASET)
OR volumes > (SELECT (AVG(volumes)+2STDEV(volumes)) FROM DATASET)
I don't know how to implement nested queries and I'm struggling to find a way to solve this.
Also, if you happen to know other way of getting what I want, feel free to share it.
This is what I attempted to do but I get an error:
Column meanSize = functions.mean(size);
Column meanRecords = functions.mean(records);
Column stdSize = functions.stddev(size);
Column stdRecords = functions.stddev(records);
Column lowerSizeThreshold = size.lt((meanSize.minus(stdSize).minus(stdSize)));
Column upperSizeThreshold = size.gt(meanSize.plus(stdSize).plus(stdSize));
Column lowerRecordsThreshold = records.lt(meanRecords.minus(stdRecords).minus(stdRecords));
Column upperRecordsThreshold = records.gt(meanRecords.plus(stdRecords).plus(stdRecords));
Dataset<Row> outliers = dataFrame.where(lowerSizeThreshold.or(upperSizeThreshold).or(lowerRecordsThreshold).or(upperRecordsThreshold));

You asked about Java that I'm currently not using at all, so here comes a Scala version that I hope might somehow help you to find a corresponding Java version.
What about the following solution?
// preparing the dataset
val input = spark.
read.
text("input.txt").
as[String].
filter(line => !line.startsWith("Name")).
map(_.split("\\W+")).
withColumn("name", $"value"(0)).
withColumn("size", $"value"(1) cast "int").
withColumn("volumes", $"value"(2) cast "int").
select("name", "size", "volumes")
scala> input.show
+------+----+-------+
| name|size|volumes|
+------+----+-------+
| File1|1030| 107529|
| File2| 997| 106006|
| File3|1546| 112426|
| File4|2235| 117335|
| File5|2061| 115363|
| File6|1875| 114015|
| File7|1237| 110002|
| File8|1546| 112289|
| File9|1030| 107154|
|File10|1339| 110276|
+------+----+-------+
// the final computation
import org.apache.spark.sql.functions._
val (sizeAvg, sizeStddev, volumesAvg, volumesStddev) = input.
groupBy().
agg(avg("size"), stddev("size"), avg("volumes"), stddev("volumes")).
as[(Double, Double, Double, Double)].
head
val sizeLessThanStddev = col("size") < (sizeAvg - 2 * sizeStddev)
input.filter(sizeLessThanStddev)
This is only a first part of the 4-part filter operator, and am leaving the rest as a home exercise.

Thanks for your comments guys.
So this solution is for the Java implementation of Spark. If you want the implementation of Scala, look at Jacek Laskowski post.
Solution:
//first convert the columns Size and Records to a List<Double>
List<Double> sizeList = dataFrame.select("Size").javaRDD().map(f -> f.getDouble(0)).collect();
List<Double> recordsList = dataFrame.select("Records").javaRDD().map(f -> f.getDouble(0)).collect();
//then convert the lists into JavaDoubleRDD
JavaDoubleRDD size = sparkContext.parallelizeDoubles(sizeList);
JavaDoubleRDD records = sparkContext.parallelizeDoubles(recordsList);
//calculate the mean and stddev using the built in functions:
double sizeMean = size.mean();
double sizeStdev = size.stdev();
double recordsMean = records.mean();
double recordsStdev = records.stdev();
After that I can finally use those values in the Column comparison functions.
Hope you understood what I did.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

aggregate function Count usage with groupBy in Spark - java

Related

Snowflake Java UDF using BigInteger returning indeterministic results

Spark - how to get random unique rows

Can I compare resultsets like this? I'm facing the below error

retrieve histogram from mssql table using java

How to find outliers using avg and stddev?

Categories

Resources