I need a way to get some x number of random rows from a dataset which are unique. I tried sample method of dataset class but it sometimes pick duplicate rows.
Dataset's sample method:
https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/sql/Dataset.html#sample-boolean-double-
Sample Function with withReplacement=>'false' would always pick distinct rows df1.sample(false, 0.1).show()
sample(boolean withReplacement, double fraction)
Consider below example:
where withReplacement => 'true' gave duplicate rows which can be verified by count, but withReplacement => 'false' did not.
import org.apache.spark.sql.functions._
val df1 = ((1 to 10000).toList).zip(((1 to 10000).map(x=>x*2))).toDF("col1", "col2")
// df1.sample(false, 0.1).show()
println("Sample Count for with Replacement : " + df1.sample(true, 0.1).count)
println("Sample Count for with Out Replacement : " + df1.sample(false, 0.1).count)
df1.sample(true, 0.1).groupBy($"col1", $"col2").count().filter($"count">1).show(5)
df1.sample(false, 0.1).groupBy($"col1", $"col2").count().filter($"count">1).show(5)
Sample Count for with Replacement : 978
Sample Count for with Out Replacement : 973
+----+-----+-----+
|col1| col2|count|
+----+-----+-----+
|7464|14928| 2|
|6080|12160| 2|
|6695|13390| 2|
|3393| 6786| 2|
|2137| 4274| 2|
+----+-----+-----+
only showing top 5 rows
+----+----+-----+
|col1|col2|count|
+----+----+-----+
+----+----+-----+
you should use sample function with withReplacement of false, for example, you can use:
val sampledData=df.sample(withReplacement=false,0.5)
but this is NOT guaranteed to provide exactly the fraction of the total count of your given Dataset.
for doing that, after you get your sampled data by sample function, take X entity of sampled data.
Related
Java 8 and Spark 2.11:2.3.2 here. Although I would greatly prefer Java API answers, I do speak a wee bit of Scala so I will be able to understand any answers provided in it! But Java if at all possible (please)!
I have two datasets with different schema, with the exception of a common "model_number" (string) column: that exists on both.
For each row in my first Dataset (we'll call that d1), I need to scan/search the second Dataset ("d2") to see if there is a row with the same model_number, and if so, update another d2 column.
Here are my Dataset schemas:
d1
===========
model_number : string
desc : string
fizz : string
buzz : date
d2
===========
model_number : string
price : double
source : string
So again, if a d1 row has a model_number of , say, 12345, and a d2 row also has the same model_number, I want to update the d2.price by multiplying it by 10.0.
My best attempt thus far:
// I *think* this would give me a 3rd dataset with all d1 and d2 columns, but only
// containing rows from d1 and d2 that have matching 'model_number' values
Dataset<Row> d3 = d1.join(d2, d1.col("model_number") == d2.col("model_number"));
// now I just need to update d2.price based on matching
Dataset<Row> d4 = d3.withColumn("adjusted_price", d3.col("price") * 10.0);
Can anyone help me cross the finish line here? Thanks in advance!
Some points here, as #VamsiPrabhala mentioned in the comment, the function that you need to use is join on your specific fields. Regarding the "update", you need to take in mind that df, ds and rdd in spark are immutable, so you can not update them. So, the solution here is, after join your df's, you need to perform your calculation, in this case multiplication, in a select or using withColumn and then select. In other words, you can not update the column, but you can create the new df with the "new" column.
Example:
Input data:
+------------+------+------+----+
|model_number| desc| fizz|buzz|
+------------+------+------+----+
| model_a|desc_a|fizz_a|null|
| model_b|desc_b|fizz_b|null|
+------------+------+------+----+
+------------+-----+--------+
|model_number|price| source|
+------------+-----+--------+
| model_a| 10.0|source_a|
| model_b| 20.0|source_b|
+------------+-----+--------+
using join will output:
val joinedDF = d1.join(d2, "model_number")
joinedDF.show()
+------------+------+------+----+-----+--------+
|model_number| desc| fizz|buzz|price| source|
+------------+------+------+----+-----+--------+
| model_a|desc_a|fizz_a|null| 10.0|source_a|
| model_b|desc_b|fizz_b|null| 20.0|source_b|
+------------+------+------+----+-----+--------+
applying your calculation:
joinedDF.withColumn("price", col("price") * 10).show()
output:
+------------+------+------+----+-----+--------+
|model_number| desc| fizz|buzz|price| source|
+------------+------+------+----+-----+--------+
| model_a|desc_a|fizz_a|null| 100.0|source_a|
| model_b|desc_b|fizz_b|null| 200.0|source_b|
+------------+------+------+----+-----+--------+
I have dataset in Spark where there are two columns: the string column (the string consists of the year in the first 4 characters, while the remaining characters are of a word) and the column Integer. Example of dataset row: "2004-dog" 45. I don't know how to print the first ten rows of every year. I arrived at this point:
JavaRDD<String> mentions =
tweets.flatMap(s -> Arrays.asList(s.split(":")).iterator());
JavaPairRDD<String, Integer> counts =
mentions.mapToPair(mention -> new Tuple2<>(mention, 1))
.reduceByKey((x, y) -> x + y);
Here is just an example:
Input Data:
+-------+---+
| year|cnt|
+-------+---+
|2015:04| 50|
|2015:04| 40|
|2015:04| 50|
|2017:04| 55|
|2017:04| 20|
|2017:04| 20|
+-------+---+
And, assuming you have some criteria to pick the top 10.
Create a window function
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val w = Window.partitionBy("year").orderBy(desc("cnt"))
df.withColumn("year", split('year,":")(0))
.withColumn("rank", row_number.over(w)) //use rank/dense_rank as you need
.filter('rank <= 2) //replace 10 here
//.drop("rank") //you can drop rank if you want
.show()
Result:
+----+---+----+
|year|cnt|rank|
+----+---+----+
|2017| 55| 1|
|2017| 20| 2|
|2015| 50| 1|
|2015| 50| 2|
+----+---+----+
Hope this helps!
I want to implement java application that can connect to any sql server and load any table from it. For each table I want to create histogram based on some arbitrary columns.
For example if I have this table
name profit
------------
name1 12
name2 14
name3 18
name4 13
I can create histogram with bin size 4 based on min and max value of profit column and count number of records for each bin.
result is:
profit count
---------------
12-16 3
16-20 1
My solution for this problem is retrieving all the data based on required columns and after that construct the bins and group by the records using java stream Collectors.groupingBy.
I'm not sure if my solution is optimized and for this I want some help to find the better algorithm specially when I have big number of records.(for example use some benefits of sql server or other frameworks that can be used.)
Can I use better algorithm for this issue?
edit 1:
assume my sql result is in List data
private String mySimpleHash(Object[] row, int index) {
StringBuilder hash = new StringBuilder();
for (int i = 0; i < row.length; i++)
if (i != index)
hash.append(row[i]).append(":");
return hash.toString();
}
//index is index of column for histogram
List<Object[]> histogramData = new ArrayList<>();
final Map<String, List<Object[]>> map = data.stream().collect(
Collectors.groupingBy(row -> mySimpleHash(Arrays.copyOfRange(row, index))));
for (final Map.Entry<String, List<Object[]>> entry : map.entrySet()) {
Object[] newRow = newData.get(rowNum);
double result = entry.getValue().stream()
.mapToDouble(row ->
Double.valueOf(row[index].toString())).count();
newRow[index] = result;
histogramData.add(newRow);
}
As you have considered, performing the aggregation after getting all the data out of SQL server is going to be very expensive if the number of rows in your tables increase. You can simply do the aggregation within SQL. Depending on how you are expressing your histogram bins, this is either trivial or requires some work. In your case, the requirement that the lowest bin start at min value requires a little bit of setup as opposed to binning starting from 0. See sample below. The inner query is mapping values to a bin number, the outer query is aggregating and computing the bin boundaries.
CREATE TABLE Test (
Name varchar(max) NOT NULL,
Profit int NOT NULL
)
INSERT Test(Name, Profit)
VALUES
('name1', 12),
('name2', 14),
('name3', 18),
('name4', 13)
DECLARE #minValue int = (SELECT MIN(Profit) FROM Test)
DECLARE #binSize int = 4
SELECT
(#minValue + #binSize * Bin) AS BinLow,
(#minValue + #binSize * Bin) + #binSize - 1 AS BinHigh,
COUNT(*) AS Count
FROM (
SELECT
((Profit - #minValue) / #binSize) AS Bin
FROM
Test
) AS t
GROUP BY Bin
| BinLow | BinHigh | Count |
|--------|---------|-------|
| 12 | 15 | 3 |
| 16 | 19 | 1 |
http://sqlfiddle.com/#!18/d093c/9
I am having conflict filtering a Dataset<'Row> using the MEAN() and STDEV() built in functions in the org.apache.spark.sql.functions library.
This is the set of data I am working with (top 10):
Name Size Volumes
File1 1030 107529
File2 997 106006
File3 1546 112426
File4 2235 117335
File5 2061 115363
File6 1875 114015
File7 1237 110002
File8 1546 112289
File9 1030 107154
File10 1339 110276
What I am currently trying to do is find the outliers in this dataset. For that, I need to find the rows where the SIZE and VOLUMES are outliers using the 95% rule: μ - 2σ ≤ X ≤ μ + 2σ
This is the SQL-like query that I would like to run on this Dataset:
SELECT * FROM DATASET
WHERE size < (SELECT (AVG(size)-2STDEV(size)) FROM DATASET)
OR size > (SELECT (AVG(size)+2STDEV(size)) FROM DATASET)
OR volumes < (SELECT (AVG(volumes)-2STDEV(volumes)) FROM DATASET)
OR volumes > (SELECT (AVG(volumes)+2STDEV(volumes)) FROM DATASET)
I don't know how to implement nested queries and I'm struggling to find a way to solve this.
Also, if you happen to know other way of getting what I want, feel free to share it.
This is what I attempted to do but I get an error:
Column meanSize = functions.mean(size);
Column meanRecords = functions.mean(records);
Column stdSize = functions.stddev(size);
Column stdRecords = functions.stddev(records);
Column lowerSizeThreshold = size.lt((meanSize.minus(stdSize).minus(stdSize)));
Column upperSizeThreshold = size.gt(meanSize.plus(stdSize).plus(stdSize));
Column lowerRecordsThreshold = records.lt(meanRecords.minus(stdRecords).minus(stdRecords));
Column upperRecordsThreshold = records.gt(meanRecords.plus(stdRecords).plus(stdRecords));
Dataset<Row> outliers = dataFrame.where(lowerSizeThreshold.or(upperSizeThreshold).or(lowerRecordsThreshold).or(upperRecordsThreshold));
You asked about Java that I'm currently not using at all, so here comes a Scala version that I hope might somehow help you to find a corresponding Java version.
What about the following solution?
// preparing the dataset
val input = spark.
read.
text("input.txt").
as[String].
filter(line => !line.startsWith("Name")).
map(_.split("\\W+")).
withColumn("name", $"value"(0)).
withColumn("size", $"value"(1) cast "int").
withColumn("volumes", $"value"(2) cast "int").
select("name", "size", "volumes")
scala> input.show
+------+----+-------+
| name|size|volumes|
+------+----+-------+
| File1|1030| 107529|
| File2| 997| 106006|
| File3|1546| 112426|
| File4|2235| 117335|
| File5|2061| 115363|
| File6|1875| 114015|
| File7|1237| 110002|
| File8|1546| 112289|
| File9|1030| 107154|
|File10|1339| 110276|
+------+----+-------+
// the final computation
import org.apache.spark.sql.functions._
val (sizeAvg, sizeStddev, volumesAvg, volumesStddev) = input.
groupBy().
agg(avg("size"), stddev("size"), avg("volumes"), stddev("volumes")).
as[(Double, Double, Double, Double)].
head
val sizeLessThanStddev = col("size") < (sizeAvg - 2 * sizeStddev)
input.filter(sizeLessThanStddev)
This is only a first part of the 4-part filter operator, and am leaving the rest as a home exercise.
Thanks for your comments guys.
So this solution is for the Java implementation of Spark. If you want the implementation of Scala, look at Jacek Laskowski post.
Solution:
//first convert the columns Size and Records to a List<Double>
List<Double> sizeList = dataFrame.select("Size").javaRDD().map(f -> f.getDouble(0)).collect();
List<Double> recordsList = dataFrame.select("Records").javaRDD().map(f -> f.getDouble(0)).collect();
//then convert the lists into JavaDoubleRDD
JavaDoubleRDD size = sparkContext.parallelizeDoubles(sizeList);
JavaDoubleRDD records = sparkContext.parallelizeDoubles(recordsList);
//calculate the mean and stddev using the built in functions:
double sizeMean = size.mean();
double sizeStdev = size.stdev();
double recordsMean = records.mean();
double recordsStdev = records.stdev();
After that I can finally use those values in the Column comparison functions.
Hope you understood what I did.
I'm trying to make multiple operations in one line of code in pySpark,
and not sure if that's possible for my case.
My intention is not having to save the output as a new dataframe.
My current code is rather simple:
encodeUDF = udf(encode_time, StringType())
new_log_df.cache().withColumn('timePeriod', encodeUDF(col('START_TIME')))
.groupBy('timePeriod')
.agg(
mean('DOWNSTREAM_SIZE').alias("Mean"),
stddev('DOWNSTREAM_SIZE').alias("Stddev")
)
.show(20, False)
And my intention is to add count() after using groupBy, to get, well, the count of records matching each value of timePeriod column, printed\shown as output.
When trying to use groupBy(..).count().agg(..) I get exceptions.
Is there any way to achieve both count() and agg().show() prints, without splitting code to two lines of commands, e.g. :
new_log_df.withColumn(..).groupBy(..).count()
new_log_df.withColumn(..).groupBy(..).agg(..).show()
Or better yet, for getting a merged output to agg.show() output - An extra column which states the counted number of records matching the row's value. e.g.:
timePeriod | Mean | Stddev | Num Of Records
X | 10 | 20 | 315
count() can be used inside agg() as groupBy expression is same.
With Python
import pyspark.sql.functions as func
new_log_df.cache().withColumn("timePeriod", encodeUDF(new_log_df["START_TIME"]))
.groupBy("timePeriod")
.agg(
func.mean("DOWNSTREAM_SIZE").alias("Mean"),
func.stddev("DOWNSTREAM_SIZE").alias("Stddev"),
func.count(func.lit(1)).alias("Num Of Records")
)
.show(20, False)
pySpark SQL functions doc
With Scala
import org.apache.spark.sql.functions._ //for count()
new_log_df.cache().withColumn("timePeriod", encodeUDF(col("START_TIME")))
.groupBy("timePeriod")
.agg(
mean("DOWNSTREAM_SIZE").alias("Mean"),
stddev("DOWNSTREAM_SIZE").alias("Stddev"),
count(lit(1)).alias("Num Of Records")
)
.show(20, false)
count(1) will count the records by first column which is equal to count("timePeriod")
With Java
import static org.apache.spark.sql.functions.*;
new_log_df.cache().withColumn("timePeriod", encodeUDF(col("START_TIME")))
.groupBy("timePeriod")
.agg(
mean("DOWNSTREAM_SIZE").alias("Mean"),
stddev("DOWNSTREAM_SIZE").alias("Stddev"),
count(lit(1)).alias("Num Of Records")
)
.show(20, false)