Spark: Print the first ten rows of every year

Spark: Print the first ten rows of every year - java

I have dataset in Spark where there are two columns: the string column (the string consists of the year in the first 4 characters, while the remaining characters are of a word) and the column Integer. Example of dataset row: "2004-dog" 45. I don't know how to print the first ten rows of every year. I arrived at this point:
JavaRDD<String> mentions =
tweets.flatMap(s -> Arrays.asList(s.split(":")).iterator());
JavaPairRDD<String, Integer> counts =
mentions.mapToPair(mention -> new Tuple2<>(mention, 1))
.reduceByKey((x, y) -> x + y);

Here is just an example:
Input Data:
+-------+---+
| year|cnt|
+-------+---+
|2015:04| 50|
|2015:04| 40|
|2015:04| 50|
|2017:04| 55|
|2017:04| 20|
|2017:04| 20|
+-------+---+
And, assuming you have some criteria to pick the top 10.
Create a window function
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val w = Window.partitionBy("year").orderBy(desc("cnt"))
df.withColumn("year", split('year,":")(0))
.withColumn("rank", row_number.over(w)) //use rank/dense_rank as you need
.filter('rank <= 2) //replace 10 here
//.drop("rank") //you can drop rank if you want
.show()
Result:
+----+---+----+
|year|cnt|rank|
+----+---+----+
|2017| 55| 1|
|2017| 20| 2|
|2015| 50| 1|
|2015| 50| 2|
+----+---+----+
Hope this helps!

Related

Spark - how to get random unique rows

I need a way to get some x number of random rows from a dataset which are unique. I tried sample method of dataset class but it sometimes pick duplicate rows.
Dataset's sample method:
https://spark.apache.org/docs/2.2.1/api/java/org/apache/spark/sql/Dataset.html#sample-boolean-double-

Sample Function with withReplacement=>'false' would always pick distinct rows df1.sample(false, 0.1).show()
sample(boolean withReplacement, double fraction)
Consider below example:
where withReplacement => 'true' gave duplicate rows which can be verified by count, but withReplacement => 'false' did not.
import org.apache.spark.sql.functions._
val df1 = ((1 to 10000).toList).zip(((1 to 10000).map(x=>x*2))).toDF("col1", "col2")
// df1.sample(false, 0.1).show()
println("Sample Count for with Replacement : " + df1.sample(true, 0.1).count)
println("Sample Count for with Out Replacement : " + df1.sample(false, 0.1).count)
df1.sample(true, 0.1).groupBy($"col1", $"col2").count().filter($"count">1).show(5)
df1.sample(false, 0.1).groupBy($"col1", $"col2").count().filter($"count">1).show(5)
Sample Count for with Replacement : 978
Sample Count for with Out Replacement : 973
+----+-----+-----+
|col1| col2|count|
+----+-----+-----+
|7464|14928| 2|
|6080|12160| 2|
|6695|13390| 2|
|3393| 6786| 2|
|2137| 4274| 2|
+----+-----+-----+
only showing top 5 rows
+----+----+-----+
|col1|col2|count|
+----+----+-----+
+----+----+-----+

you should use sample function with withReplacement of false, for example, you can use:
val sampledData=df.sample(withReplacement=false,0.5)
but this is NOT guaranteed to provide exactly the fraction of the total count of your given Dataset.
for doing that, after you get your sampled data by sample function, take X entity of sampled data.

Best practice to perform two times groupbykey in Spark?

I've a lot of tuples with this kind of format:
(1,200,a)
(2,300,a)
(1,300,b)
(2,400,a)
(2,500,b)
(3,200,a)
(3,400,b)
(1,500,a)
(2,400,b)
(3,500,a)
(1,200,b)
My job is to sort in firt time the tuple for the first integer, and then doing the average of the value in the second element of the tuple for each element of the third element of the tuple.
So, the result should be this:
(1,350,a),
(1,250,b),
(2,350,a),
(2,450,b),
(3,350,a),
(3,400,b).
What kind of best practice do you reccomend in this case?
I've tried to do MaptoPair and then groupbykey for the first element of the tuple. Then another MapTopPair and groupbykey for the third element and then reducebykey, but it doesn't work and i don't know why. I don't think i've used the best practice for resolving this type of job.
This is a sketch of my solution

Just use Dataset API. Here in Scala, but Java will be almost identical:
val rdd = sc.parallelize(Seq(
(1,200,"a"), (2,300,"a"), (1,300,"b"), (2,400,"a"), (2,500,"b"),
(3,200,"a"), (3,400,"b"), (1,500,"a"), (2,400,"b"), (3,500,"a"),
(1,200,"b")
))
val df = rdd.toDF("k1", "v", "k2")
df.groupBy("k1", "k2").mean("v").orderBy("k1", "k2").show
+---+---+------+
| k1| k2|avg(v)|
+---+---+------+
| 1| a| 350.0|
| 1| b| 250.0|
| 2| a| 350.0|
| 2| b| 450.0|
| 3| a| 350.0|
| 3| b| 400.0|
+---+---+------+
With RDD map first to have composite key:
rdd
.map(x => ((x._1, x._3), (x._2, 1.0)))
.reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2))
.mapValues(x => x._1 / x._2)
.take(6).foreach(println)
((2,a),350.0)
((3,b),400.0)
((1,b),250.0)
((1,a),350.0)
((3,a),350.0)
((2,b),450.0)

retrieve histogram from mssql table using java

I want to implement java application that can connect to any sql server and load any table from it. For each table I want to create histogram based on some arbitrary columns.
For example if I have this table
name profit
------------
name1 12
name2 14
name3 18
name4 13
I can create histogram with bin size 4 based on min and max value of profit column and count number of records for each bin.
result is:
profit count
---------------
12-16 3
16-20 1
My solution for this problem is retrieving all the data based on required columns and after that construct the bins and group by the records using java stream Collectors.groupingBy.
I'm not sure if my solution is optimized and for this I want some help to find the better algorithm specially when I have big number of records.(for example use some benefits of sql server or other frameworks that can be used.)
Can I use better algorithm for this issue?
edit 1:
assume my sql result is in List data
private String mySimpleHash(Object[] row, int index) {
StringBuilder hash = new StringBuilder();
for (int i = 0; i < row.length; i++)
if (i != index)
hash.append(row[i]).append(":");
return hash.toString();
}
//index is index of column for histogram
List<Object[]> histogramData = new ArrayList<>();
final Map<String, List<Object[]>> map = data.stream().collect(
Collectors.groupingBy(row -> mySimpleHash(Arrays.copyOfRange(row, index))));
for (final Map.Entry<String, List<Object[]>> entry : map.entrySet()) {
Object[] newRow = newData.get(rowNum);
double result = entry.getValue().stream()
.mapToDouble(row ->
Double.valueOf(row[index].toString())).count();
newRow[index] = result;
histogramData.add(newRow);
}

As you have considered, performing the aggregation after getting all the data out of SQL server is going to be very expensive if the number of rows in your tables increase. You can simply do the aggregation within SQL. Depending on how you are expressing your histogram bins, this is either trivial or requires some work. In your case, the requirement that the lowest bin start at min value requires a little bit of setup as opposed to binning starting from 0. See sample below. The inner query is mapping values to a bin number, the outer query is aggregating and computing the bin boundaries.
CREATE TABLE Test (
Name varchar(max) NOT NULL,
Profit int NOT NULL
)
INSERT Test(Name, Profit)
VALUES
('name1', 12),
('name2', 14),
('name3', 18),
('name4', 13)
DECLARE #minValue int = (SELECT MIN(Profit) FROM Test)
DECLARE #binSize int = 4
SELECT
(#minValue + #binSize * Bin) AS BinLow,
(#minValue + #binSize * Bin) + #binSize - 1 AS BinHigh,
COUNT(*) AS Count
FROM (
SELECT
((Profit - #minValue) / #binSize) AS Bin
FROM
Test
) AS t
GROUP BY Bin
| BinLow | BinHigh | Count |
|--------|---------|-------|
| 12 | 15 | 3 |
| 16 | 19 | 1 |
http://sqlfiddle.com/#!18/d093c/9

Create dataframe from rdd objectfile

What is the method to create ddf from an RDD which is saved as objectfile. I want to load the RDD but I don't have a java object, only a structtype I want to use as schema for ddf.
I tried retrieving as Row
val myrdd = sc.objectFile[org.apache.spark.sql.Row]("/home/bipin/"+name)
But I get
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to
org.apache.spark.sql.Row
Is there a way to do this.
Edit
From what I understand, I have to read rdd as array of objects and convert it to row. If anyone can give a method for this, it would be acceptable.

If you have an Array of Object you only have to use the Row apply method for an array of Any. In code will be something like this:
val myrdd = sc.objectFile[Array[Object]]("/home/bipin/"+name).map(x => Row(x))
EDIT
you are rigth #user568109 this will create a Dataframe with only one field that will be an Array to parse the whole array you have to do this:
val myrdd = sc.objectFile[Array[Object]]("/home/bipin/"+name).map(x => Row.fromSeq(x.toSeq))
As #user568109 said there are other ways to do this:
val myrdd = sc.objectFile[Array[Object]]("/home/bipin/"+name).map(x => Row(x:_*))
No matters which one you will because both are wrappers for the same code:
/**
* This method can be used to construct a [[Row]] with the given values.
*/
def apply(values: Any*): Row = new GenericRow(values.toArray)
/**
* This method can be used to construct a [[Row]] from a [[Seq]] of values.
*/
def fromSeq(values: Seq[Any]): Row = new GenericRow(values.toArray)

Let me add some explaination,
suppose you have a mysql table grocery with 3 columns (item,category,price) and its contents as below
+------------+---------+----------+-------+
| grocery_id | item | category | price |
+------------+---------+----------+-------+
| 1 | tomato | veg | 2.40 |
| 2 | raddish | veg | 4.30 |
| 3 | banana | fruit | 1.20 |
| 4 | carrot | veg | 2.50 |
| 5 | apple | fruit | 8.10 |
+------------+---------+----------+-------+
5 rows in set (0.00 sec)
Now, within spark you want to read it, your code will be something like below
val groceryRDD = new JdbcRDD(sc, ()=> DriverManager.getConnection(url,uname,passwd), "select item,price from grocery limit ?,?",1,10,2,r => r.getString("item")+"|"+r.getString("price"))
Note :
In the above statement i converted the ResultSet into String r => r.getString("item")+"|"+r.getString("price")
So my JdbcRDD will be as
groceryRDD: org.apache.spark.rdd.JdbcRDD[String] = JdbcRDD[29] at JdbcRDD at <console>:21
now you save it.
groceryRDD.saveAsObjectFile("/user/cloudera/jdbcobject")
Answer to your question
while reading the object file you need to write as below,
val newJdbObjectFile = sc.objectFile[String]("/user/cloudera/jdbcobject")
In a blind manner ,just substitute the type Parameter of RDD you are saving.
In my case, groceryRDD has a type parameter as String, hence i have used the same
UPDATE:
In your case, as mentioned by jlopezmat, you need to use Array[Object]
Here each row of RDD will be Object, but since you have converted that using ObjectArray each row with its contents will be again saved as Array,
i.e, In my case , if save above RDD as below,
val groceryRDD = new JdbcRDD(sc, ()=> DriverManager.getConnection(url,uname,passwd), "select item,price from grocery limit ?,?",1,10,2,r => JdbcRDD.resultSetToObjectArray(r))
when i read the same using and collect data
val newJdbcObjectArrayRDD = sc.objectFile[Array[Object]]("...")
val result = newJdbObjectArrayRDD.collect
result will be of type Array[Array[Object]]
result: Array[Array[Object]] = Array(Array(raddish, 4.3), Array(banana, 1.2), Array(carrot, 2.5), Array(apple, 8.1))
you can parse the above based on your column definitions.
Please let me know if it answered you question

Adding ArrayList to TreeMap within a while cycle

I'm retrieving data from a table in a database and add the whole row to a TreeMap.
I'm using the following code:
ArrayList<String> listaValores = new ArrayList();
TreeMap<Integer, ArrayList> tmValores = new TreeMap<>();
(...)
while (rs.next())
{
listaValores.clear();
for(int i=2; i<=ncolunas; i++)
{
listaValores.add(rs.getString(i));
}
tmValores.put(rs.getInt(1), listaValores);
}
My TreeMap keys are inserted fine but the values are always repeated as the values from the last line as a result from the SELECT * FROM Table query.
So if my table is:
id | name | last name |
1 | joe | lewis |
2 | mark | spencer |
3 | mike | carter |
4 | duke | melvin |
My TreeMap contains:
1=> duke, melvin
2=> duke, melvin
3=> duke, melvin
4=> duke, melvin
Instead of (as I wanted)
1=> joe, lewis
2=> mark, spencer
3=> mike, carter
4=> duke melvin
Can anyone point me out where is the problem?

I believe you have to reassign the value of listaValores.
Simply change
listaValores.clear();
to
listaValores = new ArrayList<String>();

Objects in Java are passed around as references, so you are in fact adding the same list for all the keys in your map.
Since you clear it at every step and then add some values to it, it will contain just the last row, after the while loop has finished.
What you really want to do is to create an instance of ArrayList for every row and add that to your map, instead of clearing the values in the old one:
while (rs.next())
{
listaValores = new ArrayList<String>();
for(int i = 2; i <= ncolunas; i++)
{
listaValores.add(rs.getString(i));
}
tmValores.put(rs.getInt(1), listaValores);
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Spark: Print the first ten rows of every year - java

Related

Spark - how to get random unique rows

Best practice to perform two times groupbykey in Spark?

retrieve histogram from mssql table using java

Create dataframe from rdd objectfile

Adding ArrayList to TreeMap within a while cycle

Categories

Resources