How to group by in spark

How to group by in spark - java

I have below data smaple data but in real life this dataset is huge.
A B 1-1-2018 10
A B 2-1-2018 20
C D 1-1-2018 15
C D 2-1-2018 25
I need to group by above data using date and generate key pair values
1-1-2018->key
-----------------
A B 1-1-2018 10
C D 1-1-2018 15
2-1-2018->key
-----------------
A B 2-1-2018 20
C D 2-1-2018 25
Can anyone please tell me how can we do that in spark in best optimize way (using java if possible )

Not Java but looking at your code above it seems you wants recursively set your dataframes into sub-groups by Key. The best way I know how to do it is by a while loop and its not the easiest on the planet earth.
//You will also need to import all DataFrame and Array data types in Scala, don't know if you need to do it for Java for the below code.
//Inputting your DF, with columns as Value_1, Value_2, Key, Output_Amount
val inputDF = //DF From above
//Need to get an empty DF, I just like doing it this way
val testDF = spark.sql("select 'foo' as bar")
var arrayOfDataFrames = Array[DataFrame] = Array(testDF)
val arrayOfKeys = inputDF.selectExpr("Key").distinct.rdd.map(x=>x.mkString).collect
var keyIterator = 1
//Need to overwrite the foo bar first DF
arrayOfDataFrames = Array(inputDF.where($""===arrayOfKeys(keyIterator - 1)))
keyIterator = keyIterator + 1
//loop through find the key and place it into the DataFrames array
while(keyIterator <= arrayOfKeys.length) {
arrayOfDataFrames = arrayOfDataFrames ++ Array(inputDF.where($"Key"===arrayOfKeys(keyIterator - 1)))
keyIterator = keyIterator + 1
}
At the end of the command you will have two array of same length DataFrames and Keys that match. Meaning if you select the 3rd element of the Keys it matches the 3rd element of the DataFrames.
Since this isn't Java and doesn't directly answer your question, does this at least help push you in a direction that might help (I built it in Spark Scala).

Related

From java.util.ArrayList to a dictionary in java

Im using a JBPM to make a SQL query in the DB. The sql output is return to a variable that is java.util.ArrayList. The table that im queryin is like this in MariaDB:
variable value
math 1
physics 4
biology 10
...
sport 5
chemistry 9
The query that I'm making is SELECT * from school_data. It is returning me in a form of list like [math,1,phycics,4,biology,10.....] and only 20 elements.
Is there a way to transform the output in dictionary and then extract the values easly? I python it would be like this:
cur = connection.cursor()
cur.execute("SELECT * from school_data")
result = cur.fetchall()
query_result = dict((x, y) for x, y in result)
math=query_result['math']
physics=query_result['physics']
biology=query_result['biology']

Java does not have lists or dictionaries / maps as built-in data types, so it does not offer syntax or built-in operators for working with them. One can certainly perform transformations such as you describe, but it's a matter of opinion whether it can be done "easily". One way would be something like this:
Map<String, String> query_result = new HashMap<>();
for (int i = 0; i < result_array.length; i += 2) {
query_result.put(result_array[i], result_array[i + 1]);
}
String biology = query_result.get("biology");
// ...
That makes some assumptions about the data types involved, which you might need to adjust for your actual data.

How to find outliers using avg and stddev?

I am having conflict filtering a Dataset<'Row> using the MEAN() and STDEV() built in functions in the org.apache.spark.sql.functions library.
This is the set of data I am working with (top 10):
Name Size Volumes
File1 1030 107529
File2 997 106006
File3 1546 112426
File4 2235 117335
File5 2061 115363
File6 1875 114015
File7 1237 110002
File8 1546 112289
File9 1030 107154
File10 1339 110276
What I am currently trying to do is find the outliers in this dataset. For that, I need to find the rows where the SIZE and VOLUMES are outliers using the 95% rule: μ - 2σ ≤ X ≤ μ + 2σ
This is the SQL-like query that I would like to run on this Dataset:
SELECT * FROM DATASET
WHERE size < (SELECT (AVG(size)-2STDEV(size)) FROM DATASET)
OR size > (SELECT (AVG(size)+2STDEV(size)) FROM DATASET)
OR volumes < (SELECT (AVG(volumes)-2STDEV(volumes)) FROM DATASET)
OR volumes > (SELECT (AVG(volumes)+2STDEV(volumes)) FROM DATASET)
I don't know how to implement nested queries and I'm struggling to find a way to solve this.
Also, if you happen to know other way of getting what I want, feel free to share it.
This is what I attempted to do but I get an error:
Column meanSize = functions.mean(size);
Column meanRecords = functions.mean(records);
Column stdSize = functions.stddev(size);
Column stdRecords = functions.stddev(records);
Column lowerSizeThreshold = size.lt((meanSize.minus(stdSize).minus(stdSize)));
Column upperSizeThreshold = size.gt(meanSize.plus(stdSize).plus(stdSize));
Column lowerRecordsThreshold = records.lt(meanRecords.minus(stdRecords).minus(stdRecords));
Column upperRecordsThreshold = records.gt(meanRecords.plus(stdRecords).plus(stdRecords));
Dataset<Row> outliers = dataFrame.where(lowerSizeThreshold.or(upperSizeThreshold).or(lowerRecordsThreshold).or(upperRecordsThreshold));

You asked about Java that I'm currently not using at all, so here comes a Scala version that I hope might somehow help you to find a corresponding Java version.
What about the following solution?
// preparing the dataset
val input = spark.
read.
text("input.txt").
as[String].
filter(line => !line.startsWith("Name")).
map(_.split("\\W+")).
withColumn("name", $"value"(0)).
withColumn("size", $"value"(1) cast "int").
withColumn("volumes", $"value"(2) cast "int").
select("name", "size", "volumes")
scala> input.show
+------+----+-------+
| name|size|volumes|
+------+----+-------+
| File1|1030| 107529|
| File2| 997| 106006|
| File3|1546| 112426|
| File4|2235| 117335|
| File5|2061| 115363|
| File6|1875| 114015|
| File7|1237| 110002|
| File8|1546| 112289|
| File9|1030| 107154|
|File10|1339| 110276|
+------+----+-------+
// the final computation
import org.apache.spark.sql.functions._
val (sizeAvg, sizeStddev, volumesAvg, volumesStddev) = input.
groupBy().
agg(avg("size"), stddev("size"), avg("volumes"), stddev("volumes")).
as[(Double, Double, Double, Double)].
head
val sizeLessThanStddev = col("size") < (sizeAvg - 2 * sizeStddev)
input.filter(sizeLessThanStddev)
This is only a first part of the 4-part filter operator, and am leaving the rest as a home exercise.

Thanks for your comments guys.
So this solution is for the Java implementation of Spark. If you want the implementation of Scala, look at Jacek Laskowski post.
Solution:
//first convert the columns Size and Records to a List<Double>
List<Double> sizeList = dataFrame.select("Size").javaRDD().map(f -> f.getDouble(0)).collect();
List<Double> recordsList = dataFrame.select("Records").javaRDD().map(f -> f.getDouble(0)).collect();
//then convert the lists into JavaDoubleRDD
JavaDoubleRDD size = sparkContext.parallelizeDoubles(sizeList);
JavaDoubleRDD records = sparkContext.parallelizeDoubles(recordsList);
//calculate the mean and stddev using the built in functions:
double sizeMean = size.mean();
double sizeStdev = size.stdev();
double recordsMean = records.mean();
double recordsStdev = records.stdev();
After that I can finally use those values in the Column comparison functions.
Hope you understood what I did.

How filter Scan of HBase by part of row key?

I have HBase table with row keys, which consist of text ID and timestamp, like next:
...
string_id1.1470913344067
string_id1.1470913345067
string_id2.1470913344067
string_id2.1470913345067
...
How can I filter Scan of HBase (in Scala or Java) to get results with some string ID and timestamp more than some value?
Thanks

Fuzzy row approach is efficient for this kind of requirement and when data is is huge :
As explained by this article
FuzzyRowFilter takes as parameters row key and a mask info.
In example above, in case we want to find last logged in users and row key format is userId_actionId_timestamp (where userId has fixed length of say 4 chars), the fuzzy row key we are looking for is ????_login_. This translates into the following params for FuzzyRowKey:
FuzzyRowFilter rowFilter = new FuzzyRowFilter(
Arrays.asList(
new Pair<byte[], byte[]>(
Bytes.toBytesBinary("\x00\x00\x00\x00_login_"),
new byte[] {1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0})));
Would suggest to go through hbase-the-definitive guide -->Client API: Advanced Features

Lets say you somehow ended up having your lines in a monadic traversable structure like List or RDD. Now, you want to have only the strings with id = "string_id2" and timestamp > 1470913345000.
Now what is the problem here ? Just filter you traversable monadic structure on these two criteria.
val filtered = listOrRddOfLines
.map(l => {
val idStr :: timestampStr :: Nil = l.split('.').toList
(idStr, timestampStr.toLong)
})
.filter({
case (idStr, timestamp) => idStr.equals("string_id2") && (timestamp > "1470913345000".toLong)
})

I resolve my problem by using to filters:
- PrefixFilter (I put to this filter first part of row key. In my case - string ID, for example "string_id1.")
- RowFilter (I put there two parametres: first - CompareOp.GREATER_OR_EQUAL, second - all my row key with necessary timestamp, for example "string_id1.1470913345000"
In result I get all cells with row key, which has necessary string_id if first part, and with timestamp more or equal than I put in filter in second part. It is exactly what I want.
Code snippet:
val s = new Scan()
s.addFamily(family.getBytes)
val filterList = new FilterList()
filterList.addFilter(new PrefixFilter(Bytes.toBytes(prefixOfRowKey)))
filterList.addFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new BinaryComparator(valueForBinaryFilter.getBytes())))
s.setFilter(filterList)
val scanner = table.getScanner(s)
Thanks to everyone who helped to find a solution.

I'm getting different results every time I run my code

I'm using ELKI to cluster my data I used KMeansLloyd<NumberVector> with k=3 every time I run my java code I'm getting totally different clusters results, is this normal or there is something I should do to make my output nearly stable?? here my code that I got from elki tutorials
DatabaseConnection dbc = new ArrayAdapterDatabaseConnection(a);
// Create a database (which may contain multiple relations!)
Database db = new StaticArrayDatabase(dbc, null);
// Load the data into the database (do NOT forget to initialize...)
db.initialize();
// Relation containing the number vectors:
Relation<NumberVector> rel = db.getRelation(TypeUtil.NUMBER_VECTOR_FIELD);
// We know that the ids must be a continuous range:
DBIDRange ids = (DBIDRange) rel.getDBIDs();
// K-means should be used with squared Euclidean (least squares):
//SquaredEuclideanDistanceFunction dist = SquaredEuclideanDistanceFunction.STATIC;
CosineDistanceFunction dist= CosineDistanceFunction.STATIC;
// Default initialization, using global random:
// To fix the random seed, use: new RandomFactory(seed);
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(RandomFactory.DEFAULT);
// Textbook k-means clustering:
KMeansLloyd<NumberVector> km = new KMeansLloyd<>(dist, //
3 /* k - number of partitions */, //
0 /* maximum number of iterations: no limit */, init);
// K-means will automatically choose a numerical relation from the data set:
// But we could make it explicit (if there were more than one numeric
// relation!): km.run(db, rel);
Clustering<KMeansModel> c = km.run(db);
// Output all clusters:
int i = 0;
for(Cluster<KMeansModel> clu : c.getAllClusters()) {
// K-means will name all clusters "Cluster" in lack of noise support:
System.out.println("#" + i + ": " + clu.getNameAutomatic());
System.out.println("Size: " + clu.size());
System.out.println("Center: " + clu.getModel().getPrototype().toString());
// Iterate over objects:
System.out.print("Objects: ");
for(DBIDIter it = clu.getIDs().iter(); it.valid(); it.advance()) {
// To get the vector use:
NumberVector v = rel.get(it);
// Offset within our DBID range: "line number"
final int offset = ids.getOffset(it);
System.out.print(v+" " + offset);
// Do NOT rely on using "internalGetIndex()" directly!
}
System.out.println();
++i;
}

I would say, since you are using RandomlyGeneratedInitialMeans:
Initialize k-means by generating random vectors (within the data sets value range).
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(RandomFactory.DEFAULT);
Yes, it is normal.

K-Means is supposed to be initialized randomly. It is desirable to get different results when running it multiple times.
If you don't want this, use a fixed random seed.
From the code you copy and pasted:
// To fix the random seed, use: new RandomFactory(seed);
That is exactly what you should do...
long seed = 0;
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(
new RandomFactory(seed));

This was too long for a comment. As #Idos stated, You are initializing your data randomly; that's why you're getting random results. Now the question is, how do you ensure the results are robust? Try this:
Run the algorithm N times. Each time, record the cluster membership for each observation. When you are finished, classify an observation into the cluster which contained it most often. For example, suppose you have 3 observations, 3 classes, and run the algorithm 3 times:
obs R1 R2 R3
1 A A B
2 B B B
3 C B B
Then you should classify obs1 as A since it was most often classified as A. Classify obs2 as B since it was always classified as B. And classify obs3 as B since it was most often classified as B by the algorithm. The results should become increasingly stable the more times you run the algorithm.

JVisualVM HeapDump OQL rendering array inside an Object

I am trying to write a query such as this:
select {r: referrers(f), count:count(referrers(f))}
from com.a.b.myClass f
However, the output doesn't show the actual objects:
{
count = 3.0,
r = [object Object]
}
Removing the Javascript Object notation once again shows referrers normally, but they are no longer compartmentalized. Is there a way to format it inside the Object notation?

So I see that you asked this question a year ago, so I don't know if you still need the answer, but since I was searching around for something similar, I can answer this. The problem is that referrers(f) returns an enumeration and so it doesn't really translate well when you try to put it into your hashmap. I was doing a similar type of analysis where I was trying to find unique char arrays (count the unique combinations of char arrays up to the first 50 characters). What I came up with was this:
var counts = {};
filter(
map(
unique(
map(
filter(heap.objects('char[]'), "it.length > 50"), // filter out strings less than 50 chars in length
function(charArray) { // chop the string at 50 chars and then count the unique combos
var subs = charArray.toString().substr(0,50);
if (! counts[subs]) {
counts[subs] = 1;
} else {
counts[subs] = counts[subs] + 1;
}
return subs;
}
) // map
) // unique
, function(subs) { // map the strings into an array that has the string and the counts of that string
return { string: subs, count: counts[subs] };
}) // map
, "it.count > 5000"); // filter out strings that have counts < 5000
This essentially shows how to take an enumeration (heap.objects('char[]') in this case) and filter it and map it so that you can compute statistics on it. Hope this helps someone.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to group by in spark - java

Related

From java.util.ArrayList to a dictionary in java

How to find outliers using avg and stddev?

How filter Scan of HBase by part of row key?

I'm getting different results every time I run my code

JVisualVM HeapDump OQL rendering array inside an Object

Categories

Resources