how to avoid collect when using Statistic.stat - java

when I calculate the Variance of my data, I have to collect first, Is there any other methods?
my data format:
1 2 3
1 4 5
4 5 6
4 7 8
7 8 9
10 11 12
10 13 14
10 1 2
1 100 100
10 11 2
10 11 2
1 2 5
4 7 6
code:
val conf = new SparkConf().setAppName("hh")
conf.setMaster("local[3]")
val sc = new SparkContext(conf)
val data = sc.textFile("/home/hadoop4/Desktop/i.txt")
.map(_.split("\t")).map(f => f.map(f => f.toDouble))
.map(f => ("k"+f(0),f(1)))
//data:RDD[(String,Double)]
val dataArr = data.map(f=>(f._1,ArrayBuffer(f._2)))
//dataArr RDD[(String,ArrayBuffer[Double])]
dataArr.collect().foreach(println(_))
//output
(k1.0,ArrayBuffer(2.0))
(k1.0,ArrayBuffer(4.0))
(k4.0,ArrayBuffer(5.0))
(k4.0,ArrayBuffer(7.0))
(k7.0,ArrayBuffer(8.0))
(k10.0,ArrayBuffer(11.0))
(k10.0,ArrayBuffer(13.0))
(k10.0,ArrayBuffer(1.0))
(k1.0,ArrayBuffer(100.0))
(k10.0,ArrayBuffer(11.0))
(k10.0,ArrayBuffer(11.0))
(k1.0,ArrayBuffer(2.0))
(k4.0,ArrayBuffer(7.0))
val dataArrRed = dataArr.reduceByKey((x,y)=>x++=y)
//dataArrRed :RDD[(String,ArrayBuffer[Double])]
dataArrRed.collect().foreach(println(_))
//output
(k1.0,ArrayBuffer(2.0, 4.0, 100.0, 2.0))
(k7.0,ArrayBuffer(8.0))
(k10.0,ArrayBuffer(11.0, 13.0, 1.0, 11.0, 11.0))
(k4.0,ArrayBuffer(5.0, 7.0, 7.0))
val dataARM = dataArrRed.collect().map(
f=>(f._1,sc.makeRDD(f._2,2)))
val dataARMM = dataARM.map(
f=>(f._1,(f._2.variance(),f._2.max(),f._2.min())))
.foreach(println(_))
sc.stop()
//output
(k1.0,(1777.0,100.0,2.0))
(k7.0,(0.0,8.0,8.0))
(k10.0,(18.24,13.0,1.0))
(k4.0,(0.8888888888888888,7.0,5.0))
//update ,now I calculate the second column and the third column at the same time and put them into an Array(f(1),f(2)), then turned into an RDD and aggregateByKey with it, the 'zero values' is Array(new StatCounter(),new StatCounter()),it has some problem.
val dataArray2 = dataString.split("\\n")
.map(_.split("\\s+")).map(_.map(_.toDouble))
.map(f => ("k" + f(0), Array(f(1),f(2))))
val data2 = sc.parallelize(dataArray2)
val dataStat2 = data2.aggregateByKey(Array(new StatCounter(),new
StatCounter()))
({
(s,v)=>(
s(0).merge(v(0)),s(1).merge(v(1))
)
},{
(s,t)=>(
s(0).merge(v(0)),s(1).merge(v(1))
)})
it's wrong. Can I use Array(new StatCounter(),new StatCounter())? Thanks.

Worked example. It turns out to be a one-liner, and another line to map it into the OP's format.
Slightly different way of getting the data (more convenient for testing but same result)
val dataString = """1 2 3
1 4 5
4 5 6
4 7 8
7 8 9
10 11 12
10 13 14
10 1 2
1 100 100
10 11 2
10 11 2
1 2 5
4 7 6
""".trim
val dataArray = dataString.split("\\n")
.map(_.split("\\s+")).map(_.map(_.toDouble))
.map(f => ("k" + f(0), f(1)))
val data = sc.parallelize(dataArray)
Build the stats by key
val dataStats = data.aggregateByKey(new StatCounter())
({(s,v)=>s.merge(v)}, {(s,t)=>s.merge(t)})
Or, slightly shorter but perhaps over-tricky:
val dataStats = data.aggregateByKey(new StatCounter())(_ merge _, _ merge _)
Re-format to the OP's format and print
val result = dataStats.map(f=>(f._1,(f._2.variance,f._2.max,f._2.min)))
.foreach(println(_))
Output, same apart from some rounding errors.
(k1.0,(1776.9999999999998,100.0,2.0))
(k7.0,(0.0,8.0,8.0))
(k10.0,(18.240000000000002,13.0,1.0))
(k4.0,(0.888888888888889,7.0,5.0))
EDIT: Version with two columns
val dataArray = dataString.split("\\n")
.map(_.split("\\s+")).map(_.map(_.toDouble))
.map(f => ("k" + f(0), Array(f(1), f(2))))
val data = sc.parallelize(dataArray)
val dataStats = data.aggregateByKey(Array(new StatCounter(), new StatCounter()))({(s, v)=> Array(s(0) merge v(0), s(1) merge v(1))}, {(s, t)=> Array(s(0) merge t(0), s(1) merge t(1))})
val result = dataStats.map(f => (f._1, (f._2(0).variance, f._2(0).max, f._2(0).min), (f._2(1).variance, f._2(1).max, f._2(1).min)))
.foreach(println(_))
Output
(k1.0,(1776.9999999999998,100.0,2.0),(1716.6875,100.0,3.0))
(k7.0,(0.0,8.0,8.0),(0.0,9.0,9.0))
(k10.0,(18.240000000000002,13.0,1.0),(29.439999999999998,14.0,2.0))
(k4.0,(0.888888888888889,7.0,5.0),(0.888888888888889,8.0,6.0))
EDIT2: "n"-column version
val n = 2
val dataStats = data.aggregateByKey(List.fill(n)(new StatCounter()))(
{(s, v)=> (s zip v).map{case (si, vi) => si merge vi}},
{(s, t)=> (s zip t).map{case (si, ti) => si merge ti}})
val result = dataStats.map(f => (f._1, f._2.map(x => (x.variance, x.max, x.min))))
.foreach(println(_))
Output same as above, but if you have more columns, you can change n. It will break if the Arrays in any row has less than n elements.

I would simply use a stats object (class StatCounter). Then, I would :
parse the file, split each line
create the tuples and partition the RDD
use aggregate by key and collect as an RDD of stat objects

Related

How can I print the content of rows in a Dataset using Java and the Spark SQL?

I would like to do a simple Spark SQL code that reads a file called u.data, that contains the movie ratings, creates a Dataset of Rows, and then print the first rows of the Dataset.
I've had as premise read the file to a JavaRDD, and map the RDD according to a ratingsObject(the object has two parameters, movieID and rating). So I just want to print the first Rows in this Dataset.
I'm using Java language and Spark SQL.
public static void main(String[] args){
App obj = new App();
SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example").getOrCreate();
Map<Integer,String> movieNames = obj.loadMovieNames();
JavaRDD<String> lines = spark.read().textFile("hdfs:///ml-100k/u.data").javaRDD();
JavaRDD<MovieRatings> movies = lines.map(line -> {
String[] parts = line.split(" ");
MovieRatings ratingsObject = new MovieRatings();
ratingsObject.setMovieID(Integer.parseInt(parts[1].trim()));
ratingsObject.setRating(Integer.parseInt(parts[2].trim()));
return ratingsObject;
});
Dataset<Row> movieDataset = spark.createDataFrame(movies, MovieRatings.class);
Encoder<Integer> intEncoder = Encoders.INT();
Dataset<Integer> HUE = movieDataset.map(
new MapFunction<Row, Integer>(){
private static final long serialVersionUID = -5982149277350252630L;
#Override
public Integer call(Row row) throws Exception{
return row.getInt(0);
}
}, intEncoder
);
HUE.show();
//stop the session
spark.stop();
}
I've tried a lot of possible solutions that I found, but all of them got the same error:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, localhost, executor 1): java.lang.ArrayIndexOutOfBoundsException: 1
at com.ericsson.SparkMovieRatings.App.lambda$main$1e634467$1(App.java:63)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1040)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
And here is the sample of the u.data file:
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
62 257 2 879372434
286 1014 5 879781125
200 222 5 876042340
210 40 3 891035994
224 29 3 888104457
303 785 3 879485318
122 387 5 879270459
194 274 2 879539794
Where the first column represents de UserID, the second MovieID, the third the rating,and the last one is the timestamp.
As mentioned before your data are not space separated.
I'll show you two possible solutions, the first one based on RDD and the second one based on spark sql which is, in general, the best solution in term of performance.
RDD (you should use built in types to reduce the overhead):
public class SparkDriver {
public static void main (String args[]) {
// Create a configuration object and set the name of
// the application
SparkConf conf = new SparkConf().setAppName("application_name");
// Create a spark Context object
JavaSparkContext context = new JavaSparkContext(conf);
// Create final rdd (suppose you have a text file)
JavaPairRDD<Integer,Integer> movieRatingRDD =
contextFile("u.data.txt")
.mapToPair(line -> {(
String[] tokens = line.split("\\s+");
int movieID = Integer.parseInt(tokens[0]);
int rating = Integer.parseInt(tokens[1]);
return new Tuple2<Integer, Integer>(movieID, rating);});
// Keep in mind that take operation takes the first n elements
// and the order is the order of the file.
ArrayList<Tuple2<Integer, Integer> list = new ArrayList<>(movieRatingRDD.take(10));
System.out.println("MovieID\tRating");
for(tuple : list) {
System.out.println(tuple._1 + "\t" + tuple._2);
}
context.close();
}}
SQL
public class SparkDriver {
public static void main(String[] args) {
// Create spark session
SparkSession session = SparkSession.builder().appName("[Spark app sql version]").getOrCreate();
Dataset<MovieRatings> personsDataframe = session.read()
.format("tct")
.option("header", false)
.option("inferSchema", true)
.option("delimiter", "\\s+")
.load("u.data.txt")
.map(row -> {
int movieID = row.getInteger(0);
int rating = row.getInteger(1);
return new MovieRatings(movieID, rating);
}).as(Encoders.bean(MovieRatings.class);
// Stop session
session.stop();
}
}

Scala: Parse yaml file in scala (types casting )

I have a yaml file that I want to read its contents in scala , so I parse it using io.circe.yaml to json
var js = yaml.parser.parse(ymlText)
var json=js.valueOr(null)
var jsonstring=json.toString
val json2 = parse(jsonstring)
the yamltext is like this:
ALL:
Category1:
Subcategory11 : 1.5
Subcategory12 : 0
Subcategory13 : 0
Subcategory14 : 0.5
Category2:
Subcategory21 : 1.5
Subcategory22 : 0.3
Subcategory23 : 0
Subcategory24 : 0
what I want is to filter the subcategories that has Zero values, I've used this code:
val elements = (json2 \\"ALL" ).children.map(x=>(x.values))
var subCategories=elements.map{case(a,b)=>(b)}
var cats=elements.map{case(a,b)=>(b.asInstanceOf[Map[String,Double]])}
cats.map(x=>x.filter{case(a,b)=>b>0.0})
But the last line gives me this error:
scala.math.BigInt cannot be cast to java.lang.Double
I'm not sure why you do toString + parse and which parse is used but you probably don't need it. Also you didn't describe your expected result so here are a few guesses of what you might need:
import java.io._
import io.circe._
import io.circe.yaml._
import io.circe.parser._
def test(): Unit = {
// test data instead of a file
val ymlText =
"""
|ALL:
| Category1:
| Subcategory11 : 1.5
| Subcategory12 : 0
| Subcategory13 : 0
| Subcategory14 : 0.5
| Category2:
| Subcategory21 : 1.5
| Subcategory22 : 0.3
| Subcategory23 : 0
| Subcategory24 : 0
""".stripMargin
var js = yaml.parser.parse(new StringReader(ymlText))
var json: Json = js.right.get
val categories = (json \\ "ALL").flatMap(j => j.asObject.get.values.toList)
val subs = categories.flatMap(j => j.asObject.get.toList)
val elements: List[(String, Double)] = subs.map { case (k, v) => (k, v.asNumber.get.toDouble) }
.filter {
case (k, v) => v > 0.0
}
println(s"elements: $elements")
val allCategories = (json \\ "ALL").flatMap(j => j.asObject.get.toList).toMap
val filteredTree: Map[String, Map[String, Double]] = allCategories
.mapValues(catJson => catJson.asObject.get.toList.map { case (subName, subJson) => (subName, subJson.asNumber.get.toDouble) }
.filter { case (subName, subValue) => subValue > 0.0 }
.toMap)
println(s"filteredTree : $filteredTree")
}
And the output for that is:
elements: List((Subcategory11,1.5), (Subcategory14,0.5), (Subcategory21,1.5), (Subcategory22,0.3))
filteredTree : Map(Category1 -> Map(Subcategory11 -> 1.5, Subcategory14 -> 0.5), Category2 -> Map(Subcategory21 -> 1.5, Subcategory22 -> 0.3))
Hope one of those version is what you needed.

Is it possible to create a list in java using data from multiple text files

I have multiple text files that contains information about different programming languages popularity in different countries based off of google searches. I have one text file for each year from 2004 to 2015. I also have a text file that breaks this down into each week (called iot.txt) but this file does not include the country.
Example data from 2004.txt:
Region java c++ c# python JavaScript
Argentina 13 14 10 0 17
Australia 22 20 22 64 26
Austria 23 21 19 31 21
Belgium 20 14 17 34 25
Bolivia 25 0 0 0 0
etc
example from iot.txt:
Week java c++ c# python JavaScript
2004-01-04 - 2004-01-10 88 23 12 8 34
2004-01-11 - 2004-01-17 88 25 12 8 36
2004-01-18 - 2004-01-24 91 24 12 8 36
2004-01-25 - 2004-01-31 88 26 11 7 36
2004-02-01 - 2004-02-07 93 26 12 7 37
My problem is that i am trying to write code that will output the number of countries that have exhibited 0 interest in python.
This is my current code that I use to read the text files. But I'm not sure of the best way to tell the number of regions that have 0 interest in python across all the years 2004-2015. At first I thought the best way would be to create a list from all the text files not including iot.txt and then search that for any entries that have 0 interest in python but I have no idea how to do that.
Can anyone suggest a way to do this?
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.*;
public class Starter{
public static void main(String[] args) throws Exception {
BufferedReader fh =
new BufferedReader(new FileReader("iot.txt"));
//First line contains the language names
String s = fh.readLine();
List<String> langs =
new ArrayList<>(Arrays.asList(s.split("\t")));
langs.remove(0); //Throw away the first word - "week"
Map<String,HashMap<String,Integer>> iot = new TreeMap<>();
while ((s=fh.readLine())!=null)
{
String [] wrds = s.split("\t");
HashMap<String,Integer> interest = new HashMap<>();
for(int i=0;i<langs.size();i++)
interest.put(langs.get(i), Integer.parseInt(wrds[i+1]));
iot.put(wrds[0], interest);
}
fh.close();
HashMap<Integer,HashMap<String,HashMap<String,Integer>>>
regionsByYear = new HashMap<>();
for (int i=2004;i<2016;i++)
{
BufferedReader fh1 =
new BufferedReader(new FileReader(i+".txt"));
String s1 = fh1.readLine(); //Throw away the first line
HashMap<String,HashMap<String,Integer>> year = new HashMap<>();
while ((s1=fh1.readLine())!=null)
{
String [] wrds = s1.split("\t");
HashMap<String,Integer>langMap = new HashMap<>();
for(int j=1;j<wrds.length;j++){
langMap.put(langs.get(j-1), Integer.parseInt(wrds[j]));
}
year.put(wrds[0],langMap);
}
regionsByYear.put(i,year);
fh1.close();
}
}
}
Create a Map<String, Integer> using a HashMap and each time you find a new country while scanning the incoming data add it into the map country->0. Each time you find a usage of python increment the value.
At the end loop through the entrySet of the map and for each case where e.value() is zero output e.key().

Kafka Error fetching offset data. Reason: 1

I'm running a java program that uses the kafka library and check the committed offset of a consumer group from zookeeper every 1 sec.
the program runs well for about 2 hour and starts throwing RuntimeException:
java.lang.RuntimeException: Error fetching offset data. Reason: 1
at com.pinterest.secor.common.KafkaClient.getMessage(KafkaClient.java:127)
at com.pinterest.secor.common.KafkaClient.getCommittedMessage(KafkaClient.java:186)
...
What is reason 1? I couldn't find any document or pages illustrating the root cause of this situation.
Check this out, this is a code fragment of kafka.common.ErrorMapping:
val UnknownCode : Short = -1
val NoError : Short = 0
val OffsetOutOfRangeCode : Short = 1
val InvalidMessageCode : Short = 2
val UnknownTopicOrPartitionCode : Short = 3
val InvalidFetchSizeCode : Short = 4
val LeaderNotAvailableCode : Short = 5
val NotLeaderForPartitionCode : Short = 6
val RequestTimedOutCode: Short = 7
val BrokerNotAvailableCode: Short = 8
val ReplicaNotAvailableCode: Short = 9
val MessageSizeTooLargeCode: Short = 10
val StaleControllerEpochCode: Short = 11
val OffsetMetadataTooLargeCode: Short = 12
val StaleLeaderEpochCode: Short = 13
val OffsetsLoadInProgressCode: Short = 14
val ConsumerCoordinatorNotAvailableCode: Short = 15
val NotCoordinatorForConsumerCode: Short = 16
val InvalidTopicCode : Short = 17
val MessageSetSizeTooLargeCode: Short = 18
val NotEnoughReplicasCode : Short = 19
val NotEnoughReplicasAfterAppendCode: Short = 20
As you can see, 1 means OffsetOutOfRangeCode.

How can i extract components from fastICA

I have problems to get ICA component fastICA in R. when i try to extract 6 component from fastICA function it give only one component but there should be 6 components . upto 5 it was worked perfectly but after the 5 it gives different different number of components.can anyone tell me what is the reason for that
Function and Parameters:
ICA6 <- fastICA(X, 6, alg.typ = "parallel", fun = "logcosh", alpha = 1,
method = "R", row.norm = FALSE, maxit = 200, tol = 0.0001, verbose = TRUE)
here is the last few lines of my output for ICA6 and ICA5
hpc-admin#aiken:~/gayan$ tail ICA6
[1992,] -1.755614e-01
[1993,] -1.931838e-01
[1994,] -1.403488e-01
[1995,] 4.952370e-01
[1996,] 3.798545e-02
[1997,] -8.870945e-02
[1998,] -1.847535e-01
[1999,] 2.084906e-01
[2000,] 2.235841e-01
hpc-admin#aiken:~/gayan$ tail ICA5
[1992,] -4.449966e-02 2.348224e-02 -0.1296879740 4.220189e-02 -0.1751827781
[1993,] -7.690094e-02 1.725353e-02 -0.1153838819 1.694351e-01 -0.1308105118
[1994,] -4.777415e-02 2.299214e-02 -0.1259907838 -6.011591e-03 -0.1605316621
[1995,] 4.354237e-02 2.295694e-02 -0.2499377363 -2.227481e-01 0.4414782035
[1996,] -3.848286e-02 2.121986e-02 -0.1361600020 -8.448882e-02 -0.0005046113
[1997,] -3.030994e-03 2.285310e-02 -0.1407370888 -1.215308e-02 -0.1062227838
[1998,] -3.988264e-03 2.335983e-02 -0.1497881709 1.787074e-02 -0.1982725941
[1999,] -7.483824e-02 1.096696e-02 -0.0672348301 -1.665848e-01 0.1489732404
[2000,] -7.123032e-01 1.123832e-02 0.5153474842 -1.785166e-01 0.1632015019

Categories

Resources