Related
I am trying to build NN with LSTM and Dense layers.
Me net is:
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.seed(123)
.weightInit(WeightInit.XAVIER)
.updater(new Adam(0.1))
.list()
.layer(0, new LSTM.Builder().activation(Activation.TANH).nIn(numInputs).nOut(120).build())
.layer(1, new DenseLayer.Builder().activation(Activation.RELU).nIn(120).nOut(1000).build())
.layer(2, new DenseLayer.Builder().activation(Activation.RELU).nIn(1000).nOut(20).build())
.layer(new OutputLayer.Builder(LossFunction.NEGATIVELOGLIKELIHOOD).activation(Activation.SOFTMAX).nIn(20).nOut(numOutputs).build())
.inputPreProcessor(1, new RnnToFeedForwardPreProcessor())
.build();
I read my data like that:
SequenceRecordReader reader = new CSVSequenceRecordReader(0, ",");
reader.initialize(new NumberedFileInputSplit("TRAIN_%d.csv", 1, 17476));
DataSetIterator trainIter = new SequenceRecordReaderDataSetIterator(reader, miniBatchSize, 6, 7, false);
allData = trainIter.next();
//Load the test/evaluation data:
SequenceRecordReader testReader = new CSVSequenceRecordReader(0, ",");
testReader.initialize(new NumberedFileInputSplit("TEST_%d.csv", 1, 8498));
DataSetIterator testIter = new SequenceRecordReaderDataSetIterator(testReader, miniBatchSize, 6, 7, false);
allData = testIter.next();
So, when its going to net it has shape [batch, features, timestamp] = [32,7,60]
I can define it with special made error like that:
Received input with size(1) = 7 (input array shape = [32, 7, 60]); input.size(1) must match layer nIn size (nIn = 9)
So its normally go to the net. After first LSTM layer it must reshape into 2-D and go throw Dense layer next.
But I have next problem:
Labels and preOutput must have equal shapes: got shapes [32, 6, 60] vs
[1920, 6]
It did not reshape before going into Dense layer and I had missed 1 feature (now shape is 32, 6 , 60 instead of 32, 7 , 60), so why ???
if possible you'll want to use setInputType which will set up the pre processors for you.
Here's an example configuration of lstm to dense:
MultiLayerConfiguration conf1 = new NeuralNetConfiguration.Builder()
.trainingWorkspaceMode(wsm)
.inferenceWorkspaceMode(wsm)
.seed(12345)
.updater(new Adam(0.1))
.list()
.layer(new LSTM.Builder().nIn(3).nOut(3).dataFormat(rnnDataFormat).build())
.layer(new DenseLayer.Builder().nIn(3).nOut(3).activation(Activation.TANH).build())
.layer(new RnnOutputLayer.Builder().nIn(3).nOut(3).activation(Activation.SOFTMAX).dataFormat(rnnDataFormat)
.lossFunction(LossFunctions.LossFunction.MCXENT).build())
.setInputType(InputType.recurrent(3, rnnDataFormat))
.build();
RNN format is:
import org.deeplearning4j.nn.conf.RNNFormat;
This is an enum that specifies what your data format should be (channels last or first)
From the javadoc:
/**
* NCW = "channels first" - arrays of shape [minibatch, channels, width]<br>
* NWC = "channels last" - arrays of shape [minibatch, width, channels]<br>
* "width" corresponds to sequence length and "channels" corresponds to sequence item size.
*/
Source here: https://github.com/eclipse/deeplearning4j/blob/1930d9990810db6214829c716c2ae7eb7f59cd13/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/RNNFormat.java#L21
More here in our tests: https://github.com/eclipse/deeplearning4j/blob/1930d9990810db6214829c716c2ae7eb7f59cd13/deeplearning4j/deeplearning4j-core/src/test/java/org/deeplearning4j/nn/layers/recurrent/TestTimeDistributed.java#L58
Note: Solutions in r, python, java, or if necessary, c++ or c# are desired.
I am trying to draw contours based on transportation time. To be more clear, I want to cluster the points which have similar travel time (let's say 10 minute interval) to a specific point (destination) and map them as contours or a heatmap.
Right now, the only idea that I have is using R package gmapsdistance to find the travel time for different origins and then cluster them and draw them on a map. But, as you can tell, this is in no way a robust solution.
This thread on GIS-community and this one for python illustrate a similar problem but for an origin to destinations within reach in specific time. I want to find origins which I can travel to the destination within certain time.
Right now, the code below shows my rudimentary idea (using R):
library(gmapsdistance)
set.api.key("YOUR.API.KEY")
mdestination <- "40.7+-73"
morigin1 <- "40.6+-74.2"
morigin2 <- "40+-74"
gmapsdistance(origin = morigin1,
destination = mdestination,
mode = "transit")
gmapsdistance(origin = morigin2,
destination = mdestination,
mode = "transit")
This map also may help to understand the question:
Using this answer I can get the points which I can go to from a point of origin but I need to reverse it and find the points which have travel time equal-less-than a certain time to my destination;
library(httr)
library(googleway)
library(jsonlite)
appId <- "TravelTime_APP_ID"
apiKey <- "TravelTime_API_KEY"
mapKey <- "GOOGLE_MAPS_API_KEY"
location <- c(40, -73)
CommuteTime <- (5 / 6) * 60 * 60
url <- "http://api.traveltimeapp.com/v4/time-map"
requestBody <- paste0('{
"departure_searches" : [
{"id" : "test",
"coords": {"lat":', location[1], ', "lng":', location[2],' },
"transportation" : {"type" : "driving"} ,
"travel_time" : ', CommuteTime, ',
"departure_time" : "2017-05-03T07:20:00z"
}
]
}')
res <- httr::POST(url = url,
httr::add_headers('Content-Type' = 'application/json'),
httr::add_headers('Accept' = 'application/json'),
httr::add_headers('X-Application-Id' = appId),
httr::add_headers('X-Api-Key' = apiKey),
body = requestBody,
encode = "json")
res <- jsonlite::fromJSON(as.character(res))
pl <- lapply(res$results$shapes[[1]]$shell, function(x){
googleway::encode_pl(lat = x[['lat']], lon = x[['lng']])
})
df <- data.frame(polyline = unlist(pl))
df_marker <- data.frame(lat = location[1], lon = location[2])
google_map(key = mapKey) %>%
add_markers(data = df_marker) %>%
add_polylines(data = df, polyline = "polyline")
Moreover, Documentation of Travel Time Map Platform talks about Multi Origins with Arrival time which is exactly the thing I want to do. But I need to do that for both public transportation and driving (for places with less than an hour commute time) and I think since public transport is tricky (based on what station you are close to) maybe heatmap is a better option than contours.
This answer is based on obtaining an origin-destination matrix between a grid of (roughly) equally distant points. This is a computer intensive operation not only because it requires a good number of API calls to mapping services, but also because the servers must calculate a matrix for each call. The number of required calls grows exponentially along the number of points in the grid.
To tackle this problem, I would suggest that you consider running on your local machine or on a local server a mapping server. Project OSRM offers a relatively simple, free, and open-source solution, enabling you to run an OpenStreetMap server into a Linux docker (https://github.com/Project-OSRM/osrm-backend). Having your own local mapping server will allow you to make as many API calls as you desire. R's osrm package allows you to interact with OpenStreetMaps' APIs, Including those placed to a local server.
library(raster) # Optional
library(sp)
library(ggmap)
library(tidyverse)
library(osrm)
devtools::install_github("cmartin/ggConvexHull") # Needed to quickly draw the contours
library(ggConvexHull)
I create a grid of 96 roughly equally distant points around Bruxelles (Belgium) conurbation.
This grid does not take into consideration the earths curvature, which is negligible at the level of city distances.
For convenience, I employ the raster package to download a ShapeFile of Belgium and extract the nodes for Brussels city.
BE <- raster::getData("GADM", country = "BEL", level = 1)
Bruxelles <- BE[BE$NAME_1 == "Bruxelles", ]
df_grid <- makegrid(Bruxelles, cellsize = 0.02) %>%
SpatialPoints() %>%
## I convert the SpatialPoints object into a simple data.frame
as.data.frame() %>%
## create a unique id for each point in the data.frame
rownames_to_column() %>%
## rename variables of the data.frame with more explanatory names.
rename(id = rowname, lat = x2, lon = x1)
## I point osrm.server to the OpenStreet docker running in my Linux machine. ...
### ... Do not run this if you are getting your data from OpenStreet public servers.
options(osrm.server = "http://127.0.0.1:5000/")
## I obtain a list with distances (Origin Destination Matrix in ...
### ... minutes, origins and destinations)
Distance_Tables <- osrmTable(loc = df_grid)
OD_Matrix <- Distance_Tables$durations %>% ## subset the previous list
## convert the Origin Destination Matrix into a tibble
as_data_frame() %>%
rownames_to_column() %>%
## make sure we have an id column for the OD tibble
rename(origin_id = rowname) %>%
## transform the tibble into long/tidy format
gather(key = destination_id, value = distance_time, -origin_id) %>%
left_join(df_grid, by = c("origin_id" = "id")) %>%
## set origin coordinates
rename(origin_lon = lon, origin_lat = lat) %>%
left_join(df_grid, by = c("destination_id" = "id")) %>%
## set destination coordinates
rename(destination_lat = lat, destination_lon = lon)
## Obtain a nice looking road map of Brussels
Brux_map <- get_map(location = "bruxelles, belgique",
zoom = 11,
source = "google",
maptype = "roadmap")
ggmap(Brux_map) +
geom_point(aes(x = origin_lon, y = origin_lat),
data = OD_Matrix %>%
## Here I selected point_id 42 as the desired target, ...
## ... just because it is not far from the City Center.
filter(destination_id == 42),
size = 0.5) +
## Draw a diamond around point_id 42
geom_point(aes(x = origin_lon, y = origin_lat),
data = OD_Matrix %>%
filter(destination_id == 42, origin_id == 42),
shape = 5, size = 3) +
## Countour marking a distance of up to 8 minutes
geom_convexhull(alpha = 0.2,
fill = "blue",
colour = "blue",
data = OD_Matrix %>%
filter(destination_id == 42,
distance_time <= 8),
aes(x = origin_lon, y = origin_lat)) +
## Countour marking a distance of up to 16 minutes
geom_convexhull(alpha = 0.2,
fill = "red",
colour = "red",
data = OD_Matrix %>%
filter(destination_id == 42,
distance_time <= 15),
aes(x = origin_lon, y = origin_lat))
Results
The blue contour represent distances to the city center of up to 8 minutes.
The red contour represent distances of up to 15 minutes.
I came up with an approach that would be applicable comparing to making numerous api calls.
The idea is finding the places you can reach in certain time(look at this thread). Traffic can be simulated by changing the time from morning to evening. You will end up with an overlapped area which you can reach from both places.
Then you can use Nicolas answer and map some points within that overlapped area and draw the heat map for the destinations you have. This way, you will have less area (points) to cover and therefore you will make much less api calls (remember to use appropriate time for that matter).
Below, I tried to demonstrate what I mean by these and get you to the point that you can make the grid mentioned in the other answer to make your estimation more robust.
This shows how to map the intersected area.
library(httr)
library(googleway)
library(jsonlite)
appId <- "Travel.Time.ID"
apiKey <- "Travel.Time.API"
mapKey <- "Google.Map.ID"
locationK <- c(40, -73) #K
locationM <- c(40, -74) #M
CommuteTimeK <- (3 / 4) * 60 * 60
CommuteTimeM <- (0.55) * 60 * 60
url <- "http://api.traveltimeapp.com/v4/time-map"
requestBodyK <- paste0('{
"departure_searches" : [
{"id" : "test",
"coords": {"lat":', locationK[1], ', "lng":', locationK[2],' },
"transportation" : {"type" : "public_transport"} ,
"travel_time" : ', CommuteTimeK, ',
"departure_time" : "2018-06-27T13:00:00z"
}
]
}')
requestBodyM <- paste0('{
"departure_searches" : [
{"id" : "test",
"coords": {"lat":', locationM[1], ', "lng":', locationM[2],' },
"transportation" : {"type" : "driving"} ,
"travel_time" : ', CommuteTimeM, ',
"departure_time" : "2018-06-27T13:00:00z"
}
]
}')
resKi <- httr::POST(url = url,
httr::add_headers('Content-Type' = 'application/json'),
httr::add_headers('Accept' = 'application/json'),
httr::add_headers('X-Application-Id' = appId),
httr::add_headers('X-Api-Key' = apiKey),
body = requestBodyK,
encode = "json")
resMi <- httr::POST(url = url,
httr::add_headers('Content-Type' = 'application/json'),
httr::add_headers('Accept' = 'application/json'),
httr::add_headers('X-Application-Id' = appId),
httr::add_headers('X-Api-Key' = apiKey),
body = requestBodyM,
encode = "json")
resK <- jsonlite::fromJSON(as.character(resKi))
resM <- jsonlite::fromJSON(as.character(resMi))
plK <- lapply(resK$results$shapes[[1]]$shell, function(x){
googleway::encode_pl(lat = x[['lat']], lon = x[['lng']])
})
plM <- lapply(resM$results$shapes[[1]]$shell, function(x){
googleway::encode_pl(lat = x[['lat']], lon = x[['lng']])
})
dfK <- data.frame(polyline = unlist(plK))
dfM <- data.frame(polyline = unlist(plM))
df_markerK <- data.frame(lat = locationK[1], lon = locationK[2], colour = "#green")
df_markerM <- data.frame(lat = locationM[1], lon = locationM[2], colour = "#lavender")
iconK <- "red"
df_markerK$icon <- iconK
iconM <- "blue"
df_markerM$icon <- iconM
google_map(key = mapKey) %>%
add_markers(data = df_markerK,
lat = "lat", lon = "lon",colour = "icon",
mouse_over = "K_K") %>%
add_markers(data = df_markerM,
lat = "lat", lon = "lon", colour = "icon",
mouse_over = "M_M") %>%
add_polygons(data = dfM, polyline = "polyline", stroke_colour = '#461B7E',
fill_colour = '#461B7E', fill_opacity = 0.6) %>%
add_polygons(data = dfK, polyline = "polyline",
stroke_colour = '#F70D1A',
fill_colour = '#FF2400', fill_opacity = 0.4)
You can extract the intersected area like this:
# install.packages(c("rgdal", "sp", "raster","rgeos","maptools"))
library(rgdal)
library(sp)
library(raster)
library(rgeos)
library(maptools)
Kdata <- resK$results$shapes[[1]]$shell
Mdata <- resM$results$shapes[[1]]$shell
xyfunc <- function(mydf) {
xy <- mydf[,c(2,1)]
return(xy)
}
spdf <- function(xy, mydf){
sp::SpatialPointsDataFrame(
coords = xy, data = mydf,
proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))}
for (i in (1:length(Kdata))) {Kdata[[i]] <- xyfunc(Kdata[[i]])}
for (i in (1:length(Mdata))) {Mdata[[i]] <- xyfunc(Mdata[[i]])}
Kshp <- list(); for (i in (1:length(Kdata))) {Kshp[i] <- spdf(Kdata[[i]],Kdata[[i]])}
Mshp <- list(); for (i in (1:length(Mdata))) {Mshp[i] <- spdf(Mdata[[i]],Mdata[[i]])}
Kbind <- do.call(bind, Kshp)
Mbind <- do.call(bind, Mshp)
#plot(Kbind);plot(Mbind)
x <- intersect(Kbind,Mbind)
#plot(x)
xdf <- data.frame(x)
xdf$icon <- "https://i.stack.imgur.com/z7NnE.png"
google_map(key = mapKey,
location = c(mean(latmax,latmin), mean(lngmax,lngmin)), zoom = 8) %>%
add_markers(data = xdf, lat = "lat", lon = "lng", marker_icon = "icon")
This is just an illustration of the intersected area.
Now, You can get the coordinates from xdf dataframe and construct your grid around those points to finally come up with a heat map. To respect the other user who came up with that idea/answer I am not including it in mine and am just referencing to it.
Nicolás Velásquez - Obtaining an Origin-Destination Matrix between a Grid of (Roughly) Equally Distant Points
I have a question regarding h2o.stackedEnsemble in R. When I try to create an ensemble from GLM models (or any other models and GLM) I get the following error:
DistributedException from localhost/127.0.0.1:54321: 'null', caused by java.lang.NullPointerException
DistributedException from localhost/127.0.0.1:54321: 'null', caused by java.lang.NullPointerException
at water.MRTask.getResult(MRTask.java:478)
at water.MRTask.getResult(MRTask.java:486)
at water.MRTask.doAll(MRTask.java:390)
at water.MRTask.doAll(MRTask.java:396)
at hex.StackedEnsembleModel.predictScoreImpl(StackedEnsembleModel.java:123)
at hex.StackedEnsembleModel.doScoreMetricsOneFrame(StackedEnsembleModel.java:194)
at hex.StackedEnsembleModel.doScoreOrCopyMetrics(StackedEnsembleModel.java:206)
at hex.ensemble.StackedEnsemble$StackedEnsembleDriver.computeMetaLearner(StackedEnsemble.java:302)
at hex.ensemble.StackedEnsemble$StackedEnsembleDriver.computeImpl(StackedEnsemble.java:231)
at hex.ModelBuilder$Driver.compute2(ModelBuilder.java:206)
at water.H2O$H2OCountedCompleter.compute(H2O.java:1263)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)
Caused by: java.lang.NullPointerException
Error: DistributedException from localhost/127.0.0.1:54321: 'null', caused by java.lang.NullPointerException
The error does not occur when I stack any other models, only appears with the GLM. Of course I use the same folds for cross-validation.
Some sample code for training the model and the ensemble:
glm_grid <- h2o.grid(algorithm = "glm",
family = 'binomial',
grid_id = "glm_grid",
x = predictors,
y = response,
seed = 1,
fold_column = "fold_assignment",
training_frame = train_h2o,
keep_cross_validation_predictions = TRUE,
hyper_params = list(alpha = seq(0, 1, 0.05)),
lambda_search = TRUE,
search_criteria = search_criteria,
balance_classes = TRUE,
early_stopping = TRUE)
glm <- h2o.getGrid("glm_grid",
sort_by="auc",
decreasing=TRUE)
ensemble <- h2o.stackedEnsemble(x = predictors,
y = response,
training_frame = train_h2o,
model_id = "ens_1",
base_models = glm#model_ids[1:5])
This is a bug, and you can track the progress of the fix here (this should be fixed in the next release, but it might be fixed sooner and available on the nightly releases).
I was going to suggest training the GLMs in a loop or apply function (instead of using h2o.grid()) as a temporary work-around, but unfortunately, the same error happens.
I currently have a ipynb file (ipython notebook) that contains a linear regression code / model(im not entirely sure if it's a model) that I've created earlier.
How do i implement this model in an android application such that if I were to input a value of 'x' in a text box, it'll output in a textview the predicted value of 'y'. Function: Y = mx + b.
I've tried looking at different tutorials, but they were mostly not "step-by-step" guides, which made it really hard to understand, I'm a beginner at coding.
Here's my code for the model:
import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
rng = np.random
from numpy import genfromtxt
from sklearn.datasets import load_boston
# Parameters
learning_rate = 0.0001
training_epochs = 1000
display_step = 50
n_samples = 222
X = tf.placeholder("float") # create symbolic variables
Y = tf.placeholder("float")
filename_queue = tf.train.string_input_producer(["C://Users//Shiina//battdata.csv"],shuffle=False)
reader = tf.TextLineReader() # skip_header_lines=1 if csv has headers
key, value = reader.read(filename_queue)
# Default values, in case of empty columns. Also specifies the type of the
# decoded result.
record_defaults = [[1.], [1.]]
height, soc= tf.decode_csv(
value, record_defaults=record_defaults)
features = tf.stack([height])
# Set model weights
W = tf.Variable(rng.randn(), name="weight")
b = tf.Variable(rng.randn(), name="bias")
# Construct a linear model
pred_soc = tf.add(tf.multiply(height, W), b) # XW + b <- y = mx + b where W is gradient, b is intercept
# Mean squared error
cost = tf.reduce_sum(tf.pow(pred_soc-soc, 2))/(2*n_samples)
# Gradient descent
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
# Initializing the variables
init = tf.global_variables_initializer()
with tf.Session() as sess:
# Start populating the filename queue.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
sess.run(init)
# Fit all training data
for epoch in range(training_epochs):
_, cost_value = sess.run([optimizer,cost])
#Display logs per epoch step
if (epoch+1) % display_step == 0:
c = sess.run(cost)
print( "Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(c), \
"W=", sess.run(W), "b=", sess.run(b))
print("Optimization Finished!")
training_cost = sess.run(cost)
print ("Training cost=", training_cost, "W=", sess.run(W), "b=", sess.run(b), '\n')
#Plot data after completing training
train_X = []
train_Y = []
for i in range(n_samples): #Your input data size to loop through once
X, Y = sess.run([height, pred_soc]) # Call pred, to get the prediction with the updated weights
train_X.append(X)
train_Y.append(Y)
#Graphic display
plt.plot(train_X, train_Y, 'ro', label='Original data')
plt.ylabel("SoC")
plt.xlabel("Height")
plt.axis([0, 1, 0, 100])
plt.plot(train_X, train_Y, linewidth=2.0)
plt.legend()
plt.show()
coord.request_stop()
coord.join(threads)
when I calculate the Variance of my data, I have to collect first, Is there any other methods?
my data format:
1 2 3
1 4 5
4 5 6
4 7 8
7 8 9
10 11 12
10 13 14
10 1 2
1 100 100
10 11 2
10 11 2
1 2 5
4 7 6
code:
val conf = new SparkConf().setAppName("hh")
conf.setMaster("local[3]")
val sc = new SparkContext(conf)
val data = sc.textFile("/home/hadoop4/Desktop/i.txt")
.map(_.split("\t")).map(f => f.map(f => f.toDouble))
.map(f => ("k"+f(0),f(1)))
//data:RDD[(String,Double)]
val dataArr = data.map(f=>(f._1,ArrayBuffer(f._2)))
//dataArr RDD[(String,ArrayBuffer[Double])]
dataArr.collect().foreach(println(_))
//output
(k1.0,ArrayBuffer(2.0))
(k1.0,ArrayBuffer(4.0))
(k4.0,ArrayBuffer(5.0))
(k4.0,ArrayBuffer(7.0))
(k7.0,ArrayBuffer(8.0))
(k10.0,ArrayBuffer(11.0))
(k10.0,ArrayBuffer(13.0))
(k10.0,ArrayBuffer(1.0))
(k1.0,ArrayBuffer(100.0))
(k10.0,ArrayBuffer(11.0))
(k10.0,ArrayBuffer(11.0))
(k1.0,ArrayBuffer(2.0))
(k4.0,ArrayBuffer(7.0))
val dataArrRed = dataArr.reduceByKey((x,y)=>x++=y)
//dataArrRed :RDD[(String,ArrayBuffer[Double])]
dataArrRed.collect().foreach(println(_))
//output
(k1.0,ArrayBuffer(2.0, 4.0, 100.0, 2.0))
(k7.0,ArrayBuffer(8.0))
(k10.0,ArrayBuffer(11.0, 13.0, 1.0, 11.0, 11.0))
(k4.0,ArrayBuffer(5.0, 7.0, 7.0))
val dataARM = dataArrRed.collect().map(
f=>(f._1,sc.makeRDD(f._2,2)))
val dataARMM = dataARM.map(
f=>(f._1,(f._2.variance(),f._2.max(),f._2.min())))
.foreach(println(_))
sc.stop()
//output
(k1.0,(1777.0,100.0,2.0))
(k7.0,(0.0,8.0,8.0))
(k10.0,(18.24,13.0,1.0))
(k4.0,(0.8888888888888888,7.0,5.0))
//update ,now I calculate the second column and the third column at the same time and put them into an Array(f(1),f(2)), then turned into an RDD and aggregateByKey with it, the 'zero values' is Array(new StatCounter(),new StatCounter()),it has some problem.
val dataArray2 = dataString.split("\\n")
.map(_.split("\\s+")).map(_.map(_.toDouble))
.map(f => ("k" + f(0), Array(f(1),f(2))))
val data2 = sc.parallelize(dataArray2)
val dataStat2 = data2.aggregateByKey(Array(new StatCounter(),new
StatCounter()))
({
(s,v)=>(
s(0).merge(v(0)),s(1).merge(v(1))
)
},{
(s,t)=>(
s(0).merge(v(0)),s(1).merge(v(1))
)})
it's wrong. Can I use Array(new StatCounter(),new StatCounter())? Thanks.
Worked example. It turns out to be a one-liner, and another line to map it into the OP's format.
Slightly different way of getting the data (more convenient for testing but same result)
val dataString = """1 2 3
1 4 5
4 5 6
4 7 8
7 8 9
10 11 12
10 13 14
10 1 2
1 100 100
10 11 2
10 11 2
1 2 5
4 7 6
""".trim
val dataArray = dataString.split("\\n")
.map(_.split("\\s+")).map(_.map(_.toDouble))
.map(f => ("k" + f(0), f(1)))
val data = sc.parallelize(dataArray)
Build the stats by key
val dataStats = data.aggregateByKey(new StatCounter())
({(s,v)=>s.merge(v)}, {(s,t)=>s.merge(t)})
Or, slightly shorter but perhaps over-tricky:
val dataStats = data.aggregateByKey(new StatCounter())(_ merge _, _ merge _)
Re-format to the OP's format and print
val result = dataStats.map(f=>(f._1,(f._2.variance,f._2.max,f._2.min)))
.foreach(println(_))
Output, same apart from some rounding errors.
(k1.0,(1776.9999999999998,100.0,2.0))
(k7.0,(0.0,8.0,8.0))
(k10.0,(18.240000000000002,13.0,1.0))
(k4.0,(0.888888888888889,7.0,5.0))
EDIT: Version with two columns
val dataArray = dataString.split("\\n")
.map(_.split("\\s+")).map(_.map(_.toDouble))
.map(f => ("k" + f(0), Array(f(1), f(2))))
val data = sc.parallelize(dataArray)
val dataStats = data.aggregateByKey(Array(new StatCounter(), new StatCounter()))({(s, v)=> Array(s(0) merge v(0), s(1) merge v(1))}, {(s, t)=> Array(s(0) merge t(0), s(1) merge t(1))})
val result = dataStats.map(f => (f._1, (f._2(0).variance, f._2(0).max, f._2(0).min), (f._2(1).variance, f._2(1).max, f._2(1).min)))
.foreach(println(_))
Output
(k1.0,(1776.9999999999998,100.0,2.0),(1716.6875,100.0,3.0))
(k7.0,(0.0,8.0,8.0),(0.0,9.0,9.0))
(k10.0,(18.240000000000002,13.0,1.0),(29.439999999999998,14.0,2.0))
(k4.0,(0.888888888888889,7.0,5.0),(0.888888888888889,8.0,6.0))
EDIT2: "n"-column version
val n = 2
val dataStats = data.aggregateByKey(List.fill(n)(new StatCounter()))(
{(s, v)=> (s zip v).map{case (si, vi) => si merge vi}},
{(s, t)=> (s zip t).map{case (si, ti) => si merge ti}})
val result = dataStats.map(f => (f._1, f._2.map(x => (x.variance, x.max, x.min))))
.foreach(println(_))
Output same as above, but if you have more columns, you can change n. It will break if the Arrays in any row has less than n elements.
I would simply use a stats object (class StatCounter). Then, I would :
parse the file, split each line
create the tuples and partition the RDD
use aggregate by key and collect as an RDD of stat objects