Environment: Java 1.8, VM Cloudera Quickstart.
I have data into Hadoop hdfs from a csv file. Each row represents a bus route.
id vendor start_datetime end_datetime trip_duration_in_sec
17534 A 1/1/2013 12:00 1/1/2013 12:14 840
68346 A 1/1/2013 12:13 1/1/2013 12:18 300
09967 B 1/1/2013 12:34 1/1/2013 12:39 300
09967 B 1/1/2013 12:44 1/1/2013 12:51 420
09967 A 1/1/2013 12:54 1/1/2013 12:56 120
So, i want for every day, to find the hour that each vendor (A and B) has the most bus routes. With java and spark.
A result could be:
1/1/2013 (Day 1) - Vendor A has 3 bus routes at 12:00-13:00 hour. (That time 12:00-13:00, vendor A had the most bus routes..)
1/1/2013 (Day 1) - Vendor B has 2 bus routes at 12:00-13:00 hour. (That time 12:00-13:00, vendor B had the most bus routes..)
Mu java code is:
import static org.apache.spark.sql.functions;
import static org.apache.spark.sql.Row;
Dataset<Row> ds;
ds.groupBy(functions.window(col("start_datetime"), "1 hour").count().show();
But i cant find in which hour are the max routes per day.
I'm not so familiar in Java so I tried to explain it in Scala.
The key to find out the hour of max routes per day per vendor, is to count by (vendor, day, hour), then aggregate by (vendor, day) to calculate the hour corresponding to maximum cnt of each group. The day and the hour of each record could be parsed by start_datetime.
val df = spark.createDataset(Seq(
("17534","A","1/1/2013 12:00","1/1/2013 12:14",840),
("68346","A","1/1/2013 12:13","1/1/2013 12:18",300),
("09967","B","1/1/2013 12:34","1/1/2013 12:39",300),
("09967","B","1/1/2013 12:44","1/1/2013 12:51",420),
("09967","A","1/1/2013 12:54","1/1/2013 12:56",120)
df.rdd.map(t => {
val vendor = t(1)
val day = t(2).toString.split(" ")(0)
val hour = t(2).toString.split(" ")(1).split(":")(0)
((vendor, day, hour), 1)
// count by key
.aggregateByKey(0)((x: Int, y: Int) =>x+y, (x: Int, y: Int) =>x+y)
.map(t => {
val ((vendor, day, hour), cnt) = t;
((vendor, day), (hour, cnt))
// solve the max cnt by key (vendor, day)
.foldByKey(("", 0))((z: (String, Int), i: (String, Int)) => if (i._2 > z._2) i else z)
.foreach(t => println(s"${t._1._2} - Vendor ${t._1._1} has ${t._2._2} bus routes from ${t._2._1}:00 hour."))
I am trying to draw contours based on transportation time. To be more clear, I want to cluster the points which have similar travel time (let's say 10 minute interval) to a specific point (destination) and map them as contours or a heatmap.
Right now, the only idea that I have is using R package gmapsdistance to find the travel time for different origins and then cluster them and draw them on a map. But, as you can tell, this is in no way a robust solution.
This thread on GIS-community and this one for python illustrate a similar problem but for an origin to destinations within reach in specific time. I want to find origins which I can travel to the destination within certain time.
Right now, the code below shows my rudimentary idea (using R):
mdestination <- "40.7+-73"
morigin1 <- "40.6+-74.2"
morigin2 <- "40+-74"
gmapsdistance(origin = morigin1,
destination = mdestination,
mode = "transit")
gmapsdistance(origin = morigin2,
destination = mdestination,
mode = "transit")
This map also may help to understand the question:
Using this answer I can get the points which I can go to from a point of origin but I need to reverse it and find the points which have travel time equal-less-than a certain time to my destination;
appId <- "TravelTime_APP_ID"
apiKey <- "TravelTime_API_KEY"
location <- c(40, -73)
CommuteTime <- (5 / 6) * 60 * 60
url <- "http://api.traveltimeapp.com/v4/time-map"
requestBody <- paste0('{
"departure_searches" : [
{"id" : "test",
"coords": {"lat":', location[1], ', "lng":', location[2],' },
"transportation" : {"type" : "driving"} ,
"travel_time" : ', CommuteTime, ',
"departure_time" : "2017-05-03T07:20:00z"
res <- httr::POST(url = url,
httr::add_headers('Content-Type' = 'application/json'),
httr::add_headers('Accept' = 'application/json'),
httr::add_headers('X-Application-Id' = appId),
httr::add_headers('X-Api-Key' = apiKey),
body = requestBody,
encode = "json")
res <- jsonlite::fromJSON(as.character(res))
pl <- lapply(res$results$shapes[[1]]$shell, function(x){
googleway::encode_pl(lat = x[['lat']], lon = x[['lng']])
df <- data.frame(polyline = unlist(pl))
df_marker <- data.frame(lat = location[1], lon = location[2])
google_map(key = mapKey) %>%
add_markers(data = df_marker) %>%
add_polylines(data = df, polyline = "polyline")
Moreover, Documentation of Travel Time Map Platform talks about Multi Origins with Arrival time which is exactly the thing I want to do. But I need to do that for both public transportation and driving (for places with less than an hour commute time) and I think since public transport is tricky (based on what station you are close to) maybe heatmap is a better option than contours.
This answer is based on obtaining an origin-destination matrix between a grid of (roughly) equally distant points. This is a computer intensive operation not only because it requires a good number of API calls to mapping services, but also because the servers must calculate a matrix for each call. The number of required calls grows exponentially along the number of points in the grid.
To tackle this problem, I would suggest that you consider running on your local machine or on a local server a mapping server. Project OSRM offers a relatively simple, free, and open-source solution, enabling you to run an OpenStreetMap server into a Linux docker (https://github.com/Project-OSRM/osrm-backend). Having your own local mapping server will allow you to make as many API calls as you desire. R's osrm package allows you to interact with OpenStreetMaps' APIs, Including those placed to a local server.
library(raster) # Optional
devtools::install_github("cmartin/ggConvexHull") # Needed to quickly draw the contours
I create a grid of 96 roughly equally distant points around Bruxelles (Belgium) conurbation.
This grid does not take into consideration the earths curvature, which is negligible at the level of city distances.
For convenience, I employ the raster package to download a ShapeFile of Belgium and extract the nodes for Brussels city.
BE <- raster::getData("GADM", country = "BEL", level = 1)
Bruxelles <- BE[BE$NAME_1 == "Bruxelles", ]
df_grid <- makegrid(Bruxelles, cellsize = 0.02) %>%
SpatialPoints() %>%
## I convert the SpatialPoints object into a simple data.frame
as.data.frame() %>%
## create a unique id for each point in the data.frame
rownames_to_column() %>%
## rename variables of the data.frame with more explanatory names.
rename(id = rowname, lat = x2, lon = x1)
## I point osrm.server to the OpenStreet docker running in my Linux machine. ...
### ... Do not run this if you are getting your data from OpenStreet public servers.
options(osrm.server = "")
## I obtain a list with distances (Origin Destination Matrix in ...
### ... minutes, origins and destinations)
Distance_Tables <- osrmTable(loc = df_grid)
OD_Matrix <- Distance_Tables$durations %>% ## subset the previous list
## convert the Origin Destination Matrix into a tibble
as_data_frame() %>%
rownames_to_column() %>%
## make sure we have an id column for the OD tibble
rename(origin_id = rowname) %>%
## transform the tibble into long/tidy format
gather(key = destination_id, value = distance_time, -origin_id) %>%
left_join(df_grid, by = c("origin_id" = "id")) %>%
## set origin coordinates
rename(origin_lon = lon, origin_lat = lat) %>%
left_join(df_grid, by = c("destination_id" = "id")) %>%
## set destination coordinates
rename(destination_lat = lat, destination_lon = lon)
## Obtain a nice looking road map of Brussels
Brux_map <- get_map(location = "bruxelles, belgique",
zoom = 11,
source = "google",
maptype = "roadmap")
ggmap(Brux_map) +
geom_point(aes(x = origin_lon, y = origin_lat),
data = OD_Matrix %>%
## Here I selected point_id 42 as the desired target, ...
## ... just because it is not far from the City Center.
filter(destination_id == 42),
size = 0.5) +
## Draw a diamond around point_id 42
geom_point(aes(x = origin_lon, y = origin_lat),
data = OD_Matrix %>%
filter(destination_id == 42, origin_id == 42),
shape = 5, size = 3) +
## Countour marking a distance of up to 8 minutes
geom_convexhull(alpha = 0.2,
fill = "blue",
colour = "blue",
data = OD_Matrix %>%
filter(destination_id == 42,
distance_time <= 8),
aes(x = origin_lon, y = origin_lat)) +
## Countour marking a distance of up to 16 minutes
geom_convexhull(alpha = 0.2,
fill = "red",
colour = "red",
data = OD_Matrix %>%
filter(destination_id == 42,
distance_time <= 15),
aes(x = origin_lon, y = origin_lat))
The blue contour represent distances to the city center of up to 8 minutes.
The red contour represent distances of up to 15 minutes.
I came up with an approach that would be applicable comparing to making numerous api calls.
The idea is finding the places you can reach in certain time(look at this thread). Traffic can be simulated by changing the time from morning to evening. You will end up with an overlapped area which you can reach from both places.
Then you can use Nicolas answer and map some points within that overlapped area and draw the heat map for the destinations you have. This way, you will have less area (points) to cover and therefore you will make much less api calls (remember to use appropriate time for that matter).
Below, I tried to demonstrate what I mean by these and get you to the point that you can make the grid mentioned in the other answer to make your estimation more robust.
This shows how to map the intersected area.
appId <- "Travel.Time.ID"
apiKey <- "Travel.Time.API"
mapKey <- "Google.Map.ID"
locationK <- c(40, -73) #K
locationM <- c(40, -74) #M
CommuteTimeK <- (3 / 4) * 60 * 60
CommuteTimeM <- (0.55) * 60 * 60
url <- "http://api.traveltimeapp.com/v4/time-map"
requestBodyK <- paste0('{
"departure_searches" : [
{"id" : "test",
"coords": {"lat":', locationK[1], ', "lng":', locationK[2],' },
"transportation" : {"type" : "public_transport"} ,
"travel_time" : ', CommuteTimeK, ',
"departure_time" : "2018-06-27T13:00:00z"
requestBodyM <- paste0('{
"departure_searches" : [
{"id" : "test",
"coords": {"lat":', locationM[1], ', "lng":', locationM[2],' },
"transportation" : {"type" : "driving"} ,
"travel_time" : ', CommuteTimeM, ',
"departure_time" : "2018-06-27T13:00:00z"
resKi <- httr::POST(url = url,
httr::add_headers('Content-Type' = 'application/json'),
httr::add_headers('Accept' = 'application/json'),
httr::add_headers('X-Application-Id' = appId),
httr::add_headers('X-Api-Key' = apiKey),
body = requestBodyK,
encode = "json")
resMi <- httr::POST(url = url,
httr::add_headers('Content-Type' = 'application/json'),
httr::add_headers('Accept' = 'application/json'),
httr::add_headers('X-Application-Id' = appId),
httr::add_headers('X-Api-Key' = apiKey),
body = requestBodyM,
encode = "json")
resK <- jsonlite::fromJSON(as.character(resKi))
resM <- jsonlite::fromJSON(as.character(resMi))
plK <- lapply(resK$results$shapes[[1]]$shell, function(x){
googleway::encode_pl(lat = x[['lat']], lon = x[['lng']])
plM <- lapply(resM$results$shapes[[1]]$shell, function(x){
googleway::encode_pl(lat = x[['lat']], lon = x[['lng']])
dfK <- data.frame(polyline = unlist(plK))
dfM <- data.frame(polyline = unlist(plM))
df_markerK <- data.frame(lat = locationK[1], lon = locationK[2], colour = "#green")
df_markerM <- data.frame(lat = locationM[1], lon = locationM[2], colour = "#lavender")
iconK <- "red"
df_markerK$icon <- iconK
iconM <- "blue"
df_markerM$icon <- iconM
google_map(key = mapKey) %>%
add_markers(data = df_markerK,
lat = "lat", lon = "lon",colour = "icon",
mouse_over = "K_K") %>%
add_markers(data = df_markerM,
lat = "lat", lon = "lon", colour = "icon",
mouse_over = "M_M") %>%
add_polygons(data = dfM, polyline = "polyline", stroke_colour = '#461B7E',
fill_colour = '#461B7E', fill_opacity = 0.6) %>%
add_polygons(data = dfK, polyline = "polyline",
stroke_colour = '#F70D1A',
fill_colour = '#FF2400', fill_opacity = 0.4)
You can extract the intersected area like this:
# install.packages(c("rgdal", "sp", "raster","rgeos","maptools"))
Kdata <- resK$results$shapes[[1]]$shell
Mdata <- resM$results$shapes[[1]]$shell
xyfunc <- function(mydf) {
xy <- mydf[,c(2,1)]
spdf <- function(xy, mydf){
coords = xy, data = mydf,
proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))}
for (i in (1:length(Kdata))) {Kdata[[i]] <- xyfunc(Kdata[[i]])}
for (i in (1:length(Mdata))) {Mdata[[i]] <- xyfunc(Mdata[[i]])}
Kshp <- list(); for (i in (1:length(Kdata))) {Kshp[i] <- spdf(Kdata[[i]],Kdata[[i]])}
Mshp <- list(); for (i in (1:length(Mdata))) {Mshp[i] <- spdf(Mdata[[i]],Mdata[[i]])}
Kbind <- do.call(bind, Kshp)
Mbind <- do.call(bind, Mshp)
x <- intersect(Kbind,Mbind)
xdf <- data.frame(x)
xdf$icon <- "https://i.stack.imgur.com/z7NnE.png"
google_map(key = mapKey,
location = c(mean(latmax,latmin), mean(lngmax,lngmin)), zoom = 8) %>%
add_markers(data = xdf, lat = "lat", lon = "lng", marker_icon = "icon")
This is just an illustration of the intersected area.
Now, You can get the coordinates from xdf dataframe and construct your grid around those points to finally come up with a heat map. To respect the other user who came up with that idea/answer I am not including it in mine and am just referencing to it.
Nicolás Velásquez - Obtaining an Origin-Destination Matrix between a Grid of (Roughly) Equally Distant Points
I have a cassandra table defined like below:
create table if not exists test(
id int,
readDate timestamp,
totalreadings text,
readings text,
PRIMARY KEY(meter_id, date)
The reading contains the map of all snapshots of data collected at regular intervals (30 minutes) along with aggregated data for full day.
The data would like below :
id=8, readDate=Tue Dec 20 2016, totalreadings=220.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
id=8, readDate=Tue Dec 21 2016, totalreadings=221.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
id=8, readDate=Tue Dec 22 2016, totalreadings=219.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
id=8, readDate=Tue Dec 23 2016, totalreadings=224.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
The java pojo classes look like below:
public class Test{
private int id;
private Date readDate;
private String totalreadings;
private Map<Integer, Double> readings;
I am trying to find last 4 days aggregated average of all reading per snapshot. So logically, i have 4 list for last 4 days Test object and each of them has a map containing reading across the intervals.
Is there a simple way to find aggregate of a similar snapshot entries across 4 days . For example , i want to aggregate specific data snapshots (1,2,3,4,5,6,etc) only not the total aggregate.
After changing you table-structure a little bit the problem can be solved completely in Cassandra. - Mainly I have put your readings into a map.
create table test(
id int,
readDate timestamp,
totalreadings float,
readings map<int,float>,
PRIMARY KEY(id, readDate)
Now I entered a bit of your data using CQL:
insert into test (id,readDate,totalReadings, readings ) values (8 '2016-12-20', 220.0, {0:9.0, 1:0.0, 2:9.0, 3:5.0, 4:2.0, 5:7.0, 6:1.0, 7:3.0, 8:9.0, 9:2.0, 10:5.0, 11:1.0, 12:1.0, 13:2.0, 14:4.0, 15:4.0, 16:7.0, 17:7.0, 18:5.0, 19:4.0, 20:9.0, 21:6.0, 22:8.0, 23:4.0, 24:6.0, 25:3.0, 26:5.0, 27:7.0, 28:2.0, 29:0.0, 30:8.0, 31:9.0, 32:1.0, 33:8.0, 34:9.0, 35:2.0, 36:4.0, 37:5.0, 38:4.0, 39:7.0, 40:3.0, 41:2.0, 42:1.0, 43:2.0, 44:4.0, 45:5.0, 46:3.0, 47:1.0});
insert into test (id,readDate,totalReadings, readings ) values (8, '2016-12-21', 221.0,{0:9.0, 1:0.0, 2:9.0, 3:5.0, 4:2.0, 5:7.0, 6:1.0, 7:3.0, 8:9.0, 9:2.0, 10:5.0, 11:1.0, 12:1.0, 13:2.0, 14:4.0, 15:4.0, 16:7.0, 17:7.0, 18:5.0, 19:4.0, 20:9.0, 21:6.0, 22:8.0, 23:4.0, 24:6.0, 25:3.0, 26:5.0, 27:7.0, 28:2.0, 29:0.0, 30:8.0, 31:9.0, 32:1.0, 33:8.0, 34:9.0, 35:2.0, 36:4.0, 37:5.0, 38:4.0, 39:7.0, 40:3.0, 41:2.0, 42:1.0, 43:2.0, 44:4.0, 45:5.0, 46:3.0, 47:1.0});
To extract single values out of the map I created a User defined function (UDF). This UDF picks the right value aut of your map containing the readings. See Cassandra docs on UDF for more on UDFs. Note that UDFs are disabled in cassandra by default so you need to modify cassandra.yaml to include enable_user_defined_functions: true
create function map_item(readings map<int,float>, idx int) called on null input returns float language java as ' return readings.get(idx);';
After creating the function you can calculate your average as
select avg(map_item(readings, 7)) from test where readDate > '2016-12-20' allow filtering;
which gives me:
system.avg(betterconnect.map_item(readings, 7))
You may want to supply the date fort your where-clause and the index (7 in my example) as parameters from your application.
There are lots of good examples out there on how to read Microsoft Excel files into R with the XLConnect package, but I can't find any examples of how to read in an Excel file directly from a URL. The reproducible example below returns a "FileNotFoundException (Java)". But, I know the file exists because I can pull it up directly by pasting the URL into a browser.
fname <- "https://www.misoenergy.org/Library/Repository/Market%20Reports/20140610_sr_nd_is.xls"
sheet <- c("Sheet1")
data <- readWorksheetFromFile(fname, sheet, header=TRUE, startRow=11, startCol=2, endCol=13)
Although, the URL is prefixed with "https:" it is a public file that does not require a username or password.
I have tried to download the file first using download.file(fname, destfile="test.xls") and got a message that says it was downloaded but when I try to open it in Excel to check to see if it was successful i get a Excel popup box that says "..found unreadable content in 'test.xls'.
Below are the specifics of my system:
Computer: 64-bit Dell running
Operating System: Windows 7 Professional
R version: R-3.1.0
Any assistance would be greatly appreciated.
You can use RCurl to download the file:
appURL <- "https://www.misoenergy.org/Library/Repository/Market%20Reports/20140610_sr_nd_is.xls"
f = CFILE("exfile.xls", mode="wb")
curlPerform(url = appURL, writedata = f#ref, ssl.verifypeer = FALSE)
out <- readWorksheetFromFile(file = "exfile.xls", sheet = "Sheet1", header = TRUE
, startRow = 11, startCol = 2, endCol = 15, endRow = 35)
> head(out)
1 Hour 1 272 NA 768 1671 NA 148 200 -52 198 280 NA 700 4185
2 Hour 2 272 NA 769 1743 NA 598 200 -29 190 267 NA 706 4716
3 Hour 3 272 NA 769 1752 NA 598 200 -28 194 267 NA 710 4734
4 Hour 4 272 NA 769 1740 NA 598 200 -26 189 266 NA 714 4722
5 Hour 5 272 NA 769 1753 NA 554 200 -27 189 270 NA 713 4693
6 Hour 6 602 NA 769 1682 NA 218 200 -32 223 286 NA 714 4662
Two things:
Try using a different package--I know the gdata package's read.xls function has support for URLs
Try loading in a publicly-available xls file to make sure it's not an issue with the particular website.
For instance, you can try:
site <- "http://www.econ.yale.edu/~shiller/data/chapt26.xls"
data <- read.xls(site, header=FALSE, skip=8)
XLConnect does not support importing directly from URLs. You have to use e.g. download.file first to download the file to your local machine:
tmp = tempfile(fileext = ".xls")
download.file(url = "http://www.econ.yale.edu/~shiller/data/chapt26.xls", destfile = tmp)
readWorksheetFromFile(file = tmp, sheet = "Data", header = FALSE, startRow = 9, endRow = 151)
or with your originally proposed URL:
tmp = tempfile(fileext = ".xls")
download.file(url = "https://www.misoenergy.org/Library/Repository/Market%20Reports/20140610_sr_nd_is.xls", destfile = tmp, method = "curl")
readWorksheetFromFile(file = tmp, sheet = "Sheet1", header = TRUE, startRow = 11, startCol = 2, endCol = 13)
This will open a Firefox instance within R and ask you to download the file, which you could then open in the next line of code. I don't know of any R utilities that will open an excel spreadsheet from HTTPS.
You could then set a delay while you're saving the file and then read the sheet from your downloads folder:
sheet <- c("Sheet1")
data <- readWorksheetFromFile(path, sheet, header=TRUE, startRow=11, startCol=2, endCol=13)