Environment: Java 1.8, VM Cloudera Quickstart.
I have data into Hadoop hdfs from a csv file. Each row represents a bus route.
id vendor start_datetime end_datetime trip_duration_in_sec
17534 A 1/1/2013 12:00 1/1/2013 12:14 840
68346 A 1/1/2013 12:13 1/1/2013 12:18 300
09967 B 1/1/2013 12:34 1/1/2013 12:39 300
09967 B 1/1/2013 12:44 1/1/2013 12:51 420
09967 A 1/1/2013 12:54 1/1/2013 12:56 120
.........
.........
So, i want for every day, to find the hour that each vendor (A and B) has the most bus routes. With java and spark.
A result could be:
1/1/2013 (Day 1) - Vendor A has 3 bus routes at 12:00-13:00 hour. (That time 12:00-13:00, vendor A had the most bus routes..)
1/1/2013 (Day 1) - Vendor B has 2 bus routes at 12:00-13:00 hour. (That time 12:00-13:00, vendor B had the most bus routes..)
....
Mu java code is:
import static org.apache.spark.sql.functions;
import static org.apache.spark.sql.Row;
Dataset<Row> ds;
ds.groupBy(functions.window(col("start_datetime"), "1 hour").count().show();
But i cant find in which hour are the max routes per day.
I'm not so familiar in Java so I tried to explain it in Scala.
The key to find out the hour of max routes per day per vendor, is to count by (vendor, day, hour), then aggregate by (vendor, day) to calculate the hour corresponding to maximum cnt of each group. The day and the hour of each record could be parsed by start_datetime.
val df = spark.createDataset(Seq(
("17534","A","1/1/2013 12:00","1/1/2013 12:14",840),
("68346","A","1/1/2013 12:13","1/1/2013 12:18",300),
("09967","B","1/1/2013 12:34","1/1/2013 12:39",300),
("09967","B","1/1/2013 12:44","1/1/2013 12:51",420),
("09967","A","1/1/2013 12:54","1/1/2013 12:56",120)
)).toDF("id","vendor","start_datetime","end_datetime","trip_duration_in_sec")
df.rdd.map(t => {
val vendor = t(1)
val day = t(2).toString.split(" ")(0)
val hour = t(2).toString.split(" ")(1).split(":")(0)
((vendor, day, hour), 1)
})
// count by key
.aggregateByKey(0)((x: Int, y: Int) =>x+y, (x: Int, y: Int) =>x+y)
.map(t => {
val ((vendor, day, hour), cnt) = t;
((vendor, day), (hour, cnt))
})
// solve the max cnt by key (vendor, day)
.foldByKey(("", 0))((z: (String, Int), i: (String, Int)) => if (i._2 > z._2) i else z)
.foreach(t => println(s"${t._1._2} - Vendor ${t._1._1} has ${t._2._2} bus routes from ${t._2._1}:00 hour."))
I have a cassandra table defined like below:
create table if not exists test(
id int,
readDate timestamp,
totalreadings text,
readings text,
PRIMARY KEY(meter_id, date)
) WITH CLUSTERING ORDER BY(date desc);
The reading contains the map of all snapshots of data collected at regular intervals (30 minutes) along with aggregated data for full day.
The data would like below :
id=8, readDate=Tue Dec 20 2016, totalreadings=220.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
id=8, readDate=Tue Dec 21 2016, totalreadings=221.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
id=8, readDate=Tue Dec 22 2016, totalreadings=219.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
id=8, readDate=Tue Dec 23 2016, totalreadings=224.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
The java pojo classes look like below:
public class Test{
private int id;
private Date readDate;
private String totalreadings;
private Map<Integer, Double> readings;
//setters
//getters
}
I am trying to find last 4 days aggregated average of all reading per snapshot. So logically, i have 4 list for last 4 days Test object and each of them has a map containing reading across the intervals.
Is there a simple way to find aggregate of a similar snapshot entries across 4 days . For example , i want to aggregate specific data snapshots (1,2,3,4,5,6,etc) only not the total aggregate.
After changing you table-structure a little bit the problem can be solved completely in Cassandra. - Mainly I have put your readings into a map.
create table test(
id int,
readDate timestamp,
totalreadings float,
readings map<int,float>,
PRIMARY KEY(id, readDate)
) WITH CLUSTERING ORDER BY(readDate desc);
Now I entered a bit of your data using CQL:
insert into test (id,readDate,totalReadings, readings ) values (8 '2016-12-20', 220.0, {0:9.0, 1:0.0, 2:9.0, 3:5.0, 4:2.0, 5:7.0, 6:1.0, 7:3.0, 8:9.0, 9:2.0, 10:5.0, 11:1.0, 12:1.0, 13:2.0, 14:4.0, 15:4.0, 16:7.0, 17:7.0, 18:5.0, 19:4.0, 20:9.0, 21:6.0, 22:8.0, 23:4.0, 24:6.0, 25:3.0, 26:5.0, 27:7.0, 28:2.0, 29:0.0, 30:8.0, 31:9.0, 32:1.0, 33:8.0, 34:9.0, 35:2.0, 36:4.0, 37:5.0, 38:4.0, 39:7.0, 40:3.0, 41:2.0, 42:1.0, 43:2.0, 44:4.0, 45:5.0, 46:3.0, 47:1.0});
insert into test (id,readDate,totalReadings, readings ) values (8, '2016-12-21', 221.0,{0:9.0, 1:0.0, 2:9.0, 3:5.0, 4:2.0, 5:7.0, 6:1.0, 7:3.0, 8:9.0, 9:2.0, 10:5.0, 11:1.0, 12:1.0, 13:2.0, 14:4.0, 15:4.0, 16:7.0, 17:7.0, 18:5.0, 19:4.0, 20:9.0, 21:6.0, 22:8.0, 23:4.0, 24:6.0, 25:3.0, 26:5.0, 27:7.0, 28:2.0, 29:0.0, 30:8.0, 31:9.0, 32:1.0, 33:8.0, 34:9.0, 35:2.0, 36:4.0, 37:5.0, 38:4.0, 39:7.0, 40:3.0, 41:2.0, 42:1.0, 43:2.0, 44:4.0, 45:5.0, 46:3.0, 47:1.0});
To extract single values out of the map I created a User defined function (UDF). This UDF picks the right value aut of your map containing the readings. See Cassandra docs on UDF for more on UDFs. Note that UDFs are disabled in cassandra by default so you need to modify cassandra.yaml to include enable_user_defined_functions: true
create function map_item(readings map<int,float>, idx int) called on null input returns float language java as ' return readings.get(idx);';
After creating the function you can calculate your average as
select avg(map_item(readings, 7)) from test where readDate > '2016-12-20' allow filtering;
which gives me:
system.avg(betterconnect.map_item(readings, 7))
-------------------------------------------------
3
You may want to supply the date fort your where-clause and the index (7 in my example) as parameters from your application.
Having been burned by mysql timezone and Daylight Savings "hour from hell" issues in the past, I decided my next application would store everything in UTC timezone and only interact with the database using UTC times (not even the closely-related GMT).
I soon ran into some mysterious bugs. After pulling my hair out for a while, I came up with this test code:
try(Connection conn = dao.getDataSource().getConnection();
Statement stmt = conn.createStatement()) {
Instant now = Instant.now();
stmt.execute("set time_zone = '+00:00'");
stmt.execute("create temporary table some_times("
+ " dt datetime,"
+ " ts timestamp,"
+ " dt_string datetime,"
+ " ts_string timestamp,"
+ " dt_epoch datetime,"
+ " ts_epoch timestamp,"
+ " dt_auto datetime default current_timestamp(),"
+ " ts_auto timestamp default current_timestamp(),"
+ " dtc char(19) generated always as (cast(dt as character)),"
+ " tsc char(19) generated always as (cast(ts as character)),"
+ " dt_autoc char(19) generated always as (cast(dt_auto as character)),"
+ " ts_autoc char(19) generated always as (cast(ts_auto as character))"
+ ")");
PreparedStatement ps = conn.prepareStatement("insert into some_times "
+ "(dt, ts, dt_string, ts_string, dt_epoch, ts_epoch) values (?,?,?,?,from_unixtime(?),from_unixtime(?))");
DateTimeFormatter dbFormat = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss").withZone(ZoneId.of("UTC"));
ps.setTimestamp(1, new Timestamp(now.toEpochMilli()));
ps.setTimestamp(2, new Timestamp(now.toEpochMilli()));
ps.setString(3, dbFormat.format(now));
ps.setString(4, dbFormat.format(now));
ps.setLong(5, now.getEpochSecond());
ps.setLong(6, now.getEpochSecond());
ps.executeUpdate();
ResultSet rs = stmt.executeQuery("select * from some_times");
ResultSetMetaData md = rs.getMetaData();
while(rs.next()) {
for(int c=1; c <= md.getColumnCount(); ++c) {
Instant inst1 = Instant.ofEpochMilli(rs.getTimestamp(c).getTime());
Instant inst2 = Instant.from(dbFormat.parse(rs.getString(c).replaceAll("\\.0$", "")));
System.out.println(inst1.getEpochSecond() - now.getEpochSecond());
System.out.println(inst2.getEpochSecond() - now.getEpochSecond());
}
}
}
Note how the session timezone is set to UTC, and everything in the Java code is very timezone-aware and forced to UTC. The only thing in this entire environment which is not UTC is the JVM's default timezone.
I expected the output to be a bunch of 0s but instead I get this
0
-28800
0
-28800
28800
0
28800
0
28800
0
28800
0
28800
0
28800
0
0
-28800
0
-28800
28800
0
28800
0
Each line of output is just subtracting the time stored from the time retrieved. The result in each row should be 0.
It seems the JDBC driver is performing inappropriate timezone conversions. For an application which interacts fully in UTC although it runs on a VM that's not in UTC, is there any way to completely disable the TZ conversions?
i.e. Can this test be made to output all-zero rows?
UPDATE
Using useLegacyDatetimeCode=false (cacheDefaultTimezone=false makes no difference) changes the output but still not a fix:
0
-28800
0
-28800
0
-28800
0
-28800
0
-28800
0
-28800
0
-28800
0
-28800
0
0
0
0
0
0
0
0
UPDATE2
Checking the console (after changing the test to create a permanent table), I see all the values are STORED correctly:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 27148
Server version: 5.7.12-log MySQL Community Server (GPL)
Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> set time_zone = '-00:00';
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT * FROM some_times \G
*************************** 1. row ***************************
dt: 2016-11-18 15:39:51
ts: 2016-11-18 15:39:51
dt_string: 2016-11-18 15:39:51
ts_string: 2016-11-18 15:39:51
dt_epoch: 2016-11-18 15:39:51
ts_epoch: 2016-11-18 15:39:51
dt_auto: 2016-11-18 15:39:51
ts_auto: 2016-11-18 15:39:51
dtc: 2016-11-18 15:39:51
tsc: 2016-11-18 15:39:51
dt_autoc: 2016-11-18 15:39:51
ts_autoc: 2016-11-18 15:39:51
1 row in set (0.00 sec)
mysql>
The solution is to set JDBC connection parameter noDatetimeStringSync=true with useLegacyDatetimeCode=false. As a bonus I also found sessionVariables=time_zone='-00:00' alleviates the need to set time_zone explicitly on every new connection.
There is some "intelligent" timezone conversion code that gets activated deep inside the ResultSet.getString() method when it detects that the column is a TIMESTAMP column.
Alas, this intelligent code has a bug: TimeUtil.fastTimestampCreate(TimeZone tz, int year, int month, int day, int hour, int minute, int seconds, int secondsPart) returns a Timestamp wrongly tagged to the JVM's default timezone, even when the tz parameter is set to something else:
final static Timestamp fastTimestampCreate(TimeZone tz, int year, int month, int day, int hour, int minute, int seconds, int secondsPart) {
Calendar cal = (tz == null) ? new GregorianCalendar() : new GregorianCalendar(tz);
cal.clear();
// why-oh-why is this different than java.util.date, in the year part, but it still keeps the silly '0' for the start month????
cal.set(year, month - 1, day, hour, minute, seconds);
long tsAsMillis = cal.getTimeInMillis();
Timestamp ts = new Timestamp(tsAsMillis);
ts.setNanos(secondsPart);
return ts;
}
The return ts would be perfectly valid except when further up in the call chain, it is converted back to a string using the bare toString() method, which renders the ts as a String representing what a clock would display in the JVM-default timezone, instead of a String representation of the time in UTC. In ResultSetImpl.getStringInternal(int columnIndex, boolean checkDateTypes):
case Types.TIMESTAMP:
Timestamp ts = getTimestampFromString(columnIndex, null, stringVal, this.getDefaultTimeZone(), false);
if (ts == null) {
this.wasNullFlag = true;
return null;
}
this.wasNullFlag = false;
return ts.toString();
Setting noDatetimeStringSync=true disables the entire parse/unparse mess and just returns the string value as-is received from the database.
Test output:
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
The useLegacyDatetimeCode=false is still important because it changes the behaviour of getDefaultTimeZone() to use the database server's TZ.
While chasing this down I also found the documentation for useJDBCCompliantTimezoneShift is incorrect, although it makes no difference: documentation says [This is part of the legacy date-time code, thus the property has an effect only when "useLegacyDatetimeCode=true."], but that's wrong, see ResultSetImpl.getNativeTimestampViaParseConversion(int, Calendar, TimeZone, boolean).
I need to read .bib file and insert it tags into an objects of bib-entries
the file is big (almost 4000 lines) , so my first question is what to use (bufferrReader or FileReader)
the general format is
#ARTICLE{orleans01DJ,
author = {Doug Orleans and Karl Lieberherr},
title = {{{DJ}: {Dynamic} Adaptive Programming in {Java}}},
journal = {Metalevel Architectures and Separation of Crosscutting Concerns 3rd
Int'l Conf. (Reflection 2001), {LNCS} 2192},
year = {2001},
pages = {73--80},
month = sep,
editor = {A. Yonezawa and S. Matsuoka},
owner = {Administrator},
publisher = {Springer-Verlag},
timestamp = {2009.03.09}
}
#ARTICLE{Ossher:1995:SOCR,
author = {Harold Ossher and Matthew Kaplan and William Harrison and Alexander
Katz},
title = {{Subject-Oriented Composition Rules}},
journal = {ACM SIG{\-}PLAN Notices},
year = {1995},
volume = {30},
pages = {235--250},
number = {10},
month = oct,
acknowledgement = {Nelson H. F. Beebe, University of Utah, Department of Mathematics,
110 LCB, 155 S 1400 E RM 233, Salt Lake City, UT 84112-0090, USA,
Tel: +1 801 581 5254, FAX: +1 801 581 4148, e-mail: \path|beebe#math.utah.edu|,
\path|beebe#acm.org|, \path|beebe#computer.org| (Internet), URL:
\path|http://www.math.utah.edu/~beebe/|},
bibdate = {Fri Apr 30 12:33:10 MDT 1999},
coden = {SINODQ},
issn = {0362-1340},
keywords = {ACM; object-oriented programming systems; OOPSLA; programming languages;
SIGPLAN},
owner = {Administrator},
timestamp = {2009.02.26}
}
As you can see , there are some entries that have more than line, entries that end with }
entries that end with }, or }},
Also , some entries have {..},{..}.. in the middle
so , i am a little bit confused on how to start reading this file and how to get these entries and manipulate them.
Any help will be highly appreciated.
We currently discuss different options at JabRef.
These are the current options:
JBibTeX
ANTLRv3 Grammar
JabRef's BibtexParser.java