query to get last years data missing dec data - java

I have a query to get previous years data like
vsqlstr := 'select name, enrollement_dt,case_name, dept, city, state, zip from enrollement where ';
vsqlstr :=vsqlstr ||' enrollement_dt <= to_date(''12''||(EXTRACT(YEAR FROM SYSDATE)-1), ''MMYYYY'') ';
DB has foll. records:
Name Enrollement_Dt Case_name dept city state zip
ABS 2019-AUG-16 TGH ENG NY NY 46378
BCD 2019-DEC-31 THG SCI VA VA 76534
EFG 2019-DEC-31 TGY HIS WA WA 96534
HIJ 2019-DEC-31 YUI MATH PA PA 56534
KLM 2020-JAN-12 RET MATH TN TN 56453
The above query returns 1st record with enrollement date '2019-AUG-16' but not the other records. I want to get all records from 2019.

Your issue is that
to_date('12'||(EXTRACT(YEAR FROM SYSDATE)-1), 'MMYYYY')
returns (as of 2020) 2019-12-01, when you need 2019-12-31. You can work around that by specifying the date as well:
to_date('3112'||(EXTRACT(YEAR FROM SYSDATE)-1), 'DDMMYYYY')
Demo on dbfiddle
Another alternative would be to use TRUNC to get the first day of this year, and then subtract 1:
enrollement_dt <= TRUNC(SYSDATE, 'YEAR') - 1
or even simpler
enrollement_dt < TRUNC(SYSDATE, 'YEAR')
Demo on dbfiddle

Related

Comparing rows with multiples keys that might or might not be present in 2 Excel sheets

I need to process 2 Excel files which I need to compare in order to get those registers that are not present in both tables. Each register is identified by 4 component: the 4 last digits of an account, the card number, the authorization and the amount of the transaction. The problem I'm facing is that while one of the files contains all of this component, the other one does not, leaving holes in the register, as such is not guaranteed for them to be unique. I'll include an example:
File 1 with holes where I don't have data:
ACCOUNT
TC
AUTO
AMOUNT
456
0
98846
2302
0
4797
0
79268599
782
8415
38711
78636161
0
0
61658
199234
506
0
0
1029249
0
0
50724
33554244
782
0
12308
28755261
956
0
0
9392043
396
3238
0
2482527
614
4442
0
30846962
855
0
71143
61793724
File 2 with all the information:
ACCOUNT
TC
AUTO
AMOUNT
861
4797
19313
79268599
782
8415
38711
78636161
410
1022
61658
199234
506
6368
93040
1029249
785
2478
50724
33554244
782
4741
12308
28755261
956
2623
45025
9392043
396
3238
14196
2482527
614
4442
61568
30846962
855
5917
71143
61793724
As you can see in this case Id need to return this value:
456
0
98846
2302
Is there an efficient way I can do this either in SQL or using Java data structures? Is important to note that values might be missing from either one of the files. I was thinking about uploading them to a temporal data base, but that seems rather inefficient and I wanted to use a Map that supports the query operation I want with possible repeated values. Thank you for your help!
I tried this query, but it isn't returning no information and I'm not sure about creating a temporal database table, is there a map implementation that supports values with several keys being some of them optional?
select e1.id
, e2.id
from excel1 e1
left join excel2 e2 on
(
e1.Cuenta = e2.Cuenta
OR
e1.TC = e2.TC
OR e1.auto = e2.auto
)
AND e1.monto = e2.monto
where e1.id not in
(
SELECT e1.id
FROM excel1 e1
WHERE id NOT IN
(
SELECT e1.id
FROM excel1 e1
LEFT JOIN excel2 e2 ON
(
e1.Cuenta = e2.Cuenta
OR
e1.TC = e2.TC
OR e1.auto = e2.auto
)
AND e1.monto = e2.monto
GROUP BY e1.monto
HAVING COUNT(*) = 1
)
)
AND e2.id =
(
SELECT e2.id
FROM excel1 e1
WHERE id NOT IN
(
SELECT e1.id
FROM excel1 e1
LEFT JOIN excel2 e2 ON
(
e1.Cuenta = e2.Cuenta
OR
e1.TC = e2.TC
OR
e1.auto = e2.auto
)
AND e1.monto = e2.monto
GROUP BY e1.monto
HAVING COUNT(*) = 1
)
)

Find max trips per day from DataFrame in Java - Spark

Environment: Java 1.8, VM Cloudera Quickstart.
I have data into Hadoop hdfs from a csv file. Each row represents a bus route.
id vendor start_datetime end_datetime trip_duration_in_sec
17534 A 1/1/2013 12:00 1/1/2013 12:14 840
68346 A 1/1/2013 12:13 1/1/2013 12:18 300
09967 B 1/1/2013 12:34 1/1/2013 12:39 300
09967 B 1/1/2013 12:44 1/1/2013 12:51 420
09967 A 1/1/2013 12:54 1/1/2013 12:56 120
.........
.........
So, i want for every day, to find the hour that each vendor (A and B) has the most bus routes. With java and spark.
A result could be:
1/1/2013 (Day 1) - Vendor A has 3 bus routes at 12:00-13:00 hour. (That time 12:00-13:00, vendor A had the most bus routes..)
1/1/2013 (Day 1) - Vendor B has 2 bus routes at 12:00-13:00 hour. (That time 12:00-13:00, vendor B had the most bus routes..)
....
Mu java code is:
import static org.apache.spark.sql.functions;
import static org.apache.spark.sql.Row;
Dataset<Row> ds;
ds.groupBy(functions.window(col("start_datetime"), "1 hour").count().show();
But i cant find in which hour are the max routes per day.
I'm not so familiar in Java so I tried to explain it in Scala.
The key to find out the hour of max routes per day per vendor, is to count by (vendor, day, hour), then aggregate by (vendor, day) to calculate the hour corresponding to maximum cnt of each group. The day and the hour of each record could be parsed by start_datetime.
val df = spark.createDataset(Seq(
("17534","A","1/1/2013 12:00","1/1/2013 12:14",840),
("68346","A","1/1/2013 12:13","1/1/2013 12:18",300),
("09967","B","1/1/2013 12:34","1/1/2013 12:39",300),
("09967","B","1/1/2013 12:44","1/1/2013 12:51",420),
("09967","A","1/1/2013 12:54","1/1/2013 12:56",120)
)).toDF("id","vendor","start_datetime","end_datetime","trip_duration_in_sec")
df.rdd.map(t => {
val vendor = t(1)
val day = t(2).toString.split(" ")(0)
val hour = t(2).toString.split(" ")(1).split(":")(0)
((vendor, day, hour), 1)
})
// count by key
.aggregateByKey(0)((x: Int, y: Int) =>x+y, (x: Int, y: Int) =>x+y)
.map(t => {
val ((vendor, day, hour), cnt) = t;
((vendor, day), (hour, cnt))
})
// solve the max cnt by key (vendor, day)
.foldByKey(("", 0))((z: (String, Int), i: (String, Int)) => if (i._2 > z._2) i else z)
.foreach(t => println(s"${t._1._2} - Vendor ${t._1._1} has ${t._2._2} bus routes from ${t._2._1}:00 hour."))

Time series data aggregation per similar snapshots

I have a cassandra table defined like below:
create table if not exists test(
id int,
readDate timestamp,
totalreadings text,
readings text,
PRIMARY KEY(meter_id, date)
) WITH CLUSTERING ORDER BY(date desc);
The reading contains the map of all snapshots of data collected at regular intervals (30 minutes) along with aggregated data for full day.
The data would like below :
id=8, readDate=Tue Dec 20 2016, totalreadings=220.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
id=8, readDate=Tue Dec 21 2016, totalreadings=221.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
id=8, readDate=Tue Dec 22 2016, totalreadings=219.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
id=8, readDate=Tue Dec 23 2016, totalreadings=224.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
The java pojo classes look like below:
public class Test{
private int id;
private Date readDate;
private String totalreadings;
private Map<Integer, Double> readings;
//setters
//getters
}
I am trying to find last 4 days aggregated average of all reading per snapshot. So logically, i have 4 list for last 4 days Test object and each of them has a map containing reading across the intervals.
Is there a simple way to find aggregate of a similar snapshot entries across 4 days . For example , i want to aggregate specific data snapshots (1,2,3,4,5,6,etc) only not the total aggregate.
After changing you table-structure a little bit the problem can be solved completely in Cassandra. - Mainly I have put your readings into a map.
create table test(
id int,
readDate timestamp,
totalreadings float,
readings map<int,float>,
PRIMARY KEY(id, readDate)
) WITH CLUSTERING ORDER BY(readDate desc);
Now I entered a bit of your data using CQL:
insert into test (id,readDate,totalReadings, readings ) values (8 '2016-12-20', 220.0, {0:9.0, 1:0.0, 2:9.0, 3:5.0, 4:2.0, 5:7.0, 6:1.0, 7:3.0, 8:9.0, 9:2.0, 10:5.0, 11:1.0, 12:1.0, 13:2.0, 14:4.0, 15:4.0, 16:7.0, 17:7.0, 18:5.0, 19:4.0, 20:9.0, 21:6.0, 22:8.0, 23:4.0, 24:6.0, 25:3.0, 26:5.0, 27:7.0, 28:2.0, 29:0.0, 30:8.0, 31:9.0, 32:1.0, 33:8.0, 34:9.0, 35:2.0, 36:4.0, 37:5.0, 38:4.0, 39:7.0, 40:3.0, 41:2.0, 42:1.0, 43:2.0, 44:4.0, 45:5.0, 46:3.0, 47:1.0});
insert into test (id,readDate,totalReadings, readings ) values (8, '2016-12-21', 221.0,{0:9.0, 1:0.0, 2:9.0, 3:5.0, 4:2.0, 5:7.0, 6:1.0, 7:3.0, 8:9.0, 9:2.0, 10:5.0, 11:1.0, 12:1.0, 13:2.0, 14:4.0, 15:4.0, 16:7.0, 17:7.0, 18:5.0, 19:4.0, 20:9.0, 21:6.0, 22:8.0, 23:4.0, 24:6.0, 25:3.0, 26:5.0, 27:7.0, 28:2.0, 29:0.0, 30:8.0, 31:9.0, 32:1.0, 33:8.0, 34:9.0, 35:2.0, 36:4.0, 37:5.0, 38:4.0, 39:7.0, 40:3.0, 41:2.0, 42:1.0, 43:2.0, 44:4.0, 45:5.0, 46:3.0, 47:1.0});
To extract single values out of the map I created a User defined function (UDF). This UDF picks the right value aut of your map containing the readings. See Cassandra docs on UDF for more on UDFs. Note that UDFs are disabled in cassandra by default so you need to modify cassandra.yaml to include enable_user_defined_functions: true
create function map_item(readings map<int,float>, idx int) called on null input returns float language java as ' return readings.get(idx);';
After creating the function you can calculate your average as
select avg(map_item(readings, 7)) from test where readDate > '2016-12-20' allow filtering;
which gives me:
system.avg(betterconnect.map_item(readings, 7))
-------------------------------------------------
3
You may want to supply the date fort your where-clause and the index (7 in my example) as parameters from your application.

What are JDBC-mysql Driver settings for sane handling of DATETIME and TIMESTAMP in UTC?

Having been burned by mysql timezone and Daylight Savings "hour from hell" issues in the past, I decided my next application would store everything in UTC timezone and only interact with the database using UTC times (not even the closely-related GMT).
I soon ran into some mysterious bugs. After pulling my hair out for a while, I came up with this test code:
try(Connection conn = dao.getDataSource().getConnection();
Statement stmt = conn.createStatement()) {
Instant now = Instant.now();
stmt.execute("set time_zone = '+00:00'");
stmt.execute("create temporary table some_times("
+ " dt datetime,"
+ " ts timestamp,"
+ " dt_string datetime,"
+ " ts_string timestamp,"
+ " dt_epoch datetime,"
+ " ts_epoch timestamp,"
+ " dt_auto datetime default current_timestamp(),"
+ " ts_auto timestamp default current_timestamp(),"
+ " dtc char(19) generated always as (cast(dt as character)),"
+ " tsc char(19) generated always as (cast(ts as character)),"
+ " dt_autoc char(19) generated always as (cast(dt_auto as character)),"
+ " ts_autoc char(19) generated always as (cast(ts_auto as character))"
+ ")");
PreparedStatement ps = conn.prepareStatement("insert into some_times "
+ "(dt, ts, dt_string, ts_string, dt_epoch, ts_epoch) values (?,?,?,?,from_unixtime(?),from_unixtime(?))");
DateTimeFormatter dbFormat = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss").withZone(ZoneId.of("UTC"));
ps.setTimestamp(1, new Timestamp(now.toEpochMilli()));
ps.setTimestamp(2, new Timestamp(now.toEpochMilli()));
ps.setString(3, dbFormat.format(now));
ps.setString(4, dbFormat.format(now));
ps.setLong(5, now.getEpochSecond());
ps.setLong(6, now.getEpochSecond());
ps.executeUpdate();
ResultSet rs = stmt.executeQuery("select * from some_times");
ResultSetMetaData md = rs.getMetaData();
while(rs.next()) {
for(int c=1; c <= md.getColumnCount(); ++c) {
Instant inst1 = Instant.ofEpochMilli(rs.getTimestamp(c).getTime());
Instant inst2 = Instant.from(dbFormat.parse(rs.getString(c).replaceAll("\\.0$", "")));
System.out.println(inst1.getEpochSecond() - now.getEpochSecond());
System.out.println(inst2.getEpochSecond() - now.getEpochSecond());
}
}
}
Note how the session timezone is set to UTC, and everything in the Java code is very timezone-aware and forced to UTC. The only thing in this entire environment which is not UTC is the JVM's default timezone.
I expected the output to be a bunch of 0s but instead I get this
0
-28800
0
-28800
28800
0
28800
0
28800
0
28800
0
28800
0
28800
0
0
-28800
0
-28800
28800
0
28800
0
Each line of output is just subtracting the time stored from the time retrieved. The result in each row should be 0.
It seems the JDBC driver is performing inappropriate timezone conversions. For an application which interacts fully in UTC although it runs on a VM that's not in UTC, is there any way to completely disable the TZ conversions?
i.e. Can this test be made to output all-zero rows?
UPDATE
Using useLegacyDatetimeCode=false (cacheDefaultTimezone=false makes no difference) changes the output but still not a fix:
0
-28800
0
-28800
0
-28800
0
-28800
0
-28800
0
-28800
0
-28800
0
-28800
0
0
0
0
0
0
0
0
UPDATE2
Checking the console (after changing the test to create a permanent table), I see all the values are STORED correctly:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 27148
Server version: 5.7.12-log MySQL Community Server (GPL)
Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> set time_zone = '-00:00';
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT * FROM some_times \G
*************************** 1. row ***************************
dt: 2016-11-18 15:39:51
ts: 2016-11-18 15:39:51
dt_string: 2016-11-18 15:39:51
ts_string: 2016-11-18 15:39:51
dt_epoch: 2016-11-18 15:39:51
ts_epoch: 2016-11-18 15:39:51
dt_auto: 2016-11-18 15:39:51
ts_auto: 2016-11-18 15:39:51
dtc: 2016-11-18 15:39:51
tsc: 2016-11-18 15:39:51
dt_autoc: 2016-11-18 15:39:51
ts_autoc: 2016-11-18 15:39:51
1 row in set (0.00 sec)
mysql>
The solution is to set JDBC connection parameter noDatetimeStringSync=true with useLegacyDatetimeCode=false. As a bonus I also found sessionVariables=time_zone='-00:00' alleviates the need to set time_zone explicitly on every new connection.
There is some "intelligent" timezone conversion code that gets activated deep inside the ResultSet.getString() method when it detects that the column is a TIMESTAMP column.
Alas, this intelligent code has a bug: TimeUtil.fastTimestampCreate(TimeZone tz, int year, int month, int day, int hour, int minute, int seconds, int secondsPart) returns a Timestamp wrongly tagged to the JVM's default timezone, even when the tz parameter is set to something else:
final static Timestamp fastTimestampCreate(TimeZone tz, int year, int month, int day, int hour, int minute, int seconds, int secondsPart) {
Calendar cal = (tz == null) ? new GregorianCalendar() : new GregorianCalendar(tz);
cal.clear();
// why-oh-why is this different than java.util.date, in the year part, but it still keeps the silly '0' for the start month????
cal.set(year, month - 1, day, hour, minute, seconds);
long tsAsMillis = cal.getTimeInMillis();
Timestamp ts = new Timestamp(tsAsMillis);
ts.setNanos(secondsPart);
return ts;
}
The return ts would be perfectly valid except when further up in the call chain, it is converted back to a string using the bare toString() method, which renders the ts as a String representing what a clock would display in the JVM-default timezone, instead of a String representation of the time in UTC. In ResultSetImpl.getStringInternal(int columnIndex, boolean checkDateTypes):
case Types.TIMESTAMP:
Timestamp ts = getTimestampFromString(columnIndex, null, stringVal, this.getDefaultTimeZone(), false);
if (ts == null) {
this.wasNullFlag = true;
return null;
}
this.wasNullFlag = false;
return ts.toString();
Setting noDatetimeStringSync=true disables the entire parse/unparse mess and just returns the string value as-is received from the database.
Test output:
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
The useLegacyDatetimeCode=false is still important because it changes the behaviour of getDefaultTimeZone() to use the database server's TZ.
While chasing this down I also found the documentation for useJDBCCompliantTimezoneShift is incorrect, although it makes no difference: documentation says [This is part of the legacy date-time code, thus the property has an effect only when "useLegacyDatetimeCode=true."], but that's wrong, see ResultSetImpl.getNativeTimestampViaParseConversion(int, Calendar, TimeZone, boolean).

How to read the contents of (.bib) file format using Java

I need to read .bib file and insert it tags into an objects of bib-entries
the file is big (almost 4000 lines) , so my first question is what to use (bufferrReader or FileReader)
the general format is
#ARTICLE{orleans01DJ,
author = {Doug Orleans and Karl Lieberherr},
title = {{{DJ}: {Dynamic} Adaptive Programming in {Java}}},
journal = {Metalevel Architectures and Separation of Crosscutting Concerns 3rd
Int'l Conf. (Reflection 2001), {LNCS} 2192},
year = {2001},
pages = {73--80},
month = sep,
editor = {A. Yonezawa and S. Matsuoka},
owner = {Administrator},
publisher = {Springer-Verlag},
timestamp = {2009.03.09}
}
#ARTICLE{Ossher:1995:SOCR,
author = {Harold Ossher and Matthew Kaplan and William Harrison and Alexander
Katz},
title = {{Subject-Oriented Composition Rules}},
journal = {ACM SIG{\-}PLAN Notices},
year = {1995},
volume = {30},
pages = {235--250},
number = {10},
month = oct,
acknowledgement = {Nelson H. F. Beebe, University of Utah, Department of Mathematics,
110 LCB, 155 S 1400 E RM 233, Salt Lake City, UT 84112-0090, USA,
Tel: +1 801 581 5254, FAX: +1 801 581 4148, e-mail: \path|beebe#math.utah.edu|,
\path|beebe#acm.org|, \path|beebe#computer.org| (Internet), URL:
\path|http://www.math.utah.edu/~beebe/|},
bibdate = {Fri Apr 30 12:33:10 MDT 1999},
coden = {SINODQ},
issn = {0362-1340},
keywords = {ACM; object-oriented programming systems; OOPSLA; programming languages;
SIGPLAN},
owner = {Administrator},
timestamp = {2009.02.26}
}
As you can see , there are some entries that have more than line, entries that end with }
entries that end with }, or }},
Also , some entries have {..},{..}.. in the middle
so , i am a little bit confused on how to start reading this file and how to get these entries and manipulate them.
Any help will be highly appreciated.
We currently discuss different options at JabRef.
These are the current options:
JBibTeX
ANTLRv3 Grammar
JabRef's BibtexParser.java

Categories

Resources