I need to detect some stuff within a String that contains, among other things, dates. Now, parsing dates using regex is a known question on SO.
However, the dates in this text are localized. And the app needs to be able to adapt to differently localized dates. Luckily, I can figure out the correct date format for the current locale using DateFormat.getDateInstance(SHORT, locale). I can get a date pattern from that. But how do I turn it into a regex, dynamically?
The regex would not need to do in-depth validation of the format (leap years, correct amount of days for a month etc.), I can already be sure that the data is provided in a valid format. The date just needs to be identified (as in, the regex should be able to detect the start and end index of where a date is).
The answers in the linked question all assume the handful of common date formats. But assuming that in this case is a likely cause of getting an edge case that breaks the app in a very non-obvious way. Which is why I'd prefer a dynamically generated regex over a one-fits-all(?) solution.
I can't use DateFormat.parse(...), since I have to actually detect the date first, and can't directly extract it.
Since you're doing getDateInstance(SHORT, locale), with emphasis on Date and SHORT, the patterns are fairly limited, so the following code will do:
public static String dateFormatToRegex(Locale locale) {
StringBuilder regex = new StringBuilder();
String fmt = ((SimpleDateFormat) DateFormat.getDateInstance(DateFormat.SHORT, locale)).toPattern();
for (Matcher m = Pattern.compile("[^a-zA-Z]+|([a-zA-Z])\\1*").matcher(fmt); m.find(); ) {
String part = m.group();
if (m.start(1) == -1) { // Not letter(s): Literal text
regex.append(Pattern.quote(part));
} else {
switch (part.charAt(0)) {
case 'G': // Era designator
regex.append("\\p{L}+");
break;
case 'y': // Year
regex.append("\\d{1,4}");
break;
case 'M': // Month in year
if (part.length() > 2)
throw new UnsupportedOperationException("Date format part: " + part);
regex.append("(?:1[0-2]|0?[1-9])");
break;
case 'd': // Day in month
regex.append("(?:3[01]|[12][0-9]|0?[1-9])");
break;
default:
throw new UnsupportedOperationException("Date format part: " + part);
}
}
}
return regex.toString();
}
To see what regex's you'll get for various locales:
Locale[] locales = Locale.getAvailableLocales();
Arrays.sort(locales, Comparator.comparing(Locale::toLanguageTag));
Map<String, List<String>> fmtLocales = new TreeMap<>();
for (Locale locale : locales) {
String fmt = ((SimpleDateFormat) DateFormat.getDateInstance(DateFormat.SHORT, locale)).toPattern();
fmtLocales.computeIfAbsent(fmt, k -> new ArrayList<>()).add(locale.toLanguageTag());
}
fmtLocales.forEach((k, v) -> System.out.println(dateFormatToRegex(Locale.forLanguageTag(v.get(0))) + " " + v));
Output
\p{L}+\d{1,4}\Q.\E(?:0[1-9]|1[0-2])\Q.\E(?:0[1-9]|[12][0-9]|3[01]) [ja-JP-u-ca-japanese-x-lvariant-JP]
(?:0[1-9]|1[0-2])\Q/\E(?:0[1-9]|[12][0-9]|3[01])\Q/\E\d{1,4} [brx, brx-IN, chr, chr-US, ee, ee-GH, ee-TG, en, en-AS, en-BI, en-GU, en-MH, en-MP, en-PR, en-UM, en-US, en-US-POSIX, en-VI, fil, fil-PH, ks, ks-IN, ug, ug-CN, zu, zu-ZA]
(?:0[1-9]|1[0-2])\Q/\E(?:0[1-9]|[12][0-9]|3[01])\Q/\E\d{1,4} [es-PA, es-PR]
(?:0[1-9]|[12][0-9]|3[01])\Q-\E(?:0[1-9]|1[0-2])\Q-\E\d{1,4} [or, or-IN]
(?:0[1-9]|[12][0-9]|3[01])\Q. \E(?:0[1-9]|1[0-2])\Q. \E\d{1,4} [ksh, ksh-DE]
(?:0[1-9]|[12][0-9]|3[01])\Q. \E(?:0[1-9]|1[0-2])\Q. \E\d{1,4} [sl, sl-SI]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4} [fi, fi-FI, he, he-IL, is, is-IS]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4} [be, be-BY, dsb, dsb-DE, hsb, hsb-DE, sk, sk-SK, sq, sq-AL, sq-MK, sq-XK]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4}\Q.\E [bs-Cyrl, bs-Cyrl-BA, sr, sr-CS, sr-Cyrl, sr-Cyrl-BA, sr-Cyrl-ME, sr-Cyrl-RS, sr-Cyrl-XK, sr-Latn, sr-Latn-BA, sr-Latn-ME, sr-Latn-RS, sr-Latn-XK, sr-ME, sr-RS]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4} [tr, tr-CY, tr-TR]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4}\Q 'г'.\E [bg, bg-BG]
(?:0[1-9]|[12][0-9]|3[01])\Q/\E(?:0[1-9]|1[0-2])\Q/\E\d{1,4} [agq, agq-CM, bas, bas-CM, bm, bm-ML, dje, dje-NE, dua, dua-CM, dyo, dyo-SN, en-HK, en-ZW, ewo, ewo-CM, ff, ff-CM, ff-GN, ff-MR, ff-SN, kab, kab-DZ, kea, kea-CV, khq, khq-ML, ksf, ksf-CM, ln, ln-AO, ln-CD, ln-CF, ln-CG, lo, lo-LA, lu, lu-CD, mfe, mfe-MU, mg, mg-MG, mua, mua-CM, nmg, nmg-CM, rn, rn-BI, seh, seh-MZ, ses, ses-ML, sg, sg-CF, shi, shi-Latn, shi-Latn-MA, shi-MA, shi-Tfng, shi-Tfng-MA, sw-CD, twq, twq-NE, yav, yav-CM, zgh, zgh-MA, zh-HK, zh-Hant-HK, zh-Hant-MO]
(?:0[1-9]|[12][0-9]|3[01])\Q/\E(?:0[1-9]|1[0-2])\Q/\E\d{1,4} [ast, ast-ES, bn, bn-BD, bn-IN, ca, ca-AD, ca-ES, ca-ES-VALENCIA, ca-FR, ca-IT, el, el-CY, el-GR, en-AU, en-SG, es, es-419, es-AR, es-BO, es-BR, es-CR, es-CU, es-DO, es-EA, es-EC, es-ES, es-GQ, es-HN, es-IC, es-NI, es-PH, es-PY, es-SV, es-US, es-UY, es-VE, gu, gu-IN, ha, ha-GH, ha-NE, ha-NG, haw, haw-US, hi, hi-IN, km, km-KH, kn, kn-IN, ml, ml-IN, mr, mr-IN, pa, pa-Guru, pa-Guru-IN, pa-IN, pa-PK, ta, ta-IN, ta-LK, ta-MY, ta-SG, th, th-TH, to, to-TO, ur, ur-IN, ur-PK, zh-Hans-HK, zh-Hans-MO]
(?:0[1-9]|[12][0-9]|3[01])\Q/\E(?:0[1-9]|1[0-2])\Q/\E\d{1,4} [th-TH-u-nu-thai-x-lvariant-TH]
(?:0[1-9]|[12][0-9]|3[01])\Q/\E(?:0[1-9]|1[0-2])\Q/\E\d{1,4} [nus, nus-SS]
(?:0[1-9]|[12][0-9]|3[01])\Q/\E(?:0[1-9]|1[0-2])\Q/\E\d{1,4} [en-NZ, es-CO, es-GT, es-PE, fr-BE, ms, ms-BN, ms-MY, ms-SG, nl-BE]
(?:0[1-9]|[12][0-9]|3[01])\Q-\E(?:0[1-9]|1[0-2])\Q-\E\d{1,4} [sv-FI]
(?:0[1-9]|[12][0-9]|3[01])\Q-\E(?:0[1-9]|1[0-2])\Q-\E\d{1,4} [es-CL, fy, fy-NL, my, my-MM, nl, nl-AW, nl-BQ, nl-CW, nl-NL, nl-SR, nl-SX, rm, rm-CH, te, te-IN]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4} [mk, mk-MK]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4} [nb, nb-NO, nb-SJ, nn, nn-NO, nn-NO, no, no-NO, pl, pl-PL, ro, ro-MD, ro-RO, tk, tk-TM]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4}\Q.\E [hr, hr-BA, hr-HR]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4} [az, az-AZ, az-Cyrl, az-Cyrl-AZ, az-Latn, az-Latn-AZ, cs, cs-CZ, de, de-AT, de-BE, de-CH, de-DE, de-LI, de-LU, et, et-EE, fo, fo-DK, fo-FO, fr-CH, gsw, gsw-CH, gsw-FR, gsw-LI, hy, hy-AM, it-CH, ka, ka-GE, kk, kk-KZ, ky, ky-KG, lb, lb-LU, lv, lv-LV, os, os-GE, os-RU, ru, ru-BY, ru-KG, ru-KZ, ru-MD, ru-RU, ru-UA, uk, uk-UA]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4}\Q.\E [bs, bs-BA, bs-Latn, bs-Latn-BA]
(?:0[1-9]|[12][0-9]|3[01])\Q/\E(?:0[1-9]|1[0-2])\Q \E\d{1,4} [kkj, kkj-CM]
(?:0[1-9]|[12][0-9]|3[01])\Q/\E(?:0[1-9]|1[0-2])\Q/\E\d{1,4} [am, am-ET, asa, asa-TZ, bem, bem-ZM, bez, bez-TZ, cgg, cgg-UG, da, da-DK, da-GL, dav, dav-KE, ebu, ebu-KE, en-001, en-150, en-AG, en-AI, en-AT, en-BB, en-BM, en-BS, en-CC, en-CH, en-CK, en-CM, en-CX, en-CY, en-DE, en-DG, en-DK, en-DM, en-ER, en-FI, en-FJ, en-FK, en-FM, en-GB, en-GD, en-GG, en-GH, en-GI, en-GM, en-GY, en-IE, en-IL, en-IM, en-IO, en-JE, en-JM, en-KE, en-KI, en-KN, en-KY, en-LC, en-LR, en-LS, en-MG, en-MO, en-MS, en-MT, en-MU, en-MW, en-MY, en-NA, en-NF, en-NG, en-NL, en-NR, en-NU, en-PG, en-PH, en-PK, en-PN, en-PW, en-RW, en-SB, en-SC, en-SD, en-SH, en-SI, en-SL, en-SS, en-SX, en-SZ, en-TC, en-TK, en-TO, en-TT, en-TV, en-TZ, en-UG, en-VC, en-VG, en-VU, en-WS, en-ZM, fr, fr-BF, fr-BI, fr-BJ, fr-BL, fr-CD, fr-CF, fr-CG, fr-CI, fr-CM, fr-DJ, fr-DZ, fr-FR, fr-GA, fr-GF, fr-GN, fr-GP, fr-GQ, fr-HT, fr-KM, fr-LU, fr-MA, fr-MC, fr-MF, fr-MG, fr-ML, fr-MQ, fr-MR, fr-MU, fr-NC, fr-NE, fr-PF, fr-PM, fr-RE, fr-RW, fr-SC, fr-SN, fr-SY, fr-TD, fr-TG, fr-TN, fr-VU, fr-WF, fr-YT, ga, ga-IE, gd, gd-GB, guz, guz-KE, ig, ig-NG, jmc, jmc-TZ, kam, kam-KE, kde, kde-TZ, ki, ki-KE, kln, kln-KE, ksb, ksb-TZ, lag, lag-TZ, lg, lg-UG, luo, luo-KE, luy, luy-KE, mas, mas-KE, mas-TZ, mer, mer-KE, mgh, mgh-MZ, mt, mt-MT, naq, naq-NA, nd, nd-ZW, nyn, nyn-UG, pa-Arab, pa-Arab-PK, qu, qu-BO, qu-EC, qu-PE, rof, rof-TZ, rwk, rwk-TZ, saq, saq-KE, sbp, sbp-TZ, sn, sn-ZW, sw, sw-KE, sw-TZ, sw-UG, teo, teo-KE, teo-UG, tzm, tzm-MA, vai, vai-LR, vai-Latn, vai-Latn-LR, vai-Vaii, vai-Vaii-LR, vi, vi-VN, vun, vun-TZ, xog, xog-UG, yo, yo-BJ, yo-NG]
(?:0[1-9]|[12][0-9]|3[01])\Q/\E(?:0[1-9]|1[0-2])\Q/\E\d{1,4} [cy, cy-GB, en-BE, en-BW, en-BZ, en-IN, es-MX, fur, fur-IT, gl, gl-ES, id, id-ID, it, it-IT, it-SM, nnh, nnh-CM, om, om-ET, om-KE, pt, pt-AO, pt-BR, pt-CH, pt-CV, pt-GQ, pt-GW, pt-LU, pt-MO, pt-MZ, pt-PT, pt-ST, pt-TL, so, so-DJ, so-ET, so-KE, so-SO, ti, ti-ER, ti-ET, uz, uz-AF, uz-Cyrl, uz-Cyrl-UZ, uz-Latn, uz-Latn-UZ, uz-UZ, yi, yi-001, zh-Hans-SG, zh-SG]
(?:0[1-9]|[12][0-9]|3[01])\Q/\E(?:0[1-9]|1[0-2])\Q/\E\d{1,4} [ar, ar-001, ar-AE, ar-BH, ar-DJ, ar-DZ, ar-EG, ar-EH, ar-ER, ar-IL, ar-IQ, ar-JO, ar-KM, ar-KW, ar-LB, ar-LY, ar-MA, ar-MR, ar-OM, ar-PS, ar-QA, ar-SA, ar-SD, ar-SO, ar-SS, ar-SY, ar-TD, ar-TN, ar-YE]
\d{1,4}\Q-\E(?:0[1-9]|1[0-2])\Q-\E(?:0[1-9]|[12][0-9]|3[01]) [af, af-NA, af-ZA, as, as-IN, bo, bo-CN, bo-IN, br, br-FR, ce, ce-RU, ckb, ckb-IQ, ckb-IR, cu, cu-RU, dz, dz-BT, en-CA, en-SE, gv, gv-IM, ii, ii-CN, jgo, jgo-CM, kl, kl-GL, kok, kok-IN, kw, kw-GB, lkt, lkt-US, lrc, lrc-IQ, lrc-IR, lt, lt-LT, mgo, mgo-CM, mn, mn-MN, mzn, mzn-IR, ne, ne-IN, ne-NP, prg, prg-001, se, se-FI, se-NO, se-SE, si, si-LK, smn, smn-FI, sv, sv-AX, sv-SE, und, uz-Arab, uz-Arab-AF, vo, vo-001, wae, wae-CH]
\d{1,4}\Q. \E(?:0[1-9]|1[0-2])\Q. \E(?:0[1-9]|[12][0-9]|3[01])\Q.\E [hu, hu-HU]
\d{1,4}\Q/\E(?:0[1-9]|1[0-2])\Q/\E(?:0[1-9]|[12][0-9]|3[01]) [fa, fa-AF, fa-IR, ps, ps-AF, yue, yue-HK, zh, zh-CN, zh-Hans, zh-Hans-CN, zh-Hant, zh-Hant-TW, zh-TW]
\d{1,4}\Q/\E(?:0[1-9]|1[0-2])\Q/\E(?:0[1-9]|[12][0-9]|3[01]) [en-ZA, eu, eu-ES, ja, ja-JP]
\d{1,4}\Q-\E(?:0[1-9]|1[0-2])\Q-\E(?:0[1-9]|[12][0-9]|3[01]) [eo, eo-001, fr-CA, sr-BA]
\d{1,4}\Q. \E(?:0[1-9]|1[0-2])\Q. \E(?:0[1-9]|[12][0-9]|3[01])\Q.\E [ko, ko-KP, ko-KR]
\d{1,4}\Q/\E(?:0[1-9]|1[0-2])\Q/\E(?:0[1-9]|[12][0-9]|3[01]) [sah, sah-RU]
\d{1,4}\Q/\E(?:0[1-9]|1[0-2])\Q/\E(?:0[1-9]|[12][0-9]|3[01]) [ak, ak-GH, rw, rw-RW]
What you're asking is really complicated, but it's not impossible — just likely many hundreds of lines of code before you're done. I'm really not sure that this is the route you want to go — honestly, if you already know what format the date is in, you should probably just parse() it — but let's say for the sake of argument that you really do want to turn a date pattern like YYYY-mm-dd HH:mm:ss into a regular expression that can match dates in that format.
There are several steps in the solution: You'll need to lexically analyze the pattern; transform the tokens into correct regex pieces in the current locale; and then mash them all together to make a regex you can use. (Thankfully, you don't need to perform complex parsing on the date-pattern string; lexical analysis is good enough for this.)
Lexical analysis or tokenization is the act of breaking the input string into its component tokens, so that instead of an array of characters, it becomes a sequence of enumerated values or objects: So for the previous example, you'd end up with an array or list like this: [YYYY, Hyphen, mm, Hyphen, dd, Space, HH, Colon, mm, Colon, ss]. This kind of tokenization is often done with a big state machine, and you may be able to find some open-source code somewhere (part of the Android source code, maybe?) that already does it. If not, you'll have to read each letter, count up how many of that letter there is, and choose an appropriate enum value to add to the growing list of tokens.
Once you have the tokenized sequence of elements, the next step is to transform each into a chunk of a regular expression that is valid for the current localization. This is probably a giant switch statement inside a loop over the tokens, and thus would turn a YYYY enum value into the string piece "[0-9]{4}", or the mmm enum value into a big chunk of regex string that matches all of the month names in the current locale ("jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec"). This obviously involves you pulling all the data for the given locale, so that you can make regex chunks out of its words.
Finally, you can concatenate all of the regex bits together, wrapping each bit in parentheses to ensure precedence is correct, and then finally Pattern.compile() the whole string. Don't forget to make it use a case-insensitive test.
If you don't know what locale you're in, you'll have to do this many times to produce many regexes for each possible locale, and then test the input against each one of them in turn.
This is a project-and-a-half, but it is something that could be built, if you really really really need it to work exactly like you described.
But again, if I were you, I'd stick with something that already exists: If you already know what locale you're in (or even if you don't), the parse() method already not only does the lexical analysis and input-validation for you — and is not only already written! — but it also produces a usable date object, too!
I still think that parsing from each position in the string and seeing if it succeeds is simpler and easier than first generating a regular expression.
Locale loc = Locale.forLanguageTag("en-AS");
DateTimeFormatter dateFormatter
= DateTimeFormatter.ofLocalizedDate(FormatStyle.SHORT).withLocale(loc);
String mixed = "09/03/18Some data06/29/18Some other data04/27/18A third piece of data";
// Check that the string starts with a date
ParsePosition pos = new ParsePosition(0);
LocalDate.from(dateFormatter.parse(mixed, pos));
int dataStartIndex = pos.getIndex();
System.out.println("Date: " + mixed.substring(0, dataStartIndex));
int candidateDateStartIndex = dataStartIndex;
while (candidateDateStartIndex < mixed.length()) {
try {
pos.setIndex(candidateDateStartIndex);
LocalDate.from(dateFormatter.parse(mixed, pos));
// Date found
System.out.println("Data: "
+ mixed.substring(dataStartIndex, candidateDateStartIndex));
dataStartIndex = pos.getIndex();
System.out.println("Date: "
+ mixed.substring(candidateDateStartIndex, dataStartIndex));
candidateDateStartIndex = dataStartIndex;
} catch (DateTimeException dte) {
// No date here; try next
candidateDateStartIndex++;
pos.setErrorIndex(-1); // Clear error
}
}
System.out.println("Data: " + mixed.substring(dataStartIndex, mixed.length()));
The output from this snippet was:
Date: 09/03/18
Data: Some data
Date: 06/29/18
Data: Some other data
Date: 04/27/18
Data: A third piece of data
If you’re happy with the accepted answer, please don’t let me take that away from you. Only please allow me to demonstrate the alternative to anyone reading along.
Exactly because I am presenting this for a broader audience, I am using java.time, the modern Java date and time API. If your data was originally written with a DateFormat, you may want to substitute that class into the above code. I trust you to do that in that case.
I have a cassandra table defined like below:
create table if not exists test(
id int,
readDate timestamp,
totalreadings text,
readings text,
PRIMARY KEY(meter_id, date)
) WITH CLUSTERING ORDER BY(date desc);
The reading contains the map of all snapshots of data collected at regular intervals (30 minutes) along with aggregated data for full day.
The data would like below :
id=8, readDate=Tue Dec 20 2016, totalreadings=220.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
id=8, readDate=Tue Dec 21 2016, totalreadings=221.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
id=8, readDate=Tue Dec 22 2016, totalreadings=219.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
id=8, readDate=Tue Dec 23 2016, totalreadings=224.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
The java pojo classes look like below:
public class Test{
private int id;
private Date readDate;
private String totalreadings;
private Map<Integer, Double> readings;
//setters
//getters
}
I am trying to find last 4 days aggregated average of all reading per snapshot. So logically, i have 4 list for last 4 days Test object and each of them has a map containing reading across the intervals.
Is there a simple way to find aggregate of a similar snapshot entries across 4 days . For example , i want to aggregate specific data snapshots (1,2,3,4,5,6,etc) only not the total aggregate.
After changing you table-structure a little bit the problem can be solved completely in Cassandra. - Mainly I have put your readings into a map.
create table test(
id int,
readDate timestamp,
totalreadings float,
readings map<int,float>,
PRIMARY KEY(id, readDate)
) WITH CLUSTERING ORDER BY(readDate desc);
Now I entered a bit of your data using CQL:
insert into test (id,readDate,totalReadings, readings ) values (8 '2016-12-20', 220.0, {0:9.0, 1:0.0, 2:9.0, 3:5.0, 4:2.0, 5:7.0, 6:1.0, 7:3.0, 8:9.0, 9:2.0, 10:5.0, 11:1.0, 12:1.0, 13:2.0, 14:4.0, 15:4.0, 16:7.0, 17:7.0, 18:5.0, 19:4.0, 20:9.0, 21:6.0, 22:8.0, 23:4.0, 24:6.0, 25:3.0, 26:5.0, 27:7.0, 28:2.0, 29:0.0, 30:8.0, 31:9.0, 32:1.0, 33:8.0, 34:9.0, 35:2.0, 36:4.0, 37:5.0, 38:4.0, 39:7.0, 40:3.0, 41:2.0, 42:1.0, 43:2.0, 44:4.0, 45:5.0, 46:3.0, 47:1.0});
insert into test (id,readDate,totalReadings, readings ) values (8, '2016-12-21', 221.0,{0:9.0, 1:0.0, 2:9.0, 3:5.0, 4:2.0, 5:7.0, 6:1.0, 7:3.0, 8:9.0, 9:2.0, 10:5.0, 11:1.0, 12:1.0, 13:2.0, 14:4.0, 15:4.0, 16:7.0, 17:7.0, 18:5.0, 19:4.0, 20:9.0, 21:6.0, 22:8.0, 23:4.0, 24:6.0, 25:3.0, 26:5.0, 27:7.0, 28:2.0, 29:0.0, 30:8.0, 31:9.0, 32:1.0, 33:8.0, 34:9.0, 35:2.0, 36:4.0, 37:5.0, 38:4.0, 39:7.0, 40:3.0, 41:2.0, 42:1.0, 43:2.0, 44:4.0, 45:5.0, 46:3.0, 47:1.0});
To extract single values out of the map I created a User defined function (UDF). This UDF picks the right value aut of your map containing the readings. See Cassandra docs on UDF for more on UDFs. Note that UDFs are disabled in cassandra by default so you need to modify cassandra.yaml to include enable_user_defined_functions: true
create function map_item(readings map<int,float>, idx int) called on null input returns float language java as ' return readings.get(idx);';
After creating the function you can calculate your average as
select avg(map_item(readings, 7)) from test where readDate > '2016-12-20' allow filtering;
which gives me:
system.avg(betterconnect.map_item(readings, 7))
-------------------------------------------------
3
You may want to supply the date fort your where-clause and the index (7 in my example) as parameters from your application.