Dynamically creating a regex from a DateFormat

Dynamically creating a regex from a DateFormat - java

I need to detect some stuff within a String that contains, among other things, dates. Now, parsing dates using regex is a known question on SO.
However, the dates in this text are localized. And the app needs to be able to adapt to differently localized dates. Luckily, I can figure out the correct date format for the current locale using DateFormat.getDateInstance(SHORT, locale). I can get a date pattern from that. But how do I turn it into a regex, dynamically?
The regex would not need to do in-depth validation of the format (leap years, correct amount of days for a month etc.), I can already be sure that the data is provided in a valid format. The date just needs to be identified (as in, the regex should be able to detect the start and end index of where a date is).
The answers in the linked question all assume the handful of common date formats. But assuming that in this case is a likely cause of getting an edge case that breaks the app in a very non-obvious way. Which is why I'd prefer a dynamically generated regex over a one-fits-all(?) solution.
I can't use DateFormat.parse(...), since I have to actually detect the date first, and can't directly extract it.

Since you're doing getDateInstance(SHORT, locale), with emphasis on Date and SHORT, the patterns are fairly limited, so the following code will do:
public static String dateFormatToRegex(Locale locale) {
StringBuilder regex = new StringBuilder();
String fmt = ((SimpleDateFormat) DateFormat.getDateInstance(DateFormat.SHORT, locale)).toPattern();
for (Matcher m = Pattern.compile("[^a-zA-Z]+|([a-zA-Z])\\1*").matcher(fmt); m.find(); ) {
String part = m.group();
if (m.start(1) == -1) { // Not letter(s): Literal text
regex.append(Pattern.quote(part));
} else {
switch (part.charAt(0)) {
case 'G': // Era designator
regex.append("\\p{L}+");
break;
case 'y': // Year
regex.append("\\d{1,4}");
break;
case 'M': // Month in year
if (part.length() > 2)
throw new UnsupportedOperationException("Date format part: " + part);
regex.append("(?:1[0-2]|0?[1-9])");
break;
case 'd': // Day in month
regex.append("(?:3[01]|[12][0-9]|0?[1-9])");
break;
default:
throw new UnsupportedOperationException("Date format part: " + part);
}
}
}
return regex.toString();
}
To see what regex's you'll get for various locales:
Locale[] locales = Locale.getAvailableLocales();
Arrays.sort(locales, Comparator.comparing(Locale::toLanguageTag));
Map<String, List<String>> fmtLocales = new TreeMap<>();
for (Locale locale : locales) {
String fmt = ((SimpleDateFormat) DateFormat.getDateInstance(DateFormat.SHORT, locale)).toPattern();
fmtLocales.computeIfAbsent(fmt, k -> new ArrayList<>()).add(locale.toLanguageTag());
}
fmtLocales.forEach((k, v) -> System.out.println(dateFormatToRegex(Locale.forLanguageTag(v.get(0))) + " " + v));
Output
\p{L}+\d{1,4}\Q.\E(?:0[1-9]|1[0-2])\Q.\E(?:0[1-9]|[12][0-9]|3[01]) [ja-JP-u-ca-japanese-x-lvariant-JP]
(?:0[1-9]|1[0-2])\Q/\E(?:0[1-9]|[12][0-9]|3[01])\Q/\E\d{1,4} [brx, brx-IN, chr, chr-US, ee, ee-GH, ee-TG, en, en-AS, en-BI, en-GU, en-MH, en-MP, en-PR, en-UM, en-US, en-US-POSIX, en-VI, fil, fil-PH, ks, ks-IN, ug, ug-CN, zu, zu-ZA]
(?:0[1-9]|1[0-2])\Q/\E(?:0[1-9]|[12][0-9]|3[01])\Q/\E\d{1,4} [es-PA, es-PR]
(?:0[1-9]|[12][0-9]|3[01])\Q-\E(?:0[1-9]|1[0-2])\Q-\E\d{1,4} [or, or-IN]
(?:0[1-9]|[12][0-9]|3[01])\Q. \E(?:0[1-9]|1[0-2])\Q. \E\d{1,4} [ksh, ksh-DE]
(?:0[1-9]|[12][0-9]|3[01])\Q. \E(?:0[1-9]|1[0-2])\Q. \E\d{1,4} [sl, sl-SI]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4} [fi, fi-FI, he, he-IL, is, is-IS]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4} [be, be-BY, dsb, dsb-DE, hsb, hsb-DE, sk, sk-SK, sq, sq-AL, sq-MK, sq-XK]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4}\Q.\E [bs-Cyrl, bs-Cyrl-BA, sr, sr-CS, sr-Cyrl, sr-Cyrl-BA, sr-Cyrl-ME, sr-Cyrl-RS, sr-Cyrl-XK, sr-Latn, sr-Latn-BA, sr-Latn-ME, sr-Latn-RS, sr-Latn-XK, sr-ME, sr-RS]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4} [tr, tr-CY, tr-TR]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4}\Q 'г'.\E [bg, bg-BG]
(?:0[1-9]|[12][0-9]|3[01])\Q/\E(?:0[1-9]|1[0-2])\Q/\E\d{1,4} [agq, agq-CM, bas, bas-CM, bm, bm-ML, dje, dje-NE, dua, dua-CM, dyo, dyo-SN, en-HK, en-ZW, ewo, ewo-CM, ff, ff-CM, ff-GN, ff-MR, ff-SN, kab, kab-DZ, kea, kea-CV, khq, khq-ML, ksf, ksf-CM, ln, ln-AO, ln-CD, ln-CF, ln-CG, lo, lo-LA, lu, lu-CD, mfe, mfe-MU, mg, mg-MG, mua, mua-CM, nmg, nmg-CM, rn, rn-BI, seh, seh-MZ, ses, ses-ML, sg, sg-CF, shi, shi-Latn, shi-Latn-MA, shi-MA, shi-Tfng, shi-Tfng-MA, sw-CD, twq, twq-NE, yav, yav-CM, zgh, zgh-MA, zh-HK, zh-Hant-HK, zh-Hant-MO]
(?:0[1-9]|[12][0-9]|3[01])\Q/\E(?:0[1-9]|1[0-2])\Q/\E\d{1,4} [ast, ast-ES, bn, bn-BD, bn-IN, ca, ca-AD, ca-ES, ca-ES-VALENCIA, ca-FR, ca-IT, el, el-CY, el-GR, en-AU, en-SG, es, es-419, es-AR, es-BO, es-BR, es-CR, es-CU, es-DO, es-EA, es-EC, es-ES, es-GQ, es-HN, es-IC, es-NI, es-PH, es-PY, es-SV, es-US, es-UY, es-VE, gu, gu-IN, ha, ha-GH, ha-NE, ha-NG, haw, haw-US, hi, hi-IN, km, km-KH, kn, kn-IN, ml, ml-IN, mr, mr-IN, pa, pa-Guru, pa-Guru-IN, pa-IN, pa-PK, ta, ta-IN, ta-LK, ta-MY, ta-SG, th, th-TH, to, to-TO, ur, ur-IN, ur-PK, zh-Hans-HK, zh-Hans-MO]
(?:0[1-9]|[12][0-9]|3[01])\Q/\E(?:0[1-9]|1[0-2])\Q/\E\d{1,4} [th-TH-u-nu-thai-x-lvariant-TH]
(?:0[1-9]|[12][0-9]|3[01])\Q/\E(?:0[1-9]|1[0-2])\Q/\E\d{1,4} [nus, nus-SS]
(?:0[1-9]|[12][0-9]|3[01])\Q/\E(?:0[1-9]|1[0-2])\Q/\E\d{1,4} [en-NZ, es-CO, es-GT, es-PE, fr-BE, ms, ms-BN, ms-MY, ms-SG, nl-BE]
(?:0[1-9]|[12][0-9]|3[01])\Q-\E(?:0[1-9]|1[0-2])\Q-\E\d{1,4} [sv-FI]
(?:0[1-9]|[12][0-9]|3[01])\Q-\E(?:0[1-9]|1[0-2])\Q-\E\d{1,4} [es-CL, fy, fy-NL, my, my-MM, nl, nl-AW, nl-BQ, nl-CW, nl-NL, nl-SR, nl-SX, rm, rm-CH, te, te-IN]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4} [mk, mk-MK]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4} [nb, nb-NO, nb-SJ, nn, nn-NO, nn-NO, no, no-NO, pl, pl-PL, ro, ro-MD, ro-RO, tk, tk-TM]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4}\Q.\E [hr, hr-BA, hr-HR]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4} [az, az-AZ, az-Cyrl, az-Cyrl-AZ, az-Latn, az-Latn-AZ, cs, cs-CZ, de, de-AT, de-BE, de-CH, de-DE, de-LI, de-LU, et, et-EE, fo, fo-DK, fo-FO, fr-CH, gsw, gsw-CH, gsw-FR, gsw-LI, hy, hy-AM, it-CH, ka, ka-GE, kk, kk-KZ, ky, ky-KG, lb, lb-LU, lv, lv-LV, os, os-GE, os-RU, ru, ru-BY, ru-KG, ru-KZ, ru-MD, ru-RU, ru-UA, uk, uk-UA]
(?:0[1-9]|[12][0-9]|3[01])\Q.\E(?:0[1-9]|1[0-2])\Q.\E\d{1,4}\Q.\E [bs, bs-BA, bs-Latn, bs-Latn-BA]
(?:0[1-9]|[12][0-9]|3[01])\Q/\E(?:0[1-9]|1[0-2])\Q \E\d{1,4} [kkj, kkj-CM]
(?:0[1-9]|[12][0-9]|3[01])\Q/\E(?:0[1-9]|1[0-2])\Q/\E\d{1,4} [am, am-ET, asa, asa-TZ, bem, bem-ZM, bez, bez-TZ, cgg, cgg-UG, da, da-DK, da-GL, dav, dav-KE, ebu, ebu-KE, en-001, en-150, en-AG, en-AI, en-AT, en-BB, en-BM, en-BS, en-CC, en-CH, en-CK, en-CM, en-CX, en-CY, en-DE, en-DG, en-DK, en-DM, en-ER, en-FI, en-FJ, en-FK, en-FM, en-GB, en-GD, en-GG, en-GH, en-GI, en-GM, en-GY, en-IE, en-IL, en-IM, en-IO, en-JE, en-JM, en-KE, en-KI, en-KN, en-KY, en-LC, en-LR, en-LS, en-MG, en-MO, en-MS, en-MT, en-MU, en-MW, en-MY, en-NA, en-NF, en-NG, en-NL, en-NR, en-NU, en-PG, en-PH, en-PK, en-PN, en-PW, en-RW, en-SB, en-SC, en-SD, en-SH, en-SI, en-SL, en-SS, en-SX, en-SZ, en-TC, en-TK, en-TO, en-TT, en-TV, en-TZ, en-UG, en-VC, en-VG, en-VU, en-WS, en-ZM, fr, fr-BF, fr-BI, fr-BJ, fr-BL, fr-CD, fr-CF, fr-CG, fr-CI, fr-CM, fr-DJ, fr-DZ, fr-FR, fr-GA, fr-GF, fr-GN, fr-GP, fr-GQ, fr-HT, fr-KM, fr-LU, fr-MA, fr-MC, fr-MF, fr-MG, fr-ML, fr-MQ, fr-MR, fr-MU, fr-NC, fr-NE, fr-PF, fr-PM, fr-RE, fr-RW, fr-SC, fr-SN, fr-SY, fr-TD, fr-TG, fr-TN, fr-VU, fr-WF, fr-YT, ga, ga-IE, gd, gd-GB, guz, guz-KE, ig, ig-NG, jmc, jmc-TZ, kam, kam-KE, kde, kde-TZ, ki, ki-KE, kln, kln-KE, ksb, ksb-TZ, lag, lag-TZ, lg, lg-UG, luo, luo-KE, luy, luy-KE, mas, mas-KE, mas-TZ, mer, mer-KE, mgh, mgh-MZ, mt, mt-MT, naq, naq-NA, nd, nd-ZW, nyn, nyn-UG, pa-Arab, pa-Arab-PK, qu, qu-BO, qu-EC, qu-PE, rof, rof-TZ, rwk, rwk-TZ, saq, saq-KE, sbp, sbp-TZ, sn, sn-ZW, sw, sw-KE, sw-TZ, sw-UG, teo, teo-KE, teo-UG, tzm, tzm-MA, vai, vai-LR, vai-Latn, vai-Latn-LR, vai-Vaii, vai-Vaii-LR, vi, vi-VN, vun, vun-TZ, xog, xog-UG, yo, yo-BJ, yo-NG]
(?:0[1-9]|[12][0-9]|3[01])\Q/\E(?:0[1-9]|1[0-2])\Q/\E\d{1,4} [cy, cy-GB, en-BE, en-BW, en-BZ, en-IN, es-MX, fur, fur-IT, gl, gl-ES, id, id-ID, it, it-IT, it-SM, nnh, nnh-CM, om, om-ET, om-KE, pt, pt-AO, pt-BR, pt-CH, pt-CV, pt-GQ, pt-GW, pt-LU, pt-MO, pt-MZ, pt-PT, pt-ST, pt-TL, so, so-DJ, so-ET, so-KE, so-SO, ti, ti-ER, ti-ET, uz, uz-AF, uz-Cyrl, uz-Cyrl-UZ, uz-Latn, uz-Latn-UZ, uz-UZ, yi, yi-001, zh-Hans-SG, zh-SG]
(?:0[1-9]|[12][0-9]|3[01])\Q‏/\E(?:0[1-9]|1[0-2])\Q‏/\E\d{1,4} [ar, ar-001, ar-AE, ar-BH, ar-DJ, ar-DZ, ar-EG, ar-EH, ar-ER, ar-IL, ar-IQ, ar-JO, ar-KM, ar-KW, ar-LB, ar-LY, ar-MA, ar-MR, ar-OM, ar-PS, ar-QA, ar-SA, ar-SD, ar-SO, ar-SS, ar-SY, ar-TD, ar-TN, ar-YE]
\d{1,4}\Q-\E(?:0[1-9]|1[0-2])\Q-\E(?:0[1-9]|[12][0-9]|3[01]) [af, af-NA, af-ZA, as, as-IN, bo, bo-CN, bo-IN, br, br-FR, ce, ce-RU, ckb, ckb-IQ, ckb-IR, cu, cu-RU, dz, dz-BT, en-CA, en-SE, gv, gv-IM, ii, ii-CN, jgo, jgo-CM, kl, kl-GL, kok, kok-IN, kw, kw-GB, lkt, lkt-US, lrc, lrc-IQ, lrc-IR, lt, lt-LT, mgo, mgo-CM, mn, mn-MN, mzn, mzn-IR, ne, ne-IN, ne-NP, prg, prg-001, se, se-FI, se-NO, se-SE, si, si-LK, smn, smn-FI, sv, sv-AX, sv-SE, und, uz-Arab, uz-Arab-AF, vo, vo-001, wae, wae-CH]
\d{1,4}\Q. \E(?:0[1-9]|1[0-2])\Q. \E(?:0[1-9]|[12][0-9]|3[01])\Q.\E [hu, hu-HU]
\d{1,4}\Q/\E(?:0[1-9]|1[0-2])\Q/\E(?:0[1-9]|[12][0-9]|3[01]) [fa, fa-AF, fa-IR, ps, ps-AF, yue, yue-HK, zh, zh-CN, zh-Hans, zh-Hans-CN, zh-Hant, zh-Hant-TW, zh-TW]
\d{1,4}\Q/\E(?:0[1-9]|1[0-2])\Q/\E(?:0[1-9]|[12][0-9]|3[01]) [en-ZA, eu, eu-ES, ja, ja-JP]
\d{1,4}\Q-\E(?:0[1-9]|1[0-2])\Q-\E(?:0[1-9]|[12][0-9]|3[01]) [eo, eo-001, fr-CA, sr-BA]
\d{1,4}\Q. \E(?:0[1-9]|1[0-2])\Q. \E(?:0[1-9]|[12][0-9]|3[01])\Q.\E [ko, ko-KP, ko-KR]
\d{1,4}\Q/\E(?:0[1-9]|1[0-2])\Q/\E(?:0[1-9]|[12][0-9]|3[01]) [sah, sah-RU]
\d{1,4}\Q/\E(?:0[1-9]|1[0-2])\Q/\E(?:0[1-9]|[12][0-9]|3[01]) [ak, ak-GH, rw, rw-RW]

What you're asking is really complicated, but it's not impossible — just likely many hundreds of lines of code before you're done. I'm really not sure that this is the route you want to go — honestly, if you already know what format the date is in, you should probably just parse() it — but let's say for the sake of argument that you really do want to turn a date pattern like YYYY-mm-dd HH:mm:ss into a regular expression that can match dates in that format.
There are several steps in the solution: You'll need to lexically analyze the pattern; transform the tokens into correct regex pieces in the current locale; and then mash them all together to make a regex you can use. (Thankfully, you don't need to perform complex parsing on the date-pattern string; lexical analysis is good enough for this.)
Lexical analysis or tokenization is the act of breaking the input string into its component tokens, so that instead of an array of characters, it becomes a sequence of enumerated values or objects: So for the previous example, you'd end up with an array or list like this: [YYYY, Hyphen, mm, Hyphen, dd, Space, HH, Colon, mm, Colon, ss]. This kind of tokenization is often done with a big state machine, and you may be able to find some open-source code somewhere (part of the Android source code, maybe?) that already does it. If not, you'll have to read each letter, count up how many of that letter there is, and choose an appropriate enum value to add to the growing list of tokens.
Once you have the tokenized sequence of elements, the next step is to transform each into a chunk of a regular expression that is valid for the current localization. This is probably a giant switch statement inside a loop over the tokens, and thus would turn a YYYY enum value into the string piece "[0-9]{4}", or the mmm enum value into a big chunk of regex string that matches all of the month names in the current locale ("jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec"). This obviously involves you pulling all the data for the given locale, so that you can make regex chunks out of its words.
Finally, you can concatenate all of the regex bits together, wrapping each bit in parentheses to ensure precedence is correct, and then finally Pattern.compile() the whole string. Don't forget to make it use a case-insensitive test.
If you don't know what locale you're in, you'll have to do this many times to produce many regexes for each possible locale, and then test the input against each one of them in turn.
This is a project-and-a-half, but it is something that could be built, if you really really really need it to work exactly like you described.
But again, if I were you, I'd stick with something that already exists: If you already know what locale you're in (or even if you don't), the parse() method already not only does the lexical analysis and input-validation for you — and is not only already written! — but it also produces a usable date object, too!

I still think that parsing from each position in the string and seeing if it succeeds is simpler and easier than first generating a regular expression.
Locale loc = Locale.forLanguageTag("en-AS");
DateTimeFormatter dateFormatter
= DateTimeFormatter.ofLocalizedDate(FormatStyle.SHORT).withLocale(loc);
String mixed = "09/03/18Some data06/29/18Some other data04/27/18A third piece of data";
// Check that the string starts with a date
ParsePosition pos = new ParsePosition(0);
LocalDate.from(dateFormatter.parse(mixed, pos));
int dataStartIndex = pos.getIndex();
System.out.println("Date: " + mixed.substring(0, dataStartIndex));
int candidateDateStartIndex = dataStartIndex;
while (candidateDateStartIndex < mixed.length()) {
try {
pos.setIndex(candidateDateStartIndex);
LocalDate.from(dateFormatter.parse(mixed, pos));
// Date found
System.out.println("Data: "
+ mixed.substring(dataStartIndex, candidateDateStartIndex));
dataStartIndex = pos.getIndex();
System.out.println("Date: "
+ mixed.substring(candidateDateStartIndex, dataStartIndex));
candidateDateStartIndex = dataStartIndex;
} catch (DateTimeException dte) {
// No date here; try next
candidateDateStartIndex++;
pos.setErrorIndex(-1); // Clear error
}
}
System.out.println("Data: " + mixed.substring(dataStartIndex, mixed.length()));
The output from this snippet was:
Date: 09/03/18
Data: Some data
Date: 06/29/18
Data: Some other data
Date: 04/27/18
Data: A third piece of data
If you’re happy with the accepted answer, please don’t let me take that away from you. Only please allow me to demonstrate the alternative to anyone reading along.
Exactly because I am presenting this for a broader audience, I am using java.time, the modern Java date and time API. If your data was originally written with a DateFormat, you may want to substitute that class into the above code. I trust you to do that in that case.

Related

Convert language code three characters (ISO 639-2) to two-character code (ISO 639-1)

I'm developing an android app using Text-to-Speech (TTS) engine. TTS component return the list of available languages as list of Locale objects.
But both methods Locale::getLanguage and Locale::getISO3Language of each Locale object return the same 3-character code (ISO 639-2). Usually getLanguage() return the language code in 2-character format (ISO 639-1) but for a particular device the code is three characters. Same for country code. However I need to have the language and country code in two character format (ISO 639-1).
Someone know a way to make a conversion? Please note, I need a corresponding Locale object with both language and country codes in two letter format.

tl;dr
As a workaround, make your own Map< Locale , String> mapping each known Locale to its 2-letter language code per ISO 639-1.
new LocaleLookup().lookupTwoLetterLanguageCode( Locale.CANADA_FRENCH )
fr
Or maybe just parse the text of Locale::toString.
Locale
.CANADA_FRENCH
.toString() // fr_CA
.split( "_" ) // Array: { "fr" , "CA" }
[ 0 ] // Grab first element in array, "fr".
fr
For two-letter country code, use the second part of that split string. Use index of 1 instead of 0.
Locale
.CANADA_FRENCH
.toString() // fr_CA
.split( "_" ) // Array: { "fr" , "CA" }
[ 1 ] // Grab first element in array, "CA".
CA
Bug?
It seems to be a bug that Locale::getLanguage would return a 3-letter code. The Javadoc uses a 2-letter code in its code example. But unfortunately the Javadoc fails to specify explicitly 2 or 3 letters. I suggest you file a request with the OpenJDK project to clarify this Javadoc.
Workaround
As a workaround, perhaps you could call Locale.getISOLanguages to get an array of all known languages in 2-letter codes. Then loop those. For each, use the code seen in the Javadoc, passing 2-letter code to constrict a Locale object for comparison:
if (locale.getLanguage().equals(new Locale("he").getLanguage()))
From this build your own Map between locale and 2-letter code.
Example class
Here is my first stab at such a workaround map.
In the constructor, we get a list of all known locales, and all known 2-letter ISO 639-1 language codes.
Next we do a nested loop. For each locale, we loop all the 2-letter language codes until we find a match. Notice that we do not do a string match. The Javadoc warns us that the ISO 639 standard is not stable; the codes are changing. Quoting:
Note: ISO 639 is not a stable standard— some languages' codes have changed. Locale's constructor recognizes both the new and the old codes for the languages whose codes have changed, but this function always returns the old code. If you want to check for a specific language whose code has changed, don't do
if (locale.getLanguage().equals("he")) // BAD!
Instead, do
if (locale.getLanguage().equals(new Locale("he").getLanguage())) // GOOD.
So our inner loop looks at each known 2-letter language code, and gets a Locale object for that language. Then our if statement compares the output of getLanguage for (a) our outer loop’s Locale, and (b) our inner loop’s generated language-only Locale (generated by our 2-letter code). In your case, you claim some device is outputting 3-letter code value for our call to getLanguage. But whether 2 or 3 letters, does not matter. We are just looking for a match.
Once instantiated, we can ask our LocaleLookup instance for the two-letter code matching a particular Locale by calling the lookupTwoLetterLanguageCode method.
LocaleLookup localeLookup = new LocaleLookup();
Locale locale = Locale.CANADA_FRENCH;
String code = localeLookup.lookupTwoLetterLanguageCode( locale );
System.out.println( "Locale: " + locale.toString() + " " + locale.getDisplayName( Locale.getDefault() ) + " | ISO 639-1 code: " + code );
Locale: fr_CA French (Canada) | ISO 639-1 code: fr
I'm just guessing at all this. I have not thought it through, nor have I tested any of this. So buyer-beware, this solution is worth every penny you paid for it. Good luck.
Here is the entire class, with a public static void main to use as demonstration.
package work.basil.example;
import java.util.*;
public class LocaleLookup
{
private Map < Locale, String > mapLocaleToTwoLetterLangCode;
public LocaleLookup ( )
{
this.mapLocaleToTwoLetterLangCode = new HashMap <>( Locale.getAvailableLocales().length );
this.makeMaps();
System.out.println( "mapLocaleToTwoLetterLangCode = " + mapLocaleToTwoLetterLangCode );
}
private void makeMaps ( )
{
// Get all locales.
Set < Locale > locales = Set.of( Locale.getAvailableLocales() );
// Get all languages, per 2-letter code.
Set < String > twoLetterLanguageCodes = Set.of( Locale.getISOLanguages() ); // Returns: An array of ISO 639 two-letter language codes.
for ( Locale locale : locales )
{
for ( String twoLetterLanguageCode : twoLetterLanguageCodes )
{
if ( locale.getLanguage().equals( new Locale( twoLetterLanguageCode ).getLanguage() ) )
{
this.mapLocaleToTwoLetterLangCode.put( locale , twoLetterLanguageCode );
break;
}
}
}
// System.out.println( "locales = " + locales );
// System.out.println( "twoLetterLanguageCodes = " + twoLetterLanguageCodes );
}
public String lookupTwoLetterLanguageCode ( final Locale locale )
{
String code = this.mapLocaleToTwoLetterLangCode.get( locale );
Objects.requireNonNull( code );
return code;
}
public static void main ( String[] args )
{
LocaleLookup localeLookup = new LocaleLookup();
Locale locale = Locale.CANADA_FRENCH;
String code = localeLookup.lookupTwoLetterLanguageCode( locale );
System.out.println( "Locale: " + locale.toString() + " " + locale.getDisplayName( Locale.getDefault() ) + " | ISO 639-1 code: " + code );
}
}
And here is the map I produce in a pre-release version of Java 15. Note this may be incorrect, as I have seen some goofiness with locales in the pre-release version.
mapLocaleToTwoLetterLangCode = {nn=nn, ar_JO=ar, bg=bg, zu=zu, am_ET=am, fr_DZ=fr, ti_ET=ti, bo_CN=bo, qu_EC=qu, ta_SG=ta, lv=lv, en_NU=en, en_MS=en, zh_SG_#Hans=zh, ff_LR_#Adlm=ff, en_GG=en, en_JM=en, vo=vo, sd__#Arab=sd, sv_SE=sv, sr_ME=sr, dz_BT=dz, es_BO=es, en_ZM=en, fr_ML=fr, br=br, ha_NG=ha, fa_AF=fa, ar_SA=ar, sk=sk, os_GE=os, ml=ml, en_MT=en, en_LR=en, ar_TD=ar, en_GH=en, en_IL=en, sv=sv, cs=cs, el=el, af=af, ff_MR_#Latn=ff, sw_UG=sw, tk_TM=tk, sr_ME_#Cyrl=sr, ar_EG=ar, sd__#Deva=sd, ji_001=yi, yo_NG=yo, se_NO=se, ku=ku, sw_CD=sw, vo_001=vo, en_PW=en, pl_PL=pl, ff_MR_#Adlm=ff, it_VA=it, sr_CS=sr, ne_IN=ne, es_PH=es, es_ES=es, es_CO=es, bg_BG=bg, ji=yi, ar_EH=ar, bs_BA_#Latn=bs, en_VC=en, nb_SJ=nb, es_US=es, en_US_POSIX=en, en_150=en, ar_SD=ar, en_KN=en, ha_NE=ha, pt_MO=pt, ro_RO=ro, zh__#Hans=zh, lb_LU=lb, sr_ME_#Latn=sr, es_GT=es, so_KE=so, ff_LR_#Latn=ff, ff_GH_#Latn=ff, fr_PM=fr, ar_KM=ar, no_NO_NY=no, fr_MG=fr, es_CL=es, mn=mn, tr_TR=tr, eu=eu, fa_IR=fa, en_MO=en, wo=wo, en_BZ=en, sq_AL=sq, ar_MR=ar, es_DO=es, ru=ru, az=az, su__#Latn=su, fa=fa, kl_GL=kl, en_NR=en, nd=nd, kk=kk, en_MP=en, az__#Cyrl=az, en_GD=en, tk=tk, hy=hy, en_BW=en, en_AU=en, en_CY=en, ta_MY=ta, ti_ER=ti, en_RW=en, sv_FI=sv, nd_ZW=nd, lb=lb, ne=ne, su=su, zh_SG=zh, en_IE=en, ln_CD=ln, en_KI=en, om_ET=om, no=no, ja_JP=ja, my=my, ka=ka, ar_IL=ar, ff_GH_#Adlm=ff, or_IN=or, fr_MF=fr, ms_ID=ms, kl=kl, en_SZ=en, zh=zh, es_PE=es, ta=ta, az__#Latn=az, en_GB=en, zh_HK_#Hant=zh, ar_SY=ar, bo=bo, kk_KZ=kk, tt_RU=tt, es_PA=es, om_KE=om, ar_PS=ar, fr_VU=fr, en_AS=en, zh_TW=zh, sd_IN=sd, fr_MC=fr, kw=kw, fr_NE=fr, pt_MZ=pt, ur_IN=ur, ln=ln, en_JE=en, ln_CF=ln, en_CX=en, pt=pt, en_AT=en, gl=gl, sr__#Cyrl=sr, es_GQ=es, kn_IN=kn, ff__#Adlm=ff, ar_YE=ar, en_SX=en, to=to, ga=ga, qu=qu, ru_KZ=ru, en_TZ=en, et=et, en_PR=en, jv=jv, ko_KP=ko, in=in, sn=sn, ps=ps, nl_SR=nl, en_BS=en, km=km, fr_NC=fr, be=be, gv=gv, es=es, gd_GB=gd, nl_BQ=nl, ff_GN_#Adlm=ff, fr_CM=fr, uz_UZ_#Cyrl=uz, pa_IN_#Guru=pa, en_KE=en, ja=ja, fr_SN=fr, or=or, fr_MA=fr, pt_LU=pt, ff_GM_#Adlm=ff, fr_BL=fr, en_NL=en, ln_CG=ln, te=te, sl=sl, ha=ha, mr_IN=mr, ko_KR=ko, el_CY=el, ku_TR=ku, es_MX=es, es_HN=es, hu_HU=hu, ff_SN=ff, sq_MK=sq, sr_BA_#Cyrl=sr, fi=fi, bs__#Cyrl=bs, uz=uz, et_EE=et, sr__#Latn=sr, en_SS=en, bo_IN=bo, sw=sw, fy_NL=fy, ar_OM=ar, tr_CY=tr, rm=rm, fr_BI=fr, en_MG=en, uz_UZ_#Latn=uz, bn=bn, de_IT=de, kn=kn, fr_TN=fr, sr_RS=sr, bn_BD=bn, de_CH=de, fr_PF=fr, gu=gu, pt_GQ=pt, en_ZA=en, en_TV=en, lo=lo, fr_FR=fr, en_PN=en, fr_BJ=fr, en_MH=en, zh__#Hant=zh, zh_HK_#Hans=zh, cu_RU=cu, nl_NL=nl, en_GY=en, ps_AF=ps, bs__#Latn=bs, ky=ky, os=os, bs_BA_#Cyrl=bs, nl_CW=nl, ar_DZ=ar, sk_SK=sk, pt_CH=pt, fr_GQ=fr, xh=xh, ki_KE=ki, am=am, fr_CI=fr, en_NG=en, ia_001=ia, en_PK=en, zh_CN=zh, en_LC=en, rw=rw, ff_BF_#Adlm=ff, wo_SN=wo, gv_IM=gv, iw=iw, en_TT=en, mk_MK=mk, sl_SI=sl, fr_HT=fr, te_IN=te, nl_SX=nl, ce=ce, fr_CG=fr, xh_ZA=xh, fr_BE=fr, ff_NE_#Adlm=ff, es_VE=es, mt_MT=mt, mr=mr, mg=mg, ko=ko, en_BM=en, nb_NO=nb, ak=ak, dz=dz, vi_VN=vi, en_VU=en, ia=ia, en_US=en, ff_SL_#Latn=ff, to_TO=to, ff_SN_#Adlm=ff, fr_BF=fr, pa__#Guru=pa, it_SM=it, su_ID=su, fr_YT=fr, gu_IN=gu, ii_CN=ii, ff_CM_#Latn=ff, pa_PK_#Arab=pa, fr_RE=fr, fi_FI=fi, ca_FR=ca, sr_BA_#Latn=sr, bn_IN=bn, fr_GP=fr, pa=pa, tg=tg, fr_DJ=fr, rn=rn, uk_UA=uk, ks__#Arab=ks, hu=hu, fr_CH=fr, en_NF=en, ff_GW_#Adlm=ff, ha_GH=ha, sr_XK_#Cyrl=sr, bm=bm, ar_SS=ar, en_GU=en, nl_AW=nl, de_BE=de, en_AI=en, en_CM=en, cs_CZ=cs, ca_ES=ca, tr=tr, ff_GW_#Latn=ff, rm_CH=rm, ru_MD=ru, ms_MY=ms, ta_LK=ta, en_TO=en, ff_SN_#Latn=ff, ff_SL_#Adlm=ff, cy=cy, en_PG=en, fr_CF=fr, pt_TL=pt, sq=sq, tg_TJ=tg, fr=fr, en_ER=en, qu_PE=qu, sr_BA=sr, es_PY=es, de=de, es_EC=es, ff_CM_#Adlm=ff, lg_UG=lg, ff_NE_#Latn=ff, zu_ZA=zu, fr_TG=fr, su_ID_#Latn=su, sr_XK_#Latn=sr, en_PH=en, ig_NG=ig, fr_GN=fr, zh_MO_#Hans=zh, lg=lg, ru_RU=ru, se_FI=se, ff=ff, en_DM=en, en_CK=en, sd=sd, ar_MA=ar, ga_IE=ga, en_BI=en, en_AG=en, fr_TD=fr, fr_LU=fr, en_WS=en, fr_CD=fr, so=so, rn_BI=rn, en_NA=en, mi_NZ=mi, ar_ER=ar, ms=ms, sn_ZW=sn, iw_IL=iw, ug=ug, es_EA=es, ga_GB=ga, th_TH_TH_#u-nu-thai=th, hi=hi, fr_SC=fr, ca_IT=ca, ff_NG_#Latn=ff, en_SL=en, no_NO=no, ca_AD=ca, ff_NG_#Adlm=ff, zh_MO_#Hant=zh, en_SH=en, qu_BO=qu, vi=vi, sd_PK_#Arab=sd, fr_CA=fr, de_LU=de, sq_XK=sq, en_KY=en, mi=mi, mt=mt, it_CH=it, de_DE=de, si_LK=si, en_AE=en, en_DK=en, so_DJ=so, eo=eo, lt_LT=lt, it_IT=it, en_ZW=en, ar_SO=ar, ro=ro, en_UM=en, ps_PK=ps, eo_001=eo, ee=ee, fr_MU=fr, nn_NO=nn, se_SE=se, pl=pl, en_TK=en, en_SI=en, ur=ur, uz__#Arab=uz, pt_GW=pt, se=se, lo_LA=lo, af_ZA=af, ar_LB=ar, ms_SG=ms, ee_TG=ee, ln_AO=ln, be_BY=be, ff_GN=ff, in_ID=in, es_BZ=es, ar_AE=ar, hr_HR=hr, as=as, it=it, pt_CV=pt, ks_IN=ks, uk=uk, my_MM=my, mn_MN=mn, ur_PK=ur, en_FM=en, da_DK=da, es_PR=es, en_BE=en, ii=ii, fr_WF=fr, tt=tt, ru_BY=ru, fo_DK=fo, ee_GH=ee, en_SG=en, ar_BH=ar, ff_GM_#Latn=ff, om=om, en_CH=en, hi_IN=hi, fo_FO=fo, yo_BJ=yo, fr_KM=fr, fr_MQ=fr, ff_GN_#Latn=ff, en_SD=en, es_AR=es, ff__#Latn=ff, en_MY=en, ja_JP_JP_#u-ca-japanese=ja, es_SV=es, pt_BR=pt, ml_IN=ml, en_FK=en, uz__#Cyrl=uz, is_IS=is, hy_AM=hy, en_GM=en, en_DG=en, fo=fo, ne_NP=ne, pt_ST=pt, hr=hr, ak_GH=ak, lt=lt, uz_AF_#Arab=uz, ta_IN=ta, fr_GF=fr, en_SE=en, zh_CN_#Hans=zh, es_419=es, is=is, pt_AO=pt, si=si, en_001=en, jv_ID=jv, en=en, es_IC=es, fr_MR=fr, ca=ca, ru_KG=ru, ar_TN=ar, ks=ks, zh_TW_#Hant=zh, ff_BF_#Latn=ff, bm_ML=bm, kw_GB=kw, ug_CN=ug, as_IN=as, es_BR=es, zh_HK=zh, sw_KE=sw, en_SB=en, th_TH=th, rw_RW=rw, ar_IQ=ar, en_MW=en, mk=mk, en_IO=en, pa__#Arab=pa, en_DE=en, ar_QA=ar, en_CC=en, ro_MD=ro, en_FI=en, bs=bs, pt_PT=pt, fy=fy, az_AZ_#Cyrl=az, th=th, es_CU=es, ar=ar, en_SC=en, en_VI=en, eu_ES=eu, en_UG=en, en_NZ=en, es_UY=es, sg_CF=sg, ru_UA=ru, sg=sg, uz__#Latn=uz, el_GR=el, da_GL=da, en_FJ=en, de_LI=de, en_BB=en, km_KH=km, hr_BA=hr, de_AT=de, nl=nl, lu_CD=lu, ca_ES_VALENCIA=ca, ar_001=ar, so_SO=so, lv_LV=lv, sd_IN_#Deva=sd, es_CR=es, ar_KW=ar, fr_GA=fr, ar_LY=ar, sr=sr, sr_RS_#Cyrl=sr, en_MU=en, da=da, gl_ES=gl, az_AZ_#Latn=az, en_IM=en, en_LS=en, ig=ig, en_HK=en, en_GI=en, ce_RU=ce, gd=gd, en_CA=en, ka_GE=ka, fr_SY=fr, sw_TZ=sw, so_ET=so, fr_RW=fr, nl_BE=nl, ar_DJ=ar, mg_MG=mg, en_VG=en, cy_GB=cy, cu=cu, sr_RS_#Latn=sr, os_RU=os, en_TC=en, sv_AX=sv, ky_KG=ky, af_NA=af, lu=lu, en_IN=en, yo=yo, ki=ki, es_NI=es, nb=nb, sd_PK=sd, ti=ti, ms_BN=ms, br_FR=br}
Substring of Locale.toString?
Now, after having done all that work, I notice that the toString representation of the locale name starts with the two-letter language code! 🤦
If this always the case for all Locale objects, we can simply parse that string.
String twoLetterLanguageCode = Locale.CANADA_FRENCH.toString().split( "_" )[ 0 ];
twoLetterCode = fr
For country code, do the same, but pull the second part. Use an index value of 1 versus 0.
String twoLetterCountryCode = Locale.CANADA_FRENCH.toString().split( "_" )[ 1 ];
For this quick check on my pre-release Java 15, it does seem to be the case that every Locale object’s toString text starts with the 2-letter language code. But I do not know if you can count on that always being the case in the past and in the future.
System.out.println( Locale.getAvailableLocales().length );
ArrayList < Locale > problemLocales = new ArrayList <>( Locale.getAvailableLocales().length );
for ( Locale locale : Locale.getAvailableLocales() )
{
String parsed = locale.toString().split( "_" )[ 0 ];
if ( ! parsed.equalsIgnoreCase( locale.getLanguage() ) )
{
problemLocales.add( locale );
}
}
System.out.println( "problemLocales = " + problemLocales );
problemLocales = []
Or, vice-versa:
System.out.println( "Locale.getAvailableLocales().length: " + Locale.getAvailableLocales().length );
ArrayList < Locale > matchingLocales = new ArrayList <>( Locale.getAvailableLocales().length );
for ( Locale locale : Locale.getAvailableLocales() )
{
String parsed = locale.toString().split( "_" )[ 0 ];
if ( parsed.equalsIgnoreCase( locale.getLanguage() ) )
{
matchingLocales.add( locale );
}
}
System.out.println( "matchingLocales.size: " + matchingLocales.size() );
System.out.println( "matchingLocales = " + matchingLocales );
Locale.getAvailableLocales().length: 810
matchingLocales.size: 810

Time series data aggregation per similar snapshots

I have a cassandra table defined like below:
create table if not exists test(
id int,
readDate timestamp,
totalreadings text,
readings text,
PRIMARY KEY(meter_id, date)
) WITH CLUSTERING ORDER BY(date desc);
The reading contains the map of all snapshots of data collected at regular intervals (30 minutes) along with aggregated data for full day.
The data would like below :
id=8, readDate=Tue Dec 20 2016, totalreadings=220.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
id=8, readDate=Tue Dec 21 2016, totalreadings=221.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
id=8, readDate=Tue Dec 22 2016, totalreadings=219.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
id=8, readDate=Tue Dec 23 2016, totalreadings=224.0, readings={0=9.0, 1=0.0, 2=9.0, 3=5.0, 4=2.0, 5=7.0, 6=1.0, 7=3.0, 8=9.0, 9=2.0, 10=5.0, 11=1.0, 12=1.0, 13=2.0, 14=4.0, 15=4.0, 16=7.0, 17=7.0, 18=5.0, 19=4.0, 20=9.0, 21=6.0, 22=8.0, 23=4.0, 24=6.0, 25=3.0, 26=5.0, 27=7.0, 28=2.0, 29=0.0, 30=8.0, 31=9.0, 32=1.0, 33=8.0, 34=9.0, 35=2.0, 36=4.0, 37=5.0, 38=4.0, 39=7.0, 40=3.0, 41=2.0, 42=1.0, 43=2.0, 44=4.0, 45=5.0, 46=3.0, 47=1.0}]]
The java pojo classes look like below:
public class Test{
private int id;
private Date readDate;
private String totalreadings;
private Map<Integer, Double> readings;
//setters
//getters
}
I am trying to find last 4 days aggregated average of all reading per snapshot. So logically, i have 4 list for last 4 days Test object and each of them has a map containing reading across the intervals.
Is there a simple way to find aggregate of a similar snapshot entries across 4 days . For example , i want to aggregate specific data snapshots (1,2,3,4,5,6,etc) only not the total aggregate.

After changing you table-structure a little bit the problem can be solved completely in Cassandra. - Mainly I have put your readings into a map.
create table test(
id int,
readDate timestamp,
totalreadings float,
readings map<int,float>,
PRIMARY KEY(id, readDate)
) WITH CLUSTERING ORDER BY(readDate desc);
Now I entered a bit of your data using CQL:
insert into test (id,readDate,totalReadings, readings ) values (8 '2016-12-20', 220.0, {0:9.0, 1:0.0, 2:9.0, 3:5.0, 4:2.0, 5:7.0, 6:1.0, 7:3.0, 8:9.0, 9:2.0, 10:5.0, 11:1.0, 12:1.0, 13:2.0, 14:4.0, 15:4.0, 16:7.0, 17:7.0, 18:5.0, 19:4.0, 20:9.0, 21:6.0, 22:8.0, 23:4.0, 24:6.0, 25:3.0, 26:5.0, 27:7.0, 28:2.0, 29:0.0, 30:8.0, 31:9.0, 32:1.0, 33:8.0, 34:9.0, 35:2.0, 36:4.0, 37:5.0, 38:4.0, 39:7.0, 40:3.0, 41:2.0, 42:1.0, 43:2.0, 44:4.0, 45:5.0, 46:3.0, 47:1.0});
insert into test (id,readDate,totalReadings, readings ) values (8, '2016-12-21', 221.0,{0:9.0, 1:0.0, 2:9.0, 3:5.0, 4:2.0, 5:7.0, 6:1.0, 7:3.0, 8:9.0, 9:2.0, 10:5.0, 11:1.0, 12:1.0, 13:2.0, 14:4.0, 15:4.0, 16:7.0, 17:7.0, 18:5.0, 19:4.0, 20:9.0, 21:6.0, 22:8.0, 23:4.0, 24:6.0, 25:3.0, 26:5.0, 27:7.0, 28:2.0, 29:0.0, 30:8.0, 31:9.0, 32:1.0, 33:8.0, 34:9.0, 35:2.0, 36:4.0, 37:5.0, 38:4.0, 39:7.0, 40:3.0, 41:2.0, 42:1.0, 43:2.0, 44:4.0, 45:5.0, 46:3.0, 47:1.0});
To extract single values out of the map I created a User defined function (UDF). This UDF picks the right value aut of your map containing the readings. See Cassandra docs on UDF for more on UDFs. Note that UDFs are disabled in cassandra by default so you need to modify cassandra.yaml to include enable_user_defined_functions: true
create function map_item(readings map<int,float>, idx int) called on null input returns float language java as ' return readings.get(idx);';
After creating the function you can calculate your average as
select avg(map_item(readings, 7)) from test where readDate > '2016-12-20' allow filtering;
which gives me:
system.avg(betterconnect.map_item(readings, 7))
-------------------------------------------------
3
You may want to supply the date fort your where-clause and the index (7 in my example) as parameters from your application.

finding features from a large data set by stanford corenlp

I am new Stanford NLP. I can not find any good and complete documentation or tutorial. My work is to do sentiment analysis. I have a very large dataset of product reviews. I already distinguished them by positive and negative according to "starts" given by the users. Now I need to find the most occurred positive and negative adjectives as the features of my algorithm. I understand how to do tokenzation, lemmatization and POS taging from here. I got files like this.
The review was
Don't waste your money. This is a short DVD and the host is boring and offers information that is common sense to any idiot. Pass on this and buy something else. Very generic
and the output was.
Sentence #1 (6 tokens):
Don't waste your money.
[Text=Do CharacterOffsetBegin=0 CharacterOffsetEnd=2 PartOfSpeech=VBP Lemma=do]
[Text=n't CharacterOffsetBegin=2 CharacterOffsetEnd=5 PartOfSpeech=RB Lemma=not]
[Text=waste CharacterOffsetBegin=6 CharacterOffsetEnd=11 PartOfSpeech=VB Lemma=waste]
[Text=your CharacterOffsetBegin=12 CharacterOffsetEnd=16 PartOfSpeech=PRP$ Lemma=you]
[Text=money CharacterOffsetBegin=17 CharacterOffsetEnd=22 PartOfSpeech=NN Lemma=money]
[Text=. CharacterOffsetBegin=22 CharacterOffsetEnd=23 PartOfSpeech=. Lemma=.]
Sentence #2 (21 tokens):
This is a short DVD and the host is boring and offers information that is common sense to any idiot.
[Text=This CharacterOffsetBegin=24 CharacterOffsetEnd=28 PartOfSpeech=DT Lemma=this]
[Text=is CharacterOffsetBegin=29 CharacterOffsetEnd=31 PartOfSpeech=VBZ Lemma=be]
[Text=a CharacterOffsetBegin=32 CharacterOffsetEnd=33 PartOfSpeech=DT Lemma=a]
[Text=short CharacterOffsetBegin=34 CharacterOffsetEnd=39 PartOfSpeech=JJ Lemma=short]
[Text=DVD CharacterOffsetBegin=40 CharacterOffsetEnd=43 PartOfSpeech=NN Lemma=dvd]
[Text=and CharacterOffsetBegin=44 CharacterOffsetEnd=47 PartOfSpeech=CC Lemma=and]
[Text=the CharacterOffsetBegin=48 CharacterOffsetEnd=51 PartOfSpeech=DT Lemma=the]
[Text=host CharacterOffsetBegin=52 CharacterOffsetEnd=56 PartOfSpeech=NN Lemma=host]
[Text=is CharacterOffsetBegin=57 CharacterOffsetEnd=59 PartOfSpeech=VBZ Lemma=be]
[Text=boring CharacterOffsetBegin=60 CharacterOffsetEnd=66 PartOfSpeech=JJ Lemma=boring]
[Text=and CharacterOffsetBegin=67 CharacterOffsetEnd=70 PartOfSpeech=CC Lemma=and]
[Text=offers CharacterOffsetBegin=71 CharacterOffsetEnd=77 PartOfSpeech=VBZ Lemma=offer]
[Text=information CharacterOffsetBegin=78 CharacterOffsetEnd=89 PartOfSpeech=NN Lemma=information]
[Text=that CharacterOffsetBegin=90 CharacterOffsetEnd=94 PartOfSpeech=WDT Lemma=that]
[Text=is CharacterOffsetBegin=95 CharacterOffsetEnd=97 PartOfSpeech=VBZ Lemma=be]
[Text=common CharacterOffsetBegin=98 CharacterOffsetEnd=104 PartOfSpeech=JJ Lemma=common]
[Text=sense CharacterOffsetBegin=105 CharacterOffsetEnd=110 PartOfSpeech=NN Lemma=sense]
[Text=to CharacterOffsetBegin=111 CharacterOffsetEnd=113 PartOfSpeech=TO Lemma=to]
[Text=any CharacterOffsetBegin=114 CharacterOffsetEnd=117 PartOfSpeech=DT Lemma=any]
[Text=idiot CharacterOffsetBegin=118 CharacterOffsetEnd=123 PartOfSpeech=NN Lemma=idiot]
[Text=. CharacterOffsetBegin=123 CharacterOffsetEnd=124 PartOfSpeech=. Lemma=.]
Sentence #3 (8 tokens):
Pass on this and buy something else.
[Text=Pass CharacterOffsetBegin=125 CharacterOffsetEnd=129 PartOfSpeech=VB Lemma=pass]
[Text=on CharacterOffsetBegin=130 CharacterOffsetEnd=132 PartOfSpeech=IN Lemma=on]
[Text=this CharacterOffsetBegin=133 CharacterOffsetEnd=137 PartOfSpeech=DT Lemma=this]
[Text=and CharacterOffsetBegin=138 CharacterOffsetEnd=141 PartOfSpeech=CC Lemma=and]
[Text=buy CharacterOffsetBegin=142 CharacterOffsetEnd=145 PartOfSpeech=VB Lemma=buy]
[Text=something CharacterOffsetBegin=146 CharacterOffsetEnd=155 PartOfSpeech=NN Lemma=something]
[Text=else CharacterOffsetBegin=156 CharacterOffsetEnd=160 PartOfSpeech=RB Lemma=else]
[Text=. CharacterOffsetBegin=160 CharacterOffsetEnd=161 PartOfSpeech=. Lemma=.]
Sentence #4 (2 tokens):
Very generic
[Text=Very CharacterOffsetBegin=162 CharacterOffsetEnd=166 PartOfSpeech=RB Lemma=very]
[Text=generic CharacterOffsetBegin=167 CharacterOffsetEnd=174 PartOfSpeech=JJ Lemma=generic]
I already have processed 10000 positive and 10000 negative file like this. Now How can I easily find the most occurred positive and negative features(adjectives)? Do i need to read all the output(processed) file and make a list frequency count of the adjectives like this or is there any easy way by stanford corenlp?

Here is an example of processing an annotated review and storing the adjectives in a Counter.
In the example the movie review "The movie was great! It was a great film." has a sentiment of "positive".
I would suggest altering my code to load in each file and build an Annotation with the file's text and recording the sentiment for that file.
Then you can go through each file and build up a Counter with positive and negative counts for each adjective.
The final Counter has the adjective "great" with a count of 2.
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.stats.Counter;
import edu.stanford.nlp.stats.ClassicCounter;
import edu.stanford.nlp.util.CoreMap;
import java.util.Properties;
public class AdjectiveSentimentExample {
public static void main(String[] args) throws Exception {
Counter<String> adjectivePositiveCounts = new ClassicCounter<String>();
Counter<String> adjectiveNegativeCounts = new ClassicCounter<String>();
Annotation review = new Annotation("The movie was great! It was a great film.");
String sentiment = "positive";
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
pipeline.annotate(review);
for (CoreMap sentence : review.get(CoreAnnotations.SentencesAnnotation.class)) {
for (CoreLabel cl : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
if (cl.get(CoreAnnotations.PartOfSpeechAnnotation.class).equals("JJ")) {
if (sentiment.equals("positive")) {
adjectivePositiveCounts.incrementCount(cl.word());
} else if (sentiment.equals("negative")) {
adjectiveNegativeCounts.incrementCount(cl.word());
}
}
}
}
System.out.println("---");
System.out.println("positive adjective counts");
System.out.println(adjectivePositiveCounts);
}
}

How to read the contents of (.bib) file format using Java

I need to read .bib file and insert it tags into an objects of bib-entries
the file is big (almost 4000 lines) , so my first question is what to use (bufferrReader or FileReader)
the general format is
#ARTICLE{orleans01DJ,
author = {Doug Orleans and Karl Lieberherr},
title = {{{DJ}: {Dynamic} Adaptive Programming in {Java}}},
journal = {Metalevel Architectures and Separation of Crosscutting Concerns 3rd
Int'l Conf. (Reflection 2001), {LNCS} 2192},
year = {2001},
pages = {73--80},
month = sep,
editor = {A. Yonezawa and S. Matsuoka},
owner = {Administrator},
publisher = {Springer-Verlag},
timestamp = {2009.03.09}
}
#ARTICLE{Ossher:1995:SOCR,
author = {Harold Ossher and Matthew Kaplan and William Harrison and Alexander
Katz},
title = {{Subject-Oriented Composition Rules}},
journal = {ACM SIG{\-}PLAN Notices},
year = {1995},
volume = {30},
pages = {235--250},
number = {10},
month = oct,
acknowledgement = {Nelson H. F. Beebe, University of Utah, Department of Mathematics,
110 LCB, 155 S 1400 E RM 233, Salt Lake City, UT 84112-0090, USA,
Tel: +1 801 581 5254, FAX: +1 801 581 4148, e-mail: \path|beebe#math.utah.edu|,
\path|beebe#acm.org|, \path|beebe#computer.org| (Internet), URL:
\path|http://www.math.utah.edu/~beebe/|},
bibdate = {Fri Apr 30 12:33:10 MDT 1999},
coden = {SINODQ},
issn = {0362-1340},
keywords = {ACM; object-oriented programming systems; OOPSLA; programming languages;
SIGPLAN},
owner = {Administrator},
timestamp = {2009.02.26}
}
As you can see , there are some entries that have more than line, entries that end with }
entries that end with }, or }},
Also , some entries have {..},{..}.. in the middle
so , i am a little bit confused on how to start reading this file and how to get these entries and manipulate them.
Any help will be highly appreciated.

We currently discuss different options at JabRef.
These are the current options:
JBibTeX
ANTLRv3 Grammar
JabRef's BibtexParser.java

TimeZone issue with ibatis ORM and postgres

Please suggest, I am retrieving data using ibatis ORM from postgres database.
When I fire the below query using ibatis the timestamp "PVS.SCHED_DTE" get reduced to "2013-06-07 23:30:00.0" where as in db timestamp is "2013-06-08 05:00:00.0".
Though when the same query is executed using postgresAdminII, its gives proper PVS.SCHED_DTE.
SELECT PVS.PTNT_VISIT_SCHED_NUM, PVS.PTNT_ID,P.FIRST_NAM AS PTNT_FIRST_NAM,P.PRIM_EMAIL_TXT,P.SCNDRY_EMAIL_TXT, PVS.PTNT_NOTIFIED_FLG,PVS.SCHED_DTE,PVS.VISIT_TYPE_NUM,PVS.DURTN_TYPE_NUM,PVS.ROUTE_TYPE_NUM, C.FIRST_NAM AS CARE_FIRST_NAM,C.MIDDLE_INIT_NAM AS CARE_MIDDLE_NAM,C.LAST_NAM AS CARE_LAST_NAM, C.PHONE_NUM_1,ORGNIZATN.EMAIL_HOST_SERVER_TXT,ORGNIZATN.PORT_NUM, ORGNIZATN.MAIL_SENDER_USER_ID,ORGNIZATN.MAIL_SENDER_PSWD_DESC,ORGNIZATN.SOCKET_FCTRY_PORT_DESC, ORGNIZATN.TIMEZONE_ID FROM IHCAS.PTNT_VISIT_SCHED PVS , IHCAS.PTNT P, IHCAS.ROUTE_LEG RL, IHCAS.ROUTE R, IHCAS.CAREGIVER C,IHCAS.ORG ORGNIZATN WHERE ORGNIZATN.ORG_NUM = ? AND PVS.SCHED_STAT_CDE = '2002' AND PVS.PTNT_NOTIFIED_FLG = FALSE AND **PVS.SCHED_DTE >= (CURRENT_DATE + interval '1 day') AT TIME ZONE ? at TIME ZONE 'UTC'** AND **PVS.SCHED_DTE < (CURRENT_DATE + interval '2 day' - interval '1 sec') AT TIME ZONE ? at TIME ZONE 'UTC'** AND PVS.PTNT_ID = P.PTNT_ID AND PVS.PTNT_VISIT_SCHED_NUM = RL.PTNT_VISIT_SCHED_NUM AND RL.ROUTE_NUM = R.ROUTE_NUM AND C.CAREGIVER_NUM=R.ASGN_TO_CAREGIVER_NUM AND PVS.ORG_NUM= ORGNIZATN.ORG_NUM ORDER BY PVS.SCHED_DTE ASC,PVS.ROUTE_TYPE_NUM ASC,PVS.ORG_NUM ASC
Parameters: **[1, EST, EST]**
Header: [PTNT_VISIT_SCHED_NUM, PTNT_NOTIFIED_FLG, **SCHED_DTE**, VISIT_TYPE_NUM, DURTN_TYPE_NUM, ROUTE_TYPE_NUM, PTNT_ID, PTNT_FIRST_NAM, PRIM_EMAIL_TXT, SCNDRY_EMAIL_TXT, CARE_FIRST_NAM, CARE_MIDDLE_NAM, CARE_LAST_NAM, PHONE_NUM_1, EMAIL_HOST_SERVER_TXT, PORT_NUM, MAIL_SENDER_USER_ID, MAIL_SENDER_PSWD_DESC, SOCKET_FCTRY_PORT_DESC, TIMEZONE_ID]
Result: [1129, false, **2013-06-07 23:30:00.0**, 1002, 1551, 1702, PID146, Ricky, Receiver.test#******.com, , Vikas, S, Jain, 956-**-5**8, 172.17.*.***, 25, Sender.test#******.com, ********123, 465, EST]
For me.. it seems to be some timezone issue but what and how to resolve.
N.B : Local Time zone : IST
Application Time zone : EST
In database, datatype of column is timestamp without timezone.
Any suggestion.. ??

It may be best to set the database timezone to UTC and implement an UTC-preserving MyBatis type handler as shown in the accepted answer to this:
What's the right way to handle UTC date-times using Java, iBatis, and Oracle?.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Dynamically creating a regex from a DateFormat - java

Related

Convert language code three characters (ISO 639-2) to two-character code (ISO 639-1)

Time series data aggregation per similar snapshots

finding features from a large data set by stanford corenlp

How to read the contents of (.bib) file format using Java

TimeZone issue with ibatis ORM and postgres

Categories

Resources