Regex'ers:
How can I construct a Java Regex to match Strings lexigraphically <= to a given date string?
For example, suppose the input is in YYYY-DD-MM format:
2014-01-20 MLK day
2007-04-14 'twas a very good day
2014-05-19 is today
1998-11-30 someone's birthday
I'd like the filter to return all lines before, say, Groundhog's day of this year, 2014-02-20;
so in the above list the regex would return all lines except today. (I don't want to convert the
dates to Epoch time; I'd like to just pass a Regex to a class that runs a map/reduce job so that
my input record reader can use the Regex as it constructs bundles to deliver to the mappers.)
TIA,
It's near impossible to do <= type logic with regular expressions. You technically could, but you'd have to map out every possible scenario...and then if you want to change the date you are comparing to, the whole expression would change. Instead, I'd just match all the dates/values and then use a date parser to see if it less then the date. Here's an expression to get you started:
(\d{4}-\d{2}-\d{2})\s+(.*)
Then the date will be in capture group one. If it is <= Groundhog's day, then you have the value in capture group two.
To show how complicated it is to do <= logic with regular expression, I whipped together a quick expression to match numbers > 0 and <= 27.
^([1-9]|1[0-9]|2[0-7])$
As you can see, we pretty much need to map out each scenario. You can imagine how much more of a headache this would be with a date..and you wouldn't just be able to say "2014-02-02", you'd need to redo the majority of the expression.
Related
I want to parse some dates in Java, but the format is not defined and could be a lot of them (any ISO-8601 format which is already a lot, Unix timestamp in any unit, and more)
Here are some samples :
1970-01-01T00:00:00.00Z
1234567890
1234567890000
1234567890000000
2021-09-20T17:27:00.000Z+02:00
The perfect parsing might be impossible because of ambiguous cases but, a solution to parse most of the common dates with some logical might be achievable (for example timestamps are considered in seconds / milli / micro / nano in order to give a date close to the 2000 era, dates like '08/07/2021' could have a default for month and day distinction).
I didn't find any easy way to do it in Java while in python it is kind of possible (not working on all my samples but at least some of them) using infer_datetime_format of panda function to_datetime (https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html).
Are there some easy approach in Java?
Well, first of all, I agree with rzwitserloot here that date parsing in free format is extremely difficult and full of ambiguities. So you are skating on thin ice and will eventually run into trouble if you just assume that a user input will be correctly parsed the way you think it will.
Nevertheless, we could make it work if I assume either of the following:
You simply don't care if it will be parsed incorrectly; or
You are doing this for fun or for learning purposes; or
You have a banner, saying:
If the parsing goes wrong, it's your fault. Don't blame us.
Anyway, the DateTimeFormatterBuilder is able to build a DateTimeFormatter which could be able to parse a lot of different patterns. Since a formatter supports optional parsing, it could be instructed to try to parse a certain value, or skip that part if no valid value could be found.
For instance, this builder is able to parse a fairly wide range of ISO-like dates, with many optional parts:
DateTimeFormatterBuilder builder = new DateTimeFormatterBuilder()
.appendPattern("uuuu-M-d")
.optionalStart()
.optionalStart().appendLiteral(' ').optionalEnd()
.optionalStart().appendLiteral('T').optionalEnd()
.appendValue(ChronoField.HOUR_OF_DAY)
.optionalStart()
.appendLiteral(':')
.appendValue(ChronoField.MINUTE_OF_HOUR)
.optionalStart()
.appendLiteral(':')
.appendValue(ChronoField.SECOND_OF_MINUTE)
.optionalStart()
.appendFraction(ChronoField.NANO_OF_SECOND, 1, 9, true)
.optionalEnd()
.optionalEnd()
.optionalEnd()
.appendPattern("[XXXXX][XXXX][XXX][XX][X]")
.optionalEnd();
DateTimeFormatter formatter = builder.toFormatter(Locale.ROOT);
All of the strings below can be successfully parsed by this formatter.
Stream.of(
"2021-09-28",
"2021-07-04T14",
"2021-07-04T14:06",
"2001-09-11 00:00:15",
"1970-01-01T00:00:15.446-08:00",
"2021-07-04T14:06:15.2017323Z",
"2021-09-20T17:27:00.000+02:00"
).forEach(testcase -> System.out.println(formatter.parse(testcase)));
Als you can see, with optionalStart() and optionalEnd(), you could define optional portions of the format.
There are many more patterns you probably want to parse. You could add those patterns to the abovementioned builder. Alternatively, the appendOptional(DateTimeFormatter) method could be used to include multiple builders.
The perfect parsing might be impossible because of ambiguous cases but, a solution to parse most of the common dates with some logical might be achievable
Sure, and such wide-ranging guesswork should most definitely not be part of a standard java.* API. I think you're also wildly underestimating the ambiguity. 1234567890? It's just flat out incorrect to say that this can reasonably be parsed.
You are running into many, many problems here:
Java in general prefers throwing an error instead of guessing. This is inherent in the language (java has few optional syntax constructs; semicolons aren't optional, () for method invocations are not optional, java intentionally does not have 'truthy/false', i.e. if (foo) is only valid if foo is an expression of the boolean type, unlike e.g. python where you can stick anything in there and there's a big list of what counts as falsy, with the rest being considering truthy. When in rome, be like the romans: If this tenet annoys you, well, either learn to love it, begrudgingly accept it, or program in another language. This idea is endemic in the entire ecosystem. For what it is worth, given that debugging tends to take far longer than typing the optional constructs, java is objectively correct or at least making rational decisions for being like this.
Either you can't bring in the notion that 'hey, this number is larger than 12, therefore it cannot possibly be the month', or, you have to accept that whether a certain date format parsers properly depends on whether the day-of-month value is above or below 12. I would strongly advocate that you avoid a library that fails this rule like the plague. What possible point is there, in the end? "My app will parse your date correctly, but only for about 3/5ths of all dates?" So, given that you can't/should not take that into account, 1234567890, is that seconds-since-1970? milliseconds-since-1970? Is that the 12th of the 34th month of the year 5678, the 90th hour, and assumed zeroes for minutes, seconds, and millis? If a library guesses, that library is wrong, because you should not guess unless you're 95%+ sure.
The obvious and perennial "do not guess" example is, of course, 101112. Is that November 10th, 2012 (european style)? Is that October 11th, 2012 (American style), or is that November 12th, 2010 (ISO style)? These are all reasonable guesses and therefore guessing is just wrong here. Do. Not. Guess. Unless you're really sure. Given that this is a somewhat common way to enter dates, thus: Guessing at all costs is objectively silly (see above). Guessing only when it's pretty clear and erroring out otherwise is mostly useless, given that ambiguity is so easy to introduce.
The concept of guessing may be defensible but only with a lot more information. For example, if you give me the input '101112100000', there's no way it's correct to guess here. But if you also tell me that a human entered this input, and that human is clearly clued into, say, german locale, then I can see the need to be able to turn that into '10th of november 2012, 10 o'clock in the morning': Interpreting as seconds or millis since some epoch is precluded by the human factor, and the day-month-year order by locale.
You asked:
Are there some easy approach in Java?
This entire question is incorrect. The in Java part needs to be stripped from this question, and then the answer is a simple: No. There is no simple way to parse strings into date/times without a lot more information than just the input string. If another library says they can do that, they are lying, or at least, operating under a list of cultural and source assumptions as long as my leg, and you should not be using that library.
I don't know any standard library with this functionality, but you can always use DateTimeFormatter class and guess the format looping over a list of predefined formats, or using the ones provides by this class.
This is a typichal approximation of what you want to archive.
Here you can see and old implementation https://balusc.omnifaces.org/2007/09/dateutil.html
FTA (https://github.com/tsegall/fta) is designed to solve exactly this problem (among others). It currently parses thousands of formats and does not do it via a predefined set, so typically runs extremely quickly. In this example we explicitly set the DateResolutionMode, however, it will default to something intelligent based on the Locale. Here is an example:
import com.cobber.fta.dates.DateTimeParser;
import com.cobber.fta.dates.DateTimeParser.DateResolutionMode;
public abstract class Simple {
public static void main(final String[] args) {
final String[] samples = { "1970-01-01T00:00:00.00Z", "2021-09-20T17:27:00.000Z+02:00", "08/07/2021" };
final DateTimeParser dtp = new DateTimeParser().withDateResolutionMode(DateResolutionMode.MonthFirst).withLocale(Locale.ENGLISH);
for (final String sample : samples)
System.err.printf("Format is: '%s'%n", dtp.determineFormatString(sample));
}
}
Which will give the following output:
Format is: 'yyyy-MM-dd'T'HH:mm:ss.SSX'
Format is: 'yyyy-MM-dd'T'HH:mm:ss.SSSX'
Format is: 'MM/dd/yyyy'
I had this problem of trying to identifying whether there is a date information contained in a paragraph. So here are the issues:
We don't know where the date string might appear. A paragraph would be something like "We would like set the appointment at Nov. 15th. Then we would .....". So we cannot directly use DateTime.parse()
The format of the date is arbitrary, it can be more formal forms like "Nov. 15th" or "08/21/1988" or "5th in this month".
It would be unlikely to cover all the cases given that the date information can have various forms, I just want to cover as many cases as possible. The lightweight solution I can come up with would be regular expressions I guess.... And again that would be a huge expression. Does anyone know if there are better solutions or available regular expressions for this?
(P.S. I would prefer more light weighted approaches, methods like machine learning might be more general but is not applicable to my task here)
I'd propably approach it with a regular expression (or multiple) as well.
I'd make the regular expression match regions that look date-like by matching everything around "th", "nd" "st", month/day names and abbreviations, dot/line/slash/colon separated numbers or such things. Experiment with that and see how good it finds dates with a ton of test-cases.
Parsing the possible dates is another story. I guess you'd need something as powerful as PHP's strtotime.
Another approach is to just clearly define a big collection of possible formats. Then, when one is detected, you can easily parse it. Feels too brute-force for me though
As a starting point, there are seven pages of date regexes over at http://regexlib.com. If you don't know which one you're looking for, I would create an array and apply them one at a time. You'll still have a problem with dates like 11/12/2015 vs. 12/11/2015 so some kind of process for clarification is still necessary (e.g., automatically mail back and ask "Do you mean December 11 or November 12?").
I was having some trouble when trying to format time in 24 hours format to 12 hours format. Here are some of the example of my time in string format:
0:00, 9:00, 12:00, 15:00
I wonder how should I substr the first two character in JavaScript because some of them were one digit and some were two. The output time format should be in 12 hours format like:
12:00AM, 9:00AM, 12:00PM, 3:00PM
Any guides? Thanks in advance.
In comments you clarified that each string you process will have only a single time in it (i.e., you are not processing a single string with four comma-separated times in it). So essentially you have input as follows:
var input = "9:00";
The easiest way to extract the hour and minute is using the String .split() method. This splits up the string at a specified character - in your case you'd use ":" - and returns an array with the pieces:
var parts = input.split(":"),
hour = parts[0],
minute = parts[1];
The obvious answer would be to use regular expressions (but remember AWZ's rule: if you have a problem and decide it can be solved with RE's, then you now have two prolems).
However, save yourself a whole helluva lot of trouble and get moment.js
As far I know, in Java I can get weekdays in normal (Friday) or short mode (Fri). But, there is any way to obtain only first letter?
I thought I can get first letter using "substring", but it won't be correct for all languages. For example, spanish weekdays are: Lunes, Martes, Miércoles, Jueves, Viernes, Sábado and Domingo, and first letter for "Miércoles" is X instead of M to difference it from "Martes".
In Android you can use SimpleDateFormat with "EEEEE". In the next example you can see it.
SimpleDateFormat formatLetterDay = new SimpleDateFormat("EEEEE",Locale.getDefault());
String letter = formatLetterDay.format(new Date());
EDIT: it's actually not entirely true. The result on Android could have more than a single letter (and also non-unique, if this matters), but this is what we have. Here's proof that you won't get these characteristics on Android, going over all locales. It's written in Kotlin, but should work for Java too, of course:
val charCountStats = SparseIntArray()
Locale.getAvailableLocales().forEach { locale ->
val sb = StringBuilder("$locale : ")
val formatLetterDay = SimpleDateFormat("EEEEE", locale)
for (day in 1..7) {
val cal = Calendar.getInstance()
cal.set(Calendar.DAY_OF_WEEK, day)
val letter: String = formatLetterDay.format(cal.time)
charCountStats.put(letter.length, charCountStats.get(letter.length, 0)+1)
sb.append(letter)
if (day != 7)
sb.append(',')
}
Log.d("AppLog", "$sb")
}
Log.d("AppLog", "stats:")
charCountStats.forEach { key, value ->
Log.d("AppLog", "formatted days with $key characters:$value")
}
And the result is that for most cases it's indeed a single letter, but for many it's more, and can even reaches 8 characters (though it might look as less letters, even one) :
formatted days with 1 characters:4889
formatted days with 2 characters:471
formatted days with 3 characters:99
formatted days with 4 characters:58
formatted days with 5 characters:3
formatted days with 8 characters:3
Example of a locale that it shows as 3 letters (and not just has 3 letters) is "wo" ("Wolof" language), as this is the result for each of its days of the week using the above formatting:
Dib,Alt,Tal,Àla,Alx,Àjj,Ase
As mentioned above there is no standard Java support for this. Using the formatting string "EEEEE" however is not guaranteed to work on all Android devices. The following code is guaranteed to work on any device:
public String firstLetterOfDayOfTheWeek(Date date) {
Locale locale = Locale.getDefault();
DateFormat weekdayNameFormat = new SimpleDateFormat("EEE", locale);
String weekday = weekdayNameFormat.format(date);
return weekday.charAt(0)+"";
}
There is no standard Java API support for doing that1.
Part of the reason is that many (maybe even most) languages don't have conventional unique one-letter weekday abbreviations. In English there isn't, for example (M T W T F S S).
A (hypothetical) formatting option that doesn't work2 in many / most locales would be an impediment to internationalization rather than a help.
It has been pointed out that:
SimpleDateFormat formatLetterDay =
new SimpleDateFormat("EEEEE", Locale.getDefault());
String letter = formatLetterDay.format(new Date());
gives one letter abbreviations for later versions of Android (18 and above), though the javadocs do not mention this. It appears that this "5 letter" format has been borrowed from DateTimeFormatter whose javadoc says:
The count of pattern letters determines the format.
Text: The text style is determined based on the number of pattern letters used. Less than 4 pattern letters will use the short form. Exactly 4 pattern letters will use the full form. Exactly 5 pattern letters will use the narrow form. ...
If you are targeting Android API 26 or later, you should consider using the java.time.* classes rather than the legacy classes.
But either way, this isn't guaranteed to give you unique day letters.
1 - By "that" I mean mapping to unique 1-letter abbreviations.
2 - I mean it doesn't work in the human sense. You could invent a convention, but typical people wouldn't understand what the abbreviations meant; e.g. they wouldn't know that "X" meant "Miércoles", or in English that (say) "R" meant "Thursday" (see https://stackoverflow.com/a/21049169/139985).
I realize the OP was asking for standards across languages, and this does not address it. But there is/was a standard for using single character Day of Week abbreviation.
Back in mainframe days, using a 1-character abbreviation for Day of Week was common, either to store day of week in 1 character field (to save precious space), or have a report heading for single-character column. The "standard" was to use MTWRFSU, where R was for Thursday, and U for Sunday.
I could not find any definitive references to this (which is why I quoted "standard", but here are a couple of examples:
http://eventguide.com/topics/one_digit_day_abbreviations.html
http://www.registrar.ucla.edu/soc/definitions.htm#Anchor-Days-3800
I think there's no direct java function to get the first letter and no standard way to do it.
You can refer to this link to obtain the first letter of the string day using substring() java method
Given a string in Java, just take the first X letters
DateFormatSymbols.getWeekdays with a width of NARROW will give you the first letter of each week day. It works for every language. However, it requires API 24.
if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.N) {
String[] weekDays = DateFormatSymbols.getInstance(Locale.getDefault())
.getWeekdays(DateFormatSymbols.STANDALONE, DateFormatSymbols.NARROW);
}
This question already has an answer here:
Closed 11 years ago.
Possible Duplicate:
generically parsing String to date
Following situation:
I need to detect if a String contains a DateTime/Timestamp. The problem is that those DateTimes come in various formats and granularity such as:
2011-09-12
12-09-2011
12.09.2011
2011-09-01-14:15
... and many many more variations
I don't need to understand the semantics (e.g. distinct between day or months) I just need to detect let's say 80% of the most common DateTime variations.
My first thought was using RegExp - which I'm far from being familiar with and also I would need to familiarize myselft with all variations in which DateTimes can come.
So my questions:
Does anybody know a canned RegExps to achieve this?
Is there maybe some Java library that could do this task?
Thanks!!
There is another question of same context, hope that link will help you: Dynamic regex for date time formats
you're going to struggle to find a generic match. For the day - month - year section you could possibly use a pattern like (\d{1,2}.){2}\d{4} which would match dates in format dd*mm*yyyy
DateFormat would be a better choice, I think. As John B suggested above, create a list of valid formats and try to match against each one.
Use Java's DateFormat.
You can set up as many formats as you want and iterate through them looking for a match. You will have to catch exceptions for the formats that don't parse and so this solution is not efficient but will work.
Edit per comment:
If you don't want to have exceptions due to performance the you would need to set up a list of regular expressions (one for each format you will support). Find the regex (if any) that matches your input and convert it to a date based on the matching format. What I would suggest would be to match a DateFormat to each regex and let the appropriate DateFormat do the work of parsing once you have identified the appropriate DateFormat. This would reduce the chance of errors in using the groups from the regex to produce the date. Personally, I don't know if this would actually be more efficient than try/catch so I would opt for the more straightforward mechanism (using DateFormat directly).