I have an Android app that outputs a csv with several hundred lines per second. Each line has a timestamp that is basically generated like this:
String formatTimeStamp (Calendar cal)
{
SimpleDateFormat timeFormatISO = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss")
return timeFormatISO.format(cal.getTime());
}
Some timestamps (maybe one in a thousand) have unexpected zeroes in them and look something like this:
2016-04-12T09:0011:30
2016-04-12T0009:0011:30
2016-04-0012T09:11:30
The occurrences seem completely arbitrary, at least
I don't see any pattern behind them. Sometimes there are several thousand lines between two wrong lines, sometimes there is just one.
The date format is defined in only one place in the code.
Edit:
Here
are the faulty timestamps of my last run, just so you can take a look at it. The only thing I have noted about them is that it's always two leading zeroes in front of a date element, never in front of the year, though.
Edit 2:
Issue resolved! Turns out SimpleDateFormat is not thread safe and does weird things to strings if used incorrectly. I didn't know multithreading would be an issue, so I didn't point it out in the initial question. Sorry about the confusion.
"Turns out SimpleDateFormat is not thread safe and does weird things to strings if used incorrectly."
The code you showed us in your Question is thread-safe unless the value of cal is being handled in a non-safe way.
The SimpleDateFormat instance is thread-confine; i.e. no other threads can see it. Therefore, its thread-safety or otherwise is not relevant.
My guess is that in your actual code you had multiple threads attempting to share a SimpleDateFormat instance. The javadoc says:
"It is recommended to create separate format instances for each thread. If multiple threads access a format concurrently, it must be synchronized externally."
you need to specify the timezone in the input string.
for example :
timeFormatISO.setTimeZone(TimeZone.getTimeZone("GMT"));
Related
I want to parse some dates in Java, but the format is not defined and could be a lot of them (any ISO-8601 format which is already a lot, Unix timestamp in any unit, and more)
Here are some samples :
1970-01-01T00:00:00.00Z
1234567890
1234567890000
1234567890000000
2021-09-20T17:27:00.000Z+02:00
The perfect parsing might be impossible because of ambiguous cases but, a solution to parse most of the common dates with some logical might be achievable (for example timestamps are considered in seconds / milli / micro / nano in order to give a date close to the 2000 era, dates like '08/07/2021' could have a default for month and day distinction).
I didn't find any easy way to do it in Java while in python it is kind of possible (not working on all my samples but at least some of them) using infer_datetime_format of panda function to_datetime (https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html).
Are there some easy approach in Java?
Well, first of all, I agree with rzwitserloot here that date parsing in free format is extremely difficult and full of ambiguities. So you are skating on thin ice and will eventually run into trouble if you just assume that a user input will be correctly parsed the way you think it will.
Nevertheless, we could make it work if I assume either of the following:
You simply don't care if it will be parsed incorrectly; or
You are doing this for fun or for learning purposes; or
You have a banner, saying:
If the parsing goes wrong, it's your fault. Don't blame us.
Anyway, the DateTimeFormatterBuilder is able to build a DateTimeFormatter which could be able to parse a lot of different patterns. Since a formatter supports optional parsing, it could be instructed to try to parse a certain value, or skip that part if no valid value could be found.
For instance, this builder is able to parse a fairly wide range of ISO-like dates, with many optional parts:
DateTimeFormatterBuilder builder = new DateTimeFormatterBuilder()
.appendPattern("uuuu-M-d")
.optionalStart()
.optionalStart().appendLiteral(' ').optionalEnd()
.optionalStart().appendLiteral('T').optionalEnd()
.appendValue(ChronoField.HOUR_OF_DAY)
.optionalStart()
.appendLiteral(':')
.appendValue(ChronoField.MINUTE_OF_HOUR)
.optionalStart()
.appendLiteral(':')
.appendValue(ChronoField.SECOND_OF_MINUTE)
.optionalStart()
.appendFraction(ChronoField.NANO_OF_SECOND, 1, 9, true)
.optionalEnd()
.optionalEnd()
.optionalEnd()
.appendPattern("[XXXXX][XXXX][XXX][XX][X]")
.optionalEnd();
DateTimeFormatter formatter = builder.toFormatter(Locale.ROOT);
All of the strings below can be successfully parsed by this formatter.
Stream.of(
"2021-09-28",
"2021-07-04T14",
"2021-07-04T14:06",
"2001-09-11 00:00:15",
"1970-01-01T00:00:15.446-08:00",
"2021-07-04T14:06:15.2017323Z",
"2021-09-20T17:27:00.000+02:00"
).forEach(testcase -> System.out.println(formatter.parse(testcase)));
Als you can see, with optionalStart() and optionalEnd(), you could define optional portions of the format.
There are many more patterns you probably want to parse. You could add those patterns to the abovementioned builder. Alternatively, the appendOptional​(DateTimeFormatter) method could be used to include multiple builders.
The perfect parsing might be impossible because of ambiguous cases but, a solution to parse most of the common dates with some logical might be achievable
Sure, and such wide-ranging guesswork should most definitely not be part of a standard java.* API. I think you're also wildly underestimating the ambiguity. 1234567890? It's just flat out incorrect to say that this can reasonably be parsed.
You are running into many, many problems here:
Java in general prefers throwing an error instead of guessing. This is inherent in the language (java has few optional syntax constructs; semicolons aren't optional, () for method invocations are not optional, java intentionally does not have 'truthy/false', i.e. if (foo) is only valid if foo is an expression of the boolean type, unlike e.g. python where you can stick anything in there and there's a big list of what counts as falsy, with the rest being considering truthy. When in rome, be like the romans: If this tenet annoys you, well, either learn to love it, begrudgingly accept it, or program in another language. This idea is endemic in the entire ecosystem. For what it is worth, given that debugging tends to take far longer than typing the optional constructs, java is objectively correct or at least making rational decisions for being like this.
Either you can't bring in the notion that 'hey, this number is larger than 12, therefore it cannot possibly be the month', or, you have to accept that whether a certain date format parsers properly depends on whether the day-of-month value is above or below 12. I would strongly advocate that you avoid a library that fails this rule like the plague. What possible point is there, in the end? "My app will parse your date correctly, but only for about 3/5ths of all dates?" So, given that you can't/should not take that into account, 1234567890, is that seconds-since-1970? milliseconds-since-1970? Is that the 12th of the 34th month of the year 5678, the 90th hour, and assumed zeroes for minutes, seconds, and millis? If a library guesses, that library is wrong, because you should not guess unless you're 95%+ sure.
The obvious and perennial "do not guess" example is, of course, 101112. Is that November 10th, 2012 (european style)? Is that October 11th, 2012 (American style), or is that November 12th, 2010 (ISO style)? These are all reasonable guesses and therefore guessing is just wrong here. Do. Not. Guess. Unless you're really sure. Given that this is a somewhat common way to enter dates, thus: Guessing at all costs is objectively silly (see above). Guessing only when it's pretty clear and erroring out otherwise is mostly useless, given that ambiguity is so easy to introduce.
The concept of guessing may be defensible but only with a lot more information. For example, if you give me the input '101112100000', there's no way it's correct to guess here. But if you also tell me that a human entered this input, and that human is clearly clued into, say, german locale, then I can see the need to be able to turn that into '10th of november 2012, 10 o'clock in the morning': Interpreting as seconds or millis since some epoch is precluded by the human factor, and the day-month-year order by locale.
You asked:
Are there some easy approach in Java?
This entire question is incorrect. The in Java part needs to be stripped from this question, and then the answer is a simple: No. There is no simple way to parse strings into date/times without a lot more information than just the input string. If another library says they can do that, they are lying, or at least, operating under a list of cultural and source assumptions as long as my leg, and you should not be using that library.
I don't know any standard library with this functionality, but you can always use DateTimeFormatter class and guess the format looping over a list of predefined formats, or using the ones provides by this class.
This is a typichal approximation of what you want to archive.
Here you can see and old implementation https://balusc.omnifaces.org/2007/09/dateutil.html
FTA (https://github.com/tsegall/fta) is designed to solve exactly this problem (among others). It currently parses thousands of formats and does not do it via a predefined set, so typically runs extremely quickly. In this example we explicitly set the DateResolutionMode, however, it will default to something intelligent based on the Locale. Here is an example:
import com.cobber.fta.dates.DateTimeParser;
import com.cobber.fta.dates.DateTimeParser.DateResolutionMode;
public abstract class Simple {
public static void main(final String[] args) {
final String[] samples = { "1970-01-01T00:00:00.00Z", "2021-09-20T17:27:00.000Z+02:00", "08/07/2021" };
final DateTimeParser dtp = new DateTimeParser().withDateResolutionMode(DateResolutionMode.MonthFirst).withLocale(Locale.ENGLISH);
for (final String sample : samples)
System.err.printf("Format is: '%s'%n", dtp.determineFormatString(sample));
}
}
Which will give the following output:
Format is: 'yyyy-MM-dd'T'HH:mm:ss.SSX'
Format is: 'yyyy-MM-dd'T'HH:mm:ss.SSSX'
Format is: 'MM/dd/yyyy'
I am currently making an auction program in Java, I am trying to work out deadlines, however my date keeps coming out as (7/04/2013 11:22), is there a way to use String.format to add a leading zero to this date when it is a one digit day?
String timeOne = Server.getDateTime(itemArray.get(1).time).toString()
It causes me a problem later on when I try to sub string it, and it is less than 17 characters long.
Thanks in advance, James.
#Leonard Brünings answer is the right way. And here's why your original code is the wrong way ... even if it worked.
The javadoc for Calendar.toString() says this:
"Return a string representation of this calendar. This method is intended to be used only for debugging purposes, and the format of the returned string may vary between implementations."
Basically you are using toString() for a purpose that the javadoc says you shouldn't. Even if you tweaked the output from toString(), the chances are that your code would be fragile. A change in JVM could break it. A change of locale could break it.
Simply use the SimpleDateFormat
import java.text.SimpleDateFormat;
Calendar timeOne = Server.getDateTime(itemArray.get(1).time)
SimpleDateFormat sdf = new SimpleDateFormat("MM/dd/yyyy HH:mm")
System.out.println(sdf.format(timeOne.getTime()))
From an external service I get objects with Date+Time fields as String's in format 2012-03-07 12:12:23.547 and I need to compare these fields to get a correct order of the objects. I am well aware that I can create Date objects via e.g. SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS") and compare the two Date's to achieve this but my question is if I can rely on a correct sorting order if I compare them as Strings such as String.compareTo(String)? Some light testing gives me the impression that it works but I am wondering if anyone is aware of any scenarios where it would NOT give me the correct result? Also, are there any performance considerations, pros or cons, of comparing String's Vs parsing into Dates to compare?
Assuming the hours are in 24 hour format, then yes, that's a sortable date/time format - and one of its well-known benefits is that you can sort without actually parsing.
One downside: if you get bad data, you won't spot it - you'll just neatly sort it into the "right" place, ignoring the fact that you've been given (say) February 30th.
If you need the value as a date/time later on, then I'd parse it first and then compare. But if you only need this in terms of ordering, the string comparison may be faster than parsing everything. Worth benchmarking, of course... especially as comparing two strings on the same day will require several characters-worth of checking, whereas if you've parsed it once you can then probably just compare long values.
No, the better approach would be parse the string in a date object and then compare with other date object.
I wouldn't worry about performance unless you have some reason to think this code will be a bottleneck (called lots of times within loops) and even then I'd wait till you could do some concrete performance testing.
Comparing them as dates will make your code clearer and will mean you can more easily change the date format in future (to something which doesn't sort as a string).
It works provided
You order the fields from most significant to least significant.
You use number fields (not Jan/Feb) and they are the same width. e.g. 2:15 is after 12:15 but 02:15 is before as expected. Note: For years after 9999 and before 0001 this will not work.
You accept that invalid dates may not be detected.
Use the type that it represents - in this case it's a date so use a date. Dates can easily be sorted chronologically by adding them to a collection and doing Collections.sort(dates) on the dates.
See http://docs.oracle.com/javase/tutorial/collections/interfaces/order.html
I think Date is compared or can be compared through miliseconds (long) which is faster. It's probably safer way, you don't need to think about when string comparation would not fit.
Is there a better way of doing this?
boolean oneCalendarWeek = interval.getStart().plusWeeks(1).equals( interval.getEnd() );
I guess the following won't work because of the way equals is implemented...
boolean oneCalendarWeek = interval.toPeriod().equals( Weeks.ONE );
From the comments:
i really want to know if the api supports something like my second example which i think is clearer than the first
While the example using Weeks.ONE does not work (since Period.equals() first checks if the two Period instances support the same number of fields, and Weeks.ONE only supports one field), this should work instead:
boolean oneCalendarWeek = interval.toPeriod().equals( Period.weeks(1) );
Here is a code sample that tests this for an interval that starts before the start of DST and ends while in DST. However, I'm not 100% sure how this would behave if the start or end time of the Interval fell exactly on the DST boundary.
i know this topic isn't new, though i have to dig it up again.
I already searched the Web numerous times (including some Threads here on stackoverflow) but haven't found a satisfying answer so far.
(Amongst others I checked
Parsing Ambiguous Dates in Java and
http://www.coderanch.com/t/375367/java/java/Handling-Multiple-Date-Formats-Elegantly
I am currently writing a Dateparser in Java, which takes a date and generates a format-String which can be used by SimpleDateFormat for parsing the date.
The dates are parsed via regex (yes, it's an ugly one xD) from Logfiles (IBM Websphere, Tomcat, Microsoft Exchange, ....). Because we have customers in (at least 2) different Locales, there is no way to simply "throw" the String against the parse-method of SimpleDateFormat and expect it to work properly.
Furthermore, there is the problem with the position of day and month (i.e. formats "dd/MM/yyyy" or "MM/dd/yyyy") which cannot be solved if i don't have at least two datasets where the day-digit has changed..
So my current approach would be storing the dateformats for a specific software installed at a specific customer's systems in a database (mysql / xml / ... ) and forcing the user to at least specify customername and softwarename so there is enough context to break down the amount of possibilites the format may be given in.
This "subset" then would be used to try to parse the logfiles of the specified software.
(The subset is stored in a HashMap in a HashMap in the form
HashMap> map;
The Integer-Key is the length of the formatstring and the String Key of the second Hashmap specifies a datesignature only containing the separating characters.
(i.e. ".. ::." for a date with format "dd.MM.yyyy 11:11:11.111")
I also take into account the value of the digits, i.e. a digit > 12 has to be a day because there is no 13th month. But this only works reliably for Date-Strings later than the 12th of a month..
Is there any chance to avoid implementing prior knowledge about the environment out of which the logfile came, thus enabling the parser to reliably parse one date without having to refer a second datestring for comparison?
I'm stuck on that for almost 3 months now -.-
Any suggestions would be very welcome =)
Edit:
Okay guys this thread can be closed. I now came up with a different solution for my specific problem. For those who are interested:
I am writing a Logreader in Java. As we have regular maintenance I have to read many logfiles.
But it's not just the plain text information that's written in the file.
Imagine a server just having crashed, it's sunday night and the next person to notice is the head of the IT dpt of the customer. Then on the following day I have to to maintenance and check the logfiles. Judging by content, everything seemed okay, nothing unusual. Half an hour after sending the maintenance report I receive a mail with the above mentioned head of it dpt ranting, that the server had crashed and it seemed to go unnoticed.
The point is, you can't keep track over content and the timestamps for logfiles with several thousand lines. So i developed a component which reads a logfile and calculates the time between two different log-entrys. Each logline got parsed into a java.util.Date to later get the Date as Timestamp for high resolution regarding the log-intervals. The differences i then threw onto a linegraph, which makes longer timeouts between two loglines visible as a big spike relating to the rest of the file.
My solution now will be to completely throw away the date-half of the String and insert a dummy-Date with a predefined format. The date only has to change if the Hour and minute approach 23:59.
The original date later is presented on the graph with the "fake-data" lying beneath.
I thank all of you for your suggestions and feedback =)
(And I hope my English has been understandable so far ;) )
My suggestion is to store all dates as 'ambiguous' until such time that the ambiguity can be resolved. (This assumes that a particular customer will always supply data in the same format.) As soon as you get a log from a customer for which you can unambiguously identify the date format, you would then be able to retrospectively apply this format to previously files.
To do this, you would need a table mapping each customer to their date format with some marker (e.g. NULL) to indicate that format is not yet established. You will probably also need to create your own date representation such that you can model these ambiguous dates.
So, as an example, if the possible date formats are:
dd/mm/yyyy
mm/dd/yyyy
yyyy/mm/dd
yyyy/dd/mm
Given dates, you should always be able to identify the year (permitting two digit years would make this problem considerably harder). So you should be able to map dates as follows:
25/01/2011 -> UNAMBIGUOUS_DD_MM_YYYY
12/01/2011 -> AMBIGUOUS_XX_XX_YYYY
2011/03/03 -> AMBIGUOUS_YYYY_XX_XX
03/30/2011 -> UNAMBIGUOUS_MM_DD_YYYY
If possible, you can ask the customers to pass the dateformat string also along with their actual date strings.
i.e. in their log files, they would need to have one more column
..... , '03/11/2011' , 'MM/DD/YYYY' , ...
I think the strategy you are going for (i.e. analysing a bigger set of data) is the best you can get.
From a single line of logfile you will never know if 3/5/11 is the 3rd of may in 2011 or the 5th of march in 2011. (I guess there might also be locales that might interpret this as 11th of may in 2003...)
I had these problems myself some time ago, and i also could only try to introduce some sort of context by either looking at numbers>12, or what changes quickest (must be "day"). But you already stated that yourself...