Adding a new date format to Natty DateParser

Adding a new date format to Natty DateParser - java

I'm a total newbie when it comes to Natty and Antler. Up to now, Natty has been great and has parsed dates with no problems. Recently we have started to receive a new date and time format which Natty has trouble extracting.
Mon 29 Feb 09:00:00 2016
It cannot extract the year due to it being separated from the rest of the date.
I've been trying to add my own format into DateParser, where it could pick up on this format as it does with any other.
I've made the following changes:
date_time: Added an extra rule called custom_dates which will be the new rule for my format
date_time
: (
(date)=>date (date_time_separator explicit_time)?
| explicit_time (time_date_separator date)?
| custom_dates
) -> ^(DATE_TIME date? explicit_time?)
| relative_time -> ^(DATE_TIME relative_time?)
;
custom_date: My new rule
custom_date
: relaxed_day_of_week WHITE_SPACE relaxed_day_of_month WHITE_SPACE relaxed_month (date_time_separator explicit_time)? relaxed_year
-> ^(EXPLICIT_DATE relaxed_day_of_week relaxed_day_of_month relaxed_month relaxed_year (date_time_separator explicit_time)?)
;
When I try to build Natty with my changes, it just hangs, and never finishes. The output up to that point is:
Decision can match input such as "COMMA WHITE_SPACE INT_00 INT_00" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
warning(200): com\joestelmach\natty\generated\DateParser.g:444:73:
Decision can match input such as "COMMA WHITE_SPACE INT_00 {INT_13..INT_19, INT_20..INT_23}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
warning(200): com\joestelmach\natty\generated\DateParser.g:496:45:
Decision can match input such as "WHITE_SPACE IN {COMMA, WHITE_SPACE}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
warning(200): com\joestelmach\natty\generated\DateParser.g:504:77:
Decision can match input such as "WHITE_SPACE IN {COMMA, WHITE_SPACE}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
Am I possibly going the wrong way about this? I've taken a look at the Natty and ANTLR v3 documentation but there isn't much to go on.
Thanks in advance
EDIT:
As requested in the comments below. I've added in where the first warning occurs. However what I've included above is just a small snapshot of the dozens of warnings that have been in there before I modified any code with my own rules
The first warning appears in the date_time_separator
date_time_separator
: WHITE_SPACE (AT WHITE_SPACE)?
| WHITE_SPACE? COMMA WHITE_SPACE? (AT WHITE_SPACE)?
| T
;
One observation I've made is when I changed my rule to always include the time
custom_date
: relaxed_day_of_week WHITE_SPACE relaxed_day_of_month WHITE_SPACE relaxed_month (date_time_separator explicit_time) relaxed_year
-> ^(EXPLICIT_DATE relaxed_day_of_week relaxed_day_of_month relaxed_month relaxed_year (date_time_separator explicit_time)?)
;
When I compile I receive this error:
error(202): com\joestelmach\natty\generated\DateParser.g:831:3: the decision cannot distinguish between alternative(s) 1,2 for input such as "INT_00 INT_00 INT_00 EOF"
Looking at line 831 is where the explicit_time resides. I cannot find anything on StackOverflow or otherwise as to what this error means. I assume this error means that there is some ambiguity between the two possible routes. However I don't understand why merely adding in my code should cause an error.
explicit_time_hours_minutes returns [String hours, String minutes, String ampm]
: hours (COLON | DOT)? minutes ((COLON | DOT)? seconds)? (WHITE_SPACE (meridian_indicator | (MILITARY_HOUR_SUFFIX | HOUR)))?
{$hours=$hours.text; $minutes=$minutes.text; $ampm=$meridian_indicator.text;}
-> hours minutes seconds? meridian_indicator?
| hours (WHITE_SPACE? meridian_indicator)?
{$hours=$hours.text; $ampm=$meridian_indicator.text;}
-> hours ^(MINUTES_OF_HOUR INT["0"]) meridian_indicator?
;

Related

Talend : capture value from row1 and replace it in the entire column

I want to take uk from 1st row and replace it in the entire country column without changing the values in zones. I have tried regex expression from expression builder but failed.
COUNTRY
ZONE
UK
12
AU
44
FR
21
GER
20
FR
02

Your job design will look like this
Second , using a tSampleRow you will get the range of lines (in your case you would like first line )
Third , stock your wanted line in a global variable like this
Finally , in the tmap just get your global variable as such
Here is the output (I have 201 lignes i will have 201 UK printed ):
.--------.
|tLogRow_1|
|=------=|
|mystring|
|=------=|
|UK|
|UK |
|UK |
|UK |
|UK |
'--------'
[statistics] disconnected
Job operation ended at 14:00 21/02/2022. [exit code = 0]

Predict function R returns 0.0 [duplicate]

I posted earlier today about an error I was getting with using the predict function. I was able to get that corrected, and thought I was on the right path.
I have a number of observations (actuals) and I have a few data points that I want to extrapolate or predict. I used lm to create a model, then I tried to use predict with the actual value that will serve as the predictor input.
This code is all repeated from my previous post, but here it is:
df <- read.table(text = '
Quarter Coupon Total
1 "Dec 06" 25027.072 132450574
2 "Dec 07" 76386.820 194154767
3 "Dec 08" 79622.147 221571135
4 "Dec 09" 74114.416 205880072
5 "Dec 10" 70993.058 188666980
6 "Jun 06" 12048.162 139137919
7 "Jun 07" 46889.369 165276325
8 "Jun 08" 84732.537 207074374
9 "Jun 09" 83240.084 221945162
10 "Jun 10" 81970.143 236954249
11 "Mar 06" 3451.248 116811392
12 "Mar 07" 34201.197 155190418
13 "Mar 08" 73232.900 212492488
14 "Mar 09" 70644.948 203663201
15 "Mar 10" 72314.945 203427892
16 "Mar 11" 88708.663 214061240
17 "Sep 06" 15027.252 121285335
18 "Sep 07" 60228.793 195428991
19 "Sep 08" 85507.062 257651399
20 "Sep 09" 77763.365 215048147
21 "Sep 10" 62259.691 168862119', header=TRUE)
str(df)
'data.frame': 21 obs. of 3 variables:
$ Quarter : Factor w/ 24 levels "Dec 06","Dec 07",..: 1 2 3 4 5 7 8 9 10 11 ...
$ Coupon: num 25027 76387 79622 74114 70993 ...
$ Total: num 132450574 194154767 221571135 205880072 188666980 ...
Code:
model <- lm(df$Total ~ df$Coupon, data=df)
> model
Call:
lm(formula = df$Total ~ df$Coupon)
Coefficients:
(Intercept) df$Coupon
107286259 1349
Predict code (based on previous help):
(These are the predictor values I want to use to get the predicted value)
Quarter = c("Jun 11", "Sep 11", "Dec 11")
Total = c(79037022, 83100656, 104299800)
Coupon = data.frame(Quarter, Total)
Coupon$estimate <- predict(model, newdate = Coupon$Total)
Now, when I run that, I get this error message:
Error in `$<-.data.frame`(`*tmp*`, "estimate", value = c(60980.3823396919, :
replacement has 21 rows, data has 3
My original data frame that I used to build the model had 21 observations in it. I am now trying to predict 3 values based on the model.
I either don't truly understand this function, or have an error in my code.
Help would be appreciated.
Thanks

First, you want to use
model <- lm(Total ~ Coupon, data=df)
not model <-lm(df$Total ~ df$Coupon, data=df).
Second, by saying lm(Total ~ Coupon), you are fitting a model that uses Total as the response variable, with Coupon as the predictor. That is, your model is of the form Total = a + b*Coupon, with a and b the coefficients to be estimated. Note that the response goes on the left side of the ~, and the predictor(s) on the right.
Because of this, when you ask R to give you predicted values for the model, you have to provide a set of new predictor values, ie new values of Coupon, not Total.
Third, judging by your specification of newdata, it looks like you're actually after a model to fit Coupon as a function of Total, not the other way around. To do this:
model <- lm(Coupon ~ Total, data=df)
new.df <- data.frame(Total=c(79037022, 83100656, 104299800))
predict(model, new.df)

Thanks Hong, that was exactly the problem I was running into. The error you get suggests that the number of rows is wrong, but the problem is actually that the model has been trained using a command that ends up with the wrong names for parameters.
This is really a critical detail that is entirely non-obvious for lm and so on. Some of the tutorial make reference to doing lines like lm(olive$Area#olive$Palmitic) - ending up with variable names of olive$Area NOT Area, so creating an entry using anewdata<-data.frame(Palmitic=2) can't then be used. If you use lm(Area#Palmitic,data=olive) then the variable names are right and prediction works.
The real problem is that the error message does not indicate the problem at all:
Warning message: 'anewdata' had 1 rows but variable(s) found to have X
rows

instead of newdata you are using newdate in your predict code, verify once. and just use Coupon$estimate <- predict(model, Coupon)
It will work.

To avoid error, an important point about the new dataset is the name of independent variable. It must be the same as reported in the model. Another way is to nest the two function without creating a new dataset
model <- lm(Coupon ~ Total, data=df)
predict(model, data.frame(Total=c(79037022, 83100656, 104299800)))
Pay attention on the model. The next two commands are similar, but for predict function, the first work the second don't work.
model <- lm(Coupon ~ Total, data=df) #Ok
model <- lm(df$Coupon ~ df$Total) #Ko

Regex - Get text between two strings

I have a large text file which contains many abstracts (7k of them). I want to separate them. They have the following properties:
a number at the begining with a period right after
123.
and it always ends in:
[PubMed - indexed for MEDLINE]
It would be even better if I can get the title and abstract out of the separated string. I am fine if I have to split the articles first then split the texts.
In the example the title is the third line:
Effects of propofol and isoflurane on haemodynamics and the inflammatory response in cardiopulmonary bypass surgery.
The abstract is on the 8th line:
Cardiopulmonary bypass (CPB) causes reperfusion injury...
I have tried to use the following code for this text
Regex:
[0-9\.]*\s*(((?![0-9\.]*|MEDLINE).)+)\s*MEDLINE
Text:
1. Br J Biomed Sci. 2015;72(3):93-101.
Effects of propofol and isoflurane on haemodynamics and the inflammatory response
in cardiopulmonary bypass surgery.
Sayed S, Idriss NK, Sayyedf HG, Ashry AA, Rafatt DM, Mohamed AO, Blann AD.
Cardiopulmonary bypass (CPB) causes reperfusion injury that when most severe is
clinically manifested as a systemic inflammatory response syndrome. The
anaesthetic propofol may have anti-inflammatory properties that may reduce such a
response. We hypothesised differing effects of propofol and isoflurane on
inflammatory markers in patients having CBR Forty patients undergoing elective
CPB were randomised to receive either propofol or isoflurane for maintenance of
anaesthesia. CRP, IL-6, IL-8, HIF-1α (ELISA), CD11 and CD18 expression (flow
cytometry), and haemoxygenase (HO-1) promoter polymorphisms (PCR/electrophoresis)
were measured before anaesthetic induction, 4 hours post-CPB, and 24 hours later.
There were no differences in the 4 hours changes in CRP, IL-6, IL-8 or CD18
between the two groups, but those in the propofol group had higher HIF-1α (P =
0.016) and lower CD11 expression (P = 0.026). After 24 hours, compared to the
isoflurane group, the propofol group had significantly lower levels of CRP (P <
0.001), IL-6 (P < 0.001) and IL-8 (P < 0.001), with higher levels CD11 (P =
0.009) and CD18 (P = 0.002) expression. After 24 hours, patients on propofol had
increased expression of shorter HO-1 GT(n) repeats than patients on isoflurane (P
= 0.001). Use of propofol in CPB is associated with a less adverse inflammatory
profile than is isofluorane, and an increased up-regulation of HO-1. This
supports the hypothesis that propofol has anti-inflammatory activity.
PMID: 26510263 [PubMed - indexed for MEDLINE]

Two useful solutions have been proposed by Mariano and stribizhev:
Mariano's solution: Use the split method with the typical end
(?m)\[PubMed - indexed for MEDLINE\]$
DEMO : http://ideone.com/Qw5ss2
Java 4+
stribizhev's solution: Fully extract data from the text
(?m)^\s*\d+\..*\R{2} # Get to the title
(?<title>[^\n]*(?:\n(?!\n)[^\n]*)*) # Get title
\R{2} # Get to the authors
[^\n]*(?:\n(?!\R)[^\R]*)* # Consume authors
(?<abstract>[^\[]*(?:\[(?!PubMed[ ]-[ ]indexed[ ]for[ ]MEDLINE\])[^\[]*)*) #Grab abstract
DEMO: https://regex101.com/r/sG2yQ2/2
Java 8+

Try this:
"^[0-9]+\..*\s+(.*)\s+.*\s+((?:\s|.)*?)\[PubMed - indexed for MEDLINE\]"
First group would be title. Second would be abstract.

Parsing a complicated CSV file

I am in the difficult situation now where i need to make a parser to parse a formatted document from tekla to be processed in the database.
so on the .CSV i have this
,SMS-PW-BM31,,1,,,,287.9
,,SMS-PW-BM31,1,H350*175*7*11,SS400,5805,287.9
,------------,--------------,----,---------------,--------,------------,---------
,SMS-PW-BM32,,1,,,,405.8
,,SMSPW-H707,1,H350*175*7*11,SS400,6697,332.2
,,SMSPW-EN12,1,PLT12x175,SS400,500,8.2
,,SMSPW-EN14,1,PLT16x175,SS400,500,11
,------------,--------------,----,---------------,--------,------------,---------
That is the document generated from the tekla software. What i expect from the output is something like this
HEAD-MARK COMPONENT-TYPE QUANTITY PROFILE GRADE LENGTH WEIGHT
SMS-PW-BM31 1 287.9
SMS-PW-BM31 SMS-PW-BM31 1 H350*175*7*11 SS400 5805 287.9
SMS-PW-BM32 1 405.8
SMS-PW-BM32 SMSPW-H707 1 H350*175*7*11 SS400 6697 332.2
SMS-PW-BM32 SMSPW-EN12 1 PLT12X175 SS400 500 8.2
SMS-PW-BM32 SMSPW-EN14 1 PLT16X175 SS400 500 11
How do i start from in Java ? the most complicated thing is distributing the head mark that separated by the '-'

CSV format is quite simple, there is a column delimiter that is a comma(,) and a row delimiter that is a new line(\n). Some columns will be surrounded by quotes(") to contain column data but it looks like you wont have to worry about that given your current file.
Look at String.split and you will find your answer after a bit of pondering it.

Would regex be a good choice to parse SMTP received lines

I want to parse elements of RFC822 (SMTP) "Received" lines, which are defined formally in the spec, e.g.:
atom = 1*
[...]
received = "Received" ":" ; one per relay
["from" domain] ; sending host
["by" domain] ; receiving host
["via" atom] ; physical path
*("with" atom) ; link/mail protocol
["id" msg-id] ; receiver msg id
["for" addr-spec] ; initial form
";" date-time ; time received
[...]
msg-id = "" ; Unique message id
[...]
addr-spec = local-part "#" domain ; global address
etc. for domain, date-time, etc.
Here's a real example:
Received: from ll-194.132.162.89.kv.sovam.net.ua (ll-194.132.162.89.kv.sovam.net.ua [83.170.243.194] (may be forged)) by raq2073.uk2.net (8.10.2/8.10.2) with ESMTP id lASHDDE10765 for <johnsmithsvt#matts.co.uk>; Wed, 28 Nov 2007 17:13:13 GMT
Would regex be a good strategy to capture the parts of a received line?
I realize that many SMTP servers don't format received lines properly (in real life).
Otherwise, does anyone know of a library in Java that does this well?
Edit Here's a fiddle showing a regex and tests that I've banged on for a while, which seems to work.
Received:\s+(?:from\s+(.+?))?(?:\(qmail (.+?)\))?(?:\s+by\s+(.+?))?(?:\\s+via\s+(.+?))?(?:\s+with\s+(.+?))?(?:\;?\s+id\s+(.+?))?(?:\s+for\s+(.+?))?(?:;\s*(?!.*\;.*)(.+))?$

The choice really depends on exactly what you want to achieve.
For capturing specific parts of a Receiver-line (e.g. 'give me the From-part'), regexes are awesome.
If you need a full-fledged parser for this grammar, then regexes alone will not suffice. Especially the addr-spec has so many special cases that a regex cannot hope to handle each one correctly (explanation). Regexes are not parsers.
Last time I needed an actual parser, I wrote my own using JavaCC. I would only recommend going down that road if you know a thing or two about grammars and parsing.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Adding a new date format to Natty DateParser - java

Related

Talend : capture value from row1 and replace it in the entire column

Predict function R returns 0.0 [duplicate]

Regex - Get text between two strings

Parsing a complicated CSV file

Would regex be a good choice to parse SMTP received lines

Categories

Resources