Getting non-greedy sequence as a string with ANTLR - java

I have a problem with getting sequence as a string. I have a file with strings like:
{TEXT="<div itemprop=\"content\"><div>some text</div>"}
I want to get and use text that exactly between first and last quotes. First i tried:
parse : line+;
line : '{TEXT="' SEQUENCE '"}' {System.out.println($SEQUENCE.getText())};
SEQUENCE : .+?;
But it failed, SEQUENCE get only one symbol in that way. I tried:
parse : line+;
line : '{TEXT="' (a+=SEQUENCE)*? '"}' {System.out.println($a.getText())};
SEQUENCE : .;
And I got List of Tokens, so i can't use getText.

if you want to do it in this way, you can do it like this:
grammar Sequence;
parse : line+;
line : '{TEXT="' a=sequence '"}' {System.out.println(((LineContext)_localctx).a.getText());};
sequence : .+?;
ANY:.;
But there also other mechanisms in ANTLR4 like listeners and visitors.

Related

Using java validate a string is printed in particular format

I have a response string like as follows
21.03.2019_15:06.26 [SELOGER]:: [Seloger value]-[PROGRESS]: marminto=true, france24=true,
Using Java I have to validate the above response is printed in following format:
<date+time> [SELOGER]:: [Seloger value]-[<PROGRESS|STOP|START>]: <value1>=<true|false>, <value2>=<true|false>........
first is <date+time> then [SELOGER]:: [Seloger value]- then [PROGRESS or STOP or START]: then values marminto=true, france24=true,.....
How can i perform this with regex? or any java API's available to detect a string is printed in particular format.
Try this pattern:
\d{2}\.\d{2}\.\d{4}\_\d{2}:\d{2}\.\d{2} \[SELOGER\]:: \[Seloger value\]-\[(?:PROGRESS|STOP|START)\]: *(?:[a-zA-Z0-9]+=(?:true|false), ?)*
Explanation:
\d{2}\.\d{2}\.\d{4}\_\d{2}:\d{2}\.\d{2} matches date in specified format
(?:PROGRESS|STOP|START) - conditional, match any from PROGRESS, STOP or START
(?:[a-z0-9]+=(?:true|false), ?)* - match zero or more value=true/value=false pairs optionally followed by space and followed by comma
Demo

Regex pattern needed for 123.12/23

I am trying to check my string having 123.12/23 with pattern \\d+(.\\d+)*\\/\\d+(.\\d+)* but it is not working, it is passing 123.12/23/24 also.
I need below scenarios to be covered :
Strings to be passed : 12/23 , 12.23/23 , 12/23.33
Strings to be failed : 12/13/14 , 12.23/2/4
^\d+(?:\.\d+)?\/\d+(?:\.\d+)?$
You were close.Escape the ..See demo.
https://regex101.com/r/iJ7bT6/1
For java it would be
^\\d+(?:\\.\\d+)?\\/\\d+(?:\\.\\d+)?$

Regex with double quotes in PIG

I'm writing a pig script to process an access log from a sophos proxy.
Each line is like:
2015:01:13-00:00:01 AR-BADC-FAST-01 httpproxy[27983]: id="0001" severity="info" sys="SecureWeb" sub="http" name="http access" action="pass" method="GET" srcip="10.20.7.210" dstip="10.24.2.7" user="" ad_domain="" statuscode="302" cached="0" profile="REF_DefaultHTTPProfile (Default Web Filter Profile)" filteraction="REF_DefaultHTTPCFFAction (Default content filter action)" size="0" request="0x9ac68d0" url="http://www.google.com" exceptions="av,auth,content,url,ssl,certcheck,certdate,mime,cache,fileextension" error="" authtime="0" dnstime="1" cattime="0" avscantime="0" fullreqtime="239428" device="0" auth="0"
So I managed to do it in Java with MapReduce, using the following regex: \"([^\"]*)\" to get the values between the quotes and then process it. Now I want to do the same with pig, but I'm not able to apply the regex to the each of the lines.
I'm doing:
input = load './http.log' as (line : chararray);
splt = foreach input generate FLATTEN(REGEX_EXTRACT_ALL(line,'(\\"([^\\"]*)\\")'));
dump splt;
And the result of the dump is: ().
There is something that I'm missing with the use of REGEX_EXTRACT_ALL or I have to escape some characters of the regex in a different way?
Thanks!
I managed to extract the values with a different approach, because I just wanted some of the fields of the line.
In order to get the values I'm doing:
splt = FOREACH A GENERATE
FLATTEN(REGEX_EXTRACT(line,'.*url="([^"]*)".*',1)) AS url,
FLATTEN(REGEX_EXTRACT(line,'.*fullreqtime="([^"]*)".*',1)) AS duration,
FLATTEN(REGEX_EXTRACT(line,'.*size="([^"]*)".*',1)) AS bytes;
And then I can continue with the rest of the script

Error in Spliting and Concatenating String

I am trying to make query by performing operation on String. I have a String named query like this:
2010-10-01' and '2013-10-01' and (extension='5028' or extension='00' or extension='
Now with the following code I am deleting last 16 characters from query string. Here is the code:
query=query.substring(0,query.length()-16);
output of this snippet is:
2010-10-01' and '2013-10-01' and (extension='5028' or extension='00
Now I want to concatenate the string with this character:
query=query.concat("')");
output of above snippet is
2010-10-01' and '2013-10-01' and (extension='5028' or extension='00)'
Whereas I need the output like this:
2010-10-01' and '2013-10-01' and (extension='5028' or extension='00')
I hope I don't get downvoted since I don't have time to check this right now but it looks to me like you're in the single-quote/double quote character vs string mess. I'm a little surprised it compiled. I'd try:
query=query.concat("\')");

Need a regex expression to get value between two tags

Need regular expression to extract the values between >xxxxx<. Can anybody help me in this.
<ChangeID type="String">C10286</ChangeID>
<ChangeID type="String">C10296</ChangeID>
Is it possible to get the two values in a comma separated format like C10286,C10296 in a single regex expression?
Thanks and Regards
Riyas Hussain A
try this:
(?<=>)[^<]*
test it with grep -Po:
kent$ echo '<ChangeID type="String">C10286</ChangeID>
<ChangeID type="String">C10296</ChangeID>'|grep -Po '(?<=>)[^<]*'
C10286
C10296
My idea would be to lookup for all words and remove the ones we don't need (in case you have more than 1 value inside your tag):
(?!ChangeID\b)(?!type\b)(?!String\b)\b\w+
You can try it out on : http://regexpal.com/

Categories

Resources