Regex with double quotes in PIG

Regex with double quotes in PIG - java

I'm writing a pig script to process an access log from a sophos proxy.
Each line is like:
2015:01:13-00:00:01 AR-BADC-FAST-01 httpproxy[27983]: id="0001" severity="info" sys="SecureWeb" sub="http" name="http access" action="pass" method="GET" srcip="10.20.7.210" dstip="10.24.2.7" user="" ad_domain="" statuscode="302" cached="0" profile="REF_DefaultHTTPProfile (Default Web Filter Profile)" filteraction="REF_DefaultHTTPCFFAction (Default content filter action)" size="0" request="0x9ac68d0" url="http://www.google.com" exceptions="av,auth,content,url,ssl,certcheck,certdate,mime,cache,fileextension" error="" authtime="0" dnstime="1" cattime="0" avscantime="0" fullreqtime="239428" device="0" auth="0"
So I managed to do it in Java with MapReduce, using the following regex: \"([^\"]*)\" to get the values between the quotes and then process it. Now I want to do the same with pig, but I'm not able to apply the regex to the each of the lines.
I'm doing:
input = load './http.log' as (line : chararray);
splt = foreach input generate FLATTEN(REGEX_EXTRACT_ALL(line,'(\\"([^\\"]*)\\")'));
dump splt;
And the result of the dump is: ().
There is something that I'm missing with the use of REGEX_EXTRACT_ALL or I have to escape some characters of the regex in a different way?
Thanks!

I managed to extract the values with a different approach, because I just wanted some of the fields of the line.
In order to get the values I'm doing:
splt = FOREACH A GENERATE
FLATTEN(REGEX_EXTRACT(line,'.*url="([^"]*)".*',1)) AS url,
FLATTEN(REGEX_EXTRACT(line,'.*fullreqtime="([^"]*)".*',1)) AS duration,
FLATTEN(REGEX_EXTRACT(line,'.*size="([^"]*)".*',1)) AS bytes;
And then I can continue with the rest of the script

Related

Redisearch query with "begin with" instead of "contains"

I am trying to understand on how to perform queries in Redisearch strictly with "begins with" and I keep getting "contains".
For example if I have fields with values like 'football', 'myfootball', 'greenfootball' and would provide a search term like this:
> FT.SEARCH myIdx #myfield:foot*
I want just to get 'football' but I keep getting other fields that contain the word instead of beginning with that word.
Is there a way to avoid this?
I was trying to use VERBATIM and things like #myfield:^foot* but nothing.
I am using JRedisearch as a client but eventually I had to enter the DB and perform these queries manually in order to figure out what's happening. That being said, is this possible to do with this client at the moment?
Thanks
EDIT
A sample of my index setup:
Client client = new Client(INDEX_NAME, url, PORT);
Schema sc = new Schema().addSortableTextField("url", 1.0); // using this field for query
client.dropIndex(true);
client.createIndex(sc, Client.IndexOptions.Default());
return client;
Sample document:
id: // random uuid
urlPath: myfootbal
application: web
market: Europe

After checking the RDB provided I see that when searching foot* you are not getting myfootbal. The replies look like this: /dot-com/plp/football/x/index.html. You are getting those replies because this url is tokenized, and '/' is one of the tokenize chars. If you do not want those urls to be tokenized you need to declare them as TAGS and not as TEXT. This way the entire url will be indexed as is and when search for foot* it will not appear in the results.
For more information about TAGS see the FT.CREATE documentation: https://oss.redislabs.com/redisearch/Commands.html

java regex matcher results != to notepad++ regex find result

I am trying to extract data out of a website access log as part of a java program. Every entry in the log has a url. I have successfully extracted the url out of each record.
Within the url, there is a parameter that I want to capture so that I can use it to query a database. Unfortunately, it doesn't seem that the web developers used any one standard to write the parameter's name.
The parameter is usually called "course_id", but I have also seen "courseId", "course%3DId", "course%253Did", etc. The format for the parameter name and value is usually course_id=_22222_1, where the number I want is between the "_" and "_1". (The value is always the same, even if the parameter name varies.)
So, my idea was to use the regex /^.*course_id[^_]*_(\d*)_1.*$/i to find and extract the number.
In java, my code is
java.util.regex.Pattern courseIDPattern = java.util.regex.Pattern.compile(".*course[^i]*id[^_]*_(\\d*)_1.*", java.util.regex.Pattern.CASE_INSENSITIVE);
java.util.regex.Matcher courseIDMatcher = courseIDPattern.matcher(_url);
_courseID = "";
if(courseIDMatcher.matches())
{
_courseID = retrieveCourseID(courseIDMatcher.group(1));
return;
}
This works for a lot of the records. However, some records do not record the course_id, even though the parameter is in the url. One such example is the record:
/webapps/contentDetail?course_id=_223629_1&content_id=_3641164_1&rich_content_level=RICH&language=en_US&v=1&ver=4.1.2
However, I used notepad++ to do a regex replace on this (in fact, every) url using the regex above, and the url was successfully replaced by the course ID, implying that the regex is not incorrect.
Am I doing something wrong in the java code, or is the java matcher broken?

Getting non-greedy sequence as a string with ANTLR

I have a problem with getting sequence as a string. I have a file with strings like:
{TEXT="<div itemprop=\"content\"><div>some text</div>"}
I want to get and use text that exactly between first and last quotes. First i tried:
parse : line+;
line : '{TEXT="' SEQUENCE '"}' {System.out.println($SEQUENCE.getText())};
SEQUENCE : .+?;
But it failed, SEQUENCE get only one symbol in that way. I tried:
parse : line+;
line : '{TEXT="' (a+=SEQUENCE)*? '"}' {System.out.println($a.getText())};
SEQUENCE : .;
And I got List of Tokens, so i can't use getText.

if you want to do it in this way, you can do it like this:
grammar Sequence;
parse : line+;
line : '{TEXT="' a=sequence '"}' {System.out.println(((LineContext)_localctx).a.getText());};
sequence : .+?;
ANY:.;
But there also other mechanisms in ANTLR4 like listeners and visitors.

How to use java regex to filter xml file

I have this java string with xml info and I am trying to use java regex to filter out all the junk that is between the words to form a word enclosed in brackets, e.g. [DEFENDANT].
I want to go from this:
<w:p><w:r><w:t>[</w:t></w:r><st1:PlaceName w:st="on"><w:r><w:t>DEFENDANT</w:t></w:r>
</st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r><w:t>CITY</w:t></w:r>
</st1:PlaceType><w:r><w:t>], [</w:t></w:r><st1:place w:st="on"><st1:PlaceName w:st="on"><w:r>
<w:t>DEFENDANT</w:t></w:r></st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r>
<w:t>STATE</w:t></w:r></st1:PlaceType></st1:place><w:r><w:t>] [DEFENDANT ZIP]</w:r><w:r>
to this:
<w:p><w:r><w:t>[DEFENDANT CITY], [DEFENDANT STATE] [DEFENDANT ZIP]</w:r><w:r>
I have been testing with regex epression like (\[)<.+>+([A-Z ]+\]) on regexPlanet extensively to no avail.

Do not use Regex to parse XML. Just use the built in Java XML library.

If it's all on a single line, like this:
<w:p><w:r><w:t>[</w:t></w:r><st1:PlaceName w:st="on"><w:r><w:t>DEFENDANT</w:t></w:r></st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r><w:t>CITY</w:t></w:r></st1:PlaceType><w:r><w:t>], [</w:t></w:r><st1:place w:st="on"><st1:PlaceName w:st="on"><w:r><w:t>DEFENDANT</w:t></w:r></st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r><w:t>STATE</w:t></w:r></st1:PlaceType></st1:place><w:r><w:t>] [DEFENDANT ZIP]</w:r><w:r>
Then this regex should work:
([<\w:\w>]+)(\[[</\w:\w>]+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\s</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\],\s\[</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\s</w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+></\w+:\w+><\w:\w><\w:\w>\]\s\[)(\w+\s\w+)(\])(</\w:\w><\w:\w>)
I have a working example here: RegExr
I could have grouped things a little better, but overall, it gets the job done, so you should be able to see it working.
Also, if it's not on a single line (if it's like it is in your example), then this would work:
([<\w:\w>]+)(\[[</\w:\w>]+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w>\s+</\w+:\w+><\w:\w><\w:\w>\s</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w>\s+</\w+:\w+><\w:\w><\w:\w>\],\s\[</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w+:\w+\s\w:\w+="\w+"><\w:\w>\s+<\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\s</w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w>\s+<\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+></\w+:\w+><\w:\w><\w:\w>\]\s\[)(\w+\s\w+)(\])(</\w:\w><\w:\w>)
You can see that on RegExr here.

REGEX to get flags from console command?

I'm trying to get a regex that can pull out the flags and values in string. Basically, I need to be able to take a string like this:
command -aparam -b"Another \"quoted\" param" -canother one here
And capture the data:
a param
b Another "quoted" param
c another one here
Here is my Java regex so far:
(?<= -\w)(?:(?=")(?:(?:")([^"\\]*(?:\\.[^"\\]*)*)(?="))|.*?( -\w|$))?
But is doesn't quite work yet. Any suggestions?

The suggestion is to use one of available CLI parsers. For example CLI from Jakarta or, better, args4j.

Tokenize the string into command and its parameters using split method,
String input = "command -aparam -b\"Another \"quoted\" param\" -canother one here ";
String[] cmds = input.split("\\s*-(?=\\w)");

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Regex with double quotes in PIG - java

Related

Redisearch query with "begin with" instead of "contains"

java regex matcher results != to notepad++ regex find result

Getting non-greedy sequence as a string with ANTLR

How to use java regex to filter xml file

REGEX to get flags from console command?

Categories

Resources