I'm trying to get a regex that can pull out the flags and values in string. Basically, I need to be able to take a string like this:
command -aparam -b"Another \"quoted\" param" -canother one here
And capture the data:
a param
b Another "quoted" param
c another one here
Here is my Java regex so far:
(?<= -\w)(?:(?=")(?:(?:")([^"\\]*(?:\\.[^"\\]*)*)(?="))|.*?( -\w|$))?
But is doesn't quite work yet. Any suggestions?
The suggestion is to use one of available CLI parsers. For example CLI from Jakarta or, better, args4j.
Tokenize the string into command and its parameters using split method,
String input = "command -aparam -b\"Another \"quoted\" param\" -canother one here ";
String[] cmds = input.split("\\s*-(?=\\w)");
Related
I'm writing a pig script to process an access log from a sophos proxy.
Each line is like:
2015:01:13-00:00:01 AR-BADC-FAST-01 httpproxy[27983]: id="0001" severity="info" sys="SecureWeb" sub="http" name="http access" action="pass" method="GET" srcip="10.20.7.210" dstip="10.24.2.7" user="" ad_domain="" statuscode="302" cached="0" profile="REF_DefaultHTTPProfile (Default Web Filter Profile)" filteraction="REF_DefaultHTTPCFFAction (Default content filter action)" size="0" request="0x9ac68d0" url="http://www.google.com" exceptions="av,auth,content,url,ssl,certcheck,certdate,mime,cache,fileextension" error="" authtime="0" dnstime="1" cattime="0" avscantime="0" fullreqtime="239428" device="0" auth="0"
So I managed to do it in Java with MapReduce, using the following regex: \"([^\"]*)\" to get the values between the quotes and then process it. Now I want to do the same with pig, but I'm not able to apply the regex to the each of the lines.
I'm doing:
input = load './http.log' as (line : chararray);
splt = foreach input generate FLATTEN(REGEX_EXTRACT_ALL(line,'(\\"([^\\"]*)\\")'));
dump splt;
And the result of the dump is: ().
There is something that I'm missing with the use of REGEX_EXTRACT_ALL or I have to escape some characters of the regex in a different way?
Thanks!
I managed to extract the values with a different approach, because I just wanted some of the fields of the line.
In order to get the values I'm doing:
splt = FOREACH A GENERATE
FLATTEN(REGEX_EXTRACT(line,'.*url="([^"]*)".*',1)) AS url,
FLATTEN(REGEX_EXTRACT(line,'.*fullreqtime="([^"]*)".*',1)) AS duration,
FLATTEN(REGEX_EXTRACT(line,'.*size="([^"]*)".*',1)) AS bytes;
And then I can continue with the rest of the script
I am trying to extract the pass number from strings of any of the following formats:
PassID_132
PassID_64
Pass_298
Pass_16
For this, I constructed the following regex:
Pass[I]?[D]?_([\d]{2,3})
-and tested it in Eclipse's search dialog. It worked fine.
However, when I use it in code, it doesn't match anything. Here's my code snippet:
String idString = filename.replaceAll("Pass[I]?[D]?_([\\d]{2,3})", "$1");
int result = Integer.parseInt(idString);
I also tried
java.util.regex.Pattern.compile("Pass[I]?[D]?_([\\d]{2,3})")
in the Expressions window while debugging, but that says "", whereas
java.util.regex.Pattern.compile("Pass[I]?[D]?_([0-9]{2,3})")
compiled, but didn't match anything. What could be the problem?
Instead of Pass[I]?[D]?_([\d]{2,3}) try this:
Pass(?:I)?(?:D)?_([\d]{2,3})
There's nothing invalid with your tegex, but it sucks. You don't need character classes around single character terms. Try this:
"Pass(?:ID)?_(\\d{2,3})"
I have a string formatted as below:
source1.type1.8371-(12345)->source2.type3.3281-(38270)->source4.type2.903..
It's a path, the number in () is the weight for the edge, I tried to split it using java Pattern as following:
[a-zA-Z.0-9]+-{1}({1}\\d+){1}
[a-zA-Z_]+.[a-zA-Z_]+.(\\d)+-(\\d+)
[a-zA-Z.0-9]+-{1}({1}\\d+){1}-{1}>{1}
hopefully it split the string into fields like
source1.type1.8371-(12345)
source2.type3.3281-(38270)
..
but none of them work, it always return the whole string as the field.
It looks like you just want String.split("->") (javadoc). This splits on the symbol -> and returns an array containing the parts between ->.
String str = "source1.type1.8371-(12345)->source2.type3.3281-(38270)->source4.type2.903..";
for(String s : str.split("->")){
System.out.println(s);
}
Output
source1.type1.8371-(12345)
source2.type3.3281-(38270)
source4.type2.903..
It seems to me like you want to split at the ->'s. So you could use something like str.split("->") If you were more specific about why you need this maybe we could understand why you were trying to use those complicated regexes
I have this java string with xml info and I am trying to use java regex to filter out all the junk that is between the words to form a word enclosed in brackets, e.g. [DEFENDANT].
I want to go from this:
<w:p><w:r><w:t>[</w:t></w:r><st1:PlaceName w:st="on"><w:r><w:t>DEFENDANT</w:t></w:r>
</st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r><w:t>CITY</w:t></w:r>
</st1:PlaceType><w:r><w:t>], [</w:t></w:r><st1:place w:st="on"><st1:PlaceName w:st="on"><w:r>
<w:t>DEFENDANT</w:t></w:r></st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r>
<w:t>STATE</w:t></w:r></st1:PlaceType></st1:place><w:r><w:t>] [DEFENDANT ZIP]</w:r><w:r>
to this:
<w:p><w:r><w:t>[DEFENDANT CITY], [DEFENDANT STATE] [DEFENDANT ZIP]</w:r><w:r>
I have been testing with regex epression like (\[)<.+>+([A-Z ]+\]) on regexPlanet extensively to no avail.
Do not use Regex to parse XML. Just use the built in Java XML library.
If it's all on a single line, like this:
<w:p><w:r><w:t>[</w:t></w:r><st1:PlaceName w:st="on"><w:r><w:t>DEFENDANT</w:t></w:r></st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r><w:t>CITY</w:t></w:r></st1:PlaceType><w:r><w:t>], [</w:t></w:r><st1:place w:st="on"><st1:PlaceName w:st="on"><w:r><w:t>DEFENDANT</w:t></w:r></st1:PlaceName><w:r><w:t> </w:t></w:r><st1:PlaceType w:st="on"><w:r><w:t>STATE</w:t></w:r></st1:PlaceType></st1:place><w:r><w:t>] [DEFENDANT ZIP]</w:r><w:r>
Then this regex should work:
([<\w:\w>]+)(\[[</\w:\w>]+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\s</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\],\s\[</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\s</w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+></\w+:\w+><\w:\w><\w:\w>\]\s\[)(\w+\s\w+)(\])(</\w:\w><\w:\w>)
I have a working example here: RegExr
I could have grouped things a little better, but overall, it gets the job done, so you should be able to see it working.
Also, if it's not on a single line (if it's like it is in your example), then this would work:
([<\w:\w>]+)(\[[</\w:\w>]+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w>\s+</\w+:\w+><\w:\w><\w:\w>\s</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w><\w:\w>)(\w+)(</\w:\w></\w:\w>\s+</\w+:\w+><\w:\w><\w:\w>\],\s\[</\w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w+:\w+\s\w:\w+="\w+"><\w:\w>\s+<\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+><\w:\w><\w:\w>\s</w:\w></\w:\w><\w+:\w+\s\w:\w+="\w+"><\w:\w>\s+<\w:\w>)(\w+)(</\w:\w></\w:\w></\w+:\w+></\w+:\w+><\w:\w><\w:\w>\]\s\[)(\w+\s\w+)(\])(</\w:\w><\w:\w>)
You can see that on RegExr here.
I have this string: "\"Blah \'Blah\' Blah\"". There is another string inside it. How do I convert that into: Blah 'Blah' Blah? (you see, unescaping the string.) This is because I get a SQL Where query:
WHERE blah="Blah \'Blah\' Blah"
When I parse this, I get the string above (still inside quotes and escaped.) How would I extract that, un-escaping the string? Or is ther some much easier way to do this? Thanks,
Isaac
DO NOT DO THIS.
Follow the proper steps for parametrization of a query on your Database/Platform, and you won't have to escape anything. You also will protect yourself from injection vulnerabilities.
Put the string in a property file, Java supports XML property files and the quote character does not need to be escaped in XML.
Use loadFromXML(InputStream in) method of the Properties class.
You can then use the MessageFormat class to interpolate values into the String if needed.
This should be about right. This assumes that if it starts with a quote, it ends with a quote.
if (val.startsWith("\"") || val.startsWith("\'"))
val = val.substring(1, val.length-2);
You may wish to add val = val.trim(); as well.
"\"Blah \'Blah\' Blah\"".replaceAll("\"", "")