Python Regex to Java

Python Regex to Java - java

I am trying to convert a python regex to java. It finds a match in python but fails on the same string in java.
Python regex : "(CommandLineEventConsumer)(\x00\x00)(.*?)(\x00)(.*?)({})(\x00\x00)?([^\x00]*)?".format(event_consumer_name)
Java regex : "(CommandLineEventConsumer)(\\u0000\\u0000)(.*?)(\\u0000)(.*?)(" + event_consumer_name + ")(\\u0000\\u0000)?([^\\u0000]*)?"
I also tried this : "(CommandLineEventConsumer)(\\x00\\x00)(.*?)(\\x00)(.*?)(" + event_consumer_name + ")(\\x00\\x00)?([^\\x00]*)?"
What I'm I missing please?
I have attached a piece of the code
String sampleStr = "\u0000\u0000�\u0003\b\u0000\u0000\u0000�\u0005\u0000\u0000\u0003\u0000\u0000�\u0000\u000B\u0000\u0000\u0000���\u0005\u0000\u0000\u0000\u0003\u0000\u0000\u0000 \u0000\u0000\u0000\u0000string\u0000\u0000WMIDataID\u0000\u0000SystemVersion\u0000\b\u0000\u0000\u0000\f\u0000.\u0000\u0000\u0000\u0000\u0000\u0000\u0000)\u0000\u0000\u0000 \u0000\u0000�\u0003\b\u0000\u0000\u0000'\u0006\u0000\u0000\u0003\u0000\u0000�\u0000\u000B\u0000\u0000\u0000��/\u0006\u0000\u0000\u0000\u0003\u0000\u0000\u0000\u000B\u0000\u0000\u0000\u0000string\u0000\u0000WMIDataID\u0000\f\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000�\u0016\u0000\u0000\u0000R\u0000O\u0000O\u0000T\u0000\\\u0000M\u0000i\u0000c\u0000r\u0000o\u0000s\u0000o\u0000f\u0000t\u0000\\\u0000H\u0000o\u0000m\u0000e\u0000N\u0000e\u0000t\u0000\u0019\u0000\u0000\u0000H\u0000N\u0000e\u0000t\u0000_\u0000C\u0000o\u0000n\u0000n\u0000e\u0000c\u0000t\u0000i\u0000o\u0000n\u0000P\u0000r\u0000o\u0000p\u0000e\u0000r\u0000t\u0000i\u0000e\u0000s\u0000 \u0000\u0000\u0000C\u0000o\u0000n\u0000n\u0000e\u0000c\u0000t\u0000i\u0000o\u0000n\u0000�\u0000\u0000\u0000N\u0000S\u0000_\u00005\u00001\u00001\u00006\u00002\u00006\u0000F\u0000A\u0000E\u00004\u0000F\u00005\u00007\u0000D\u0000B\u0000D\u00002\u00000\u0000D\u0000F\u00005\u0000C\u0000D\u00004\u00004\u0000A\u00004\u00001\u0000D\u0000A\u0000E\u0000C\u0000E\u0000D\u00002\u00008\u0000C\u0000F\u00007\u0000B\u00003\u0000F\u0000D\u00008\u0000B\u00001\u00002\u00000\u00001\u00002\u0000C\u00007\u0000F\u00004\u0000B\u00005\u00008\u0000F\u00004\u00004\u0000E\u00006\u00006\u00005\u0000\\\u0000K\u0000I\u0000_\u0000A\u00000\u00001\u00000\u00008\u0000C\u0000E\u00002\u00006\u00001\u0000D\u00006\u0000C\u0000D\u00007\u00000\u0000D\u00003\u00005\u00000\u0000F\u00005\u0000B\u00007\u00002\u0000F\u00002\u0000E\u00009\u00008\u00007\u00004\u0000A\u0000E\u00006\u0000E\u00000\u00000\u00004\u0000D\u00003\u00000\u00002\u00009\u00000\u00001\u00005\u0000B\u00000\u00009\u00001\u00009\u0000B\u00001\u0000B\u0000D\u00003\u00002\u00006\u0000B\u0000B\u00006\u00004\u00009\u0000\\\u0000I\u0000_\u0000E\u0000D\u0000C\u0000E\u0000A\u00001\u00004\u0000E\u0000C\u00006\u00003\u0000A\u00005\u00007\u00004\u00001\u0000F\u0000A\u0000A\u00006\u00003\u00000\u00001\u0000C\u00007\u00007\u0000C\u0000A\u00002\u00006\u00000\u0000A\u0000B\u0000E\u0000C\u00000\u0000E\u00007\u00007\u00000\u00009\u00005\u00001\u00004\u0000F\u00006\u0000A\u00003\u00002\u0000C\u00000\u00003\u00004\u00007\u0000E\u00000\u00002\u00006\u00008\u00001\u00007\u0000C\u00008\u00008\u0000\u0000\u0000WQL:Re4\u00007\u0000C\u00007\u00009\u0000E\u00006\u00002\u0000C\u00002\u00002\u00002\u00007\u0000E\u0000D\u0000D\u00000\u0000F\u0000F\u00002\u00009\u0000B\u0000F\u00004\u00004\u0000D\u00008\u00007\u0000F\u00002\u0000F\u0000A\u0000F\u00009\u0000F\u0000E\u0000D\u0000F\u00006\u00000\u0000A\u00001\u00008\u0000D\u00009\u0000F\u00008\u00002\u00005\u00009\u00007\u00006\u00000\u00002\u0000B\u0000D\u00009\u00005\u0000E\u00002\u00000\u0000B\u0000D\u00003\u0000�3u�&��\u0001����+\u0004�\u0001�\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\f;\u0000\u0000\u0000\u000F\u0000\u0000\u0000�\u0000\u0000\u0000F\u0000\u0000\u0000/\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004\u0000\u0000\u0000\u0001�\u0000\u0000�\u0000__EventFilter\u0000\u001C\u0000\u0000\u0000\u0001\u0005\u0000\u0000\u0000\u0000\u0000\u0005\u0015\u0000\u0000\u0000�tw�}\n" +
"z�p�)��\u0001\u0000\u0000\u0000root\\cimv2\u0000\u0000BVTFilter\u0000\u0000SELECT * FROM __InstanceModificationEvent WITHIN 60 WHERE TargetInstance ISA \"Win32_Processor\" AND TargetInstance.LoadPercentage > 99\u0000\u0000WQL\u0000B\u0000B\u0000F\u0000C\u0000C\u0000B\u00004\u00004\u00004\u0000C\u0000F\u00006\u00006\u0000A\u0000A\u00000\u00009\u0000A\u0000E\u00006\u0000F\u00001\u00005\u00009\u00006\u00007\u0000A\u00006\u00008\u00006\u00005\u00001\u00007\u00005\u0000B\u0000B\u00000\u0000E\u0000D\u00002\u00001\u00006\u0000D\u00001\u00009\u00009\u00007\u00000\u0000A\u00007\u00009\u00008\u00008\u0000B\u00007\u00002\u0000C\u0000D\u0000F\u00000\u0000A\u00003\u0000A\u00004\u0000�3u�&��\u0001Ԏ��+\u0004�\u0001�\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u000F�����\"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000/\u0000\u0000\u0000O\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u001A\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\\\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004\u0000\u0000\u0000\u0001q\u0000\u0000�\u0000CommandLineEventConsumer\u0000\u0000cscript KernCap.vbs\u0000\u001C\u0000\u0000\u0000\u0001\u0005\u0000\u0000\u0000\u0000\u0000\u0005\u0015\u0000\u0000\u0000�tw�}\n" +
"z�p�)��\u0001\u0000\u0000\u0000BVTConsumer\u0000\u0000C:\\\\tools\\\\kernrate\u00000\u0000A\u00007\u0000A\u0000B\u0000E\u00006\u00003\u0000F\u00003\u00006\u0000E\u00002\u0000B\u00002\u00009\u00002\u00000\u0000F\u0000E\u0000D\u0000A\u0000F\u0000A\u0000E\u00008\u00004\u00009\u00008\u00002\u00003\u0000A\u0000F\u00009\u00004\u00002\u00009\u0000C\u0000C\u00000\u0000E\u0000A\u00003\u00007\u00003\u0000F\u0000F\u0000E\u0000E\u00001\u00005\u00000\u00007\u0000E\u0000D\u0000B\u00002\u00001\u0000F\u0000D\u00009\u00001\u00007\u00000\u0000�3u�&��\u0001����+\u0004�\u0001�\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000�";
String event_consumer_name = "BVTConsumer";
String cPattern = "(CommandLineEventConsumer)(\\u0000\\u0000)(.*?)(\\u0000)(.*?)(" + event_consumer_name + ")(\\u0000\\u0000)?([^\\u0000]*)?";
Pattern consumer_mo = Pattern.compile(cPattern, Pattern.CASE_INSENSITIVE);
Matcher consumer_match = consumer_mo.matcher(sampleStr);
if(consumer_match.find()){
System.out.println(consumer_match.group(6));
}
UPDATE
In python the groups return
python result screenshot

From what I posted as comments:
The (CommandLineEventConsumer)(\u0000\u0000)(.*?)(\u0000)(.*?) part matches fine.
group(3) gets cscript KernCap.vbs
group(4) gets a null character
but group(5) gets nothing.
I did try in Python and I have the exact same lack of match when I include the (BVTConsumer). So you probably had a difference in the code doing the matching in Python, not the regex itself.
So the reason is that you have a \n in your string so the matching stops there. If you do
Pattern consumer_mo = Pattern.compile(cPattern, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
it does match in your example.

Related

Regex expression to get the file name

I want to extract only filename from the complete file name + time stamp . below is the input.
String filePath = "fileName1_20150108.csv";
expected output should be: "fileName1"
String filePath2 = "fileName1_filedesc1_20150108_002_20150109013841.csv"
And expected output should be: "fileName1_filedesc1"
I wrote a below code in java to get the file name but it is working for first part (filePath) but not for filepath2.
Pattern pattern = Pattern.compile(".*.(?=_)");
String filePath = "fileName1_20150108.csv";
String filePath2 = "fileName1_filedesc1_20150108_002_20150109013841.csv";
Matcher matcher = pattern.matcher(filePath);
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end() + " ");
System.out.println(matcher.group());
}
Can somebody please help me to correct the regex so i can parse both filepath using same regex?
Thanks

Anchor the start, and make the .* non-greedy:
^.*?(_\D.*?)?(?=[_.])
Update: change the second group (for fileDesc) to optional, and enforce that it starts with a non-digit character. This will work as long as your fileDesc strings never start with numbers.

You can get the characters before the first underscode, the first underscore, and then the characters until the next underscore:
^[^_]*_[^_]*

This should work: "^(.*?)_([0-9_]*)\\.([^.]*)$"
It will return you 3 groups:
the base name (assuming not a single part will be all numbers)
the timestamp info
the extension.
You can test here: http://fiddle.re/v0hne6 (RegexPlanet)

Certain strings that should be found by a working Regex are missed, and I need help identifying why

I have a set of strings, which I cycle through, checking those against the following set of regex, to try and separate the first small section from the rest of the string. The regex works in almost all cases, but unfortunately I have no idea why it fails occasionally. I’ve been using Pattern Matcher to print out the string, if the pattern is found.
Two example working strings:
98. SORGHUM Moench - Millets Annuals or rhizomatous perennials; inflorescence …
99. MISCANTHUS Andersson - Silver-grasses Rhizomatous perennials; inflorescence …
Two example failed strings:
100. ZEA L. - Maize Annuals; male and female inflorescences separate, the …
26. POA L. (Parodiochloa C.E. Hubb.) - Meadow-grasses Annuals or perennials with or without stolons or rhizomes; sheaths overlapping or some …
Regex’s used so far:
Pattern endOfGenus = Pattern.compile("(?<=(^\\d+\\. " + genusNames[l].toUpperCase() + "))");
Pattern endOfGenusTwo = Pattern.compile("(?<=(^\\d+" + genusNames[l].toUpperCase() + "))");
Pattern endOfGenusThree = Pattern.compile("(?<=(\\d+\\. " + genusNames[l] + "))");
Pattern endOfGenusFour = Pattern.compile("(?<=(\\d+" + genusNames[l] + "))");
Pattern endOfGenusFive = Pattern.compile("(?<=(\\. " + genusNames[l] + "))");
The first of these is the one thats producing the reliable results so far.
Example Code
Pattern endOfGenus = Pattern.compile("(?<=(^\\d+\\. " + genusNames[l].toUpperCase() + "))");
Matcher endOfGenusFinder = endOfGenus.matcher(descriptionPartBits[b]);
if (endOfGenusFinder.find()) {
System.out.print(descriptionPartBits[b] + ":- ");
System.out.print(genusNames[l] + "\n");
String[] genusNameBits = descriptionPartBits[b].split("(?<=(^\\d+\\. " + genusNames[l].toUpperCase() + "))");
}
Desired Output. This is what is produced by strings that work. Strings that don't work simply don't appear in the output:
98. SORGHUM Moench - Millets Annuals or rhizomatous perennials:- Sorghum
99. MISCANTHUS Andersson - Silver-grasses Rhizomatous perennials:- Miscanthus

From regex tutorial:
Lookahead and lookbehind, collectively called "lookaround", are
zero-length assertions just like the start and end of line, and start
and end of word anchors explained earlier in this tutorial.
Lookahead and lookbehind only return true or false.
So I changed your code example:
Pattern endOfGenus = Pattern.compile("(?<=(^\\d+\\. ZEA L))(.+)$");
// Matcher matcher = endOfGenus.matcher("98. SORGHUM Moench - Millets Annuals or rhizomatous perennials; inflorescence …");
Matcher matcher = endOfGenus.matcher("100. ZEA L. - Maize Annuals; male and female inflorescences separate, the …");
while (matcher.find()) {
String group1 = matcher.group(1);
String group2 = matcher.group(2);
System.out.println("group1=" + group1);
System.out.println("group2=" + group2);
}
Group 1 is matched by (^\\d+\\. ZEA L). Group 2 is matched by (.+).

Regex matching more than it should

I'm doing this:
List<String> listOfLinks = new ArrayList<String>();
String regex = startMatch + "(.*)" + endMatch;
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(html);
while (matcher.find()) {
listOfLinks.add(matcher.group(1));
}
Where regex has a value of:
class="thumb-link" href="(.*)" titl
I am getting this result :
http://www.sportscraft.com.au/longline-vest--9344961510736.html" title="Longline Vest "> <img class="alpha" src="http://demandware.edgesuite.net/sits_pod19/dw/image/v2/AAJZ_PRD/on/demandware.static/Sites-Sportscraft-Site/Sites-sc-master/default/v1427554286311/images/hi-res/1102031_black_a.jpg?sw=180&sh=215&sm=fit" alt="Longline Vest , BLACK, hi-res" title="Longline Vest , BLACK" height="214" /> <img class="beta" src="http://demandware.edgesuite.net/sits_pod19/dw/image/v2/AAJZ_PRD/on/demandware.static/Sites-Sportscraft-Site/Sites-sc-master/default/v1427554286311/images/hi-res/1102031_black_b.jpg?sw=180&sh=215&sm=fit" alt="Longline Vest , BLACK, hi-res
When all I want is:
http://www.sportscraft.com.au/longline-vest--9344961510736.html
What this means is, the first part of the regex class="thumb-link" is working fine. But the second part " titl is not stopping the first time it matches. It keeps going till it finds another occurence.
When I test this on http://myregexp.com/ with the same regex I get the correct result. I guess there is some option I need to set to make this "non-greedy" but not sure which, since I can't reproduce the error in a regex tester.

Try using something like:
String regex = "^(.*?[^ ]) .*?";//remove ^, i have tried on your input string.
Output:
[http://www.sportscraft.com.au/longline-vest--9344961510736.html"]

Extracting part of URL using java regular expression

I'm trying to extract part of the URL in the text files.
for example:
/p/gnomecatalog/bugs/search/?q=status%3Aclosed-accepted+or+status%3Awont-fix+or+status%3Aclosed" class="search_bin"><span>Closed Tickets</span></a>
I would like to extract only
/p/gnomecatalog/bugs/search/?q=status%3Aclosed-accepted+or+status%3Awont-fix+or+status%3Aclosed
HOW I COULD DO THAT BY USING REGULAR Expression. I tried with regex
"/p/*./bugs/*."
but it didn't work.

Try this:
"\/p.*\/bugs[^"]*"
it means: "/p"
then: all chars,
then: "/bugs",
then: all chars except "

You can use :
(\/p\/.*\/bugs\/.*?(?="))
Java Code :
String REGEX = "(\\/p\\/.*\\/bugs\\/.*?(?=\"))";
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(line);
while (m.find()) {
String matched = m.group();
System.out.println("Mached : "+ matched);
}
OUTPUT
Mached : /p/gnomecatalog/bugs/search/?q=status%3Aclosed-accepted+or+status%3Awont-fix+or+status%3Aclosed
DEMO
Explanation:

Here's another way:
(?i)/p/[a-z/]+bugs/[^ "]+
The (?i) in the beginning makes the regex case insensitive so you don't have to worry about that. Then after bugs/ it will continue until it reaches either a space or a ".

How to remove dot (.) character using a regex for email addresses of type "abcd.efgh#xyz.com" in java?

I was trying to write a regex to detect email addresses of the type 'abc#xyz.com' in java. I came up with a simple pattern.
String line = // my line containing email address
Pattern myPattern = Pattern.compile("()(\\w+)( *)#( *)(\\w+)\\.com");
Matcher myMatcher = myPattern.matcher(line);
This will however also detect email addresses of the type 'abcd.efgh#xyz.com'.
I went through http://www.regular-expressions.info/ and links on this site like
How to match only strings that do not contain a dot (using regular expressions)
Java RegEx meta character (.) and ordinary dot?
So I changed my pattern to the following to avoid detecting 'efgh#xyz.com'
Pattern myPattern = Pattern.compile("([^\\.])(\\w+)( *)#( *)(\\w+)\\.com");
Matcher myMatcher = myPattern.matcher(line);
String mailid = myMatcher.group(2) + "#" + myMatcher.group(5) + ".com";
If String 'line' contained the address 'abcd.efgh#xyz.com', my String mailid will come back with 'fgh#yyz.com'. Why does this happen? How do I write the regex to detect only 'abc#xyz.com' and not 'abcd.efgh#xyz.com'?
Also how do I write a single regex to detect email addresses like 'abc#xyz.com' and 'efg at xyz.com' and 'abc (at) xyz (dot) com' from strings. Basically how would I implement OR logic in regex for doing something like check for # OR at OR (at)?
After some comments below I tried the following expression to get the part before the # squared away.
Pattern.compile("((([\\w]+\\.)+[\\w]+)|([\\w]+))#(\\w+)\\.com")
Matcher myMatcher = myPattern.matcher(line);
what will the myMatcher.groups be? how are these groups considered when we have nested brackets?
System.out.println(myMatcher.group(1));
System.out.println(myMatcher.group(2));
System.out.println(myMatcher.group(3));
System.out.println(myMatcher.group(4));
System.out.println(myMatcher.group(5));
the output was like
abcd.efgh
abcd.efgh
abcd.
null
xyz
for abcd.efgh#xyz.com
abc
null
null
abc
xyz
for abc#xyz.com
Thanks.

You can use | operator in your regexps to detect #ORAT: #|OR|(at).
You can avoid having dot in email addresses by using ^ at the beginning of the pattern:
Try this:
Pattern myPattern = Pattern.compile("^(\\w+)\\s*(#|at|\\(at\\))\\s*(\\w+)\\.(\\w+)");
Matcher myMatcher = myPattern.matcher(line);
if (myMatcher.matches())
{
String mail = myMatcher.group(1) + "#" + myMatcher.group(3) + "." +myMatcher.group(4);
System.out.println(mail);
}

Your first pattern needs to combine the facts that you want word character and not dots, you currently have it separately, it should be:
[^\\.\W]+
This is 'not dots' and 'not not word characters'
So you have:
Pattern myPattern = Pattern.compile("([^\\.\W]+)( *)#( *)(\\w+)\\.com");
To answer your second question, you can use OR in REGEX with the | character
(#|at)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Python Regex to Java - java

Related

Regex expression to get the file name

Certain strings that should be found by a working Regex are missed, and I need help identifying why

Regex matching more than it should

Extracting part of URL using java regular expression

How to remove dot (.) character using a regex for email addresses of type "abcd.efgh#xyz.com" in java?

Categories

Resources