Regex matching more than it should - java

I'm doing this:
List<String> listOfLinks = new ArrayList<String>();
String regex = startMatch + "(.*)" + endMatch;
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(html);
while (matcher.find()) {
listOfLinks.add(matcher.group(1));
}
Where regex has a value of:
class="thumb-link" href="(.*)" titl
I am getting this result :
http://www.sportscraft.com.au/longline-vest--9344961510736.html" title="Longline Vest "> <img class="alpha" src="http://demandware.edgesuite.net/sits_pod19/dw/image/v2/AAJZ_PRD/on/demandware.static/Sites-Sportscraft-Site/Sites-sc-master/default/v1427554286311/images/hi-res/1102031_black_a.jpg?sw=180&sh=215&sm=fit" alt="Longline Vest , BLACK, hi-res" title="Longline Vest , BLACK" height="214" /> <img class="beta" src="http://demandware.edgesuite.net/sits_pod19/dw/image/v2/AAJZ_PRD/on/demandware.static/Sites-Sportscraft-Site/Sites-sc-master/default/v1427554286311/images/hi-res/1102031_black_b.jpg?sw=180&sh=215&sm=fit" alt="Longline Vest , BLACK, hi-res
When all I want is:
http://www.sportscraft.com.au/longline-vest--9344961510736.html
What this means is, the first part of the regex class="thumb-link" is working fine. But the second part " titl is not stopping the first time it matches. It keeps going till it finds another occurence.
When I test this on http://myregexp.com/ with the same regex I get the correct result. I guess there is some option I need to set to make this "non-greedy" but not sure which, since I can't reproduce the error in a regex tester.

Try using something like:
String regex = "^(.*?[^ ]) .*?";//remove ^, i have tried on your input string.
Output:
[http://www.sportscraft.com.au/longline-vest--9344961510736.html"]

Related

Python Regex to Java

I am trying to convert a python regex to java. It finds a match in python but fails on the same string in java.
Python regex : "(CommandLineEventConsumer)(\x00\x00)(.*?)(\x00)(.*?)({})(\x00\x00)?([^\x00]*)?".format(event_consumer_name)
Java regex : "(CommandLineEventConsumer)(\\u0000\\u0000)(.*?)(\\u0000)(.*?)(" + event_consumer_name + ")(\\u0000\\u0000)?([^\\u0000]*)?"
I also tried this : "(CommandLineEventConsumer)(\\x00\\x00)(.*?)(\\x00)(.*?)(" + event_consumer_name + ")(\\x00\\x00)?([^\\x00]*)?"
What I'm I missing please?
I have attached a piece of the code
String sampleStr = "\u0000\u0000�\u0003\b\u0000\u0000\u0000�\u0005\u0000\u0000\u0003\u0000\u0000�\u0000\u000B\u0000\u0000\u0000���\u0005\u0000\u0000\u0000\u0003\u0000\u0000\u0000 \u0000\u0000\u0000\u0000string\u0000\u0000WMIDataID\u0000\u0000SystemVersion\u0000\b\u0000\u0000\u0000\f\u0000.\u0000\u0000\u0000\u0000\u0000\u0000\u0000)\u0000\u0000\u0000 \u0000\u0000�\u0003\b\u0000\u0000\u0000'\u0006\u0000\u0000\u0003\u0000\u0000�\u0000\u000B\u0000\u0000\u0000��/\u0006\u0000\u0000\u0000\u0003\u0000\u0000\u0000\u000B\u0000\u0000\u0000\u0000string\u0000\u0000WMIDataID\u0000\f\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000�\u0016\u0000\u0000\u0000R\u0000O\u0000O\u0000T\u0000\\\u0000M\u0000i\u0000c\u0000r\u0000o\u0000s\u0000o\u0000f\u0000t\u0000\\\u0000H\u0000o\u0000m\u0000e\u0000N\u0000e\u0000t\u0000\u0019\u0000\u0000\u0000H\u0000N\u0000e\u0000t\u0000_\u0000C\u0000o\u0000n\u0000n\u0000e\u0000c\u0000t\u0000i\u0000o\u0000n\u0000P\u0000r\u0000o\u0000p\u0000e\u0000r\u0000t\u0000i\u0000e\u0000s\u0000 \u0000\u0000\u0000C\u0000o\u0000n\u0000n\u0000e\u0000c\u0000t\u0000i\u0000o\u0000n\u0000�\u0000\u0000\u0000N\u0000S\u0000_\u00005\u00001\u00001\u00006\u00002\u00006\u0000F\u0000A\u0000E\u00004\u0000F\u00005\u00007\u0000D\u0000B\u0000D\u00002\u00000\u0000D\u0000F\u00005\u0000C\u0000D\u00004\u00004\u0000A\u00004\u00001\u0000D\u0000A\u0000E\u0000C\u0000E\u0000D\u00002\u00008\u0000C\u0000F\u00007\u0000B\u00003\u0000F\u0000D\u00008\u0000B\u00001\u00002\u00000\u00001\u00002\u0000C\u00007\u0000F\u00004\u0000B\u00005\u00008\u0000F\u00004\u00004\u0000E\u00006\u00006\u00005\u0000\\\u0000K\u0000I\u0000_\u0000A\u00000\u00001\u00000\u00008\u0000C\u0000E\u00002\u00006\u00001\u0000D\u00006\u0000C\u0000D\u00007\u00000\u0000D\u00003\u00005\u00000\u0000F\u00005\u0000B\u00007\u00002\u0000F\u00002\u0000E\u00009\u00008\u00007\u00004\u0000A\u0000E\u00006\u0000E\u00000\u00000\u00004\u0000D\u00003\u00000\u00002\u00009\u00000\u00001\u00005\u0000B\u00000\u00009\u00001\u00009\u0000B\u00001\u0000B\u0000D\u00003\u00002\u00006\u0000B\u0000B\u00006\u00004\u00009\u0000\\\u0000I\u0000_\u0000E\u0000D\u0000C\u0000E\u0000A\u00001\u00004\u0000E\u0000C\u00006\u00003\u0000A\u00005\u00007\u00004\u00001\u0000F\u0000A\u0000A\u00006\u00003\u00000\u00001\u0000C\u00007\u00007\u0000C\u0000A\u00002\u00006\u00000\u0000A\u0000B\u0000E\u0000C\u00000\u0000E\u00007\u00007\u00000\u00009\u00005\u00001\u00004\u0000F\u00006\u0000A\u00003\u00002\u0000C\u00000\u00003\u00004\u00007\u0000E\u00000\u00002\u00006\u00008\u00001\u00007\u0000C\u00008\u00008\u0000\u0000\u0000WQL:Re4\u00007\u0000C\u00007\u00009\u0000E\u00006\u00002\u0000C\u00002\u00002\u00002\u00007\u0000E\u0000D\u0000D\u00000\u0000F\u0000F\u00002\u00009\u0000B\u0000F\u00004\u00004\u0000D\u00008\u00007\u0000F\u00002\u0000F\u0000A\u0000F\u00009\u0000F\u0000E\u0000D\u0000F\u00006\u00000\u0000A\u00001\u00008\u0000D\u00009\u0000F\u00008\u00002\u00005\u00009\u00007\u00006\u00000\u00002\u0000B\u0000D\u00009\u00005\u0000E\u00002\u00000\u0000B\u0000D\u00003\u0000�3u�&��\u0001����+\u0004�\u0001�\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\f;\u0000\u0000\u0000\u000F\u0000\u0000\u0000�\u0000\u0000\u0000F\u0000\u0000\u0000/\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004\u0000\u0000\u0000\u0001�\u0000\u0000�\u0000__EventFilter\u0000\u001C\u0000\u0000\u0000\u0001\u0005\u0000\u0000\u0000\u0000\u0000\u0005\u0015\u0000\u0000\u0000�tw�}\n" +
"z�p�)��\u0001\u0000\u0000\u0000root\\cimv2\u0000\u0000BVTFilter\u0000\u0000SELECT * FROM __InstanceModificationEvent WITHIN 60 WHERE TargetInstance ISA \"Win32_Processor\" AND TargetInstance.LoadPercentage > 99\u0000\u0000WQL\u0000B\u0000B\u0000F\u0000C\u0000C\u0000B\u00004\u00004\u00004\u0000C\u0000F\u00006\u00006\u0000A\u0000A\u00000\u00009\u0000A\u0000E\u00006\u0000F\u00001\u00005\u00009\u00006\u00007\u0000A\u00006\u00008\u00006\u00005\u00001\u00007\u00005\u0000B\u0000B\u00000\u0000E\u0000D\u00002\u00001\u00006\u0000D\u00001\u00009\u00009\u00007\u00000\u0000A\u00007\u00009\u00008\u00008\u0000B\u00007\u00002\u0000C\u0000D\u0000F\u00000\u0000A\u00003\u0000A\u00004\u0000�3u�&��\u0001Ԏ��+\u0004�\u0001�\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u000F�����\"\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000/\u0000\u0000\u0000O\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u001A\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\\\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0004\u0000\u0000\u0000\u0001q\u0000\u0000�\u0000CommandLineEventConsumer\u0000\u0000cscript KernCap.vbs\u0000\u001C\u0000\u0000\u0000\u0001\u0005\u0000\u0000\u0000\u0000\u0000\u0005\u0015\u0000\u0000\u0000�tw�}\n" +
"z�p�)��\u0001\u0000\u0000\u0000BVTConsumer\u0000\u0000C:\\\\tools\\\\kernrate\u00000\u0000A\u00007\u0000A\u0000B\u0000E\u00006\u00003\u0000F\u00003\u00006\u0000E\u00002\u0000B\u00002\u00009\u00002\u00000\u0000F\u0000E\u0000D\u0000A\u0000F\u0000A\u0000E\u00008\u00004\u00009\u00008\u00002\u00003\u0000A\u0000F\u00009\u00004\u00002\u00009\u0000C\u0000C\u00000\u0000E\u0000A\u00003\u00007\u00003\u0000F\u0000F\u0000E\u0000E\u00001\u00005\u00000\u00007\u0000E\u0000D\u0000B\u00002\u00001\u0000F\u0000D\u00009\u00001\u00007\u00000\u0000�3u�&��\u0001����+\u0004�\u0001�\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000�";
String event_consumer_name = "BVTConsumer";
String cPattern = "(CommandLineEventConsumer)(\\u0000\\u0000)(.*?)(\\u0000)(.*?)(" + event_consumer_name + ")(\\u0000\\u0000)?([^\\u0000]*)?";
Pattern consumer_mo = Pattern.compile(cPattern, Pattern.CASE_INSENSITIVE);
Matcher consumer_match = consumer_mo.matcher(sampleStr);
if(consumer_match.find()){
System.out.println(consumer_match.group(6));
}
UPDATE
In python the groups return
python result screenshot
From what I posted as comments:
The (CommandLineEventConsumer)(\u0000\u0000)(.*?)(\u0000)(.*?) part matches fine.
group(3) gets cscript KernCap.vbs
group(4) gets a null character
but group(5) gets nothing.
I did try in Python and I have the exact same lack of match when I include the (BVTConsumer). So you probably had a difference in the code doing the matching in Python, not the regex itself.
So the reason is that you have a \n in your string so the matching stops there. If you do
Pattern consumer_mo = Pattern.compile(cPattern, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
it does match in your example.

How to parse a string to get array of #tags out of the string?

so I have this string like
"#tag1 #tag2 #tag3 not_tag1 not_tag2 #tag4" (the space between tag2 and tag4 is to indicate there can be many spaces). From this string I want to parse just a tag1, tag2 and so on. They are similar to #tags we see on LinkedIn or any other social media. Is there any easy way to do this using regex or any other function in Java. Or should I do it hard way(i.e. using loops and conditions).
Tag format should be "#" (to indicate tag is starting) and space " "(to indicate end of tag). In between there can be character or numbers but start should be a character only.
example,
input : "#tag1 #tag2 #tag3 not_tag1 not_tag2 #12tag #tag4"
output : ["tag1", "tag2", "tag3", "tag4"]
split by regex: "#\w+"
EDIT: this is the correct regex, but split is not the right method.
same solution as javadev suggested, but use instead:
String input = "#tag1 #tag2 #tag3 not_tag1 not_tag2 #12tag #tag4";
Matcher matcher = Pattern.compile("#\\w+").matcher(input);
while (matcher.find()) {
System.out.println(matcher.group(0));
}
output with # as expected.
Maybe something like:
public static void main(String[] args ) {
String input = "#tag1 #tag2 #tag3 not_tag1 not_tag2 #12tag #tag4";
Pattern pattern = Pattern.compile("#([A-z][A-z0-9]*) *");
Matcher matcher = pattern.matcher(input);
while (matcher.find()) {
System.out.println(matcher.group(1));
}
}
worked for me :)
Output:
tag1
tag2
tag3
tag4

How can I get non-matching groups using a Matcher in Java?

I'm trying to write a java regex to catch some groups of words from a String using a Matcher.
Say i got this string: "Hello, we are #happy# to see you today".
I would like to get 2 group of matches, one having
Hello, we are
to see you today
and the other
happy
So far, I was only able to match the word between the #s using this Pattern:
Pattern p = Pattern.compile("#(.+?)#");
I've read about negative lookahead and lookaround, played a bit with it but without success.
I assume I should do some sort of negation of the regex so far, but I couldn't come up with anything.
Any help would be really appreciated, thank you.
From comment:
I may incur in a string where I got more than one instances of words wrapped by #, such as "#Hello# kind #stranger#"
From comment:
I need to apply some different style format to both the text inside and outside.
Since you need to apply different stylings, the code need to process each block of text separately, and needs to know if the text is inside or outside a #..# section.
Note, in the following code, it will silently skip the last #, if there is an odd number of them.
String input = ...
for (Matcher m = Pattern.compile("([^#]+)|#([^#]+)#").matcher(input); m.find(); ) {
if (m.start(1) != -1) {
String outsideText = m.group(1);
System.out.println("Outside: \"" + outsideText + "\"");
} else {
String insideText = m.group(2);
System.out.println("Inside: \"" + insideText + "\"");
}
}
Output for input = "Hello, we are #happy# to see you today"
Outside: "Hello, we are "
Inside: "happy"
Outside: " to see you today"
Output for input = "#Hello# kind #stranger#"
Inside: "Hello"
Outside: " kind "
Inside: "stranger"
Output for input = "This #text# has unpaired # characters"
Outside: "This "
Inside: "text"
Outside: " has unpaired "
Outside: " characters"
The best I could do is splitting in 3 groups, then merging the group 1 and 4 :
(^.*)(\#(.+?)\#)(.*)
Test it here
EDIT: Taking remarks from the comments :
(^[^\#]*)(?:\#(.+?)\#)([^\#]*)
Thanks to #Lino we don't capture the useless group with # anymore, and we capture anything except #, instead of any non whitespace character in the 1st and 2nd groups.
Test it here
Is this solution fine?
Pattern pattern =
Pattern.compile("([^#]+)|#([^#]*)#");
Matcher matcher =
pattern.matcher("Hello, we are #happy# to see you today");
List<String> notBetween = new ArrayList<>(); // not surrounded by #
List<String> between = new ArrayList<>(); // surrounded by #
while (matcher.find()) {
if (Objects.nonNull(matcher.group(1))) notBetween.add(matcher.group(1));
if (Objects.nonNull(matcher.group(2))) between.add(matcher.group(2));
}
System.out.println("Printing group 1");
for (String string :
notBetween) {
System.out.println(string);
}
System.out.println("Printing group 2");
for (String string :
between) {
System.out.println(string);
}

Extracting part of URL using java regular expression

I'm trying to extract part of the URL in the text files.
for example:
/p/gnomecatalog/bugs/search/?q=status%3Aclosed-accepted+or+status%3Awont-fix+or+status%3Aclosed" class="search_bin"><span>Closed Tickets</span></a>
I would like to extract only
/p/gnomecatalog/bugs/search/?q=status%3Aclosed-accepted+or+status%3Awont-fix+or+status%3Aclosed
HOW I COULD DO THAT BY USING REGULAR Expression. I tried with regex
"/p/*./bugs/*."
but it didn't work.
Try this:
"\/p.*\/bugs[^"]*"
it means: "/p"
then: all chars,
then: "/bugs",
then: all chars except "
You can use :
(\/p\/.*\/bugs\/.*?(?="))
Java Code :
String REGEX = "(\\/p\\/.*\\/bugs\\/.*?(?=\"))";
Pattern p = Pattern.compile(REGEX);
Matcher m = p.matcher(line);
while (m.find()) {
String matched = m.group();
System.out.println("Mached : "+ matched);
}
OUTPUT
Mached : /p/gnomecatalog/bugs/search/?q=status%3Aclosed-accepted+or+status%3Awont-fix+or+status%3Aclosed
DEMO
Explanation:
Here's another way:
(?i)/p/[a-z/]+bugs/[^ "]+
The (?i) in the beginning makes the regex case insensitive so you don't have to worry about that. Then after bugs/ it will continue until it reaches either a space or a ".

Java Regex - How to replace a pattern or how to

I have a bunch of HTML files. In these files I need to correct the src attribute of the IMG tags.
The IMG tags look typically like this:
<img alt="" src="./Suitbert_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />`
where the attributes are NOT in any specific order.
I need to remove the dot and the forward slash at the beginning of the src attribute of the IMG tags so they look like this:
<img alt="" src="Suitbert%20%E2%80%93%20Wikipedia_files/233px-Suitbertus.jpg" class="thumbimage" height="243" width="233" />
I have the following class so far:
import java.util.regex.*;
public class Replacer {
// this PATTERN should find all img tags with 0 or more attributes before the src-attribute
private static final String PATTERN = "<img\\.*\\ssrc=\"\\./";
private static final String REPLACEMENT = "<img\\.*\\ssrc=\"";
private static final Pattern COMPILED_PATTERN = Pattern.compile(PATTERN, Pattern.CASE_INSENSITIVE);
public static void findMatches(String html){
Matcher matcher = COMPILED_PATTERN.matcher(html);
// Check all occurance
System.out.println("------------------------");
System.out.println("Following Matches found:");
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end() + " ");
System.out.println(matcher.group());
}
System.out.println("------------------------");
}
public static String replaceMatches(String html){
//Pattern replace = Pattern.compile("\\s+");
Matcher matcher = COMPILED_PATTERN.matcher(html);
html = matcher.replaceAll(REPLACEMENT);
return html;
}
}
So, my method findMatches(String html) seems to find correctly all IMG tags where the src attributes starts with ./.
Now my method replaceMatches(String html) does not correctly replace the matches.
I am a newbie to regex, but I assume that either the REPLACEMENT regex is incorrect or the usage of the replaceAll method or both.
A you can see, the replacement String contains 2 parts which are identical in all IMG tags:
<img and src="./. In between these 2 parts, there should be the 0 or more HTML attributes from the original string.
How do I formulate such a REPLACEMENT string?
Can somebody please enlighten me?
Don't use regex for HTML. Use a parser, obtain the src attribute and replace it.
Try these:
PATTERN = "(<img[^>]*\\ssrc=\")\\./"
REPLACEMENT = "$1"
Basically, you capture everything except the ./ in group #1, then plug it back in using the $1 placeholder, effectively stripping off the ./.
Notice how I changed your .* to [^>]*, too. If there happened to be two IMG tags on the same line, like this:
<img src="good" /><img src="./bad" />
...your regex would match this:
<img src="good" /><img src="./
It would do that even if you used a non-greedy .*?. [^>]* makes sure the match is always contained within the one tag.
Your replacement is incorrect. It will replace the matched string by the replacement (not interpreted as a regexp). If you want to achieve, what you want, you need to use groups. A group is delimited by the parenthesis of the regexp. Each opening parenthesis indicates a new group.
You can use $i in the replacement string to reproduce what a groupe has matched and where 'i' is your group number reference. See The doc of appendReplacement for the details.
// Here is an example (it looks a bit like your case but not exactly)
String input = "<img name=\"foobar\" src=\"img.png\">";
String regexp = "<img(.+)src=\"[^\"]+\"(.*)>";
Matcher m = Pattern.compile(regexp).matcher(input);
StringBuffer sb = new StringBuffer();
while(m.find()) {
// Found a match!
// Append all chars before the match and then replaces the match by the
// replacement (the replacement refers to group 1 & 2 with $1 & $2
// which match respectively everything between '<img' and 'src' and,
// everything after the src value and the closing >
m.appendReplacement(sb, "<img$1src=\"something else\"$2>";
}
m.appendTail(sb);// No more match, we append the end of input
Hope this helps you
If src attributes only occur in your HTML within img tags, you can just do this:
input.replace("src=\"./", "src=\"")
You could also do this without java by using sed if you're using a *nix OS

Categories

Resources