My regex search only prints out there last match - java

I actually wrote a regex expression to search for web URLs in a text (full code below) but on running the code, console prints out only the last URL in the text. I don't know what's wrong and I actually used a while loop. See code below and kindly help make corrections. Thanks
import java.util.*;
import java.util.regex.*;
public class Main
{
static String query = "This is a URL http://facebook.com"
+ " and this is another, http://twitter.com "
+ "this is the last URL http://instagram.com"
+ " all these URLs should be printed after the code execution";
public static void main(String args[])
{
String pattern = "([\\w \\W]*)((http://)([\\w \\W]+)(.com))";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(query);
while(m.find())
{
System.out.println(m.group(2));
}
}
}
On running the above code, only http://instagram.com gets printed to the console output

I found another RegEx here
https?:\/\/(www\.)?[-a-zA-Z0-9#:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9#:%_\+.~#?&//=]*)
It looks for https, but seems to be valid in your case.
I'm getting all 3 URLs printed with this code :
public class Main {
static String query = "This is a URL http://facebook.com"
+ " and this is another, http://twitter.com "
+ "this is the last URL http://instagram.com"
+ " all these URLs should be printed after the code execution";
public static void main(String[] args) {
String pattern = "https?:\\/\\/(www\\.)?[-a-zA-Z0-9#:%._\\+~#=]{2,256}\\.[a-z]{2,6}\\b([-a-zA-Z0-9#:%_\\+.~#?&//=]*)";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(query);
while (m.find()) {
System.out.println(m.group());
}
}
}

I hope this will clear it for you but you are matching too many characters, your match should be as restrictive as possible because regex is greedy and is going to try to match as much as possible.
here is my take on your code:
public class Main {
static String query = "This is a URL http://facebook.com"
+ " and this is another, http://twitter.com "
+ "this is the last URL http://instagram.com"
+ " all these URLs should be printed after the code execution";
public static void main(String args[]) {
String pattern = "(http:[/][/][Ww.]*[a-zA-Z]+.com)";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(query);
while(m.find())
{
System.out.println(m.group(1));
}
}
}
the above cote will match only your examples if you wish to match more you need to tweak it to your needs.
And a great way to live test patterns is http://www.regexpal.com/ you can tweet your pattern there to match exactly what you want just remember to replace the \ with double \\ in java for escaped caracters .

I'm not sure how reliable this pattern is, but it prints out all the URLs when I run your example.
(http://[A-Za-z0-9]+\\.[a-zA-Z]{2,3})
You will have to modify it if you encounter an url that looks like this:
http://www.instagram.com
As it will only capture URLs without the 'www'.

Perhaps you're looking for this regex:
http://(\w+(?:\.\w+)+)
For example, from this string:
http://ww1.amazon.com and http://npr.org
it extracts
"ww1.amazon.com"
"npr.org"
To break down how it works:
http:// is literal
( ... ) is the main capture group
\w+ find one or more alphanumeric characters
(?: ... ) ...followed by a non-capturing group
\.\w+ ...that contains a literal period followed by at least one alphanumeric
+ repeated one or more times
Hope this helps.

Your problem is that your regex quantifiers (i.e. the * and + characters) are greedy, meaning that they match as much as possible. You need to use reluctant quantifiers. See the corrected code pattern below - just two extra characters - a ? character after the * and + to match as little as possible.
String pattern = "([\\w \\W]*?)((http://)([\\w \\W]+?)(.com))";

Related

Java Regex capture nested matches

I am having trouble with regex here.
Say i have this input:
608094.21.1.2014.TELE.&BIG00Z.1.1.GBP
My regex looks like this
(\d\d\d\d\.\d?\d\.\d?\d)|(\d?\d\.\d?\d\.\d?\d?\d\d)
I want to extract the date 21.1.2014 out of the string, but all i get is
8094.21.1
I think my problem here is, that 21.1.2014 starts within the (wrong) match before. Is there a simple way to make the matcher look for the next match not after the end of the match before but one character after the beginning of the match before?
You could use a regex like this:
\d{1,2}\.\d{1,2}\.\d{4}
Working demo
Or shorten it and use:
(\d{1,2}\.){2}\d{4}
If the date is always surrounded by dot:
\.(\d\d\d\d\.\d?\d\.\d?\d|\d?\d\.\d?\d\.\d?\d?\d\d)\.
I hope this will help you.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
String x = "608094.21.1.2014.TELE.&BIG00Z.1.1.GBP";
String pattern = "[0-9]{2}.[0-9]{1}.[0-9]{4}";
// Create a Pattern object
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(x);
if (m.find( )) {
System.out.println("Found value: " + m.group() );
}else {
System.out.println("NO MATCH");
}
}

Get a substring from string multiple times

I have a String that I don't know how long it is or what caracters are used in it.
I want to search in the string and get any substring found inside "" .
I tried to use pattern.compile but it always return an empty string
Pattern p = Pattern.compile("\".\"");
Matcher m = p.matcher(mystring);
while(m.find()){
System.out.println(m.group().toString());
}
How can I do it?
Use the .+? to get all characters inside "" with grouping
Pattern p = Pattern.compile("\".+?\"");
The .+ specifies that you want at least one or more characters inside the quotations. The ? specifies that it is a reluctant quantifier, which means it will put different quotations into different groups.
Unit test example:
#Test
public void test() {
String test = "speak \"friend\" and \"enter\"";
Pattern p = Pattern.compile("\".+?\"");
Matcher m = p.matcher(test);
while(m.find()){
System.out.println(m.group().toString().replace("\"", ""));
}
}
Output:
friend
enter
That is because your regex actually searches for one character between " and " ... if you want to search for more character, you should rewrite your regex to "\".?\""

Need help in Regex to exclude splitting string within "

I need to split a String based on comma as seperator, but if the part of string is enclosed with " the splitting has to stop for that portion from starting of " to ending of it even it contains commas in between.
Can anyone please help me to solve this using regex with look around.
Resurrecting this question because it had a simple regex solution that wasn't mentioned. This situation sounds very similar to ["regex-match a pattern unless..."][4]
\"[^\"]*\"|(,)
The left side of the alternation matches complete double-quoted strings. We will ignore these matches. The right side matches and captures commas to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
Here is working code (see online demo):
import java.util.regex.*;
import java.util.List;
class Program {
public static void main (String[] args) {
String subject = "\"Messages,Hello\",World,Hobbies,Java\",Programming\"";
Pattern regex = Pattern.compile("\"[^\"]*\"|(,)");
Matcher m = regex.matcher(subject);
StringBuffer b = new StringBuffer();
while (m.find()) {
if(m.group(1) != null) m.appendReplacement(b, "SplitHere");
else m.appendReplacement(b, m.group(0));
}
m.appendTail(b);
String replaced = b.toString();
String[] splits = replaced.split("SplitHere");
for (String split : splits)
System.out.println(split);
} // end main
} // end Program
Reference
How to match pattern except in situations s1, s2, s3
Please try this:
(?<!\G\s*"[^"]*),
If you put this regex in your program, it should be:
String regex = "(?<!\\G\\s*\"[^\"]*),";
But 2 things are not clear:
Does the " only start near the ,, or it can start in the middle of content, such as AAA, BB"CC,DD" ? The regex above only deal with start neer , .
If the content has " itself, how to escape? use "" or \"? The regex above does not deal any escaped " format.

What's up with this regular expression not matching?

public class PatternTest {
public static void main(String[] args) {
System.out.println("117_117_0009v0_172_5738_5740".matches("^([0-9_]+v._.)"));
}
}
This program prints "false". What?!
I am expecting to match the prefix of the string: "117_117_0009v0_1"
I know this stuff, really I do... but for the life of me, I've been staring at this for 20 minutes and have tried every variation I can think of and I'm obviously missing something simple and obvious here.
Hoping the many eyes of SO can pick it out for me before I lose my mind over this.
Thanks!
The final working version ended up as:
String text = "117_117_0009v0_172_5738_5740";
String regex = "[0-9_]+v._.";
Pattern p = Pattern.compile(regex);
Mather m = p.matcher(text);
if (m.lookingAt()) {
System.out.println(m.group());
}
One non-obvious discovery/reminder for me was that before accessing matcher groups, one of matches() lookingAt() or find() must be called. If not an IllegalStateException is thrown with the unhelpful message "Match not found". Despite this, groupCount() will still return non-zero, but it lies. Do not beleive it.
I forgot how ugly this API is. Argh...
by default Java sticks in the ^ and $ operators, so something like this should work:
public class PatternTest {
public static void main(String[] args) {
System.out.println("117_117_0009v0_172_5738_5740".matches("^([0-9_]+v._.).*$"));
}
}
returns:
true
Match content:
117_117_0009v0_1
This is the code I used to extract the match:
Pattern p = Pattern.compile("^([0-9_]+v._.).*$");
String str = "117_117_0009v0_172_5738_5740";
Matcher m = p.matcher(str);
if (m.matches())
{
System.out.println(m.group(1));
}
If you want to check if a string starts with the certain pattern you should use Matcher.lookingAt() method:
Pattern pattern = Pattern.compile("([0-9_]+v._.)");
Matcher matcher = pattern.matcher("117_117_0009v0_172_5738_5740");
if (matcher.lookingAt()) {
int groupCount = matcher.groupCount();
for (int i = 0; i <= groupCount; i++) {
System.out.println(i + " : " + matcher.group(i));
}
}
Javadoc:
boolean
java.util.regex.Matcher.lookingAt()
Attempts to match the input sequence,
starting at the beginning of the
region, against the pattern. Like the
matches method, this method always
starts at the beginning of the region;
unlike that method, it does not
require that the entire region be
matched. If the match succeeds then
more information can be obtained via
the start, end, and group methods.
I donno Java Flavor of Regular Expression However This PCRE Regular Expression Should work
^([\d_]+v\d_\d).+
Dont know why you are using ._. instead of \d_\d

how to read string part in java

I have this string :
<meis xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" uri="localhost/naro-nei" onded="flpSW531213" identi="lemenia" id="75" lastStop="bendi" xsi:noNamespaceSchemaLocation="http://localhost/xsd/postat.xsd xsd/postat.xsd">
How can I get lastStop property value in JAVA?
This regex worked when tested on http://www.myregexp.com/
But when I try it in java I don't see the matched text, here is how I tried :
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class SimpleRegexTest {
public static void main(String[] args) {
String sampleText = "<meis xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" uri=\"localhost/naro-nei\" onded=\"flpSW531213\" identi=\"lemenia\" id=\"75\" lastStop=\"bendi\" xsi:noNamespaceSchemaLocation=\"http://localhost/xsd/postat.xsd xsd/postat.xsd\">";
String sampleRegex = "(?<=lastStop=[\"']?)[^\"']*";
Pattern p = Pattern.compile(sampleRegex);
Matcher m = p.matcher(sampleText);
if (m.find()) {
String matchedText = m.group();
System.out.println("matched [" + matchedText + "]");
} else {
System.out.println("didn’t match");
}
}
}
Maybe the problem is that I use escape char in my test , but real string doesn't have escape inside. ?
UPDATE
Does anyone know why this doesn't work when used in java ? or how to make it work?
(?<=lastStop=[\"']?)[^\"]+
The reason it doesn't work as you expect is because of the * in [^\"']*. The lookbehind is matching at the position before the " in lastStop=", which is permitted because the quote is optional: [\"']?. The next part is supposed to match zero or more non-quote characters, but because the next character is a quote, it matches zero characters.
If you change that * to a +, the second part will fail to match at that position, forcing the regex engine to bump ahead one more position. The lookbehind will match the quote, and [^\"']+ will match what follows. However, you really shouldn't be using a lookbehind for this in the first place. It's much easier to just match the whole sequence in the normal way and extract the part you want to keep via a capturing group:
String sampleRegex = "lastStop=[\"']?([^\"']*)";
Pattern p = Pattern.compile(sampleRegex);
Matcher m = p.matcher(sampleText);
if (m.find()) {
String matchedText = m.group(1);
System.out.println("matched [" + matchedText + "]");
} else {
System.out.println("didn’t match");
}
It will also make it easier to deal with the problem #Kobi mentioned. You're trying to allow for values contained in double-quotes, single-quotes or no quotes, but your regex is too simplistic. For one thing, a quoted value can contain whitespace, but an unquoted one can't. To deal with all three possibilities, you'll need two or three capturing groups, not just one.

Categories

Resources