I'm trying to extract the second url from Stings like these
submitted by thecrappycoder <br /> [link] [3 comments]
submitted by durdn <br /> [link] [1 comment]
by using regex. I tried this.
String regex = "\\(?\\b(http://|www[.])[-A-Za-z0-9+&##/%?=~_()|!:,.;]*[-A-Za-z0-9+&##/%=~_()|]";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(text);
while(m.find()) {
String urlStr = m.group();
urlStr = urlStr.substring(1, 3);
links.add(urlStr);
}
I also tried in that way
System.out.println(("http://"+text.split("http://")[1]).split("")[0]);
Unfortunately, I couldn't get it. Any help, thank you.
You can take the same approach with a simplified regex pattern:
String text = "submitted by thecrappycoder <br />" +
" [link] " +
"[3 comments]\n" +
" ";
String regex = "href=.(http.*?)\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(text);
m.find(); // ignore the 1st match
m.find(); // find the 2nd match
String urlStr = m.group(); // read the 2nd match
System.out.println("urlStr = " + urlStr); // prints: urlStr = http://blogs.msdn.com/b/bethmassi/archive/2015/02/25/understanding-net-2015.aspx
Related
I need to get the text between the URL which has a date in Java
Input 1:
/test1/raw/2019-06-11/testcustomer/usr/pqr/DATA/mn/export/
Output: testcustomer
Only /raw/ remains, date will change and testcustomer will change
Input 2:
/test3/raw/2018-09-01/newcustomer/usr/pqr/DATA/mn/export/
Output: newcustomer
String url = "/test3/raw/2018-09-01/newcustomer/usr/pqr/DATA/mn/export/";
String customer = getCustomer(url);
public String getCustomer (String _url){
String source = "default";
String regex = basePath + "/raw/\\d{4}-\\d{2}-\\d{2}/usr*";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(_url);
if (m.find()) {
source = m.group(1);
} else {
logger.error("Cant get customer with regex " + regex);
}
return source;
}
It's returning 'default' :(
Your regex /raw/\\d{4}-\\d{2}-\\d{2}/usr* is missing the part for the value you want, you need a regex that find the date, and keep what's next :
/\w*/raw/[0-9-]+/(\w+)/.* or (?<=raw\/\d{4}-\d{2}-\d{2}\/)(\w+) will be good
Pattern p = Pattern.compile("/\\w*/raw/[0-9-]+/(\\w+)/.*");
Matcher m = p.matcher(str);
if (m.find()) {
String value = m.group(1);
System.out.println(value);
}
Or if it's always the 4th part, use split()
String value = str.split("/")[4];
System.out.println(value);
And here a >> code demo
Here, we can likely use raw followed by the date as a left boundary, then we would collect our desired output in a capturing group, we would add an slash and consume the rest of our string, with an expression similar to:
.+raw\/[0-9]{4}-[0-9]{2}-[0-9]{2}\/(.+?)\/.+
Demo
Test
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = ".+raw\\/[0-9]{4}-[0-9]{2}-[0-9]{2}\\/(.+?)\\/.+";
final String string = "/test1/raw/2019-06-11/testcustomer/usr/pqr/DATA/mn/export/\n"
+ "/test3/raw/2018-09-01/newcustomer/usr/pqr/DATA/mn/export/";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
RegEx
If this expression wasn't desired or you wish to modify it, please visit regex101.com.
RegEx Circuit
jex.im visualizes regular expressions:
I am using below java program to find list of js files as a Substring.
String str = "jsLib//connect.facebook.net/en_US/fbevents.js , jsLib//connect.facebook.net/en_US/fbevents2.js;";
String patternStr = "(\\/.*?\\.js)";
Pattern pattern = Pattern.compile(patternStr);
Matcher matcher = pattern.matcher(html);
if (matcher.find()) {
System.out.println("Count:" + matcher.groupCount());
jsLib = matcher.group(1);
jsLib = jsLib.substring(jsLib.lastIndexOf('/') + 1, jsLib.length());
System.out.println("jsLib:" + jsLib);
}
Regex : I used String patternStr="(\\/.*?\\.js)";
Expected Result : both fbevents.js and fbevents2.js should be matched and part of result
Actual Result : only fbevents.js is matched
You may get all your results using while loop and a regex like [^/]*\.js:
String str = "jsLib//connect.facebook.net/en_US/fbevents.js , jsLib//connect.facebook.net/en_US/fbevents2.js;";
String patternStr = "[^/]*\\.js";
Pattern pattern = Pattern.compile(patternStr);
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {
System.out.println("jsLib:" + matcher.group());
}
Output:
jsLib:fbevents.js
jsLib:fbevents2.js
See the Java demo and the regex demo.
The [^/]*\.js pattern matches any 0+ chars other than / (with [^/]*) and then a .js substring.
I am trying to extract both XML tags and text within tags using regex. I understand using regex is not the best option. I only have very few tags in my inline text file hence did not opt for XML parsers.
String txt="American Airlines made <TRIPS> 100 </TRIPS> flights in <DATE> December </DATE> over <ROUTE> Altantic </ROUTE> ";
String re1="<([^>]+)>"; // Tag 1
String re2="([^<]*)"; // Variable Name 1
String re3="</([^>]+)>"; // Tag 2
// String re3 = re1;
Pattern p = Pattern.compile(re1+re2+re3,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(txt);
if (m.find())
{
String tag1=m.group(1);
String var1=m.group(2);
System.out.println(tag1.toString());
System.out.println(var1.toString());
}
The problem is that, it only identifies the first tag and not the second one or subsequent ones.
Current Output
TRIPS
100
Desired Output
TRIPS
100
DATE
December
ROUTE
Altantic
Please Change if to while :
String txt = "American Airlines made <TRIPS> 100 <TRIPS> flights in <DATE> December </DATE> over <ROUTE> Altantic </ROUTE> ";
String re1 = "<([^>]+)>"; // Tag 1
String re2 = "([^<]*)"; // Variable Name 1
// String re3="</([^>]+)>"; // Tag 2
String re3 = re1;
Pattern p = Pattern.compile(re1 + re2 + re3, Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(txt);
while (m.find()) {
String tag1 = m.group(1);
String var1 = m.group(2);
System.out.println(tag1.toString());
System.out.println(var1.toString());
}
If you came to this post looking for a way to parse XML, don't read this. Use an XML parser instead.
Solution:
Change if (m.find()) to while (m.find()). You can iterate to find all matches.
This is the general case to find all regex matches:
Pattern p = Pattern.compile(regex,flags);
Matcher m = p.matcher(text);
while (m.find())
{
System.out.println("First group: " + m.group(1) +
"\nSecond group: " + m.group(2) );
}
I need to extract the values after :70: in the following text file using RegEx. Value may contain line breaks as well.
My current solution is to extract the string between :70: and : but this always returns only one match, the whole text between the first :70: and last :.
:32B:xxx,
:59:yyy
something
:70:ACK1
ACK2
:21:something
:71A:something
:23E:something
value
:70:ACK2
ACK3
:71A:something
How can I achive this using Java? Ideally I want to iterate through all values, i.e.
ACK1\nACK2,
ACK2\nACK3
Thanks :)
Edit: What I'm doing right now,
Pattern pattern = Pattern.compile("(?<=:70:)(.*)(?=\n)", Pattern.DOTALL);
Matcher matcher = pattern.matcher(data);
while (matcher.find()) {
System.out.println(matcher.group())
}
Try this.
String data = ""
+ ":32B:xxx,\n"
+ ":59:yyy\n"
+ "something\n"
+ ":70:ACK1\n"
+ "ACK2\n"
+ ":21:something\n"
+ ":71A:something\n"
+ ":23E:something\n"
+ "value\n"
+ ":70:ACK2\n"
+ "ACK3\n"
+ ":71A:something\n";
Pattern pattern = Pattern.compile(":70:(.*?)\\s*:", Pattern.DOTALL);
Matcher matcher = pattern.matcher(data);
while (matcher.find())
System.out.println("found="+ matcher.group(1));
result:
found=ACK1
ACK2
found=ACK2
ACK3
You need a loop to do this.
Pattern p = Pattern.compile(regexPattern);
List<String> list = new ArrayList<String>();
Matcher m = p.matches(input);
while (m.find()) {
list.add(m.group());
}
As seen here Create array of regex matches
I am trying to match pattern like '#(a-zA-Z0-9)+ " but not like 'abc#test'.
So this is what I tried:
Pattern MY_PATTERN
= Pattern.compile("\\s#(\\w)+\\s?");
String data = "abc#gere.com #gogasig #jytaz #tibuage";
Matcher m = MY_PATTERN.matcher(data);
StringBuffer sb = new StringBuffer();
boolean result = m.find();
while(result) {
System.out.println (" group " + m.group());
result = m.find();
}
But I can only see '#jytaz', but not #tibuage.
How can I fix my problem? Thank you.
This pattern should work: \B(#\w+)
The \B scans for non-word boundary in the front. The \w+ already excludes the trailing space. Further I've also shifted the parentheses so that the # and + comes in the correct group. You should preferably use m.group(1) to get it.
Here's the rewrite:
Pattern pattern = Pattern.compile("\\B(#\\w+)");
String data = "abc#gere.com #gogasig #jytaz #tibuage";
Matcher m = pattern.matcher(data);
while (m.find()) {
System.out.println(" group " + m.group(1));
}