Find all <a href>link</a> in a string with java regex - java

I have a String which contains some url how i can find all the href with a regular expression?
prodotto di prova
Now i have this which find all amazon links now i need to add also the href to this regex:
String regex="(http|www\\.)(amazon|AMAZON)\\.(com|it|uk|fr|de)\\/(?:gp\\/product|gp\\/product\\/glance|[^\\/]+\\/dp|dp|[^\\/]+\\/product-reviews)\\/([^\\/]{10})";

This pattern works for me in Java: (IDEONE here)
String input = "prodotto di prova\"";
String pattern = "href=(?<link>['\\\"](?:https?:\\/\\/)?(?:www\\.)?(?:amazon|AMAZON)\\.(?:com|it|uk|fr|de)\\/(?<product>:gp\\/product|gp\\/product\\/glance|[^\\/]+\\/dp|dp|[^\\/]+\\/product-reviews)\\/(?<productID>[^\\/]{10})\\/(?<queryString>.*?)\\\")";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(input);
if (m.find( )) {
System.out.println("Amazon link: " + m.group(0) );
System.out.println("product: " + m.group("product") );
System.out.println("productID: " + m.group("productID"));
System.out.println("querystring: " + m.group("queryString"));
} else {
System.out.println("NO MATCH");
}
output:
Amazon link:
href="http://www.amazon.it/Die-10-Symphonien-Orchesterlieder-Sinfonie-Complete/dp/B003LQSHBO/ref=sr_1_2?ie=UTF8&qid=1440101590&sr=8-2&keywords=mahler"
product: Die-10-Symphonien-Orchesterlieder-Sinfonie-Complete/dp
productID: B003LQSHBO
querystring: ref=sr_1_2?ie=UTF8&qid=1440101590&sr=8-2&keywords=mahler
Java's rules for backslashes and escapes in strings are absolutely infuriating to me and I never get it right. You may find it helpful to go to http://www.regexplanet.com/advanced/java/index.html and enter a regex, which it will convert into a java string with the proper escapes. (I couldn't get mine working until I did this!)

Related

Retrieving concrete data from String

Im trying to retrieve data-product id from the String which goes like this:
<img class="lazy" src="/b/mp/img/svg/no_picture.svg" lazy-img="https://ecsmedia.pl/c/w-pustyni-i-w-puszczy-p-iext43240721.jpg" alt="">
The output should be
prod14290034
I tried to achieve this with a regular expression, but I'm beginner in it.
Is regular expression good for it? If so, how to do it?
/EDIT
According to Emma's comment.
I've made something like this:
String z = element.toString();
Pattern pattern = Pattern.compile("data-product-id=\"\\s*([^\\s\"]*?)\\s*\"");
Matcher matcher = pattern.matcher(z);
System.out.println(matcher.find());
if (matcher.find()) {
System.out.println(matcher.group());
}
it returns true, but dont print any value. Why?
You might use some HTML/XHTML/XML library which could transform your string data into document or at least Element and then you can easily obtain the attribute value from there. But if you want to use regex then you can try this snippet
#Test
public void productId() {
String src =
" <img class=\"lazy\" src=\"/b/mp/img/svg/no_picture.svg\" lazy-img=\"https://ecsmedia.pl/c/w-pustyni-i-w-puszczy-p-iext43240721.jpg\" alt=\"\"> ";
final Pattern pattern = Pattern.compile("(data-product-id=)\"(p[a-zA-Z]+[0-9]+)\"");
final Matcher matcher = pattern.matcher(src);
String prodId = null;
if (matcher.find()) {
System.out.println(matcher.groupCount());
prodId = matcher.group(2);
}
System.out.println(prodId);
Assert.assertNotNull(prodId);
Assert.assertEquals(prodId, "prod14290034");
}
You can use jsoup for Java - it is a library for parsing HTML pages. There are a lot of other libraries for different languages, beautifulSoup for python.
EDIT: Here is a snippet for jsoup, you can select any element with a tag, and then get needed attribute with attr method.
Document doc = Jsoup.parse(
"<a href=\"/w-pustyni-i-w-puszczy-sienkiewicz-henryk,prod14290034,ksiazka-p\" " +
"class=\"img seoImage\" " +
"title=\"W pustyni i w puszczy - Sienkiewicz Henryk\" " +
"rel=\"nofollow\" " +
"data-product-id=\"prod14290034\"> " +
"<img class=\"lazy\" src=\"/b/mp/img/svg/no_picture.svg\" lazy-img=\"https://ecsmedia.pl/c/w-pustyni-i-w-puszczy-p-iext43240721.jpg\" alt=\"\"> </a>\n"
);
String dataProductId = doc.select("a").first().attr("data-product-id");

Regex for finding mp4 in string

I want to get all .mp4 URLs of this String using Regex.
Also I want to know how to get only the last .mp4 URL using Regex.
Thanks
contentType=application/x-mpegURL, url=https://video.twimg.com/amplify_video/822938952332144642/pl/BjHU8aBCbOgZNzXQ.m3u8},
Variant{bitrate=0, contentType=application/dash+xml, url=https://video.twimg.com/amplify_video/822938952332144642/pl/BjHU8aBCbOgZNzXQ.mpd},
Variant{bitrate=320000, contentType=video/mp4, url=https://video.twimg.com/amplify_video/822938952332144642/vid/320x180/YqZ72rzLj3VWVhy4.mp4},
Variant{bitrate=832000, contentType=video/mp4, url=https://video.twimg.com/amplify_video/822938952332144642/vid/640x360/A2vMgzo2ElpPP6TE.mp4},
Variant{bitrate=2176000, contentType=video/mp4, url=https://video.twimg.com/amplify_video/822938952332144642/vid/1280x720/j9xbNzRZqEbYs_2s.mp4}]}]";
Regex:
https?.*?\.mp4
Literal http
Followed by an optional 's': s?
Remove the question mark if they will all use HTTPS.
Followed by as few characters as possible: .*?
Followed by an mp4 extension (literal dot) \.mp4
2 Approaches:
If you're sure the URL's will always begin with https:// and will not contain a mp4 after the complete URL is finished, then you can use
pattern = "https://.*mp4";
String[] arr = {
"contentType=application/x-mpegURL, url=https://video.twimg.com/amplify_video/822938952332144642/pl/BjHU8aBCbOgZNzXQ.m3u8}",
"Variant{bitrate=0, contentType=application/dash+xml, url=https://video.twimg.com/amplify_video/822938952332144642/pl/BjHU8aBCbOgZNzXQ.mpd}",
"Variant{bitrate=320000, contentType=video/mp4, url=https://video.twimg.com/amplify_video/822938952332144642/vid/320x180/YqZ72rzLj3VWVhy4.mp4}",
"Variant{bitrate=832000, contentType=video/mp4, url=https://video.twimg.com/amplify_video/822938952332144642/vid/640x360/A2vMgzo2ElpPP6TE.mp4}",
"Variant{bitrate=2176000, contentType=video/mp4, url=https://video.twimg.com/amplify_video/822938952332144642/vid/1280x720/j9xbNzRZqEbYs_2s.mp4}]}]"
};
String pattern = "https://.*mp4";
Pattern r = Pattern.compile(pattern);
for (String line : arr) {
Matcher m = r.matcher(line);
if (m.find()) {
System.out.println(m.group(0));
} else {
System.out.println("NO MATCH");
}
}
If not, to Support all types of URL's then change your pattern to what is defined here with a little modification,
String pattern =
"(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" +
"(\\w+:\\w+#)?(([-\\w]+\\.)+(com|org|net|gov" +
"|mil|biz|info|mobi|name|aero|jobs|museum" +
"|travel|[a-z]{2}))(:[\\d]{1,5})?" +
"(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" +
"((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
"([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" +
"(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
"([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" +
"(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b"+"mp4";
Output:
NO MATCH
NO MATCH
https://video.twimg.com/amplify_video/822938952332144642/vid/320x180/YqZ72rzLj3VWVhy4.mp4
https://video.twimg.com/amplify_video/822938952332144642/vid/640x360/A2vMgzo2ElpPP6TE.mp4
https://video.twimg.com/amplify_video/822938952332144642/vid/1280x720/j9xbNzRZqEbYs_2s.mp4

Regex in Java not working while same regex is working in shell

I want to replace all :variable (word starting with :) with ${variable}$.
For example,
:aks_num with ${aks_num}$
:brn_num with ${brn_num}$
Following is my code, which does not work:
public static void main(String[] argv) throws Exception
{
CharSequence chSeq = "AND ((:aks_num = -1) OR (aks_num = :aks_num AND ((:brn_num = -1) OR (brn_num = :brn_num))))";
// replaceAll also not working
//String s = chSeq.replaceAll(":\\([a-z_]*\\)","\\${ $1 \\}$");
Pattern p = Pattern.compile(":\\([a-z_]*\\)");
Matcher m = p.matcher(chSeq);
if (m.find()) {
System.out.println("Found value: " + m.group(0) );
System.out.println("Found value: " + m.group(1) );
System.out.println("Found value: " + m.group(2) );
} else {
System.out.println("NO MATCH");
}
}
While in shell script the following regex works perfectly:
s/:\([a-z_]*\)/${\1}$/g
:\\([a-z_]*\\) (with escaped parenthesis) means that you want to match expressions like :(aks_num). Obviously, there are no such expression in the input string. That explains why there are no matches.
Instead, if you want to use parenthesis in order to capture some variables, you should not escape the parenthesis.
Example :
CharSequence chSeq = "AND ((:aks_num = -1) OR (aks_num = :aks_num AND ((:brn_num = -1) OR (brn_num = :brn_num))))";
Pattern p = Pattern.compile(":([a-z_]*)");
Matcher m = p.matcher(chSeq);
while (m.find()) {
System.out.println("Found value: " + m.group(0)+". Captured : "+m.group(1));
}
Output:
Found value: :aks_num. Captured : aks_num
Found value: :aks_num. Captured : aks_num
Found value: :brn_num. Captured : brn_num
Found value: :brn_num. Captured : brn_num
CharSequence chSeq = "AND ((:aks_num = -1) OR (aks_num = :aks_num AND ((:brn_num = -1) OR (brn_num = :brn_num))))";
// replaceAll also not working
//String s = chSeq.replaceAll(":\\([a-z_]*\\)","\\${ $1 \\}$");
Pattern p = Pattern.compile(":(\\w+)");
Matcher m = p.matcher(chSeq);
while (m.find()) {
System.out.println("Found value: " + m.group(1) );
}
Ideone Demo
Working fine with replaceAll
Pattern p = Pattern.compile("(:\\w+)");
Matcher m = p.matcher(x);
x = m.replaceAll("\\${$1}\\$");
You don't need to escape the parentheses, so
Pattern.compile(":([a-z_]*)");
should work.
I believe you got confused with the Java's regex syntax that is different from regular sed syntax. You do not need to escape parentheses to make them "special" grouping operators. Vice versa, in Java, when you escape parentheses, they start matching literal ( and ) symbols.
In the replacement pattern, $ must be escaped for the regex engine to replace with literal $ symbols, but you do not need to escape braces there.
So, just use
.replaceAll(":([a-z_]+)", "\\${$1}\\$")
See the IDEONE demo
I suggest the + quantifier because I doubt you need to match a : followed with a space, or digits - any non-letter.
BTW, you do not need any /g flag in Java since replaceAll will replace all matches with the provided replacement pattern.
NOTE: you can further adjust the pattern to match all letters/digits/underscores with ":(\\w+)". Or just alphanumerics/underscore: ":([\\p{Alnum}_]+)".

Java multiple regular expression search

I have a string some thing like this:
If message contains sensitive info like: {Password:123456, tmpPwd : tesgjadgj, TEMP_PASSWORD: kfnda}
My pattern should look for the particular words Password or tmpPwd or TEMP_PASSWORD.
How can I create a pattern for this kind of search?
I think you are looking for the values after these words. You need to set capturing groups to extract those values, e.g.
String content = "If message contains sensitive info like: {Password:123456, tmpPwd : tesgjadgj, TEMP_PASSWORD: kfnda} ";
Pattern p = Pattern.compile("\\{Password\\s*:\\s*([^,]+)\\s*,\\s*tmpPwd\\s*:\\s*([^,]+)\\s*,\\s*TEMP_PASSWORD:\\s*([^,]+)\\s*\\}");
Matcher m = p.matcher(content);
while (m.find()) {
System.out.println(m.group(1) + ", " + m.group(2) + ", " + m.group(3));
}
See IDEONE demo
This will output 123456, tesgjadgj, kfnda.
To just find out if there are any of the substrings, use contains method:
System.out.println(content.contains("Password") ||
content.contains("tmpPwd") ||
content.contains("TEMP_PASSWORD"));
See another demo
And if you want a regex-solution for the keywords, here it is:
String str = "If message contains sensitive info like: {Password:123456, tmpPwd : tesgjadgj, TEMP_PASSWORD: kfnda} ";
Pattern ptrn = Pattern.compile("Password|tmpPwd|TEMP_PASSWORD");
Matcher m = ptrn.matcher(str);
while (m.find()) {
System.out.println("Match found: " + m.group(0));
}
See Demo 3
Finally I am using it like as per my requirement .
private final static String censoredWords =
"(?i)PASSWORD|pwd";
The (?i) makes it case-insensitive

extract values with java regex

I begin with regex and i want extract values from a String like this
String test="[ABC]Name:User:Date: Adresse ";
I want extract Name, User , Date and Adresse
I can do the trick with substring and split
String test = "String test="[ABC]Name:User:Date: Adresse ";
String test2= test.substring(5,test.length());
System.out.println(test2);
String[] chaine = test2.split(":");
for(String s :chaine)
{
System.out.println("Valeur " + s);
}
but i want try with regex , i did
pattern = Pattern.compile("^[(ABC)|:].");
but it doesn ' t work
Can you help me please ?
Thanks a lot
String#split is really the best way to accomplish what you are trying to do. Having said that, with regex, the following will give you the same output:
Pattern p = Pattern.compile("^(?:\\[ABC\\])([^:]+):([^:]+):([^:]+):([^:]+)$");
Matcher m = p.matcher(test);
while (m.find()) {
System.out.println("Valeur " + m.group(1)); // Name
System.out.println("Valeur " + m.group(2)); // User
System.out.println("Valeur " + m.group(3)); // Date
System.out.println("Valeur " + m.group(4)); // Address
}
You have to escape the [ and ] here is a working example.
^\[(.*)\](.*):(.*):(.*):(.*)$
Note that your code is probably more easily maintained than regular expressions in cases where the regular expression becomes complex.
Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems. - Jamie Zawinski

Categories

Resources