Retrieving concrete data from String - java

Im trying to retrieve data-product id from the String which goes like this:
<img class="lazy" src="/b/mp/img/svg/no_picture.svg" lazy-img="https://ecsmedia.pl/c/w-pustyni-i-w-puszczy-p-iext43240721.jpg" alt="">
The output should be
prod14290034
I tried to achieve this with a regular expression, but I'm beginner in it.
Is regular expression good for it? If so, how to do it?
/EDIT
According to Emma's comment.
I've made something like this:
String z = element.toString();
Pattern pattern = Pattern.compile("data-product-id=\"\\s*([^\\s\"]*?)\\s*\"");
Matcher matcher = pattern.matcher(z);
System.out.println(matcher.find());
if (matcher.find()) {
System.out.println(matcher.group());
}
it returns true, but dont print any value. Why?

You might use some HTML/XHTML/XML library which could transform your string data into document or at least Element and then you can easily obtain the attribute value from there. But if you want to use regex then you can try this snippet
#Test
public void productId() {
String src =
" <img class=\"lazy\" src=\"/b/mp/img/svg/no_picture.svg\" lazy-img=\"https://ecsmedia.pl/c/w-pustyni-i-w-puszczy-p-iext43240721.jpg\" alt=\"\"> ";
final Pattern pattern = Pattern.compile("(data-product-id=)\"(p[a-zA-Z]+[0-9]+)\"");
final Matcher matcher = pattern.matcher(src);
String prodId = null;
if (matcher.find()) {
System.out.println(matcher.groupCount());
prodId = matcher.group(2);
}
System.out.println(prodId);
Assert.assertNotNull(prodId);
Assert.assertEquals(prodId, "prod14290034");
}

You can use jsoup for Java - it is a library for parsing HTML pages. There are a lot of other libraries for different languages, beautifulSoup for python.
EDIT: Here is a snippet for jsoup, you can select any element with a tag, and then get needed attribute with attr method.
Document doc = Jsoup.parse(
"<a href=\"/w-pustyni-i-w-puszczy-sienkiewicz-henryk,prod14290034,ksiazka-p\" " +
"class=\"img seoImage\" " +
"title=\"W pustyni i w puszczy - Sienkiewicz Henryk\" " +
"rel=\"nofollow\" " +
"data-product-id=\"prod14290034\"> " +
"<img class=\"lazy\" src=\"/b/mp/img/svg/no_picture.svg\" lazy-img=\"https://ecsmedia.pl/c/w-pustyni-i-w-puszczy-p-iext43240721.jpg\" alt=\"\"> </a>\n"
);
String dataProductId = doc.select("a").first().attr("data-product-id");

Related

RegEx to extract text between tags in Java

I need to extract the values after :70: in the following text file using RegEx. Value may contain line breaks as well.
My current solution is to extract the string between :70: and : but this always returns only one match, the whole text between the first :70: and last :.
:32B:xxx,
:59:yyy
something
:70:ACK1
ACK2
:21:something
:71A:something
:23E:something
value
:70:ACK2
ACK3
:71A:something
How can I achive this using Java? Ideally I want to iterate through all values, i.e.
ACK1\nACK2,
ACK2\nACK3
Thanks :)
Edit: What I'm doing right now,
Pattern pattern = Pattern.compile("(?<=:70:)(.*)(?=\n)", Pattern.DOTALL);
Matcher matcher = pattern.matcher(data);
while (matcher.find()) {
System.out.println(matcher.group())
}
Try this.
String data = ""
+ ":32B:xxx,\n"
+ ":59:yyy\n"
+ "something\n"
+ ":70:ACK1\n"
+ "ACK2\n"
+ ":21:something\n"
+ ":71A:something\n"
+ ":23E:something\n"
+ "value\n"
+ ":70:ACK2\n"
+ "ACK3\n"
+ ":71A:something\n";
Pattern pattern = Pattern.compile(":70:(.*?)\\s*:", Pattern.DOTALL);
Matcher matcher = pattern.matcher(data);
while (matcher.find())
System.out.println("found="+ matcher.group(1));
result:
found=ACK1
ACK2
found=ACK2
ACK3
You need a loop to do this.
Pattern p = Pattern.compile(regexPattern);
List<String> list = new ArrayList<String>();
Matcher m = p.matches(input);
while (m.find()) {
list.add(m.group());
}
As seen here Create array of regex matches

Find all <a href>link</a> in a string with java regex

I have a String which contains some url how i can find all the href with a regular expression?
prodotto di prova
Now i have this which find all amazon links now i need to add also the href to this regex:
String regex="(http|www\\.)(amazon|AMAZON)\\.(com|it|uk|fr|de)\\/(?:gp\\/product|gp\\/product\\/glance|[^\\/]+\\/dp|dp|[^\\/]+\\/product-reviews)\\/([^\\/]{10})";
This pattern works for me in Java: (IDEONE here)
String input = "prodotto di prova\"";
String pattern = "href=(?<link>['\\\"](?:https?:\\/\\/)?(?:www\\.)?(?:amazon|AMAZON)\\.(?:com|it|uk|fr|de)\\/(?<product>:gp\\/product|gp\\/product\\/glance|[^\\/]+\\/dp|dp|[^\\/]+\\/product-reviews)\\/(?<productID>[^\\/]{10})\\/(?<queryString>.*?)\\\")";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(input);
if (m.find( )) {
System.out.println("Amazon link: " + m.group(0) );
System.out.println("product: " + m.group("product") );
System.out.println("productID: " + m.group("productID"));
System.out.println("querystring: " + m.group("queryString"));
} else {
System.out.println("NO MATCH");
}
output:
Amazon link:
href="http://www.amazon.it/Die-10-Symphonien-Orchesterlieder-Sinfonie-Complete/dp/B003LQSHBO/ref=sr_1_2?ie=UTF8&qid=1440101590&sr=8-2&keywords=mahler"
product: Die-10-Symphonien-Orchesterlieder-Sinfonie-Complete/dp
productID: B003LQSHBO
querystring: ref=sr_1_2?ie=UTF8&qid=1440101590&sr=8-2&keywords=mahler
Java's rules for backslashes and escapes in strings are absolutely infuriating to me and I never get it right. You may find it helpful to go to http://www.regexplanet.com/advanced/java/index.html and enter a regex, which it will convert into a java string with the proper escapes. (I couldn't get mine working until I did this!)

Java multiple regular expression search

I have a string some thing like this:
If message contains sensitive info like: {Password:123456, tmpPwd : tesgjadgj, TEMP_PASSWORD: kfnda}
My pattern should look for the particular words Password or tmpPwd or TEMP_PASSWORD.
How can I create a pattern for this kind of search?
I think you are looking for the values after these words. You need to set capturing groups to extract those values, e.g.
String content = "If message contains sensitive info like: {Password:123456, tmpPwd : tesgjadgj, TEMP_PASSWORD: kfnda} ";
Pattern p = Pattern.compile("\\{Password\\s*:\\s*([^,]+)\\s*,\\s*tmpPwd\\s*:\\s*([^,]+)\\s*,\\s*TEMP_PASSWORD:\\s*([^,]+)\\s*\\}");
Matcher m = p.matcher(content);
while (m.find()) {
System.out.println(m.group(1) + ", " + m.group(2) + ", " + m.group(3));
}
See IDEONE demo
This will output 123456, tesgjadgj, kfnda.
To just find out if there are any of the substrings, use contains method:
System.out.println(content.contains("Password") ||
content.contains("tmpPwd") ||
content.contains("TEMP_PASSWORD"));
See another demo
And if you want a regex-solution for the keywords, here it is:
String str = "If message contains sensitive info like: {Password:123456, tmpPwd : tesgjadgj, TEMP_PASSWORD: kfnda} ";
Pattern ptrn = Pattern.compile("Password|tmpPwd|TEMP_PASSWORD");
Matcher m = ptrn.matcher(str);
while (m.find()) {
System.out.println("Match found: " + m.group(0));
}
See Demo 3
Finally I am using it like as per my requirement .
private final static String censoredWords =
"(?i)PASSWORD|pwd";
The (?i) makes it case-insensitive

cant remove all occurrence of html tag in java using pattern matching

I have very long html string which has multiple
<dl id="divmap"> .... </dl>.
I want to remove all content between this .
i wrote this code in java:
String triphtml= htmlString;
System.out.println("triphtml is "+triphtml);
System.out.println("test1 ");
final Pattern pattern = Pattern.compile("(<dl id=\""+selectedArray[i]+"\">)(.+?)(</dl>)",
Pattern.DOTALL);
final Matcher matcher = pattern.matcher(triphtml);
// matcher.find();
System.out.println("pattern of test1 is : "
+ pattern); // Prints
System.out.println("MATCHER of test1 is : "
+ matcher); // Prints
System.out.println("MATCH COUNT of test1 a: "
+ matcher.groupCount()); // Prints
System.out.println("MATCH COUNT of test1 a: "
+ matcher.find()); // Prints
while (matcher.find()) {
// System.out.println("MATCH GP 3: "+matcher.group(3).substring(1,10));
for (int z = 0; z <= matcher.groupCount(); z++) {
String extstr = matcher.group(z);
System.out.println("matcher group of "+z+" test1 is " + extstr);
System.out.println("ext a of test1 is " + extstr);
triphtml = triphtml.replaceAll(extstr, "");
System.out.println("Group found of test1 is :\n" + extstr);
}
}
But this code removes some dl and some remains in triphtml.
I dont why this thing is happening.
Here triphtml is a html string which has multiple dl's. Please help me how I remove content between all
<dl id="divmap">.
Thanks in advance.
I suggest to NOT use regex for html. Just use any library used for traversing xml/html.
For example JSoup
Try using JSoup
It uses selectors and syntax like JQuery, it it very easy to use.
You can try this
String triphtml = htmlString;
Document doc = Jsoup.parse(htmlString);
Elements divmaps = doc.select("#divmap");
then you can remove (or alter) the elements in the DOM.
divmaps.remove();
triphtml = doc.html();
By using regex you can do as follows:
String orgString = "<dl id=\"divmap\"> .... </dl>";
orgString = orgString.replaceAll("<[^>]*>", "");
//for removing html tag
orgString = orgString.replaceAll(orgString.replaceAll("<[^>]*>", ""),"");
//for removing content inside html tag
But it is better to use html parsing
Edit:
String htmlString = "<dl id=\"divmap\"> Content </dl>";
Pattern p = Pattern.compile("<[^>]*>");
Matcher m = p.matcher(htmlString);
while(m.find()){
htmlString = htmlString.replaceAll(m.group(), "");
}
System.out.println("Ans"+htmlString);

Get image link and text from string

I have this string
<div><img width="100px" src="http://www.mysite.com/Content/dataImages/news/small/some-pic.png" /><br />This is some text that I need to get.</div>
and i need to get the image link and the text This is some text that I need to get.from the string above in Java. Can anybody tell me how can I do this?
Use regex to get what you want.
If this is all you have to do there's no point in bringing in extra packages just use regex:
The pattern "(?<=src=\")(.*?)(?=\")" can be used to get the link, you can modify that to give you the text.
Try this, just change the patter if you must.
String str = "<div><img width=\"100px\" src=\"http://www.mysite.com/Content/dataImages/news/small/some-pic.png\" /><br />This is some text that I need to get.</div>";
Pattern p = Pattern.compile("src=\"(.*?)\" /><br />(.*?)</div>");
Matcher m = p.matcher(str);
if (m.find()) {
String link = m.group(1);
String text = m.group(2);
}
My solution was:
String tmp=xpp.nextText();
desc=android.text.Html.fromHtml(tmp).toString();
img=FindUrls.extractUrls(tmp);
for extracting the text from the string I used:
desc=android.text.Html.fromHtml(tmp).toString();
img=FindUrls.extractUrls(tmp);
and for the link inside the string I've used this function:
public static String extractUrls(String input) {
String result = null;
Pattern pattern = Pattern.compile(
"\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" +
"(\\w+:\\w+#)?(([-\\w]+\\.)+(com|org|net|gov" +
"|mil|biz|info|mobi|name|aero|jobs|museum" +
"|travel|[a-z]{2}))(:[\\d]{1,5})?" +
"(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" +
"((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
"([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" +
"(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" +
"([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" +
"(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b");
Matcher matcher = pattern.matcher(input);
if (matcher.find()) {
result=matcher.group();
}
return result;
}
Hope It will help someone that has similar problem

Categories

Resources