How to get the src url and a href html

How to get the src url and a href html - java

I've this piece of html code. I want to replace the link placeholders for the content mentioned in three separate attributes. This is what I've tried so far:
String texto2 = "url(\"primeiro url\")\n" +
"url('2 url')\n" +
"href=\"1 href\"\n" +
"src=\"1 src\"\n" +
"src='2 src'\n" +
"url('3 url')\n" +
"\n" +
".camera_target_content .camera_link {\n" +
" background: url(../images/blank.gif);\n" +
" display: block;\n" +
" height: 100%;\n" +
" text-decoration: none;\n" +
"}";
String exp = "(?:href|src)=[\"'](.+)[\"']+|(?:url)\\([\"']*(.*)[\"']*\\)";
// expressão para pegar os links do src e do href
Pattern pattern = Pattern.compile(exp);
// preparando expressao
Matcher matcher = pattern.matcher(texto2);
// pegando urls e guardando na lista
while(matcher.find()) {
System.out.println(texto2.substring(matcher.start(), matcher.end()));
}
So far, so good - It works with find just that I need to get the clean link, something like this:
img/image.gif
and not:
 href = "img/image.gif"
     src = "img/image.gif"
     url (img/image.gif)
I want to replace one placeholder using one variable; this is what I've tried so far:
String texto2 = "url(\"primeiro url\")\n" +
"url('2 url')\n" +
"href=\"1 href\"\n" +
"src=\"1 src\"\n" +
"src='2 src'\n" +
"url('3 url')\n" +
"\n" +
".camera_target_content .camera_link {\n" +
" background: url(../images/blank.gif);\n" +
" display: block;\n" +
" height: 100%;\n" +
" text-decoration: none;\n" +
"}";
String exp = "(?:href|src)=[\"'](.+)[\"']+|(?:url)\\([\"']*(.*)[\"']*\\)";
// expressão para pegar os links do src e do href
Pattern pattern = Pattern.compile(exp);
// preparando expressao
Matcher matcher = pattern.matcher(texto2);
// pegando urls e guardando na lista
while(matcher.find()) {
String s = matcher.group(2);
System.out.println(s);
}
It turns out that this version does not work. It grabs the url perfectly; can someone help me spot the problem?

Use jsoup. Parse the HTML string into a DOM and you can then use CSS selectors to pull out the values as you would with jQuery in JavaScript. Note that this will only work if you're actually working with HTML; the string at the top of your example is not HTML.

Related

Parsing a specific text value with JSoup

Hey does anyone know how to parse the "Light rain", " 7°C", and "Limited"? These are stored as #text so that's kind of throwing me off. For reference, to parse "Temperature:", it would be Element element5 = doc.select("strong").get(3);
Thanks!

The nodes from your example are called text nodes. In Jsoup, you can read the text nodes of a node by using the text() method. So given your example using Jsoup we'd select the td element and then use text() to get it's text value.
However, this would also output the text value from any child nodes, so in your case this would produce Weather: Light rain as a single string. Fortunately, Jsoup also has a ownText() method that only extracts the value from the text nodes that are a direct descendant of the element (and not all children). So given your example code, you could write it like this:
Element element5 = doc.select("td").get(3);
String value = element5.ownText()

You can use variuos ways to extract required text and one of them is td.childNode(1).toString() and complete solution is mentioned below:
public static void main(String[] args) {
// Parse HTML String using JSoup library
String HTMLSTring = "<html>\n" +
" <head></head>\n" +
" <body>\n" +
" <table class=\"table\"> \n" +
" <tbody>\n" +
" <tr> \n" +
" <td><strong>Weather: </strong>Light Rain</td> \n" +
" </tr> \n" +
" <tr> \n" +
" <td><strong>Tempratue: </strong>70 C</td> \n" +
" </tr> \n" +
" <tr> \n" +
" <td><strong>Visibility: </strong>Limited</td> \n" +
" </tr> \n" +
" <tr> \n" +
" <td><strong>Runs open: </strong>0</td> \n" +
" </tr>\n" +
" </tbody>\n" +
" </table>\n" +
" </body>\n" +
"</html>"
+ "<head></head>";
Document html = Jsoup.parse(HTMLSTring);
Elements tds = html.getElementsByTag("td");
for (Element td : tds) {
//String tdStrongText = td.childNode(0).childNodes().get(0).toString();
String tdStrongText = td.select("strong").text();
System.out.print(tdStrongText + " : ");
String tdText = td.childNode(1).toString();
System.out.println(tdText);
}
}
Check out code on github.

java regex to extract the image src from the data in the script tag

I need a java regex to extract the image src in the script tag in the following code.help me out..
thanks
<script language="javascript"><!--
document.write('<a href="javascript:popupWindow(\'https://www.kitchenniche.ca/prepara-adjustable-oil-pourer-pi-5597.html?invis=0\')">
<img src="images/imagecache/prepara-adjustable-oil-pourer-1.jpg" border="0" alt="Prepara Adjustable Oil Pourer" title=" Prepara Adjustable Oil Pourer " width="170" height="175" hspace="5" vspace="5">
<br>
</a>');
--></script>

Try this:
String mydata = "<script language='javascript'><!--document.write('<a href='javascript:popupWindow"
+ "(\'https://www.kitchenniche.ca/prepara-adjustable-oil-pourer-pi-5597.html?invis=0\')'><img "
+ "src='images/imagecache/prepara-adjustable-oil-pourer-1.jpg' border='0' alt='Prepara Adjustable Oil Pourer' "
+ "title=' Prepara Adjustable Oil Pourer ' width='170' height='175' hspace='5' vspace='5'><br></a>');</script>";
Pattern pattern = Pattern.compile("src='(.*?)'");
Matcher matcher = pattern.matcher(mydata);
if (matcher.find()) {
System.out.println(matcher.group(1));
}

This regex finds content of src attribute only if src is located after <img. If src is not the first attribute of img tag then you need more complex regex.
public static void main(String[] args) {
String s = "<script language=\"javascript\"><!--\r\n"
+ " document.write('<a href=\"javascript:popupWindow(\\'https://www.kitchenniche.ca/prepara-adjustable-oil-pourer-pi-5597.html?invis=0\\')\">\r\n"
+ "<img src=\"images/imagecache/prepara-adjustable-oil-pourer-1.jpg\" border=\"0\" alt=\"Prepara Adjustable Oil Pourer\" title=\" Prepara Adjustable Oil Pourer \" width=\"170\" height=\"175\" hspace=\"5\" vspace=\"5\">\r\n"
+ "<br>\r\n" + "</a>');\r\n" + "--></script>";
Pattern pattern = Pattern.compile("<img src=\"([^\"]+)");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
String group = matcher.group(1);
System.out.println(group);
}
}
([^\"]+) means match any character except " and put the match into group 1. In java you have to escape ".

Capture multiples groups in regex

I want to parse some CSS in Java.
It doesn't have to be perfect and should only capture a specific style class.
Let's assume the CSS looks something like this:
.someunimportantclass .txt-value .input-suffix {
margin-left: 4px;
}
/* Table Columns */
body.thisone table .column-bic {
min-width: 70px;
}
body.thisone table .column-char35,
body.thisone table .column-somethingdifferent,
body.thisone table .column-somethingdifferent2,
body.thisone table .column-closebutstilldifferent {
{
min-width: 245px;
}
body.thisone table .column-code {
min-width: 25px;
text-align: center;
}
My approach with regex works only partial. Right now I have:
body.thisone table \.([a-z]*-[\w]*) \{[\s]*(.*)\: ([\w]*);
Which captures all the single-line-classes. It doesn't work with multiple classes with the same attribute(s) or with classes with several attributes. I experimented a little bit with the group-"flags" (like + and ?) but couldn't really figure out how to do.
Another problem I haven't really thought about is how to map those groups into Java Objects. With just one attribute to one class it is as easy as
for (int i = 1; i <= matcher.groupCount(); i += 3) {
classes.add(matcher.group(i));
attributes.put(matcher.group(i + 1), matcher.group(i + 2));
}
with classes as List<String> and attributes as Map<String, String>.
But off the top of my mind I cannot come up with a way to do it with several classes and / or attributes.

{
String foo = ".someunimportantclass .txt-value .input-suffix {\n" +
" margin-left: 4px;\n" +
"}\n" +
"\n" +
"/* Table Columns */\n" +
"\n" +
"body.thisone table .column-bic {\n" +
" min-width: 70px;\n" +
"}\n" +
"\n" +
"body.thisone table .column-char35,\n" +
"body.thisone table .column-somethingdifferent,\n" +
"body.thisone table .column-somethingdifferent2,\n" +
"body.thisone table .column-closebutstilldifferent {\n" +
"{\n" +
" min-width: 245px;\n" +
"}\n" +
"\n" +
"body.thisone table .column-code {\n" +
" min-width: 25px;\n" +
" text-align: center;\n" +
"}";
String key = "body.thisone table";
If what I expected of the data structure to be correct, they will be similar to:
HashMap<String, HashMap<String, String>> matchingClasses = new HashMap<>();
The pattern to find a CSS Class name with similar structure would be:
// Pattern pattern = Pattern.compile(key + "\\s\\.([a-z]*-[\\w]*)(?:,[^{]+)?\\s*");
And then we can capture its contents with a lookahead, so that colliding classes can be re-matched as well:
// Pattern pattern = Pattern.compile(key + "\\s\\.([a-z]*-[\\w]*)(?=(?:,[^{]+)?\\s*" +
// "{\\s*(.*?)\\s*)");
Since the CSS class contents are multi-line we have to compile this with DOTALL.
Pattern pattern = Pattern.compile(key + "\\s\\.([a-z]*-[\\w]*)(?=(?:,[^{]+)?\\s*" +
"{\\s*(.*?)\\s*)", Pattern.DOTALL);
From there we can match with the regex, after compiling another pattern to break down the CSS class contents:
Pattern content = Pattern.compile("([\\w-]+)\\s*:\\s*([^;]+);");
Matcher matcher = pattern.matcher(foo);
while (matcher.find()) {
// matcher.group(1); // This is the class name.
// matcher.group(2); // This is the class contents.
We can get the attribute value pairs like this:
HashMap<String, String> attributes = new HashMap<>();
Matcher contents = content.matcher(matcher.group(2));
while (contents.find())
attributes.put(contents.group(1), contents.group(2));
And then add it into our matchingClasses hashmap.
if (! attributes.isEmpty())
matchingClasses.put(matcher.group(1), attributes);
}

java: Extract a substring using regular expression

I have String data in which I am interested to extract a substring but I am stuck on creating the regex pattern for that.The String data I have is following:
$.ajax({url:"Q" + "uestions?"
+ "" + "action="
+ "maxim" + "um&"
+ "p043366329446409=08315891235072667&"
+ "c" + "ity="
+ k.val() + "&"
+ e + "=888",success:succFun,error:errFun,async:false});
};
I want to extract p043366329446409=08315891235072667 part from the above string.This data changes everytime I make request to server but "p0" will always start the string and &" will end the string.
Thanks EveryOne.

Try this one:
String mydata = "<query string>";
Pattern pattern = Pattern.compile("p0([0-9]+)=([0-9]+)&");
Matcher matcher = pattern.matcher(mydata);
int start=0,end=0;
if(matcher.find())
{
start=matcher.start();
end=matcher.end();
System.out.println(mydata.substring(start,end-1));
}

try this
String p0 = s.replaceAll(".*&(p0.+?=.+?)&.*", "$1");

How to get text between two Elements in DOM object?

I'm using JSoup to parse this HTML content:
<div class="submitted">
<strong><a title="View user profile." href="/user/1">user1</a></strong>
on 27/09/2011 - 15:17
<span class="via">www.google.com</span>
</div>
Which looks like this in web browser:
user1 on 27/09/2011 - 15:17 www.google.com
The username and the website can be parsed into variables using this:
String user = content.getElementsByClass("submitted").first().getElementsByTag("strong").first().text();
String website = content.getElementsByClass("submitted").first().getElementsByClass("via").first().text();
But I'm unsure of how to get the "on 27/09/2011 -15:17" into a variable, if I use
String date = content.getElementsByClass("submitted").first().text();
It also contains username and the website???

You can always remove the user and the website elements like this (you can clone your submitted element if you do not want the remove actions to "damage" your document):
public static void main(String[] args) throws Exception {
Document content = Jsoup.parse(
"<div class=\"submitted\">" +
" <strong><a title=\"View user profile.\" href=\"/user/1\">user1</a></strong>" +
" on 27/09/2011 - 15:17 " +
" <span class=\"via\">www.google.com</span>" +
"</div> ");
// create a clone of the element so we do not destroy the original
Element submitted = content.getElementsByClass("submitted").first().clone();
// remove the elements that you do not need
submitted.getElementsByTag("strong").remove();
submitted.getElementsByClass("via").remove();
// print the result (demo)
System.out.println(submitted.text());
}
Outputs:
on 27/09/2011 - 15:17

You can then parse string that you get.
String str[] = contentString.split(" ");
Then you can construct the string you want like this:
String str = str[1] + " " + str[2] + " - " + str[4];
This will extract you the string you need.

Select the element before the text you wish to grab, then get its next sibling node (not element), which is a text node:
Document doc = Jsoup.parse("<div class=\"submitted\">" +
" <strong><a title=\"View user profile.\" href=\"/user/1\">user1</a></strong>" +
" on 27/09/2011 - 15:17 " +
" <span class=\"via\">www.google.com</span>" +
"</div> ");
String str = doc.select("strong").first().nextSibling().toString().trim();
System.out.println(str);
You can also ask an element for its child text nodes and index directly (though referencing the nodes by sibling is usually more robust than indexing):
Document doc = Jsoup.parse(
"<div class=\"submitted\">" +
" <strong><a title=\"View user profile.\" href=\"/user/1\">user1</a></strong>" +
" on 27/09/2011 - 15:17 " +
" <span class=\"via\">www.google.com</span>" +
"</div> ");
String str = doc.select("div").first().textNodes().get(1).text().trim();
System.out.println(str);

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to get the src url and a href html - java

Use jsoup. Parse the HTML string into a DOM and you can then use CSS selectors to pull out the values as you would with jQuery in JavaScript. Note that this will only work if you're actually working with HTML; the string at the top of your example is not HTML.

Related

Parsing a specific text value with JSoup

java regex to extract the image src from the data in the script tag

Capture multiples groups in regex

java: Extract a substring using regular expression

How to get text between two Elements in DOM object?

Categories

Resources