Extract links from document jsoup containing some string to other string - java

i use jsoup to extract the links from a website. i want to extract one only specified link containg some keywords. i want to retrieve the links contains the keyword "download". how to do it. i have the following code
Document doc = Jsoup.parse( new URL("http://www.examplesite.com));
Element link = doc.select("a").first();

See here for the selector syntax.
You can test for the text within a node with :contains, e.g. Element link = doc.select("a:contains(Download)").first();. If you want you can use :matches for regex.
You get the link address via the attr method, e.g. String linkaddress = link.attr("href");.

you can use this
elements with attributes that start with [attr^=value],end with [attr$=value],contain the value [attr*=value] e.g. [href*=/path/]
you want to get the links containing certain word use this
org.jsoup.select.Elements links = doc.select("[href*=download]");

Related

JSOUP - find elements starting with

I have a following HTML:
<data-my-tag>
<data-another-tag>
<p>...</p>
<data-my-tag>
<span>...</span>
</data-my-tag>
</data-another-tag>
</data-my-tag>
I use JSOUP to parse it and I would like to match all elements starting with <data-.
I only found methods to match getElementsByTag which matches by entire tag name. Also select method performs only css selector, but there seems to be no way to match data-* in JSOUP way (e.g. use XPath). Is there any way to match these tags via JSOUP.
Unfortunately, it is not possible to use XPath queries in JSOUP. The only way I figured out is following:
Document doc = Jsoup.parse(content);
Elements elements = doc.select("*");
elements.stream().filter(e -> e.nodeName().startsWith("data-")).forEach(e -> {
// do what you need with the node
});

How can I get the last text in <p> through xpath in selenium webdriver

I want to extract the text in the paragraph encircled in the screenshot. But from the last instance of the class, because this text is dynamic. Therefore I want the last text in this class.
I am tried this
String Reply2= driver.findElement(By.xpath("//div[#class='chat-message-content clearfix']/last[]")).getText(); ][1]
you can't use last[]
use [last()]
//div[contains(#class,'chat-message-content')][last()]/p
You are almost there, just need couple of changes in your xpath.
String xPath = "(//div[#class='chat-message-content clearfix']/p)[last()]"
//div[#class='chat-message-content clearfix']/p should get all the p items under chat-message-content clearfix classes. Then group them by using ( and ). Now get the last item from the group using [last()].
Get the text by using the below line.
String Reply2= driver.findElement(By.xpath(xPath)).getText();

How to select an element in Jsoup using its html content?

I want to select an element in Jsoup using its html content.
Example: LOCATION:
How can i do it. I couldn't find any approriate selector methods directly. Is there any work around available?
Using Jsoup library you can parse from value from html using name, ID or class of element.
String html = "<html><head><title>Title</title></head> <body><div id='location'>Mumbai, India</div></body></html>";
Document document= Jsoup.parse(html);
String content = document.getElementById("location").outerHtml();
Happy Coding :-)

Java replace link in a tag

I have
String s = "https://stackoverflow.com<br/>https://google.com"
Now I just want to replace all links in the href attributes, by prefixing with a fixed value (e.g. `abc.com?'). Here's the result that I want:
String s = "https://stackoverflow.com<br/>https://google.com"
I tried the following, but it doesn't resolve the problem because it replaces all strings beginning http://, not only those within href attributes:
s= s.replaceAll("http://.+?(com|net|org|vn)/{0,1}","abc.com" + "&url=" + "$0");
What can I do to replace only within the attribute, and not in other content?
You could use a HTML Parser such as JSoup
String s = "https://stackoverflow.com";
Document document = JSoup.parse(s);
Elements anchors = document.getElementsByTag("a");
anchors.get(0).attr("href", "...new href...");
Alternatively if this is too heavy weight a regex should suffice:
<a href="(?<url>[^"]+)">(?<text>[^<]+)<\/a>
Note if you dont care about the text group, replace ?<text> with ?:
Just replace the url & text group using a similar approach to this answer
As said by RealSkeptic look for href instead of the link itself, it saves a lot of effort.
var s = 'https://stackoverflow.com<br/>https://google.com';
s = s.replace(/href="/g,"href=\"abc.com&url=" );
console.log(s);

How to find elements by substring of ID using selector-syntax Jsoup?

I have used Jsoup to fetch a page from a URL. I can extract the link of certain id using the following line of code:
Elements links = doc.select("a[href]#title0");
How can I find the elements if I only know the part of its ID for example 'title'. I know that I could find all the a links with the href and then iterate through the 'links' and check whether it's id contains 'title' substring or not however I would like to avoid this approach. Is there a way to filter the links in the selector and check whether it's id contains 'title' substring?
You can use something like:
Elements links = doc.select("a[id^=nav]");
this would return all the links with id starting with string "nav"
The following will return all the links with id containing string "logo"
Elements links = doc.select("a[id~=logo]");
#alex-ackerman 's answer is half correct, but the other half is wrong.
[attr^=valPrefix] elements with an attribute named "attr", and value
starting with "valPrefix" [attr$=valSuffix] elements with an attribute
named "attr", and value ending with "valSuffix"
[attr*=valContaining] elements with an attribute named "attr", and
value containing "valContaining" [attr~=regex] elements with an
attribute named "attr", and value matching the regular expression The
above may be combined in any order
http://jsoup.org/apidocs/org/jsoup/select/Selector.html

Categories

Resources