I have written a Jsoup class file to scrape a page and grab the hrefs for every element on the page. What I would like to do from there is to extract the Xpath for each of the elements from their hrefs.
Is there a way to do this in JSoup? If not is what is the best way to do this in Java (and are there any resources on this)?
Update
I want to clarify my question.
I want to scan a page for all the href identifiers and grab the links (that part is done). For my script, I need to get the xpath of all the elements I have identified and scraped from the (scanned) page.
The problem is that I assumed I could easily translate the href links to Xpath.
The comment from #Rishal dev Singh ended up being the right answer.
Check his link here:
http://stackoverflow.com/questions/7085539/does-jsoup-support-xpath
Related
I have to find element with "== $0" after end tag of "span". Below is the HTML Code of element.
<div _ngcontent-c6="" class="col-12">
<span _ngcontent-c6="">Registration with prefilled user's data</span>
</div>
Although while I have copied the html code it is removing "== $0" itself. so I am attaching image also.
I have tried to find out solution but it was not working. I have tried xpath that normally works like .//span[text()='Registration with prefilled user's data'], but no sucess. I just found that "we can access this element in chrome console with syntax ' $0' and it is working fine there
but I do't know how to find it with xpath and CSS or any recommended locator's strategies in Selenium.
Note: Please don't mention any work around say use className or css with class name like div.col-12 span as I knew already this. My problem is handling elements with == $0.
So the text, == $0, is not exactly what you think it means. This is just a feature of Chrome dev tools, and not an actual element on the page. It's a property used by dev tools that allows you to test scripts via console. This has been discussed here, and it does not affect Selenium's ability to locate elements on the page.
The issue might be the selector that you are using, or possibly a hidden iframe element higher in the DOM that is obscuring the element.
You can try this XPath:
//span[contains(text(), "Registration with prefilled user's data")]
I just swapped out the text()='text' query with contains(text(), 'text') which may account for any hidden whitespace within the span element.
This XPath is correct for the span element given there are no special cases on the page. So, if this does not work for you, it would be recommended to post the full HTML or link to the page you are automating, so that we can all help you better.
Basically, I need a table with all the possible books that exist, and I don't wanna do that, because I'm a very lazy person xD. So, my question is.. can I use a site, that I have in mind, and just like cut off the rest this site(that I don't need) and leave only the search part(maybe do some kind of changes in layout)... then, make the search, find the book and store in my database only the data that make sense for me. Is that possible? I heard that JSOUP could help.
So, I just want some tips. (thx for reading).
the site: http://www.isbn.bn.br/website/consulta/cadastro
Yes, you can do that using Jsoup, the main problem is that the URL you shared uses JavaScript so you'll need to use Selenium to force the JS execution or you can also get the book URL and parse it.
The way to parse a web using Jsoup is:
Document document = Jsoup.connect("YOUR-URL-GOES-HERE")
.userAgent("Mozilla/5.0")
.get();
The you retrieve the whole HTML in a Document so you can get any Element contained in the Element using CSS Selectors, for example, if in the HTML you want to retrieve the title of the web, you can use:
Elements elements = document.select("title");
And that for every HTML tag that you want to retrieve information from. You can check the Jsoup Doc an check some of the examples explained: Jsoup
I hope it helps you!
For fun, I'm writing a basic parser that finds data within an HTML document. I want to find the best structure to represent the branches of the parsed file.
The criteria for "best structure" is this: I want to easily search for a tag's relative location and access its contents, like "the image in the second image tag after the third h3 tag in the body" or "the title tag in the header".
I expect to search the first level of tags for the tag I'm looking for, then move into the branch associated with that tag. That's the structure this question is looking for, but if there is a better way to find relative locations in an HTML document, please explain.
So that's the question. More generally, what kind of Java structures are available through the API that can represent tree data structures?
Don't reinvent the wheel, just use an HTML parser like Jsoup, you will be able to get your tags thanks to a CSS selector using the method Element#select(cssQuery).
Document doc = Jsoup.parse(file, encoding);
Elements elements = doc.select(cssQuery);
I am getting the href by using
Jsoup.parse(hrefLink, "").select("a[href]").attr("href")
where hrefLink is founded hreflink.
What I want to do is, to get outgoing links from the current web page if they match with my condition. Unfortunately, because of anchorlinks, I cannot always get outgoing links, but instead I want to be able to get other hrefs that anchorlink is redirecting to. For instance:
Given page: http://en.wikipedia.org/wiki/Baked_potato
where citation [10] anchorlink has two outgoing links. I want to be able to get them. How can I do that by using Jsoup? If that's not possible with Jsoup, what else can I use?
HTML anchors (and links to fragments in general) only indicate a position in a document that a browser will scroll to when the anchor is navigated to (through a link or directly via a URL with a #fragment); they don't "redirect" to anything. The relationship between the links is not encoded in the document, so Jsoup (or any other library) cannot determine this, in general. Your program will need some semantic knowledge of the pages it's processing.
In your Wikipedia example, after finding the li#cite_note-10 element, you can select all child a elements, then use absUrl("href") to get the link target and filter out any links that refer to the same page. (Currently just checking that the href attribute doesn't start with # is enough, but in general a document can link to itself with a full URL too.) But this depends on the semantics of the document, not just its syntax -- a future Wikipedia redesign may move where the citation link points so that the outgoing links are no longer children of the citation link target, and your code will break.
Perhaps I'm doing something wrong, but I'm trying to parse this page using jsoup, it for some reason it doesn't find me the div I'm looking for
doc = Jsoup.connect(params[0]).get();
content = doc.select("div.itemcontent").first().text();
Where am I going wrong here?
Thanks
The problem: you get a different website using jsoup than using a browser. I set another useragent in Jsoup, but no luck. Possible the content is changed through JavaScript?!
However, you can change the selector according to the webseite you get.
It's always a good idea to take a look into document as it's parsed - a simple System.out.println(doc) is enough.
Here are some steps you can try:
Print your Document doc (eg. using System.out)
Search for the required value(s) in there
Select those tags instead
I just played around a bit, but maybe you can use this snipped:
content = doc.select("description").first().text();
It seems to me, <description>...</description> is what you're looking for.