Getting the Anchor Link by using Jsoup - java

I am getting the href by using
Jsoup.parse(hrefLink, "").select("a[href]").attr("href")
where hrefLink is founded hreflink.
What I want to do is, to get outgoing links from the current web page if they match with my condition. Unfortunately, because of anchorlinks, I cannot always get outgoing links, but instead I want to be able to get other hrefs that anchorlink is redirecting to. For instance:
Given page: http://en.wikipedia.org/wiki/Baked_potato
where citation [10] anchorlink has two outgoing links. I want to be able to get them. How can I do that by using Jsoup? If that's not possible with Jsoup, what else can I use?

HTML anchors (and links to fragments in general) only indicate a position in a document that a browser will scroll to when the anchor is navigated to (through a link or directly via a URL with a #fragment); they don't "redirect" to anything. The relationship between the links is not encoded in the document, so Jsoup (or any other library) cannot determine this, in general. Your program will need some semantic knowledge of the pages it's processing.
In your Wikipedia example, after finding the li#cite_note-10 element, you can select all child a elements, then use absUrl("href") to get the link target and filter out any links that refer to the same page. (Currently just checking that the href attribute doesn't start with # is enough, but in general a document can link to itself with a full URL too.) But this depends on the semantics of the document, not just its syntax -- a future Wikipedia redesign may move where the citation link points so that the outgoing links are no longer children of the citation link target, and your code will break.

Related

Fetching data from another website with JSOUP

Basically, I need a table with all the possible books that exist, and I don't wanna do that, because I'm a very lazy person xD. So, my question is.. can I use a site, that I have in mind, and just like cut off the rest this site(that I don't need) and leave only the search part(maybe do some kind of changes in layout)... then, make the search, find the book and store in my database only the data that make sense for me. Is that possible? I heard that JSOUP could help.
So, I just want some tips. (thx for reading).
the site: http://www.isbn.bn.br/website/consulta/cadastro
Yes, you can do that using Jsoup, the main problem is that the URL you shared uses JavaScript so you'll need to use Selenium to force the JS execution or you can also get the book URL and parse it.
The way to parse a web using Jsoup is:
Document document = Jsoup.connect("YOUR-URL-GOES-HERE")
.userAgent("Mozilla/5.0")
.get();
The you retrieve the whole HTML in a Document so you can get any Element contained in the Element using CSS Selectors, for example, if in the HTML you want to retrieve the title of the web, you can use:
Elements elements = document.select("title");
And that for every HTML tag that you want to retrieve information from. You can check the Jsoup Doc an check some of the examples explained: Jsoup
I hope it helps you!

Is there a way to convert an element link to XPath

I have written a Jsoup class file to scrape a page and grab the hrefs for every element on the page. What I would like to do from there is to extract the Xpath for each of the elements from their hrefs.
Is there a way to do this in JSoup? If not is what is the best way to do this in Java (and are there any resources on this)?
Update
I want to clarify my question.
I want to scan a page for all the href identifiers and grab the links (that part is done). For my script, I need to get the xpath of all the elements I have identified and scraped from the (scanned) page.
The problem is that I assumed I could easily translate the href links to Xpath.
The comment from #Rishal dev Singh ended up being the right answer.
Check his link here:
http://stackoverflow.com/questions/7085539/does-jsoup-support-xpath

Developing app to detect webpage change

I'm trying to make a desktop app with java to track changes made to a webpage as a side project and also to monitor when my professors add content to their webpages. I did a bit of research and my current approach is to use the Jsoup library to retrieve the webpage, run it through a hashing algorithm, and then compare the current hash value with a previous hash value.
Is this a recommended approach? I'm open to suggestions and ideas since before I did any research I had no clue how to start nor what jsoup was.
One potential problem with your hashing method: if the page contains any dynamically generated content that changes on each refresh, as many modern websites do, your program will report that the page is constantly changing. Hashing the whole page will only work if the site does not employ any of this dynamic content (ads, hit counter, social media, etc.).
What specifically are you looking for that has changed? Perhaps new assignments being posted? You likely do not want to monitor the entire page for changes anyway. Therefore, you should use an HTML parser -- this is where Jsoup comes in.
First, parse the page into a Document object:
Document doc = Jsoup.parse(htmlString)
You can now perform a number of methods on the Document object to traverse the HTML Nodes. (See Jsoup docs on DOM navigation methods)
For instance, say there is a table on the site, and each row of the table represents a different assignment. The following code would get the table by its ID and each of its row by selecting each of the table's tags.
Element assignTbl = doc.getElementById("assignmentTable");
Elements tblRows = assignTbl.getElementsByTag("tr");
for (Element tblRow: tblRows) {
tblRow.html();
}
You will need to somehow view the webpage's source code (such as Inspect Element in Google Chrome) to figure out the page's structure and design your code accordingly. This way, not only would the algorithm be more reliable, but you could take it much further, such as extracting the details of the assignment that has changed. (If you would like assistance, please edit your question with the target page's HTML.)

How to find a web element using an href attribute

It's pretty straight forward. I have a String which contains an href value. I want to find the web element on the page that contains that specific href. I am using:
String href = "http://www.bing.com/images?FORM=Z9LH";
driver.findElement(By.cssSelector("a[href*='" + href + "']")).click()
If you inspect www.bing.com you can see that the first link "images" on the top left has the href value that I set here.
When I run, it says that there is no such element on the page. I feel like I am not using the correct "find by" option.
This should work for you:
driver.findElement(By.cssSelector("li#scpt0>a")).click();
As far as I can see my link appears as such:
href="/?scope=images&FORM=Z9LH1"
Therefore you may try to change your href variable's value. (if you're using chrome to see the source code, be aware that what you see in the F12 console and what you see with Ctrl+U is different. The Ctrl+U one is usually wha web parser see, before executing any JS or making subqueries)
Also, be aware that search engines (google, bing...) usually don't like being parsed, and they will do what they can to make your job impossible: using JS, iframe, changing patterns...

retrieve information from a url

I want to make a program that will retrieve some information a url.
For example i give the url below, from
librarything
How can i retrieve all the words below the "TAGS" tab, like
Black Library fantasy Thanquol & Boneripper Thanquol and Bone Ripper Warhammer ?
I am thinking of using java, and design a data mining wrapper, but i am not sure how to start. Can anyone give me some advice?
EDIT:
You gave me excellent help, but I want to ask something else.
For every tag we can see how many times each tag has been used, when we press the "number" button. How can I retrieve that number also?
You could use a HTML parser like Jsoup. It allows you to select HTML elements of interest using simple CSS selectors:
E.g.
Document document = Jsoup.connect("http://www.librarything.com/work/9767358/78536487").get();
Elements tags = document.select(".tags .tag a");
for (Element tag : tags) {
System.out.println(tag.text());
}
which prints
Black Library
fantasy
Thanquol & Boneripper
Thanquol and Bone Ripper
Warhammer
Please note that you should read website's robots.txt -if any- and read the website's terms of service -if any- or your server might be IP-banned sooner or later.
I've done this before using PHP with a page scrape, then parsing the HTML as a string using Regular Expressions.
Example here
I imagine there's something similar in java and other languages. The concept would be similar:
Load page data.
Parse the data, (i.e. with a regex, or via the DOM model and using some CSS selectors or some XPath selectors.
Do what you want with the data :)
It's worth remembering that some people might not appreciate you data mining their site and profiting / redistrubuting it on a large scale.

Categories

Resources