getting the hyperlink from website - java

I am using Jsoup.
I do a get document= connect.get(); and get the html page.
now I write that to a text(string).
I have users who populate these pages.
I know each user name . These pages have the username.
I am able to do a string.contains("username") to check if the user is present or not.
Now my issue is:
I have users with there names in
Tables
ordered lists
unordered lists
in Body
But in all these cases they have in format as:Example
<li>2012 academic record</li>
some are in table and all..
In the example I know the student name = john.
how can I get all the urls?
==

You can use regex for this:
Elements elements = document.select("[href~=(?is)http://university\\.xxx\\.students\\.com/grade9/(.+?)/[0-9]+?]")
more abstract: document.select("a[href~=regex]")
if you already know the name you can replace (.+?), eg.:
Elements elements = document.select("[href~=(?is)http://university\\.xxx\\.students\\.com/grade9/" + name + "/[0-9]+?]")

How about this:
Document doc = Jsoup.connect(url).get();
Elements links = doc.select("a[href]");
for (Element link : links) {
if(link.attr("abs:href").contains(studentName) || link.text().contains(studentName)){
studentLinkList.add(link.attr("abs:href"));
}
}

Related

JSoup Scraping based on custom attributes

So I am using JSoup to scrape a website that creates a bunch of divs with dynamic class names (they change every reload), but the same attribute names. E.g:
<div class="[random text here that changes] js_resultTile" data-listing-number="[some number]">
<div class="a12_regularTile js_rollover_container " itemscope itemtype="http://schema.org/Product" data-listing-number="[same number here]">
<a href...
I've tried multiple approaches to selecting those divs and saving them in elements, but I can't seem to get it right. I've tried by attribute:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = doc.select("div[data-listing-number]");
I've tried by class:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = doc.getElementsByClass("a12_regularTile")
And:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = doc.select("div[class*=js_resultTile]")
I've tried another attribute method:
Document doc = Jsoup.connect([theUrl]).get();
Elements myEls = new Elements();
for (Element element : doc.getAllElements() )
{
for ( Attribute attribute : element.attributes() )
{
if ( attribute.getKey().equalsIgnoreCase("data-listing-number"))
{
myEls.add(element);
}
}
}
None of these work. I can select the doc that gets me all the HTML, but my myEls object is always empty. What can I use to select these elements?
Are you sure these elements are present in HTML returned by server? They may be added later by JavaScript. If JavaScript is involved in page presentation then you won't be able to use Jsoup. More details in my answer to similar question here: JSoup: Difficulty extracting a single element
And one more tip. Instead of using your for-for-if construction you can use this:
for (Element element : doc.getAllElements()) {
if (element.dataset().containsKey("listing-number")) {
myEls.add(element);
}
}

jsoup - how to obtain links from a text of an article in Wikipedia

I have just started to explore Jsoup and faced the following problem: when I'm trying to extract links from https://en.wikipedia.org/wiki/Knowledge that belong only to the English version of Wikipedia everything works correctly.
Document document = Jsoup.connect("https://en.wikipedia.org/wiki/Knowledge").timeout(6000).get();
Elements linksOnPage = document.select( "a[href^=\"/wiki/\"]");
for (Element link : linksOnPage) {
System.out.println("link : " + link.attr("abs:href"));
}
}
However I'm also getting the links that do not belong to the text of the current article such as:
link : https://en.wikipedia.org/wiki/Main_Page
link : https://en.wikipedia.org/wiki/Portal:Contents
link : https://en.wikipedia.org/wiki/Portal:Featured_content
link : https://en.wikipedia.org/wiki/Portal:Current_events
link : https://en.wikipedia.org/wiki/Special:Random
link : https://en.wikipedia.org/wiki/Help:Contents
link : https://en.wikipedia.org/wiki/Wikipedia:About
link : https://en.wikipedia.org/wiki/Wikipedia:Community_portal
What is the proper way to get only the links from the text leading to other Wikipedia articles with Jsoup?
links that I do not need are located in the div id="mw-panel"
Therefore the correct selector would be:
div:not(#mw-panel) a[href^="/wiki/"]
Which will select <a> elements that:
are not inside a <div> element with mw-panel ID
and their href attribute starts with "/wiki/".
EDIT:
I need only the links from an article without links from the side panels and without any links such as https://en.wikipedia.org/wiki/Special:BookSources/978-1-4200‌​-5940-3 https://en.wikipedia.org/wiki/Special:BookSources/1-58450-46‌​0-9
Then you may try:
#bodyContent a[href^="/wiki/"]
This will parse links that:
are inside the article (<div> with ID of bodyContent)
their href attribute starts with "/wiki/"
div#bodyContent does not have "/wiki/...Special:..." links. (If you want to exclude links with some other word, append this to the end of the above selector without any space or separator: :not([href*="something"]))
You can also try to combine selectors to achieve the best pattern based on my tryings above and by reading about Jsoup selectors.
Example code:
String url = "https://en.wikipedia.org/wiki/Knowledge";
Document document = Jsoup.connect(url).timeout(6000).get();
Elements links = document.select("#bodyContent a[href^=\"/wiki/\"]");
for (Element e : links) {
System.out.println(e.attr("href"));
}
System.out.println("Links found: " + links.size());
This prints out following:
/wiki/Knowledge_(disambiguation)
/wiki/Fact
/wiki/Information
...
/wiki/Category:Articles_with_unsourced_statements_from_September_2007
/wiki/Category:Articles_with_unsourced_statements_from_May_2009
/wiki/Category:Wikipedia_articles_with_GND_identifiers
Links found: 826

jSoup get data using td-class tags from webpage

I would like to get data from http://www.futbol24.com/Live/?__igp=1&LiveDate=20141104 using jSoup. I know how to use jSoup - but I am finding it difficult to pinpoint the data that I need.
I would like the Time, Home Team and Away Team from each row of the tbody table. So the output from the first row should be:
08:30 Persipura Jayapura Pelita Bandung Raya
I can see the td class of each of these elements as "status alt", "home" and "guest".
Currently I have tried the below, but it doesn't seem to output anything... what am I doing wrong?
matches = new ArrayList<Match>();
//getHistory
String website = "http://www.futbol24.com/Live/?__igp=1&LiveDate=20141104";
Document doc = Jsoup.connect(website).get();
Element tblHeader = doc.select("tbody").first();
List<Match> data = new ArrayList<>();
for (Element element1 : tblHeader.children()){
Match match = new Match();
match.setTimeOfMatch(element1.select("td.status.alt").text());
match.setAwayTeam(element1.select("td.home").text());
match.setHomeTeam(element1.select("td.guest").text());
data.add(match);
System.out.println(data.toString());
Does anybody know how I can use jSoup to get these elements from each row of the table?
Thanks,
Rob
The content of this site is generated via AJAX it seems. Jsoup can't handle this, since it is not a browser that interprets JavaScript. To solve this scraping problem you may need something like Selenium webdriver. I gave a longer answer to a generalized question about this before, so please look here:
Jsoup get dynamically generated HTML

Getting text from a website using JSoup

I’m working with JSoup to parse the html website.
I want to get the article from (for example) Wikipedia.
I would like to get the text from the main page (http://en.wikipedia.org/wiki/Main_Page) from the table “From today’s featured article”.
Here’s the code:
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Main_Page”);
Elements el = doc.select("div.mp-tfa”);
System.out.println(el);
The problem is that it doesn’t work properly - it prints out just a blank line.
The “From today’s featured article” table is inserted in div class=“mp-tfa”.
How to get this text in my java program?
Thanks in advance.
Change:
doc.select("div.mp-tfa");
To:
doc.select("div#mp-tfa");
The better way would to iterate over the Elements thus retrieved for the tag, class or Element of your choice, simply put:
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Main_Page").get();
Elements el = doc.select("div#mp-tfa");
for (Element e : el) {
System.out.println(e.text());
}
Would give:
The Boulonnais is a heavy draft horse breed from Fr....
I think it's supposed to be:
Document doc = Jsoup.connect("http://en.wikipedia.org/wiki/Main_Page").get();
Elements el = doc.select("div#mp-tfa");
System.out.println(el);

Extract links from document jsoup containing some string to other string

i use jsoup to extract the links from a website. i want to extract one only specified link containg some keywords. i want to retrieve the links contains the keyword "download". how to do it. i have the following code
Document doc = Jsoup.parse( new URL("http://www.examplesite.com));
Element link = doc.select("a").first();
See here for the selector syntax.
You can test for the text within a node with :contains, e.g. Element link = doc.select("a:contains(Download)").first();. If you want you can use :matches for regex.
You get the link address via the attr method, e.g. String linkaddress = link.attr("href");.
you can use this
elements with attributes that start with [attr^=value],end with [attr$=value],contain the value [attr*=value] e.g. [href*=/path/]
you want to get the links containing certain word use this
org.jsoup.select.Elements links = doc.select("[href*=download]");

Categories

Resources