How to Crawl Over a Single Website using Jsoup?

How to Crawl Over a Single Website using Jsoup? - java

I am starting of from a websites homepage. I am parsing the entire web page and I am collecting all the links on that homepage and putting them in a queue. Then I am removing each link from the queue and doing the same thing until I get the text that I want. However if I get a link like youtube.com/something then I am going to all the links on youtube. I want to restrict this.
I want to crawl within the same domain only. How do I do that?
private void crawler() throws IOException {
while (!q.isEmpty()){
String link = q.remove();
System.out.println("------"+link);
Document doc = Jsoup.connect(link).ignoreContentType(true).timeout(0).get();
if(doc.text().contains("publicly intoxicated behavior or persistence")){
System.out.println("************ On this page ******************");
System.out.println(doc.text());
return;
}
Elements links = doc.select("a[href]");
for (Element link1 : links){
String absUrl = link1.attr("abs:href");
if (absUrl == null || absUrl.length() == 0) {
continue;
}
// System.out.println(absUrl);
q.add(absUrl);
}
}
}

This article shows how to write a web crawler. The following line forces all crawled links to be on the mit.edu domain.
if(link.attr("href").contains("mit.edu"))
There might be a bug with that line since relative URLs won't have the domain. I suggest that adding abs: might be better.
if(link.attr("abs:href").contains("mit.edu"))

Related

Java JSoup: article extraction with image links and paragraph

I am currently making an article content extraction application using Jsoup and Java. My problem is when I scrape the article, Jsoup tends to return a list of Element rather than preserves the order of the article. For example, in an normal article with more than 1 image, it could has an order like this: (Title, sapo, image, paragraph, image, paragraph, paragraph, image, paragraph). So how can I scrape the main content of the website (text and image links) without losing its order?
Below is my idea for doing that but it doesn't work.
int cur = 0;
Document doc = Jsoup.connect(url).get();
Elements elements = doc.select("div");
for (Element element : elements) {
if (element.select("div[type=\"Photo\"] img").hasAttr("src")) {
Elements temp = element.select("div[type=\"Photo\"] img");
System.out.println(temp.get(cur).attr("src"));
cur++;
}
System.out.println(element.select("p span").text());
System.out.println("");
}

If you wanted to extract the article data from the sites that you linked to in the comment, you could do something like this:
Document doc = Jsoup.connect(url).get();
// Full article
Elements elements = doc.select("div.sidebar-1");
System.out.println("## Article title:");
System.out.println(elements.select("h1.title-detail").text());
System.out.println("## Article summary:");
System.out.println(elements.select("p.description").text());
// Images and paragraphs
for (Element e : elements.select("article.fck_detail p,figure")) {
if (e.is("p")) {
System.out.println("## Paragraph");
System.out.println(e.text());
} else {
System.out.println("## Image (image URL)");
System.out.println(e.select("img[src]").attr("src"));
}
}
The idea is this one:
find the outermost container that contains the full article
extract title and the summary
loop through the image (figure) and paragraph (p) elements of the article - the order will be preserved automatically

Jsoup in Java return always the same page with different URL - SCRAPING website

public void conectUrl() throws IOException, InterruptedException {
product= new ArrayList<>();
String url = "https://www.continente.pt/stores/continente/pt-pt/public/pages/category.aspx?cat=campanhas#/?page=1&sf=Revelance";
page = Jsoup.connect(url).userAgent("JSoup scraper").get();
//get actual page
Elements paginaAtu=page.getElementsByClass("_actualPage");
paginaAtual=Integer.parseInt(paginaAtu.attr("value"));
//get Total Pages
Elements nextPage=page.getElementsByClass("_actualTotalPages");
numPaginas =Integer.parseInt(nextPage.attr("value"));
for(paginaAtual=1;paginaAtual<numPaginas;paginaAtual++) {
getProductInfo("https://www.continente.pt/stores/continente/pt-pt/public/pages/category.aspx?cat=campanhas#/?page="+paginaAtual+"&sf=Revelance");
}
}
Always return the same resul with different URL. I already searched about jsoup cache , i am not the first person doing this question, however nobody says how to resolve the situation. In theory, JSoup doesn't cache url pages...
I already did the code "sleep" during 30 seconds to load the new URL however still not working, return always the same result.
Anybody can help me? Thank you in advance.

Use JSoup to get all textual links

I'm using JSoup to grab content from web pages.
I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.
Example of links I want:
Link to Some Page
Since it contains the text "Link to Some Page"
Links I don't want:
<img src="someimage.jpg"/>
My code looks like this. How can I modify it to only get the first type of link?
Document document = // I get my document object
Elements linksOnPage = document.select("a[href]")
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

You could do something like this.
It does it's job though it's probably not the fanciest solution out there.
Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.
Document doc = // get the doc
Elements linksOnPage = document.select("a");
for (Element pageElem : linksOnPage){
String link = "";
if(pageElem.text().trim().equals(""))
continue;
// do smth with it
}

I am using this and it's working fine:
Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))");
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

Real time web crawling using Jsoup

I have this web page https://rrtp.comed.com/pricing-table-today/ and from that I need to get the information about Time (Hour Ending) and Day-Ahead Hourly Price column alone. I tried with the following code,
Document doc = Jsoup.connect("https://rrtp.comed.com/pricing-table-today/").get();
for (Element table : doc.select("table.prices three-col")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
if (tds.size() > 2) {
System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
}
}
}
but unfortunately I am unable to get the data I need.
Is there something wrong in the code..? or This page can't be crawled...?
Need some help

As I said in comment:
You should hit https://rrtp.comed.com/rrtp/ServletFeed?type=pricingtabledual&date=20150717 because it's source from which data is loaded on the page you have pointed to.
Data under this link is not a valid html document (and this is why it's not working for you), but you can easily make it "quite" right.
All you have to do is first get the response and add <table>..</table> tags around it, then it's enough to parse it as html document.
Connection.Response response = Jsoup.connect("https://rrtp.comed.com/rrtp/ServletFeed?type=pricingtabledual&date=20150717").execute();
Document doc = Jsoup.parse("<table>" + response.body() + "</table>");
for (Element element : doc.select("tr")) {
System.out.println(element.html());
}

get favicon from html (JSOUP)

How can I get icon path from html string uses JSOUP?
I find to diferent way to add favicon on webpage -
(in Google)
first method I can to get uses doc.select("html head meta")
but I can't to select link tag

Get the file name on head element:
Connection con2=Jsoup.connect(url);
Document doc = con2.get();
Element e=doc.head().select("link[href~=.*\\.ico]").first();
String url=e.attr("href");
http://jsoup.org/cookbook/extracting-data/attributes-text-html
http://jsoup.org/cookbook/extracting-data/selector-syntax

As Uwe Plonus pointed out in the comment you can always get the favicon from <website>/favicon.ico
Google favicon

It's pretty late to submit a answer but
correct way to check is "rel" tag
public boolean checkFevicon() {
Elements e = doc.head().select("link[rel=shortcut icon]");
if (e.isEmpty()) {
return false;
} else {
return true;
}
}
jQuery equivalent
$("link[rel='shortcut icon']")

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to Crawl Over a Single Website using Jsoup? - java

Related

Java JSoup: article extraction with image links and paragraph

Jsoup in Java return always the same page with different URL - SCRAPING website

Use JSoup to get all textual links

Real time web crawling using Jsoup

get favicon from html (JSOUP)

Categories

Resources