Real time web crawling using Jsoup

Real time web crawling using Jsoup - java

I have this web page https://rrtp.comed.com/pricing-table-today/ and from that I need to get the information about Time (Hour Ending) and Day-Ahead Hourly Price column alone. I tried with the following code,
Document doc = Jsoup.connect("https://rrtp.comed.com/pricing-table-today/").get();
for (Element table : doc.select("table.prices three-col")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
if (tds.size() > 2) {
System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
}
}
}
but unfortunately I am unable to get the data I need.
Is there something wrong in the code..? or This page can't be crawled...?
Need some help

As I said in comment:
You should hit https://rrtp.comed.com/rrtp/ServletFeed?type=pricingtabledual&date=20150717 because it's source from which data is loaded on the page you have pointed to.
Data under this link is not a valid html document (and this is why it's not working for you), but you can easily make it "quite" right.
All you have to do is first get the response and add <table>..</table> tags around it, then it's enough to parse it as html document.
Connection.Response response = Jsoup.connect("https://rrtp.comed.com/rrtp/ServletFeed?type=pricingtabledual&date=20150717").execute();
Document doc = Jsoup.parse("<table>" + response.body() + "</table>");
for (Element element : doc.select("tr")) {
System.out.println(element.html());
}

Related

JSOUP - Extract Data Web dinamis

I tried to extract the price. Can anyone please help me? There is no output for the price and its weight ,, I've tried several ways but not out the results
Document doc = Jsoup.connect("https://www.jakmall.com/tokocamzone/mi-travel-charger-20a-output-fast-charging#9730928979371").get();
Elements rows = doc.getElementsByAttributeValue("class", "div[dp__price dp__price--2 format__money]");
System.out.println("rows.size() = " + rows.size());
String index = "";
for (Element span : rows) {
index = span.text();
}
System.out.println("index = " + index);
I've tried another way but I did not get the result. I was very curious but did not find it the right way

if you run this line of code above you will discover thtat there is no price ordiv[dp__price dp__price--2 format__money] DOM. There is only Javascript.
String d = doc.getElementsByClass("dp__header__info").outherHtml();
System.out.println(d);
Jsoup is not able to fetch the price because content is loaded dynamically after page loading. Consider using Selenium which more powerfull and supports JavaScript websites,

Use JSoup to get all textual links

I'm using JSoup to grab content from web pages.
I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.
Example of links I want:
Link to Some Page
Since it contains the text "Link to Some Page"
Links I don't want:
<img src="someimage.jpg"/>
My code looks like this. How can I modify it to only get the first type of link?
Document document = // I get my document object
Elements linksOnPage = document.select("a[href]")
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

You could do something like this.
It does it's job though it's probably not the fanciest solution out there.
Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.
Document doc = // get the doc
Elements linksOnPage = document.select("a");
for (Element pageElem : linksOnPage){
String link = "";
if(pageElem.text().trim().equals(""))
continue;
// do smth with it
}

I am using this and it's working fine:
Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))");
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

How to Crawl Over a Single Website using Jsoup?

I am starting of from a websites homepage. I am parsing the entire web page and I am collecting all the links on that homepage and putting them in a queue. Then I am removing each link from the queue and doing the same thing until I get the text that I want. However if I get a link like youtube.com/something then I am going to all the links on youtube. I want to restrict this.
I want to crawl within the same domain only. How do I do that?
private void crawler() throws IOException {
while (!q.isEmpty()){
String link = q.remove();
System.out.println("------"+link);
Document doc = Jsoup.connect(link).ignoreContentType(true).timeout(0).get();
if(doc.text().contains("publicly intoxicated behavior or persistence")){
System.out.println("************ On this page ******************");
System.out.println(doc.text());
return;
}
Elements links = doc.select("a[href]");
for (Element link1 : links){
String absUrl = link1.attr("abs:href");
if (absUrl == null || absUrl.length() == 0) {
continue;
}
// System.out.println(absUrl);
q.add(absUrl);
}
}
}

This article shows how to write a web crawler. The following line forces all crawled links to be on the mit.edu domain.
if(link.attr("href").contains("mit.edu"))
There might be a bug with that line since relative URLs won't have the domain. I suggest that adding abs: might be better.
if(link.attr("abs:href").contains("mit.edu"))

adding text before and after a link jSoup

I've just stared learning Jsoup and the cookbook on their website but I'm just a bit stuck with addling text to an element I've parsed.
try{
Document doc = Jsoup.connect(url).get();
Element add = doc.prependText("a href") ;
Elements links = add.select("a[href]");
for (Element link : links) {
PrintStream sb = System.out.format("%n %s",link.attr("abs:href"));
System.out.print("<br>");
}
}
catch(Exception e){
System.out.print("error --> " + e);
}
Example run with google.com I get
http://www.google.ie/imghp?hl=en&tab=wi<br>
http://maps.google.ie/maps?hl=en&tab=wl<br>
https://play.google.com/?hl=en&tab=w8<br>
But I really want
<a href> http://www.google.ie/imghp?hl=en&tab=wi<br></a>
<a href> http://maps.google.ie/maps?hl=en&tab=wl<br></a>
<a href> https://play.google.com/?hl=en&tab=w8<br></a>
With this code I've gotten all the links off the page but I want to also get the and tags so I can them create my on webpage. I've tried adding a string and prepend text but just can't seem to get it right.
Thanks

with link.attr(...) you get the attribute value.
But you need the whole tag:
Document doc = Jsoup.connect(...).get();
for( Element e : doc.select("a[href]") ) // Select all 'a'-Tags with 'href' attribute
{
String wholeTag = e.toString(); // Get a string as the element is
/* No you you can use the html - in this example for a simple output */
System.out.println(wholeTag);
}

Using Jsoup to extract data

I am using jsoup to extract data from a table in a website.http://www.moneycontrol.com/stocks/marketstats/gainerloser.php?optex=BSE&opttopic=topgainers&index=-1 using Jsoup. I have referred to Using JSoup To Extract HTML Table Contents and other similar questions but it does not print the data. Could someone please provide me with the code required to achieve this?
public class TestClass
{
public static void main(String args[]) throws IOException
{
Document doc = Jsoup.connect("http://www.moneycontrol.com/stocks/marketstats/gainerloser.php?optex=BSE&opttopic=topgainers&index=-1").get();
for (Element table : doc.select("table.tablehead")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
if (tds.size() > 6) {
System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
}
}
}

If you want to get the content of table(not head), you need change the selector of table:
for (Element table : doc.select("table.tbldata14"))
instead of
for (Element table : doc.select("table.tablehead"))

One important thing is to check what are you getting in Doc when you parse the HTML because there might be few problems with it like:
1. The Site might be using iframes to display content
2. Display content via Javascript
3. few sites have scripts which does not allow jsoup parsing, hence the doc element will contain random data

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Real time web crawling using Jsoup - java

Related

JSOUP - Extract Data Web dinamis

Use JSoup to get all textual links

How to Crawl Over a Single Website using Jsoup?

adding text before and after a link jSoup

Using Jsoup to extract data

Categories

Resources