Get specific information from Wikipedia Information Box

Get specific information from Wikipedia Information Box - java

I'm trying to get the details of the latest release in the information box on the right side. I'm trying to retrieve "6.2 (Build 9200) / August 1, 2012; 7 years ago" from the box by scraping this page using jsoup.
I have code that pulls all data from the box but I can't figure out how to pull the specific part of the box.
org.jsoup.Connection.Response res = Jsoup.connect("https://en.wikipedia.org/wiki/Windows_Server_2012").execute();
String html = res.body();
Document doc2 = Jsoup.parseBodyFragment(html);
Element body = doc2.body();
Elements tables = body.getElementsByTag("table");
for (Element table : tables) {
if (table.className().contains("infobox")==true) {
System.out.println(table.outerHtml());
break;
}
}

You can query for the table row that contains a link that ends with Software_release_life_cycle:
String url = "https://en.wikipedia.org/wiki/Windows_Server_2012";
try {
Document document = Jsoup.connect(url).get();
Elements elements = document.select("tr:has([href$=Software_release_life_cycle])");
for (Element element: elements){
System.out.println(element.text());
}
}
catch (IOException e) {
//exception handling
}
This is why, by looking at the full html, I found out that the row you need (and only the row you need -this is a vital detail!-) is formed like this. Infact elements will actually contain only an Element.
Finally you extract only the text. This code will print:
Latest release 6.2 (Build 9200) / August 1, 2012; 7 years ago (2012-08-01)[2]
If you need even more refinement you can always substring it.
Hope I helped!
( selector syntax reference )

Related

JSOUP - Extract Data Web dinamis

I tried to extract the price. Can anyone please help me? There is no output for the price and its weight ,, I've tried several ways but not out the results
Document doc = Jsoup.connect("https://www.jakmall.com/tokocamzone/mi-travel-charger-20a-output-fast-charging#9730928979371").get();
Elements rows = doc.getElementsByAttributeValue("class", "div[dp__price dp__price--2 format__money]");
System.out.println("rows.size() = " + rows.size());
String index = "";
for (Element span : rows) {
index = span.text();
}
System.out.println("index = " + index);
I've tried another way but I did not get the result. I was very curious but did not find it the right way

if you run this line of code above you will discover thtat there is no price ordiv[dp__price dp__price--2 format__money] DOM. There is only Javascript.
String d = doc.getElementsByClass("dp__header__info").outherHtml();
System.out.println(d);
Jsoup is not able to fetch the price because content is loaded dynamically after page loading. Consider using Selenium which more powerfull and supports JavaScript websites,

Real time web crawling using Jsoup

I have this web page https://rrtp.comed.com/pricing-table-today/ and from that I need to get the information about Time (Hour Ending) and Day-Ahead Hourly Price column alone. I tried with the following code,
Document doc = Jsoup.connect("https://rrtp.comed.com/pricing-table-today/").get();
for (Element table : doc.select("table.prices three-col")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
if (tds.size() > 2) {
System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
}
}
}
but unfortunately I am unable to get the data I need.
Is there something wrong in the code..? or This page can't be crawled...?
Need some help

As I said in comment:
You should hit https://rrtp.comed.com/rrtp/ServletFeed?type=pricingtabledual&date=20150717 because it's source from which data is loaded on the page you have pointed to.
Data under this link is not a valid html document (and this is why it's not working for you), but you can easily make it "quite" right.
All you have to do is first get the response and add <table>..</table> tags around it, then it's enough to parse it as html document.
Connection.Response response = Jsoup.connect("https://rrtp.comed.com/rrtp/ServletFeed?type=pricingtabledual&date=20150717").execute();
Document doc = Jsoup.parse("<table>" + response.body() + "</table>");
for (Element element : doc.select("tr")) {
System.out.println(element.html());
}

Modifying an html tag's own text in Java using JSoup

So yeah, suppose I have this piece of HTML
<p>And finally, how about some Links?</p>
and I want to access and modify the "And finally, how about some" part only, and get this:
<p>new text Links?</p>
I can't seem to figure out how. Here's what I've tried so far:
Document doc = null;
try {
doc = Jsoup.connect("http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html").userAgent("Mozilla").get();
} catch (IOException e1) {
e1.printStackTrace();
}
Elements d = doc.body().children();
Element e = d.get(20); //Assuming the HTML line in question is found at index 20
e.text("new text") //just outputs <p>new value</p>, which is not good for me
It seems that I can access it by
Element e = d.get(20);
System.out.println("\n"+e.ownText()); //outputs: And finally, how about some
but modifying it doesn't work.
Element e = d.get(20);
String s = e.toString().replace(e.ownText(), "new text");
e.text(s);
System.out.println(e.toString());
The output for the code above is
<p><p>changed <a href="http://www.yahoo.com/">Links?</a></p></p>
It seems to be taking the tags as literals, but I want them as < or > because I then have to re build the webpage with the new text.
Any kind of help will be hugely appreciated.

How about something like
Element e = d.get(20);
e.text("new text");
e.append("Links?");//lets you add HTML.
If link is dynamic and you don't want to change it you can earlier store it and use later
Element e = d.get(20);
Element link = e.child(0);
e.text("new text");
e.append(link.toString());

adding text before and after a link jSoup

I've just stared learning Jsoup and the cookbook on their website but I'm just a bit stuck with addling text to an element I've parsed.
try{
Document doc = Jsoup.connect(url).get();
Element add = doc.prependText("a href") ;
Elements links = add.select("a[href]");
for (Element link : links) {
PrintStream sb = System.out.format("%n %s",link.attr("abs:href"));
System.out.print("<br>");
}
}
catch(Exception e){
System.out.print("error --> " + e);
}
Example run with google.com I get
http://www.google.ie/imghp?hl=en&tab=wi<br>
http://maps.google.ie/maps?hl=en&tab=wl<br>
https://play.google.com/?hl=en&tab=w8<br>
But I really want
<a href> http://www.google.ie/imghp?hl=en&tab=wi<br></a>
<a href> http://maps.google.ie/maps?hl=en&tab=wl<br></a>
<a href> https://play.google.com/?hl=en&tab=w8<br></a>
With this code I've gotten all the links off the page but I want to also get the and tags so I can them create my on webpage. I've tried adding a string and prepend text but just can't seem to get it right.
Thanks

with link.attr(...) you get the attribute value.
But you need the whole tag:
Document doc = Jsoup.connect(...).get();
for( Element e : doc.select("a[href]") ) // Select all 'a'-Tags with 'href' attribute
{
String wholeTag = e.toString(); // Get a string as the element is
/* No you you can use the html - in this example for a simple output */
System.out.println(wholeTag);
}

Using Jsoup to extract data

I am using jsoup to extract data from a table in a website.http://www.moneycontrol.com/stocks/marketstats/gainerloser.php?optex=BSE&opttopic=topgainers&index=-1 using Jsoup. I have referred to Using JSoup To Extract HTML Table Contents and other similar questions but it does not print the data. Could someone please provide me with the code required to achieve this?
public class TestClass
{
public static void main(String args[]) throws IOException
{
Document doc = Jsoup.connect("http://www.moneycontrol.com/stocks/marketstats/gainerloser.php?optex=BSE&opttopic=topgainers&index=-1").get();
for (Element table : doc.select("table.tablehead")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
if (tds.size() > 6) {
System.out.println(tds.get(0).text() + ":" + tds.get(1).text());
}
}
}

If you want to get the content of table(not head), you need change the selector of table:
for (Element table : doc.select("table.tbldata14"))
instead of
for (Element table : doc.select("table.tablehead"))

One important thing is to check what are you getting in Doc when you parse the HTML because there might be few problems with it like:
1. The Site might be using iframes to display content
2. Display content via Javascript
3. few sites have scripts which does not allow jsoup parsing, hence the doc element will contain random data

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Get specific information from Wikipedia Information Box - java

Related

JSOUP - Extract Data Web dinamis

Real time web crawling using Jsoup

Modifying an html tag's own text in Java using JSoup

adding text before and after a link jSoup

Using Jsoup to extract data

Categories

Resources