how to web-scrape from a table?

how to web-scrape from a table? - java

I am working on a project where I have to web-scrape from the site https://lite.ip2location.com.
When you come into the site there are a number of divs each with a different country. When you click on one of them the browser is redirected to a table on that site. The table has a thead and tbody. I need to access the tbody but for some reason, I only get the information from the thead tag.
This is my code:
public static void main(String[] args) {
final String url = "https://lite.ip2location.com/ip-address-ranges-by-country";
try {
final Document document = Jsoup.connect(url).get();
for (Element element : document.select("div.card-columns div")) {
Elements link = element.select("a");
String redirectUrl = "https://lite.ip2location.com" + link.attr("href");
final Document redirectDoc = Jsoup.connect(redirectUrl).get();
Element table = redirectDoc.select("table").get(0);
for (Element row : table.select("tbody tr")) {
System.out.println(row.text());
}
}
} catch (Exception e) {
e.printStackTrace();
}
}

Jsoup just parses HTML from a URL it does not execute JavaScript or fetches additional resources such as js or css files.
At a page with IP address ranges for a country data represented in JSON that is loaded asynchronously by a browser and then with that data a table is populated.
Here final Document redirectDoc = Jsoup.connect(redirectUrl).get(); you got an HTML page that contains only a template for a table. Like this
<div class="row my-5" style="min-height:500px;">
<div class="col table-responsive">
<table id="ip-address" class="table table-striped table-hover">
<thead>
<tr>
<th width="30%" class="no-sort">Begin IP Address</th>
<th width="30%" class="no-sort">End IP Address</th>
<th width="40%" class="text-right no-sort">Total Count</th>
</tr>
</thead>
<tbody>
</tbody>
</table>
</div>
</div>
And exactly this fragment you parses in your code.
So, there is a one of a possible solution to get ranges.
The data with IP address ranges for Zimbabwe locates at URLs like this https://cdn-lite.ip2location.com/datasets/ZW.json . A file name matches with Country Codes Alpha-2 (ZW for Zimbabwe).
These codes for available countries can be extracted from https://lite.ip2location.com/ip-address-ranges-by-country page where inside a <p class="card-text"> tag for each country there is a span tag to draw a flag.
The second class contains a code at the end of the name (flag-icon-ba)
<div class="card" style="min-height:72px;">
<div class="card-body" style="padding:.85rem;">
<p class="card-text"><span class="flag-icon flag-icon-ba"></span> Bosnia and Herzegovina</p>
</div>
</div>
BA for Bosnia and Herzegovina.
Having a URL to a JSON data for a particular country, you can fetch it with Jsoup.
String data = Jsoup
.connect("https://cdn-lite.ip2location.com/datasets/BA.json")
.ignoreContentType(true)
.get().text();

Related

HREF + TEXT with Jsoup

I've the following HTML Page:
</div><div id="page_content_list01" class="grid_12">
<h2><strong class="floatleft">TEXT1</strong></h2><br>
<table>
<tbody>
<tr>
<th class="no_width">
<p class="floatleft">Attachments:</p>
</th>
<td class="link_azure">
<a target="_blank" href="http://www.example.com">TEXT2</a><br/>
</td>
</tr>
</tbody>
</table><h2><strong class="floatleft">TEXT3</strong></h2><br>
<table>
<tbody>
<tr>
<th class="no_width">
<p class="floatleft">Atachments:</p>
</th>
<td class="link_azure">
<a target="_blank" href="http://www.example2.com">TEXT4</a><br/>
</td>
</tr>
</tbody>
</table><h2><strong class="floatleft">TEXT5</strong></h2><br>
<table>
<tbody>
<tr>
Actually I'm doing:
Elements rows = document.select("div#page_content_list01");
Now I to select "TEXT" and link. I wanna to make clickable link, so I'm using:
for (Element eleme : rows) {
Elements elements = eleme.select("a");
for (Element elem : elementi) {
String url = elem.attr("href");
String title = elem.text();
}
}
and I'm getting:
url = "http://www.example.com";
title = "TEXT2";
and it's ok, but in this way I can't read "TEXT1" and "TEXT3".
Can someone help me please?

I think you need to work on the selecors. First, your primary selector
Elements rows = document.select("div#page_content_list01");
will return with a list of ONE element only, since you actually select the div, not the tables or table rows. I would instead do this to get all relevant info:
Elements tables = document.select("div#page_content_list01>table");
for (Element table : tables){
Element h2 = table.previousElementSibling();
String titleStr = h2.text();
Element a = table.select("a").first();
String linkStr = a.attr("href");
}
Note that the Text in the h2 elements is on the same level as the table, not inside a common div. This is why I use the previous sibling notation. Also note that I wrote this out of my head and it is untested. You should get the idea though.

Parsing table data with jsoup

I am using jsoup in my android app to parse my html code but now I need parse table data and I can not get it to work. I try many ways but not successful so I want try luck here if anyone have experience.
Here is part of my html:
<div id="editacia_jedla">
<h2>My header</h2>
<h3>My sub header</h3>
<table border="0" class="jedalny_listok_tabulka" cellpadding="2" cellspacing="1">
<tr>
<td width="100" class="menu_nazov neparna" align="left">Food Menu 1</td>
<td class="jedlo neparna" align="left">vegetable and beef
<div class="jedlo_box_alergeny">Allergens: 1, 3</div>
</td>
</tr>
<tr>
<td width="100" class="menu_nazov parna" align="left">Food Menu 2</td>
<td class="jedlo parna" align="left">Potato salad and pork
<div class="jedlo_box_alergeny">Allergens: 6</div>
</td>
</tr>
</table>
etc
</div>
My java/android code:
try {
String tableHtmlCode="";
Document fullHtmlDocument = Jsoup.connect(urlOfFoodDay).get();
Element elm1 = fullHtmlDocument.select("#editacia_jedla").first();
for( Element element : elm1.children() )
{
tableHtmlCode+=element.getElementsByIndexEquals(2); //this set table content because 0=h2, 1=h3
}
Document parsedTableDocument = Jsoup.parse(tableHtmlCode);
//Element th = parsedTableDocument.select("td[class=jedlo neparna]").first(); THIS IS BAD
String foodContent="";
String foodAllergens="";
}
So now I want extract text vegetable and beef and save it to string foodContent and numbera 1, 3(together) from div class jedlo_box_alergeny save to string foodAllergens. Someone can help? I will very grateful for any ideas

Iterate over your document's parent tag jedalny_listok_tabulka and loop over td tags.
td tag is the parent to href tags which include the allergy values. Hence, you would loop over the tags a elements to get your numbers, something like:
Elements myElements = doc.getElementsByClass("jedalny_listok_tabulka")
.first().getElementsByTag("td");
for (Element element : myElements) {
if (element.className().contains("jedlo")) {
String foodContent = element.ownText();
String foodAllergen = "";
for (Element href : element.getElementsByTag("a")) {
foodAllergen += " " + href.text();
}
System.out.println(foodContent + " : " + foodAllergen);
}
}
Output:
vegetable and beef : 1 3
Potato salad and pork : 6

Html parsing text from TD Tag

I have my Html data
<table border='0' cellpadding='3' bgcolor="#CCCCCC" class="hostinfo_title2" width='100%' align="center">
<tr align='center' bgcolor="#ffffff">
<td width='26%' class="hostinfo_title3">Archive Url</td>
</tr>
<tr bgcolor="#ffffff"
<td height="25" align="center">http://www.toradio.com/prgramdetails/20130413_vali_mm.mp3</td>
</tr>
</table>
I want to get mp3 url(http://www.toradio.com/prgramdetails/20130413_vali_mm.mp3) from above HTML text.
I'm following this link,Is it Correct or any better way to parse this text
Could any one help?

Check out JSoup. It's a nice HTML Parser for JAVA.
You should be able to do that with something like this:
String html = "<YOUR HTML HERE>";
Document doc = Jsoup.parse(html);
Elements tds = doc.select("table.hostinfo_title2").select("td");
String mp3Link = "";
for(Element td : tds) {
if(td.text().contains("mp3") {
mp3Link = td.text();
// do something with mp3Link
}
}

Replacing text inside tags using Jsoup

<table width="100%" border="0" cellpadding="0" cellspacing="1" class="table_border" id="center_table">
<tbody>
<tr>
<td width="25%" class="heading_table_top">S. No.</td>
<td width="45%" class="heading_table_top">
Booking Status (Coach No , Berth No., Quota)
</td>
<td width="30%" class="heading_table_top">
* Current Status (Coach No , Berth No.)
</td>
</tr>
</tbody>
</table>
I scrap a webpage and store the response in a string.
I then parse it into jsoup doc
Document doc = Jsoup.parse(result);
Then i select the table using
Element table=doc.select("table[id=center_table]").first();
Now i need to replace the text in tag "Booking Status (Coach No , Berth No., Quota)" to "Booking Status" using jsoup.. Could anybody help ?
I tried
table.children().text().replaceAll(RegEx to select the text?????, "Booking Status");

Elements tds=doc.select("table[id=center_table] td"); // select the tds from your table
for(Element td : tds) { // loop through them
if(td.text().contains("Booking Status")) { // found the one you want
td.text("Booking Status"); // Replace with your text
}
}
then you can use doc.toString() to get the text of the HTML back to save to disk, send to a webView or whatever else you want to do with it.

Elements tablecells=doc.select("table tbody tr td");
will give you 3 cells.
use a loop to get the each element with
Element e=Elements.get(int index);
Use the e.text() to get the String.
Compare or replace strings with String.equals() , String.contains(), String.replace()

Parse the inner most html tags using jSoup

Here is my code.
String tags="<html><head></head><body><table><tr><td>1</td></tr><tr><td><table><tr><td>3</td><td>4</td></tr></table></td></tr></table><body></html>";
Document document = Jsoup.parse(tags);
for(int i=0;i<document.body().childNodes().size();i++)
{
if(!document.body().childNodes().get(i).nodeName().startsWith("#"))
{
System.out.println("1st Level Nodes:"+document.body().childNodes().get(i).nodeName());
while(document.body().childNodes().get(i).childNodes().size()>1)
{
System.out.println("2nd Level: "+document.body().childNodes().get(i).childNodes().get(0).nodeName());
}
}
}
How to parse the HTML which return tag by tag. Loop is not covered innermost tags.
Here is a well formatted html code. Parse the all the tags to inner most.
<html>
<head></head>
<body>
<table>
<tr>
<td>1</td>
</tr>
<tr>
<td>
<table>
<tr>
<td>3</td>
<td>4</td>
</tr>
</table>
</td>
</tr>
</table>
<body>
</html>
I want to get all the html in between tag as a hierarchy of html which i shown in html code. So i like to get all the tag one after another as per sequence of parent and child.

If you need only the tags you can use this here:
String tags = "<html><head></head><body><table><tr><td>1</td></tr><tr><td><table><tr><td>3</td><td>4</td></tr></table></td></tr></table><body></html>";
Document doc = Jsoup.parse(tags);
for( Element e : doc.select("*") // you can use 'doc.getAllElements()' here too
{
System.out.println(e.tag());
}
Output:
#root
html
head
body
table
tbody
tr
td
tr
td
table
tbody
tr
td
td

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to web-scrape from a table? - java

Related

HREF + TEXT with Jsoup

Parsing table data with jsoup

Html parsing text from TD Tag

Replacing text inside tags using Jsoup

Parse the inner most html tags using jSoup

Categories

Resources