Html parsing text from TD Tag - java

I have my Html data
<table border='0' cellpadding='3' bgcolor="#CCCCCC" class="hostinfo_title2" width='100%' align="center">
<tr align='center' bgcolor="#ffffff">
<td width='26%' class="hostinfo_title3">Archive Url</td>
</tr>
<tr bgcolor="#ffffff"
<td height="25" align="center">http://www.toradio.com/prgramdetails/20130413_vali_mm.mp3</td>
</tr>
</table>
I want to get mp3 url(http://www.toradio.com/prgramdetails/20130413_vali_mm.mp3) from above HTML text.
I'm following this link,Is it Correct or any better way to parse this text
Could any one help?

Check out JSoup. It's a nice HTML Parser for JAVA.
You should be able to do that with something like this:
String html = "<YOUR HTML HERE>";
Document doc = Jsoup.parse(html);
Elements tds = doc.select("table.hostinfo_title2").select("td");
String mp3Link = "";
for(Element td : tds) {
if(td.text().contains("mp3") {
mp3Link = td.text();
// do something with mp3Link
}
}

Related

Get data from table(html) except div tag by jsoup

I have html code:
<table width="100%" cellpadding="5" cellspacing="2" class="zebra">
<tr>
<td colspan="5">
<div class="paginator">
2
</div>
</td>
</tr>
<tr>
<td>some_value</td>
</tr>
<tr>
<td>some_value</td>
</tr>
<tr>
<td colspan="2">
<div class="paginator">
2
</div>
</td>
</tr>
</table>
I use Jsoup. How can I get all links except links in div tag?
I try to do something like this, but It doesn't work. Element contains all the links.
org.jsoup.nodes.Elements tableText = doc.select("table.zebra").not("tr td div.paginator");
for (org.jsoup.nodes.Element td : tableText.select("td a")) {
System.out.println(td.attr("href")); // http://some_link
....
}
You can use the below code..
Document html = Jsoup.parse(htmlStr);
for (Element e : html.getElementsByTag("a")) {
if (!"div".equalsIgnoreCase(e.parentNode().nodeName())) {
System.out.println(e.attr("href"));
}
}
Here I am checking that the parent node of the anchor element is not div. if it is not div I am printing the url.

HREF + TEXT with Jsoup

I've the following HTML Page:
</div><div id="page_content_list01" class="grid_12">
<h2><strong class="floatleft">TEXT1</strong></h2><br>
<table>
<tbody>
<tr>
<th class="no_width">
<p class="floatleft">Attachments:</p>
</th>
<td class="link_azure">
<a target="_blank" href="http://www.example.com">TEXT2</a><br/>
</td>
</tr>
</tbody>
</table><h2><strong class="floatleft">TEXT3</strong></h2><br>
<table>
<tbody>
<tr>
<th class="no_width">
<p class="floatleft">Atachments:</p>
</th>
<td class="link_azure">
<a target="_blank" href="http://www.example2.com">TEXT4</a><br/>
</td>
</tr>
</tbody>
</table><h2><strong class="floatleft">TEXT5</strong></h2><br>
<table>
<tbody>
<tr>
Actually I'm doing:
Elements rows = document.select("div#page_content_list01");
Now I to select "TEXT" and link. I wanna to make clickable link, so I'm using:
for (Element eleme : rows) {
Elements elements = eleme.select("a");
for (Element elem : elementi) {
String url = elem.attr("href");
String title = elem.text();
}
}
and I'm getting:
url = "http://www.example.com";
title = "TEXT2";
and it's ok, but in this way I can't read "TEXT1" and "TEXT3".
Can someone help me please?
I think you need to work on the selecors. First, your primary selector
Elements rows = document.select("div#page_content_list01");
will return with a list of ONE element only, since you actually select the div, not the tables or table rows. I would instead do this to get all relevant info:
Elements tables = document.select("div#page_content_list01>table");
for (Element table : tables){
Element h2 = table.previousElementSibling();
String titleStr = h2.text();
Element a = table.select("a").first();
String linkStr = a.attr("href");
}
Note that the Text in the h2 elements is on the same level as the table, not inside a common div. This is why I use the previous sibling notation. Also note that I wrote this out of my head and it is untested. You should get the idea though.

Getting Input tag id, value in webdriver using java

Html
<table id="tblRenewalList" class="adminlist dataTable" width="100%" cellspacing="1" cellpadding="1" border="1" style="margin-left: 0px; width: 100%;" aria-describedby="tblRenewalList_info">
<thead>
</thead>
<tbody role="alert" aria-live="polite" aria-relevant="all">
<tr class="odd">
<td class="alignCenter">
<input id="chkRenewal_868" class="chkPatent" type="checkbox" onclick="RenewalSelection(this)" companyid="33" value="868">
</td>
</tr>
</table>
with above Html i want to scrape the id, value
following are my java code, when i try with below code, its return empty values, please find the code
WebElement inputValues = driver.findElement(By
.xpath("//*[#id='tblRenewalList']/tbody/tr[1]/td[1]"));
String idValue = inputValues.getAttribute("id");
String ed2 = inputValues.getAttribute("value");
following are my expected output
id = chkRenewal_868
value = 868
The document isn't well-formed, i don't know if that matters for webdriver,
but XPath must be
//*[#id='tblRenewalList']/tbody/tr[1]/td[1]/input

Parsing table data with jsoup

I am using jsoup in my android app to parse my html code but now I need parse table data and I can not get it to work. I try many ways but not successful so I want try luck here if anyone have experience.
Here is part of my html:
<div id="editacia_jedla">
<h2>My header</h2>
<h3>My sub header</h3>
<table border="0" class="jedalny_listok_tabulka" cellpadding="2" cellspacing="1">
<tr>
<td width="100" class="menu_nazov neparna" align="left">Food Menu 1</td>
<td class="jedlo neparna" align="left">vegetable and beef
<div class="jedlo_box_alergeny">Allergens: 1, 3</div>
</td>
</tr>
<tr>
<td width="100" class="menu_nazov parna" align="left">Food Menu 2</td>
<td class="jedlo parna" align="left">Potato salad and pork
<div class="jedlo_box_alergeny">Allergens: 6</div>
</td>
</tr>
</table>
etc
</div>
My java/android code:
try {
String tableHtmlCode="";
Document fullHtmlDocument = Jsoup.connect(urlOfFoodDay).get();
Element elm1 = fullHtmlDocument.select("#editacia_jedla").first();
for( Element element : elm1.children() )
{
tableHtmlCode+=element.getElementsByIndexEquals(2); //this set table content because 0=h2, 1=h3
}
Document parsedTableDocument = Jsoup.parse(tableHtmlCode);
//Element th = parsedTableDocument.select("td[class=jedlo neparna]").first(); THIS IS BAD
String foodContent="";
String foodAllergens="";
}
So now I want extract text vegetable and beef and save it to string foodContent and numbera 1, 3(together) from div class jedlo_box_alergeny save to string foodAllergens. Someone can help? I will very grateful for any ideas
Iterate over your document's parent tag jedalny_listok_tabulka and loop over td tags.
td tag is the parent to href tags which include the allergy values. Hence, you would loop over the tags a elements to get your numbers, something like:
Elements myElements = doc.getElementsByClass("jedalny_listok_tabulka")
.first().getElementsByTag("td");
for (Element element : myElements) {
if (element.className().contains("jedlo")) {
String foodContent = element.ownText();
String foodAllergen = "";
for (Element href : element.getElementsByTag("a")) {
foodAllergen += " " + href.text();
}
System.out.println(foodContent + " : " + foodAllergen);
}
}
Output:
vegetable and beef : 1 3
Potato salad and pork : 6

Parse the inner most html tags using jSoup

Here is my code.
String tags="<html><head></head><body><table><tr><td>1</td></tr><tr><td><table><tr><td>3</td><td>4</td></tr></table></td></tr></table><body></html>";
Document document = Jsoup.parse(tags);
for(int i=0;i<document.body().childNodes().size();i++)
{
if(!document.body().childNodes().get(i).nodeName().startsWith("#"))
{
System.out.println("1st Level Nodes:"+document.body().childNodes().get(i).nodeName());
while(document.body().childNodes().get(i).childNodes().size()>1)
{
System.out.println("2nd Level: "+document.body().childNodes().get(i).childNodes().get(0).nodeName());
}
}
}
How to parse the HTML which return tag by tag. Loop is not covered innermost tags.
Here is a well formatted html code. Parse the all the tags to inner most.
<html>
<head></head>
<body>
<table>
<tr>
<td>1</td>
</tr>
<tr>
<td>
<table>
<tr>
<td>3</td>
<td>4</td>
</tr>
</table>
</td>
</tr>
</table>
<body>
</html>
I want to get all the html in between tag as a hierarchy of html which i shown in html code. So i like to get all the tag one after another as per sequence of parent and child.
If you need only the tags you can use this here:
String tags = "<html><head></head><body><table><tr><td>1</td></tr><tr><td><table><tr><td>3</td><td>4</td></tr></table></td></tr></table><body></html>";
Document doc = Jsoup.parse(tags);
for( Element e : doc.select("*") // you can use 'doc.getAllElements()' here too
{
System.out.println(e.tag());
}
Output:
#root
html
head
body
table
tbody
tr
td
tr
td
table
tbody
tr
td
td

Categories

Resources