Parse HTML page in Java - java

I'm parsing this page segment:
<tr valign="middle">
<td class="inner"><span style=""><span class="" title=""></span> 2 <span class="icon ok" title="Verified"></span> </span><span class="icon cat_tv" title="Video » TV" style="bottom:-2;"></span> VALUE </td>
<td width="1%" align="center" nowrap="nowrap" class="small inner" >VALUE</td>
<td width="1%" align="right" nowrap="nowrap" class="small inner" >VALUE</td>
<td width="1%" align="center" nowrap="nowrap" class="small inner" >VALUE</td>
</tr>
I have this segment in variable tv: HtmlElement tv = tr.get(i);
I read tag VALUE in this way:
HtmlElement a = tv.getElementsByTagName("a").get(0);
object.name.value(a.getTextContent());
url = a.getAttribute("href");
object.url_detail.value(myBase + url);
How can I read only VALUE field of the other <td>....</td> sections?

I would suggest using XPath, which is the recommended way of parsing XML/HTML
Reference: How to read XML using XPath in Java
Also take a look at this question: RegEx match open tags except XHTML self-contained tags
Update
If I understood correctly, you need the "VALUE" from each td, right?
If so, your XPath would something like this:
//td[#class="small inner"]/text()

You may try a wonderful java package jsoup.
UPDATE: using the package, you can solve the problem like this:
String html = "<tr valign=\"middle\">"
+ " <td class=\"inner\">"
+ " <span style=\"\"><span class=\"\" title=\"\"></span> 2 <span class=\"icon ok\" title=\"Verified\"></span> </span><span class=\"icon cat_tv\" title=\"Video » TV\" style=\"bottom:-2;\"></span>"
+ " VALUE "
+ " </td>"
+ " <td width=\"1%\" align=\"center\" nowrap=\"nowrap\" class=\"small inner\" >VALUE</td>"
+ " <td width=\"1%\" align=\"right\" nowrap=\"nowrap\" class=\"small inner\" >VALUE</td>"
+ " <td width=\"1%\" align=\"center\" nowrap=\"nowrap\" class=\"small inner\" >VALUE</td>"
+ "</tr>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Elements labelPLine = doc.select("a[href]");
System.out.println("value 1:" + labelPLine.text());
Elements labelPLine2 = doc.select("td[width=1%");
Iterator<Element> it = labelPLine2.iterator();
int n = 2;
while (it.hasNext()) {
System.out.println("value " + (n++) + ":" + it.next().text());
}
The result would be:
value 1:VALUE
value 2:VALUE
value 3:VALUE
value 4:VALUE

Related

Parsing an HTML table in Android studio

I am trying to process a large amount of data for a research project. I have one html file loaded trough Jsoup, but the problem is that the table I need to evaluate does not have an Id or CLASS. I have searched stack, but I don't seem to find an answer as to how I can reach each <tr> and get the information out of its <td>'s.
<table>
<tr>
<td align="center">inf1</td>
<td align="center">date</td>
<tdalign="center">time</td>
<td align="center">group</td>
<td align="center">name</td>
<td align="center">---</td>
<td align="center">room</td>
<td align="center">---</td>
<td align="center">---</td>
<td> </td>
<tdalign="center">reason</td>
<td align="center"> </td>
</tr>
</table>
(The empty <td>'s and the "---" are just for displaying purposes in this table and don't have any value for my project)
I need to sort each <tr> (structured in the same way) by group and inf1 with the other data linked to them in order to use the data in an android Studio project where they will be displayed differently.
Thank you in advance for help:)
You can use Jsoup CSS selectors and a custom class that implements Comparable to keep the records. Something like this:
String html = ""
+"<table>"
+" <tr>"
+" <td align=\"center\">inf1</td>"
+" <td align=\"center\">date</td>"
+" <td align=\"center\">time</td>"
+" <td align=\"center\">group1</td>"
+" </tr> "
+"</table>"
+"<table>"
+" <tr>"
+" <td align=\"center\">inf1</td>"
+" <td align=\"center\">date</td>"
+" <td align=\"center\">time</td>"
+" <td align=\"center\">group0</td>"
+" </tr> "
+"</table>"
+"<table>"
+" <tr>"
+" <td align=\"center\">inf2</td>"
+" <td align=\"center\">date</td>"
+" <td align=\"center\">time</td>"
+" <td align=\"center\">group0</td>"
+" </tr> "
+"</table>"
;
Document doc = Jsoup.parse(html);
class TableRecord implements Comparable<TableRecord>{
public String inf = "";
public String grp = "";
#Override
public int compareTo(TableRecord arg0) {
int cmpGrp = arg0.grp.compareTo(this.grp);
if (cmpGrp==0){
return arg0.inf.compareTo(this.inf);
}
return cmpGrp;
}
#Override
public String toString(){
return "grp="+grp+":inf="+inf;
}
}
List<TableRecord> tableRecords = new ArrayList<>();
Elements trs = doc.select("table tr");
for (Element tr : trs){
Elements tds = tr.select("td");
TableRecord tableRecord = new TableRecord();
tableRecord.inf = tds.get(0).text();
tableRecord.grp = tds.get(3).text();
tableRecords.add(tableRecord);
}
Collections.sort(tableRecords);
for (TableRecord tableRecord:tableRecords){
System.out.println(tableRecord);
}

Parsing table data with jsoup

I am using jsoup in my android app to parse my html code but now I need parse table data and I can not get it to work. I try many ways but not successful so I want try luck here if anyone have experience.
Here is part of my html:
<div id="editacia_jedla">
<h2>My header</h2>
<h3>My sub header</h3>
<table border="0" class="jedalny_listok_tabulka" cellpadding="2" cellspacing="1">
<tr>
<td width="100" class="menu_nazov neparna" align="left">Food Menu 1</td>
<td class="jedlo neparna" align="left">vegetable and beef
<div class="jedlo_box_alergeny">Allergens: 1, 3</div>
</td>
</tr>
<tr>
<td width="100" class="menu_nazov parna" align="left">Food Menu 2</td>
<td class="jedlo parna" align="left">Potato salad and pork
<div class="jedlo_box_alergeny">Allergens: 6</div>
</td>
</tr>
</table>
etc
</div>
My java/android code:
try {
String tableHtmlCode="";
Document fullHtmlDocument = Jsoup.connect(urlOfFoodDay).get();
Element elm1 = fullHtmlDocument.select("#editacia_jedla").first();
for( Element element : elm1.children() )
{
tableHtmlCode+=element.getElementsByIndexEquals(2); //this set table content because 0=h2, 1=h3
}
Document parsedTableDocument = Jsoup.parse(tableHtmlCode);
//Element th = parsedTableDocument.select("td[class=jedlo neparna]").first(); THIS IS BAD
String foodContent="";
String foodAllergens="";
}
So now I want extract text vegetable and beef and save it to string foodContent and numbera 1, 3(together) from div class jedlo_box_alergeny save to string foodAllergens. Someone can help? I will very grateful for any ideas
Iterate over your document's parent tag jedalny_listok_tabulka and loop over td tags.
td tag is the parent to href tags which include the allergy values. Hence, you would loop over the tags a elements to get your numbers, something like:
Elements myElements = doc.getElementsByClass("jedalny_listok_tabulka")
.first().getElementsByTag("td");
for (Element element : myElements) {
if (element.className().contains("jedlo")) {
String foodContent = element.ownText();
String foodAllergen = "";
for (Element href : element.getElementsByTag("a")) {
foodAllergen += " " + href.text();
}
System.out.println(foodContent + " : " + foodAllergen);
}
}
Output:
vegetable and beef : 1 3
Potato salad and pork : 6

Select next node after specific condition

I'm trying to select the next node value (number 4) after the span tag in the html below. How can I do that??
<tr valign="top">
<td></td>
<td> 1 </td>
<td> 2 </td>
<td><span> 3 </span></td>
<td> 4 </td>
<td> 5 </td>
<td> 6 </td>
</tr>
final String html = "<tr valign=\"top\">\n"
+ " <td></td>\n"
+ " <td> 1 </td>\n"
+ " <td> 2 </td>\n"
+ " <td><span> 3 </span></td>\n"
+ " <td> 4 </td>\n"
+ " <td> 5 </td>\n"
+ " <td> 6 </td>\n"
+ "</tr>";
Document doc = Jsoup.parse(html);
Element nextToSpan = doc.select("span").first().nextElementSibling();
Explained:
doc.select("span") // Select the span-tags of doc
.first() // retrieve the first one
.nextElementSibling(); // Get the element that's next to it
Documentation: http://jsoup.org/cookbook/extracting-data/selector-syntax

Printing multiple tags within CDATA section

I have a old xml file which was generated manually in java. Its tree structure is like this.
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="tvshows.xsl"?>
<rss version='0.91'>
<channel>
<title>xyz.com</title>
<link>http://www.xyz.com</link>
<description></description>
<item>
<title>Downton Abbey</title>
<link>http://www.xyz.com</link>
<description><![CDATA[
<tr class='chartContent'>
<td class='rank'>1.</td>
<td class='showTitle'>Dexter</td>
<td class='network'>CBS</td>
<td class='sumInvIndex'>210</td>
<td class='earlierWeek'>-13</td>
<td class='mediaInvIndex'>225</td>
<td class='socialNetworkInvIndex'>238</td>
<td class='gammaIndex'>--</td>
</tr>]]>
</description>
</item>
</channel>
</rss>
Right now am using the JDOM library to generate the exact format. But how should i deal with the CDATA[]. I find <tr> with almost 10 columns. I am trying to fix it with the
CDATA cdata = new CDATA("<tr class='chartContent'>");
cdata.append("<td class='rank'>" + current.getRank() + "</td>");
cdata.append("\n");
cdata.append("<td class='showTitle'>" + current.getShowTitle() + "</td>");
cdata.append("<td class='network'>" + current.getNetwork() + "</td>");
cdata.append("<td class='sumInvIndex'>" + current.getsumInvIndex() +"</td>");
cdata.append("<td class='earlierWeek'>" + current.getearlierWeek() + "</td>");
cdata.append("<td class='mediaInvIndex'>" + current.getmediaInvIndex() + "</td>");
cdata.append("<td class='socialNetworkInvIndex'>" + current.getsocialNetworkInvIndex() + "</td>");
cdata.append("<td class='gammaIndex'>" + current.getgammaIndex() + "</td>");
cdata.append("</tr>");
Element description = new Element("description");
description.setContent(cdata);
But is there a optimal way of appending tags to columns something like
Element rankTD = new Element("td");
rankTD.setText(current.getRank());
& add rankTD element to cdata .
The generated out put after using
Format format = null;
format = Format.getPrettyFormat();
content.add(new Element("td").setText(current.getRank()).setAttribute("class","showTitle"));
-------------------
-------------------
-------------------
String cdataContent = new XMLOutputter(format).outputString(content);
Output:
<description><![CDATA[<tr class="chartContent" />
<td class="showTitle">1.</td>
<td class="network">PBS</td>
<td class="sumInvIndex">210</td>
<td class="earlierWeek">-13</td>
<td class="mediaInvIndex">225</td>
<td class="socialNetworkInvIndex">238</td>
<td class="gammaIndex">--</td>]]></description>
You need to encode content of the CDATA section to String separately and put that String into CDATA, something like this:
List<Element> content = new ArrayList<Element>();
content.add(new Element("td").setText(current.getRank());
...
String cdataContent = new XMLOutputter().outputString(content);
description.setContent(new CDATA(cdataContent));

jsoup to get a particular element in Tablw

I have html page that extract the information from:
table class="students">
<tbody>
<tr class="rz" style="color:red;" onclick="location.href='//andy.pvt.com';">
<td>
<a title="Display Andy website,Andy" href="//andy.pvt.com">15</a>
</td>
<td>Andy jr</td>
<td align="right">44.31</td>
<td align="right">23.79</td>
<td align="right">57</td>
<td align="right">1,164,700</td>
<td align="right">0.12</td>
<td align="center">
<td align="left">0.99</td>
<td align="right">
</tr>
=
I want to get Andy, 15 andy.pvt.lom.
I am able to extract this table using doc.select(table).get
I am not able to extract the information I am looking.
how to get the "tables.select("xxxx");"
can you please help me with the xxx what I am missing?
You state:
I tried ; tables = doc.select("table").get(0); than tables.select("a title).
You want something more along the lines of
tables.select("a[href]").attr("href"); // to get your String
and
tables.select("a[href]").text(); // to get your number
e.g.,
Elements tables = doc.select("table");
String hrefAttr = tables.select("a[href]").attr("href");
System.out.println("href attribute: " + hrefAttr);
String number = tables.select("a[href]").text();
System.out.println("number: " + number);

Categories

Resources