Extract and Parse HTML Table using Jsoup

Extract and Parse HTML Table using Jsoup - java

How could I use Jsoup to extract specification data from this website separately for each row e.g. Network->Network Type, Battery etc.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class mobilereviews {
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("http://mobilereviews.net/details-for-Motorola%20L7.htm").get();
for (Element table : doc.select("table")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
System.out.println(tds.get(0).text());
}
}
}
}

Here is an attempt to find the solution to your problem
Document doc = Jsoup.connect("http://mobilereviews.net/details-for-Motorola%20L7.htm").get();
for (Element table : doc.select("table[id=phone_details]")) {
for (Element row : table.select("tr:gt(2)")) {
Elements tds = row.select("td:not([rowspan])");
System.out.println(tds.get(0).text() + "->" + tds.get(1).text());
}
}
Parsing the HTML is tricky and if the HTML changes your code needs to change as well.
You need to study the HTML markup to come up with your parsing rules first.
There are multiple tables in the HTML, so you first filter on the correct one table[id=phone_details]
The first 2 table rows contain only markup for formatting, so skip them tr:gt(2)
Every other row starts with the global description for the content type, filter it out td:not([rowspan])
For more complex options in the selector syntax, look here http://jsoup.org/cookbook/extracting-data/selector-syntax

xpath for the columns - //*[#id="phone_details"]/tbody/tr[3]/td[2]/strong
xpath for the values - //*[#id="phone_details"]/tbody/tr[3]/td[3]
#Joey's code tries to zero in on these. You should be able to write the select() rules based on the Xpath.
Replace the numbers (tr[N] / td[N]) with appropriate values.
Alternatively, you can pipe the HTML thought a text only browser and extract the data from the text. Here is the text version of the page. You can delimit the text or read after N chars to extract the data.

this is how i get the data from a html table.
org.jsoup.nodes.Element tablaRegistros = doc
.getElementById("tableId");
for (org.jsoup.nodes.Element row : tablaRegistros.select("tr")) {
for (org.jsoup.nodes.Element column : row.select("td")) {
// Elements tds = row.select("td");
// cadena += tds.get(0).text() + "->" +
// tds.get(1).text()
// + " \n";
cadena += column.text() + ",";
}
cadena += "\n";
}

Here is a generic solution to extraction of table from HTML page via JSoup.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ExtractTableDataUsingJSoup {
public static void main(String[] args) {
extractTableUsingJsoup("http://mobilereviews.net/details-for-Motorola%20L7.htm","phone_details");
}
public static void extractTableUsingJsoup(String url, String tableId){
Document doc;
try {
// need http protocol
doc = Jsoup.connect(url).get();
//Set id of any table from any website and the below code will print the contents of the table.
//Set the extracted data in appropriate data structures and use them for further processing
Element table = doc.getElementById(tableId);
Elements tds = table.getElementsByTag("td");
//You can check for nesting of tds if such structure exists
for (Element td : tds) {
System.out.println("\n"+td.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

Related

java.lang.IllegalArgumentException: Must supply a valid URL

Im trying to build a web crawler for my OOP class. The crawler needs to traverse 1000 wikipedia pages and collect the titles and words off the page. The current code I have will traverse a singular page and collect the required information but it also gives me the error code "java.lang.IllegalArgumentException: Must supply a valid URL:" Here is my crawlers code. Ive been using Jsoups libraries.
import java.util.HashMap;
import java.util.HashSet;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class crawler {
private static final int MAX_PAGES = 1000;
private final HashSet<String> titles = new HashSet<>();
private final HashSet<String> urlVisited = new HashSet<>();
private final HashMap<String, Integer> map = new HashMap<>();
public void getLinks(String startURL) {
if ((titles.size() < MAX_PAGES) && !urlVisited.contains(startURL)) {
urlVisited.add(startURL);
try {
Document doc = Jsoup.connect(startURL).get();
Elements linksFromPage = doc.select("a[href]");
String title = doc.select("title").first().text();
titles.add(title);
String text = doc.body().text();
CountWords(text);
for (Element link : linksFromPage) {
if(titles.size() <= MAX_PAGES) {
Thread.sleep(50);
getLinks(link.attr("a[href]"));
}
else {
System.out.println("URL couldn't visit");
System.out.println(startURL + ", " + urlVisited.size());
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
public void PrintAllTitles() {
for (String t : titles) {
System.out.println(t);
}
}
public void PrintAllWordsAndCount() {
for (String key : map.keySet()) {
System.out.println(key + " : " + map.get(key));
}
}
private void CountWords(String text) {
String[] lines = text.split(" ");
for (String word : lines) {
if (map.containsKey(word)) {
int val = map.get(word);
val += 1;
map.remove(word);
map.put(word, val);
} else {
map.put(word, 1);
}
}
}
}
The Driver function just uses c.getLinks(https://en.wikipedia.org/wiki/Computer)
as the starting URL.

The issue is in this line:
getLinks(link.attr("a[href]"));
link.attr(attributeName) is a method for getting an element's attribute by name. But a[href] is a CSS selector. So that method call returns a blank String (as there is no attribute in the element named a[href]), which is not a valid URL, and so you get the validation exception.
Before you call connect, you should log the URL you are about to hit. That way you will see the error.
You should change the line to:
getLinks(link.attr("abs:href"));
That will get the absolute URL pointed to by the href attribute. Most of the hrefs on that page are relative, so it's important to make them absolute before they are made into a URL for connect().
You can see the URLs that the first a[href] selector will return here. You should also think about how to only fetch HTML pages and not images (e.g., maybe filter out by filetype).
There is more detail and examples of this area in the Working with URLs article of jsoup.

Use jsoup drawing html table

I'm reading a html file using jsoup. I want to show the html table，how can I do that?
I'm a beginner with jsoup - and a not that experienced java developer. :)
public class test {
public static void main(String[] args) throws IOException {
// TODO 自動產生的方法 Stub
File input = new File("D://index.html");//從一個html文件讀取
Document doc = Jsoup.parse(input,"UTF-8");
//test
Elements trs = doc.select("table").select("tr");
for(Element e : trs) {
System.out.println("-------------------");
System.out.println(e.text());
}
}
}

Without knowing jsoup, I guess you should descend into the html structure step by step, like this:
...
//test
Elements tables = doc.select("table");
for (Element table : tables) {
for (Element row : table.select("tr")) {
for (Element e : row.select("td")) {
// output your td-contents here
System.out.println("-------------------");
System.out.println(e.text());
}
}
}
...
The advantage of this approach is that you have more control over drawing separators between the HTML Elements.

Tokenizing text content of an XML element using Dom Java

I have an XML file that contains tags such as:
<P>(b) <E T="03">Filing of financial reports.</E> (1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,</p>
I need to parse the text content and get the results back as an array of strings ["(b)", "Filing of financial reports.", "(1)(i) Except as provided in paragraphs (b) (3) and (h) of this section,"].
In other words, I need to tokenize the text content of a <p> element according to <E T=03"> and store the results in an array of strings.

There's nothing to "tokenize", as the parsing has already been done for you when the DOM was built. The <P> node contains both text and child nodes. This is what the DOM looks like:
P
|
+---text "(b) "
|
+---E
| |
| +---attribute T=03
| |
| +---text "Filing of financial reports."
|
+---text "Except as provided ..."
To get the results you want you need to navigate through the sub-nodes of <P> and extract all the text nodes.

here's one way to do it using jsoup library:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
class Test {
public static void main(String args[]) throws Exception {
String xml = "<P>(b) <E T=\"03\">Filing of financial reports.</E> (1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,</p>";
Document doc = Jsoup.parse(xml);
for (Element e : doc.select("p"))
for (Node child : e.childNodes()) {
if (child instanceof TextNode) {
System.out.println(((TextNode) child).text());
} else {
System.out.println(((Element) child).text());
}
}
}
}
output:
(b)
Filing of financial reports.
(1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,

Use XPath. If you don't want to use specialized Java libraries, you may just use standard Java API such us:
import java.io.ByteArrayInputStream;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
public class ExtractingAllTextNodes {
private static final String XML = "<P>(b) <E T=\"03\">Filing of financial reports.</E> (1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,</P>";
public static void main(final String[] args) throws Exception {
final XPath xPath = XPathFactory.newInstance().newXPath();
final DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
final DocumentBuilder builder = builderFactory.newDocumentBuilder();
final String expression = "//text()";
final Document xmlDocument = builder.parse(new ByteArrayInputStream(XML.getBytes()));
final NodeList nodeList = (NodeList) xPath.compile(expression).evaluate(xmlDocument, XPathConstants.NODESET);
for (int i = 0; i < nodeList.getLength(); i++) {
System.out.println("=> " + nodeList.item(i).getTextContent());
}
}
}
Output:
=> (b)
=> Filing of financial reports.
=> (1)(i) Except as provided in paragraphs (b)(3) and (h) of this section,
Depending on your needs, you may alter the XPath expression.

Ok. I finally managed to find a solution to the problem. The code is somewhat complex but it uses Dom which is the standard library for XML parsing:
public static void parseSection(Element sec){
NodeList pTags = ((Element) (((NodeList) sec
.getElementsByTagName("contents")).item(0)))
.getElementsByTagName("P");
int pTagIndex = 0;
while (pTagIndex < pTags.getLength()) {
System.out.println(pTagIndex);
Node pTag = pTags.item(pTagIndex);
NodeList pTagChildren = pTag.getChildNodes();
int pTagChildrenIndex = 0;
while(pTagChildrenIndex < pTagChildren.getLength()){
Node pTagChild = pTagChildren.item(pTagChildrenIndex);
if(pTagChild.getNodeName().equals("#text")){
System.out.println("Text: " + pTagChild.getNodeValue());
} else if(pTagChild.getNodeName().equals("E")){
System.out.println("E: " + pTagChild.getTextContent());
}
pTagChildrenIndex ++;
}

Personal Project "RSS FEED" XML Parser

I am relatively new to Java and I have been trying to figure out how to reach the following tags for output for a couple of long, LONG days now. I would really appreciate some insight into the problem. It seems like everything I could find and or try just does not pan out right. (Excuse the cheesy news articles)
<item>
<pubDate>Sat, 21 Sep 2013 02:30:23 EDT</pubDate>
<title>
<![CDATA[
Carmen Bryan Lashes Out at Beyonce Fans for Throwing Shade (#carmenbryan)
]]>
</title>
<link>
http://www.vladtv.com/blog/174937/carmen-bryan-lashes-out-at-beyonce-fans-for-throwing-shade/
</link>
<guid>
http://www.vladtv.com/blog/174937/carmen-bryan-lashes-out-at-beyonce-fans-for-throwing-shade/
</guid>
<description>
<![CDATA[
<img ... /><br />.
<p>In response to someone who reminded Bryan that Jay Z has Beyonce now, she tweeted.</p>
<p>Check out what else Bryan had to say above.</p>
<p>Source: </p>
]]>
</description>
</item>
I have managed to parse the XML and print out the content in both the title and description element tags, however the output for the description element tag also includes all its child element tags. I would like to use this project in future to build on my Java portfolio, please help!
My code so far:
public class NewXmlReader
{
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
try {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document docXml = builder.parse(NewXMLReaderHandlers.inputHandler());
docXml.getDocumentElement().normalize();
NewXMLReaderHandlers.handleItemTags(docXml, "item");
} catch (ParserConfigurationException | SAXException parserConfigurationException) {
System.out.println("You Are Not XML formated !!");
parserConfigurationException.printStackTrace();
} catch (IOException iOException) {
System.out.println("URL NOT FOUND");
iOException.getCause();
}
}
}
public class NewXMLReaderHandlers {
private static int ARTICLELENGTH;
public static String inputHandler() throws IOException {
InputStreamReader inputStream = new InputStreamReader(System.in);
BufferedReader bufferRead = new BufferedReader(inputStream);
System.out.println("Please Enter A Proper URL: ");
String urlPageString = bufferRead.readLine();
return urlPageString;
}
public static void handleItemTags( Document document, String rssFeedParentTopicTag){
NodeList listOfArticles = document.getElementsByTagName(rssFeedParentTopicTag);
NewXMLReaderHandlers.ARTICLELENGTH = listOfArticles.getLength();
String rootElement = document.getDocumentElement().getNodeName();
if (rootElement == "rss"){
System.out.println("We Have An RSS Feed To Parse");
for (int i = 0; i < NewXMLReaderHandlers.ARTICLELENGTH; i++) {
Node itemNode = (Node) listOfArticles.item(i);
if (itemNode.getNodeType() == Node.ELEMENT_NODE) {
Element itemElement= (Element) itemNode;
tagContent (itemElement, "title");
tagContent (itemElement, "description");
}
}
}
}
public static void tagContent (Element item, String tagName) {
NodeList tagNodeList = item.getElementsByTagName(tagName);
Element tagElement = (Element)tagNodeList.item(0);
NodeList tagTElist = tagElement.getChildNodes();
Node tagNode = tagTElist.item(0);
// System.out.println( " - " + tagName + " : " + tagNode.getNodeValue() + "\n");
if(tagName == "description"){
System.out.println( " - " + tagName + " : " + tagNode.getNodeValue() + "\n\n");
System.out.println(" Do We Have Any Siblings? " + tagNode.getNextSibling().getNodeValue() + "\n");
}
}
}

For my money, the easiest solution would be to use the XPath API.
Essentially, it's a query language for XML. See XPath Tutorial for a primer.
This example uses the RSS feed from SO, which uses <entry...> instead of <item>, but I've used the same technique for other RSS (and XML) files and even very complex HTML documents...
import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
public class TestRSSFeed {
public static void main(String[] args) {
try {
// Read the feed...
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
Document doc = factory.newDocumentBuilder().parse("http://stackoverflow.com/feeds/tag?tagnames=java&sort=newest");
Element root = doc.getDocumentElement();
// Create a xPath instance
XPath xPath = XPathFactory.newInstance().newXPath();
// Find all the nodes that are named <entry...> any where in
// the document that live under the parent node...
XPathExpression expression = xPath.compile("//entry");
NodeList nl = (NodeList) expression.evaluate(root, XPathConstants.NODESET);
System.out.println("Found " + nl.getLength() + " items...");
for (int index = 0; index < nl.getLength(); index++) {
Node node = nl.item(index);
// This is a sub node search.
// The search is based on the parent node and looks for a single
// node titled "title" that belongs to the parent node...
// I did this because I'm only expecting a single node...
expression = xPath.compile("title");
Node child = (Node) expression.evaluate(node, XPathConstants.NODE);
System.out.println(child.getTextContent());
}
} catch (IOException | ParserConfigurationException | SAXException exp) {
exp.printStackTrace();
} catch (XPathExpressionException ex) {
ex.printStackTrace();
}
}
}
Now, you can do some pretty complex queries, but I thought I'd start with a basic example ;)

Just in case anyone is still left wondering about how i managed to solve the CDATA puzzle:
The logic is as follows:
Once you get the program to extract all the xml to display the correct node tree as the rss feed displays, if any xml data is wrapped in CDATA tags, the only way to access that information is by creating new xml based on the text content in the CDATA tag. Once you parse the new document, you should be able to access all the data you need.

Cannot extract data from an XML

Im using getElementBytag method to extract data from the following an XML document(Yahoo finance news api http://finance.yahoo.com/rss/topfinstories)
Im using the following code . It gets the new items and the title's no problem using the getelementsBytag method but for some reason wont pick up the link when searched by tag. It only picks up the closing tag for the link element. Is it a problem with the XML document or a problem with jsoup?
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
class GetNewsXML {
/**
* #param args
*/
/**
* #param args
*/
public static void main(String args[]){
Document doc = null;
String con = "http://finance.yahoo.com/rss/topfinstories";
try {
doc = Jsoup.connect(con).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements collection = doc.getElementsByTag("item");// Gets each news item
for (Element c: collection){
System.out.println(c.getElementsByTag("title"));
}
for (Element c: collection){
System.out.println(c.getElementsByTag("link"));
}
}

You get <link /> http://...; the link is put after the link-tag as a textnode.
But this is not a problem:
final String url = "http://finance.yahoo.com/rss/topfinstories";
Document doc = Jsoup.connect(url).get();
for( Element item : doc.select("item") )
{
final String title = item.select("title").first().text();
final String description = item.select("description").first().text();
final String link = item.select("link").first().nextSibling().toString();
System.out.println(title);
System.out.println(description);
System.out.println(link);
System.out.println("");
}
Explanation:
item.select("link") // Select the 'link' element of the item
.first() // Retrieve the first Element found (since there's only one)
.nextSibling() // Get the next Sibling after the one found; its the TextNode with the real URL
.toString() // Get it as a String
With your link this example prints all elements like this:
Tax Day Freebies and Deals
You made it through tax season. Reward yourself by taking advantage of some special deals on April 15.
http://us.rd.yahoo.com/finance/news/rss/story/SIG=14eetvku9/*http%3A//us.rd.yahoo.com/finance/news/topfinstories/SIG=12btdp321/*http%3A//finance.yahoo.com/news/tax-day-freebies-and-deals-133544366.html?l=1
(...)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Extract and Parse HTML Table using Jsoup - java

Related

java.lang.IllegalArgumentException: Must supply a valid URL

Use jsoup drawing html table

Tokenizing text content of an XML element using Dom Java

Personal Project "RSS FEED" XML Parser

Cannot extract data from an XML

Categories

Resources