Jsoup parse html-string - java

I have this Element:
<td id="color" align="center">
Z 29.02-23.05 someText,
<br>
some.Text2 J. Smith (l.)
</td>
How do I get the text after the tag <br>, to look like some.Text2 J. Smith I tried to find answer in the documentation, but ...
update
If i use
System.out.println(element.select("a").text());
i get just only J. Smith.. Unfortunately, I don't know how to parse tags like <br>

Node.childNodes could save your life:
package com.github.davidepastore.stackoverflow35436825;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
/**
* Stackoverflow 35436825
*
*/
public class App
{
public static void main( String[] args )
{
String html = "<html><body><table><tr><td id=\"color\" align=\"center\">" +
"Z 29.02-23.05 someText," +
"<br>" +
"some.Text2 J. Smith (l.) " +
"</td></tr></table></body></html>";
Document doc = Jsoup.parse( html );
Element td = doc.getElementById( "color" );
String text = getText( td );
System.out.println("Text: " + text);
}
/**
* Get the custom text from the given {#link Element}.
* #param element The {#link Element} from which get the custom text.
* #return Returns the custom text.
*/
private static String getText(Element element) {
String working = "";
List<Node> childNodes = element.childNodes();
boolean brFound = false;
for (int i = 0; i < childNodes.size(); i++) {
Node child = childNodes.get( i );
if (child instanceof TextNode) {
if(brFound){
working += ((TextNode) child).text();
}
}
if (child instanceof Element) {
Element childElement = (Element)child;
if(brFound){
working += childElement.text();
}
if(childElement.tagName().equals( "br" )){
brFound = true;
}
}
}
return working;
}
}
The output will be:
Text: some.Text2 J. Smith (l.) 

As far as I know you can only receive the text between two tags, which is not possible with a single <br/> tag in your document.
The only option I can think of is to use split() in order to receive the second part:
String partAfterBr = element.text().split("<br>")[1];
Document relevantPart = JSoup.parse(partAfterBr);
// do whatever you want with the Document in order to receive the necessary parts

Related

how to retrieve a url from a link on a website using Jsoup?

Alright so I finished my Yelp scanner, and everything is running great. What I want to do now is have the program retrieve the url for each link to each business, go to that page, and scan for whether it contains:
xlink:href="#30x30_bullhorn"></use>
I pretty much have a good idea of how I'm going to go about doing that, however, I can't seem to find a jSoup method that would retrieve a link's url. Is there somewhere in the page's HTML that would have the url? I'm not very proficient with HTML at all, so 90% of what I'm looking at is gibbering. Here's an example link if you want to check out what I'm referring to.
https://www.yelp.com/search?find_loc=nj&start=10 is the main page, that I need to obtain the url for the page https://www.yelp.com/biz/la-cocina-newark. The orange bullhorn is what I am trying to get it to retrieve. Here's my code btw:
import java.util.ArrayList;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.Scanner;
public class YelpScrapper
{
public static void main(String[] args) throws IOException, Exception
{
//Variables
String description;
String location;
int pages;
int parseCount = 0;
Document document;
Scanner keyboard = new Scanner(System.in);
//Perform a Search
System.out.print("Enter a description: ");
description = keyboard.nextLine();
System.out.print("Enter a state: ");
location = keyboard.nextLine();
System.out.print("How many pages should we scan? ");
pages = keyboard.nextInt();
String descString = "find_desc=" + description.replace(' ', '+') + "&";
String locString = "find_loc=" + location.replace(' ', '+') + "&";
int number = 0;
String url = "https://www.yelp.com/search?" + descString + locString + "start=" + number;
ArrayList<String> names = new ArrayList<String>();
ArrayList<String> address = new ArrayList<String>();
ArrayList<String> phone = new ArrayList<String>();
//Fetch Data From Yelp
for (int i = 0 ; i <= pages ; i++)
{
document = Jsoup.connect(url).get();
Elements nameElements = document.select(".indexed-biz-name span");
Elements addressElements = document.select(".secondary-attributes address");
Elements phoneElements = document.select(".biz-phone");
for (Element element : nameElements)
{
names.add(element.text());
}
for (Element element : addressElements)
{
address.add(element.text());
}
for (Element element : phoneElements)
{
phone.add(element.text());
}
for (int index = 0 ; index < 10 ; index++)
{
System.out.println("\nLead " + parseCount);
System.out.println("Company Name: " + names.get(parseCount));
System.out.println("Address: " + address.get(parseCount));
System.out.println("Phone Number: " + phone.get(parseCount));
parseCount = parseCount + 1;
}
number = number + 10;
}
}
}
Learn how to use the Inspect element of Chrome Developer tools, as it makes it incredibly easy to locate elements in the DOM (you said you aren't comfortable with HTML, well you certainly will be after this and using Inspect is a great learning tool). Focusing the inspector on the "View Now" button, you'll get to this:
View Now.
You'll have to figure out how to traverse down to this, and childNodes() will be helpful in traversing down. Then you can use getElementsByClass("ybtn ybtn--primary ybtn--small ybtn-cta") to get to that specific class where the link is, and then use the .attr() method of the Element class to get the href: .attr("href");.

Personal Project "RSS FEED" XML Parser

I am relatively new to Java and I have been trying to figure out how to reach the following tags for output for a couple of long, LONG days now. I would really appreciate some insight into the problem. It seems like everything I could find and or try just does not pan out right. (Excuse the cheesy news articles)
<item>
<pubDate>Sat, 21 Sep 2013 02:30:23 EDT</pubDate>
<title>
<![CDATA[
Carmen Bryan Lashes Out at Beyonce Fans for Throwing Shade (#carmenbryan)
]]>
</title>
<link>
http://www.vladtv.com/blog/174937/carmen-bryan-lashes-out-at-beyonce-fans-for-throwing-shade/
</link>
<guid>
http://www.vladtv.com/blog/174937/carmen-bryan-lashes-out-at-beyonce-fans-for-throwing-shade/
</guid>
<description>
<![CDATA[
<img ... /><br />.
<p>In response to someone who reminded Bryan that Jay Z has Beyonce now, she tweeted.</p>
<p>Check out what else Bryan had to say above.</p>
<p>Source: </p>
]]>
</description>
</item>
I have managed to parse the XML and print out the content in both the title and description element tags, however the output for the description element tag also includes all its child element tags. I would like to use this project in future to build on my Java portfolio, please help!
My code so far:
public class NewXmlReader
{
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
try {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document docXml = builder.parse(NewXMLReaderHandlers.inputHandler());
docXml.getDocumentElement().normalize();
NewXMLReaderHandlers.handleItemTags(docXml, "item");
} catch (ParserConfigurationException | SAXException parserConfigurationException) {
System.out.println("You Are Not XML formated !!");
parserConfigurationException.printStackTrace();
} catch (IOException iOException) {
System.out.println("URL NOT FOUND");
iOException.getCause();
}
}
}
public class NewXMLReaderHandlers {
private static int ARTICLELENGTH;
public static String inputHandler() throws IOException {
InputStreamReader inputStream = new InputStreamReader(System.in);
BufferedReader bufferRead = new BufferedReader(inputStream);
System.out.println("Please Enter A Proper URL: ");
String urlPageString = bufferRead.readLine();
return urlPageString;
}
public static void handleItemTags( Document document, String rssFeedParentTopicTag){
NodeList listOfArticles = document.getElementsByTagName(rssFeedParentTopicTag);
NewXMLReaderHandlers.ARTICLELENGTH = listOfArticles.getLength();
String rootElement = document.getDocumentElement().getNodeName();
if (rootElement == "rss"){
System.out.println("We Have An RSS Feed To Parse");
for (int i = 0; i < NewXMLReaderHandlers.ARTICLELENGTH; i++) {
Node itemNode = (Node) listOfArticles.item(i);
if (itemNode.getNodeType() == Node.ELEMENT_NODE) {
Element itemElement= (Element) itemNode;
tagContent (itemElement, "title");
tagContent (itemElement, "description");
}
}
}
}
public static void tagContent (Element item, String tagName) {
NodeList tagNodeList = item.getElementsByTagName(tagName);
Element tagElement = (Element)tagNodeList.item(0);
NodeList tagTElist = tagElement.getChildNodes();
Node tagNode = tagTElist.item(0);
// System.out.println( " - " + tagName + " : " + tagNode.getNodeValue() + "\n");
if(tagName == "description"){
System.out.println( " - " + tagName + " : " + tagNode.getNodeValue() + "\n\n");
System.out.println(" Do We Have Any Siblings? " + tagNode.getNextSibling().getNodeValue() + "\n");
}
}
}
For my money, the easiest solution would be to use the XPath API.
Essentially, it's a query language for XML. See XPath Tutorial for a primer.
This example uses the RSS feed from SO, which uses <entry...> instead of <item>, but I've used the same technique for other RSS (and XML) files and even very complex HTML documents...
import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
public class TestRSSFeed {
public static void main(String[] args) {
try {
// Read the feed...
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
Document doc = factory.newDocumentBuilder().parse("http://stackoverflow.com/feeds/tag?tagnames=java&sort=newest");
Element root = doc.getDocumentElement();
// Create a xPath instance
XPath xPath = XPathFactory.newInstance().newXPath();
// Find all the nodes that are named <entry...> any where in
// the document that live under the parent node...
XPathExpression expression = xPath.compile("//entry");
NodeList nl = (NodeList) expression.evaluate(root, XPathConstants.NODESET);
System.out.println("Found " + nl.getLength() + " items...");
for (int index = 0; index < nl.getLength(); index++) {
Node node = nl.item(index);
// This is a sub node search.
// The search is based on the parent node and looks for a single
// node titled "title" that belongs to the parent node...
// I did this because I'm only expecting a single node...
expression = xPath.compile("title");
Node child = (Node) expression.evaluate(node, XPathConstants.NODE);
System.out.println(child.getTextContent());
}
} catch (IOException | ParserConfigurationException | SAXException exp) {
exp.printStackTrace();
} catch (XPathExpressionException ex) {
ex.printStackTrace();
}
}
}
Now, you can do some pretty complex queries, but I thought I'd start with a basic example ;)
Just in case anyone is still left wondering about how i managed to solve the CDATA puzzle:
The logic is as follows:
Once you get the program to extract all the xml to display the correct node tree as the rss feed displays, if any xml data is wrapped in CDATA tags, the only way to access that information is by creating new xml based on the text content in the CDATA tag. Once you parse the new document, you should be able to access all the data you need.

Extract and Parse HTML Table using Jsoup

How could I use Jsoup to extract specification data from this website separately for each row e.g. Network->Network Type, Battery etc.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class mobilereviews {
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("http://mobilereviews.net/details-for-Motorola%20L7.htm").get();
for (Element table : doc.select("table")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
System.out.println(tds.get(0).text());
}
}
}
}
Here is an attempt to find the solution to your problem
Document doc = Jsoup.connect("http://mobilereviews.net/details-for-Motorola%20L7.htm").get();
for (Element table : doc.select("table[id=phone_details]")) {
for (Element row : table.select("tr:gt(2)")) {
Elements tds = row.select("td:not([rowspan])");
System.out.println(tds.get(0).text() + "->" + tds.get(1).text());
}
}
Parsing the HTML is tricky and if the HTML changes your code needs to change as well.
You need to study the HTML markup to come up with your parsing rules first.
There are multiple tables in the HTML, so you first filter on the correct one table[id=phone_details]
The first 2 table rows contain only markup for formatting, so skip them tr:gt(2)
Every other row starts with the global description for the content type, filter it out td:not([rowspan])
For more complex options in the selector syntax, look here http://jsoup.org/cookbook/extracting-data/selector-syntax
xpath for the columns - //*[#id="phone_details"]/tbody/tr[3]/td[2]/strong
xpath for the values - //*[#id="phone_details"]/tbody/tr[3]/td[3]
#Joey's code tries to zero in on these. You should be able to write the select() rules based on the Xpath.
Replace the numbers (tr[N] / td[N]) with appropriate values.
Alternatively, you can pipe the HTML thought a text only browser and extract the data from the text. Here is the text version of the page. You can delimit the text or read after N chars to extract the data.
this is how i get the data from a html table.
org.jsoup.nodes.Element tablaRegistros = doc
.getElementById("tableId");
for (org.jsoup.nodes.Element row : tablaRegistros.select("tr")) {
for (org.jsoup.nodes.Element column : row.select("td")) {
// Elements tds = row.select("td");
// cadena += tds.get(0).text() + "->" +
// tds.get(1).text()
// + " \n";
cadena += column.text() + ",";
}
cadena += "\n";
}
Here is a generic solution to extraction of table from HTML page via JSoup.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ExtractTableDataUsingJSoup {
public static void main(String[] args) {
extractTableUsingJsoup("http://mobilereviews.net/details-for-Motorola%20L7.htm","phone_details");
}
public static void extractTableUsingJsoup(String url, String tableId){
Document doc;
try {
// need http protocol
doc = Jsoup.connect(url).get();
//Set id of any table from any website and the below code will print the contents of the table.
//Set the extracted data in appropriate data structures and use them for further processing
Element table = doc.getElementById(tableId);
Elements tds = table.getElementsByTag("td");
//You can check for nesting of tds if such structure exists
for (Element td : tds) {
System.out.println("\n"+td.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

Cannot extract data from an XML

Im using getElementBytag method to extract data from the following an XML document(Yahoo finance news api http://finance.yahoo.com/rss/topfinstories)
Im using the following code . It gets the new items and the title's no problem using the getelementsBytag method but for some reason wont pick up the link when searched by tag. It only picks up the closing tag for the link element. Is it a problem with the XML document or a problem with jsoup?
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
class GetNewsXML {
/**
* #param args
*/
/**
* #param args
*/
public static void main(String args[]){
Document doc = null;
String con = "http://finance.yahoo.com/rss/topfinstories";
try {
doc = Jsoup.connect(con).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements collection = doc.getElementsByTag("item");// Gets each news item
for (Element c: collection){
System.out.println(c.getElementsByTag("title"));
}
for (Element c: collection){
System.out.println(c.getElementsByTag("link"));
}
}
You get <link /> http://...; the link is put after the link-tag as a textnode.
But this is not a problem:
final String url = "http://finance.yahoo.com/rss/topfinstories";
Document doc = Jsoup.connect(url).get();
for( Element item : doc.select("item") )
{
final String title = item.select("title").first().text();
final String description = item.select("description").first().text();
final String link = item.select("link").first().nextSibling().toString();
System.out.println(title);
System.out.println(description);
System.out.println(link);
System.out.println("");
}
Explanation:
item.select("link") // Select the 'link' element of the item
.first() // Retrieve the first Element found (since there's only one)
.nextSibling() // Get the next Sibling after the one found; its the TextNode with the real URL
.toString() // Get it as a String
With your link this example prints all elements like this:
Tax Day Freebies and Deals
You made it through tax season. Reward yourself by taking advantage of some special deals on April 15.
http://us.rd.yahoo.com/finance/news/rss/story/SIG=14eetvku9/*http%3A//us.rd.yahoo.com/finance/news/topfinstories/SIG=12btdp321/*http%3A//finance.yahoo.com/news/tax-day-freebies-and-deals-133544366.html?l=1
(...)

Parse Company Info

I was wondering if anyone knows how to successfully parse the company name "Alcoa Inc." shown in the URL below. It would be much easier to show a picture but I do not have enough reputation. Any help would be appreciated.
http://www.google.com/finance?q=NYSE%3AAA&ei=LdwVUYC7Fp_YlgPBiAE
This is what I have tried so far using jsoup to parse the div class:
<div class="appbar-snippet-primary">
<span>Alcoa Inc.</span>
</div>
public Elements htmlParser(String url, String element, String elementType, String returnElement){
try {
Document doc = Jsoup.connect(url).get();
Document parse = Jsoup.parse(doc.html());
if (returnElement == null){
return parse.select(elementType + "." + element);
}
else {
return parse.select(elementType + "." + element + " " + returnElement);
}
}
public String htmlparseGoogleStocks(String url){
String pr = "pr";
String appbar_center = "appbar-snippet-primary";
String val = "val";
String span = "span";
String div = "div";
String td = "td";
Elements price_data;
Elements title_data;
Elements more_data;
price_data = htmlParser(url, pr, span, null);
title_data = htmlParser(url, appbar_center, div, span);
//more_data = htmlParser(url, val, td, null);
//String stockprice = price_data.text().toString();
String title = title_data.text().toString();
//System.out.println(more_data.text());
return title;
Myself, I'd analyze the page of interest's source HTML, and then just use JSoup to extract the information. For instance, using a very small JSoup program like so:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class GoogleFinance {
public static final String PAGE = "https://www.google.com/finance?q=NASDAQ:XONE";
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect(PAGE).get();
Elements title = doc.select("title");
System.out.println(title.text());
}
}
You get in return:
ExOne Co: NASDAQ:XONE quotes & news - Google Finance
It doesn't get much easier than that.

Categories

Resources