Im using getElementBytag method to extract data from the following an XML document(Yahoo finance news api http://finance.yahoo.com/rss/topfinstories)
Im using the following code . It gets the new items and the title's no problem using the getelementsBytag method but for some reason wont pick up the link when searched by tag. It only picks up the closing tag for the link element. Is it a problem with the XML document or a problem with jsoup?
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
class GetNewsXML {
/**
* #param args
*/
/**
* #param args
*/
public static void main(String args[]){
Document doc = null;
String con = "http://finance.yahoo.com/rss/topfinstories";
try {
doc = Jsoup.connect(con).get();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Elements collection = doc.getElementsByTag("item");// Gets each news item
for (Element c: collection){
System.out.println(c.getElementsByTag("title"));
}
for (Element c: collection){
System.out.println(c.getElementsByTag("link"));
}
}
You get <link /> http://...; the link is put after the link-tag as a textnode.
But this is not a problem:
final String url = "http://finance.yahoo.com/rss/topfinstories";
Document doc = Jsoup.connect(url).get();
for( Element item : doc.select("item") )
{
final String title = item.select("title").first().text();
final String description = item.select("description").first().text();
final String link = item.select("link").first().nextSibling().toString();
System.out.println(title);
System.out.println(description);
System.out.println(link);
System.out.println("");
}
Explanation:
item.select("link") // Select the 'link' element of the item
.first() // Retrieve the first Element found (since there's only one)
.nextSibling() // Get the next Sibling after the one found; its the TextNode with the real URL
.toString() // Get it as a String
With your link this example prints all elements like this:
Tax Day Freebies and Deals
You made it through tax season. Reward yourself by taking advantage of some special deals on April 15.
http://us.rd.yahoo.com/finance/news/rss/story/SIG=14eetvku9/*http%3A//us.rd.yahoo.com/finance/news/topfinstories/SIG=12btdp321/*http%3A//finance.yahoo.com/news/tax-day-freebies-and-deals-133544366.html?l=1
(...)
Related
Im trying to build a web crawler for my OOP class. The crawler needs to traverse 1000 wikipedia pages and collect the titles and words off the page. The current code I have will traverse a singular page and collect the required information but it also gives me the error code "java.lang.IllegalArgumentException: Must supply a valid URL:" Here is my crawlers code. Ive been using Jsoups libraries.
import java.util.HashMap;
import java.util.HashSet;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class crawler {
private static final int MAX_PAGES = 1000;
private final HashSet<String> titles = new HashSet<>();
private final HashSet<String> urlVisited = new HashSet<>();
private final HashMap<String, Integer> map = new HashMap<>();
public void getLinks(String startURL) {
if ((titles.size() < MAX_PAGES) && !urlVisited.contains(startURL)) {
urlVisited.add(startURL);
try {
Document doc = Jsoup.connect(startURL).get();
Elements linksFromPage = doc.select("a[href]");
String title = doc.select("title").first().text();
titles.add(title);
String text = doc.body().text();
CountWords(text);
for (Element link : linksFromPage) {
if(titles.size() <= MAX_PAGES) {
Thread.sleep(50);
getLinks(link.attr("a[href]"));
}
else {
System.out.println("URL couldn't visit");
System.out.println(startURL + ", " + urlVisited.size());
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
public void PrintAllTitles() {
for (String t : titles) {
System.out.println(t);
}
}
public void PrintAllWordsAndCount() {
for (String key : map.keySet()) {
System.out.println(key + " : " + map.get(key));
}
}
private void CountWords(String text) {
String[] lines = text.split(" ");
for (String word : lines) {
if (map.containsKey(word)) {
int val = map.get(word);
val += 1;
map.remove(word);
map.put(word, val);
} else {
map.put(word, 1);
}
}
}
}
The Driver function just uses c.getLinks(https://en.wikipedia.org/wiki/Computer)
as the starting URL.
The issue is in this line:
getLinks(link.attr("a[href]"));
link.attr(attributeName) is a method for getting an element's attribute by name. But a[href] is a CSS selector. So that method call returns a blank String (as there is no attribute in the element named a[href]), which is not a valid URL, and so you get the validation exception.
Before you call connect, you should log the URL you are about to hit. That way you will see the error.
You should change the line to:
getLinks(link.attr("abs:href"));
That will get the absolute URL pointed to by the href attribute. Most of the hrefs on that page are relative, so it's important to make them absolute before they are made into a URL for connect().
You can see the URLs that the first a[href] selector will return here. You should also think about how to only fetch HTML pages and not images (e.g., maybe filter out by filetype).
There is more detail and examples of this area in the Working with URLs article of jsoup.
My code pulls the links and adds them to the HashSet. I want the link to replace the original link and repeat the process till no more new links can be found to add. The program keeps running but the link isn't updating and the program gets stuck in an infinite loop doing nothing. How do I get the link to update so the program can repeat until no more links can be found?
package downloader;
import java.io.IOException;
import java.net.URL;
import java.util.HashSet;
import java.util.Scanner;
import java.util.Set;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Stage2 {
public static void main(String[] args) throws IOException {
int q = 0;
int w = 0;
HashSet<String> chapters = new HashSet();
String seen = new String("/manga/manabi-ikiru-wa-fuufu-no-tsutome/i1778063/v1/c1");
String source = new String("https://mangapark.net" + seen);
// 0123456789
while( q == w ) {
String source2 = new String(source.substring(21));
String last = new String(source.substring(source.length() - 12));
String last2 = new String(source.substring(source.length() - 1));
chapters.add(seen);
for (String link : findLinks(source)) {
if(link.contains("/manga") && !link.contains(last) && link.contains("/i") && link.contains("/c") && !chapters.contains(link)) {
chapters.add(link);
System.out.println(link);
seen = link;
System.out.print(chapters);
System.out.println(seen);
}
}
}
System.out.print(chapters);
}
private static Set<String> findLinks(String url) throws IOException {
Set<String> links = new HashSet<>();
Document doc = Jsoup.connect(url)
.data("query", "Java")
.userAgent("Mozilla")
.cookie("auth", "token")
.timeout(3000)
.get();
Elements elements = doc.select("a[href]");
for (Element element : elements) {
links.add(element.attr("href"));
}
return links;
}
}
Your progamm didn't stop becouse yout while conditions never change:
while( q == w )
is always true. I run your code without the while and I got 2 links print twice(!) and the programm stop.
If you want the links to the other chapters you have the same problem like me. In the element
Element element = doc.getElementById("sel_book_1");
the links are after the pseudoelement ::before. So they will not be in your Jsoup Document.
Here is my questsion to this topic:
How can I find a HTML tag with the pseudoElement ::before in jsoup
I have this Element:
<td id="color" align="center">
Z 29.02-23.05 someText,
<br>
some.Text2 J. Smith (l.)
</td>
How do I get the text after the tag <br>, to look like some.Text2 J. Smith I tried to find answer in the documentation, but ...
update
If i use
System.out.println(element.select("a").text());
i get just only J. Smith.. Unfortunately, I don't know how to parse tags like <br>
Node.childNodes could save your life:
package com.github.davidepastore.stackoverflow35436825;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
/**
* Stackoverflow 35436825
*
*/
public class App
{
public static void main( String[] args )
{
String html = "<html><body><table><tr><td id=\"color\" align=\"center\">" +
"Z 29.02-23.05 someText," +
"<br>" +
"some.Text2 J. Smith (l.) " +
"</td></tr></table></body></html>";
Document doc = Jsoup.parse( html );
Element td = doc.getElementById( "color" );
String text = getText( td );
System.out.println("Text: " + text);
}
/**
* Get the custom text from the given {#link Element}.
* #param element The {#link Element} from which get the custom text.
* #return Returns the custom text.
*/
private static String getText(Element element) {
String working = "";
List<Node> childNodes = element.childNodes();
boolean brFound = false;
for (int i = 0; i < childNodes.size(); i++) {
Node child = childNodes.get( i );
if (child instanceof TextNode) {
if(brFound){
working += ((TextNode) child).text();
}
}
if (child instanceof Element) {
Element childElement = (Element)child;
if(brFound){
working += childElement.text();
}
if(childElement.tagName().equals( "br" )){
brFound = true;
}
}
}
return working;
}
}
The output will be:
Text: some.Text2 J. Smith (l.)
As far as I know you can only receive the text between two tags, which is not possible with a single <br/> tag in your document.
The only option I can think of is to use split() in order to receive the second part:
String partAfterBr = element.text().split("<br>")[1];
Document relevantPart = JSoup.parse(partAfterBr);
// do whatever you want with the Document in order to receive the necessary parts
I am relatively new to Java and I have been trying to figure out how to reach the following tags for output for a couple of long, LONG days now. I would really appreciate some insight into the problem. It seems like everything I could find and or try just does not pan out right. (Excuse the cheesy news articles)
<item>
<pubDate>Sat, 21 Sep 2013 02:30:23 EDT</pubDate>
<title>
<![CDATA[
Carmen Bryan Lashes Out at Beyonce Fans for Throwing Shade (#carmenbryan)
]]>
</title>
<link>
http://www.vladtv.com/blog/174937/carmen-bryan-lashes-out-at-beyonce-fans-for-throwing-shade/
</link>
<guid>
http://www.vladtv.com/blog/174937/carmen-bryan-lashes-out-at-beyonce-fans-for-throwing-shade/
</guid>
<description>
<![CDATA[
<img ... /><br />.
<p>In response to someone who reminded Bryan that Jay Z has Beyonce now, she tweeted.</p>
<p>Check out what else Bryan had to say above.</p>
<p>Source: </p>
]]>
</description>
</item>
I have managed to parse the XML and print out the content in both the title and description element tags, however the output for the description element tag also includes all its child element tags. I would like to use this project in future to build on my Java portfolio, please help!
My code so far:
public class NewXmlReader
{
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
try {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document docXml = builder.parse(NewXMLReaderHandlers.inputHandler());
docXml.getDocumentElement().normalize();
NewXMLReaderHandlers.handleItemTags(docXml, "item");
} catch (ParserConfigurationException | SAXException parserConfigurationException) {
System.out.println("You Are Not XML formated !!");
parserConfigurationException.printStackTrace();
} catch (IOException iOException) {
System.out.println("URL NOT FOUND");
iOException.getCause();
}
}
}
public class NewXMLReaderHandlers {
private static int ARTICLELENGTH;
public static String inputHandler() throws IOException {
InputStreamReader inputStream = new InputStreamReader(System.in);
BufferedReader bufferRead = new BufferedReader(inputStream);
System.out.println("Please Enter A Proper URL: ");
String urlPageString = bufferRead.readLine();
return urlPageString;
}
public static void handleItemTags( Document document, String rssFeedParentTopicTag){
NodeList listOfArticles = document.getElementsByTagName(rssFeedParentTopicTag);
NewXMLReaderHandlers.ARTICLELENGTH = listOfArticles.getLength();
String rootElement = document.getDocumentElement().getNodeName();
if (rootElement == "rss"){
System.out.println("We Have An RSS Feed To Parse");
for (int i = 0; i < NewXMLReaderHandlers.ARTICLELENGTH; i++) {
Node itemNode = (Node) listOfArticles.item(i);
if (itemNode.getNodeType() == Node.ELEMENT_NODE) {
Element itemElement= (Element) itemNode;
tagContent (itemElement, "title");
tagContent (itemElement, "description");
}
}
}
}
public static void tagContent (Element item, String tagName) {
NodeList tagNodeList = item.getElementsByTagName(tagName);
Element tagElement = (Element)tagNodeList.item(0);
NodeList tagTElist = tagElement.getChildNodes();
Node tagNode = tagTElist.item(0);
// System.out.println( " - " + tagName + " : " + tagNode.getNodeValue() + "\n");
if(tagName == "description"){
System.out.println( " - " + tagName + " : " + tagNode.getNodeValue() + "\n\n");
System.out.println(" Do We Have Any Siblings? " + tagNode.getNextSibling().getNodeValue() + "\n");
}
}
}
For my money, the easiest solution would be to use the XPath API.
Essentially, it's a query language for XML. See XPath Tutorial for a primer.
This example uses the RSS feed from SO, which uses <entry...> instead of <item>, but I've used the same technique for other RSS (and XML) files and even very complex HTML documents...
import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
public class TestRSSFeed {
public static void main(String[] args) {
try {
// Read the feed...
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
Document doc = factory.newDocumentBuilder().parse("http://stackoverflow.com/feeds/tag?tagnames=java&sort=newest");
Element root = doc.getDocumentElement();
// Create a xPath instance
XPath xPath = XPathFactory.newInstance().newXPath();
// Find all the nodes that are named <entry...> any where in
// the document that live under the parent node...
XPathExpression expression = xPath.compile("//entry");
NodeList nl = (NodeList) expression.evaluate(root, XPathConstants.NODESET);
System.out.println("Found " + nl.getLength() + " items...");
for (int index = 0; index < nl.getLength(); index++) {
Node node = nl.item(index);
// This is a sub node search.
// The search is based on the parent node and looks for a single
// node titled "title" that belongs to the parent node...
// I did this because I'm only expecting a single node...
expression = xPath.compile("title");
Node child = (Node) expression.evaluate(node, XPathConstants.NODE);
System.out.println(child.getTextContent());
}
} catch (IOException | ParserConfigurationException | SAXException exp) {
exp.printStackTrace();
} catch (XPathExpressionException ex) {
ex.printStackTrace();
}
}
}
Now, you can do some pretty complex queries, but I thought I'd start with a basic example ;)
Just in case anyone is still left wondering about how i managed to solve the CDATA puzzle:
The logic is as follows:
Once you get the program to extract all the xml to display the correct node tree as the rss feed displays, if any xml data is wrapped in CDATA tags, the only way to access that information is by creating new xml based on the text content in the CDATA tag. Once you parse the new document, you should be able to access all the data you need.
How could I use Jsoup to extract specification data from this website separately for each row e.g. Network->Network Type, Battery etc.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class mobilereviews {
public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("http://mobilereviews.net/details-for-Motorola%20L7.htm").get();
for (Element table : doc.select("table")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
System.out.println(tds.get(0).text());
}
}
}
}
Here is an attempt to find the solution to your problem
Document doc = Jsoup.connect("http://mobilereviews.net/details-for-Motorola%20L7.htm").get();
for (Element table : doc.select("table[id=phone_details]")) {
for (Element row : table.select("tr:gt(2)")) {
Elements tds = row.select("td:not([rowspan])");
System.out.println(tds.get(0).text() + "->" + tds.get(1).text());
}
}
Parsing the HTML is tricky and if the HTML changes your code needs to change as well.
You need to study the HTML markup to come up with your parsing rules first.
There are multiple tables in the HTML, so you first filter on the correct one table[id=phone_details]
The first 2 table rows contain only markup for formatting, so skip them tr:gt(2)
Every other row starts with the global description for the content type, filter it out td:not([rowspan])
For more complex options in the selector syntax, look here http://jsoup.org/cookbook/extracting-data/selector-syntax
xpath for the columns - //*[#id="phone_details"]/tbody/tr[3]/td[2]/strong
xpath for the values - //*[#id="phone_details"]/tbody/tr[3]/td[3]
#Joey's code tries to zero in on these. You should be able to write the select() rules based on the Xpath.
Replace the numbers (tr[N] / td[N]) with appropriate values.
Alternatively, you can pipe the HTML thought a text only browser and extract the data from the text. Here is the text version of the page. You can delimit the text or read after N chars to extract the data.
this is how i get the data from a html table.
org.jsoup.nodes.Element tablaRegistros = doc
.getElementById("tableId");
for (org.jsoup.nodes.Element row : tablaRegistros.select("tr")) {
for (org.jsoup.nodes.Element column : row.select("td")) {
// Elements tds = row.select("td");
// cadena += tds.get(0).text() + "->" +
// tds.get(1).text()
// + " \n";
cadena += column.text() + ",";
}
cadena += "\n";
}
Here is a generic solution to extraction of table from HTML page via JSoup.
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class ExtractTableDataUsingJSoup {
public static void main(String[] args) {
extractTableUsingJsoup("http://mobilereviews.net/details-for-Motorola%20L7.htm","phone_details");
}
public static void extractTableUsingJsoup(String url, String tableId){
Document doc;
try {
// need http protocol
doc = Jsoup.connect(url).get();
//Set id of any table from any website and the below code will print the contents of the table.
//Set the extracted data in appropriate data structures and use them for further processing
Element table = doc.getElementById(tableId);
Elements tds = table.getElementsByTag("td");
//You can check for nesting of tds if such structure exists
for (Element td : tds) {
System.out.println("\n"+td.text());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}