I was wondering if anyone knows how to successfully parse the company name "Alcoa Inc." shown in the URL below. It would be much easier to show a picture but I do not have enough reputation. Any help would be appreciated.
http://www.google.com/finance?q=NYSE%3AAA&ei=LdwVUYC7Fp_YlgPBiAE
This is what I have tried so far using jsoup to parse the div class:
<div class="appbar-snippet-primary">
<span>Alcoa Inc.</span>
</div>
public Elements htmlParser(String url, String element, String elementType, String returnElement){
try {
Document doc = Jsoup.connect(url).get();
Document parse = Jsoup.parse(doc.html());
if (returnElement == null){
return parse.select(elementType + "." + element);
}
else {
return parse.select(elementType + "." + element + " " + returnElement);
}
}
public String htmlparseGoogleStocks(String url){
String pr = "pr";
String appbar_center = "appbar-snippet-primary";
String val = "val";
String span = "span";
String div = "div";
String td = "td";
Elements price_data;
Elements title_data;
Elements more_data;
price_data = htmlParser(url, pr, span, null);
title_data = htmlParser(url, appbar_center, div, span);
//more_data = htmlParser(url, val, td, null);
//String stockprice = price_data.text().toString();
String title = title_data.text().toString();
//System.out.println(more_data.text());
return title;
Myself, I'd analyze the page of interest's source HTML, and then just use JSoup to extract the information. For instance, using a very small JSoup program like so:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class GoogleFinance {
public static final String PAGE = "https://www.google.com/finance?q=NASDAQ:XONE";
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect(PAGE).get();
Elements title = doc.select("title");
System.out.println(title.text());
}
}
You get in return:
ExOne Co: NASDAQ:XONE quotes & news - Google Finance
It doesn't get much easier than that.
Related
Im trying to build a web crawler for my OOP class. The crawler needs to traverse 1000 wikipedia pages and collect the titles and words off the page. The current code I have will traverse a singular page and collect the required information but it also gives me the error code "java.lang.IllegalArgumentException: Must supply a valid URL:" Here is my crawlers code. Ive been using Jsoups libraries.
import java.util.HashMap;
import java.util.HashSet;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class crawler {
private static final int MAX_PAGES = 1000;
private final HashSet<String> titles = new HashSet<>();
private final HashSet<String> urlVisited = new HashSet<>();
private final HashMap<String, Integer> map = new HashMap<>();
public void getLinks(String startURL) {
if ((titles.size() < MAX_PAGES) && !urlVisited.contains(startURL)) {
urlVisited.add(startURL);
try {
Document doc = Jsoup.connect(startURL).get();
Elements linksFromPage = doc.select("a[href]");
String title = doc.select("title").first().text();
titles.add(title);
String text = doc.body().text();
CountWords(text);
for (Element link : linksFromPage) {
if(titles.size() <= MAX_PAGES) {
Thread.sleep(50);
getLinks(link.attr("a[href]"));
}
else {
System.out.println("URL couldn't visit");
System.out.println(startURL + ", " + urlVisited.size());
}
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
public void PrintAllTitles() {
for (String t : titles) {
System.out.println(t);
}
}
public void PrintAllWordsAndCount() {
for (String key : map.keySet()) {
System.out.println(key + " : " + map.get(key));
}
}
private void CountWords(String text) {
String[] lines = text.split(" ");
for (String word : lines) {
if (map.containsKey(word)) {
int val = map.get(word);
val += 1;
map.remove(word);
map.put(word, val);
} else {
map.put(word, 1);
}
}
}
}
The Driver function just uses c.getLinks(https://en.wikipedia.org/wiki/Computer)
as the starting URL.
The issue is in this line:
getLinks(link.attr("a[href]"));
link.attr(attributeName) is a method for getting an element's attribute by name. But a[href] is a CSS selector. So that method call returns a blank String (as there is no attribute in the element named a[href]), which is not a valid URL, and so you get the validation exception.
Before you call connect, you should log the URL you are about to hit. That way you will see the error.
You should change the line to:
getLinks(link.attr("abs:href"));
That will get the absolute URL pointed to by the href attribute. Most of the hrefs on that page are relative, so it's important to make them absolute before they are made into a URL for connect().
You can see the URLs that the first a[href] selector will return here. You should also think about how to only fetch HTML pages and not images (e.g., maybe filter out by filetype).
There is more detail and examples of this area in the Working with URLs article of jsoup.
I parsed a website with Jsoup and extracted the links. Now I tried to store just a part of that link in an ArrayList. Somehow I cannot store one link at a time.
I tried several String methods, Scanner and BufferedReader without success.
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class DatenImportUnternehmen {
public static void main(String[] args) throws IOException {
ArrayList<String> aktien = new ArrayList<String>();
String searchUrl = "https://www.ariva.de/aktiensuche/_result_table.m";
for(int i = 0; i < 1; i++) {
String searchBody = "page=" + Integer.toString(i) +
"&page_size=25&sort=ariva_name&sort_d=asc
&ariva_performance_1_year=_&ariva_per
formance_3_years=&ariva_performance_5_years=
&index=0&founding_year=&land=0&ind
ustrial_sector=0§or=0¤cy=0
&type_of_share=0&year=_all_years&sales=_&p
rofit_loss=&sum_assets=&sum_liabilities=
&number_of_shares=&earnings_per_share=
÷nd_per_share=&turnover_per_share=
&book_value_per_share=&cashflow_per_sh
are=&balance_sheet_total_per_share=
&number_of_employees=&turnover_per_employee
=_&profit_per_employee=&kgv=_&kuv=_&kbv=_÷nd
_yield=_&return_on_sales=_";
// post request to search URL
Document document =
Jsoup.connect(searchUrl).requestBody(searchBody).post();
// find links in returned HTML
for(Element link:document.select("a[href]")) {
String link1 = link.toString();
String link2 = link1.substring(link1.indexOf('/'));
String link3 = link2.substring(0, link2.indexOf('"'));
aktien.add(link3);
System.out.println(aktien);
}
}
}
}
My output looks like (just a part of it):
[/1-1_drillisch-aktie]
[/1-1_drillisch-aktie, /11_88_0_solutions-aktie]
[/1-1_drillisch-aktie, /11_88_0_solutions-aktie, /1st_red-aktie]
[/1-1_drillisch-aktie, /11_88_0_solutions-aktie, /1st_red-aktie, /21st-
_cent-_fox_b_new-aktie]
[/1-1_drillisch-aktie, /11_88_0_solutions-aktie, /1st_red-aktie, /21st-
_cent-_fox_b_new-aktie, /21st_century_fox-aktie]
[/1-1_drillisch-aktie, /11_88_0_solutions-aktie, /1st_red-aktie, /21st-
_cent-_fox_b_new-aktie, /21st_century_fox-aktie, /2g_energy-aktie]
[/1-1_drillisch-aktie, /11_88_0_solutions-aktie, /1st_red-aktie, /21st-
_cent-_fox_b_new-aktie, /21st_century_fox-aktie, /2g_energy-aktie,
/3i_group-aktie]
[/1-1_drillisch-aktie, /11_88_0_solutions-aktie, /1st_red-aktie, /21st-
_cent-_fox_b_new-aktie, /21st_century_fox-aktie, /2g_energy-aktie,
/3i_group-aktie, /3i_infrastructure-aktie]
What I want to achieve is:
[/1-1_drillisch-aktie]
[/11_88_0_solutions-aktie]
[/1st_red-aktie]
[/21st-_cent-_fox_b_new-aktie]
and so on.
I just don't now what the problem is at this stage.
Your problem is that you are printing the array whilst adding to it in the loop.
To resolve the issue you can print the array outside of the array to print everything in one go, or you can print link3 (which is what you are adding to the ArrayList), instead of the array in the loop.
Option 1:
for(Element link:document.select("a[href]")) {
String link1 = link.toString();
String link2 = link1.substring(link1.indexOf('/'));
String link3 = link2.substring(0, link2.indexOf('"'));
aktien.add(link3);
}
System.out.println(aktien);
Option 2:
for(Element link:document.select("a[href]")) {
String link1 = link.toString();
String link2 = link1.substring(link1.indexOf('/'));
String link3 = link2.substring(0, link2.indexOf('"'));
aktien.add(link3);
System.out.println(link3);
}
Alright so I finished my Yelp scanner, and everything is running great. What I want to do now is have the program retrieve the url for each link to each business, go to that page, and scan for whether it contains:
xlink:href="#30x30_bullhorn"></use>
I pretty much have a good idea of how I'm going to go about doing that, however, I can't seem to find a jSoup method that would retrieve a link's url. Is there somewhere in the page's HTML that would have the url? I'm not very proficient with HTML at all, so 90% of what I'm looking at is gibbering. Here's an example link if you want to check out what I'm referring to.
https://www.yelp.com/search?find_loc=nj&start=10 is the main page, that I need to obtain the url for the page https://www.yelp.com/biz/la-cocina-newark. The orange bullhorn is what I am trying to get it to retrieve. Here's my code btw:
import java.util.ArrayList;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
import java.util.Scanner;
public class YelpScrapper
{
public static void main(String[] args) throws IOException, Exception
{
//Variables
String description;
String location;
int pages;
int parseCount = 0;
Document document;
Scanner keyboard = new Scanner(System.in);
//Perform a Search
System.out.print("Enter a description: ");
description = keyboard.nextLine();
System.out.print("Enter a state: ");
location = keyboard.nextLine();
System.out.print("How many pages should we scan? ");
pages = keyboard.nextInt();
String descString = "find_desc=" + description.replace(' ', '+') + "&";
String locString = "find_loc=" + location.replace(' ', '+') + "&";
int number = 0;
String url = "https://www.yelp.com/search?" + descString + locString + "start=" + number;
ArrayList<String> names = new ArrayList<String>();
ArrayList<String> address = new ArrayList<String>();
ArrayList<String> phone = new ArrayList<String>();
//Fetch Data From Yelp
for (int i = 0 ; i <= pages ; i++)
{
document = Jsoup.connect(url).get();
Elements nameElements = document.select(".indexed-biz-name span");
Elements addressElements = document.select(".secondary-attributes address");
Elements phoneElements = document.select(".biz-phone");
for (Element element : nameElements)
{
names.add(element.text());
}
for (Element element : addressElements)
{
address.add(element.text());
}
for (Element element : phoneElements)
{
phone.add(element.text());
}
for (int index = 0 ; index < 10 ; index++)
{
System.out.println("\nLead " + parseCount);
System.out.println("Company Name: " + names.get(parseCount));
System.out.println("Address: " + address.get(parseCount));
System.out.println("Phone Number: " + phone.get(parseCount));
parseCount = parseCount + 1;
}
number = number + 10;
}
}
}
Learn how to use the Inspect element of Chrome Developer tools, as it makes it incredibly easy to locate elements in the DOM (you said you aren't comfortable with HTML, well you certainly will be after this and using Inspect is a great learning tool). Focusing the inspector on the "View Now" button, you'll get to this:
View Now.
You'll have to figure out how to traverse down to this, and childNodes() will be helpful in traversing down. Then you can use getElementsByClass("ybtn ybtn--primary ybtn--small ybtn-cta") to get to that specific class where the link is, and then use the .attr() method of the Element class to get the href: .attr("href");.
I have this Element:
<td id="color" align="center">
Z 29.02-23.05 someText,
<br>
some.Text2 J. Smith (l.)
</td>
How do I get the text after the tag <br>, to look like some.Text2 J. Smith I tried to find answer in the documentation, but ...
update
If i use
System.out.println(element.select("a").text());
i get just only J. Smith.. Unfortunately, I don't know how to parse tags like <br>
Node.childNodes could save your life:
package com.github.davidepastore.stackoverflow35436825;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
/**
* Stackoverflow 35436825
*
*/
public class App
{
public static void main( String[] args )
{
String html = "<html><body><table><tr><td id=\"color\" align=\"center\">" +
"Z 29.02-23.05 someText," +
"<br>" +
"some.Text2 J. Smith (l.) " +
"</td></tr></table></body></html>";
Document doc = Jsoup.parse( html );
Element td = doc.getElementById( "color" );
String text = getText( td );
System.out.println("Text: " + text);
}
/**
* Get the custom text from the given {#link Element}.
* #param element The {#link Element} from which get the custom text.
* #return Returns the custom text.
*/
private static String getText(Element element) {
String working = "";
List<Node> childNodes = element.childNodes();
boolean brFound = false;
for (int i = 0; i < childNodes.size(); i++) {
Node child = childNodes.get( i );
if (child instanceof TextNode) {
if(brFound){
working += ((TextNode) child).text();
}
}
if (child instanceof Element) {
Element childElement = (Element)child;
if(brFound){
working += childElement.text();
}
if(childElement.tagName().equals( "br" )){
brFound = true;
}
}
}
return working;
}
}
The output will be:
Text: some.Text2 J. Smith (l.)
As far as I know you can only receive the text between two tags, which is not possible with a single <br/> tag in your document.
The only option I can think of is to use split() in order to receive the second part:
String partAfterBr = element.text().split("<br>")[1];
Document relevantPart = JSoup.parse(partAfterBr);
// do whatever you want with the Document in order to receive the necessary parts
I am relatively new to Java and I have been trying to figure out how to reach the following tags for output for a couple of long, LONG days now. I would really appreciate some insight into the problem. It seems like everything I could find and or try just does not pan out right. (Excuse the cheesy news articles)
<item>
<pubDate>Sat, 21 Sep 2013 02:30:23 EDT</pubDate>
<title>
<![CDATA[
Carmen Bryan Lashes Out at Beyonce Fans for Throwing Shade (#carmenbryan)
]]>
</title>
<link>
http://www.vladtv.com/blog/174937/carmen-bryan-lashes-out-at-beyonce-fans-for-throwing-shade/
</link>
<guid>
http://www.vladtv.com/blog/174937/carmen-bryan-lashes-out-at-beyonce-fans-for-throwing-shade/
</guid>
<description>
<![CDATA[
<img ... /><br />.
<p>In response to someone who reminded Bryan that Jay Z has Beyonce now, she tweeted.</p>
<p>Check out what else Bryan had to say above.</p>
<p>Source: </p>
]]>
</description>
</item>
I have managed to parse the XML and print out the content in both the title and description element tags, however the output for the description element tag also includes all its child element tags. I would like to use this project in future to build on my Java portfolio, please help!
My code so far:
public class NewXmlReader
{
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
try {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document docXml = builder.parse(NewXMLReaderHandlers.inputHandler());
docXml.getDocumentElement().normalize();
NewXMLReaderHandlers.handleItemTags(docXml, "item");
} catch (ParserConfigurationException | SAXException parserConfigurationException) {
System.out.println("You Are Not XML formated !!");
parserConfigurationException.printStackTrace();
} catch (IOException iOException) {
System.out.println("URL NOT FOUND");
iOException.getCause();
}
}
}
public class NewXMLReaderHandlers {
private static int ARTICLELENGTH;
public static String inputHandler() throws IOException {
InputStreamReader inputStream = new InputStreamReader(System.in);
BufferedReader bufferRead = new BufferedReader(inputStream);
System.out.println("Please Enter A Proper URL: ");
String urlPageString = bufferRead.readLine();
return urlPageString;
}
public static void handleItemTags( Document document, String rssFeedParentTopicTag){
NodeList listOfArticles = document.getElementsByTagName(rssFeedParentTopicTag);
NewXMLReaderHandlers.ARTICLELENGTH = listOfArticles.getLength();
String rootElement = document.getDocumentElement().getNodeName();
if (rootElement == "rss"){
System.out.println("We Have An RSS Feed To Parse");
for (int i = 0; i < NewXMLReaderHandlers.ARTICLELENGTH; i++) {
Node itemNode = (Node) listOfArticles.item(i);
if (itemNode.getNodeType() == Node.ELEMENT_NODE) {
Element itemElement= (Element) itemNode;
tagContent (itemElement, "title");
tagContent (itemElement, "description");
}
}
}
}
public static void tagContent (Element item, String tagName) {
NodeList tagNodeList = item.getElementsByTagName(tagName);
Element tagElement = (Element)tagNodeList.item(0);
NodeList tagTElist = tagElement.getChildNodes();
Node tagNode = tagTElist.item(0);
// System.out.println( " - " + tagName + " : " + tagNode.getNodeValue() + "\n");
if(tagName == "description"){
System.out.println( " - " + tagName + " : " + tagNode.getNodeValue() + "\n\n");
System.out.println(" Do We Have Any Siblings? " + tagNode.getNextSibling().getNodeValue() + "\n");
}
}
}
For my money, the easiest solution would be to use the XPath API.
Essentially, it's a query language for XML. See XPath Tutorial for a primer.
This example uses the RSS feed from SO, which uses <entry...> instead of <item>, but I've used the same technique for other RSS (and XML) files and even very complex HTML documents...
import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
public class TestRSSFeed {
public static void main(String[] args) {
try {
// Read the feed...
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
Document doc = factory.newDocumentBuilder().parse("http://stackoverflow.com/feeds/tag?tagnames=java&sort=newest");
Element root = doc.getDocumentElement();
// Create a xPath instance
XPath xPath = XPathFactory.newInstance().newXPath();
// Find all the nodes that are named <entry...> any where in
// the document that live under the parent node...
XPathExpression expression = xPath.compile("//entry");
NodeList nl = (NodeList) expression.evaluate(root, XPathConstants.NODESET);
System.out.println("Found " + nl.getLength() + " items...");
for (int index = 0; index < nl.getLength(); index++) {
Node node = nl.item(index);
// This is a sub node search.
// The search is based on the parent node and looks for a single
// node titled "title" that belongs to the parent node...
// I did this because I'm only expecting a single node...
expression = xPath.compile("title");
Node child = (Node) expression.evaluate(node, XPathConstants.NODE);
System.out.println(child.getTextContent());
}
} catch (IOException | ParserConfigurationException | SAXException exp) {
exp.printStackTrace();
} catch (XPathExpressionException ex) {
ex.printStackTrace();
}
}
}
Now, you can do some pretty complex queries, but I thought I'd start with a basic example ;)
Just in case anyone is still left wondering about how i managed to solve the CDATA puzzle:
The logic is as follows:
Once you get the program to extract all the xml to display the correct node tree as the rss feed displays, if any xml data is wrapped in CDATA tags, the only way to access that information is by creating new xml based on the text content in the CDATA tag. Once you parse the new document, you should be able to access all the data you need.