jsoup - parsing the first row of a table doesn't work - java

I really tried a lot and also searched a lot of websites...
I tried to parse a price from a website with jsoup, but it didn't work.
What I tried out is this:
try {
String str1 = "https://www.google.de/shopping/product/3996339592576509511?hl=de&q=4250155834791&oq=4250155834791&gs_l=products-cc.3...4306.7625.0.8037.13.6.0.7.0.0.60.314.6.6.0...0.0...1ac.1.LgJKDfZQvls&sa=X&ei=eeqlUY2zFNT54QSyloCoDw&ved=0CFIQgggwAA&prds=scoring:p";
doc = Jsoup.connect(str3).get();
final Elements elements = doc.select("td:lt(1)");
String price = doc.select("span").first().text();
System.out.println(price);
System.out.println("Ende");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
The goal should be to extract the lowest price of the product.
example-page:
https://www.google.de/shopping/product/3996339592576509511?hl=de&q=4250155834791&oq=4250155834791&gs_l=products-cc.3...4306.7625.0.8037.13.6.0.7.0.0.60.314.6.6.0...0.0...1ac.1.LgJKDfZQvls&sa=X&ei=eeqlUY2zFNT54QSyloCoDw&ved=0CFIQgggwAA&prds=scoring:p
I would like to parse the first row that shows me the results.
in this case: ebay 24-trade365.
I need the article's price and the link to the vendor.
Can anyone help, please?

You'd do better extracting by class, also when you select for "span" you are selecting in the original doc, not the elements you have extracted. Try something like:
// get all column entries for price
final Elements elements = doc.getElementsByClass("os-price-col");
int lowest_price = Integer.MAX_VALUE;
// foreach entry
for(Element element : elements){
// get the price in text form
String text_price = element.text();
// convert to an integer
text_price = text_price.replaceAll("[^0-9.]", "");
int price = Integer.parseInt(text_price);
// check if it's the lowest
if(price < lowest_price) lowest_price = price;
}
System.out.println(lowest_price);
Obviously updating slightly to get the output in a format you want.
EDIT: Just saw that you wanted the vendor link as well. In this case I would extract a row at a time i.e.
Elements rows = doc.getElementsByClass("os-row");
Then iterate through each row and pick out the price as before, but this time do
row.getElementsByClass("os-price-col").first();
And if it's the lowest you can pick out the vendor url with something like
row.getElementsByClass("os-seller-name").first().select("a").attr("href");

If your table is already sorted and all you want is the first row:
Element table=doc.getElementsByClass("os-main-table").first();
Element firstRow=table.select("tr[class=os-row").first();
Element seller=firstRow.select("td[class=os-seller-name]").first();
String sellerName=seller.text().trim();
String sellerLink=seller.getElementsByTag("a").first().attr("href");
String price=firstRow.select("td[class=os-price-col").first().getElementsByClass("os-base_price").text();
You can find a tutorial on Jsoup navigation at http://jsoup.org/cookbook/extracting-data/dom-navigation

Related

How to remove one letter in Jsoup?

I am having troubles with parsing string into double. I am trying to get the price of an item but the string returns in a value like: $24.54. So the problem (I'm assumming) is that it is conflicting with the $. Is there a way to eleminate the $ and turn the 24.54 into a double and store in a variable. Here is the code: (PS I am a C++ programmer haven't programmed in Java for a while so feel free to give me tips):
Elements tdsInSecondRow = doc.select("table tr:eq(1) > td:eq(0)"); //Test the changing of these numbers
Elements prices = doc.select("table tr:eq(1) > td:eq(2)");
for (Element td : tdsInSecondRow)
{
String word = td.text(); //Saved the text into symbol
double price = Double.parseDouble(prices.text());
System.out.println(symbol);
System.out.println(price);
}
You could check if your text starts with "$".
if(word.startsWith("$")){
word = word.substring(1, word.length());
}
double price = Double.parseDouble(word.text());
Hope it helps,

Iterating through elements in jsoup and parsing href

I was having trouble getting just the href from a rows of table data. Although I was able to get it working, I am wondering if anyone has an explanation for why my code here works.
for (Element element : result.select("tr")) {
if (element.select("tr.header.left").isEmpty()) {
Elements tds = element.select("td");
//The line below is what I don't understand
String link = tds.get(0).getElementsByAttribute("href").first().attr("href");
String position = tds.get(1).text();
}
}
The line that I was using before, that did not work is below:
String link = tds.get(0).attr("href");
Why does this line return an empty string? I'm assuming it has to do with how I am iterating through the elements as I've selected by "tr". However, I'm not familiar with how Elements vs Element are structured.
Thanks for your help!
Elements is simply an ArrayList<Element>
The reason you're having to write that extra code is because <td> doesn't have an href attribute, so tds.get(0).attr("href"); won't work. You're presumably trying to capture the href from an <a> within the cell. The longer, working code is saying:
For the first cell in the row, get the first element with an #href attribute (i.e. a link), and get
its #href attribute
Try the following example (with example document) to show how to access the child links more clearly:
Element result = Jsoup.parse("<html><body><table><tr><td><a href=\"http://a.com\" /</td><td>Label1</td></tr><tr><td><a href=\"http://b.com\" /></td><td>Label2</td></tr></table></body></html>");
for (Element element : result.select("tr")) {
if (element.select("tr.header.left").isEmpty()) {
Elements tds = element.select("td");
String link = tds.get(0).getElementsByTag("a").attr("href");
String position = tds.get(1).text();
System.out.println(link + ", " + position);
}
}

Extracting Table Data with JSoup on Yahoo Finance

Trying to practice extracting data from tables using JSoup. Can't figure out why I can't pull the "Shares Outstanding" field from
https://finance.yahoo.com/q/ks?s=AAPL+Key+Statistics
Here's two attempts where 's' is AAPL:
public class YahooStatistics {
String sharesOutstanding = "Shares Outstanding:";
public YahooStatistics(String s) {
String keyStatisticsURL = ("https://finance.yahoo.com/q/ks?s="+s+"+Key+Statistics");
//Attempt 1
try {
Document doc = Jsoup.connect(keyStatisticsURL).get();
for (Element table : doc.select("table.yfnc_datamodoutline1")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
for (Element td : tds.select(sharesOutstanding)) {
System.out.println(td.ownText());
}
}
}
}
catch (IOException ex) {
ex.printStackTrace();
}
//Attempt 2
try {
Document doc = Jsoup.connect(keyStatisticsURL).get();
for (Element table : doc.select("table.yfnc_datamodoutline1")) {
for (Element row : table.select("tr")) {
Elements tds = row.select("td");
for (int j = 0; j < tds.size() - 1; j++) {
Element td = tds.get(j);
if ((td.ownText()).equals(sharesOutstanding)) {
System.out.println(tds.get(j+1).ownText());
}
}
}
}
}
catch(IOException ex) {
ex.printStackTrace();
}
The attempts return: BUILD SUCCESSFUL and nothing else.
I've disabled JavaScript on my browser and the table still shows, so I'm assuming this is not written in JavaScript but HTML.
Any suggestions are appreciated.
Notes about your source after the edit:
You should compare ownText() rather than text(). text() gives you the combined text of all the element and all its sub-elements. In this case the element contains Shares Outstanding<font size="-1"><sup>5</sup></font>:, so its combined text is "Shares Outstanding5:". If you use ownText it will just be "Shares Outstanding:".
Note the colon (:). Update the value in sharesOutstanding accordingly.
You are passing it the wrong URL. There should be a + following the AAPL.
Your current query (at least the second attempt) is returning the element twice, because there is a nested table so it finds the TDs twice.
You can either break from your loops once you found a match, go back to your original version (with corrections as above) - see note - or you can try using a more sophisticated query which will only match once:
Elements elems = doc.select("td.yfnc_tablehead1:containsOwn("+sharesOutstanding+") + td.yfnc_tabledata1");
if ( ! elems.isEmpty() ) {
System.out.println( elems.get(0).owntext() );
}
This selector gives you all the td elements whose class is yfnc_tabledata1, whose immediate preceding sibling is a td element whose class is yfnc_tablehead1 and whose own text contains the "Shares Outstanding:" string. This should basically select the exact TD you need.
Note: the previous version of this answer was a long rattle about the difference between Elements.select() and Element.select(). It turns out that I was dead wrong and your original version should have worked - if you had corrected the four points above. So to set the record straight: select() on an Elements actually does look inside each element and the resulting list may contain descendents of any of the elements in the original list that match the selection. Sorry about that.

Parsing links for href value using JSoup works for a single link, but not for an array of links

I have managed to successfully grab the href links using JSoup. I have also managed to grab the relative value and absolute value of a href for a single link. As shown below:
//works perfectly, website: bbc.co.uk
Document document = Jsoup.connect(url).get();
Element link = document.select("a").last();
String relHref = testlink.attr("href");
String absHref = testlink.attr("abs:href");
System.out.println(relHref);
System.out.println(absHref);
//output:
relHref: /help/web/links/
absHref: http://www.bbc.co.uk/help/web/links/
I can even use Element link = document.select("a").first(); and this also works. However, when I try and add this in a loop to iterate through all of the grabbed links and print out each link, it doesn't give me the expected results. Here is my code:
//not working
Elements links = document.select("a");
for(int i=0; i<links.size(); i++){
String relHref = links.attr("href");
String absHref = links.attr("abs:href");
System.out.println(relHref);
System.out.println(absHref);
}
//output
http://m.bbc.co.uk
http://m.bbc.co.uk
http://m.bbc.co.uk
....
I know the links array of type Elements has the correct data, and if I try and print the elements in the links array it displays all of the href tags i.e.
for (Element link : links) {
System.out.println(link);
}
//output 116 links:
mobile site
<img src="http://static.bbci.co.uk/frameworks/barlesque/2.72.5/orb/4/img/bbc-blocks-dark.png" width="84" height="24" alt="BBC">
Skip to content
<a id="orb-accessibility-help" href="/accessibility/">Accessibility Help</a>
....
But how do I get the relHref and absHref for an array to work? Instead my code just prints out the first link over and over again. I've been going at this for hours, so I'm probably making a silly mistake somewhere but help is appreciated!
Thanks.
On this line:
String relHref = links.attr("href");
...how is it supposed to know you're talking about the ith link? (It doesn't: Elements#attr always returns the value for the first entry in the Elements collection.)
You want
String relHref = links.get(i).attr("href");
...which gets the specific link you're interested in via Elements#get, then uses Node#attr on it.
That said, though, I would just use the enhanced for loop:
for (Element link : document.select("a")) {
String relHref = link.attr("href");
String absHref = link.attr("abs:href");
System.out.println(relHref);
System.out.println(absHref);
}
...unless you need i for something.
You need to use the Elements method, get(int index) inside of your for loop to get each Element held by your Elements.
e.g.,
Elements links = document.select("a");
for(int i=0; i < links.size(); i++) {
Element ele = links.get(i);
/// use ele here to extract info from each Element
}

How many times a text appears in webpage - Selenium Webdriver

Hi I would like to count how many times a text Ex: "VIM LIQUID MARATHI" appears on a page using selenium webdriver(java). Please help.
I have used the following to check if a text appears in the page using the following in the main class
assertEquals(true,isTextPresent("VIM LIQUID MARATHI"));
and a function to return a boolean
protected boolean isTextPresent(String text){
try{
boolean b = driver.getPageSource().contains(text);
System.out.println(b);
return b;
}
catch(Exception e){
return false;
}
}
... but do not know how to count the number of occurrences...
The problem with using getPageSource(), is there could be id's, classnames, or other parts of the code which match your String, but those don't actually appear on the page. I suggest just using getText() on the body element, which will only return the page's content, and not HTML. If I'm understanding your question correctly, I think that is more what you are looking for.
// get the text of the body element
WebElement body = driver.findElement(By.tagName("body"));
String bodyText = body.getText();
// count occurrences of the string
int count = 0;
// search for the String within the text
while (bodyText.contains("VIM LIQUID MARATHI")){
// when match is found, increment the count
count++;
// continue searching from where you left off
bodyText = bodyText.substring(bodyText.indexOf("VIM LIQUID MARATHI") + "VIM LIQUID MARATHI".length());
}
System.out.println(count);
The variable count contains the number of occurrences.
There are two different ways to do this:
int size = driver.findElements(By.xpath("//*[text()='text to match']")).size();
This will tell the driver to find all of the elements that have the text, and then output the size.
The second way is to search the HTML, like you said.
int size = driver.getPageSource().split("text to match").length-1;
This will get the page source, the split the string whenever it finds the match, then counts the number of splits it made.
You can try to execute javascript expression using webdriver:
((JavascriptExecutor)driver).executeScript("yourScript();");
If you are using jQuery on your page you can use jQuery's selectors:
((JavascriptExecutor)driver).executeScript("return jQuery([proper selector]).size()");
[proper selector] - this should be selector that will match text you are searching for.
Try
int size = driver.findElements(By.partialLinkText("VIM MARATHI")).size();

Categories

Resources