To identify links regarding the Press Release pages alone - java

My task is to find the actual Press release links of a given link. Say http://www.apple.com/pr/ for example.
My tool has to find the press release links alone from the above URL excluding other advertisement links, tab links(or whatever) that are found in that site.
The program below is developed and the result this gives is, all the links that are present in the given webpage.
How can I modify the below program to find the Press Release links alone from a given URL?
Also, I want the program to be generic so that it identifies press release links from any press release URLs if given.
import java.io.*;
import java.net.URL;
import java.net.URLConnection;
import java.sql.*;
import org.jsoup.nodes.Document;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
public class linksfind{
public static void main(String[] args) {
try{
URL url = new URL("http://www.apple.com/pr/");
Document document = Jsoup.parse(url, 1000); // Can also take an URL.
for (Element element : document.getElementsByTag("a")) {
System.out.println(element.attr("href"));}
}catch (Exception ex){ex.printStackTrace();}
}
}

I dont think there would be any definitive way to achieve this. You can make a set of all possible keywords like 'press', 'release' and 'pr' etc and match the urls to find the keywords using regex etc. The correctness of this would depend on how comprehensive your set of keywords is.

Look at the site today. Cache to a file whatever links you saw. Look at the site tomorrow; any new links are links to news articles, maybe? You'll get incorrect results - once - any time they change the rest of the page around you.
You could, you know, just use the RSS feed provided, which is designed to do exactly what you're asking for.

Look at the HTML source code. Open the page in a normal webbrowser, rightclick and choose View Source. You have to find a path in the HTML document tree to uniquely identify those links.
They are all housed in a <ul class="stories"> element inside a <div id="releases"> element. The appropriate CSS selector would then be "div#releases ul.stories a".
Here's how it should look like:
public static void main(String... args) throws Exception {
URL url = new URL("http://www.apple.com/pr/");
Document document = Jsoup.parse(url, 3000);
for (Element element : document.select("div#releases ul.stories a")) {
System.out.println(element.attr("href"));
}
}
This yields as of now, exactly what you want:
/pr/library/2010/07/28safari.html
/pr/library/2010/07/27imac.html
/pr/library/2010/07/27macpro.html
/pr/library/2010/07/27display.html
/pr/library/2010/07/26iphone.html
/pr/library/2010/07/23iphonestatement.html
/pr/library/2010/07/20results.html
/pr/library/2010/07/19ipad.html
/pr/library/2010/07/19alert_results.html
/pr/library/2010/07/02appleletter.html
/pr/library/2010/06/28iphone.html
/pr/library/2010/06/23iphonestatement.html
/pr/library/2010/06/22ipad.html
/pr/library/2010/06/16iphone.html
/pr/library/2010/06/15applestoreapp.html
/pr/library/2010/06/15macmini.html
/pr/library/2010/06/07iphone.html
/pr/library/2010/06/07iads.html
/pr/library/2010/06/07safari.html
To learn more about CSS selectors, read the Jsoup manual and the W3 CSS selector spec.

You need to find some attribute which defines a "press release link". In the case of that site, pointing to "/pr/library/" indicates that it's an Apple press release.

Related

Scraping youtube comments with Java Jsoup

I'm learning Java JSoup, and I want to scrape the comments and the names of the people commenting from a youtube video.
I chose an arbitrary youtube video, and inspected the elements of interest. I've looked at https://jsoup.org/cookbook/extracting-data/selector-syntax and Why Exception Raised while Retrieving the Youtube Elements using iterator in Jsoup?, but I don't really understand how to use the syntax.
Currently, the output of my code is two empty lists. I want the output to be one list with the comments, and the other list with the names of the commentators.
Thanks for any help!
import java.io.IOException;
import java.util.ArrayList;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class FirstJsoupExample {
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
Document page = Jsoup.connect("https://www.youtube.com/watch?v=C33Rw0AA3aU").get();
// Comments
Elements Comments = page.select("yt-formatted-string[class=style-scope ytd-comment-renderer]");
ArrayList<String> CommentsList = new ArrayList<String>();
for (Element comment : Comments) {
CommentsList.add("Comment: " + comment.text());
}
// Commentators
Elements Comentators = page.select("span[class= style-scope ytd-comment-renderer]");
ArrayList<String> ComentatorList = new ArrayList<String>();
for (Element comentator : Comentators) {
ComentatorList.add("Comentator: " + comentator.text());
}
System.out.println(ComentatorList);
System.out.println(CommentsList);
}
}
The comments are not in the HTML file. Youtube uses Javascript to load the comments, but JSoup can only read the HTML file.
Your web browser's developer tools show you what is currently in the webpage, which may be different from what is in the HTML file.
To view the HTML file, you can open the Youtube page in your browser then right-click and choose 'View Page Source', or go to this URL:
view-source:https://www.youtube.com/watch?v=C33Rw0AA3aU
Then you will be able to confirm that the source does not contain yt-formatted-string or ytd-comment-renderer.
Youtube probably does this for two reasons:
To make the page load faster (only load the comments if you scroll down to view the comment)
To prevent people from scraping their website :)
My suggestion is to choose a different website to learn JSoup with.
I confirmed that the selectors below DO work if you:
Open the Youtube page in a web browser
Scroll down so the comments load
Open your web browser developer tools
Note that if you use the class= form, the class to select must be in quotes.
Comments:
document.querySelectorAll("yt-formatted-string[class=\"style-scope ytd-comment-renderer\"]");
//or
document.querySelectorAll("yt-formatted-string.style-scope.ytd-comment-renderer");
Commentators:
document.querySelectorAll("span[class=\"style-scope ytd-comment-renderer\"]");
//or
document.querySelectorAll("span.style-scope.ytd-comment-renderer");

Scrape youtube href using jsoup

I'm using jsoup in java and I'm trying to scrape the first href in a particular youtube video search. However, I can't figure out the correct css query in order to obtain the href. If someone can point me in the correct direction, that'd be great. Here is the image of the html I'm trying to scrape on youtube.
The following is one of the selects I've tried, but doesn't print out anything.
My code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import java.io.IOException;
public class WebTest
{
public static void main(String[] args)
{
try {
Document doc = Jsoup.connect("https://www.youtube.com/results?search_query=childish+gambino+this+is+america").get();
Elements musicVideoLink = doc.select("h3.title-and-badge.style-scope.ytd-video-renderer a[href]");
String linkh = musicVideoLink.attr("href");
System.out.println(linkh);
}
catch (IOException ex){ }
}
}
With the JSoup.connect().get(), because there are no other headers in the request like User-Agent, YouTube returns quite a basic HTML rendering of the search results. This is quite different from the structure in the linked image above, but actually easier to select in though:
Elements musicVideoLink = doc.select("h3.yt-lockup-title a");
This looks like the easiest solution here. If you do pass in the User-Agent header, you get back the same as the Network tab in the browser inspector shows, but this doesn't yet match the result in that image. The browser clearly does a bit of AJAX style processing and rendering on that response next.

I find the Register button using xpath but it's showing 2 matching nodes . How to uniquely identify the register button

Snap of the DOM here two matching nodes displayed for XPATH :
.//*[#id='header']/div/div[2]/div/a[2]
You can try this xpath - //div[#id='header'][not(#class)]//div[#class='right-side']/div/a[contains(.,'Register')]
There are two almost same div containers. Only difference is that the relevant container does not have a class attribute, so the not part.
Or you can use xpath with index - (//div[#class='right-side']/div/a[contains(.,'Register')])[1]
No need to find unique identifier for this, you can use find elements and click by index.
driver.findElements(By.xpath("//*[#id='header']/div/div[2]/div/a[2]")).get(index).click();
OR
You can use CSS Selector and locate the relevant element by using :nth-child(index).
In your case:
driver.findElement(By.cssSelector("#header:nth-child(index) a.button.border:nth-child(1)")).click();
There are more different ways to locate the element using css selectors and I suggest to read about.
And when you inspect the element using the browser you can choose copy css or xpath, this option will give you the unique locator.
A quick and dirty solution (//[#id='header']/div/div[2]/div/a[2])[1] for the first one or (//[#id='header']/div/div[2]/div/a[2])[2] for the second one. But really you should practice writing more relative xpaths and not just taking what the plugins give you.
Don't go for the complicated xpaths
This would work fine: By.xpath("(//A[#href='/Account/Register'])[1]")
I hope the below code helps you.
import org.openqa.selenium.By;
import org.openqa.selenium.Keys;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.support.ui.WebDriverWait;
public class Upmile {
public static void main(String[] args) throws InterruptedException {
// TODO Auto-generated method stub
WebDriver driver = new ChromeDriver();
driver.manage().window().maximize();
driver.get("http://upamile.com");
//This website shows a dialog box at first
//we can skip that dialog by clicking on the body
driver.findElement(By.tagName("body")).click();
Thread.sleep(2000);
driver.findElement(By.xpath("(//A[#href='/Account/Register'])[1]")).click();
System.out.println("Test ran successfully");
}
}

Cannot pick a random search result using Selenium with java

As a novice to selenium, I am trying to automate a shopping site on selenium webdriver with java, My scenario is that when i search with a keyword and get results, i should be able to pick any one of the results randomly, but I am unable to pick the random search result, either I am getting a "No such element" or when i try to click the same result everytime,search results seem to vary from time to time. please help on how to proceed further.
here is the code :
package newPackage;
import java.util.concurrent.TimeUnit;
import org.openqa.selenium.*;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.support.ui.FluentWait;
import org.openqa.selenium.support.ui.Wait;
public class flipKart {
public static void main(String[] args) throws InterruptedException {
System.setProperty("webdriver.chrome.driver","C:\\chromedriver.exe");
WebDriver dr = new ChromeDriver();
dr.get("http://m.barnesandnoble.com/");
dr.manage().window().maximize();
dr.findElement(By.xpath(".//*[#id='search_icon']")).click();
dr.findElement(By.xpath(".//*
[#id='sk_mobContentSearchInput']")).sendKeys("Golden Book");
dr.findElement(By.xpath(".//*
[#id='sk_mobContentSearchInput']")).sendKeys(Keys.ENTER);
dr.findElement(By.xpath(".//[#id='skMob_productDetails_prd9780735217034']/div/div")).click();
dr.findElement(By.xpath(".//*[#id='pdpAddtoBagBtn']")).click();
}
}
You should write any method that would try to wait for the visibility of the element that needs to be clicked.
You could use driver.sleep() to check.
Hard to answer with your info, but these tips may help:
If you're getting no such element, try to verify the css selector or xpath you are using is correct. Firefox's Firebug Firefinder is an excellent tool for this. It will highlight the element your selector points to.
If your selector is correct, make sure you are using findElementsBy... and not findElementBy...
the plural version will return a list of webelements, from which you can then pull random elements to click on.
Use an intelligent wait to make sure the elements have loaded on the page. Sometimes selenium will try to interact with elements on the page before they appear. The selenium api has plenty of methods to help here, but if you're just debugging a quick Thread.sleep(5) when you load the page will work.

Extract href from https site with accept/reject page to enter

On this site: https://services.cds.ca/applications/taxforms/taxforms.nsf/Pages/-EN-LimitedPartnershipsandIncomeTrusts?Open
If you click through on: Display tax information for year 2015, Click Accept, you will arrive at: https://services.cds.ca/applications/taxforms/taxforms.nsf/PROCESSED-EN-?OpenView&Start=1&Count=3000&RestrictToCategory=All-2015
The end goal is get all the excel file href's linked on this page.
Using the JSoup library, I have been able to read in HTML and find href's on a number of different websites, but encountering some issues when trying to apply it to this more complicated webpage.
If anyone could point me in the right direction for some reference material on what is hanging me up here or provide an example, it would be greatly appreciated.
Example code of what I had been using for other sites that does not seem to work for grabbing the HTML from this webpage is:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class URLReader {
public static void main(String[] args) throws IOException {
try{
Document doc = Jsoup.connect("https://www.google.com/").get();
Elements links = doc.select("a");
for (Element e: links) {
System.out.println(e.attr("abs:href"));
}
}
catch(IOException ex){
System.out.println(ex.getMessage());
}
}
}
However, when I throw the CDS url mentioned at the top in place for google, the program hangs at execution and eventually comes to a "Connection reset" catch error message.
Also, in the HTML of the CDS website linked above, I see some javascript:
if (document.referrer.indexOf("/applications/taxforms/taxforms.nsf/Pages/-EN-agree?Open") <= 0 ) location.href = "/applications/taxforms/taxforms.nsf/Pages/-EN-agree?Open&OpenView&Start=1&Count=3000&RestrictToCategory=All-2015";
Which throws you back to the Accept/Reject disclaimer page that precedes entering this page. Wondering some form post or data passing is needed get me past this, if this is what is causing me the issue?
Thanks!

Categories

Resources