Scrape youtube href using jsoup - java

I'm using jsoup in java and I'm trying to scrape the first href in a particular youtube video search. However, I can't figure out the correct css query in order to obtain the href. If someone can point me in the correct direction, that'd be great. Here is the image of the html I'm trying to scrape on youtube.
The following is one of the selects I've tried, but doesn't print out anything.
My code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
import java.io.IOException;
public class WebTest
{
public static void main(String[] args)
{
try {
Document doc = Jsoup.connect("https://www.youtube.com/results?search_query=childish+gambino+this+is+america").get();
Elements musicVideoLink = doc.select("h3.title-and-badge.style-scope.ytd-video-renderer a[href]");
String linkh = musicVideoLink.attr("href");
System.out.println(linkh);
}
catch (IOException ex){ }
}
}

With the JSoup.connect().get(), because there are no other headers in the request like User-Agent, YouTube returns quite a basic HTML rendering of the search results. This is quite different from the structure in the linked image above, but actually easier to select in though:
Elements musicVideoLink = doc.select("h3.yt-lockup-title a");
This looks like the easiest solution here. If you do pass in the User-Agent header, you get back the same as the Network tab in the browser inspector shows, but this doesn't yet match the result in that image. The browser clearly does a bit of AJAX style processing and rendering on that response next.

Related

Scraping youtube comments with Java Jsoup

I'm learning Java JSoup, and I want to scrape the comments and the names of the people commenting from a youtube video.
I chose an arbitrary youtube video, and inspected the elements of interest. I've looked at https://jsoup.org/cookbook/extracting-data/selector-syntax and Why Exception Raised while Retrieving the Youtube Elements using iterator in Jsoup?, but I don't really understand how to use the syntax.
Currently, the output of my code is two empty lists. I want the output to be one list with the comments, and the other list with the names of the commentators.
Thanks for any help!
import java.io.IOException;
import java.util.ArrayList;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class FirstJsoupExample {
public static void main(String[] args) throws IOException {
// TODO Auto-generated method stub
Document page = Jsoup.connect("https://www.youtube.com/watch?v=C33Rw0AA3aU").get();
// Comments
Elements Comments = page.select("yt-formatted-string[class=style-scope ytd-comment-renderer]");
ArrayList<String> CommentsList = new ArrayList<String>();
for (Element comment : Comments) {
CommentsList.add("Comment: " + comment.text());
}
// Commentators
Elements Comentators = page.select("span[class= style-scope ytd-comment-renderer]");
ArrayList<String> ComentatorList = new ArrayList<String>();
for (Element comentator : Comentators) {
ComentatorList.add("Comentator: " + comentator.text());
}
System.out.println(ComentatorList);
System.out.println(CommentsList);
}
}
The comments are not in the HTML file. Youtube uses Javascript to load the comments, but JSoup can only read the HTML file.
Your web browser's developer tools show you what is currently in the webpage, which may be different from what is in the HTML file.
To view the HTML file, you can open the Youtube page in your browser then right-click and choose 'View Page Source', or go to this URL:
view-source:https://www.youtube.com/watch?v=C33Rw0AA3aU
Then you will be able to confirm that the source does not contain yt-formatted-string or ytd-comment-renderer.
Youtube probably does this for two reasons:
To make the page load faster (only load the comments if you scroll down to view the comment)
To prevent people from scraping their website :)
My suggestion is to choose a different website to learn JSoup with.
I confirmed that the selectors below DO work if you:
Open the Youtube page in a web browser
Scroll down so the comments load
Open your web browser developer tools
Note that if you use the class= form, the class to select must be in quotes.
Comments:
document.querySelectorAll("yt-formatted-string[class=\"style-scope ytd-comment-renderer\"]");
//or
document.querySelectorAll("yt-formatted-string.style-scope.ytd-comment-renderer");
Commentators:
document.querySelectorAll("span[class=\"style-scope ytd-comment-renderer\"]");
//or
document.querySelectorAll("span.style-scope.ytd-comment-renderer");

Extract href from https site with accept/reject page to enter

On this site: https://services.cds.ca/applications/taxforms/taxforms.nsf/Pages/-EN-LimitedPartnershipsandIncomeTrusts?Open
If you click through on: Display tax information for year 2015, Click Accept, you will arrive at: https://services.cds.ca/applications/taxforms/taxforms.nsf/PROCESSED-EN-?OpenView&Start=1&Count=3000&RestrictToCategory=All-2015
The end goal is get all the excel file href's linked on this page.
Using the JSoup library, I have been able to read in HTML and find href's on a number of different websites, but encountering some issues when trying to apply it to this more complicated webpage.
If anyone could point me in the right direction for some reference material on what is hanging me up here or provide an example, it would be greatly appreciated.
Example code of what I had been using for other sites that does not seem to work for grabbing the HTML from this webpage is:
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class URLReader {
public static void main(String[] args) throws IOException {
try{
Document doc = Jsoup.connect("https://www.google.com/").get();
Elements links = doc.select("a");
for (Element e: links) {
System.out.println(e.attr("abs:href"));
}
}
catch(IOException ex){
System.out.println(ex.getMessage());
}
}
}
However, when I throw the CDS url mentioned at the top in place for google, the program hangs at execution and eventually comes to a "Connection reset" catch error message.
Also, in the HTML of the CDS website linked above, I see some javascript:
if (document.referrer.indexOf("/applications/taxforms/taxforms.nsf/Pages/-EN-agree?Open") <= 0 ) location.href = "/applications/taxforms/taxforms.nsf/Pages/-EN-agree?Open&OpenView&Start=1&Count=3000&RestrictToCategory=All-2015";
Which throws you back to the Accept/Reject disclaimer page that precedes entering this page. Wondering some form post or data passing is needed get me past this, if this is what is causing me the issue?
Thanks!

Java Jsoup iterate through drop downs and scrape dynamically added data

Using JSoup and Java I want to get data that is added dynamically by selecting an option in a drop down list. An example that better shows what I am trying to articulate is http://www.bulletin.uga.edu/CoursesHome.aspx. Each dropdown option in the by prefix/major dropdown dynmically creates a dropdown that gives all the courses or an option that is "all courses". When you select a course it dynamically adds all the course info. If you select all courses it adds every courses data that is in that major.
I can get all the list values. Here is my code so far. I just dont know how to use the values to load all the data and iterate through it all.
package getInfo;
import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class getInfo {
public static void main(String[] args){
try{
Document doc = Jsoup.connect("http://www.bulletin.uga.edu/CoursesHome.aspx").get();
org.jsoup.select.Elements links = doc.select("option");
for(Element e: links)
{
//System.out.println(e);
//System.out.println(e.text());
System.out.println(e.attr("value"));
}
} catch (IOException ex){
Logger.getLogger(getInfo.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
this returns a list of all the dropdown lists values
JSoup is not the best bet here. JSoup is primarily an HTML parser. Although it offers some helpful methods for fetching content, it's not a headless browser.
I suggest you using Selenium here. It will handle easily dynamically added data.

Alert if webpage has been updated

I am creatin an app in Java that checks if a webpage has been updated.
However some webpages dont have a "last Modified" header.
I even tried checking for a change in content length but this method is not reliable as sometimes the content length changes without any modification in the webpage giving a false alarm.
I really need some help here as i am not able to think of a single foolproof method.
Any ideas???
If you connect the whole time to the webpage like this code it can help:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class main {
String updatecheck = "";
public static void main(String args[]) throws Exception {
//Constantly trying to load page
while (true) {
try {
System.out.println("Loading page...");
// connecting to a website with Jsoup
Document doc = Jsoup.connect("URL").userAgent("CHROME").get();
// Selecting a part of this website with Jsoup
String pick = doc.select("div.selection").get(0);
// printing out when selected part is updated.
if (updatecheck != pick){
updatecheck = pick;
System.out.println("Page is changed.");
}
} catch (Exception e) {
e.printStackTrace();
System.out.println("Exception occured.... going to retry... \n");
}
}
}
}
How to get notified after a webpage changes instead of refreshing?
Probably the most reliable option would be to store a hash of the page contet.
If you are saying that content-length changes then probably the webpages your are trying to check are dynamically generated and or not whatsoever a static in nature. If that is the case then even if you check the 'last-Modified' header it won't reflect the changes in content in most cases anyway.
I guess the only solution would be a page specific solution working only for a specific page, one page you could parse and look for content changes in some parts of this page, another page you could check by last modified header and some other pages you would have to check using the content length, in my opinion there is no way to do it in a unified mode for all pages on the internet. Another option would be to talk with people developing the pages you are checking for some markers which will help you determine if the page changed or not but that of course depends on your specific use case and what you are doing with it.

To identify links regarding the Press Release pages alone

My task is to find the actual Press release links of a given link. Say http://www.apple.com/pr/ for example.
My tool has to find the press release links alone from the above URL excluding other advertisement links, tab links(or whatever) that are found in that site.
The program below is developed and the result this gives is, all the links that are present in the given webpage.
How can I modify the below program to find the Press Release links alone from a given URL?
Also, I want the program to be generic so that it identifies press release links from any press release URLs if given.
import java.io.*;
import java.net.URL;
import java.net.URLConnection;
import java.sql.*;
import org.jsoup.nodes.Document;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
public class linksfind{
public static void main(String[] args) {
try{
URL url = new URL("http://www.apple.com/pr/");
Document document = Jsoup.parse(url, 1000); // Can also take an URL.
for (Element element : document.getElementsByTag("a")) {
System.out.println(element.attr("href"));}
}catch (Exception ex){ex.printStackTrace();}
}
}
I dont think there would be any definitive way to achieve this. You can make a set of all possible keywords like 'press', 'release' and 'pr' etc and match the urls to find the keywords using regex etc. The correctness of this would depend on how comprehensive your set of keywords is.
Look at the site today. Cache to a file whatever links you saw. Look at the site tomorrow; any new links are links to news articles, maybe? You'll get incorrect results - once - any time they change the rest of the page around you.
You could, you know, just use the RSS feed provided, which is designed to do exactly what you're asking for.
Look at the HTML source code. Open the page in a normal webbrowser, rightclick and choose View Source. You have to find a path in the HTML document tree to uniquely identify those links.
They are all housed in a <ul class="stories"> element inside a <div id="releases"> element. The appropriate CSS selector would then be "div#releases ul.stories a".
Here's how it should look like:
public static void main(String... args) throws Exception {
URL url = new URL("http://www.apple.com/pr/");
Document document = Jsoup.parse(url, 3000);
for (Element element : document.select("div#releases ul.stories a")) {
System.out.println(element.attr("href"));
}
}
This yields as of now, exactly what you want:
/pr/library/2010/07/28safari.html
/pr/library/2010/07/27imac.html
/pr/library/2010/07/27macpro.html
/pr/library/2010/07/27display.html
/pr/library/2010/07/26iphone.html
/pr/library/2010/07/23iphonestatement.html
/pr/library/2010/07/20results.html
/pr/library/2010/07/19ipad.html
/pr/library/2010/07/19alert_results.html
/pr/library/2010/07/02appleletter.html
/pr/library/2010/06/28iphone.html
/pr/library/2010/06/23iphonestatement.html
/pr/library/2010/06/22ipad.html
/pr/library/2010/06/16iphone.html
/pr/library/2010/06/15applestoreapp.html
/pr/library/2010/06/15macmini.html
/pr/library/2010/06/07iphone.html
/pr/library/2010/06/07iads.html
/pr/library/2010/06/07safari.html
To learn more about CSS selectors, read the Jsoup manual and the W3 CSS selector spec.
You need to find some attribute which defines a "press release link". In the case of that site, pointing to "/pr/library/" indicates that it's an Apple press release.

Categories

Resources