Web Scraping with Java using HTMLUnit

Web Scraping with Java using HTMLUnit - java

I am trying to web scrape https://www.nba.com/standings#/
Here is my code
What I am trying to use is page.getByXPath("//caption[#class='standings__header']/span")
Which should pull back Eastern Conference and Western Conference but instead it pulls back nothing I don't understand if my Xpath is wrong?
package Standings;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlSpan;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class Standings {
private static final String baseUrl = "https://www.nba.com/standings#/";
public static void main(String[] args) {
WebClient client = new WebClient();
client.getOptions().setJavaScriptEnabled(false);
client.getOptions().setCssEnabled(false);
client.getOptions().setUseInsecureSSL(true);
String jsonString = "";
ObjectMapper mapper = new ObjectMapper();
try {
HtmlPage page = client.getPage(baseUrl);
System.out.println(page.asXml());
page.getByXPath("//caption[#class='standings__header']/span")
} catch (IOException e) {
e.printStackTrace();
}
}
}

Have used this code to verify your problem:
public static void main(String[] args) throws IOException {
final String url = "https://www.nba.com/standings#/";
try (final WebClient webClient = new WebClient()) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(10000);
System.out.println(page.asXml());
}
}
When running this i got a bunch of warning and errors in the log.
(BTW: the page produces also many error/warnings when running with real browsers. Seems the maintainer of the page has a interesting view on quality)
I guess the problematic error is this one
TypeError: Cannot modify readonly property: constructor. (https://www.nba.com/ng/game/main.js#1)
There is a known bug in the javascript support of HtmlUnit (https://sourceforge.net/p/htmlunit/bugs/1897/). Because the bug is thrown from main.js i guess this will stop the processing of the page javascript before the content you are looking for is generated.
So far i found no time to fix this (looks like this has to be fixed in Rhino) but this one is on the list.
Have a look at https://twitter.com/HtmlUnit to get informed about updates.

The page you are trying to scrape needs Javascript to display properly. If you disable it, most of the elements won't load.
Changing the line
client.getOptions().setJavaScriptEnabled(false);
to
client.getOptions().setJavaScriptEnabled(true);
should do the trick

Related

Scrape/extract with Java, result from coinmarketcap.com

I need to extract coinmarket cap volume (ex: Market Cap: $306,020,249,332) from top of page with Java, please see picture attached.
I have used jsoup library in Java Eclipse but didn't extract volume. Jsoup extract only other attributes. Probably problem is from a java script library.
Also I have used html unit without success:
import java.io.IOException;
import java.util.List;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class Testss {
public static void main(String\[\] args) throws IOException {
String url = "https://coinmarketcap.com/faq/";
WebClient client = new WebClient();
HtmlPage page = client.getPage(url);
List<?> anchors = page.getByXPath("//div\[#class='col-sm-6 text-center'\]//a");
for (Object obj : anchors) {
HtmlAnchor a = (HtmlAnchor) obj;
System.out.println(a.getTextContent().trim());
}
}
}
How can I extract volume from this site with Java?
Thanks!

Check the network tab findout the exact request which is fetching the data, In your case its https://files.coinmarketcap.com/generated/stats/global.json
Also the request URL is the below one
So, Fetching the main URL will not give you what you require, For that you have to fetch the data from the request URL directly and parse it using any JSON library. SimpleJSON I can suggest in one of those.
The JSON data which you will get after hitting the url.
{
"bitcoin_percentage_of_market_cap": 55.95083004655126,
"active_cryptocurrencies": 1324,
"total_volume_usd": 21503093761,
"active_markets": 7009,
"total_market_cap_by_available_supply_usd": 301100436864
}

HtmlUnit (junit) me is returning an error in the code

first anything clarified that I am using Google Translator. I am Hispanic. not be much English
Well, said you what I need to do
I'm trying to make this code work but it gives me an error, note that I am putting as same ta at the official website::
official website: http://htmlunit.sourceforge.net/gettingStarted.html
package serieflv;
import org.junit.Test;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import junit.framework.Assert;
public class webClient {
#Test
public void homePage() throws Exception {
final WebClient webClient = new WebClient();
try (final WebClient webClient = new WebClient()) {
final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
Assert.assertEquals("HtmlUnit - Welcome to HtmlUnit", page.getTitleText());
final String pageAsXml = page.asXml();
Assert.assertTrue(pageAsXml.contains("<body class=\"composite\">"));
final String pageAsText = page.asText();
Assert.assertTrue(pageAsText.contains("Support for the HTTP and HTTPS protocols"));
}
}
}
These are the errors that I launches
These are the errors that I launches

You seem to have to have incorrectly imported a JUnit 3 class here while your test case is clearly a JUnit 4 test case.
Have the following line
import junit.framework.Assert;
modified to
import org.junit.Assert;

How to print external script inside iframe using htmlunit?

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.Page;
import com.gargoylesoftware.htmlunit.SilentCssErrorHandler;
import com.gargoylesoftware.htmlunit.ThreadedRefreshHandler;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class ReadHtml{
public static void main(String[] args) throws Exception {
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setActiveXNative(true);
webClient.getOptions().setAppletEnabled(false);
webClient.getOptions().setCssEnabled(true);
webClient.getOptions().setDoNotTrackEnabled(true);
webClient.getOptions().setGeolocationEnabled(false);
webClient.getOptions().setPopupBlockerEnabled(false);
webClient.getOptions().setPrintContentOnFailingStatusCode(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(true);
webClient.getOptions().setThrowExceptionOnScriptError(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.setCssErrorHandler(new SilentCssErrorHandler());
webClient.setRefreshHandler(new ThreadedRefreshHandler());
webClient.getCookieManager().setCookiesEnabled(true);
WebRequest request = new WebRequest(new URL("some url containing javascript to load html elements"));
try {
Page page;
page = webClient.getPage(request);
//System.out.println(page.getWebResponse().getContentAsString());
System.out.println(((HtmlPage) page).asXml());
} catch (FailingHttpStatusCodeException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
I want to print all html element(not only source code), including html which are produced by javascript,iframes, nested iframes. I tried with this code but (also tried identifying by id,name but not prefer to print anyting specifically. want to print entire html contents), html load by javascript is not printing to console. Can Someone point out the modification need to be carried out?
Thanks in advance.

I found some solution for my task (Not exactly what i want )
List<WebWindow> windows = webClient.getWebWindows();
for(WebWindow w : windows){
HtmlPage hpage2 = (HtmlPage) w.getEnclosedPage();
System.out.println("-------------------------------------");
System.out.println(hpage2.asXml());
}
By this way i could able to get all the iframe contents and nested iframe contents.Not as continuous page but as seperately.
when i know the iframe name i could extract that contents by
HtmlPage hpage = (HtmlPage)webClient.getWebWindowByName("google_esf").getEnclosedPage();
for now this resolves my problem.Still its better if someone suggest how to get as continuous page.

Try using page.asXML.
HTMLPage itself is a DOM Node, so you can iterate through the children recursively The frames may be accessed (recursively) via DOM or via page.getFrames.
If you need to print all the responses from server, you can use WebConnectionWrapper as interceptor. This will get you access to all the responses (including Script ones)
July 9
Frames are part of the DOM. But, if some of the content is being loaded asynchronously (Ajax), HTMLUnit might not have waited for that to load. Try adding an AjaxController to your WebClient. Here is an example.
For WebConnectoinWrapper, use this example. But again, if there is some asynchronous processing, HTMLUnit may exit before all the processing is done. So, AjaxController might be your best bet.
browser.setWebConnection(new WebConnectionWrapper(browser) {
public WebResponse getResponse(final WebRequest request) throws IOException {
WebResponse response = super.getResponse(request);
//processResponse
return response;
}
});
July 10
NicelyResynchronizingAjaxController works for user initiated ajax. For "self loading" ones try something like this.
public class AlwaysSynchronizingAjaxController extends NicelyResynchronizingAjaxController {
public boolean processSynchron(HtmlPage page, WebRequest settings, boolean async) {
return true;
};
}
If you are using Fiddler (or wireshark or any other sniffing/interceptor tools), see if you find the communication for the dynamically loaded requests.

How to scrape the images from web pages?

I used htmlunit to scrape the images from web pages. I am beginner in htmlunit. I coded, but don't know how to get the images. Below is my code.
import java.io.*;
import java.net.URL;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class urlscrap {
public static void main(String[] args) throws Exception
{
//WebClient webClient = new WebClient(Opera);
WebClient webClient = new WebClient();
HtmlPage currentPage = (HtmlPage) webClient.getPage(new URL("http://www.google.com"));
System.out.println(currentPage.asText());
//webClient.closeAllWindows();
}
}

Does this work for you??
import java.net.URL;
import java.util.List;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlImage;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class urlscrap {
public static void main(String[] args) throws Exception
{
//WebClient webClient = new WebClient(Opera);
WebClient webClient = new WebClient();
HtmlPage currentPage = (HtmlPage) webClient.getPage(new URL("http://www.google.com"));
//get list of all divs
final List<?> images = currentPage.getByXPath("//img");
for (Object imageObject : images) {
HtmlImage image = (HtmlImage) imageObject;
System.out.println(image.getSrcAttribute());
}
//webClient.closeAllWindows();
}
}

Looks like you're getting the text of the page, which is indeed the first step. What's your question? Are you having a problem finding all the images referenced within the page? I recommend looking up how to do DOM parsing in Java, and use it to extract all the img tags from the page.

If you don't mind switching languages, then I would recommend Python's scrapy. It is the best framework I've used so far to scrape web content, including images (it can even create thumbnails for you automatically). Personally, I would not use java for such tasks.

How to search YouTube with HtmlUnit

I wonder if YouTube could be searched with HtmlUnit. I started to write code, here it is:
import java.io.IOException;
import java.net.MalformedURLException;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlSubmitInput;
public class HtmlUnitExampleTestBase {
private static final String YOUTUBE = "http://www.youtube.com";
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
WebClient webClient = new WebClient();
webClient.setThrowExceptionOnScriptError(false);
//This is equivalent to typing youtube.com to the adress bar of browser
HtmlPage currentPage = webClient.getPage("http://www.youtube.com");
//Get form where submit button is located
HtmlForm searchForm = (HtmlForm) currentPage.getElementById("masthead-search");
//Printing result form
System.out.println(searchForm.asText());
final List<HtmlAnchor> listLinks = (List<HtmlAnchor>) newPage.getByXPath("//a[#class='ux-thumb-wrap result-item-thumb']");
for (int i=0; i<listLinks.size(); i++){
System.out.println(YOUTUBE + listLinks.get(i).getAttribute("href"));
}
}
}
Now I don't know how to type some text into a search field and press Search button.
I saw tutorials about HtmlUnit but I'm having a problem because they use a method named: getElementByName but the search button on YouTube doesn't have a name, just an id. Could someone help me?
EDIT: I edited code above code and now I am getting youtube links from first page. But before that I need to sort by upload date and then to grab links. Can someone help me to do sorting?

I'm no HtmlUnit expert, but there is a workaround. You can add your own button to the form and use it to submit the form.
Here's a code sample with comments:
import java.io.IOException;
import java.net.MalformedURLException;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlButton;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlTextInput;
public class HtmlUnitExampleTestBase {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
WebClient webClient = new WebClient();
webClient.setThrowExceptionOnScriptError(false);
// This is equivalent to typing youtube.com to the adress bar of browser
HtmlPage currentPage = webClient.getPage("http://www.youtube.com");
// Get form where submit button is located
HtmlForm searchForm = (HtmlForm) currentPage.getElementById("masthead-search");
// Get the input field.
HtmlTextInput searchInput = (HtmlTextInput) currentPage.getElementById("masthead-search-term");
// Insert the search term.
searchInput.setText("Nyan Cat");
// Workaround: create a 'fake' button and add it to the form.
HtmlButton submitButton = (HtmlButton) currentPage.createElement("button");
submitButton.setAttribute("type", "submit");
searchForm.appendChild(submitButton);
// Workaround: use the reference to the button to submit the form.
HtmlPage newPage = submitButton.click();
System.out.println(newPage.asText());
}
}

HtmlUnit is OK, but I vastly prefer Watir or Selenium for web automation.
One of HtmlUnit's weaknesses is its lack of selector methods for getting at DOM elements in a jQuery-like way. Check out the css-selector project, which will add on to HtmlUnit to help you do what you need very easily. There's an intro at Gooder Code.
Once you get that working, the selector for the YouTube search form would be ".search-term" and the submit button's selector would be ".search-button"

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Web Scraping with Java using HTMLUnit - java

The page you are trying to scrape needs Javascript to display properly. If you disable it, most of the elements won't load. Changing the line client.getOptions().setJavaScriptEnabled(false); to client.getOptions().setJavaScriptEnabled(true); should do the trick

Related

Scrape/extract with Java, result from coinmarketcap.com

HtmlUnit (junit) me is returning an error in the code

How to print external script inside iframe using htmlunit?

How to scrape the images from web pages?

How to search YouTube with HtmlUnit

Categories

Resources