import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.Page;
import com.gargoylesoftware.htmlunit.SilentCssErrorHandler;
import com.gargoylesoftware.htmlunit.ThreadedRefreshHandler;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class ReadHtml{
public static void main(String[] args) throws Exception {
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setActiveXNative(true);
webClient.getOptions().setAppletEnabled(false);
webClient.getOptions().setCssEnabled(true);
webClient.getOptions().setDoNotTrackEnabled(true);
webClient.getOptions().setGeolocationEnabled(false);
webClient.getOptions().setPopupBlockerEnabled(false);
webClient.getOptions().setPrintContentOnFailingStatusCode(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(true);
webClient.getOptions().setThrowExceptionOnScriptError(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.setCssErrorHandler(new SilentCssErrorHandler());
webClient.setRefreshHandler(new ThreadedRefreshHandler());
webClient.getCookieManager().setCookiesEnabled(true);
WebRequest request = new WebRequest(new URL("some url containing javascript to load html elements"));
try {
Page page;
page = webClient.getPage(request);
//System.out.println(page.getWebResponse().getContentAsString());
System.out.println(((HtmlPage) page).asXml());
} catch (FailingHttpStatusCodeException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
I want to print all html element(not only source code), including html which are produced by javascript,iframes, nested iframes. I tried with this code but (also tried identifying by id,name but not prefer to print anyting specifically. want to print entire html contents), html load by javascript is not printing to console. Can Someone point out the modification need to be carried out?
Thanks in advance.
I found some solution for my task (Not exactly what i want )
List<WebWindow> windows = webClient.getWebWindows();
for(WebWindow w : windows){
HtmlPage hpage2 = (HtmlPage) w.getEnclosedPage();
System.out.println("-------------------------------------");
System.out.println(hpage2.asXml());
}
By this way i could able to get all the iframe contents and nested iframe contents.Not as continuous page but as seperately.
when i know the iframe name i could extract that contents by
HtmlPage hpage = (HtmlPage)webClient.getWebWindowByName("google_esf").getEnclosedPage();
for now this resolves my problem.Still its better if someone suggest how to get as continuous page.
Try using page.asXML.
HTMLPage itself is a DOM Node, so you can iterate through the children recursively The frames may be accessed (recursively) via DOM or via page.getFrames.
If you need to print all the responses from server, you can use WebConnectionWrapper as interceptor. This will get you access to all the responses (including Script ones)
July 9
Frames are part of the DOM. But, if some of the content is being loaded asynchronously (Ajax), HTMLUnit might not have waited for that to load. Try adding an AjaxController to your WebClient. Here is an example.
For WebConnectoinWrapper, use this example. But again, if there is some asynchronous processing, HTMLUnit may exit before all the processing is done. So, AjaxController might be your best bet.
browser.setWebConnection(new WebConnectionWrapper(browser) {
public WebResponse getResponse(final WebRequest request) throws IOException {
WebResponse response = super.getResponse(request);
//processResponse
return response;
}
});
July 10
NicelyResynchronizingAjaxController works for user initiated ajax. For "self loading" ones try something like this.
public class AlwaysSynchronizingAjaxController extends NicelyResynchronizingAjaxController {
public boolean processSynchron(HtmlPage page, WebRequest settings, boolean async) {
return true;
};
}
If you are using Fiddler (or wireshark or any other sniffing/interceptor tools), see if you find the communication for the dynamically loaded requests.
Related
One block on the page is filled with content by JavaScript and after loading page with Jsoup there is none of that inforamtion. Is there a way to get also JavaScript generated content when parsing page with Jsoup?
Can't paste page code here, since it is too long: http://pastebin.com/qw4Rfqgw
Here's element which content I need: <div id='tags_list'></div>
I need to get this information in Java. Preferably using Jsoup. Element is field with help of JavaScript:
<div id="tags_list">
разведчик
Sr
стратегический
</div>
Java code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Test
{
public static void main( String[] args )
{
try
{
Document Doc = Jsoup.connect( "http://www.bestreferat.ru/referat-32558.html" ).get();
Elements Tags = Doc.select( "#tags_list a" );
for ( Element Tag : Tags )
{
System.out.println( Tag.text() );
}
}
catch ( IOException e )
{
e.printStackTrace();
}
}
}
JSoup is an HTML parser, not some kind of embedded browser engine. This means that it's completely unaware of any content that is added to the DOM by Javascript after the initial page load.
To get access to that type of content you will need an embedded browser component, there are a number of discussions on SO regarding that kind of component, eg Is there a way to embed a browser in Java?
Solved in my case with com.codeborne.phantomjsdriver
NOTE: it is groovy code.
pom.xml
<dependency>
<groupId>com.codeborne</groupId>
<artifactId>phantomjsdriver</artifactId>
<version> <here goes last version> </version>
</dependency>
PhantomJsUtils.groovy
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.openqa.selenium.WebDriver
import org.openqa.selenium.phantomjs.PhantomJSDriver
class PhantomJsUtils {
private static String filePath = 'data/temp/';
public static Document renderPage(String filePath) {
System.setProperty("phantomjs.binary.path", 'libs/phantomjs') // path to bin file. NOTE: platform dependent
WebDriver ghostDriver = new PhantomJSDriver();
try {
ghostDriver.get(filePath);
return Jsoup.parse(ghostDriver.getPageSource());
} finally {
ghostDriver.quit();
}
}
public static Document renderPage(Document doc) {
String tmpFileName = "$filePath${Calendar.getInstance().timeInMillis}.html";
FileUtils.writeToFile(tmpFileName, doc.toString());
return renderPage(tmpFileName);
}
}
ClassInProject.groovy
Document doc = PhantomJsUtils.renderPage(Jsoup.parse(yourSource))
You need to understand what is happening :
When you query a page from a website, whether using Jsoup or your browser, what gets sent back to you is some HTML. Jsoup is able to parse that.
However, most websites include Javascript in that HTML, or linked from that HTML, which will populate the page with content. Your browser is able to execute the Javascript, and thus populate the page. Jsoup is not.
The way to understand this is the following : parsing HTML code is easy. Executing Javascript code and updating corresponding HTML code is a lot more complex, and is the work of a browser.
Here are some solutions for this kind of problems:
If you can find what are the Ajax calls that Javascript code is making, that is loading content, you might be able to use the URL of these calls with Jsoup. In order to do that, use Developer Tools from your browser. But this is not guaranteed to work:
it might be that the url is dynamic, and depends on what is on the page at that time
if the content is not public, cookies will be involved, and simply querying the resource URL will not be enough
In these cases, you will need to "simulate" the work of a browser. Fortunately, such tools exist. The one I know, and recommend, is PhantomJS. It works with Javascript, and you would need to launch it from Java by starting a new process. If you want to stick to Java, this post lists some Java alternatives.
You can use a combination of JSoup and HtmlUnit to get the page contents after JavaScript scripts are done loading.
pom.xml
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>3.35</version>
</dependency>
Simple Example From file https://riptutorial.com/jsoup/example/16274/parsing-javascript-generated-page-with-jsoup-and-htmunit
// load page using HTML Unit and fire scripts
WebClient webClient2 = new WebClient();
HtmlPage myPage = webClient2.getPage(new File("page.html").toURI().toURL());
// convert page to generated HTML and convert to document
Document doc = Jsoup.parse(myPage.asXml());
// iterate row and col
for (Element row : doc.select("table#data > tbody > tr"))
for (Element col : row.select("td"))
// print results
System.out.println(col.ownText());
// clean up resources
webClient2.close();
A Complex Example: Load login, get Session and CSRF, then post and wait for home page to finish loading (15 seconds)
import java.io.IOException;
import java.net.HttpCookie;
import java.net.MalformedURLException;
import java.net.URL;
import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.HttpMethod;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
//JSoup load Login Page and get Session Details
Connection.Response res = Jsoup.connect("https://loginpage").method(Method.GET).execute();
String sessionId = res.cookie("findSESSION");
String csrf = res.cookie("findCSRF");
HttpCookie cookie = new HttpCookie("findCSRF", csrf);
cookie.setDomain("domain.url");
cookie.setPath("/path");
WebClient webClient = new WebClient();
webClient.addCookie(cookie.toString(),
new URL("https://url"),
"https://referrer");
// Add other cookies/ Session ...
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
// Wait time
webClient.waitForBackgroundJavaScript(15000);
webClient.getOptions().setThrowExceptionOnScriptError(false);
URL url = new URL("https://login.path");
WebRequest requestSettings = new WebRequest(url, HttpMethod.POST);
requestSettings.setRequestBody("user=234&pass=sdsdc&CSRFToken="+csrf);
HtmlPage page = webClient.getPage(requestSettings);
// Wait
synchronized (page) {
try {
page.wait(15000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
// Parse logged in page as needed
Document doc = Jsoup.parse(page.asXml());
I fact there is a "way"! Maybe it is more "a workaround" than a "way... The code below checks both for meta attribute "REFRESH" and javascript redirects... If either of them exists RedirectedUrl variable is set. So you know your target... Then you can retrieve the target page and go on...
String RedirectedUrl=null;
Elements meta = page.select("html head meta");
if (meta.attr("http-equiv").contains("REFRESH")) {
RedirectedUrl = meta.attr("content").split("=")[1];
} else {
if (page.toString().contains("window.location.href")) {
meta = page.select("script");
for (Element script:meta) {
String s = script.data();
if (!s.isEmpty() && s.startsWith("window.location.href")) {
int start = s.indexOf("=");
int end = s.indexOf(";");
if (start>0 && end >start) {
s = s.substring(start+1,end);
s =s.replace("'", "").replace("\"", "");
RedirectedUrl = s.trim();
break;
}
}
}
}
}
... now retrieve the redirected page again...
It is possible by combining JSoup with another framework to interpret the webpage, in my example here I'm using HtmlUnit.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
...
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(URL);
Document document = Jsoup.parse(myPage.asXml());
Elements otherLinks = document.select("a[href]");
After specifying user agent, my problem is solved.
https://github.com/jhy/jsoup/issues/287#issuecomment-12769155
Try:
Document Doc = Jsoup.connect(url)
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
.maxBodySize(0)
.timeout(600000)
.get();
One block on the page is filled with content by JavaScript and after loading page with Jsoup there is none of that inforamtion. Is there a way to get also JavaScript generated content when parsing page with Jsoup?
Can't paste page code here, since it is too long: http://pastebin.com/qw4Rfqgw
Here's element which content I need: <div id='tags_list'></div>
I need to get this information in Java. Preferably using Jsoup. Element is field with help of JavaScript:
<div id="tags_list">
разведчик
Sr
стратегический
</div>
Java code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Test
{
public static void main( String[] args )
{
try
{
Document Doc = Jsoup.connect( "http://www.bestreferat.ru/referat-32558.html" ).get();
Elements Tags = Doc.select( "#tags_list a" );
for ( Element Tag : Tags )
{
System.out.println( Tag.text() );
}
}
catch ( IOException e )
{
e.printStackTrace();
}
}
}
JSoup is an HTML parser, not some kind of embedded browser engine. This means that it's completely unaware of any content that is added to the DOM by Javascript after the initial page load.
To get access to that type of content you will need an embedded browser component, there are a number of discussions on SO regarding that kind of component, eg Is there a way to embed a browser in Java?
Solved in my case with com.codeborne.phantomjsdriver
NOTE: it is groovy code.
pom.xml
<dependency>
<groupId>com.codeborne</groupId>
<artifactId>phantomjsdriver</artifactId>
<version> <here goes last version> </version>
</dependency>
PhantomJsUtils.groovy
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.openqa.selenium.WebDriver
import org.openqa.selenium.phantomjs.PhantomJSDriver
class PhantomJsUtils {
private static String filePath = 'data/temp/';
public static Document renderPage(String filePath) {
System.setProperty("phantomjs.binary.path", 'libs/phantomjs') // path to bin file. NOTE: platform dependent
WebDriver ghostDriver = new PhantomJSDriver();
try {
ghostDriver.get(filePath);
return Jsoup.parse(ghostDriver.getPageSource());
} finally {
ghostDriver.quit();
}
}
public static Document renderPage(Document doc) {
String tmpFileName = "$filePath${Calendar.getInstance().timeInMillis}.html";
FileUtils.writeToFile(tmpFileName, doc.toString());
return renderPage(tmpFileName);
}
}
ClassInProject.groovy
Document doc = PhantomJsUtils.renderPage(Jsoup.parse(yourSource))
You need to understand what is happening :
When you query a page from a website, whether using Jsoup or your browser, what gets sent back to you is some HTML. Jsoup is able to parse that.
However, most websites include Javascript in that HTML, or linked from that HTML, which will populate the page with content. Your browser is able to execute the Javascript, and thus populate the page. Jsoup is not.
The way to understand this is the following : parsing HTML code is easy. Executing Javascript code and updating corresponding HTML code is a lot more complex, and is the work of a browser.
Here are some solutions for this kind of problems:
If you can find what are the Ajax calls that Javascript code is making, that is loading content, you might be able to use the URL of these calls with Jsoup. In order to do that, use Developer Tools from your browser. But this is not guaranteed to work:
it might be that the url is dynamic, and depends on what is on the page at that time
if the content is not public, cookies will be involved, and simply querying the resource URL will not be enough
In these cases, you will need to "simulate" the work of a browser. Fortunately, such tools exist. The one I know, and recommend, is PhantomJS. It works with Javascript, and you would need to launch it from Java by starting a new process. If you want to stick to Java, this post lists some Java alternatives.
You can use a combination of JSoup and HtmlUnit to get the page contents after JavaScript scripts are done loading.
pom.xml
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>3.35</version>
</dependency>
Simple Example From file https://riptutorial.com/jsoup/example/16274/parsing-javascript-generated-page-with-jsoup-and-htmunit
// load page using HTML Unit and fire scripts
WebClient webClient2 = new WebClient();
HtmlPage myPage = webClient2.getPage(new File("page.html").toURI().toURL());
// convert page to generated HTML and convert to document
Document doc = Jsoup.parse(myPage.asXml());
// iterate row and col
for (Element row : doc.select("table#data > tbody > tr"))
for (Element col : row.select("td"))
// print results
System.out.println(col.ownText());
// clean up resources
webClient2.close();
A Complex Example: Load login, get Session and CSRF, then post and wait for home page to finish loading (15 seconds)
import java.io.IOException;
import java.net.HttpCookie;
import java.net.MalformedURLException;
import java.net.URL;
import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.HttpMethod;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
//JSoup load Login Page and get Session Details
Connection.Response res = Jsoup.connect("https://loginpage").method(Method.GET).execute();
String sessionId = res.cookie("findSESSION");
String csrf = res.cookie("findCSRF");
HttpCookie cookie = new HttpCookie("findCSRF", csrf);
cookie.setDomain("domain.url");
cookie.setPath("/path");
WebClient webClient = new WebClient();
webClient.addCookie(cookie.toString(),
new URL("https://url"),
"https://referrer");
// Add other cookies/ Session ...
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
// Wait time
webClient.waitForBackgroundJavaScript(15000);
webClient.getOptions().setThrowExceptionOnScriptError(false);
URL url = new URL("https://login.path");
WebRequest requestSettings = new WebRequest(url, HttpMethod.POST);
requestSettings.setRequestBody("user=234&pass=sdsdc&CSRFToken="+csrf);
HtmlPage page = webClient.getPage(requestSettings);
// Wait
synchronized (page) {
try {
page.wait(15000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
// Parse logged in page as needed
Document doc = Jsoup.parse(page.asXml());
I fact there is a "way"! Maybe it is more "a workaround" than a "way... The code below checks both for meta attribute "REFRESH" and javascript redirects... If either of them exists RedirectedUrl variable is set. So you know your target... Then you can retrieve the target page and go on...
String RedirectedUrl=null;
Elements meta = page.select("html head meta");
if (meta.attr("http-equiv").contains("REFRESH")) {
RedirectedUrl = meta.attr("content").split("=")[1];
} else {
if (page.toString().contains("window.location.href")) {
meta = page.select("script");
for (Element script:meta) {
String s = script.data();
if (!s.isEmpty() && s.startsWith("window.location.href")) {
int start = s.indexOf("=");
int end = s.indexOf(";");
if (start>0 && end >start) {
s = s.substring(start+1,end);
s =s.replace("'", "").replace("\"", "");
RedirectedUrl = s.trim();
break;
}
}
}
}
}
... now retrieve the redirected page again...
It is possible by combining JSoup with another framework to interpret the webpage, in my example here I'm using HtmlUnit.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
...
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(URL);
Document document = Jsoup.parse(myPage.asXml());
Elements otherLinks = document.select("a[href]");
After specifying user agent, my problem is solved.
https://github.com/jhy/jsoup/issues/287#issuecomment-12769155
Try:
Document Doc = Jsoup.connect(url)
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
.maxBodySize(0)
.timeout(600000)
.get();
I am trying to web scrape https://www.nba.com/standings#/
Here is my code
What I am trying to use is page.getByXPath("//caption[#class='standings__header']/span")
Which should pull back Eastern Conference and Western Conference but instead it pulls back nothing I don't understand if my Xpath is wrong?
package Standings;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlSpan;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class Standings {
private static final String baseUrl = "https://www.nba.com/standings#/";
public static void main(String[] args) {
WebClient client = new WebClient();
client.getOptions().setJavaScriptEnabled(false);
client.getOptions().setCssEnabled(false);
client.getOptions().setUseInsecureSSL(true);
String jsonString = "";
ObjectMapper mapper = new ObjectMapper();
try {
HtmlPage page = client.getPage(baseUrl);
System.out.println(page.asXml());
page.getByXPath("//caption[#class='standings__header']/span")
} catch (IOException e) {
e.printStackTrace();
}
}
}
Have used this code to verify your problem:
public static void main(String[] args) throws IOException {
final String url = "https://www.nba.com/standings#/";
try (final WebClient webClient = new WebClient()) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(10000);
System.out.println(page.asXml());
}
}
When running this i got a bunch of warning and errors in the log.
(BTW: the page produces also many error/warnings when running with real browsers. Seems the maintainer of the page has a interesting view on quality)
I guess the problematic error is this one
TypeError: Cannot modify readonly property: constructor. (https://www.nba.com/ng/game/main.js#1)
There is a known bug in the javascript support of HtmlUnit (https://sourceforge.net/p/htmlunit/bugs/1897/). Because the bug is thrown from main.js i guess this will stop the processing of the page javascript before the content you are looking for is generated.
So far i found no time to fix this (looks like this has to be fixed in Rhino) but this one is on the list.
Have a look at https://twitter.com/HtmlUnit to get informed about updates.
The page you are trying to scrape needs Javascript to display properly. If you disable it, most of the elements won't load.
Changing the line
client.getOptions().setJavaScriptEnabled(false);
to
client.getOptions().setJavaScriptEnabled(true);
should do the trick
I need to extract coinmarket cap volume (ex: Market Cap: $306,020,249,332) from top of page with Java, please see picture attached.
I have used jsoup library in Java Eclipse but didn't extract volume. Jsoup extract only other attributes. Probably problem is from a java script library.
Also I have used html unit without success:
import java.io.IOException;
import java.util.List;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class Testss {
public static void main(String\[\] args) throws IOException {
String url = "https://coinmarketcap.com/faq/";
WebClient client = new WebClient();
HtmlPage page = client.getPage(url);
List<?> anchors = page.getByXPath("//div\[#class='col-sm-6 text-center'\]//a");
for (Object obj : anchors) {
HtmlAnchor a = (HtmlAnchor) obj;
System.out.println(a.getTextContent().trim());
}
}
}
How can I extract volume from this site with Java?
Thanks!
Check the network tab findout the exact request which is fetching the data, In your case its https://files.coinmarketcap.com/generated/stats/global.json
Also the request URL is the below one
So, Fetching the main URL will not give you what you require, For that you have to fetch the data from the request URL directly and parse it using any JSON library. SimpleJSON I can suggest in one of those.
The JSON data which you will get after hitting the url.
{
"bitcoin_percentage_of_market_cap": 55.95083004655126,
"active_cryptocurrencies": 1324,
"total_volume_usd": 21503093761,
"active_markets": 7009,
"total_market_cap_by_available_supply_usd": 301100436864
}
One block on the page is filled with content by JavaScript and after loading page with Jsoup there is none of that inforamtion. Is there a way to get also JavaScript generated content when parsing page with Jsoup?
Can't paste page code here, since it is too long: http://pastebin.com/qw4Rfqgw
Here's element which content I need: <div id='tags_list'></div>
I need to get this information in Java. Preferably using Jsoup. Element is field with help of JavaScript:
<div id="tags_list">
разведчик
Sr
стратегический
</div>
Java code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Test
{
public static void main( String[] args )
{
try
{
Document Doc = Jsoup.connect( "http://www.bestreferat.ru/referat-32558.html" ).get();
Elements Tags = Doc.select( "#tags_list a" );
for ( Element Tag : Tags )
{
System.out.println( Tag.text() );
}
}
catch ( IOException e )
{
e.printStackTrace();
}
}
}
JSoup is an HTML parser, not some kind of embedded browser engine. This means that it's completely unaware of any content that is added to the DOM by Javascript after the initial page load.
To get access to that type of content you will need an embedded browser component, there are a number of discussions on SO regarding that kind of component, eg Is there a way to embed a browser in Java?
Solved in my case with com.codeborne.phantomjsdriver
NOTE: it is groovy code.
pom.xml
<dependency>
<groupId>com.codeborne</groupId>
<artifactId>phantomjsdriver</artifactId>
<version> <here goes last version> </version>
</dependency>
PhantomJsUtils.groovy
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.openqa.selenium.WebDriver
import org.openqa.selenium.phantomjs.PhantomJSDriver
class PhantomJsUtils {
private static String filePath = 'data/temp/';
public static Document renderPage(String filePath) {
System.setProperty("phantomjs.binary.path", 'libs/phantomjs') // path to bin file. NOTE: platform dependent
WebDriver ghostDriver = new PhantomJSDriver();
try {
ghostDriver.get(filePath);
return Jsoup.parse(ghostDriver.getPageSource());
} finally {
ghostDriver.quit();
}
}
public static Document renderPage(Document doc) {
String tmpFileName = "$filePath${Calendar.getInstance().timeInMillis}.html";
FileUtils.writeToFile(tmpFileName, doc.toString());
return renderPage(tmpFileName);
}
}
ClassInProject.groovy
Document doc = PhantomJsUtils.renderPage(Jsoup.parse(yourSource))
You need to understand what is happening :
When you query a page from a website, whether using Jsoup or your browser, what gets sent back to you is some HTML. Jsoup is able to parse that.
However, most websites include Javascript in that HTML, or linked from that HTML, which will populate the page with content. Your browser is able to execute the Javascript, and thus populate the page. Jsoup is not.
The way to understand this is the following : parsing HTML code is easy. Executing Javascript code and updating corresponding HTML code is a lot more complex, and is the work of a browser.
Here are some solutions for this kind of problems:
If you can find what are the Ajax calls that Javascript code is making, that is loading content, you might be able to use the URL of these calls with Jsoup. In order to do that, use Developer Tools from your browser. But this is not guaranteed to work:
it might be that the url is dynamic, and depends on what is on the page at that time
if the content is not public, cookies will be involved, and simply querying the resource URL will not be enough
In these cases, you will need to "simulate" the work of a browser. Fortunately, such tools exist. The one I know, and recommend, is PhantomJS. It works with Javascript, and you would need to launch it from Java by starting a new process. If you want to stick to Java, this post lists some Java alternatives.
You can use a combination of JSoup and HtmlUnit to get the page contents after JavaScript scripts are done loading.
pom.xml
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>3.35</version>
</dependency>
Simple Example From file https://riptutorial.com/jsoup/example/16274/parsing-javascript-generated-page-with-jsoup-and-htmunit
// load page using HTML Unit and fire scripts
WebClient webClient2 = new WebClient();
HtmlPage myPage = webClient2.getPage(new File("page.html").toURI().toURL());
// convert page to generated HTML and convert to document
Document doc = Jsoup.parse(myPage.asXml());
// iterate row and col
for (Element row : doc.select("table#data > tbody > tr"))
for (Element col : row.select("td"))
// print results
System.out.println(col.ownText());
// clean up resources
webClient2.close();
A Complex Example: Load login, get Session and CSRF, then post and wait for home page to finish loading (15 seconds)
import java.io.IOException;
import java.net.HttpCookie;
import java.net.MalformedURLException;
import java.net.URL;
import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.HttpMethod;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
//JSoup load Login Page and get Session Details
Connection.Response res = Jsoup.connect("https://loginpage").method(Method.GET).execute();
String sessionId = res.cookie("findSESSION");
String csrf = res.cookie("findCSRF");
HttpCookie cookie = new HttpCookie("findCSRF", csrf);
cookie.setDomain("domain.url");
cookie.setPath("/path");
WebClient webClient = new WebClient();
webClient.addCookie(cookie.toString(),
new URL("https://url"),
"https://referrer");
// Add other cookies/ Session ...
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
// Wait time
webClient.waitForBackgroundJavaScript(15000);
webClient.getOptions().setThrowExceptionOnScriptError(false);
URL url = new URL("https://login.path");
WebRequest requestSettings = new WebRequest(url, HttpMethod.POST);
requestSettings.setRequestBody("user=234&pass=sdsdc&CSRFToken="+csrf);
HtmlPage page = webClient.getPage(requestSettings);
// Wait
synchronized (page) {
try {
page.wait(15000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
// Parse logged in page as needed
Document doc = Jsoup.parse(page.asXml());
I fact there is a "way"! Maybe it is more "a workaround" than a "way... The code below checks both for meta attribute "REFRESH" and javascript redirects... If either of them exists RedirectedUrl variable is set. So you know your target... Then you can retrieve the target page and go on...
String RedirectedUrl=null;
Elements meta = page.select("html head meta");
if (meta.attr("http-equiv").contains("REFRESH")) {
RedirectedUrl = meta.attr("content").split("=")[1];
} else {
if (page.toString().contains("window.location.href")) {
meta = page.select("script");
for (Element script:meta) {
String s = script.data();
if (!s.isEmpty() && s.startsWith("window.location.href")) {
int start = s.indexOf("=");
int end = s.indexOf(";");
if (start>0 && end >start) {
s = s.substring(start+1,end);
s =s.replace("'", "").replace("\"", "");
RedirectedUrl = s.trim();
break;
}
}
}
}
}
... now retrieve the redirected page again...
It is possible by combining JSoup with another framework to interpret the webpage, in my example here I'm using HtmlUnit.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
...
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(URL);
Document document = Jsoup.parse(myPage.asXml());
Elements otherLinks = document.select("a[href]");
After specifying user agent, my problem is solved.
https://github.com/jhy/jsoup/issues/287#issuecomment-12769155
Try:
Document Doc = Jsoup.connect(url)
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
.maxBodySize(0)
.timeout(600000)
.get();