Android - Webview HTML code extraction doesn't work (Javascript)

Android - Webview HTML code extraction doesn't work (Javascript) - java

I'm coding an app which:
- load a URL in a Webview;
- extract the HTML thorugh a javascript code;
- show the extracted HTML code in the LOG.
As i need to load the page without Javascript enabled (to avoid some behaviors of the page), i tried the code below where:
- i load the page in the webview with the Javascript disabled;
- when the page is loaded, i enable the Javascript;
- then, the app execute the Javascript required to extract the HTML code.
Unfortunately, when the code is executed in debug mode on Android 4.0.4, it gives an error:
01-22 22:37:56.575: E/Web Console(7605): Uncaught TypeError: Cannot call method 'processHTML' of undefined at null:1
If i remove the myBrowserSettings.setJavaScriptEnabled(false); declaration, after the loadurl call, everything works correctly.
What i can do to let the code below works?
package com.stefano.formfiller;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import android.app.Activity;
import android.content.Intent;
import android.os.Bundle;
import android.os.Handler;
import android.util.Log;
import android.view.View;
import android.webkit.CookieManager;
import android.webkit.CookieSyncManager;
import android.webkit.WebChromeClient;
import android.webkit.WebSettings;
import android.webkit.WebView;
import android.webkit.WebViewClient;
import android.webkit.WebSettings.PluginState;
public class MainActivity extends Activity {
WebView myBrowser;
String urlToBrowse = "http://www.mywebsite.com";
String htmlCode = null;
StringBuffer buffer = new StringBuffer();
#Override
protected void onCreate(Bundle savedInstanceState)
{
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
myBrowser = (WebView)findViewById(R.id.webView1);
//Browser settings
WebSettings myBrowserSettings = myBrowser.getSettings();
//Prevent cache to be used
myBrowserSettings.setCacheMode(WebSettings.LOAD_NO_CACHE);
myBrowserSettings.setAppCacheEnabled(false);
//General settings
myBrowserSettings.setJavaScriptEnabled(true);
Log.d("Stefano", "JS enabled");
//FIREFOX user agent
myBrowserSettings.setUserAgentString("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0");
myBrowser.setWebChromeClient(new WebChromeClient());
myBrowser.setWebViewClient(new WebViewClient() {
public void onPageFinished(WebView view, String url)
{
WebSettings myBrowserSettings = myBrowser.getSettings();
myBrowserSettings.setJavaScriptEnabled(true);
Log.d("Stefano", "JS enabled");
Log.d("Stefano", "OnPageFinished running");
} });
//Start the delayed HTML code extraction
delayedStartHtmlExtractor(16000);
Log.d("Stefano", "DelayedStart HTML Extractor launched");
//Prepare Javascript to extract the HTML code from the webview
myBrowser.addJavascriptInterface(new LoadListener(), "HTMLOUT");
myBrowser.loadUrl(urlToBrowse);
Log.d("Stefano", "Main URL requested");
myBrowserSettings.setJavaScriptEnabled(false);
Log.d("Stefano", "JS disabled");
}
//Delayed HTML extraction
public void delayedStartHtmlExtractor(final int delay){
Handler handler = new Handler();
handler.postDelayed(new Runnable()
{
#Override
public void run()
{
myBrowser.loadUrl("javascript:window.HTMLOUT.processHTML('<html>'+document.getElementsByTagName('html')[0].innerHTML+'</html>');");
Log.d("Stefano", "HTML extraction launched");
}
}, delay);
}
//Insert the HTML code in the log information
class LoadListener{
public void processHTML(String html)
{
Log.d("Stefano", "HTML Extraction in progress...");
Log.e("HTML CODE",html);
}
}
Update:
I have a doubt: the code instantiate the Javascript interface while the Javascript is enabled (trough myBrowser.addJavascriptInterface(new LoadListener(), "HTMLOUT");); then, i disable the javascript after the URL call, to re-enable the Javascript when the page is fully loaded.
Could it be that when i disable the Javascript with the instantiated Interface, i "cut-off the communication channel" between the Javascipt and the Java code?

add myBrowser.loadData(...) after the interface setting, like this
myBrowser.addJavascriptInterface(new LoadListener(), "HTMLOUT");
myBrowser.loadData("", "text/html", null);
myBrowser.loadUrl(urlToBrowse);
Also since you will disable js by the end of oncreate method there is no need to enable it on its debut :)
Hope this help

First of all, you should attach the proper annotation #JavascriptInterface to the methods that will be called through the Javascript interface; in your case:
//..
#JavascriptInterface
public void processHTML(String html) {
Log.d("Stefano", "HTML Extraction in progress...");
Log.e("HTML CODE",html);
}
//..
"Note that injected objects will not appear in JavaScript until the page is loaded"
I suppose that loading a page with setJavaScriptEnabled(false) will not inject any Javascript object at all, and this is way you are experiencing this problem.
A possible workaround (untested) could be this:
always load the page using setJavaScriptEnabled(true)
load the webpage passing through http://www.google.com/gwt/n (will load the page without JS or Flash)
do your processing

When you instantiate your LoadListener object, try the following:
this.new LoadListener();

Related

Jsoup HTML extraction without script code [duplicate]

One block on the page is filled with content by JavaScript and after loading page with Jsoup there is none of that inforamtion. Is there a way to get also JavaScript generated content when parsing page with Jsoup?
Can't paste page code here, since it is too long: http://pastebin.com/qw4Rfqgw
Here's element which content I need: <div id='tags_list'></div>
I need to get this information in Java. Preferably using Jsoup. Element is field with help of JavaScript:
<div id="tags_list">
разведчик
Sr
стратегический
</div>
Java code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Test
{
public static void main( String[] args )
{
try
{
Document Doc = Jsoup.connect( "http://www.bestreferat.ru/referat-32558.html" ).get();
Elements Tags = Doc.select( "#tags_list a" );
for ( Element Tag : Tags )
{
System.out.println( Tag.text() );
}
}
catch ( IOException e )
{
e.printStackTrace();
}
}
}

JSoup is an HTML parser, not some kind of embedded browser engine. This means that it's completely unaware of any content that is added to the DOM by Javascript after the initial page load.
To get access to that type of content you will need an embedded browser component, there are a number of discussions on SO regarding that kind of component, eg Is there a way to embed a browser in Java?

Solved in my case with com.codeborne.phantomjsdriver
NOTE: it is groovy code.
pom.xml
<dependency>
<groupId>com.codeborne</groupId>
<artifactId>phantomjsdriver</artifactId>
<version> <here goes last version> </version>
</dependency>
PhantomJsUtils.groovy
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.openqa.selenium.WebDriver
import org.openqa.selenium.phantomjs.PhantomJSDriver
class PhantomJsUtils {
private static String filePath = 'data/temp/';
public static Document renderPage(String filePath) {
System.setProperty("phantomjs.binary.path", 'libs/phantomjs') // path to bin file. NOTE: platform dependent
WebDriver ghostDriver = new PhantomJSDriver();
try {
ghostDriver.get(filePath);
return Jsoup.parse(ghostDriver.getPageSource());
} finally {
ghostDriver.quit();
}
}
public static Document renderPage(Document doc) {
String tmpFileName = "$filePath${Calendar.getInstance().timeInMillis}.html";
FileUtils.writeToFile(tmpFileName, doc.toString());
return renderPage(tmpFileName);
}
}
ClassInProject.groovy
Document doc = PhantomJsUtils.renderPage(Jsoup.parse(yourSource))

You need to understand what is happening :
When you query a page from a website, whether using Jsoup or your browser, what gets sent back to you is some HTML. Jsoup is able to parse that.
However, most websites include Javascript in that HTML, or linked from that HTML, which will populate the page with content. Your browser is able to execute the Javascript, and thus populate the page. Jsoup is not.
The way to understand this is the following : parsing HTML code is easy. Executing Javascript code and updating corresponding HTML code is a lot more complex, and is the work of a browser.
Here are some solutions for this kind of problems:
If you can find what are the Ajax calls that Javascript code is making, that is loading content, you might be able to use the URL of these calls with Jsoup. In order to do that, use Developer Tools from your browser. But this is not guaranteed to work:
it might be that the url is dynamic, and depends on what is on the page at that time
if the content is not public, cookies will be involved, and simply querying the resource URL will not be enough
In these cases, you will need to "simulate" the work of a browser. Fortunately, such tools exist. The one I know, and recommend, is PhantomJS. It works with Javascript, and you would need to launch it from Java by starting a new process. If you want to stick to Java, this post lists some Java alternatives.

You can use a combination of JSoup and HtmlUnit to get the page contents after JavaScript scripts are done loading.
pom.xml
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>3.35</version>
</dependency>
Simple Example From file https://riptutorial.com/jsoup/example/16274/parsing-javascript-generated-page-with-jsoup-and-htmunit
// load page using HTML Unit and fire scripts
WebClient webClient2 = new WebClient();
HtmlPage myPage = webClient2.getPage(new File("page.html").toURI().toURL());
// convert page to generated HTML and convert to document
Document doc = Jsoup.parse(myPage.asXml());
// iterate row and col
for (Element row : doc.select("table#data > tbody > tr"))
for (Element col : row.select("td"))
// print results
System.out.println(col.ownText());
// clean up resources
webClient2.close();
A Complex Example: Load login, get Session and CSRF, then post and wait for home page to finish loading (15 seconds)
import java.io.IOException;
import java.net.HttpCookie;
import java.net.MalformedURLException;
import java.net.URL;
import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.HttpMethod;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
//JSoup load Login Page and get Session Details
Connection.Response res = Jsoup.connect("https://loginpage").method(Method.GET).execute();
String sessionId = res.cookie("findSESSION");
String csrf = res.cookie("findCSRF");
HttpCookie cookie = new HttpCookie("findCSRF", csrf);
cookie.setDomain("domain.url");
cookie.setPath("/path");
WebClient webClient = new WebClient();
webClient.addCookie(cookie.toString(),
new URL("https://url"),
"https://referrer");
// Add other cookies/ Session ...
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
// Wait time
webClient.waitForBackgroundJavaScript(15000);
webClient.getOptions().setThrowExceptionOnScriptError(false);
URL url = new URL("https://login.path");
WebRequest requestSettings = new WebRequest(url, HttpMethod.POST);
requestSettings.setRequestBody("user=234&pass=sdsdc&CSRFToken="+csrf);
HtmlPage page = webClient.getPage(requestSettings);
// Wait
synchronized (page) {
try {
page.wait(15000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
// Parse logged in page as needed
Document doc = Jsoup.parse(page.asXml());

I fact there is a "way"! Maybe it is more "a workaround" than a "way... The code below checks both for meta attribute "REFRESH" and javascript redirects... If either of them exists RedirectedUrl variable is set. So you know your target... Then you can retrieve the target page and go on...
String RedirectedUrl=null;
Elements meta = page.select("html head meta");
if (meta.attr("http-equiv").contains("REFRESH")) {
RedirectedUrl = meta.attr("content").split("=")[1];
} else {
if (page.toString().contains("window.location.href")) {
meta = page.select("script");
for (Element script:meta) {
String s = script.data();
if (!s.isEmpty() && s.startsWith("window.location.href")) {
int start = s.indexOf("=");
int end = s.indexOf(";");
if (start>0 && end >start) {
s = s.substring(start+1,end);
s =s.replace("'", "").replace("\"", "");
RedirectedUrl = s.trim();
break;
}
}
}
}
}
... now retrieve the redirected page again...

It is possible by combining JSoup with another framework to interpret the webpage, in my example here I'm using HtmlUnit.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
...
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(URL);
Document document = Jsoup.parse(myPage.asXml());
Elements otherLinks = document.select("a[href]");

After specifying user agent, my problem is solved.
https://github.com/jhy/jsoup/issues/287#issuecomment-12769155

Try:
Document Doc = Jsoup.connect(url)
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
.maxBodySize(0)
.timeout(600000)
.get();

Android Jsoup Parsing URL for all Body Text

Situation: I have been attempting to parse a URL and retrieve the information between the body tags and setting it in the Android Text View.
Problem: Something is wrong and/or missing..
Code:
package jsouptutorial.androidbegin.com.jsouptutorial;
import android.support.v7.app.AppCompatActivity;
import android.os.Bundle;
import android.widget.TextView;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
import java.io.File;
import java.io.IOException;
public class MainActivity extends AppCompatActivity {
#Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
TextView textOut = (TextView)findViewById(R.id.rootTxtView);
//------------------Something went wrong here-------------------------------
Document doc;
try {
//doc = Jsoup.connect("https://stackoverflow.com/questions/45311629/android-jsoup-parsing-url-for-all-body-text").get();
doc = Jsoup.parse(new File("https://stackoverflow.com/questions/45311629/android-jsoup-parsing-url-for-all-body-text"), "UTF-8");
Elements desc = doc.select("a.body");
textOut.setText((CharSequence) desc); //Setting textView to a String
} catch (IOException e) {
e.printStackTrace();
}
//--------------------------------------------------------------------
}
}

You have a couple of problems here:
First you are trying to create a File object from a URL, this will throw an IOException. You instead want to use the JSoup method to retrieve the document from the URL
Document doc = Jsoup.connect("https://stackoverflow.com/questions/45311629/android-jsoup-parsing-url-for-all-body-text").get();
The next problem is your elements selection doc.select("a.body"). This is trying to select all anchor tags <a> with a class of body - and there is none. To get the body just use doc.body()
Also as mentioned by cricket_007 you are attempting a network request from the Main thread so it will throw a NetworkOnMainThreadException the easiest way around this will be to run it in an AsyncTask, see this question for details.

Error in oAuth for Twitter for Android

I am using an application to authorize a user in Twitter using Twitter4j library. I want to incorporate a feature, my mobile app opens. It has a login button on click of which Twitter login dialog appears and lets you enter the login information . after the login is complete, another screen opens.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.List;
import twitter4j.Status;
import twitter4j.Twitter;
import twitter4j.TwitterException;
import twitter4j.TwitterFactory;
import twitter4j.auth.AccessToken;
import twitter4j.auth.RequestToken;
import android.app.Activity;
import android.content.Intent;
import android.content.SharedPreferences;
import android.content.SharedPreferences.Editor;
import android.net.Uri;
import android.os.Bundle;
import android.util.Log;
import android.view.View;
import android.webkit.WebView;
import android.widget.Button;
import android.widget.Toast;
public class AndTweetVJActivity extends Activity {
/** Called when the activity is first created. */
Twitter twitter;
RequestToken requestToken;
public final static String consumerKey = "myKey"; // "your key here";
public final static String consumerSecret = "myKey"; // "your secret key here";
private final String CALLBACKURL = "T4JOAuth://main"; //Callback URL that tells the WebView to load this activity when it finishes with twitter.com. (see manifest)
//Calls the OAuth login method as soon as its started
#Override
public void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.main);
OAuthLogin();
}
/* Creates object of Twitter and sets consumerKey and consumerSecret
* - Prepares the URL accordingly and opens the WebView for the user to provide sign-in details
* - When user finishes signing-in, WebView opens your activity back */
void OAuthLogin() {
try {
twitter = new TwitterFactory().getInstance();
twitter.setOAuthConsumer(consumerKey, consumerSecret);
requestToken = twitter.getOAuthRequestToken(CALLBACKURL);
String authUrl = requestToken.getAuthenticationURL();
this.startActivity(new Intent(Intent.ACTION_VIEW, Uri
.parse(authUrl)));
} catch (TwitterException ex) {
Toast.makeText(this, ex.getMessage(), Toast.LENGTH_LONG).show();
Log.e("in Main.OAuthLogin", ex.getMessage());
}
}
/*
* - Called when WebView calls your activity back.(This happens when the user has finished signing in)
* - Extracts the verifier from the URI received
* - Extracts the token and secret from the URL
*/
#Override
protected void onNewIntent(Intent intent) {
super.onNewIntent(intent);
Uri uri = intent.getData();
try {
String verifier = uri.getQueryParameter("oauth_verifier");
AccessToken accessToken = twitter.getOAuthAccessToken(requestToken,verifier);
String token = accessToken.getToken(), secret = accessToken.getTokenSecret();
//displayTimeLine(token, secret); //after everything, display the first tweet
} catch (TwitterException ex) {
Log.e("Main.onNewIntent", "" + ex.getMessage());
}
}
}
however on running this application, it gives me error in logcat :
11-18 10:36:27.727: E/in Main.OAuthLogin(282): 401:Authentication credentials (https://dev.twitter.com/docs/auth) were missing or incorrect. Ensure that you have set valid conumer key/secret, access token/secret, and the system clock in in sync.
11-18 10:36:27.727: E/in Main.OAuthLogin(282): <?xml version="1.0" encoding="UTF-8"?>
11-18 10:36:27.727: E/in Main.OAuthLogin(282): <hash>
11-18 10:36:27.727: E/in Main.OAuthLogin(282): <error>Desktop applications only support the oauth_callback value 'oob'</error>
11-18 10:36:27.727: E/in Main.OAuthLogin(282): <request>/oauth/request_token</request>
11-18 10:36:27.727: E/in Main.OAuthLogin(282): </hash>
I believe i had not set up callback URL, but i did that as well to https://dev.twitter.com/pages/welcome-anywhere
in my app.

Make sure that you have not register your app in Desktop application category at twitter application registration.

My Bad... My Web server was down and i did not noticed it until 5 minutes ago. thanks for all suggestions. I have already done these things.

Page content is loaded with JavaScript and Jsoup doesn't see it

One block on the page is filled with content by JavaScript and after loading page with Jsoup there is none of that inforamtion. Is there a way to get also JavaScript generated content when parsing page with Jsoup?
Can't paste page code here, since it is too long: http://pastebin.com/qw4Rfqgw
Here's element which content I need: <div id='tags_list'></div>
I need to get this information in Java. Preferably using Jsoup. Element is field with help of JavaScript:
<div id="tags_list">
разведчик
Sr
стратегический
</div>
Java code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Test
{
public static void main( String[] args )
{
try
{
Document Doc = Jsoup.connect( "http://www.bestreferat.ru/referat-32558.html" ).get();
Elements Tags = Doc.select( "#tags_list a" );
for ( Element Tag : Tags )
{
System.out.println( Tag.text() );
}
}
catch ( IOException e )
{
e.printStackTrace();
}
}
}

JSoup is an HTML parser, not some kind of embedded browser engine. This means that it's completely unaware of any content that is added to the DOM by Javascript after the initial page load.
To get access to that type of content you will need an embedded browser component, there are a number of discussions on SO regarding that kind of component, eg Is there a way to embed a browser in Java?

Solved in my case with com.codeborne.phantomjsdriver
NOTE: it is groovy code.
pom.xml
<dependency>
<groupId>com.codeborne</groupId>
<artifactId>phantomjsdriver</artifactId>
<version> <here goes last version> </version>
</dependency>
PhantomJsUtils.groovy
import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.openqa.selenium.WebDriver
import org.openqa.selenium.phantomjs.PhantomJSDriver
class PhantomJsUtils {
private static String filePath = 'data/temp/';
public static Document renderPage(String filePath) {
System.setProperty("phantomjs.binary.path", 'libs/phantomjs') // path to bin file. NOTE: platform dependent
WebDriver ghostDriver = new PhantomJSDriver();
try {
ghostDriver.get(filePath);
return Jsoup.parse(ghostDriver.getPageSource());
} finally {
ghostDriver.quit();
}
}
public static Document renderPage(Document doc) {
String tmpFileName = "$filePath${Calendar.getInstance().timeInMillis}.html";
FileUtils.writeToFile(tmpFileName, doc.toString());
return renderPage(tmpFileName);
}
}
ClassInProject.groovy
Document doc = PhantomJsUtils.renderPage(Jsoup.parse(yourSource))

You need to understand what is happening :
When you query a page from a website, whether using Jsoup or your browser, what gets sent back to you is some HTML. Jsoup is able to parse that.
However, most websites include Javascript in that HTML, or linked from that HTML, which will populate the page with content. Your browser is able to execute the Javascript, and thus populate the page. Jsoup is not.
The way to understand this is the following : parsing HTML code is easy. Executing Javascript code and updating corresponding HTML code is a lot more complex, and is the work of a browser.
Here are some solutions for this kind of problems:
If you can find what are the Ajax calls that Javascript code is making, that is loading content, you might be able to use the URL of these calls with Jsoup. In order to do that, use Developer Tools from your browser. But this is not guaranteed to work:
it might be that the url is dynamic, and depends on what is on the page at that time
if the content is not public, cookies will be involved, and simply querying the resource URL will not be enough
In these cases, you will need to "simulate" the work of a browser. Fortunately, such tools exist. The one I know, and recommend, is PhantomJS. It works with Javascript, and you would need to launch it from Java by starting a new process. If you want to stick to Java, this post lists some Java alternatives.

You can use a combination of JSoup and HtmlUnit to get the page contents after JavaScript scripts are done loading.
pom.xml
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>3.35</version>
</dependency>
Simple Example From file https://riptutorial.com/jsoup/example/16274/parsing-javascript-generated-page-with-jsoup-and-htmunit
// load page using HTML Unit and fire scripts
WebClient webClient2 = new WebClient();
HtmlPage myPage = webClient2.getPage(new File("page.html").toURI().toURL());
// convert page to generated HTML and convert to document
Document doc = Jsoup.parse(myPage.asXml());
// iterate row and col
for (Element row : doc.select("table#data > tbody > tr"))
for (Element col : row.select("td"))
// print results
System.out.println(col.ownText());
// clean up resources
webClient2.close();
A Complex Example: Load login, get Session and CSRF, then post and wait for home page to finish loading (15 seconds)
import java.io.IOException;
import java.net.HttpCookie;
import java.net.MalformedURLException;
import java.net.URL;
import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.HttpMethod;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebRequest;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
//JSoup load Login Page and get Session Details
Connection.Response res = Jsoup.connect("https://loginpage").method(Method.GET).execute();
String sessionId = res.cookie("findSESSION");
String csrf = res.cookie("findCSRF");
HttpCookie cookie = new HttpCookie("findCSRF", csrf);
cookie.setDomain("domain.url");
cookie.setPath("/path");
WebClient webClient = new WebClient();
webClient.addCookie(cookie.toString(),
new URL("https://url"),
"https://referrer");
// Add other cookies/ Session ...
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getCookieManager().setCookiesEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
// Wait time
webClient.waitForBackgroundJavaScript(15000);
webClient.getOptions().setThrowExceptionOnScriptError(false);
URL url = new URL("https://login.path");
WebRequest requestSettings = new WebRequest(url, HttpMethod.POST);
requestSettings.setRequestBody("user=234&pass=sdsdc&CSRFToken="+csrf);
HtmlPage page = webClient.getPage(requestSettings);
// Wait
synchronized (page) {
try {
page.wait(15000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
// Parse logged in page as needed
Document doc = Jsoup.parse(page.asXml());

I fact there is a "way"! Maybe it is more "a workaround" than a "way... The code below checks both for meta attribute "REFRESH" and javascript redirects... If either of them exists RedirectedUrl variable is set. So you know your target... Then you can retrieve the target page and go on...
String RedirectedUrl=null;
Elements meta = page.select("html head meta");
if (meta.attr("http-equiv").contains("REFRESH")) {
RedirectedUrl = meta.attr("content").split("=")[1];
} else {
if (page.toString().contains("window.location.href")) {
meta = page.select("script");
for (Element script:meta) {
String s = script.data();
if (!s.isEmpty() && s.startsWith("window.location.href")) {
int start = s.indexOf("=");
int end = s.indexOf(";");
if (start>0 && end >start) {
s = s.substring(start+1,end);
s =s.replace("'", "").replace("\"", "");
RedirectedUrl = s.trim();
break;
}
}
}
}
}
... now retrieve the redirected page again...

It is possible by combining JSoup with another framework to interpret the webpage, in my example here I'm using HtmlUnit.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
...
WebClient webClient = new WebClient();
HtmlPage myPage = webClient.getPage(URL);
Document document = Jsoup.parse(myPage.asXml());
Elements otherLinks = document.select("a[href]");

After specifying user agent, my problem is solved.
https://github.com/jhy/jsoup/issues/287#issuecomment-12769155

Try:
Document Doc = Jsoup.connect(url)
.header("Accept-Encoding", "gzip, deflate")
.userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0")
.maxBodySize(0)
.timeout(600000)
.get();

Internet on android, causing twitter4j exception?

I'm experimenting with twitter4j on android (new to both) coded up a simple process in java just to test it out. It downloads a users timeline and prints to screen.
I modify the code for android, but I get a TwitterException when i try to download the user timeline. I checked out the debugger and the exception is null; no information given. I've also added the Internet permission to the android manifest on previous advice. Heres the code:
package com.test;
import java.util.List;
import twitter4j.Status;
import twitter4j.Twitter;
import twitter4j.TwitterException;
import twitter4j.TwitterFactory;
import android.app.Activity;
import android.os.Bundle;
public class Test2 extends Activity {
/** Called when the activity is first created. */
#Override
public void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.main);
List<Status> statuses = null;
Twitter api = new TwitterFactory().getInstance("USERNAME","PASSWORD");
try{
statuses = api.getUserTimeline();
}
catch(TwitterException e){
System.out.println("ERROR");
System.exit(-1);
}
for(Status s: statuses){
System.out.println(s.getText());
}
}
}
I realise this only prints to the console, just to keep it simple.
Thanks for any and all help.

Make sure you have the INTERNET permission in your AndroidManifest.xml file.
Also System.out.println() is not recommended on Android. Please use the android.util.Log class and send your debugging output to LogCat (available via adb logcat, DDMS, or the DDMS perspective in Eclipse).

Please check your timestamp. Each HttpRequest contain current timestamp, if the timestamp is wrong then it throw exception.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Android - Webview HTML code extraction doesn't work (Javascript) - java

When you instantiate your LoadListener object, try the following: this.new LoadListener();

Related

Jsoup HTML extraction without script code [duplicate]

Android Jsoup Parsing URL for all Body Text

Error in oAuth for Twitter for Android

Page content is loaded with JavaScript and Jsoup doesn't see it

Internet on android, causing twitter4j exception?

Categories

Resources