I am trying to crawl a web-page which requires authentication. I am able to access that page in browser when I am logged in, using JSoup http://jsoup.org/ library to parse HTML pages.
public static void main(String[] args) throws IOException {
// need http protocol
Document doc = Jsoup.connect("http://www.secinfo.com/$/SEC/Filing.asp?T=r643.91Dx_2nx").get();
// get page title
String title = doc.title();
System.out.println("title : " + title);
// get all links
Elements links = doc.select("a");
for (Element link : links) {
// get the value from href attribute
System.out.println("\nlink : " + link.attr("href"));
}
System.out.println();
}
Output :
title : SEC Info - Sign In
This is getting the content of the sign in page not the actual URL i am passing. I am registered on secinfo.com and while running this program I am logged in from my default browser Firefox.
This will not help even if you are logged in using your default browser. Your java program is a separate process and it doesn't share the screen with your browsers.
On the other hand secinfo needs an authentication and JSoup allows you to pass authentication details.
It works for me when I pass the authentication details:
Please check this answer (Jsoup connection with basic access authentication)
Jsoup's connect() also support a post() with method chaining, if your target site's login mechanism work with POST request:
Document doc = Jsoup.connect("url")
.data("aUserName", "myUserName")
.data("aPassword", "myPassword")
.userAgent("Mozilla")
.timeout(3000)
.post();
But what if the page you are trying to get requires subsequent cookie sending for each request ? Try to use HttpURLConnection with POST and read the cookie from HTTP connection response header. HttpClient will make this task easier for you. Use the library to fetch a web page as string and then pass the string to jsoup.parse() function to get the document.
You have to sign in with a post command and preserve the cookies you get back. That is where you session info is stored. I wrote an example here: Jsoup can't Login on Page.
The website in the example is an exception it sets the session cookie already on the login page. You can leave that step if it is work for you.
The exact post command can be different from website to website. You have to dig it out from the html or you have to install a plugin to your browser and intercept the post commands.
Related
I'm having issues with sending POST data to this site:
https://www.amazon.com/ap/signin?openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&openid.assoc_handle=amzn_mturk_worker&openid.ns.pape=http%3A%2F%2Fspecs.openid.net%2Fextensions%2Fpape%2F1.0&_encoding=UTF8&openid.mode=checkid_setup&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.pape.max_auth_age=43200&marketplaceId=A384XSLT9ODACQ&clientContext=703ea210dfe6fd07defd5ab30ac8d9&openid.return_to=https%3A%2F%2Fwww.mturk.com%2Fmturk%2Fendsignin`
I'm using Jsoup. I'm trying to use the same cookies "session-id" for the get data but i'm still not logged in. This is my code:
Connection.Response res = Jsoup.connect("https://www.amazon.com/ap/signin?openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&openid.assoc_handle=amzn_mturk_worker&openid.ns.pape=http%3A%2F%2Fspecs.openid.net%2Fextensions%2Fpape%2F1.0&_encoding=UTF8&openid.mode=checkid_setup&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.pape.max_auth_age=43200&marketplaceId=A384XSLT9ODACQ&clientContext=703ea210dfe6fd07defd5ab30ac8d9&openid.return_to=https%3A%2F%2Fwww.mturk.com%2Fmturk%2Fendsignin").data("email", "blah#gmail.com", "password", "blah").method(Connection.Method.POST).execute();
Document doc2 = res.parse();
sessionId = res.cookie("session-id");
Document doc = Jsoup.connect("https://www.mturk.com/mturk/searchbar?selectedSearchType=hitgroups&minReward=0.00&sortType=LastUpdatedTime%3A1&pageSize=50").cookie("SESSIONID", sessionId).get();
Where e-mail and password are real information instead of "blah". I don't know if my issue is how I parse the cookie or send the POST data originally.
Edit: So the site uses OpenID. Not sure if I should make a whole new question, but how would I go about it now? I basically need to login and pull information off the site after login. Here is my post info:
appActionToken:pj2FxGfwLZT6nheliE7BMxwZrTUKEj3D
appAction:SIGNIN
clientContext:ape:NzAzZWEyMTBkZmU2ZmQwN2RlZmQ1YWIzMGFjOGQ5
openid.pape.max_auth_age:ape:NDMyMDA=
openid.return_to:ape:aHR0cHM6Ly93d3cubXR1cmsuY29tL210dXJrL2VuZHNpZ25pbg==
prevRID:ape:S1kyUFNDUkhLVFZSSjRGMjBYUUo=
openid.identity:ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvYXV0aC8yLjAvaWRlbnRpZmllcl9zZWxlY3Q=
openid.assoc_handle:ape:YW16bl9tdHVya193b3JrZXI=
openid.mode:ape:Y2hlY2tpZF9zZXR1cA==
openid.ns.pape:ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvZXh0ZW5zaW9ucy9wYXBlLzEuMA==
openid.claimed_id:ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvYXV0aC8yLjAvaWRlbnRpZmllcl9zZWxlY3Q=
pageId:ape:YW16bl9tdHVya193b3JrZXI=
openid.ns:ape:aHR0cDovL3NwZWNzLm9wZW5pZC5uZXQvYXV0aC8yLjA=
email: -Deleted-
create:0
password: -Deleted-
metadata1:+gLgZV5Fc5cBh44WnOrKTq5ofl6IhGvSbZGHfX7T5PuwmIl0Ep4bclt77iRlLPO1thRNg/9TylDw5H/9UPZnuOcF1OAHaECaWmK9H8pkW0elpz5QgEukM4aP6dPwSliw9Ggy+1/vQCk0MLm2TvkyS8uLslyh2aEw4H7hDmcF6lTgctZVE8B2KENH97L7hp4rcR2NHKMm4tEFdwpmVqv+pmLX5rUBo4p2QNUe3g0dNAifuK3RPXCVSQyQHpUzlBuFZTFK9xspwA2dgcdSZcgQzgzQKik/WEDrn0eP4sAVnO1ZGFUWKFAY55Lzgf6yd6WxCZ15yGTWENf0Km9wnXce+Ev5SMarXPJNQtfqY6tdp5snwFxpB8m/x72AvRgWJACoi5qcyqwO6dxroebIyB9uruApIkUk07AD8bJvhcf92+flN9TY4iXCkIoeSUN5aKp8rJbyhspySgsmQ9guu4964qeQRK0J092/sx1De6VmfGQ3nMrr0+McnC4/wZo2jUhGOr62ow==`
The site you're trying to log in make use of Javascript. Since Jsoup doesn't support Javascript (Jsoup 1.8.3 as of this writing), I advice you to use some better approaches: ui4j or Selenium. The choice is yours.
I'm using java and trying to get the content of a website so that I can analyze the text on the page, however every time that I "GET" a response from the server, it is from a login page rather than the website page that I am looking at.
I am logged into the website on all my browsers, but my application is not able to see the page as if it were me.
I also tried to use an API called "Yandex" --> http://api.yandex.com/rca/
as a work-around. But when I call the page from Yandex (which would get its content) I only see information based on the login page returned.
Can anyone give me a direction to investigate? I would like to be able to get one item on the page of a website that I work for, but it doesn't seem possible.
m_strseedpath = "http://myUrl.com/mypage.html"; //not https
URLConnection connection = new URL("http://rca.yandex.com/?key={MyActualKeyNotThisText}&url=" + m_strSeedUrlPath + "").openConnection();
connection.setRequestProperty("Accept-Charset", "UTF-8");
InputStream response = connection.getInputStream();
StringWriter writer = new StringWriter();
IOUtils.copy(response, writer, "UTF-8");
String strString = writer.toString();
System.out.println(strString);
The URLConnection object will connect to the page but in a different session. You would have to programmaticaly log in from your Java code.
Create a URLConnection object to the login page, POST the user name and password, receive the content getting the InputStream from the URLConnection object, and finally create a new connection to the page you wish to analyze. You'd have to also work with cookies in order to view the second page.
Hope this helps!
The URL that you are trying to access has access restricted via login. Even if you are logged in via your browser you wont be able to access the page from your Java application because the browser has an Authenticated Session with the target website. The same session is not visible to your Java Application.
You would have to research into the ways to login to the website and then get the page content.
I have a website.
Can see inside the contents must be logged in.
However, I use this code to log on.
doc = Jsoup.connect("http://46.137.207.181/Account/Login.aspx")
.data("ctl00$MainContent$LoginUser$UserName", "1234")
.data("ctl00$MainContent$LoginUser$Password", "123456")
.data("__VIEWSTATE","/wEPDwULLTEyMDAyNTY1NjJkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYBBSZjdGwwMCRNYWluQ29udGVudCRMb2dpblVzZXIkUmVtZW1iZXJNZUHk9FMvtsvPHqlP3vAV+1oloaxe4Asr7RQX5XFptqGz")
.data("__EVENTVALIDATION","/wEWBQLup8mjCgLFyvjkDwLQzbOWAgKVu47QDwKnwKnjBTL6Xsxc9zQnY8p9KVlFJ/8HIHqlOGl9uClF4ktcWYJ5")
.data("ctl00$MainContent$LoginUser$LoginButton","2")
request.
.post();
Then get the login pages.
doc2 = Jsoup.connect("http://46.137.207.181/Groups.aspx").get();
s=doc.title();
Elements kelime = doc.select("td");
for (Element link : kelime) {
linkHref = link.attr("hh");
Have shown no login screen.
I would like to ask how can I do it?
What is happening in your example is that you are logging in with form data to Login.apsx and creating a session, but the request to Groups.aspx doesn't carry that session data, so the request is not authenticated.
Login.aspx will return a session cookie, and you need to pass that cookie onto the next request.
See the answers to this jsoup login question for good examples.
I am using web client. and login successfully through web client. but when i send second request for getting data . I got an Exception
com.gargoylesoftware.htmlunit.UnexpectedPage#6d8d73`
but when i pass url thruogh brouser i got a data Json format.
Code:
webClient.getPage("http://ajax/stream/refresh-box?r=new&id=2323222&")
Thx in advanced
this is the which is reading different content type through web client
Page page = webClient.getPage(url);
System.out.println(page);
WebResponse response = page.getWebResponse();
if (response.getContentType().equals("application/json"))
{
pagesource = response.getContentAsString();
System.out.println(pagesource);
}
This URL is returning an error, if I type it in the browser, check it again:
"http://ajax/stream/refresh-box?r=new&id=2323222&"
BTW, why does it finish with & sign while there is no additional parameter?
I removed the & sign and still I get no page exception in the browser.
I'm starting to use facebook Graph API and I'm going to retrieve an access token with some simple HTTP requests via java.
Following https://developers.facebook.com/docs/authentication/
I created a new app but I don't have a domain so
I make an HTTP request to
www.facebook.com/dialog/oauth?client_id=YOUR_APP_ID&
redirect_uri=https://www.facebook.com/connect/login_success.html
for a server-side flow, and I suppose to get redirect to a success page with a code in the URL. Then I would use this code make another HTTP request to
graph.facebook.com/oauth/access_token?
client_id=YOUR_APP_ID&redirect_uri=YOUR_URL&
client_secret=YOUR_APP_SECRET&code=THE_CODE_FROM_ABOVE
and finally get my access token.
I used both java.net.HttpURLConnection and org.apache.http.HttpResponse,
but, in both cases, executing the first call I get as response the HTML of a Facebook login page.
If I use this HTML to create a webpage and then I simply click on the Login button (without inserting username and password) I get the success page with the code!
In the HTML the field submit of the button Login is empty and I can't retrieve redirect URLs... I can just read an alternate link in the <meta> tag which generate an auth_token (what is it? It is very different wrt an normal access_token...).
So what I ask is:
it is possible to detect the hidden redirect in some way just
using java.net.HttpURLConnection or
org.apache.http.HttpResponse?
if yes, how is the mechanism? Is it related to the auth_token?
if no, is it possible with other libraries? (I used also restfb,
but they seems to require an access token inserted "by hand" as an
arg, and I also saw facebook-java-api but it seems old).
Also if I'm logged in Facebook, executing the first HTTP call via Java I get as response the HTML of a Facebook login page.
Using HTML to create a webpage and then I simply click on the Login button (without inserting username and password) I get the success.htm page with the code parameter in the URL.
If I use the original URL directly in my browser I can directly obtain the success.htm page without passages in the middle.
So I suppose the problem is in the management of cookies: in Java (executed in Eclipse) I cannot access my browser's cookies.
I tried to redirect to use a Servlet but I get the error about the domain:
ServletURL is not a Facebook domain or a "site URL" registered for my app (actually I did't set a site URL for my app... and that's the problem core).
In any case here
http://developers.facebook.com/docs/authentication/
in the section App types > Desktop apps they say:
[...] After the user authorizes your app [I allowed everything], we
redirect the user back to the redirect_uri with the access token in
the URI fragment: [...]
Detect this redirect and then read the access token out of the URI
using whatever mechanisms provided by your framework of choice. [...]
So I think that it is still possible to detect this redirect via Java. How?
If you do not have a domain yet I recommend you using localhost as a domain. That way you can test it on your local web server / local app.
Using HttpURLConnection works fine.
This is how we do it.
Redirect to:
"https://graph.facebook.com/oauth/authorize?" +
"client_id=" + clientId + "&" +
"redirect_uri=" + URLEncoder.encode(returnUrl, "utf-8")
// After redirect to the return url do the following:
//Make a http request to
"https://graph.facebook.com/oauth/access_token?client_id=" +
"client_id=" + clientId + "&" +
"redirect_uri=" + URLEncoder.encode(returnUrl, "utf-8") + "&"+
"client_secret=" + clientSecret + "&"+
"code=" + request.getParameter("code");
This will return an access token which you can query facebook with