I've often wanted to create applications that provide a simpler front-end to other websites that require users to login before the pages I want to use can be accessed. I was wondering, if
(1) any website with a POST to an http page can be authenticated by POSTing
postField1name=pf1Value&postField2name=pf2Value
to the website, if that's true how can you inspect the HTML to POST correctly?
(2) I wanted to know if you could parse HTML, say for a sign up form, and display all the fields in an application UI, including downloading a Captcha, and displaying it to the user, and allowing them to type the value in, to send back to the website, and process the response.
Also if anyone knows how I might accomplish (2) using Apache HTTP Client in java, I'd greatly appreciate it!
http://hc.apache.org/httpcomponents-client/httpclient/index.html
(1) An easy way to find out what's actually being POST'd is to look at the actual HTTP requests. You can do that with a tool like LiveHTTPHeaders. Then have your script simulate that.
(2) Yes. You can use cURL, which is excellent for things like this.
(1) Try FireBug. There's actually a lot of options for authentication.
(2) Try JTidy
Related
I'm doing some web scraping and using Jsoup to parse html files and my understanding is that Jsoup doesn't work well with dynamic web pages. Is there a way to check if a web page is dynamic so that I don't bother attempting to parse it using Jsoup?
Short answer: Not really. You need to check case by case
Explanation:
Today's websites are full of ajax calls. Many are loading important data, others are only maginally interesting when you scrape a site's content. Many very modern sites even do both, they send complete rendered page to the client where it gets transformed to a web-app (keyword isomorphic rendering)
So you need to check the site in question case by case. It is not that hard though. just fire up Curl and see if you get the content you need. If not, it is often also not that hard to understand the structure and parameters of the ajax calls. If you are doing this, then you often get even dynamic content fine with only Jsoup.
You cannot be sure 100% that a website is dynamic or static, cause there are ways to hide the clues that show a website is dynamic. but you can check on a limited number of HTTP response headers to test whether its dynamic or static :
Cookie : An HTTP cookie previously sent by the server with Set-Cookie
X-Csrf-Token : Used to prevent cross-site request forgery. Alternative header names are: X-CSRFToken and X-XSRF-TOKEN
X-Powered-By : specifies the technology (e.g. ASP.NET, PHP, JBoss) supporting the web application (version details are often in X-Runtime, X-Version, or X-AspNet-Version)
These are 3 HTTP headers that a server scripting is involved with to generate(as far as I know)
Also chances are that a webpage with form related elements should have a server side mechanism to process form data.
Let's say I've created a mobile application named 'Foo'(iOS). This app talks to a Java-running backend at 'java.com' and works perfectly. Now, I'm trying to create the website 'Foo.com' to let users enjoy the 'same' service on a browser/computer. So far, I've found that almost all calls needed to the API from the website can be done in JavaScript directly to the backend at 'java.com', including a login-function.
On the backend, I've implemented the standard 'doPost'-method to handle the login, and I create a Cookie to attach to the request.
The problem, I think, is that the users get the JavaScript from 'Foo.com', and the JavaScript tries to log in by using an AJAX-call to 'java.com', thus the cookie will be 'stamped' by www.java.com', not by 'www.foo.com', and the user will never receive the cookie. (At least, I don't receive a cookie now)
I've been trying to find a way to accept cookies from 'api.com' into the application, but it doesn't look good. Honestly, I'm not even sure this is the actual problem causing me to not receive a cookie, but I've read several places that cross-domain-cookies aren't allowed. So I ask the general question, how should I proceed?
I've been toying with the idea to add a .php-page to the server-side of the website 'foo.com', and from there handle the requests from client to API, hopefully causing the cookies to be 'stamped' as 'foo.com' instead of 'java.com'. (In that case, I'd also wonder if the .php can forward the information in the cookie or something similar).
But I really want to avoid as much traffic on the webhost as possible. An all-script-website would be optimal, but I don't really see how cookies can work with that.
Is there anything else I can do to handle this? If I simply want a persistent login-function from a client of 'foo.com' handled at 'java.com', are there any options, with or without the use of cookies?
Nowadays many websites contain some content loaded by ajax(e.g,comments in some video websites). Normally we can't crawl these data and what we get is just some js source code. So here is the question: in what ways can we execute the javascript code after we get the html response and get to the final page we want?
I know that HtmlUnit has the ability to execute background js,yet some many bugs and errors are there. Are there any else tools can help me with it?
Some people tell me that I can crawl the ajax request url, analyze its parameters and send request again so as to gain the data. If things can't work out according to the way I mention above, can anyone tell me how to extract the ajax url and send the request in correct format?
By the way,if the language is java,it would be the best
Yes, Netwoof can crawl Ajax easily. Its API and bot builder let you do it without a line of code.
Thats the great thing about HTTP you don't even need java. My goto tool for debugging AJAX is the chrome extension Postman. I start by looking at the request in the chrome debugger and identifying the salient bits(url or form encoded params etc.)
Then it can be as simple as opening a tab and launch requests at the server with Postman. As long as its all in the same browser context all of your cookies(for authentication, etc.) will be shipped along too.
I want to make an application which logs into a web site by filling form, perform basic operations such as button click etc and finally log out. What package / external jars are available for this?
You need to look at java.net.URL, java.net.HttpURLConnection, and java.net.Authenticator.
In general you need to open an HTTP connection,
get the form (using GET method),
and fill in the details, and POST probably perform a POST of the login data.
Bare in mind that most sites provide some security, such as SSL , or other mechanism, you will have to deal with this in your application (the same way browser knows how to handle this).
I'm building an app to let users export data from a university system. Currently, they can log in and see the data in HTML, but I would like to let people download it as CSV.
I have an app where users supply their username and password. I would like to log in to the university system and HTML scrape the resulting page. How can I do this?
I'm building a GWT app. I could either do this in Java-transliterated-JS on the client, or Java on the server.
Update: Selenium might be nice, but it looks like overkill.
You're going to have to do this from the server unless the domains are the same. You'd need to determine what the POST transaction used by the other server for the login step looks like - parameter names etc. Then you'd perform that operation and do whatever you want with what comes back. If you need to see multiple pages, you need to maintain the appropriate session cookie too so that the server knows you're still logged in on the subsequent HTTP requests.
If you have to hit another site to validate the credentials, then I'm not so sure that people should feel comfortable providing those credentials to you. That is, if you don't have rights to check the credentials directly, why are you trustworthy to receive them? I know sometimes people need to integrate with a system they don't own, so this is just a question.
First, this has to be done server-side because of the limitations on client scripting due to the same origin policy.
The typical way of handling the "screen scraping" you mention is to treat the web page as if it was an XML service. First, examine the source code of the page, then using an internet/HTTP stack, craft a POST to the correct URL and read the response using a standard XML library. It will take some ingenuity to come up with a good way to dig into the XML to find the piece you need that will be as insulated as possible from changes to the page. Keep in mind that your system can break any time that the owners of the site change their page.
Sometimes, you can't just send the POST but have to request the blank page initially in order to get hidden form values that need to be returned in the POST. You'll have to experiment to find out what it requires.
Additionally, you probably have to handle cookies as well, since they usually are an integral part of the web site's authentication and session management (though you might get lucky that the session doesn't matter between the initial POST and the first response).
Last, you may be unlucky enough that the site uses javascript to do part of the authentication work, which may require additional digging to understand how the credentials are posted to the site.
There are other potential barriers such as the site checking to see that the referrer is their own site, possible use of SSL (HTTPS) and so on.
I'm pretty sure that the protection against cross-site scripting in web browsers will mean that you can't log in to the university's app using javascript running in the web browser. So the part of your program that fetches data from the university will need to run on your server. Once you have the data, you can process it either on your server or in javascript in the browser, but I think it would be easier to do it on the server.
See http://en.wikipedia.org/wiki/Same_origin_policy
I'm not too sure about GWT, but in general, you would take the form data submitted by the user, check it against a database of username and hashed passwords. If the database checks out, set a session cookie that says the user is logged in.
In your pages, check if the session cookie say the user is logged in. If not, redirect to login page, otherwise allow them to view the pagfe.