Crawl contents loaded by ajax - java

Nowadays many websites contain some content loaded by ajax(e.g,comments in some video websites). Normally we can't crawl these data and what we get is just some js source code. So here is the question: in what ways can we execute the javascript code after we get the html response and get to the final page we want?
I know that HtmlUnit has the ability to execute background js,yet some many bugs and errors are there. Are there any else tools can help me with it?
Some people tell me that I can crawl the ajax request url, analyze its parameters and send request again so as to gain the data. If things can't work out according to the way I mention above, can anyone tell me how to extract the ajax url and send the request in correct format?
By the way,if the language is java,it would be the best

Yes, Netwoof can crawl Ajax easily. Its API and bot builder let you do it without a line of code.

Thats the great thing about HTTP you don't even need java. My goto tool for debugging AJAX is the chrome extension Postman. I start by looking at the request in the chrome debugger and identifying the salient bits(url or form encoded params etc.)
Then it can be as simple as opening a tab and launch requests at the server with Postman. As long as its all in the same browser context all of your cookies(for authentication, etc.) will be shipped along too.

Related

Tracking site XHR with Java

Tried to use htmlUnit to send POST requests to communicate to server and met a tiny problem: target .php url is being changed from time to time
(www123.example.net -> www345.example.net, etc.). The only way to get new adress is to open site and check it's XHR requests, find one which goes to www???.example.net and then use this address to send POSTs.
So the question is: is there a way to track XHR using htmlUnit or any other Java library?
If you really need help you have to show your problem in more detail, provide some info about the web site you are requesting, show you code and try to explain what you expect and what goes wrong. Without this details we can only guess.
Looks like you should try to think about HtmlUnit more like a browser you can control from java instead of doing simple Http requests. Have a look at the simple samples on the HtmlUnit web site (the one at the bottoms is for you).
Try something like this (the same steps as the user of an ordinary browser does)
* open the url/page
* fill the various form fields
* find the submit button an click
* use the resulting page content
Usually HtmlUnit does all the stuff in the background for you.

How to check is webpage is static or dynamic

I'm doing some web scraping and using Jsoup to parse html files and my understanding is that Jsoup doesn't work well with dynamic web pages. Is there a way to check if a web page is dynamic so that I don't bother attempting to parse it using Jsoup?
Short answer: Not really. You need to check case by case
Explanation:
Today's websites are full of ajax calls. Many are loading important data, others are only maginally interesting when you scrape a site's content. Many very modern sites even do both, they send complete rendered page to the client where it gets transformed to a web-app (keyword isomorphic rendering)
So you need to check the site in question case by case. It is not that hard though. just fire up Curl and see if you get the content you need. If not, it is often also not that hard to understand the structure and parameters of the ajax calls. If you are doing this, then you often get even dynamic content fine with only Jsoup.
You cannot be sure 100% that a website is dynamic or static, cause there are ways to hide the clues that show a website is dynamic. but you can check on a limited number of HTTP response headers to test whether its dynamic or static :
Cookie : An HTTP cookie previously sent by the server with Set-Cookie
X-Csrf-Token : Used to prevent cross-site request forgery. Alternative header names are: X-CSRFToken and X-XSRF-TOKEN
X-Powered-By : specifies the technology (e.g. ASP.NET, PHP, JBoss) supporting the web application (version details are often in X-Runtime, X-Version, or X-AspNet-Version)
These are 3 HTTP headers that a server scripting is involved with to generate(as far as I know)
Also chances are that a webpage with form related elements should have a server side mechanism to process form data.

Simulate form post using http client in Android app?

So, I'm currently developing an app for a service which has a json-based (unfortunately) read only API. Retrieving content is no problem at all, however the only way to post content is using a form on their site which location is a PHP script. The service is open source so I know which fields the form expects, but whatever I send, it always results in a BAD REQUEST.
I captured the network traffic inside my browser and as far as I can see, the browser constructs a multipart form request, however when I copy the request and invoke it again using a REST client, a BAD REQUEST gets returned.
Is there a way to construct a http request in Android that simulates a form post?
If it's readonly I think you wouldn't be able to make requests with POST (it's assume for editing or adding things).
If you let me make you an advise, I recommend you using this project as a Library.
https://github.com/matessoftwaresolutions/AndroidHttpRestService
It makes you easy deal with apis, control network problems etc.
You can find a sample of use there.
You only have to:
Build your URL
Tell the component to execute in POST mode
Build your JSON
As I told you, I don't know even if it will work.
I hope it helps!!!

How to write a java program that can post url on browser and log the results from html div or request response of HTTP?

I am planning to write a java program where I have the url of website x with which I will appending number from 1 to 100 and I will be getting result from the website.
Should I write using request and response of HTTP or mere java program where the url as string would do?
If I am getting the result as posted on browser, how to get the values from a div and write it to a text file. I guess the other option is also to get it via response.
All you need is a programatic Browser, which submits the request and gets you the response,
You can study the Http Request and Response Objects under Tcp/Ip Protocol stack and implement your own, but instead of Reinventing the wheel, you can use the apache commons Http Components Project, which has all this implemented
Apache Http Components
I'm not sure if you will be able to control the browser using only java. Even if you know where the browser exe file is installed you will not be able to use it's handle to control it (no pointers in java, different process, different memory area, etc). Sure, you could write one dll and then use it with jni but the final result would not be multplatform ...
Other possible approach would be to inject some keypress but you would be blind about the browser response (you would have to do some ugly screen capture ).
I don't think it is an easy task so IF I were you I would look in the web for some already made dll or library to control the browser.
I know that selenium does some kind of browser control (http://docs.seleniumhq.org/)
my 5 cents in 5 minutes.

Possible to Authenticate with an website with POST / Download CAPTCHA

I've often wanted to create applications that provide a simpler front-end to other websites that require users to login before the pages I want to use can be accessed. I was wondering, if
(1) any website with a POST to an http page can be authenticated by POSTing
postField1name=pf1Value&postField2name=pf2Value
to the website, if that's true how can you inspect the HTML to POST correctly?
(2) I wanted to know if you could parse HTML, say for a sign up form, and display all the fields in an application UI, including downloading a Captcha, and displaying it to the user, and allowing them to type the value in, to send back to the website, and process the response.
Also if anyone knows how I might accomplish (2) using Apache HTTP Client in java, I'd greatly appreciate it!
http://hc.apache.org/httpcomponents-client/httpclient/index.html
(1) An easy way to find out what's actually being POST'd is to look at the actual HTTP requests. You can do that with a tool like LiveHTTPHeaders. Then have your script simulate that.
(2) Yes. You can use cURL, which is excellent for things like this.
(1) Try FireBug. There's actually a lot of options for authentication.
(2) Try JTidy

Categories

Resources