I'm doing some web scraping and using Jsoup to parse html files and my understanding is that Jsoup doesn't work well with dynamic web pages. Is there a way to check if a web page is dynamic so that I don't bother attempting to parse it using Jsoup?
Short answer: Not really. You need to check case by case
Explanation:
Today's websites are full of ajax calls. Many are loading important data, others are only maginally interesting when you scrape a site's content. Many very modern sites even do both, they send complete rendered page to the client where it gets transformed to a web-app (keyword isomorphic rendering)
So you need to check the site in question case by case. It is not that hard though. just fire up Curl and see if you get the content you need. If not, it is often also not that hard to understand the structure and parameters of the ajax calls. If you are doing this, then you often get even dynamic content fine with only Jsoup.
You cannot be sure 100% that a website is dynamic or static, cause there are ways to hide the clues that show a website is dynamic. but you can check on a limited number of HTTP response headers to test whether its dynamic or static :
Cookie : An HTTP cookie previously sent by the server with Set-Cookie
X-Csrf-Token : Used to prevent cross-site request forgery. Alternative header names are: X-CSRFToken and X-XSRF-TOKEN
X-Powered-By : specifies the technology (e.g. ASP.NET, PHP, JBoss) supporting the web application (version details are often in X-Runtime, X-Version, or X-AspNet-Version)
These are 3 HTTP headers that a server scripting is involved with to generate(as far as I know)
Also chances are that a webpage with form related elements should have a server side mechanism to process form data.
Related
I have an app that I need search engine crawlers to be able to index.
I don't need that the whole app to be indexed just specific URL (or URL pattern) like http://examplegwtapp.com/xyz where xyz is a hash code, similar to those URL shorteners out there.
My app works like this:
When that URL is accessed, the servlet will forward the request to a GWT app passing this fragment: app.html#View?hash=xyz
So the View page is totally dynamic.
The question is what is the correct way to make this specific dynamically generated URL to be indexed or crawled by search engines?
I would look into Making AJAX Applications Crawlable
and A proposal for making AJAX crawlable
In a nutshell these are the steps you should consider:
Crawlers maps from pretty url to ugly url. i.e from
http://examplegwtapp.com/app.html#View?hash=xyz to
http://examplegwtapp.com/app.html?_escaped_fragment_=hash=xyz
crawler requests the ugly url
Server maps from ugly url to pretty url. you can do this by identifying on the apache level any _escaped_fragment_ request and redirect it to a certain server controller to handle the crawler call. i.e. check Apache rewrite condition for ajax crawling
Server invokes headless browser(HtmlUnit or if just small portion of your code is js just generate the static html with your server
code. i.e HtmlUnit Generate Page for GWT App
Headless browser response returned to the crawler.
I'm developing a web app using JSTL and Javascript in Eclipse Juno. I've been reading questions like How to set the JSTL variable value in javascript? and my code works good even if I have error in eclipse:
But... Is it a good practice to use JSTL and Javascript like this?
Does it cause a low performance in the time of rendering the webpage?
Can this be done in other way?
Is it a good practice to use JSTL and Javascript like this?
It is not bad practice or good practice. The bad practice would be using JSTL to control the flow of JavaScript, which is plain wrong because JSTL runs on server side and JavaScript on client side.
Does it cause a low performance in the time of rendering the webpage?
JSTL will only help to generate the HTML for the current view. JavaScript is not involved in the HTML generation at server side but at client side unless you work with nodejs or similar technologies.
Can this be done in other way?
This depends on what you're doing. Common way to access to data when accessing to a web page:
Application Server (AS) receives a GET request on http://www.foo.com/bar
AS pre process the GET request (load data from database or another data source, pre calculations, etc)
AS creates the response for the GET request (apply the data to generate the HTML)
AS sends the response to the client.
The browser client renders the HTML.
Another way to do it:
Application Server (AS) receives a GET request on http://www.foo.com/bar
AS creates the response for the GET request (generate the HTML which contains JavaScript functions to load the data in the onload event).
AS sends the response to the client.
The browser client renders the HTML.
The onload event fires and load data in the onload event through RESTful services. This way, the data interaction is handled in client side only, but the data comes from server side.
These are two very simple alternatives to handle the same problem. Which one to choose and work with will depend entirely on your design, there's no definitive answer.
Nowadays many websites contain some content loaded by ajax(e.g,comments in some video websites). Normally we can't crawl these data and what we get is just some js source code. So here is the question: in what ways can we execute the javascript code after we get the html response and get to the final page we want?
I know that HtmlUnit has the ability to execute background js,yet some many bugs and errors are there. Are there any else tools can help me with it?
Some people tell me that I can crawl the ajax request url, analyze its parameters and send request again so as to gain the data. If things can't work out according to the way I mention above, can anyone tell me how to extract the ajax url and send the request in correct format?
By the way,if the language is java,it would be the best
Yes, Netwoof can crawl Ajax easily. Its API and bot builder let you do it without a line of code.
Thats the great thing about HTTP you don't even need java. My goto tool for debugging AJAX is the chrome extension Postman. I start by looking at the request in the chrome debugger and identifying the salient bits(url or form encoded params etc.)
Then it can be as simple as opening a tab and launch requests at the server with Postman. As long as its all in the same browser context all of your cookies(for authentication, etc.) will be shipped along too.
I have some GWT application that run on the server.
we are subscripting with some solution that pings this application in a regular interval of time.
The point is, this solution (service) checks the returned response from the server to contain some pre-defined keywords.
But as you know, GWT return plain empty HTML page with the data contained in the .js file.
So, the Ping service will not be able to exmain the pre-defined keywords, Is this statement true??
And if this is ture, cannot we find any workaround solution to solve such problem?
Thanks.
The problem you are facing is related to the crawlabitlity of AJAX applications - Google has some pointers for you :) Generally, you need a headless browser on the server to generate the output you'd normally see in the browser, for example see HtmlUnit.
Only the initial container page and the loader script that it embeds are HTML & JS. Afterwards, you use GWT's RPC mechanism to exchange Java objects with the server, or Ajax (eg. RequestBuilder) to exchange any kind of data with the server. you name it: JSON, XML, plain text, etc.
I've often wanted to create applications that provide a simpler front-end to other websites that require users to login before the pages I want to use can be accessed. I was wondering, if
(1) any website with a POST to an http page can be authenticated by POSTing
postField1name=pf1Value&postField2name=pf2Value
to the website, if that's true how can you inspect the HTML to POST correctly?
(2) I wanted to know if you could parse HTML, say for a sign up form, and display all the fields in an application UI, including downloading a Captcha, and displaying it to the user, and allowing them to type the value in, to send back to the website, and process the response.
Also if anyone knows how I might accomplish (2) using Apache HTTP Client in java, I'd greatly appreciate it!
http://hc.apache.org/httpcomponents-client/httpclient/index.html
(1) An easy way to find out what's actually being POST'd is to look at the actual HTTP requests. You can do that with a tool like LiveHTTPHeaders. Then have your script simulate that.
(2) Yes. You can use cURL, which is excellent for things like this.
(1) Try FireBug. There's actually a lot of options for authentication.
(2) Try JTidy