Why search engines can't index Ajax sites directly? [closed] - java

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I understand as a (GWT developer) that Ajax sites renders page dynamically, and for instance the site i made is single page and contained tabs to render "pages" like "home", "about us", "products", etc.
However those pages are usually incorporated with a hash (#) and that if I access say, http://example.com#HomePage or #Products it will would render the page and the contents "eventually"...
Now if I put my products page site on my crawlable static blog site for example: http://example.com#Products if I click through this site then my site will render the products eventually after some ajax calls.
However, if I check the "page source" of the site from the browser the page is still the same html "empty from ajax content"; is this the reason why ajax site can't be indexed? Search engines don't put the URL they crawl in a HTML unit so they can render the page and not just get the static page?
Anyway, I saw implementations to workaround this issue, to use a external "crawler" service as part of the ajax site, however is there no solution that does not require to setup such external service / server?

However, if I check the "page source" of the site from the browser the page is still the same html "empty from ajax content"; is this the reason why ajax site can't be indexed? Search engines don't put the URL they crawl in a HTML unit so they can render the page and not just get the static page?
Yes, sites that depend on Ajax to pull in content are depending on JavaScript to pull in content and search engine indexing bots do not (in general) execute JavaScript since:
It requires much more CPU/RAM to do so
It is very hard to determine what interactions will pull in new content and which will do other things
Anyway, I saw implementations to workaround this issue, to use a external "crawler" service as part of the ajax site, however is there no solution that does not require to setup such external service / server?
Don't depend on JavaScript in the first place. Build a site that works with regular links. Layer JavaScript on top if you want to. Use pushState and friends to update the address bar with the real URL when new content is pulled in.
In short, follow the principles of Progressive Enhancement and Unobtrusive JavaScript

First thing you should know is that crawlers don't execute javascript on the page, but there is a way to make page crawlable (to show crawler that your application use AJAX).
Example(google crawler):
You should first indicate to the crawler that your site supports the AJAX crawling scheme by adding special token to application AJAX links. After that, crawler will transform that URL and with transformed URL call your server. Server should return HTML snapshot (generated HTML) which represents the HTML content that is created when user in browser load page with AJAX. On the end you can use Fetch as Google tool to test what will google crawler receive when call your AJAX links. In depth explanation can be found here.
I don't work with GWT but maybe you can some specific solution here.

Related

Read full content of a web page in Java

I want to crawl the whole content of the following link with a Java program. The first page is no problem, but when I want to crawl the data of the next pages, there is the same source code as for page one. Therefore a simple HTTP Get does not help at all.
This is the link for the page I need to crawl.
The web site has active contents that need to be interpreted and executed by a HMTL/CSS/JavaScript rendering engine. Therefore I have a simple solution with PhantomJS, but it is sophisticated to run PhantomJS code in Java.
Is there any easier way to read the whole content of the page with Java code? I already searched for a solution, but could not find anything suitable.
Appreciate your help,
kind regards.
Using the Chrome network log (or a similar tool in any other browser) you can identify the XHR request that loads the actual data displayed on the page. I have removed some of the query parameters, but essentially the request looks like this:
GET https://www.blablacar.de/search_xhr?fn=frankfurt&fcc=DE&tn=muenchen&tcc=DE&sort=trip_date&order=asc&limit=10&page=1&user_bridge=0&_=1461181945520
Helpfully, the query parameters look quite easy to understand. The order=asc&limit=10&page=1 part looks like it would be easy to adjust to return your desired results. You could adjust the page parameter to crawl successive pages of data.
The response is JSON, for which there are a ton of libraries available.

making dynamic ajax web application searchable

I have developed ajax web application that is constantly generating new Dynamic pages with ID ( like http://www.enggheads.com/#! question/1419242644475)
when some one add question on website.
I have made my ajax web application crawlable, I have implemented this as this Link recommended
LINK : https://developers.google.com/webmasters/ajax-crawling/
and tested that 'fetch as google' returns the html snaphots. and tested on facebook developer tools it also fetch data accurately. I've submitted a site map with all the current urls. but when we search, only some of the links of sitemap show in google search result. and google refuses to index any of the ajax links, although there are no crawl errors.
1--My Question: So what else I have to do to show all link of my apllication in google search result.
2--My Question: And One more question I have to ask is, as I explain above that this application is generating new dynamic pages so we have to regenerate the sitemap eachtime(or at set interval) when some one add a question on my web. Or is there any other significant way to handle this situation.
And dont't know how "Facebook.com" , "Stackoverflow.com" , "in.linkedin.com" manage their sitemap if they use...???

scrape website multiple pages using Web Client java

I am trying to scrape a website, using Web Client, i am able to get the data on the first page and parse it, but I do not know how to read the data on the second page, the website is calling a java script to navigate to the second page. Can anyone suggest me how do I get the data from the next pages?
Thanks in advance
The problem you're going to have is while you (a person) can read the JavaScript in the first page and see it is navigating to another page, having the computer do this is going to be hard.
If you could identify the block of code performing the navigation, you would then need to execute it in such a way that allowed your program to extract the URL. This again is going to be very specific to the structure of the JavaScript and would require a person to identify this.
In short, I think you're dead in the water with this one, though it serves as a good example of why the Unobtrusive JavaScript concept is so important.
This framework integrates HtmlUnit with its headless javascript enabled browser to fully support scriping multiple pages in the same WebClient session: https://github.com/subes/invesdwin-webproxy

How do i get Contents of an ASPX file through java?

In java, is there any way to get the content of a webpage, wich is an .ASPX file.
I know how to read/write anything from a normal HTML page, but asp pages seem to have one URL for multiple pages, so it's not really possible to reach the desired page by URL.
I understand you can't/won't give me complete instructions right here, but could you maybe send me in the right direction ?
Thanks in advance.
There is nothing special about ASPX pages compared to any other type of page; "plain" html pages could have been dynamically generated as well.
Just don't forget that the query string is also part of the URL. Many ASPX, PHP, etc pages might not even be 'correct' to request without some query string value at all. And other sites don't have file extensions at all... like this site itself. You just have to be sure to get the entire URL for each unique 'page'.
I'm not an expert on .asp, so I might be wrong. However, my impression is that a .asp page should ultimately return HTML (similarly to what a .jsp page does), so you can fetch the content in the same way as you would do for an HTML page.
However, you write that
asp pages seem to have one URL for multiple pages
this makes me think that perhaps your .asp page is using AJAX and so the page content may change while the URL doesn't. Is this your case?
I understand that you are trying to read the aspx from a client PC, not from the server.
If that's right, accessing an HTTP resource is independent from the technology used by the server, all you need to do is to open an http request and retrieve the results.
If you see multiple pages from one URL, then one of the following is happening:
1) POST data is sent to the aspx, and it renders different HTML due to these parameters
2) You are not looking really at the inner page but to a page that provides the frames for the HTML being rendered
3) The page uses heavily Ajax in order to be rendered. The "contents" of the page are not download through the initial request but later by javascript.
Generally, it is probably the first reason.

How to use Java to navigate a Web Search

I need to scrape French court cases for a project, but I can't figure out how to get Java to navigate the Court's search engine.
Here's the search page I need to manipulate. I want to start scraping the results page, but I can't get to that page from Java with just the URL. I need some way to have Java order the server to execute a search based on my date parameters (01/01/2003 - 30/06/2003), and then I can run the show by simply manipulating the URL I'm connecting to.
Any Suggestions?
First make sure the terms of service for the site allow this.
I would httpclient posts to send the data and get the results. See the form on the page, figure out which variables you need to emulate and submit them with httpclient. You should get back the results you are looking for. Also this page has lots of javascript, so you need to figure out what it is doing, maybe its never submitting the form but making ajax calls to update the page, but maybe you can get the same results.
You can always install something like "fiddler" and watch the http traffic the page is sending and then emulate that using httpclient.

Categories

Resources