SEO, google crawl - java

I have implemented the pagination like below
http://myhost.com/product-2/213-1
Which means there are total 213 products and this is the first page.
When I check the what pages Google crawled in my website I see the result like
http://myhost.com/product-2/213-1-2/144-0/144-1/144-14/125-1/125-12/125-1/151-15/108-10/131-1/134-13/140-14/140-1/118-11/126-1/126-12/110-1/270-27/270-1/270-27
This means google is somehow appending all the page numbers at the end of the URL and crawling that URL. Could someone give me a solution to stop this? For this particular case I want Google to crawl only one page having all the product information.

Use canonical URLs to tell Google which page is the one page you want to show in the search results.

That is weird. Looks like you are using relative links in your pagination and your URL-Router is accepting this somehow without throwing a 404... instead it is showing content because of interpreting only a part of the URL and not the whole. So search-engines can crawl this URLs.
Example:
You are linking to
next-side/
instead of
/path/to/next-side/
If you post a link the community can try out!
By the way I wouldn't recommend to change the URL by number of items. Using fixed URLs is much better and the number of items is of none interest. Better use something like /shop/category/subcategory/product .

I will add to the great answers that they gave you that you can use the rel next\prev pagination elements.
To let google know that the next link is the next page in your list.
More information you can find on google webmaster blogs there is an post and video tutorial.
they both will explain you how to implement and use the pagination tags.

Related

How to get the Google's search result using Java

According to the answer in here, using Gson we can programmatically achieve to retrieve the result that Google will return to a query. Nonetheless, yet there are 2 questions are remaining in my mind:
How can we do similar thing for Bing?
How can we get more than 4 results based on the referred answer? Because the results.getResponseData().getResults().get(n).getUrl() for n>4 returns exception.
As #Niklas noted, google search api is deprecated, thus you should not use it for your project. Currently the only solution would be to get search result by http request to get a html search results and than parse it yourself.
In case of Bing, there is a search API, but it has a limited number of calls for free users. If you need to make a lot of requests, than you will have to pay for it. https://datamarket.azure.com/dataset/5BA839F1-12CE-4CCE-BF57-A49D98D29A44

Wikipedia Page Id from URL

I am parsing through wikipedia dump in java. In my module I want to know the page id of the internal pages of wiki those are referred by the current page. Getting the internal links and thus the url from it is easy. But how to get Page ID from url.
Do I have to use some mediaWiki for this? If yes how
Any other alternative?
for eg: http://en.wikipedia.org/wiki/United_States
I want to get its Page-Id i.e 3434750
You can use the API for that. Specifically, the query would look something like:
http://en.wikipedia.org/w/api.php?action=query&titles=United_States
(You can also specify more than one page title in the titles parameter, separated by |.)
As an alternative, you could download the page.sql dump (1 GB compressed for the English Wikipedia), which also contains this information. To actually query it, you could either import it into an MySQL database and then query that, or you could directly parse the SQL.
If you can't use the api you can always get the pageID from the info page reached by appending ?action=info to the url. Should make a better starting point for a parser.
For your example above: https://en.wikipedia.org/wiki/United_States?action=info

Catching Google's search "Next Page" - Jsoup Webcrawler

I'm building one java webcrawler and I need to catch the "Next Page" link from the Google search I request. For that I was trying to realise one pattern or way to do, but until now I couldn't find any clues about this.
Check out this picture:
You can test yourself that the "Next Page" is the same link for every number you pass the mouse on. The only think that will change on the link is the part "Start=(number)" almost in the end of the link. For every page of search it plus 10 on start, since this is the number of links result per page.
But, the weird part is that this "default" link doesn't come inside the source code of the page when you request the browser to show its code. Maybe this has something with the google index process, but I'm not sure since I'm not an expert programmer yet, specially in Web programmation.
So, anyone has any Idea of how should I solve this?
I would suggest you to use the jsoup.org library

Receive JSON list from http://www.autorenlexikon.lu

I want to read a JSON list from a webservice with Java. The webservice returns a list of authors from luxemburg, e.g. sorted by the year. That's the web-site:
http://www.autorenlexikon.lu/page/periods/1919-1945/1/1/DEU/index.html
So far, I know that I can receive a JSON document with a request like this:
http://www.autorenlexikon.lu/mmp/json.document_list/DEU/0?search_since=1919&search_until=1945
But I only get the first 20 entries. How can I get the next 20 entries? I think the solution is in the JavaScript-code of the web-site, but I am pretty new in JavaScript (also in JSON).
EDIT:
There isn't any official API.
I have already tried:
http://www.autorenlexikon.lu/mmp/json.document_list/DEU/0?pageSize=1000&search_since=1919&search_until=1945
http://www.autorenlexikon.lu/mmp/json.document_list/DEU/0?page_Size=1000&search_since=1919&search_until=1945
...and many more. Who does the JavaScript-code receive all entries? Couldn't I copy this mechanism?
You should check their API and look for a parameter that let's you define the page or the range of results you want to get.
Edit Seems like you'd have to make a POST request and add the start index as well as the page size as post parameters. For more information see #matthijs koevoets' answer.
It depends on how the Webservice has been coded. Nothing to do with JSON specifically. From the results you can see it says
"pageSize":20,
You just have to figure out how to call the Web service with a page size. It may not allow you to query it with a different page size. That's up to the Web service API coded by their developers
their service seems to accept POST parameters only: sort=year&dir=asc&startIndex=0&results=100

Fetching data from the other sites and displaying into our page.?

Is ther any way to get data from other sites and display in our jsp pages dynamically.
http://www.dictionary30.com/meaning/Misty see this url
in that one block is like
Wikipedia Meaning and Definition on 'Misty'
In that block they are fetching the data from Wikipedia and displaying into dictionaly30.
Question:
How they are fetching wiki data to their site.?
I need to display data like that in my jsp page by fetching from other site.
You can use URLConnection and read other site's data.
or better you use JSoup it will also parse specific data for you from some other site.
for your case
Document document = Jsoup.parse(new URL("http://www.dictionary30.com/meaning/Misty"), 10000);
Element div = document.select("div[id=contentbox]").first();
System.out.println(div.html());
You can fetch data from other site on your server side using URLConnection and provide this data to jsp page.
Do make sure you get permission first from the site owners before doing anything like that.
Most people don't take kindly to their data being leeched by others, especially as it costs them money and doesn't generate any (advertising) income.
It's also very risky in that your own site/application will quickly fail as soon as the site you're leeching from gets changed to a different layout.

Categories

Resources