Web scraping through web pages using JSoup

Web scraping through web pages using JSoup - java

Ive made a web scraper to scraper pieces of information on the IMDB. It traversed each page by changing the number in the url to a different random one and then repeated the web scraping process on this new page.
http://www.imdb.com/title/tt0800369/ <--Changing this number for a new movie.
How can I do this on the BFI website? I cant see a way to go from film to film.
Thanks in advance!

Following randomly generated links is not the most efficient way to traversed over WWW...
You really should follow URL's that you found on other pages. You can use crawler4j that seems to be easiest Java crawler to start with. There are also some alternatives.

Related

Will Jaunt web scraper be capable of scraping this javascript site

I have never done web scraping before, actually just 3 hours ago I google the word web scraping to see what it means... so this is my level of competence on the subject, but I have a task to scrape some numbers for different football matches of this website "betstars.uk" and from what I see it is a javascript website (is it ?) which makes the already hard task for me even harder, so can Jaunt tool for JAVA do this job or I need something else? I am asking because to avoid spending more than an hour learning how to use it just to find out it can't do the job

For some reason I cannot load the website so I can't tell you if it uses javascript to load content or not.
It's impossible to scrape a javascript based website with Jaunt because it's a basic web scraper library and it doesnt load javascript at all. Though, in case the site uses javascript indeed, you could use htmlUnit to load javascript content and scrape the information you need.
Here is an easy tutorial on How to Scrape Javascript in Java

Remove .jsp ending with tomcat and nginx

I would like to www.example.com/test.jsp as www.example.com/test.
URL rewriting and so on should be too slow. Are there any alternatives? For example, maybe using the jsp files or only servlets when they work with java?
I'm looking for good solution in terms of performance and for the google ranking. The website has 200 pages and grows, so I can't do it manually for every page.
I googled but I didn't find a good answer.

Can't believe you wrote a website with 200+ separate jsp pages. Consider changing site architecture, for example if you have online store with many pages of the same type, you could write only one jsp content page as template and use RESTful architecture to build real page content.

In the nginx configuration, you can make a rewrite of the url when you do the proxy pass

making dynamic ajax web application searchable

I have developed ajax web application that is constantly generating new Dynamic pages with ID ( like http://www.enggheads.com/#! question/1419242644475)
when some one add question on website.
I have made my ajax web application crawlable, I have implemented this as this Link recommended
LINK : https://developers.google.com/webmasters/ajax-crawling/
and tested that 'fetch as google' returns the html snaphots. and tested on facebook developer tools it also fetch data accurately. I've submitted a site map with all the current urls. but when we search, only some of the links of sitemap show in google search result. and google refuses to index any of the ajax links, although there are no crawl errors.
1--My Question: So what else I have to do to show all link of my apllication in google search result.
2--My Question: And One more question I have to ask is, as I explain above that this application is generating new dynamic pages so we have to regenerate the sitemap eachtime(or at set interval) when some one add a question on my web. Or is there any other significant way to handle this situation.
And dont't know how "Facebook.com" , "Stackoverflow.com" , "in.linkedin.com" manage their sitemap if they use...???

How to make dynamic website searchable by search engine

I have developed dynamic website using technology like ajax,java etc, that is constantly generating new pages with ID(like http://www.enggheads.com/#!question/1419242644475) similarly like stackoverflow.com, but my sites pages is not searchable by google or any other search engine.
I want my pages show in the search result searched by search engine. How can I achieve this? I have not submitted any sitemap to google webmaster tool. is sitemap really a right solution...??? that means we have to regenerate the sitemap eachtime(or at set interval) when some one add a question on my website.
I m really very confused that how searche ngine search dynamically created pages like stackoverflow question and facebook proile.

Look up how meta tags works. Every dynamic page will have its own set tags and description.
Also it takes time for Google to index your pages.
Another reason why your website isn't shown in the results is because your key words are too common. Google indexes websites based on keywords mentioned in the meta tags. If they are very common, there will be other popular sites that are ranked above yours. Hence your site is not on the top results.
Google also takes into consideration the popularity of your website. It calls this juice. Your website juice increases and decreases based on how old your site is, and how many relevant redirections happen to and from your website.
All the points I mentioned are just a few things that come under the heading search engine optimization.
SEO is a massive concept and you will only learn it eventually as your website grows.
On the other hand, if you want Google to push your results up to the top. You can pay Google to do so. Google has the biggest advertising business.

This is because search engines can not find URLs containing /#?= . So you can rewrite you URLs. This page can help you to do this. http://weblogs.asp.net/scottgu/tip-trick-url-rewriting-with-asp-net

First of all, to be indexed by Google, first Google should FIND the url. Best way to be found is to have many backlinks(popularity). other wise you have to submit sitemap or URL to search engines.
Unfortunately the query "inurl:#!" giving zero results in Google. So Luiggi Mendoza is right about it.
You can try rewrite URLS using htaccess to make them SEO friendly.

scrape website multiple pages using Web Client java

I am trying to scrape a website, using Web Client, i am able to get the data on the first page and parse it, but I do not know how to read the data on the second page, the website is calling a java script to navigate to the second page. Can anyone suggest me how do I get the data from the next pages?
Thanks in advance

The problem you're going to have is while you (a person) can read the JavaScript in the first page and see it is navigating to another page, having the computer do this is going to be hard.
If you could identify the block of code performing the navigation, you would then need to execute it in such a way that allowed your program to extract the URL. This again is going to be very specific to the structure of the JavaScript and would require a person to identify this.
In short, I think you're dead in the water with this one, though it serves as a good example of why the Unobtrusive JavaScript concept is so important.

This framework integrates HtmlUnit with its headless javascript enabled browser to fully support scriping multiple pages in the same WebClient session: https://github.com/subes/invesdwin-webproxy

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Web scraping through web pages using JSoup - java

Following randomly generated links is not the most efficient way to traversed over WWW... You really should follow URL's that you found on other pages. You can use crawler4j that seems to be easiest Java crawler to start with. There are also some alternatives.

Related

Will Jaunt web scraper be capable of scraping this javascript site

Remove .jsp ending with tomcat and nginx

making dynamic ajax web application searchable

How to make dynamic website searchable by search engine

scrape website multiple pages using Web Client java

Categories

Resources