I am just making a small online platform. It includes news from different news websites.
I am using JAVA for my website and jsoup parser for parsing HTML.
I am using simple way, First i am downloading HTML pages to local folder and then extracting HTML from it by selectors and filters. eg. doc.select("img")
But just suppose if design of target websites changes, then it will stop working.
and another big problem is different parser for different websites.
I want to make a parser something like google news, HTC news feed.
If any one can point me in the right way.
Related
What is the difference between web crawler and parser?
In java there are some name for fetching libraries . For example , they name nutch as a crawler and jsoup as a parser .
Are they do the same purpose?
Are they fully similar for the job?
thanks
The jsoup library is a Java library for working with real-world HTML. It is capable of fetching and working with HTML. However, it is not a Web-Crawler in general as it is only capable of fetching one page at a time (without writing a custom program (=crawler) using jsoup to fetch, extract and fetch new urls).
A Web crawler uses a HTML parser to extract URLs from a previously fetched Website and adds this newly discovered URL to its frontier.
A general sequence diagram of a Web crawler can be found in this answer: What sequence of steps does crawler4j follow to fetch data?
To summarize it:
A HTML parser is a necessary component of a Web crawler for parsing and extracting URLs from given HTML input. However, a HTML parser alone, is not a Web crawler as it lacks some necessary features such as maintaining previously visted URLs, politeness, etc.
This is easily answered by looking this up on Wikipedia:
A parser is a software component that takes input data (frequently
text) and builds a data structure
https://en.wikipedia.org/wiki/Parsing#Computer_languages
A Web crawler, sometimes called a spider or spiderbot and often
shortened to crawler, is an [Internet bot] that systematically browses
the World Wide Web, typically for the purpose of Web indexing (web
spidering).
https://en.wikipedia.org/wiki/Web_crawler
I have a requirement to read data from different websites using Jsoup html parser (Ex: name, city, state, zip etc.). I can able to read the data from one website. but the problem is my code should be reusable to other websites. but in the other website elements and their position is different to first website. How to achieve this. please suggest any pattern or any examples. thanks.
I'll get this straight for you, there is NO WAY to achieve a general parser to scrape all websites. I've worked at a company where I had to scrape 30 websites. And I literally had to write one for every website.
You can however create general utility classes that helps you to process the data you've parsed.
I am creating a news app and have the url to the site of the articles e.g http://www.bbc.co.uk/news/technology-33379571 and I need a way to extract the content from the article.
I have tried jsoup but that gives all the html tags and there is one <main-article-body> but that gives the link to the article which I am trying to extract. I know boilerpipe does it exactly but that doesnt work with android, I am really stuck with this problem.
Any help will be much much appreciated
I have worked on few data extraction applications in.Net (c#) and have used regular expressions to extract content from news website.
The basic idea is to first extract all a href links (as needed) and then fetching details content by making web request. Finally using regular expressions to extract news body data.
Note: A problem with this process is that you will need to change your regular expressions when data source site changes.
I try to crawl Data from webpages just like amazon, but I'm only interested in the price of the products. When I try to crawl a lot of data it needs way too much time to download the full HTML document. So I desire to download only the part, where the Price stands (like the first 300kb), of the HTML document. It would be even better to download only a part in the middle of HTML document if this is possible, but it would be enough to have a solution on how to only download a specific number of bytes. I am using Jsoup to crawl data. It would be great if someone is able and willing to help me :)
I want to create something like this (code is here):
in pdf format. I'm using google charts and regarding to this forum converting chart to pdf is impossible. I've already tryied iText+XMLWorker, but there is some problem with css and any js supporting at all, I think.
So, the questions are: How can I convert html+css+js to .pdf file? Or, may be, the issue have other variants?
As promised in the comment, I've asked Raf. This was his answer:
One way to use XML Worker for HTML+CSS+JS is to use a browser engine to preprocess the HTML. Examples of such a browser engine are WebKit (Chrome, Safari) and Gecko (Firefox). These can interpret the CSS and JS and give you HTML that is ready to be parsed by XML Worker.
Examples of competing products are:
wkhtmltopdf, a command line tool that uses WebKit as its rendering engine.
Prince XML supports HTML+CSS+JS to PDF using their own engine.
Maybe there are others, but this is what Raf told me. I hope this helps.