What is the difference between web crawler and parser?
In java there are some name for fetching libraries . For example , they name nutch as a crawler and jsoup as a parser .
Are they do the same purpose?
Are they fully similar for the job?
thanks
The jsoup library is a Java library for working with real-world HTML. It is capable of fetching and working with HTML. However, it is not a Web-Crawler in general as it is only capable of fetching one page at a time (without writing a custom program (=crawler) using jsoup to fetch, extract and fetch new urls).
A Web crawler uses a HTML parser to extract URLs from a previously fetched Website and adds this newly discovered URL to its frontier.
A general sequence diagram of a Web crawler can be found in this answer: What sequence of steps does crawler4j follow to fetch data?
To summarize it:
A HTML parser is a necessary component of a Web crawler for parsing and extracting URLs from given HTML input. However, a HTML parser alone, is not a Web crawler as it lacks some necessary features such as maintaining previously visted URLs, politeness, etc.
This is easily answered by looking this up on Wikipedia:
A parser is a software component that takes input data (frequently
text) and builds a data structure
https://en.wikipedia.org/wiki/Parsing#Computer_languages
A Web crawler, sometimes called a spider or spiderbot and often
shortened to crawler, is an [Internet bot] that systematically browses
the World Wide Web, typically for the purpose of Web indexing (web
spidering).
https://en.wikipedia.org/wiki/Web_crawler
Related
I need a way to generate HTML interface (form), starting from wsdl, to submit web service requests. The request submission is made by server side code. The user fills out the form and posts data.
I am looking for a library (Java) that might help me to write the code.
I'm not trying to create java classes of the web service, I have to generate form fields for any wsdl url.
According to MikeC http://www.soapclient.com/soaptest.html is a tool to create HTML forms from WSDL documents. Unfortunately, its not a Java library and it also had at least one limitation: no multidimensional array support.
But with a little effort you should be able to write an own parser/transformer for your specific use case. See also How to parse WSDL in Java? to find more information about a WSDL parser for JAVA.
Also possible XSLT http://www.ibm.com/developerworks/library/ws-xsltwsdl/.
I would like to generate static HTML5 pages, defining my own tags and rendering more complex HTML in the generated pages.
I like the polymer architecture approach to define new sets of tags.
When I want to generate my pages, I'm not in a browser, so I can use Java or NodeJS engines to compute the final HTML pages (from a console for instance).
To sum up, I want to define my own tag libraries using the polymer approach, code some HTML using those new tags, and "print" the result DOM in a static HTML file, running all that from a console program (using Java or NodeJS).
Does somebody know how to do so?
It seems I must have some DOM interpreters, and I know that in Java I can use jsoup, but it will probably lack some JavaScript interpreter? Can NodeJS do that more simply?
I am just making a small online platform. It includes news from different news websites.
I am using JAVA for my website and jsoup parser for parsing HTML.
I am using simple way, First i am downloading HTML pages to local folder and then extracting HTML from it by selectors and filters. eg. doc.select("img")
But just suppose if design of target websites changes, then it will stop working.
and another big problem is different parser for different websites.
I want to make a parser something like google news, HTC news feed.
If any one can point me in the right way.
Are there any Java APIs (or web services) to beautify/format/pretty-print the html, xml and .java source code files?
So basically, I m indexing these various types of files in Apache Solr server and then fetching it when user searches for it. I m using Solr Cell for this. (Its like a grepcode application.)
The problem here is the file content comes as a plain text w/o any formatting (as the Solr field type is 'text' in the schema).
I m looking for any APIs (in Java) or web services so that I can hook it up in my application and convert the text (String in Java) to well formatted output to show on a web page.
Appreciate the pointers.
Thanks!
I got the solution html and tag is doing the trick.
All I wanted was to preserve the whitespaces while displaying the content (string).
Thanks.
Dear StackOverFlow Developers I want a help from you . I am stuck in Apache lucene to use in java swing application . The problem is so complex that even i m confused how should i ask it.
Please try to understand what is my actual requirement.
The case is the simple i have to give html files so that client can access them in swing application and for searching facility i decided to use apache lucene indexing. this is providing me the search facility but now i want to display the html file data which has matched the search criteria . In java API i m using swing for it and JEditorPane is the control in which i have to display the contents of html file . Please suggest me how should i index the html files and how should i get the content of html files back from lucene index.
the html files not only having text only but also they are having links , images etc.
thanks in advance hoping help from you
regards
In one of our projects where we employed Lucene for full text indexing & search, we handled HTML files as follows:
Stored the HTML document as is on disk (you can store in the DB as well).
Using Jericho HTMLParser's HTML->Text converter, we extracted the text, links etc., out of the HTML documents.
The lucene document has attributes that stored the metadata about the HTML file apart from the text content in the HTML in tokenized format.
Used StandardAnalyzer to keep certain tokens like email, website links as is during the tokenization process before indexing.
Upon searching the index, the hits returned contained the metadata of the HTML files that matched the criteria. So, we were able to identify the HTML content to be displayed for a given search result.
HTH.