Crawler4j vs. Jsoup for the pages crawling and parsing in Java

Crawler4j vs. Jsoup for the pages crawling and parsing in Java - java

I want to get the content of a page and extract the specific parts of it. As far as I know, there are at least two solutions for such task: Crawler4j and Jsoup.
Both of them are capable retrieving the content of a page and extract sub-parts of it. The only thing I'm not sure about, what is the difference between them? There is a similar question, which is marked as answered:
Crawler4j is a crawler, Jsoup is a parser.
But I just checked, Jsoup is also capable crawling a page in addition to a parsing functionality, while Crawler4j is capable not only crawling the page but parsing its content.
What is the difference between Crawler4j and Jsoup?

Crawling is something bigger than just retrieving the contents of a single URI. If you just want to retrieve the content of some pages then there is no real benefit from using something like Crawler4J.
Let's take a look at an example. Assume you want to crawl a website. The requirements would be:
Give base URI (home page)
Take all the URIs from each page and retrieve the contents of those too.
Move recursively for every URI you retrieve.
Retrieve the contents only of URIs that are inside this website (there could be external URIs referencing another website, we don't need those).
Avoid circular crawling. Page A has URI for page B (of the same site). Page B has URI for page A, but we already retrieved the content of page A (the About page has a link for the Home page, but we already got the contents of Home page so don't visit it again).
The crawling operation must be multithreaded
The website is vast. It contains a lot of pages. We only want to retrieve 50 URIs beginning from Home page.
This is a simple scenario. Try solving this with Jsoup. All this functionality must be implemented by you. Crawler4J or any crawler micro framework for that matter, would or should have an implementation for the actions above. Jsoup's strong qualities shine when you get to decide what to do with the content.
Let's take a look at some requirements for parsing.
Get all paragraphs of a page
Get all images
Remove invalid tags (tags that do not comply to the HTML specs)
Remove script tags
This is where Jsoup comes to play. Of course, there is some overlapping here. Some things might be possible with both Crawler4J or Jsoup, but that doesn't make them equivalent. You could remove the mechanism of retrieving content from Jsoup and still be an amazing tool to use. If Crawler4J would remove the retrieval, then it would lose half of its functionality.
I used both of them in the same project in a real life scenario.
I crawled a site, leveraging the strong points of Crawler4J, for all the problems mentioned in the first example. Then I passed the content of each page I retrieved to Jsoup, in order to extract the information I needed. Could I have not used one or the other? Yes, I could, but I would have had to implement all the missing functionality.
Hence the difference, Crawler4J is a crawler with some simple operations for parsing (you could extract the images in one line), but there is no implementation for complex CSS queries. Jsoup is a parser that gives you a simple API for HTTP requests. For anything more complex there is no implementation.

Related

How to deal in Android with RESTful API that throws HTML table

This API returns a whole HTML table. I'm searching how to add this table (as is) into my UI but I've never seen such API throwing HTM table.Browsing Internet for an answer is not giving me any hope either.
Is it possible to put it into a webview? or any other UI object? My application sends a word to the API, and I'm getting the table in return.
I'd appreciate some code example.

You can certainly just show that exact same page in a WebView. If you want to parse the table and display only certain information, there is a library call JSOUP that is available which makes it very convenient to parse HTML.
It looks like you don't mind displaying the whole thing in a WebView - if that is acceptable, then you just load the page into a WebView widget. WebView will take care of rendering the page exactly as you see it in a browser. You only have to tell it what to load.

You parse the output like you would any other web request. If you wanted to include the table in your own webpage, you could. Or you could parse the response for the specific info you need.
Don't think of it as an API, think of it as a URL you're requesting and now you need to do something with the contents. That might help with your Googling. You're essentially doing page scraping.

Read full content of a web page in Java

I want to crawl the whole content of the following link with a Java program. The first page is no problem, but when I want to crawl the data of the next pages, there is the same source code as for page one. Therefore a simple HTTP Get does not help at all.
This is the link for the page I need to crawl.
The web site has active contents that need to be interpreted and executed by a HMTL/CSS/JavaScript rendering engine. Therefore I have a simple solution with PhantomJS, but it is sophisticated to run PhantomJS code in Java.
Is there any easier way to read the whole content of the page with Java code? I already searched for a solution, but could not find anything suitable.
Appreciate your help,
kind regards.

Using the Chrome network log (or a similar tool in any other browser) you can identify the XHR request that loads the actual data displayed on the page. I have removed some of the query parameters, but essentially the request looks like this:
GET https://www.blablacar.de/search_xhr?fn=frankfurt&fcc=DE&tn=muenchen&tcc=DE&sort=trip_date&order=asc&limit=10&page=1&user_bridge=0&_=1461181945520
Helpfully, the query parameters look quite easy to understand. The order=asc&limit=10&page=1 part looks like it would be easy to adjust to return your desired results. You could adjust the page parameter to crawl successive pages of data.
The response is JSON, for which there are a ton of libraries available.

Retrieving contents of URL after they have been changed by javascript

I am facing a problem retrieving the contents of an HTML page using java. I have described the problem below.
I am loading a URL in java which returns an HTML page.
This page uses javascript. So when I load the URL in the browser, a javascript function call occurs AFTER the page has been loaded (onBodyLoad of HTML page) and it modifies some content (one of the div id's innerHtml) on the webpage. This change is obviously visible to me in the browser.
Now, when I try to do the same thing using java, I only get the HTML content of the page , BEFORE the javascript call has occurred.
What I want to do is, fetch the contents of the html page after the javascript function call has occurred and all this has to be done using java.
How can I do this? What should my approach be?

You need to use a server side browser library that will also execute the JavaScript, so you can get the JavaScript updated DOM contents. The default browser mechanism doesn't do this, which is why you don't get the expected result.
You should try Cobra: Java HTML Parser, which will execute your JavaScript. See here for the download and for the documentation on how to use it.
Cobra:
It is Javascript-aware. DOM modifications that occur during parsing will be reflected in the resulting DOM. However, Javascript can be disabled.

For anyone reading this answer, Scott's answer above was a starting point for me. The Cobra project is long dead and cannot handle pages which use complex JavaScript.
However there is something called HTML Unit which does just exactly what I want.
Here is a small description:
HtmlUnit is a "GUI-Less browser for Java programs". It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc... just like you do in your "normal" browser.
It has fairly good JavaScript support (which is constantly improving) and is able to work even with quite complex AJAX libraries, simulating either Firefox or Internet Explorer depending on the configuration you want to use.
It is typically used for testing purposes or to retrieve information from web sites.

mediawiki syntax parser for android

I need some help fetching the content of a mediawiki page and displaying it in my android app. I've found that mediawiki has a api which can export the markup text, but I need someway to parse it.
Other solutions are also welcome, but I don't just want to include the whole page (with menus and everything) as a webview.

You can get the HTML for just the content section (i.e. not including the menus and everything) by specifying action=render when fetching the page normally. You can get about the same result using the API's action=parse. In either case, you could then supply your own framing HTML, stylesheets, and the like for proper display to the user.

If you are in control of the wiki you may be best served by using the mobile format that wikipedia uses. Otherwise it will depend on what the markup text it returns looks like. If the markup is what you would see in the wiki editor you may have issues. Wikipedia for example make extensive use of templates (or whatever they call them). I'm not sure if you can do much if you don't have the actual temple code behind that.
You also might want to look at http://www.mediawiki.org/wiki/Markup_spec if you are headed down this path.

Is there a way to change or reskin an incoming website on the fly?

I have a project where they want me to embed a website into a java application and they want the website to have a similar color scheme as the rest of the application. I know a lot about CSS and building websites but I am not aware of a way to change the look of a website as it comes in on the fly. Is there someone who can help?
Update:
I don't have access to the header because it is not my website. To give more info about the project is we have a browser embedded in a java client application. The user needs to access a website that displays the contents of a database. I have no access to the original html or css from the site.
What i need is to change the background color and font sizes of the incoming webpage to match the look and feel of the java application.

One approach would be to replace their CSS with your own.
You could also take the approach used by the Stylish plugin, which involves a lot !important decelerations to override the site's CSS. Since this is a Java app, I assume the user will not have opportunity to supply their own CSS, so using !important here doesn't precisely go against the standard.

In your particular situation, I'd look into data scraping, all you need to do is scrape the website for the data, and then re-style it to present it how you want.
Good luck

The Greasemonkey add-on for Firefox does just this. You can write a bit of Javascript code and have it run when certain web pages load. One common thing to use it for is to make changes to the DOM to move page elements around, hide or resize elements, change colors, etc. There are a bunch of examples at userscripts.org if you want to get an idea of what I am talking about.
Your code would simply need to do something similar. Load the page (including the normal style sheets) and then use the DOM to make changes to style elements as desired. Browse through the source of the page to get the names/ids of important elements, and your code can key off of those. Loading an additional CSS file containing your changes is an option, but doing it programmatically will likely give you more flexibility in the event that the target website changes.

Depends on what do you use to show the pages in Java. Most browser implementations support dynamic changes to the DOM, so you can simply add a CSS file to header as a last element, and it will be applied.

you need to know the markup of the html / css so you can make the best skin.
you could theoretically do it by styling just the basic tags: h1...h6, p, etc... but it would not be as good and would probably fail to produce the best results at times and even produce horrible things at times.
if you KNOW the site markup then you can make a skin and simply use CSS/images to skin it as you wanted it.
just include your CSS markup LAST so that it overrides the one already present on the site that you want to skin differently.
should not be a difficult thing per se. the skin itself is probably the better (more effort required) part of the job.

On the fly, should mean changing the html fetched. So parsing and replacing tokens seems to be a/the way.
You could change the locations of the style sheet files by replacing the href value in a link that points to a css file, and set the value to your style sheet (a different URI).
<link type="text/css" href="mylocalURI" rel="stylesheet />
(this should be the result of a process/replacement)
I think you understand what should happend for inline styles.

I would use JTidy to normalize the original site HTML to XHTML, then use XSLT to filter only the interesting/relevant information, obtaining XML format; and finally (since I wouln't want to convert XML to objects), XSLT again to transform the "pure" XML into the HTML look & feel I need/want.
All of this can be assembled as streams, using more or less 4 Kb of buffer per filter (12 Kb total) per thread. Also meaning that it will run fast enough. And all built on standard, open-source available components.
Cheers.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.