Java & Heritrix 3.1.x: Web Content parsing?

Java & Heritrix 3.1.x: Web Content parsing? - java

Since the developer documentation for Heritrix 3.x is largely out of date (most of it pertains to Heritrix 1.x, as most of the classes have been changed or code has been significantly rewritten/refactored), could anyone point me to the relevant class (or classes) of the system that deal with the actual web page content extraction?
What I want to do is obtain the content of a web page Heritrix is about to crawl and then apply a classifier to the web page's content? (analyze structural features, etc.) I think this functionality may be distributed among the ContentExtractor class and its many subclasses, but what I'm trying to do is locate the point where I have either the web page content in its entirety or in a readable/parse-able stream. Where is the content (the html) that Heritrix applies regular expressions to (in order to find links, certain file types, etc.)?

I suggest looking into a custom WriterProcessor I wrote a custom MirrorWriter that looks at the incoming data, and writes files to different locations as they come it for later post-processing. The code for the MirrorWriter class is rather straight forward and well commented.
The documentation is here: http://builds.archive.org:8080/javadoc/heritrix-3.1.0/org/archive/modules/writer/MirrorWriterProcessor.html
If you are dead set on pre-processing, you can work with extending the org.archive.modules.extractor.ExtractorHTML and do a on-the-fly version. http://builds.archive.org:8080/javadoc/heritrix-3.1.0/org/archive/modules/extractor/ExtractorHTML.html

Related

Generate redirects from a list in Jakarta Tomcat

I'm pretty experienced in HTML, CSS, javascript, SQL, IIS, and a little Apache, but have essentially no knowledge of Java or Tomcat. I have a client with a low budget and a legacy web site based on a proprietary CMS (an ancient version of this) built on jakarta tomcat. Upgrading is not an option and paying tms to develop is also not an option. I'm much cheaper.
The URLs of the pages and documents on the site tend to be pretty long and not very meaningful to humans. For one reason or another when they are doing promotions they want shorter URLs for particular content. For example they may want http://{server}/naftir.html to redirect to http://{server}/cmspreview/content.jsp?id=com.tms.cms.section.Section_1013_sub_options.
I've solved this by a kludge of putting (for example) a naftir.html file in the root directory of the server and writing the redirects in there. But the {whatever}.html files are piling up and it seems there should be a better solution. E.g. edit the 404 file to look in a list (or MySQL table) of short names and desired redirects to do the redirection if found, otherwise display the 404. Or some other method based on a list of short names and redirect URLs rather than loads of files.
Any ideas?

You can easily configure a lot of this kind of stuff using a tool called urlrewrite. It's a Filter that you configure in WEB-INF/web.xml to run for all the URLs you want to re-map. Then, there is a "rewrite" configuration file where you can map specific incoming URLs to outgoing URLs. You can even use parametric replacements like mapping /foo/* to /a/b/c/d/*.
You can simplify the configuration by mapping the urlrewrite filter to all URLs, but then you lose a bit of performance on all the URLs that aren't ultimately rewritten.

Java crawler with custom file save ability

I'm looking for a open-source web crawler written in Java which, in addition to usual web crawler features such as depth/multi-threaded/etc. has the ability to custom handling each file type.
To be more precise, when a file is downloaded (or is going to be downloaded), I want to handle the saving operation of the files. The HTML files should be saved in a different repository, images to another location and other files somewhere else. Also, the repository could be not just a simple file system.
I've heard a lot about Apache Nutch. Does it have the ability to do this? I'm looking to achieve this as simple and fast as possible.

Based on assumption that you want a lot of control over how crawler works, I would recommend crawler4j. There are many examples, so you can get quick glimpse of how things are working.
You could easily handle resources based on their content type (take a look at Page.java class - it is class of object that contains information about fetched resource).
There is no limitations regarding repository. You can use anything you wish.

How to Manage Client Specific Configuration

For a product that is used by multiple clients where different clients ask for different customizations both user interface wise and functionality wise, how to accommodate those changes without getting the code cluttered with client specific code?
Are there any frameworks(for any programming language) that help with this?
To add more detail the UI is web based and written using JSP.

This is one of the most difficult business requirement to manage different versions of same app, so do not expect open frameworks for that case, however each company involved develops its own system for sth like that.
As for business logic modifications, you would benefit for strong interfacing and IoC (such as Spring). You would override the services for your specific case and change the required methods, then inject into IoC the modified version of the service.
As for UI, it's more difficult because you've chosen JSP, which has little flexibility. When you'd be programming in Swing or GWT, than you could do UI modification same way - override needed UI classes, change them, inject modified versions. With JSP - propably there will be lot of modifications to .jsp files in your customized version.
Now the change modification/bug fixing - there is fully usage of version controll system. Of course, your customer-specific versions are branches, while main, standard version is trunk. Bug fixes are made to trunk, then merged to customer-specific branches. With interfacing/overriding implementations most of the merges would be the easy way, however, with JSP, I would expect conflicts to be often...
Generally, code changes merge easier than anything XML-based.

How about simple OOP? Set up a realistic interface/base class and depending on some sort of configuration, instantiate either child class A or B, depending on the client. It's hard to provide more detail for a language-agnostic question like this, but I think it's very realistic.

One solution to this problem, common in the Win32/.NET world, is to move client-specific "code" into resource files. Many .NET projects (.NET has built-in support for this pattern through the System.Resources namespace) use this pattern for internationalization, by placing the UI strings into one file per language, and then loading UI strings from the appropriate file at runtime.
How does this pattern apply to a JSP application? Well, here you can keep one resources file per client (or, instead of files, use a database), and load the user-specific customizations from the resources file whenever you serve a page.
Say for example that your biggest customer wants to have their logo overlaid on some part of each webpage in your site. Your page could load the CustomerLogo property, and use that as the src attribute for the HTML image at that part of the page. If you are serving the page to the important customer, you load the URL "/static/images/importantCustomerLogo.png," and otherwise you fall back to the default resources file, which specifies the URL "/static/images/logo.png."
This way, you can abstract out the code for loading properties into one or two Java files, and just use those properties throughout the website. The only part of your codebase that is customer-specific will be the set of resources files, and those can be in a clean XML format that is easy to read and modify. The upshot is that people who didn't develop the application in the first place can modify XML without having to read the code first, so you won't have to maintain the resources files - the sales department can do that job for you.

GWT does this out of the box via a feature called deferred binding
When compiling a GWT application the compiler actually generates different versions of the code targeted for each different browsers. this is done automatically out of the box with the GWT components taking care of the different browser gory details.
This feature can be expanded to product arbitrary compilations based on custom properties. here is a simplified example: assume you have different view definitions for a normal and a detailed view
public abstract class AbstractView { ....}
public abstract class NormalView extends AbstractView { ... }
public abstract class DetailedView extends AbstractView { ....}
you can create a module definition that will generate two different versions, one using the NormalView class the other using the DetailedView (in your gwt.xml file)
<define-property name="customMode" values="normal,detailed" />
<replace-with class="com.example.NormalView">
<when-type-is class="com.example.AbstractView" />
<when-property-is name="customMode" value="normal" />
</replace-with>
<replace-with class="com.example.DetailedView">
<when-type-is class="com.example.AbstractView" />
<when-property-is name="customMode" value="detailed" />
</replace-with>
using
AbstractView view = GWT.create(AbstractView.class);
will provide the appropriate instance at runtime.
It's up to you to encapsulate your client specific code into specific classes, and to expose common interfaces for the different implementations.
You will also need to select the appropriate compiled version according to the client currently viewing (you can use jsp for this.)
please don't take the code samples above as tested, there might be problems with the syntax, it is just intended to convey the general idea
A JSP backend is an ideal hosting environment for a GWT app, you will be able to take advantage of the requestfactory mechanism for easy communication between client and server.
obviously there is a learning curve here, IMO the official documentation is a good place to start.

I guess that you may try to read OSGi related articles (or books)...This platform would give you a very pragmatic answer to your modularity issues.It's especially designed to be able to handle different modules living all together with dependencies and versioning.
As mentionned early in an answer , dependency injection through the OSGi Declarative Services is a very valuable alternative to Spring , with dynamic capabilities.. Deploying a bundle providing a service and your references will be updated automatically , dropping it and they will be refreshed too...
Have a look to this technology and ask some questions after ?
Regards
jerome

Multiple "pages" in GWT with human friendly URLs

I'm playing with a GWT/GAE project which will have three different "pages", although it is not really pages in a GWT sense. The top views (one for each page) will have completely different layouts, but some of the widgets will be shared.
One of the pages is the main page which is loaded by the default url (http://www.site.com), but the other two needs additional URL information to differentiate the page type. They also need a name parameter, (like http://www.site.com/project/project-name. There are at least two solutions to this that I'm aware of.
Use GWT history mechanism and let page type and parameters (such as project name) be part of the history token.
Use servlets with url-mapping patterns (like /project/*)
The first choice might seem obvious at first, but it has several drawbacks. First, a user should be able to easily remember and type URL directly to a project. It is hard to produce a human friendly URL with history tokens. Second, I'm using gwt-presenter and this approach would mean that we need to support subplaces in one token, which I'd rather avoid. Third, a user will typically stay at one page, so it makes more sense that the page information is part of the "static" URL.
Using servlets solves all these problems, but also creates other ones.
So my first questions is, what is the best solution here?
If I would go for the servlet solution, new questions pop up.
It might make sense to split the GWT app into three separate modules, each with an entry point. Each servlet that is mapped to a certain page would then simply forward the request to the GWT module that handles that page. Since a user typically stays at one page, the browser only needs to load the js for that page. Based on what I've read, this solution is not really recommended.
I could also stick with one module, but then GWT needs to find out which page it should display. It could either query the server or parse the URL itself.
If I stick with one GWT module, I need to keep the page information stored on server side. Naturally I thought about sessions, but I'm not sure if its a good idea to mix page information with user data. A session usually lives between user login and logout, but in this case it would need different behavior. Would it be bad practise to handle this via sessions?
The one GWT module + servlet solution also leads to another problem. If a user goes from a project page to the main page, how will GWT know that this has happened? The app will not be reloaded, so it will be treated as a simple state change. It seems rather ineffecient to have to check page info for every state change.
Anyone care to guide me out of the foggy darkness that surrounds me? :-)

I'd go with History tokens. It's the standard way of handling such situations. I don't understand though, what you mean by "It is hard to produce a human friendly URL with history tokens" - they seem pretty human friendly to me :) And if you use servlets for handling urls, I think that would cause the whole page to be reloaded - something which I think you'd rather want to avoid.
Second, I'm using gwt-presenter and
this approach would mean that we need
to support subplaces in one token,
which I'd rather avoid.
If you are not satisfied with gwt-presenter (like I was :)), roll out your own classes to help with MVP - it's really easy (you can start from scratch or modify the gwt-presenter classes) and you'll get a solution suited to your needs. I did precisely that, because gwt-presenter seemed to "complicated"/complex to me - to generic, when all I needed was a subset of what it offered (or try to offer).
As for the multiple modules idea - it's a good one, but I'd recommend using Code Splitting - this type of situation (pages/apps that can be divided into "standalone" modules/blocks) is just what it's meant to be used for, plus you bootstrap your application only once, so no extra code downloaded when switching between pages. Plus, it should be easier to share state that way (via event bus, for example).

Based on what you have posted I presume you come from building websites using a server side framework: JSP, JSF, Wicket, PHP or similar. GWT is not the solution for building page-based navigational websites, like you would with the aforementioned frameworks. With GWT, you load a webapp in the browser and stay there. Handle user events, talk with the server and update widgets; using gwt-presenter is a good thing here as you are forced to think about separation of controller logic and view state.
You can really exploit all features of GWT to build a high-performance app-in-the-browser, but it is definately not meant for building websites (using hyperlinked pages that transfer request parameters via the server session).
This is by far the most widely asked question about GWT here # StackOverflow :)
"How do I define pages and navigation between them in GWT?" Short answer: "You don't."
Use Wicket instead, it runs on the App Engine just fine and enables you to define page bookmarks and all stuff you mentioned above. Look here: http://stronglytypedblog.blogspot.com/2009/04/wicket-on-google-app-engine.html

What is the use of TAG Libraries in JSP and why we are using it?

I am new to JSP and i've come accross Tag libraries, Please give me some detail explaination of tag libraries, where and what type of application we should use that?
Thanks in advancve

Tag Libraries are used based upon your requirement, say, when you need more than what can be done using EL and standard actions. The tag libraries contain custom tags that can be used as per your need and you can even create your own custom tags(that is a big process & requires you write the java code that is run when you use your customized tag in the JSP), in most cases, the available tags in the JSTL would be sufficient.
Stepping a step backward, tags are used in JSPs in order to separate the scripts(java code) from the jsp pages, ie, script-free pages help in maintainability as presentation is separated from logic, and it even helps web page designers who dont know java to beautify the jsp pages without them having to deal with the java code embedded in the jsp pages.
I suggest you read 'Head First Servlets & JSP' for a better understanding of the entire process.

Tag libraries allow a cleaner separation between the look-and-feel of your application and its logic, compared to the original scriptlet syntax offered by JSP. Replacing scriptlets with custom tags eliminates the awkward confusion of imperative Java and declarative markup that used to be common in JSPs.
In an ideal world, web designers would be able to edit JSP files using a combination of standard markup and custom tags. Common markup patterns can be factored out into tag files; if they need something that requires new logic, a programmer can implement a tag class for them.
There are two ways to implement a tag library: tag files, and tag classes. Tag files use a syntax that is nearly the same as JSP, but can be parameterized with attributes in the tag. Tag classes are normal Java classes that implement a special interface, and bundled with a Tag Library Descriptor—an XML file that describes the tag name, attributes, and implementation class.
Most Java web frameworks today come with a custom tag library that helps developers utilize the features of the framework more easily. Other tag libraries, like JSTL, provide functionality that is useful in almost any application, and can be used in conjunction with any framework.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.