I'm looking for a open-source web crawler written in Java which, in addition to usual web crawler features such as depth/multi-threaded/etc. has the ability to custom handling each file type.
To be more precise, when a file is downloaded (or is going to be downloaded), I want to handle the saving operation of the files. The HTML files should be saved in a different repository, images to another location and other files somewhere else. Also, the repository could be not just a simple file system.
I've heard a lot about Apache Nutch. Does it have the ability to do this? I'm looking to achieve this as simple and fast as possible.
Based on assumption that you want a lot of control over how crawler works, I would recommend crawler4j. There are many examples, so you can get quick glimpse of how things are working.
You could easily handle resources based on their content type (take a look at Page.java class - it is class of object that contains information about fetched resource).
There is no limitations regarding repository. You can use anything you wish.
Related
I want to add different languages support in my Spring Web-MVC application without adding message_language.properties file for each language.
But I found message_language.properties file solution everywhere.
I searched deeply but I haven't got any solution for it.
please suggest me any solution....
Its not clear what are the reasons why you search for a different solution. Perhaps not what you want, but many projects that require dynamic language addition use database backed resource bundles (technically speaking its the solution without the properties file, but essentially the approach is the same).
If this is what you want you can check out the blog http://www.webreference.com/programming/Globalize-Web-Applications15_Java_ResourceBundles/index.html.
The following stackoverflow could be helpful as well
Database backed i18n for java web-app
Since the developer documentation for Heritrix 3.x is largely out of date (most of it pertains to Heritrix 1.x, as most of the classes have been changed or code has been significantly rewritten/refactored), could anyone point me to the relevant class (or classes) of the system that deal with the actual web page content extraction?
What I want to do is obtain the content of a web page Heritrix is about to crawl and then apply a classifier to the web page's content? (analyze structural features, etc.) I think this functionality may be distributed among the ContentExtractor class and its many subclasses, but what I'm trying to do is locate the point where I have either the web page content in its entirety or in a readable/parse-able stream. Where is the content (the html) that Heritrix applies regular expressions to (in order to find links, certain file types, etc.)?
I suggest looking into a custom WriterProcessor I wrote a custom MirrorWriter that looks at the incoming data, and writes files to different locations as they come it for later post-processing. The code for the MirrorWriter class is rather straight forward and well commented.
The documentation is here: http://builds.archive.org:8080/javadoc/heritrix-3.1.0/org/archive/modules/writer/MirrorWriterProcessor.html
If you are dead set on pre-processing, you can work with extending the org.archive.modules.extractor.ExtractorHTML and do a on-the-fly version. http://builds.archive.org:8080/javadoc/heritrix-3.1.0/org/archive/modules/extractor/ExtractorHTML.html
What is a good approach to save state of a Java applet?
I can deal with object serialization/deserialization to/from a file but don't know where it should be placed or if there's some 'registry' where I can just save a couple of user's settings.
Those settings are hardware dependent so I want to save it on client.
Full permission is given to the applet.
What is a good approach to save state of a Java applet?
For a trusted applet, there are many options.
I can deal with object serialization/deserialization to/from a file but don't know where it should be placed..
Put the information in a sub-directory of user.home.
user.home will be a place that is writable.
A sub-directory (e.g. based on the package name of the applet class) in order to avoid colliding with the settings files of other apps.
..or if there's some 'registry' where I can just save a couple of user's settings.
I've heard that the Preferences class can be used for that ..
This data is stored persistently in an implementation-dependent backing store. Typical implementations include flat files, OS-specific registries, directory servers and SQL databases. The user of this class needn't be concerned with details of the backing store.
Sounds neat, doesn't it? The only trouble is that I've never been able to make an example where the values persist between runs!
Object serialization comes with a huge warning that it might break at any time.
I'd go for a file location of your own specification (e.g. in user.home) and either use a Properties file (for simple name/value pairs) of XMLEncoder/XMLDecoder (for more complex Java beans).
For the former, see this little example. The latter is described in a short example at the top of the JavaDocs.
Of course, if this applet is deployed in a Plug-In 2 architecture JRE and has access to the JNLP API, it can use the PersistenceService. Here is a demo. of the PersistenceService.
Even a sand-boxed applet can use the PersistenceService - it is similar to the concept of Cookies in that it is intended for small amounts of data.
The Applet Persistence API seems to be a good approach when data needs to be persisted between browser sessions: http://docs.oracle.com/javase/1.4.2/docs/guide/plugin/developer_guide/persistence.html
we have a SOA system built on top of EJB 3.0. we are manually maintaining a "service overview map" which shows which business services call which domain services. this is tedious, error prone and nobody wants to do it :-/
that's why i am looking for a way to automate the generation of these diagrams. i think code analysis is the way to go.
does anybody know of a tool that does good code analysis for java? i think of some kind of meta model which i can query to built the graph.
something like:
parse all files from root dir xyz and built the meta model for each class
a) e.g. which other classes does it use
b) which classes use this class
c) which interfaces does it implement
d) what is it's filename
e) and so on, i guess you know what i mean
give me a meta model of all the files you found (java/class)
generate the graph (hand made)
output the graph in ".dot" (directed graph) file format
use the "dot" tool to generate the graph as png, pdf, svg...
we already have a simple solution by "grepping" through the files.... but this is not perfect.
any help would be appreciated
cheers
marcel
An alternative would be to do the same thing using ASM on the compiled class files or on the source code itself.
Hah. I wrote this exact system not one month ago for our EJB 2.0 behemoth. Here is what I did:
downloaded the Java Parser
wrote a recursive directory walker
wrote a tree-based data structure that allowed me to store dependencies.
wrote a .dot file outputter.
It took me about a week and it proved very successful when I updated it to output the files as .svg images via the twopi engine; we can now navigate class diagrams in the browser to quickly identify areas of interest for potential refactoring, and are integrating it into our automated build environment.
Drop me a pm if you need more info.
There is an eclipse framework called MoDisco which corresponds to your need. It includes several meta model (Java, JSP, ...).
You can get a complete picture of your project including config files or deployment descriptors.
I'm trying to build (or find an existing one I can use) a web filter that will compress a JavaScript file at runtime. I've tried building one based on YUICompressor, but I'm getting weird errors out of it when I try and pass a String based source into it instead of an actual file.
Now I'm expecting to get bombarded with responses like 'Real time compression/minification is a bad idea' but there is a reason I'm not wanting to do it at build time.
I've got a JavaScript web application that lazy loads it's JavaScript. It will only load what it actually needs. The JavaScript files can specify dependencies and I already have a filter that will concatenate the requested files and any dependencies not already loaded into a single response. This means that there is a large number of different combinations in the JavaScript that will be sent to the user which makes trying to build all the bundles at build time impractical.
So to restate. Ideally I'm looking for an existing real time javascript filter I can just plug into my app.
If one doesn't exist I'm looking for tips on what I can use as building blocks. YUICompressor hasn't quite got me there and GoogleClosure seems to only be a web API.
Cheers,
Peter
Take a look at The JavaScript Minifier from Douglas Crockford. The source is here: JSMin.java. It's not a filter and only contains the code to minify. We've made it into a filter where we combine and minify JavaScript on the fly as well. It works well, especially if you have browsers and a CDN cache results.
Update: I left out that we cache them on the server too. They're only regenerated if any of the assets required to make the combined and minified output have changed. Basically instead of "compiling" them at build time, we handle each combination once at runtime.
I already have a filter that will
concatenate the requested files and
any dependencies not already loaded
into a single response
Sooo.. Why not just minify those prior to loading/concatenating them?
(And yes, compression on the fly is horribly expensive and definitely not worth doing for every .js file served up)