Background: I've been tasked with implementing search engine sitemaps for a website running on Sling. The site has multiple country-specific sites, and every country-specific sites can have multiple localizations - for instance, http://ca.example.com/fr would be the French-localized version of the Canadian site, and would map to /content/ca/fr . I can't change this content structure, and unfortunately both the country and localization nodes have the same sling:resourceType. Also, the administrative types want a sitemap.xml for each country/localization pair, not one per country site.
Generating the sitemaps is an easy task, my problem is needing a 'sitemap' node for each country/localization pair - because of the way countries and localizations nodes added (and them having the same resource type), I can't currently think of a good automated way to add the sitemap node.
It would be nice if I could somehow define a "virtual resource" that maps requests for /{country}/{localization}/sitemap.xml to a handling script; I've been browsing around and have bumped into ResourceProvider and OptingServlet, but they seem to be pretty focused on absolute paths - or adding selectors to an existing resource, which doesn't seem like an option to me.
Any ideas if there's some more or less clean way to handle this? Adding new countries/localizations doesn't happen every day, but having to add the 'sitemap' node manually still isn't an optimal solution.
I've been considering whether it's perhaps a better idea to have a running service that updates the sitemaps X times per day, and generate the sitemap.xml nodes as simple file resources in the JCR, instead of involving the Sling resolver... but before going that route, I'd like some feedback :)
EDIT:
Turns out requirements changed, and they now want the sitemaps to be configurable per localization - makes my job easier, and I won't have to work against Sling :)
Sling is a resource based framework, so you have to have a resource (node) in JCR which your requests targets.
You have two options:
1) Create a Sitemap template which includes the logic to display the Sitemap, or has your Sitemap component included on it. The Sitemap logic can be extracted into a class or service as you see fit. The site map for each site would live at:
- /content/us/en/sitemap.xml
- /content/ca/fr/sitemap.xml
2) Create a single sitemap resource (node) that you reference using 2 sling selectors; the country and language codes - this method allows for caching, however you may run into cache clearing issues as its a single resource.
/content/sitemap.us.en.xml
/content/sitemap.ca.fr.xml
You can look at: PathInfo for extracting the Sling Selector information for determining which Sitemap to render.
http://dev.day.com/docs/en/cq/current/javadoc/com/day/cq/commons/PathInfo.html
If i were doing this I would require the manual addition of the Sitemap to each site, and keep the resource under /content//
You could even look into creating a Blueprint site using MSM (if youre using the platform I think you are) and roll out new sites using that which lets you create a site template.
If you want a GET to /{country}/{localization}/sitemap.xml to be processed by custom code, simply create a node at that location and set its sling:resourceType as needed to call a custom servlet or script.
To create those sitemap.xml nodes automatically, you could use a JCR observer to be notified when new /{country}/{localization} trees are created, and create the sitemap.xml node then.
For configurable sitemaps you can add properties to the sitemap.xml node, and have your custom servlet or script use their values to shape its output.
You could do that without having a sitemap.xml node in the repository, using a servlet filter or a custom ResourceProvider, but having those nodes makes things much easier to implement and understand.
Note I am working on a sling resource merger, which is a custom resource provider, with ability to merge multiple resources based on your search paths.
For instance, if your search paths are
/apps
/libs
Hitting /virtual/my/resource/is/here will check
/apps/my/resource/is/here
/libs/my/resource/is/here
There are some options like:
add/override property
delete a property of the resource under /libs
reorder nodes if available
I intend to submit this patch as soon as possible.
Code is currently located at https://github.com/gknob/sling-resourcemerger and tracked by https://issues.apache.org/jira/browse/SLING-2986
Related
In Liferay, when you add a web content to a page, a Portlet is created and you can choose the web content that will be displayed (when logged as admin), and you can choose some parameters (rights to view the content, share...).
I would like to create a Portlet that overloads this Portlet, to allow the admin to choose his / her web content with custom parameters.
Does anyone know how this could be done ? Thanks !
First idea which came to my mind is to hook default Web Content
Display Portlet This would allow you to add some custom business
logic to this portlet and not need to implement all you already get
from original one. Still this will much depends on how much your new
features you want to add, are complex.
As you have said, you are beginner, so here are some hints, how to start up with hook creation:
Visit https://www.liferay.com/documentation/liferay-portal/6.1/development/-/ai/liferay-plugin-types-to-develop-with-maven section Developing Liferay Hook Plugins with Maven This archetype will create for you default structure of hook plugin.
Next step, is to download your liferay sources (if you haven't already) Visit official site https://www.liferay.com/downloads/liferay-portal/available-releases
Now when you have your sources, get these .jsp files which you want to modify, and copy to your hook. Make sure to keep folder structure as in default one.
Add your custom logic, in specified place and after deploying, test it.
Good luck
Another idea is to use maven war overlay which you can read
about more under http://java.dzone.com/articles/mavens-war-overlay-what-are
As you say you're a beginner, I'd suggest to create your own portlet that is independent of Liferay's portlets. You can use Liferay's API to get the article you'd like and its content, while implementing your own functionality to filter out the content you like.
The reason I'm suggesting a custom portlet is: Liferay's portlets must be as generic as possible, to match as many usecases as possible. Thus there are lots of conditionals that you won't need (and won't need to understand) in the implementation. If you have some narrow non-generic requirements to have an alternative behavior, you're easier off implementing exactly those requirements rather than adding to the generic, highly conditional UI. Plus, you might want to keep the original UI for other purposes. If you make a mistake in your own implementation, the original Web Content Display Portlet would still continue to work.
That being said, you might also look at the AssetPublisher portlet. It's the swiss army knife of content management and might already do what you want (and a lot more). This takes criteria and will evaluate them at runtime, displaying matching articles (or other content types).
Since the developer documentation for Heritrix 3.x is largely out of date (most of it pertains to Heritrix 1.x, as most of the classes have been changed or code has been significantly rewritten/refactored), could anyone point me to the relevant class (or classes) of the system that deal with the actual web page content extraction?
What I want to do is obtain the content of a web page Heritrix is about to crawl and then apply a classifier to the web page's content? (analyze structural features, etc.) I think this functionality may be distributed among the ContentExtractor class and its many subclasses, but what I'm trying to do is locate the point where I have either the web page content in its entirety or in a readable/parse-able stream. Where is the content (the html) that Heritrix applies regular expressions to (in order to find links, certain file types, etc.)?
I suggest looking into a custom WriterProcessor I wrote a custom MirrorWriter that looks at the incoming data, and writes files to different locations as they come it for later post-processing. The code for the MirrorWriter class is rather straight forward and well commented.
The documentation is here: http://builds.archive.org:8080/javadoc/heritrix-3.1.0/org/archive/modules/writer/MirrorWriterProcessor.html
If you are dead set on pre-processing, you can work with extending the org.archive.modules.extractor.ExtractorHTML and do a on-the-fly version. http://builds.archive.org:8080/javadoc/heritrix-3.1.0/org/archive/modules/extractor/ExtractorHTML.html
I'm looking for a open-source web crawler written in Java which, in addition to usual web crawler features such as depth/multi-threaded/etc. has the ability to custom handling each file type.
To be more precise, when a file is downloaded (or is going to be downloaded), I want to handle the saving operation of the files. The HTML files should be saved in a different repository, images to another location and other files somewhere else. Also, the repository could be not just a simple file system.
I've heard a lot about Apache Nutch. Does it have the ability to do this? I'm looking to achieve this as simple and fast as possible.
Based on assumption that you want a lot of control over how crawler works, I would recommend crawler4j. There are many examples, so you can get quick glimpse of how things are working.
You could easily handle resources based on their content type (take a look at Page.java class - it is class of object that contains information about fetched resource).
There is no limitations regarding repository. You can use anything you wish.
I'm trying to use MPXJ library to get fields from the MS Project mpp file. I managed to to get the task and resources. My file contains additional fields like start date, end date, comments etc. Can anyone help me to extract these fields??
Thanks in advance :)
You may find it useful to take a look at the notes in the "getting started" section of the MPXJ web site. To summarise briefly, data from Microsoft Project, and other project planning tools, typically consists of a top level project, tasks, resources, and assignments (which link tasks and resources together).
This is pretty much how MPXJ represents the data read from a project plan. The attributes of each of these objects can be set or retrieved using the relevant set and get methods on each object. So for example, the Task object in MPXJ will expose setStart() and getStart() methods to allow you to work with the task start date. The method names follow the names used for the attributes in Microsoft Project so hopefully you will find it stratightforward to locate the attributes you need. You may also find the API documentation helpful in this respect too.
I'm trying to build (or find an existing one I can use) a web filter that will compress a JavaScript file at runtime. I've tried building one based on YUICompressor, but I'm getting weird errors out of it when I try and pass a String based source into it instead of an actual file.
Now I'm expecting to get bombarded with responses like 'Real time compression/minification is a bad idea' but there is a reason I'm not wanting to do it at build time.
I've got a JavaScript web application that lazy loads it's JavaScript. It will only load what it actually needs. The JavaScript files can specify dependencies and I already have a filter that will concatenate the requested files and any dependencies not already loaded into a single response. This means that there is a large number of different combinations in the JavaScript that will be sent to the user which makes trying to build all the bundles at build time impractical.
So to restate. Ideally I'm looking for an existing real time javascript filter I can just plug into my app.
If one doesn't exist I'm looking for tips on what I can use as building blocks. YUICompressor hasn't quite got me there and GoogleClosure seems to only be a web API.
Cheers,
Peter
Take a look at The JavaScript Minifier from Douglas Crockford. The source is here: JSMin.java. It's not a filter and only contains the code to minify. We've made it into a filter where we combine and minify JavaScript on the fly as well. It works well, especially if you have browsers and a CDN cache results.
Update: I left out that we cache them on the server too. They're only regenerated if any of the assets required to make the combined and minified output have changed. Basically instead of "compiling" them at build time, we handle each combination once at runtime.
I already have a filter that will
concatenate the requested files and
any dependencies not already loaded
into a single response
Sooo.. Why not just minify those prior to loading/concatenating them?
(And yes, compression on the fly is horribly expensive and definitely not worth doing for every .js file served up)