I have a Java based web-application and a new requirement to allow Users to place variables into text fields that are replaced when a document or other output is produced. How have others gone about this?
I was thinking of having a pre-defined set of variables such as :
#BOOKING_NUMBER#
#INVOICE_NUMBER#
Then when a user enters some text they can specify a variable inline (select it from a modal or similar). For example:
"This is some text for Booking #BOOKING_NUMBER# that is needed by me"
When producing some output (eg. PDF) that uses this text, I would do a regex and find all variables and replace them with the correct value:
"This is some text for Booking 10001 that is needed by me"
My initial thought was something like Freemarker but I think that is too complex for my Users and would require them to know my DataModel (eww).
Thanks for reading!
D.
Have a look at java.text.MessageFormat - particularly the format method - as this is designed for exactly what you are looking for.
i.e.
MessageFormat.format("This is some text for booking {0} that is needed by me, for use with invoice {1}", bookingNumber, invoiceNumber);
You may even want to get the template text from a resource bundle, to allow for support of multiple languages, with the added ability to cope with the fact that {0} and {1} may appear in a different order in some languages.
UPDATE:
I just read your original post properly, and noticed the comment about the PDF.
This suggest that the template text is going to be significantly larger than a line or two.
In such cases, you may want to explore something like StringTemplate which seems better suited for this purpose - this comment is based solely on initial investigations, as I've not used it in anger.
I have used a similiar replacement token system before. I personally like something like.
[MYVALUE]
As it is easy for the user to type, and then I just use replacements to swap out the tokens for the real data.
Related
I'm working with HTML tags, and I need to interpret HTML documents. Here's what I need to achieve:
I have to recognize and remove HTML tags without removing the
original content.
I have to store the index of the previously existing markups.
So here's a example. Imagine that I have the following markup:
This <strong>is a</strong> message.
In this example, we have a String sequence with 35 characters, and markedup with strong tag. As we know, an HTML markup has a start and an end, and if we interpret the start and end markup as a sequence of characters, each also has a start and an end (a character index).
Again, in the previous example, the beggining index of the open/start tag is 5 (starts at index 0), and the end index is 13. The same logic goes to the close tag.
Now, once we remove the markup, we end up with the following:
This is a message.
The question:
How can I remember with this sequence the places where I could enter the markup again?
For example, once the markup has been removed, how do I know that I have to insert the opening tag in the X position/index, and the closing tag in the Y position/index... Like so:
This is a message.
5 9
index 5 = <strong>
index 9 = </strong>
I must remember that it is possible to find the following situation:
<a>T<b attribute="value">h<c>i<d>s</a> <g>i<h>s</h></g> </b>a</c> <e>t</e>e<f>s</d>t</f>.
I need to implement this in Java. I've figured out how to get the start and end index of each tag in a document. For this, I'm using regular expressions (Pattern and Matcher), but I still do not know how to insert the tags again properly (as described). I would like a working example (if possible). It does not have to be the best example (the best solution) in the world, but only that it works the right way for any kind of situation.
If anyone has not understood my question, please comment that I will do it better.
Thanks in advance.
EDIT
People in the comments are saying that I should not use regular expressions to work with HTML. I do not care to use or not regular expressions to solve this problem, I just want to solve it, no matter how (But of course, in the most appropriate way).
I mentioned that I'm using regular expressions, but I do not mind using another approach that presents the same solution. I read that a XML parser could be the solution. Is that correct? Is there an XML parser capable of doing all this what I need?
Again, Thanks in advance.
EDIT 2
I'm doing this edition now to explain the applicability of my problem (as asked). Well, before I start, I want to say that what I'm trying to do is something I've never done before, it's not something on my area, so it may not be the most appropriate way to do it. Anyway...
I'm developing a site where users are allowed to read content but can not edit it (edit or remove text). However, users can still mark/highlight excerpts (ranges) of the content present (with some stylization). This is the big summary.
Now the problem is how to do this (in Java). On the client side, for now, I was thinking of using TinyMCE to enable styling of content without text editing. I could save stylized text to a database, but this would take up a lot of space, since every client is allowed to do this, given that they are many clients. So if a client marks snippets of a paragraph, saving the paragraph back in the database for each client in the system is somewhat costly in terms of memory.
So I thought of just saving the range (indexes) of the markups made by users in a database. It is much easier to save just a few numbers than all the text with the styling required. In the case, for example, I could save a line / record in a table that says:
In X paragraph, from Y to Z index, the user P defined a ABC
stylization.
This would require a translation / conversion, from database to HTML, and HTML to database. Setting a converter can be easy (I guess), but I do not know how to get the indexes (following this logic). And then we stop again at the beginning of my question.
Just to make it clear:
If someone offers a solution that will cost money, such as a paid API, tool, or something similar, unfortunately this option is not feasible for me. I'm sorry :/
In a similar way, I know it would be ideal to do this processing with JavaScript (client-side). It turns out that I do not have a specialized JavaScript team, so this needs to be done on the server side (unfortunately), which is written in Java. I can only use a JavaScript solution if it is already ready, easy and quick to use. Would you know of any ready-made, easy-to-use library that can do it in a simple way? Does it exist?
You can't use a regular expression to parse HTML. See this question (which includes this rather epic answer as well as several other interesting answers) for more information, but HTML isn't a regular language because it has a recursive structure.
Any language that allows recursion isn't regular by definition, so you can't parse it with a regex.
Keep in mind that HTML is a context-free languages (or, at least, pretty close to context-free). See also the Chomsky hierarchy.
How would you use SafeHtml in combination with links?
Scenario: Our users can enter unformatted text which may contain links, e.g. I like&love http://www.stackoverflow.com. We want to safely render this text in GWT but make the links clickable, e.g. I like&love <a="http://www.stackoverflow.com">stackoverflow.com</a>. Aside rendering the text in the GWT frontend, we also want to send it via email where the links should be clickable as well.
So far, we considered the following options:
Store the complete text as HTML in the backend and let the frontend assume it's correctly encoded (I like&love <a="http://www.stackoverflow.com">stackoverflow.com</a>) -> Introduces XSS vulnerabilities
Store plain text but the links as HTML (I like&love <a="http://www.stackoverflow.com">stackoverflow.com</a>) in the backend and use HtmlSanitizer in the frontend
Store plain text and special encoding for the links (I like&love [stackoverflow.com|http://www.stackoverflow.com]) in the backend and use a custom SafeHtml generator in the frontend
To us, the third option looks the cleanest but it seems to require the most custom code since we can't leverage GWT's SafeHtml infrastructure.
Could anybody share how to best solve the problem? Is there another option that we didn't consider so far?
Why not store the text exactly as it was entered by the user, and perform any special treatment when transforming it for the output (e.g. for sending emails, creating PDFs, ...). This is the most natural approach, and you won't have to undo any special treatment e.g. when you offer the user to edit the string.
As a general rule, I would always perform encoding/escaping/transformation only for the immediate transport/storage/output target. There are very few reasons to deviate from this rule, one of them may be performance, e.g. caching a transformed value in the DB. (In these cases, I think it's best to give the DB field a specific name like 'text_htmltransformed' - this avoids 'overescaping', which can be just as harmful as no escaping.)
Note: Escaping/encoding is no replacement for input validation.
Imagine that I am building a hashtag search. My main indexed type is called Post, which has a list of Hashtag items, which are marked as IndexedEmbedded. Separately, every post has a list of Comment objects, each of which, again, contains a list of Hashtag objects.
On the search side, I am using a MultiFieldQueryParser, to which I pass a long list of possible search fields, including some nested fields like:
hashTags.value and
coments.hashTags.value
Now, the interesting thing happens when I want to search for something, say #architecture. I figure out where the hashtags are, so the simplest logical thing to do would be to convert a query of the type #architecture, into one of the type hashTags.value:architecture or comments.hashTags.value:architecture Although possible, this is very inflexible. What if I come up with yet another field that contains hashtags? I'd have to include that too.
Is there a general way to do this?
P.S. Please have in mind that the root type I am searching for is Post, because this is the kind of results I'd like to achieve
Hashtags are keywords, and you should let Lucene handle the text analysis to extract the hashtags from your main text and store them in a custom field.
You can do this very easily with Hibernate Search by defining your text to be indexed in two different #Field (using #Fields annotation). You could have one field named comments and another commentsHashtags.
You then apply a custom Analyser to commentsHashtags which does some standard tokenization and discards any term not starting with #; you can define one easily by taking the standard tokenizer and apply a custom filter.
When you run a query, you don't have to write custom code to look for hashtags in the query input, let it be processed by the same Analyser (which is the default anyway) and target both fields, you can even boost the hashtags more if that makes sense.
With this solution you
take advantage of the high efficiency of Search's text analysis
avoid entities and tables on the database containing the hashtags: useless overhead
avoid messing with free text extraction
It gets you another strong win point:
you can then open a raw IndexReader and load the termvector from commentsHashtags to get both a list of all used tags, and metrics about them. Cool to do some data mining, or just visualize a tag cloud.
Instead of treating the fields as different and the top-level document as Post, why not store both Posts and Comments as Lucene documents? That way, you can just have a single field called "hashtags" that you search. You should also have a field called "type" or something to differentiate between comments and posts.
Search results may be either comments of posts. You can filter by type if users want to search only posts or comments. Or you can show them differently in your UI.
If you want to add another concept that also uses hashtags (like ... I dunno... splanks or whatever silly name we all give to Internet communications in the future), then you can add it alongside the existing Post and Comment documents simply my indexing your new type with a "hashtags" field. You'll have to do plenty of work to add the splanks, anyway, so adding a handler for that new type of search result shouldn't be too much of an inconvenience.
I've been given the (rather daunting) task of introducing i18n to a J2EE web application using the 2.3 servlet specification. The application is very large and has been in active development for over 8 years.
As such, I want to get things right the first time so I can limit the amount of time I need to scrawl through JSPs, JavaScript files, servlets and wherever else, replacing hard-coded strings with values from message bundles.
There is no framework being used here. How can I approach supporting i18n. Note that I want to have a single JSP per view that loads text from (a) properties file(s) and not a different JSP for each supported locale.
I guess my main question is whether I can set the locale somewhere in the 'backend' (i.e. read locale from user profile on login and store value in session) and then expect that the JSP pages will be able to correctly load the specified string from the correct properties file (i.e. from messages_fr.properties when the locale is to French) as opposed to adding logic to find the correct locale in each JSP.
Any ideas how I can approach this?
There are a lot of things that need to be taken care of while internationalizing application:
Locale detection
The very first thing you need to think about is to detect end-user's Locale. Depending on what you want to support it might be easy or a bit complicated.
As you surely know, web browsers tend to send end-user's preferred language via HTTP Accept-Language header. Accessing this information in the Servlet might be as simple as calling request.getLocale(). If you are not planning to support any fancy Locale Detection workflow, you might just stick to this method.
If you have User Profiles in your application, you might want to add Preferred Language and Preferred Formatting Locale to it. In such case you would need to switch Locale after user logs in.
You might want to support URL-based language switching (for example: http://deutsch.example.com/ or http://example.com?lang=de). You would need to set valid Locale based on URL information - this could be done in various ways (i.e. URL Filter).
You might want to support language switching (selecting it from drop-down menu, or something), however I would not recommend it (unless it is combined with point 3).
JSTL approach could be sufficient if you just want to support first method or if you are not planning to add any additional dependencies (like Spring Framework).
While we are at Spring Framework it has quite a few nice features that you can use both to detect Locale (like CookieLocaleResolver, AcceptHeaderLocaleResolver, SessionLocaleResolver and LocaleChangeInterceptor) and externalizing strings and formatting messages (see spring:message tab).
Spring Framework would allow you to quite easily implement all the scenarios above and that is why I prefer it.
String externalization
This is something that should be easy, right? Well, mostly it is - just use appropriate tag. The only problem you might face is when it comes to externalizing client-side (JavaScript) texts. There are several possible approaches, but let me mention these two:
Have each JSP written array of translated strings (with message tag) and simply access that array in client code. This is easier approach but less maintainable - you would need to actually write valid strings from valid pages (the ones that actually reference your client-side scripts). I have done that before and believe me, this is not something you want to do in large application (but it is probably the best solution for small one).
Another approach may sound hard in principle but it is actually way easier to handle in the future. The idea is to centralize strings on client side (move them to some common JavaScript file). After that you would need to implement your own Servlet that will return this script upon request - the contents should be translated. You won't be able to use JSTL here, you would need to get strings from Resource Bundles directly.
It is much easier to maintain, because you would have one, central point to add translatable strings.
Concatenations
I hate to say that, but concatenations are really painful from Localizability perspective. They are very common and most people don't realize it.
So what is concatenation then?
On the principle, each English sentence need to be translated to target language. The problem is, it happens many times that correctly translated message uses different word order than its English counterpart (so English "Security policy" is translated to Polish "Polityka bezpieczeństwa" - "policy" is "polityka" - the order is different).
OK, but how it is related to software?
In web application you could concatenate Strings like this:
String securityPolicy = "Security " + "policy";
or like this:
<p><span style="font-weight:bold">Security</span> policy</p>
Both would be problematic. In the first case you would need to use MessageFormat.format() method and externalize strings as (for example) "Security {0}" and "policy", in the latter you would externalize the contents of the whole paragraph (p tag), including span tag. I know that this is painful for translators but there is really no better way.
Sometimes you have to use dynamic content in your paragraph - JSTL fmt:format tag will help you here as well (it works lime MessageFormat on the backend side).
Layouts
In localized application, it often happens that translated strings are way longer than English ones. The result could look very ugly. Somehow, you would need to fix styles. There are again two approaches:
Fix issues as they happen by adjusting common styles (and pray that it won't break other languages). This is very painful to maintain.
Implement CSS Localization Mechanism. The mechanism I am talking about should serve default, language-independent CSS file and per-language overrides. The idea is to have override CSS file for each language, so that you can adjust layouts on-demand (just for one language). In order to do that, default CSS file, as well as JSP pages must not contain !important keyword next to any style definitions. If you really have to use it, move them to language-based en.css - this would allow other languages to modify them.
Culture specific issues
Avoid using graphics, colors and sounds that might be specific for western culture. If you really need it, please provide means of Localization. Avoid direction-sensitive graphics (as this would be a problem when you try to localize to say Arabic or Hebrew). Also, do not assume that whole world is using the same numbers (i.e. not true for Arabic).
Dates and time zones
Handling dates in times in Java is to say the least not easy. If you are not going to support anything else than Gregorian Calendar, you could stick to built-in Date and Calendar classes.
You can use JSTL fmt:timeZone, fmt:formatDate and fmt:parseDate to correctly set time zone, format and parse date in JSP.
I strongly suggest to use fmt:formatDate like this:
<fmt:formatDate value="${someController.somedate}"
timeZone="${someController.detectedTimeZone}"
dateStyle="default"
timeStyle="default" />
It is important to covert date and time to valid (end user's) time zone. Also it is quite important to convert it to easily understandable format - that is why I recommend default formatting style.
BTW. Time zone detection is not something easy, as web browsers are not so nice to send anything. Instead, you can either add preferred time zone field to User preferences (if you have one) or get current time zone offset from web browser via client side script (see Date object's methods)
Numbers and currencies
Numbers as well as currencies should be converted to local format. It is done in the similar way to formatting dates (parsing is also done similarly):
<fmt:formatNumber value="1.21" type="currency"/>
Compound messages
You already have been warned not to concatenate strings. Instead you would probably use MessgageFormat. However, I must state that you should minimize use of compound messages. That is just because target grammar rules are quite commonly different, so translators might need not only to re-order the sentence (this would be resolved by using placeholders and MessageFormat.format()), but translate the whole sentence in different way based on what will be substituted. Let me give you some examples:
// Multiple plural forms
English: 4 viruses found.
Polish: Znaleziono 4 wirusy. **OR** Znaleziono 5 wirusów.
// Conjugation
English: Program encountered incorrect character | Application encountered incorrect character.
Polish: Program napotkał nieznaną literę | Aplikacja napotkała nieznaną literę.
Character encoding
If you are planning to Localize into languages that does not support ISO 8859-1 code page, you would need to support Unicode - the best way is to set page encoding to UTF-8. I have seen people doing it like this:
<%# page contentType="text/html; charset=UTF-8" %>
I must warn you: this is not enough. You actually need this declaration:
<%#page pageEncoding="UTF-8" %>
Also, you would still need to declare encoding in the page header, just to be on the safe side:
<META http-equiv="Content-Type" content="text/html;charset=UTF-8">
The list I gave you is not exhaustive but this is good starting point. Good luck :)
You can do exactly this using JSTL standard tag library with the tag. Grab a copy of the JSTL specification, read the i8N chapters, which discuss general text + date, time, currency. Very clearly written and shows you how you can do it all with tags. You can also set things like Locale programmatically
You dont(and shouldnt) need to have a separate JSP file per locale. The hard task is to figure out the keys that arent i18n-ed and move them to a file per locale, say, messages_en.properties, messages_fr.properties and so on.
Locale calculation can happen in multiple places depending on your logic. We support user locales stored in a database as well as the browser locale. Every request that comes into your application will have a "Accept-Language" header that indicates what are the languages your browser has been configured with , with preferences, i.e. Japanese first and then English. If thats the case, the application should read the messages_ja.properties and for keys that are not in that file, fallback to messages_en.properties. The same can hold true for user locales that are stored inside the database. Please note that the standard is just to switch the language in the browser and expect the content to be i18n-ed. (We initially started with storing locale in the database and then moved to support locales from the browser). Also you will need a default anyway as translators miss copying keys and values from english (main language file) to other languages, so you will need to default to english for values that are not in other files.
Ive also found mygengo very useful when giving translation job to other people who know a particular language, its saved us a lot of time.
What is the best way to detect data types inside html page using Java facilities DOM API, regexp, etc?
I'd like to detect types like skype plugin does for the phone/skype numbers, similar for addresses, emails, time, etc.
'Types' is an inappropriate term for the kind of information you are referring to. Choice of DOM API or regex depends upon the structure of information within the page.
If you know the structure, (for example tables being used for displaying information, you already know from which cell you can find phone number and which cell you can find email address), it makes sense to go with a DOM API.
Otherwise, you should use regex on plain HTML text without parsing it.
I'd use regexes in the following order:
Extract only the BODY content
Remove all tags to leave just plain text
Match relevant patterns in text
Of course, this assumes that markup isn't providing hints, and that you're purely extracting data, not modifying page context.
Hope this helps,
Phil Lello