Static Analysis tool to detect Internationalization issues

Static Analysis tool to detect Internationalization issues - java

Are there any tools (free/commercial) that can audit an application for internationalization? (or localization-readiness, if you prefer)
Primarily interested in:
Mulitlingual Implementation tests
Examples:
* [javascript] alert('Oops wrong choice!');
* [java] String msg = resourcebundle.getString("key.x").concat("4");
* [jdbc] String query=".. order by abc"; //should be NLS_SORT or equiv.
Date Implementation tests
Examples:
* SimpleDateFormat used without Locale
* Apache's DateFormatUtils used
Numeric Implementation tests
Examples:
* NumberFormat used without Locale
javascript-validation tests
Examples:
* [javascript] checkIsDecimal { //decimal point checked against "." }
* [javascript] hardcoded character range [A-z]
Cheers.

Have a look at Globalyzer - http://lingoport.com/globalyzer - as it is just that, a tool for performing static analysis on code specifically for internationalization. It works with a variety of programming languages too. Supports detection and correction for embedded strings (string externalization capabilities too), potential locale-limiting methods/functions/classes depending upon the programming language and requirements, as well as other issues like programming patterns and embedded images. There are default "rule sets" which get you a good start, and then you can customize your rules for both detection and filtering of issues. Plus there's an underlying database that helps you tag or keep track of i18n issues as you work with them. There's a server component, where you create and share your rule sets with your team members, then desktop and command line clients which run locally on your machine to analyze your source, so you're not sending any code or reporting off your local machine.

Based on your examples, you mostly want to diagnose
functions that produce output, whose input isn't somehow
internationalized.
So for the alert case, you want to find any print call
that acquires a string that is not produced by
one of possibly several well-know translation routines.
For the jdbc case, you want to identify ordering constraints
that are not locale specific.
For the various date cases, you want date routines that
are known to produce locale-specific answers.
The javascript validation is harder to guess at intent;
presumaly you want to diagnose functions that are known
to be wired to a particular locale; this seems a lot like
the date case. For range checks, you want capture anything
that compares a character to another for less or greater than.
For the wired-locale functions, it seems just knowing their
name would be enough (although perhaps there has to be some overload resolution,
e.g., by number of arguments), so NumberFormat(?,?) is bad,
and NumberFormat(?,?,?) is OK.
Why can't you write a regular expression to look (hueristically) for the bad cases?
For the range case, you just need to recognize expressions
of the form of [exp] < [literal-char] or [exp] < [literal-string].
A regexp to look for just "< '.+" would seem adequate.
Are there common cases that these would miss?
EDIT (from comment below: "I've been using regexp but...")
If you want a tool that is deeper than regexp, you pretty much
have to go to language parsing, name/type resolution, and having
data flow analysis would be helpful. Since you want to process
multiple (computer) languages, the tool has to be multi-lingual capable.
And it appears you want to be able to customize it to check for
the specific cases relevant to your application.
The DMS Software Reengineering Toolkit
has all these properties, including
parsers for Java, JavaScript and SQL. It is designed to be customized,
so you have to do that in advance of using it.

I had studied IntelliJ IDEA's code analyzers, and it does have those that you requested. It's a commercial IDE, specialized in java, but knows other languages as well.
http://www.jetbrains.com/idea/

Related

How to script input for a Java program

I'm writing a Java program that requires its (technical) users to write scripts that it uses as input; it interprets these scripts into a series of actions and executes them. I am currently looking for the cleanest way to implement the script/configuration language. I was originally thinking of heading down the XML route, but the nature of the required input really is a procedural, linear flow of actions that need to be executed:
function move(Block b, Position p) {
// user-defined algorithm for moving block "b" to position "p"
}
Block a = getBlockA();
Position p = getPositionP();
move(a, p);
Etc. Please note: the above is an example only and does not constitute the exact syntax I am looking to achieve. I am still in the "30,000 ft view"-design phase, and don't know what my concreted scripting language will ultimately look like. I only provide this example to show that it is a flow/procedural script that the users must write, and that XML is probably not the best candidate for its implementation.
XML, perfect for hierarchial data, just doesn't feel like the best choice for such an implementation (although I could force it to work if need-be).
Not knowing a lick about DSLs, I've begun to read up on Groovy DSLs and they feel like a perfect match for what I need.
My uderstanding is that I could write, say, a Groovy (I'm stronger in Groovy than Scala, JRuby, etc.) DSL that would allow users to write scripts (.groovy files) that my program could then execute as input at runtime.
Is this correct, or am I misunderstanding the intent of DSLs altogether? If I am mistaken, does anybody have any suggestions for me? And if I am correct then how would a Java program read and execute a .groovy file (in other words, how would my program "consume" their script)?
Edit: I'm beginning to like ANTLR. Although I would love to roll up my sleeves and write a Groovy DSL, I don't want my users to be able to write any old Groovy program they want. I want my own "micro-language" and if users step outside of it I want the interpreter to invalidate the script. It's beginning to seem like Groovy/DSLs aren't the right choice, and maybe ANTLR could be the solution I need...?

I think you are on a really good path. Your users can write their files using your simple DSL and them you can run them by Evaling them at runtime. Your biggest challenge will be helping them to use the API of your DSL correctly. Unless they use an IDE this will be pretty tough.
Equivalent of eval() in Groovy

Yes, you can write a Groovy program that will accept a script as input and execute it. I recently wrote a BASIC DSL/interpreter in this way using groovy :
http://cartesianproduct.wordpress.com/binsic-is-not-sinclair-instruction-code/
(In the end it was more interpreter than DSL but that was to do with a peculiarity of Groovy that likely won't affect you - BASIC insists on UPPER CASE keywords which Groovy finds hard to parse - hence they have to be converted to lower case).
Groovy allows you to extend the script environment in various ways (eg injecting variables into the binding and transferring execution from the current script to a different, dynamically loaded script) which make this relatively simple.

Approach for Automating localized Web application in Selenium using Java Bindings

I am automating test cases for a web application using selenium 2.0 and Java. My application supports multiple languages. Some of the test cases require me to validate the text that appears in the UI like success/error messages etc.I am using a properties file to store whatever text I am referring in my tests from the UI, currently only english. For example there is locale_english.properties(see below) that contains all references in english. I am going to have multiple properties files like this for different locales like locale_chinese.properties,locale_french.properties and so on. For locales other than english, their corresponding properties file would have UTF-8 characters (e.g \u30ed) representing the native characters(see below). So If I want to test say Chinese UI, I would load "locale_chinese.properties" instead of "locale_english.properties". I am going to convert the native characters for non-english locale using perhaps native2ascii from JDK or some other way.I tested that Selenium API works well with UTF-8 characters for non-english locales
---locale_english.properties------
user.login.error= Please verify username/password
---locale_chinese.properties------
user.login.error= \u30ed\u30ef\u30eg\u30eh\u30ed
and so on.
The problem is that my locale_english.properties is growing and going out of control. It is becoming hard to manage a single properties file for one locale let alone for multiple locales. Is there a better way of handling localization in Java, particularly in situations like I am in?
Thanks!

You're right that there is a problem managing the files, but you're also right that this is the best approach. Some things are just hard :-(
Selenium (at least the Selenium RC API) does indeed support Unicode input and output, we have lots of tests that enter and confirm Cyrillic and Simple Chinese characters from C#. Since Java strings are Unicode at the core (just like C#), I expect you could simply create the file in a UTF-8-friendly editor like Notepad++ and read them straight into strings and use them directly in the Selenium API.

This is how I solved the issue for those who are interested.

a database would work better for many reasons, like growth, central location, kept outside of app and can be edited and maintained outside of app. We used a table with columns:
id (int) auto increment
id_text -- this and other columns are varchar ... except for date time for last 2
lang
translation
created_by
updated_by
created_date
updated_date
An id is a short english description of the text - like 'hello' or 'error1msg', the key in your map.
In java had a function to get the text for a particular text ... and a app level property - default language (usually en but good to keep it configurable)
Function would scan already loaded hashmap for language asked for - say "ch"
If corresponding translation was not found for this language we would return the default language translation and if that was not founf then we would return "[" + id "]" so the tester knows something is missing in data base - can go to web screen to edit translation table and add it.

Java web application i18n

I've been given the (rather daunting) task of introducing i18n to a J2EE web application using the 2.3 servlet specification. The application is very large and has been in active development for over 8 years.
As such, I want to get things right the first time so I can limit the amount of time I need to scrawl through JSPs, JavaScript files, servlets and wherever else, replacing hard-coded strings with values from message bundles.
There is no framework being used here. How can I approach supporting i18n. Note that I want to have a single JSP per view that loads text from (a) properties file(s) and not a different JSP for each supported locale.
I guess my main question is whether I can set the locale somewhere in the 'backend' (i.e. read locale from user profile on login and store value in session) and then expect that the JSP pages will be able to correctly load the specified string from the correct properties file (i.e. from messages_fr.properties when the locale is to French) as opposed to adding logic to find the correct locale in each JSP.
Any ideas how I can approach this?

There are a lot of things that need to be taken care of while internationalizing application:
Locale detection
The very first thing you need to think about is to detect end-user's Locale. Depending on what you want to support it might be easy or a bit complicated.
As you surely know, web browsers tend to send end-user's preferred language via HTTP Accept-Language header. Accessing this information in the Servlet might be as simple as calling request.getLocale(). If you are not planning to support any fancy Locale Detection workflow, you might just stick to this method.
If you have User Profiles in your application, you might want to add Preferred Language and Preferred Formatting Locale to it. In such case you would need to switch Locale after user logs in.
You might want to support URL-based language switching (for example: http://deutsch.example.com/ or http://example.com?lang=de). You would need to set valid Locale based on URL information - this could be done in various ways (i.e. URL Filter).
You might want to support language switching (selecting it from drop-down menu, or something), however I would not recommend it (unless it is combined with point 3).
JSTL approach could be sufficient if you just want to support first method or if you are not planning to add any additional dependencies (like Spring Framework).
While we are at Spring Framework it has quite a few nice features that you can use both to detect Locale (like CookieLocaleResolver, AcceptHeaderLocaleResolver, SessionLocaleResolver and LocaleChangeInterceptor) and externalizing strings and formatting messages (see spring:message tab).
Spring Framework would allow you to quite easily implement all the scenarios above and that is why I prefer it.
String externalization
This is something that should be easy, right? Well, mostly it is - just use appropriate tag. The only problem you might face is when it comes to externalizing client-side (JavaScript) texts. There are several possible approaches, but let me mention these two:
Have each JSP written array of translated strings (with message tag) and simply access that array in client code. This is easier approach but less maintainable - you would need to actually write valid strings from valid pages (the ones that actually reference your client-side scripts). I have done that before and believe me, this is not something you want to do in large application (but it is probably the best solution for small one).
Another approach may sound hard in principle but it is actually way easier to handle in the future. The idea is to centralize strings on client side (move them to some common JavaScript file). After that you would need to implement your own Servlet that will return this script upon request - the contents should be translated. You won't be able to use JSTL here, you would need to get strings from Resource Bundles directly.
It is much easier to maintain, because you would have one, central point to add translatable strings.
Concatenations
I hate to say that, but concatenations are really painful from Localizability perspective. They are very common and most people don't realize it.
So what is concatenation then?
On the principle, each English sentence need to be translated to target language. The problem is, it happens many times that correctly translated message uses different word order than its English counterpart (so English "Security policy" is translated to Polish "Polityka bezpieczeństwa" - "policy" is "polityka" - the order is different).
OK, but how it is related to software?
In web application you could concatenate Strings like this:
String securityPolicy = "Security " + "policy";
or like this:
<p><span style="font-weight:bold">Security</span> policy</p>
Both would be problematic. In the first case you would need to use MessageFormat.format() method and externalize strings as (for example) "Security {0}" and "policy", in the latter you would externalize the contents of the whole paragraph (p tag), including span tag. I know that this is painful for translators but there is really no better way.
Sometimes you have to use dynamic content in your paragraph - JSTL fmt:format tag will help you here as well (it works lime MessageFormat on the backend side).
Layouts
In localized application, it often happens that translated strings are way longer than English ones. The result could look very ugly. Somehow, you would need to fix styles. There are again two approaches:
Fix issues as they happen by adjusting common styles (and pray that it won't break other languages). This is very painful to maintain.
Implement CSS Localization Mechanism. The mechanism I am talking about should serve default, language-independent CSS file and per-language overrides. The idea is to have override CSS file for each language, so that you can adjust layouts on-demand (just for one language). In order to do that, default CSS file, as well as JSP pages must not contain !important keyword next to any style definitions. If you really have to use it, move them to language-based en.css - this would allow other languages to modify them.
Culture specific issues
Avoid using graphics, colors and sounds that might be specific for western culture. If you really need it, please provide means of Localization. Avoid direction-sensitive graphics (as this would be a problem when you try to localize to say Arabic or Hebrew). Also, do not assume that whole world is using the same numbers (i.e. not true for Arabic).
Dates and time zones
Handling dates in times in Java is to say the least not easy. If you are not going to support anything else than Gregorian Calendar, you could stick to built-in Date and Calendar classes.
You can use JSTL fmt:timeZone, fmt:formatDate and fmt:parseDate to correctly set time zone, format and parse date in JSP.
I strongly suggest to use fmt:formatDate like this:
<fmt:formatDate value="${someController.somedate}"
timeZone="${someController.detectedTimeZone}"
dateStyle="default"
timeStyle="default" />
It is important to covert date and time to valid (end user's) time zone. Also it is quite important to convert it to easily understandable format - that is why I recommend default formatting style.
BTW. Time zone detection is not something easy, as web browsers are not so nice to send anything. Instead, you can either add preferred time zone field to User preferences (if you have one) or get current time zone offset from web browser via client side script (see Date object's methods)
Numbers and currencies
Numbers as well as currencies should be converted to local format. It is done in the similar way to formatting dates (parsing is also done similarly):
<fmt:formatNumber value="1.21" type="currency"/>
Compound messages
You already have been warned not to concatenate strings. Instead you would probably use MessgageFormat. However, I must state that you should minimize use of compound messages. That is just because target grammar rules are quite commonly different, so translators might need not only to re-order the sentence (this would be resolved by using placeholders and MessageFormat.format()), but translate the whole sentence in different way based on what will be substituted. Let me give you some examples:
// Multiple plural forms
English: 4 viruses found.
Polish: Znaleziono 4 wirusy. **OR** Znaleziono 5 wirusów.
// Conjugation
English: Program encountered incorrect character | Application encountered incorrect character.
Polish: Program napotkał nieznaną literę | Aplikacja napotkała nieznaną literę.
Character encoding
If you are planning to Localize into languages that does not support ISO 8859-1 code page, you would need to support Unicode - the best way is to set page encoding to UTF-8. I have seen people doing it like this:
<%# page contentType="text/html; charset=UTF-8" %>
I must warn you: this is not enough. You actually need this declaration:
<%#page pageEncoding="UTF-8" %>
Also, you would still need to declare encoding in the page header, just to be on the safe side:
<META http-equiv="Content-Type" content="text/html;charset=UTF-8">
The list I gave you is not exhaustive but this is good starting point. Good luck :)

You can do exactly this using JSTL standard tag library with the tag. Grab a copy of the JSTL specification, read the i8N chapters, which discuss general text + date, time, currency. Very clearly written and shows you how you can do it all with tags. You can also set things like Locale programmatically

You dont(and shouldnt) need to have a separate JSP file per locale. The hard task is to figure out the keys that arent i18n-ed and move them to a file per locale, say, messages_en.properties, messages_fr.properties and so on.
Locale calculation can happen in multiple places depending on your logic. We support user locales stored in a database as well as the browser locale. Every request that comes into your application will have a "Accept-Language" header that indicates what are the languages your browser has been configured with , with preferences, i.e. Japanese first and then English. If thats the case, the application should read the messages_ja.properties and for keys that are not in that file, fallback to messages_en.properties. The same can hold true for user locales that are stored inside the database. Please note that the standard is just to switch the language in the browser and expect the content to be i18n-ed. (We initially started with storing locale in the database and then moved to support locales from the browser). Also you will need a default anyway as translators miss copying keys and values from english (main language file) to other languages, so you will need to default to english for values that are not in other files.
Ive also found mygengo very useful when giving translation job to other people who know a particular language, its saved us a lot of time.

How to webscrape scholar.google.com in Java?

I want to write a Java func grabTopResults(String f) such that grabTopResults("automata theory") returns me a list of the top 100 cited papers on scholar.google.com for "automata theory".
Does anyone have suggestions for what libraries will make my life easy?
Thanks!

As I'm sure Google can afford the bandwidth, I'll ignore the question of whether this is immoral/illegal/prohibited by Google's T&C
First thing you need to do is figure out what HTTP request (or requests) you need to issue in order to obtain the page with the data you need. Once you've figured this out, use HttpClient to issue the same request from Java code. The previous link shows example code that explains how to do this.
Once you've downloaded the content of the relevant page, you'll need to use a HTML parser to extract the data you're interested in. The Jericho parser suggested by peperg is a good choice.
If the Google police come knocking, you've never heard of me, OK?

I use http://jericho.htmlparser.net/docs/index.html . Google Scholar doesn't have API ( http://code.google.com/p/google-ajax-apis/issues/detail?id=109 ). Of course it is not allowed by Google (read terms of use. Automatic requestr are forbidden).

Below is a bit of example code which gets the titles on the first page using the open source product TestPlan. It is a standalone product, but if you really need it I could help you integrated it into your Java code (it is written in Java itself).
GotoURL http://scholar.google.com/
SubmitForm with
%Params:q% automate theory
end
set %Items% as response //div[#class='gs_r']
foreach %Item% in %Items%
set %Title% as selectIn %Item% h3
Notice %Title%
end
This produces output like the below (my IP is Germany, thus a german response). Obviously you could format it however you like, or write it to a file; this is just a rough test.
00000000-00 GOTOURL http://scholar.google.com/
00000001-00 SUBMITFORM default
00000002-00 NOTICE [ZITATION] Stochastic complexity in statistical inquiry theory
00000003-00 NOTICE AUTOMATED THEORY FORMATION IN MATHEMATICS1
00000004-00 NOTICE Constraint generation via automated theory formation
00000005-00 NOTICE [BUCH] Automated theorem proving: after 25 years
00000006-00 NOTICE [BUCH] Introduction to the Theory of Computation
00000007-00 NOTICE [ZITATION] Computer-controlled systems: theory and design
00000008-00 NOTICE [BUCH] … , randomness & incompleteness: papers on algorithmic information theory
00000009-00 NOTICE [BUCH] Automatic control systems
00000010-00 NOTICE [BUCH] VLSI physical design automation: theory and practice
00000011-00 NOTICE Singular Control Systems.

Simple properties to string conversion in Java

Using Java, I need to encode a Map<String, String> of name value pairs to store into a String, and be able to decode it again. These will be stored in a database column, and will probably usually be short and simple, so the common case should produce a simple nice looking line, but shouldn't corrupt the data, even if it contains unexpected characters, etc.
How would you choose to do it such that:
The encoded form is a single, human readable line
It doesn't require a big library or much context to encode / decode
Any delimeters are properly escaped
Url encoding? JSON? Do it yourself? Please specify any helper libraries or methods you'd use.
(Edited to specify more context and requirements as requested.)

As #Uri says, additional context would be good. I think your primary concerns are less about the particular encoding scheme, as rolling your own for most encodings is pretty easy for a simple Map<String, String>.
An interesting question is: what will this intermediate string encoding be used for?
if it's purely internal, an ad-hoc format is fine eg simple concatenation:
key1|value1|key2|value2
if humans night read it, a format like Ruby's map declaration is nice:
{ first_key => first_value,
second_key => second_value }
if the encoding is to send a serialised map over the wire to another application, the XML suggestion makes a lot of sense as it's standard-ish and reasonably self-documenting, at the cost of XML's verbosity.
<map>
<entry key='foo' value='bar'/>
<entry key='this' value='that'/>
</map>
if the map is going to be flushed to file and read back later by another Java application, #Cletus' suggestion of the Properties class is a good one, and has the additional benefit of being easy to open and inspect by human beings.
Edit: you've added the information that this is to store in a database column - is there a reason to use a single column, rather than three columns like so:
CREATE TABLE StringMaps
(
map_id NUMBER NOT NULL, -- ditch this if you only store one map...
key VARCHAR2 NOT NULL,
value VARCHAR2
);
As well as letting you store more semantically meaningful data, this moves the encoding/decoding into your data access layer more formally, and allows other database readers to easily see the data without having to understand any custom encoding scheme you might use. You can also easily query by key or value if you want to.
Edit again: you've said that it really does need to fit into a single column, in which case I'd either:
use the first pipe-separated encoding (or whatever exotic character you like, maybe some unprintable-in-English unicode character). Simplest thing that works. Or...
if you're using a database like Oracle that recognises XML as a real type (and so can give you XPath evaluations against it and so on) and need to be able to read the data well from the database layer, go with XML. Writing XML parsers for decoding is never fun, but shouldn't be too painful with such a simple schema.
Even if your database doesn't support XML natively, you can just throw it into any old character-like column-type...

Why not just use the Properties class? That does exactly what you want.

I have been contemplating a similar need of choosing a common representation for the conversations (transport content) between my clients and servers via a facade pattern. I want a representation that is standardized, human-readable (brief), robust, fast. I want it to be lightweight to implement and run, easy to test, and easy to "wrap". Note that I have already eliminated XML by my definition, and by explicit intent.
By "wrap", I mean that I want to support other transport content representations such as XML, SOAP, possibly Java properties or Windows INI formats, comma-separated values (CSV) and that ilk, Google protocol buffers, custom binary formats, proprietary binary formats like Microsoft Excel workbooks, and whatever else may come along. I would implement these secondary representations using wrappers/decorators around the primary facade. Each of these secondary representations is desirable, especially to integrate with other systems in certain circumstances, but none of them is desirable as a primary representation due to various shortcomings (failure to meet one or more of my criteria listed above).
Therefore, so far, I am opting for the JSON format as my primary transport content representation. I intend to explore that option in detail in the near future.
Only in cases of extreme performance considerations would I skip translating the underlying conventional format. The advantages of a clean design include good performance (no wasted effort, ease of maintainability) for which a decent hardware selection should be the only necessary complement. When performance needs become extreme (e.g., processing forty thousand incoming data files totaling forty million transactions per day), then EVERYTHING has to be revisited anyway.
As a developer, DBA, architect, and more, I have built systems of practically every size and description. I am confident in my selection of criteria, and eagerly await confirmation of its suitability. Indeed, I hope to publish an implementation as open-source (but don't hold your breath quite yet).
Note that this design discussion ignores the transport medium (HTTP, SMTP, RMI, .Net Remoting, etc.), which is intentional. I find that it is much more effective to treat the transport medium and the transport content as completely separate design considerations, from each other and from the system in question. Indeed, my intent is to make these practically "pluggable".
Therefore, I encourage you to strongly consider JSON. Best wishes.

Some additional context for the question would help.
If you're going to be encoding and decoding at the entire-map granularity, why not just use XML?

As #DanVinton says, if you need this in internal use (I mean "
internal use
as
it's used only by my components, not components written by others
you can concate key and value.
I prefer use different separator between key and key and key and value:
Instead of
key1+SEPARATOR+value1+SEPARATOR+key2 etc
I code
key1+SEPARATOR_KEY_AND_VALUE+value1+SEPARATOR_KEY(n)_AND_KEY(N+1)+key2 etc
if you must debug, this way is clearer (by design too)

Check out the apache commons configuration package. This will allow you to read/save a file as XML or properties format. It also gives you an option of automatically saving the property changes to a file.
Apache Configuration

A realise this is an old "deadish" thread, but I've got a solution not posited previously which I think is worth throwing in the ring.
We store "arbitrary" attributes (i.e. created by the user at runtime) of geographic features in a single CLOB column in the DB in the standard XML attributes format. That is:
name="value" name="value" name="value"
To create an XML element you just "wrap up" the attributes in an xml element. That is:
String xmlString += "<arbitraryAttributes" + arbitraryAttributesString + " />"
"Serialising" a Properties instance to an xml-attributes-string is a no-brainer... it's like ten lines of code. We're lucky in that we can impose on the users the rule that all attribute names must be valid xml-element-names; and we xml-escape (i.e. &quote; etc) each "value" to avoid problems from double-quotes and whatever in the value strings.
It's effective, flexible, fast (enough) and simple.
Now, having said all that... if we had the time again, we'd just totally divorce ourselves from the whole "metadata problem" by storing the complete unadulterated uninterpreted metadata xml-document in a CLOB and use one of the open-source metadata editors to handle the whole mess.
Cheers. Keith.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.