Setting Turkish and English locale: translate Turkish characters to Latin equivalents - java

I want to translate my Turkish strings to lowercase in both English and Turkish locale. I'm doing this:
String myString="YAŞAT BAYRI";
Locale trlocale= new Locale("tr-TR");
Locale enLocale = new Locale("en_US");
Log.v("mainlist", "en source: " +myString.toLowerCase(enLocale));
Log.v("mainlist", "tr source: " +myString.toLowerCase(trlocale));
The output is:
en source: yaşar bayri
tr source: yaşar bayri
But I want to have an output like this:
en source: yasar bayri
tr source: yaşar bayrı
Is this possible in Java?

If you are using the Locale constructor, you can and must set the language, country and variant as separate arguments:
new Locale(language)
new Locale(language, country)
new Locale(language, country, variant)
Therefore, your test program creates locales with the language "tr-TR" and "en_US". For your test program, you can use new Locale("tr", "TR") and new Locale("en", "US").
If you are using Java 1.7+, then you can also parse a language tag using Locale.forLanguageTag:
String myString="YASAT BAYRI";
Locale trlocale= Locale.forLanguageTag("tr-TR");
Locale enLocale = Locale.forLanguageTag("en_US");
Creates strings that have the appropriate lower case for the language.

I think this is the problem:
Locale trlocale= new Locale("tr-TR");
Try this instead:
Locale trlocale= new Locale("tr", "TR");
That's the constructor to use to specify country and language.

you can do that:
Locale trlocale= new Locale("tr","TR");
The first parameter is your language, while the other one is your country.

If you just want the string in ASCII, without accents, the following might do.
First an accented character might be split in ASCII char and a combining diacritical mark (zero-width accent). Then only those accents may be removed by regular expression replace.
public static String withoutDiacritics(String s) {
// Decompose any ş into s and combining-,.
String s2 = Normalizer.normalize(s, Normalizer.Form.NFD);
return s2.replaceAll("(?s)\\p{InCombiningDiacriticalMarks}", "");
}

Characters ş and s are different characters. Changing locale cannot help you to translate one to another. You have to create turkish-to-english characters table and do this yourself. I once did this for Vietnamic language that has a lot of such characters. You have to deal with 4 of 5, right? So, good luck!

Related

Escape special characters using Regex in java [duplicate]

Does Java have a built-in way to escape arbitrary text so that it can be included in a regular expression? For example, if my users enter "$5", I'd like to match that exactly rather than a "5" after the end of input.
Since Java 1.5, yes:
Pattern.quote("$5");
Difference between Pattern.quote and Matcher.quoteReplacement was not clear to me before I saw following example
s.replaceFirst(Pattern.quote("text to replace"),
Matcher.quoteReplacement("replacement text"));
It may be too late to respond, but you can also use Pattern.LITERAL, which would ignore all special characters while formatting:
Pattern.compile(textToFormat, Pattern.LITERAL);
I think what you're after is \Q$5\E. Also see Pattern.quote(s) introduced in Java5.
See Pattern javadoc for details.
First off, if
you use replaceAll()
you DON'T use Matcher.quoteReplacement()
the text to be substituted in includes a $1
it won't put a 1 at the end. It will look at the search regex for the first matching group and sub THAT in. That's what $1, $2 or $3 means in the replacement text: matching groups from the search pattern.
I frequently plug long strings of text into .properties files, then generate email subjects and bodies from those. Indeed, this appears to be the default way to do i18n in Spring Framework. I put XML tags, as placeholders, into the strings and I use replaceAll() to replace the XML tags with the values at runtime.
I ran into an issue where a user input a dollars-and-cents figure, with a dollar sign. replaceAll() choked on it, with the following showing up in a stracktrace:
java.lang.IndexOutOfBoundsException: No group 3
at java.util.regex.Matcher.start(Matcher.java:374)
at java.util.regex.Matcher.appendReplacement(Matcher.java:748)
at java.util.regex.Matcher.replaceAll(Matcher.java:823)
at java.lang.String.replaceAll(String.java:2201)
In this case, the user had entered "$3" somewhere in their input and replaceAll() went looking in the search regex for the third matching group, didn't find one, and puked.
Given:
// "msg" is a string from a .properties file, containing "<userInput />" among other tags
// "userInput" is a String containing the user's input
replacing
msg = msg.replaceAll("<userInput \\/>", userInput);
with
msg = msg.replaceAll("<userInput \\/>", Matcher.quoteReplacement(userInput));
solved the problem. The user could put in any kind of characters, including dollar signs, without issue. It behaved exactly the way you would expect.
To have protected pattern you may replace all symbols with "\\\\", except digits and letters. And after that you can put in that protected pattern your special symbols to make this pattern working not like stupid quoted text, but really like a patten, but your own. Without user special symbols.
public class Test {
public static void main(String[] args) {
String str = "y z (111)";
String p1 = "x x (111)";
String p2 = ".* .* \\(111\\)";
p1 = escapeRE(p1);
p1 = p1.replace("x", ".*");
System.out.println( p1 + "-->" + str.matches(p1) );
//.*\ .*\ \(111\)-->true
System.out.println( p2 + "-->" + str.matches(p2) );
//.* .* \(111\)-->true
}
public static String escapeRE(String str) {
//Pattern escaper = Pattern.compile("([^a-zA-z0-9])");
//return escaper.matcher(str).replaceAll("\\\\$1");
return str.replaceAll("([^a-zA-Z0-9])", "\\\\$1");
}
}
Pattern.quote("blabla") works nicely.
The Pattern.quote() works nicely. It encloses the sentence with the characters "\Q" and "\E", and if it does escape "\Q" and "\E".
However, if you need to do a real regular expression escaping(or custom escaping), you can use this code:
String someText = "Some/s/wText*/,**";
System.out.println(someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
This method returns: Some/\s/wText*/\,**
Code for example and tests:
String someText = "Some\\E/s/wText*/,**";
System.out.println("Pattern.quote: "+ Pattern.quote(someText));
System.out.println("Full escape: "+someText.replaceAll("[-\\[\\]{}()*+?.,\\\\\\\\^$|#\\\\s]", "\\\\$0"));
^(Negation) symbol is used to match something that is not in the character group.
This is the link to Regular Expressions
Here is the image info about negation:

How to validate & instantiate Java Locale from strings?

My app is being fed string from an external process, where each string is either 2- or 5-characters in length, and represents a java.util.Locales. For example:
en-us
ko
The first example is a 5-char string where "en" is the ISO language code, and "us" is the ISO country code. This should correspond to the "en_US" Locale. The second example is only a 2-char string, where "ko" is the ISO language code, and should correspond to the "ko_KR" (Korean) Locale.
I need a way to take these strings (either the 2- or 5-char variety), validate it (as a supported Java 6 Locale), and then create a Locale instance with it.
I would have hoped that Locale came with such validation out of the box, but unfortunately this code runs without exceptions being thrown:
Locale loc = new Locale("waawaaweewah", "greatsuccess");
// Prints: "waawaaweewah"
System.out.println(loc.getDisplayLanguage());
So I ask, given me the 2 forms that these string will be given to me in, how can I:
Validate the string (both forms) and throw an exception for strings corresponding to non-existent or unsupported Java 6 Locales; and
Instantiate a new Locale from the string? This question really applies to the 2-char form, where I might only have "ko" and need it to map to the "ko_KR" Locale, etc.
Thanks in advance!
Locale.getISOCountries() and Locale.getISOLanguages()
return a list of all 2-letter country and language codes defined in ISO 3166 and ISO 639 respectively and can be used to create Locales.
You can use this to validate your inputs.
You have two options,
Use a library for doing this commons-lang has the LocaleUtils class that has a method that can parse a String to a Locale.
While your own method, the validation here is non trivial as there are a number of different sets of country codes that a valid for a Locale - see the javadoc
A starting point would be to split the String and switch on the number of elements:
public static Locale parseLocale(final String locale) {
final String[] localeArr = locale.split("_");
switch (localeArr.length) {
case 1:
return new Locale(localeArr[0]);
case 2:
return new Locale(localeArr[0], localeArr[1]);
case 3:
return new Locale(localeArr[0], localeArr[1], localeArr[2]);
default:
throw new IllegalArgumentException("Invalid locale format;");
}
}
Presumably you would need to get lists of all valid country codes and languages and compare the elements in the String[] to the valid values before calling the constructor.

Locale appending 'variation' to language and country code

LocaleContext.getLocale() returns the locale object currently as 'en_US_WOL'. I verified the locale object using breakpoint and looks like en- language English, US - country code of US, WOL - variation (a field of Locale object).
How and why is the variation field getting appending and returned for getLocale() method? and how can I stop that? (LocaleContext is of type ThreadLocal)
According to http://docs.oracle.com/javase/6/docs/api/java/util/Locale.html
The variant argument is a vendor or browser-specific code. For example, use WIN for Windows, MAC for Macintosh, and POSIX for POSIX. Where there are two variants, separate them with an underscore, and put the most important one first. For example, a Traditional Spanish collation might construct a locale with parameters for language, country and variant as: "es", "ES", "Traditional_WIN".
If you're after Locale for specific variant, I presume you can use this constructor:
Locale(String language, String country, String variant)
Or adjust your browser's locale settings (if your application involves browser at all)
I had a problem with this too. Unfortunately I haven't found any build-in method to nicely output lang-country code without Variant so I helped myself with such snippet (maybe would be handy to somebody) :
public static String getLanguageCode(Locale locale) {
StringBuilder sb = new StringBuilder();
sb.append(locale.getLanguage());
if (locale.getCountry() != null && locale.getCountry().length() > 0) {
sb.append("-");
sb.append(locale.getCountry());
}
return sb.toString();
}

How to format Double with dot?

How do I format a Double with String.format to String with a dot between the integer and decimal part?
String s = String.format("%.2f", price);
The above formats only with a comma: ",".
String.format(String, Object ...) is using your JVM's default locale. You can use whatever locale using String.format(Locale, String, Object ...) or java.util.Formatter directly.
String s = String.format(Locale.US, "%.2f", price);
or
String s = new Formatter(Locale.US).format("%.2f", price);
or
// do this at application startup, e.g. in your main() method
Locale.setDefault(Locale.US);
// now you can use String.format(..) as you did before
String s = String.format("%.2f", price);
or
// set locale using system properties at JVM startup
java -Duser.language=en -Duser.region=US ...
Based on this post you can do it like this and it works for me on Android 7.0
import java.text.DecimalFormat
import java.text.DecimalFormatSymbols
DecimalFormat df = new DecimalFormat("#,##0.00");
df.setDecimalFormatSymbols(new DecimalFormatSymbols(Locale.ITALY));
System.out.println(df.format(yourNumber)); //will output 123.456,78
This way you have dot and comma based on your Locale
Answer edited and fixed thanks to Kevin van Mierlo comment
If it works the same as in PHP and C#, you might need to set your locale somehow. Might find something more about that in the Java Internationalization FAQ.

Java: How to get Unicode name of a character (or its type category)?

The Character class in Java defines methods which check a given char argument for equality with certain Unicode chars or for belonging to some type category. These chars and type categories are named.
As stated in given javadoc, examples for named chars are
HORIZONTAL TABULATION, FORM FEED, ...;
example for named type categories are
SPACE_SEPARATOR, PARAGRAPH_SEPARATOR, ...
However, being byte or int values instead of enums, the name of these types are "hidden" at runtime.
So, is there a possibility to get characters' and/or type categories' names at runtime?
JDK7 will have a
String getName(int codepoint)
function (READ: a “static method” in class java.lang.Character) that will turn a codepoint into its official Unicode name.
Javadoc : http://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#getName%28int%29
Yes. Use the ICU4J library. It has a the entire UCD and an API to get things out of it.
The Character class supports category info. Look at Character.getType(char) for the category. But i do not think, you can get the character names.
I posted a .NET implementation here: Finding out Unicode character name in .Net
That should be very easy to port to Java. All you need is to download the Unicode Database: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt, and the Java equivalent of a string splitting method and a Dictionary class, both of which I'm sure exist in Java.
This is a simple alternative to downloading some bloated library with tons of Unicode methods that Java and .NET probably both already support.
The names are standard and may be used subject to certain limitations.
For the character name, one can use Character.getName(int). However, for general category it is not so convenient:
// attach String names to Character constants
Map<Byte, String> unicodeCategories = new HashMap<>();
unicodeCategories.put(Character.COMBINING_SPACING_MARK, "Mc");
unicodeCategories.put(Character.CONNECTOR_PUNCTUATION, "Pc");
unicodeCategories.put(Character.CONTROL, "Cc");
unicodeCategories.put(Character.CURRENCY_SYMBOL, "Sc");
unicodeCategories.put(Character.DASH_PUNCTUATION, "Pd");
unicodeCategories.put(Character.DECIMAL_DIGIT_NUMBER, "Nd");
unicodeCategories.put(Character.ENCLOSING_MARK, "Me");
unicodeCategories.put(Character.END_PUNCTUATION, "Pe");
unicodeCategories.put(Character.FINAL_QUOTE_PUNCTUATION, "Pf");
unicodeCategories.put(Character.FORMAT, "Cf");
unicodeCategories.put(Character.INITIAL_QUOTE_PUNCTUATION, "Pi");
unicodeCategories.put(Character.LETTER_NUMBER, "Nl");
unicodeCategories.put(Character.LINE_SEPARATOR, "Zl");
unicodeCategories.put(Character.LOWERCASE_LETTER, "Ll");
unicodeCategories.put(Character.MATH_SYMBOL, "Sm");
unicodeCategories.put(Character.MODIFIER_LETTER, "Lm");
unicodeCategories.put(Character.MODIFIER_SYMBOL, "Sk");
unicodeCategories.put(Character.NON_SPACING_MARK, "Mn");
unicodeCategories.put(Character.OTHER_LETTER, "Lo");
unicodeCategories.put(Character.OTHER_NUMBER, "No");
unicodeCategories.put(Character.OTHER_PUNCTUATION, "Po");
unicodeCategories.put(Character.OTHER_SYMBOL, "So");
unicodeCategories.put(Character.PARAGRAPH_SEPARATOR, "Zp");
unicodeCategories.put(Character.PRIVATE_USE, "Co");
unicodeCategories.put(Character.SPACE_SEPARATOR, "Zs");
unicodeCategories.put(Character.START_PUNCTUATION, "Ps");
unicodeCategories.put(Character.SURROGATE, "Cs");
unicodeCategories.put(Character.TITLECASE_LETTER, "Lt");
unicodeCategories.put(Character.UNASSIGNED, "Cn");
unicodeCategories.put(Character.UPPERCASE_LETTER, "Lu");
// use the map to extract category name from the constant
char ch = 'a'; // OR int ch = Character.codePointAt("a", 0);
String category = unicodeCategories.get( (byte) (Character.getType(ch) ) );

Categories

Resources