Japanese language detection using java langdetect library

Japanese language detection using java langdetect library - java

I have a problem with language detection for Japanese language using java library:
Using Japanese text, I'm trying to detect it's text language, but instead of expected "ja" I got "en". Has anybody seen this problem before?
What is the expected output?
[ja:0.9999952022259697]
What do you see instead?
[en:0.9999952022259697]
Original issue description with Japanese text in attachments you can find here

This is almost certainly a problem related to the encoding of the input file (if that file contains Japanese at all -- I am not convinced it does).
The Java library you linked to assumes -- according to the documentation -- that the input is given as a String object. This means it assumes the encoding has already been correctly guessed and the input byte sequence been converted to a Java string.
When you use the library, you must make sure that is the case, i.e. if you are dealing with texts in unknown encodings (such as Japanese EUC-JP or SJIS), you must detect the encoding first and convert the string properly.
(Because of these reasons, good language detectors are able to detect the encoding and the language at the same time, by using language-and-encoding specific internal dictionaries.)

Related

UTF-8 to CP864 (arabic) conversion

I have the following task: some text in mixed latin/arabic written in UTF-8 needs to be converted for printing using POS-printer, which uses ancient one-byte code page 864.
text.getBytes("ibm-864") suddenly shows many question marks instead of arabic characters, and after digging the code I understood that conversion table has some different versions of arabic characters used to map to ibm-864 (somewhere in the FExx range rather than 06xx, which I have in my text).
I'm looking for some code or library, which can convert arabic unicode to cp864, preferrably mapping to the corresponding forms of arabic chars (in cp864 there're isolated, initial, medial and final forms for some chars), and maybe even handling reverse for RTL, because I doubt that hardware supports it automatically.
I understand that this is very specific task, but why don't give it a try? Also I know how to implement this, but trying to find a ready-to-use bicycle :)
Anyone?
Another possible solution: library that can translate unicode arabic from the range U+0600 - U+06FF Arabic to the range U+FE70 - U+FF6F Arabic Presentation Forms-B. Then I can safely get my bytes in cp864. Have anyone seen anything alike?

To output arabic text to a relatively dumb output device, you'll need to do several things:
Divide the text into blocks of different directionality using the Unicode Bidirectional Algorithm (UBA), better known as Bidi.
Mirror characters that need to be mirrored (e.g: opening parenthesis point in different directions when they are inside LTR/RTL blocks)
Since the output device is dumb, you'll need to change characters into their positional forms, and apply ligatures where needed (there is a ligature for LAM + ALEF). This is done by a piece of software called an Arabic Shaper.
You'll need to reorder text according to their directionality.
Since CP864 doesn't have all the positional forms for all characters, you'll need to convert to fallback forms, converting some final forms to isolated forms, some medial forms to initial forms, and some initial forms to isolated forms. The text will not ligate as nicely as if there were proper forms, but it will come relatively close.
On Java, the ICU library allows you to do that:
ICU's Bidi can take care of dividing into blocks, mirroring, and reordering. Reordering can be done before shaping, since ICU's ArabicShaping supports working with text in both logical (pre-reordering) and visual (post-reordering) order.
ICU's ArabicShaping can take care of shaping the text, mapping it into the appropriate presentational forms (the FExx range you talked about, which is not meant to be used normally, it is only meant to be used to interface with legacy software/hardware, in this case the printer that understands CP864 but not Unicode).
ICU's CharsetProvider and CharsetEncoder can be used to convert to CP864 using a fallback (non-roundtrip) conversion for characters that are not on the output charset, in this case the final→isolated, medial→initial,... forms.

Print arabic string in java

I'm trying to display arabic text in java but it shows junk characters(Example : ¤[ïß¯[î) or sometimes only question marks when i print. How do i make it to print arabic. I heard that its something related to unicode and UTF-8. This is the first time i'm working with languages so no idea. I'm using Eclipse Indigo IDE.
EDIT:
If i use UTF-8 encoding then "¤[ïß¯[î" characters are becoming "????????" characters.

For starters you could take a look here. This should allow you to make Eclipse print unicode in its console (which I do not know if it is something which Eclipse supports out of the box without any extra tweaks)
If that does not solve your problem you most likely have an issue with the encoding your program is using, so you might want to create strings in some manner similar to this:
String str = new String("تعطي يونيكود رقما فريدا لكل حرف".getBytes(), "UTF-8");
This at least works for me.

If you embed the text literally in the code make sure you set the encoding for your project correctly.

This is for Java SE, Java EE, or Java ME?
If this is for Java ME, you have to make custom GlyphUtils if you use LWUIT.
Download this file:
http://dl.dropbox.com/u/55295133/U0600.pdf
Look list of unicode encoding..
And look at this thread:
https://stackoverflow.com/a/9172732/1061371
in the answer (post) of Mohamed Nazar that edited by bernama Alex Kliuchnikau,
"The below code can be use for displaying arabic text in J2ME String s=new String("\u0628\u06A9".getBytes(), "UTF-8"); where \u0628\u06A9 is the unicode of two arabic letters"
Look at U0600.pdf file, so we can see that Mohamed Nazar and Alex Kliuchnikau give example to create "ba" and "kaf" character in arabic.
Then the last point that you must consider is: "Make sure your UI support unicode(I mean arabic) character."
Like LWUIT not support yet unicode (I mean arabic) character.
You should make your custom code if you mean your app is using LWUIT.

Approach for Automating localized Web application in Selenium using Java Bindings

I am automating test cases for a web application using selenium 2.0 and Java. My application supports multiple languages. Some of the test cases require me to validate the text that appears in the UI like success/error messages etc.I am using a properties file to store whatever text I am referring in my tests from the UI, currently only english. For example there is locale_english.properties(see below) that contains all references in english. I am going to have multiple properties files like this for different locales like locale_chinese.properties,locale_french.properties and so on. For locales other than english, their corresponding properties file would have UTF-8 characters (e.g \u30ed) representing the native characters(see below). So If I want to test say Chinese UI, I would load "locale_chinese.properties" instead of "locale_english.properties". I am going to convert the native characters for non-english locale using perhaps native2ascii from JDK or some other way.I tested that Selenium API works well with UTF-8 characters for non-english locales
---locale_english.properties------
user.login.error= Please verify username/password
---locale_chinese.properties------
user.login.error= \u30ed\u30ef\u30eg\u30eh\u30ed
and so on.
The problem is that my locale_english.properties is growing and going out of control. It is becoming hard to manage a single properties file for one locale let alone for multiple locales. Is there a better way of handling localization in Java, particularly in situations like I am in?
Thanks!

You're right that there is a problem managing the files, but you're also right that this is the best approach. Some things are just hard :-(
Selenium (at least the Selenium RC API) does indeed support Unicode input and output, we have lots of tests that enter and confirm Cyrillic and Simple Chinese characters from C#. Since Java strings are Unicode at the core (just like C#), I expect you could simply create the file in a UTF-8-friendly editor like Notepad++ and read them straight into strings and use them directly in the Selenium API.

This is how I solved the issue for those who are interested.

a database would work better for many reasons, like growth, central location, kept outside of app and can be edited and maintained outside of app. We used a table with columns:
id (int) auto increment
id_text -- this and other columns are varchar ... except for date time for last 2
lang
translation
created_by
updated_by
created_date
updated_date
An id is a short english description of the text - like 'hello' or 'error1msg', the key in your map.
In java had a function to get the text for a particular text ... and a app level property - default language (usually en but good to keep it configurable)
Function would scan already loaded hashmap for language asked for - say "ch"
If corresponding translation was not found for this language we would return the default language translation and if that was not founf then we would return "[" + id "]" so the tester knows something is missing in data base - can go to web screen to edit translation table and add it.

Encoding problems exporting file

I'm trying to find out what has happen in an integration project. We just can't get the encoding right at the end.
A Lithuanian file was imported to the as400. There, text is stored in the encoding EBCDIC. Exporting the data to ANSI file and then read as windows-1257. ASCII-characters works fine and some Lithuanian does, but the rest looks like crap with chars like ~, ¶ and ].
Example string going thou the pipe
Start file
Tuskulënö
as400
Tuskulënö
EAA9A9596
34224335A
exported file (after conversion to windows-1257)
Tuskulėnö
expected result for exported file
Tuskulėnų
Any ideas?
Regards,
Karl

EBCDIC isn't a single encoding, it's a family of encodings (in this case called codepages), similar to how ISO-8859-* is a family of encodings: the encodings within the families share about half the codes for "basic" letters (roughly what is present in ASCII) and differ on the other half.
So if you say that it's stored in EBCDIC, you need to tell us which codepage is used.
A similar problem exists with ANSI: when used for an encoding it refers to a Windows default encoding. Unfortunately the default encoding of a Windows installation can vary based on the locale configured.
So again: you need to find out which actual encoding is used here (these are usually from the Windows-* family, the "normal" English one s Windows-1252).
Once you actually know what encoding you have and want at each point, you can go towards the second step: fixing it.
My personal preference for this kind of problems is this: Have only one step where encodings are converted: take whatever the initial tool produces and convert it to UTF-8 in the first step. From then on, always use UTF-8 to handle that data. If necessary convert UTF-8 to some other encoding in the last step (but avoid this if possible).

Decoding Java's JSON Unicode values with PHP

I had experienced different JSON encoded value for the same string depending on the language used in the past. Since the APIs were used in closed environment (no 3rd parties allowed), we made a compromise and all our Java applications are manually encoding Unicode characters. LinkedIn's API is returning "corrupted" values, basically the same as our Java applications. I've already posted a question on their forum, the reason I am asking it here as well is quite simple; sharing is caring :) This question is therefore partially connected with LinkedIn, but mostly trying to find an answer to the general encoding problem described below.
As you can see, my last name contains a letter ž, which should be \u017e but Java (or LinkedIn's API for that matter) returns \u009e with JSON and nothing with XML response. PHP's json_decode() ignores it and my last name becomes Kurida.
After an investigation, I've found ž apparently has two representations, 9e and 17e. What exactly is going on here? Is there a solution for this problem?

U+009E is a usually-invisible control character and not an acceptable alternative representation for ž.
The byte 0x9E represents the character ž in Windows code page 1252. That byte, if decoded using ISO-8859-1, would turn into U+009E.
(The confusion comes from the fact that if you write  in an HTML page, the browser doesn't actually give you character U+009E, as you might expect, but converts it to U+017E. The same is true of all the character references 0080–009F: they get changed as if the numbers referred to cp1252 bytes instead of Unicode characters. This is utterly bizarre and wrong behaviour, but all the major browsers do it so we're stuck with it now. Except in proper XHTML served as XML, since that has to follow the more sensible XML rules.)
Looking at the forum page, the JSON-reading is clearly not wrong: your name is registered as being “David Kurid[U+009E]a”. However that data has got into their system needs looking at.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.