Is there any library to sort chinese strings by stroke in Java?
Try java.text.Collator for chinese Locale.
If you want to roll the code yourself, one source for the data is the Unihan database's Radical-Stroke Counts fields, from the Unicode Consortium. The link is to the section of Technical Report 38, describing those fields.
Note that the stroke count of an ideographic character is based on the structure (or morphology) of the character as displayed, i.e. its glyph. The glyph's morphology is a function of the font design style — especially whether the font follows traditional Chinese, simplified Chinese, or Japanese conventions. But character codes in Java are usually based on the Unicode standard, which unifies characters from all these conventions under a single character code.
So, you will need external information to tell you which convention your text is using. This in turn tells you which field of the Unihan database to use. If you know that your Chinese text strings are all simplified, or all traditional Chinese, then you have enough information.
Also check out the Chinese Character Web API, which serves up data from the Unihan database.
Related
I am trying to format several aspects of my clipboard when I set it. from what I understand I need to use DataFlavors and have done some reading on Oracle about it but am not sure if/how it is possible to set Unicode and other such formats. (XML?)
DocFlavor.CHAR_ARRAY should do. This is Unicode in the form of UTF-16, which should congrue with wide chars on Windows. The problem probably is the normal EOM single byte character set that is default.
I have this PDF file, which is in Greek. A known problem occurs when trying to copy and paste text from it, resulting in slight gibberish. The reason I say slight instead of total, is that while the pasted output does not make sense in Greek, it is comprised of valid greek characters. Also, an interesting aspect to the problem is that not all characters are mapped wrong. For example, if you compare this original strip of text
ΕΞ. ΕΠΕΙΓΟΝ – ΑΜΕΣΗ ΕΦΑΡΜΟΓΗ
ΝΑ ΣΤΑΛΕΙ ΚΑΙ ΜΕ Ε-ΜΑIL
with the pasted one from the PDF:
ΔΞ. ΔΠΔΙΓΟΝ – ΑΜΔΗ ΔΦΑΡΜΟΓΗ
ΝΑ ΣΑΛΔΙ ΚΑΙ ΜΔ Δ-ΜΑIL
you will notice that some of the characters are correctly pasted, while others are not. It might also be worthwhile to mention that the wrong characters are reflexively mapped wrong, e.g. Ε becomes Δ and vice-versa.
When I open the PDF using e.g. Adobe, and print it using a PDF writer, in this case CutePDF, the output when copying and pasting is correct!
Given the above, my questions are the following:
What is the root cause of this behavior?
How would I go about integrating a solution into a java-based workflow for randomly imported PDF files?
EDIT: a few typos
Some basic context:
Displaying text in PDF is done by selecting glyphs from a font. A glyph is the visual representation of one or more characters. Glyph selection is done using character codes. For text extraction, you need to know which characters correspond with a character code.
In this case, this is achieved using a ToUnicode CMap.
In this document, the first letter of the text snippet, E, is displayed like this:
[0x01FC, ...] TJ
The ToUnicode CMap contains this entry:
4 beginbfrange
<01f9> <01fc> <0391>
...
endbfrange
This means that character codes 0x01F9, 0x01FA, 0x01FB and 0x01FC are mapped to Unicode U+0x391, U+0x392, U+0x393 and U+0x394 respectively.
U+0394 is the Greek delta, Δ, that shows up when copy/pasting.
The next letter is painted using character code 0x0204. The relevant ToUnicode entry is <0200> <020b> <039a>, which maps it correctly to U+039E
So, you're getting slight gibberish, because only some of the Unicode mapping is wrong. Sometimes this is done on purpose, e.g. to prevent data mining. I have seen it before in financial reports.
I have the following task: some text in mixed latin/arabic written in UTF-8 needs to be converted for printing using POS-printer, which uses ancient one-byte code page 864.
text.getBytes("ibm-864") suddenly shows many question marks instead of arabic characters, and after digging the code I understood that conversion table has some different versions of arabic characters used to map to ibm-864 (somewhere in the FExx range rather than 06xx, which I have in my text).
I'm looking for some code or library, which can convert arabic unicode to cp864, preferrably mapping to the corresponding forms of arabic chars (in cp864 there're isolated, initial, medial and final forms for some chars), and maybe even handling reverse for RTL, because I doubt that hardware supports it automatically.
I understand that this is very specific task, but why don't give it a try? Also I know how to implement this, but trying to find a ready-to-use bicycle :)
Anyone?
Another possible solution: library that can translate unicode arabic from the range U+0600 - U+06FF Arabic to the range U+FE70 - U+FF6F Arabic Presentation Forms-B. Then I can safely get my bytes in cp864. Have anyone seen anything alike?
To output arabic text to a relatively dumb output device, you'll need to do several things:
Divide the text into blocks of different directionality using the Unicode Bidirectional Algorithm (UBA), better known as Bidi.
Mirror characters that need to be mirrored (e.g: opening parenthesis point in different directions when they are inside LTR/RTL blocks)
Since the output device is dumb, you'll need to change characters into their positional forms, and apply ligatures where needed (there is a ligature for LAM + ALEF). This is done by a piece of software called an Arabic Shaper.
You'll need to reorder text according to their directionality.
Since CP864 doesn't have all the positional forms for all characters, you'll need to convert to fallback forms, converting some final forms to isolated forms, some medial forms to initial forms, and some initial forms to isolated forms. The text will not ligate as nicely as if there were proper forms, but it will come relatively close.
On Java, the ICU library allows you to do that:
ICU's Bidi can take care of dividing into blocks, mirroring, and reordering. Reordering can be done before shaping, since ICU's ArabicShaping supports working with text in both logical (pre-reordering) and visual (post-reordering) order.
ICU's ArabicShaping can take care of shaping the text, mapping it into the appropriate presentational forms (the FExx range you talked about, which is not meant to be used normally, it is only meant to be used to interface with legacy software/hardware, in this case the printer that understands CP864 but not Unicode).
ICU's CharsetProvider and CharsetEncoder can be used to convert to CP864 using a fallback (non-roundtrip) conversion for characters that are not on the output charset, in this case the final→isolated, medial→initial,... forms.
I have a problem with language detection for Japanese language using java library:
Using Japanese text, I'm trying to detect it's text language, but instead of expected "ja" I got "en". Has anybody seen this problem before?
What is the expected output?
[ja:0.9999952022259697]
What do you see instead?
[en:0.9999952022259697]
Original issue description with Japanese text in attachments you can find here
This is almost certainly a problem related to the encoding of the input file (if that file contains Japanese at all -- I am not convinced it does).
The Java library you linked to assumes -- according to the documentation -- that the input is given as a String object. This means it assumes the encoding has already been correctly guessed and the input byte sequence been converted to a Java string.
When you use the library, you must make sure that is the case, i.e. if you are dealing with texts in unknown encodings (such as Japanese EUC-JP or SJIS), you must detect the encoding first and convert the string properly.
(Because of these reasons, good language detectors are able to detect the encoding and the language at the same time, by using language-and-encoding specific internal dictionaries.)
Are these squares a representation of chinese characters being turned into unicode?
EDIT:[Here I entered the squares with numbers inside them into the post but they didn't render]
I'd like to either turn this back into the original characters when displayed in android (or to enable mysql to just store them as chinese characters not in unicode???)
BufferedReader reader = new BufferedReader(new InputStreamReader(is, "UTF-8"), 8);
While debugging it shows the strings value as
"\u001a\u001a\u001a\u001a"
byte[] bytes = chinesestringfromdatabase.getBytes();
turns it into
"[26, 26, 26, 26]"
String fresh = new String(bytes, "UTF-8");
and then this turns it back into
EDIT:[Here I entered the squares with numbers inside them into the post but they didn't render]
My phone can display chinese text.
MySQL charset: UTF-8 Unicode (utf8)
While typing my question I realize that perhaps I have the wrong charset all together.
I'm lost as to whether or not my issue will even be anything coding related or if it is just related to a setting or if php cannot handle the character set??
I'd like to store and render multiple language character sets that could contain a mixture of languages.
Here I entered the squares with numbers inside them into the post but they didn't render
With "squares with numbers inside", do you mean the same as those which you also see for some exotic languages somewhere at the bottom of the Wikipedia homepage, while browsing with Firefox browser? (in all other browsers -MSIE, Chrome, Safari, etc- you would only see nothing-saying empty squares).
If true, then it simply means that there are no glyphs available for those characters in the font which the webbrowser/viewer is been instructed to use.
I'd like to store and render multiple language character sets that could contain a mixture of languages.
Use UTF-8 all the way. Only keep in mind that MySQL only supports the BMP panel of Unicode (max 3 bytes per character), not the other panels (4 bytes per character). So the SMP panel (which contains "special" CJK characters) is out of range for MySQL.
References
PHP UTF-8 cheatsheet
Unicode - How to get characters right? (for Java web developers)
MySQL reference - Unicode support
What were the numbers in the boxes? I'm guessing they were 001A? Like ?
(SO will usually filter these out as they're ASCII control characters, typically invisible in other browsers.)
While debugging it shows the strings value as "\u001a\u001a\u001a\u001a"
Well clearly there's no Chinese or any text to be recovered there. Any informational content in the original string has been lost.
Whilst I agree that you need to be using UTF-8 throughout (which for PHP means serving the form page with a UTF-8 <meta> tag, using mysql_set_charset('utf8'), and creating your MySQL tables with UTF-8 collations), I think you must have a more serious corruption problem than just UTF-8-vs-other-ASCII-compatible-encoding if you are somehow getting just identical control characters instead of a text string.