Android, mysql, and rendering non Latin Characters as well as Latin? - java

Are these squares a representation of chinese characters being turned into unicode?
EDIT:[Here I entered the squares with numbers inside them into the post but they didn't render]
I'd like to either turn this back into the original characters when displayed in android (or to enable mysql to just store them as chinese characters not in unicode???)
BufferedReader reader = new BufferedReader(new InputStreamReader(is, "UTF-8"), 8);
While debugging it shows the strings value as
"\u001a\u001a\u001a\u001a"
byte[] bytes = chinesestringfromdatabase.getBytes();
turns it into
"[26, 26, 26, 26]"
String fresh = new String(bytes, "UTF-8");
and then this turns it back into
EDIT:[Here I entered the squares with numbers inside them into the post but they didn't render]
My phone can display chinese text.
MySQL charset: UTF-8 Unicode (utf8)
While typing my question I realize that perhaps I have the wrong charset all together.
I'm lost as to whether or not my issue will even be anything coding related or if it is just related to a setting or if php cannot handle the character set??
I'd like to store and render multiple language character sets that could contain a mixture of languages.

Here I entered the squares with numbers inside them into the post but they didn't render
With "squares with numbers inside", do you mean the same as those which you also see for some exotic languages somewhere at the bottom of the Wikipedia homepage, while browsing with Firefox browser? (in all other browsers -MSIE, Chrome, Safari, etc- you would only see nothing-saying empty squares).
If true, then it simply means that there are no glyphs available for those characters in the font which the webbrowser/viewer is been instructed to use.
I'd like to store and render multiple language character sets that could contain a mixture of languages.
Use UTF-8 all the way. Only keep in mind that MySQL only supports the BMP panel of Unicode (max 3 bytes per character), not the other panels (4 bytes per character). So the SMP panel (which contains "special" CJK characters) is out of range for MySQL.
References
PHP UTF-8 cheatsheet
Unicode - How to get characters right? (for Java web developers)
MySQL reference - Unicode support

What were the numbers in the boxes? I'm guessing they were 001A? Like ?
(SO will usually filter these out as they're ASCII control characters, typically invisible in other browsers.)
While debugging it shows the strings value as "\u001a\u001a\u001a\u001a"
Well clearly there's no Chinese or any text to be recovered there. Any informational content in the original string has been lost.
Whilst I agree that you need to be using UTF-8 throughout (which for PHP means serving the form page with a UTF-8 <meta> tag, using mysql_set_charset('utf8'), and creating your MySQL tables with UTF-8 collations), I think you must have a more serious corruption problem than just UTF-8-vs-other-ASCII-compatible-encoding if you are somehow getting just identical control characters instead of a text string.

Related

Is there a way to find file encoding type (UTF-8 or ANSI or Cp1252 or others) using java

I have to read few html files. If i use UTF-8 as charset for reading and writing a file, there are some junk characters getting displayed in html page. It seems the actual file is ANSI encoded since i am using UTF-8 for reading and writing the file, few white spaces are displayed as black diamond with question mark.
Is there a way to find the encoding/charset to be used to read/write a particular file?
No, that's mathematically impossible. Files are just bags of bytes, and most encodings are such that any byte has meaning. Short of using an artificial intelligence getup that analyses how likely it is (looking for words that mix characters from different unicode planes and the like) that you read it using the right encoding, there is therefore no way to be sure.
Some files can be conclusively determined to definitely not be UTF_8 (or, to be corrupted), because there are certain byte sequences that cannot appear in the bytestream that results when you UTF-8 encode some characters. However, this isn't very useful either: You cannot conclude: Oh! Must be UTF-8! based on the lack of these invalid sequences.
You have some options
The right way
When you saved those HTML files, that is when encoding was either chosen (the HTML was received from the webserver and loaded into browser memory, and has been decoded from bytes to chars using the charset listed in the HTTP response header 'Content-Type', then you asked the browser to save it to a file, at which point the browser needs to choose an encoding), or it was known (the tool used to save the HTML saves the HTML 'raw', straight as it was sent over the HTTP connection, but as part of doing this, this tool knows the encoding, as the HTTP server sent it in the 'Content-Type' header), and therefore that was the perfect time to store this information, or to choose a well known encoding (UTF-8 is a good idea).
So, go back to whichever software and/or process managed to save these files and fix it at the source: Either also save the encoding, or, ensure that the HTML file is saved in UTF-8 no matter what the HTTP server you got this HTML from sent it as.
The hacky way
Grab a magnifying glass, put on your finest hat, and get your sherlock holmes on.
The usual strategy is to open a hex editor and travel to the position in the file where you see diamonds or unexpected characters and look at the byte sequence. Especially if it is a somewhat 'well known' western non-ASCII character like é or ö, odds are that doing a web search for the byte(s) you see there, usually you'll find it. Look for the ones with decimal value 128 or higher, in hex, the ones that start with an 8, 9, or a letter - because the ones below that are ASCII and almost all encodings encode those the same way, thus, not useful to differentiate encodings.
For example, if you search for 0xE1 0xBA 0x9E the first hit leads you to this page, scrolling down to 0xe1 0xBA 0x9e it says: That's the UTF-8 version of codepoint 1E9E, the sharp s (ß - common in german). If that makes sense in the text, we figured it out. We will need an AI to do text analysis to figure out if it makes sense. I don't have one, so we'll need an artificial artificial intelligence. In other words, your brain will have to do the job. Just look at it: If, after substituting an ß, the text says Last Name: Boßler, you obviously got it - Boßler is a german last name, as well as a mountain in germany. Web Searching again to the rescue if you are not sure.
Sometimes you have to figure out what character it was supposed to be, and include this in the search. For example, if you check the file and you see a 0xDF and you know a ß has to be there, search for 0xDF ß and you get to this page which shows a ton of encodings and how they store ß. Only a few store it as 0xDF: It's some ISO-8859 variant, or a Cp-125x variant (a.k.a. windows-125x) and you've managed to exclude IBM852. There's no way to know which ISO-8859 or Cp-125 variant it actually is; you'll need more weird characters and hope you hit one where you know what it is supposed to be and these chars are encoded differently between them (unlikely; they are very similar).
Most likely in the end you end up knowing that it is one of a few encodings, because usually there are multiple encodings that would all result in the exact same byte sequence. In fact, if you have all-ASCII characters, there are thousands of encodings that it could be.

UTF-8 to CP864 (arabic) conversion

I have the following task: some text in mixed latin/arabic written in UTF-8 needs to be converted for printing using POS-printer, which uses ancient one-byte code page 864.
text.getBytes("ibm-864") suddenly shows many question marks instead of arabic characters, and after digging the code I understood that conversion table has some different versions of arabic characters used to map to ibm-864 (somewhere in the FExx range rather than 06xx, which I have in my text).
I'm looking for some code or library, which can convert arabic unicode to cp864, preferrably mapping to the corresponding forms of arabic chars (in cp864 there're isolated, initial, medial and final forms for some chars), and maybe even handling reverse for RTL, because I doubt that hardware supports it automatically.
I understand that this is very specific task, but why don't give it a try? Also I know how to implement this, but trying to find a ready-to-use bicycle :)
Anyone?
Another possible solution: library that can translate unicode arabic from the range U+0600 - U+06FF Arabic to the range U+FE70 - U+FF6F Arabic Presentation Forms-B. Then I can safely get my bytes in cp864. Have anyone seen anything alike?
To output arabic text to a relatively dumb output device, you'll need to do several things:
Divide the text into blocks of different directionality using the Unicode Bidirectional Algorithm (UBA), better known as Bidi.
Mirror characters that need to be mirrored (e.g: opening parenthesis point in different directions when they are inside LTR/RTL blocks)
Since the output device is dumb, you'll need to change characters into their positional forms, and apply ligatures where needed (there is a ligature for LAM + ALEF). This is done by a piece of software called an Arabic Shaper.
You'll need to reorder text according to their directionality.
Since CP864 doesn't have all the positional forms for all characters, you'll need to convert to fallback forms, converting some final forms to isolated forms, some medial forms to initial forms, and some initial forms to isolated forms. The text will not ligate as nicely as if there were proper forms, but it will come relatively close.
On Java, the ICU library allows you to do that:
ICU's Bidi can take care of dividing into blocks, mirroring, and reordering. Reordering can be done before shaping, since ICU's ArabicShaping supports working with text in both logical (pre-reordering) and visual (post-reordering) order.
ICU's ArabicShaping can take care of shaping the text, mapping it into the appropriate presentational forms (the FExx range you talked about, which is not meant to be used normally, it is only meant to be used to interface with legacy software/hardware, in this case the printer that understands CP864 but not Unicode).
ICU's CharsetProvider and CharsetEncoder can be used to convert to CP864 using a fallback (non-roundtrip) conversion for characters that are not on the output charset, in this case the final→isolated, medial→initial,... forms.

Passing Unicode line return characters set in Class to client side (DWR/HTML/UTF8) for InDesign Team

I've built a content management tool that allows a product team to create and manage product that gets exported to a website and for a different team of designers to create print ads for newspapers displaying the same product data.
My problem is with the InDesign graphic designers and the macros that they use within InDesign. The macros have the ability to take copy/pasted text/data and auto format the text inside InDesign based on the presence of certain characters. In particular the design team uses tab, "soft line break" (shift return), and regular line breaks (hard returns) in their macros.
Right now I generate a block of text with the records and the desired formatting characters in a java Class and then that's sent via DWR to the client side. When there is a requirement for a tab character I send \t, return is \r and I was hoping that a soft line break would be \n however InDesign seems to regard both \r and \n as a regular line break.
I had given up on being able to pass a soft-return until yesterday when I cam across Unicode \u2028 (soft line break) and \u2029 (regular line break). I've tried outputting both of these characters instead of \r and \n in the hopes that InDesign may regard these characters differently. In the box that the designers copy the output from it looks like there is no character there. There's no line break at all in the places where I've specific \u2028 to appear. When I copy/paste the output into a text editor it shows me that there is an unrecognized character there (it displays as a box with a question mark around it).
Platform is Java/MySQL running on Tomcat.
To date, I haven't had to deal too much with character encoding in this application. Header has <meta charset="utf-8" /> set but that's about it so far. I've tried setting this to utf-16 but it doesn't change the output. All of the tables in the MySQL database are set to utf8/utf8_general_ci.
Thoughts? How can I force InDesign to take copy/pasted text and recognize all of its macro capable characters? Actually, it's just the soft line breaks that it's not recognizing. HELP! :)
Thank you. Sorry this is so long!
Ryan V
I've been playing around with ID CS6 (OS X) for a while and I can't for the life of me get it to recognize a pasted LF as a forced line break. LF and CR and CRLF all go to paragraph breaks. U+2028 and U+2029 are display as empty glyphs, not breaks.
I'm a little wary of posting this as an answer, but I'll give it a go:
You might consider providing the text as a downloaded .txt file. CS5 introduced "Tagged Text" (a sort of XML-ish text document with full support for InDesign characters, attributes, etc.,) so this means your designers will be able to place the text file and InDesign will treat everything as intended.
To turn your existing text into CS5+'s Tagged Text (see the reference here), plop a <ASCII-MAC> or <ASCII-WIN> (as appropriate) as the first line and escape any '<' or '>'s with a backslash, then you're free to use <0x000A> as a forced line break. (literally those 8 characters)
That's probably mega-overkill, but it's certainly the most stupidly reliable way I can think of. I'll edit if I get anything else working.
NB. "forced line break" is the term InDesign itself uses for the character produced by Shift+Enter, your "soft line break;" contrast with "paragraph break" for a standard carriage return. InDesign apparently represents forced breaks with LF (U+000A) and paragraph breaks with CR (U+000D).
I'm not sure how you were trying to transfer and print out your characters (if you post your DWR and javascript code I might be able to help more), but one thing I would try is to ensure that your java output is actual UTF-8 using something like:
String yourRecordString = "Some line 1. \u2028Some line 2.";
ByteBuffer bb = Charset.forName("UTF-8").encode(yourRecordString);
Then, you can write out the bytes in bb into an output stream/file and check them. (Make sure to write them as bytes and not as a String nor as chars.) For example, the UTF-8 encoding of \u2028 is E2 80 A8, so you should see that sequence at the appropriate place in your output. (I use hexmode in vim for things like this.)
Then, make sure that these bytes get received back on the javascript side. (While I'm not an expert with DWR, I might prefer to make your java function return something other than a String.)
This should at least help you diagnose where the problem lies. If you do see that sequence and if InDesign still isn't recognizing the soft line breaks, then you at least know the problem is with InDesign and that you will have to find some other solution (such as modifying the designer's macros to recognize other characters).
(Also, note that you can see the default encoding for your JVM using Charset.defaultCharset(). My guess is that your default is not UTF-8 and that InDesign may have also had a problem with the UTF-16 you tried due to endianess or something like that.)

How to display Latin-1 characters in Android app

I have to fetch text from an online database encoded in Latin-1 charset and every special Latin character (i.e. à, ò, ù, è...) was displayed with black squares with a "?" inside.
How can i display this correctly?
Luckily i found an answer after a couple of hours and i want to share it with you all.
Read below for my solution
solution was really simple but i haven't thought about it, but it has the benefit of being really simple to understand and implement. In fact, here is the code:
mIn = new BufferedReader(new InputStreamReader(mSocket.getInputStream(),"ISO-8859-1"));
this way, all the incoming strings from the Latin-1 server will be decoded correctly and will be displayed perfectly on android TextViews

How do I decipher garbled/gibberish characters in my networking program

I am working on a client-server networking app in Java SE. I am using stings terminated by a newline from client to server and the server responds with a null terminated string.
In the output window of Netbeans IDE I am finding some gibberish characters amongst the strings that I send and receive.
I can't figure out what these characters are they mostly look like a rectagular box, when I paste that line containing the character in Notepad++ all the characters following and including that character disapear.
How could I know what characters are appearing in the output sreen of the IDE?
If the response you are getting back from the server is supposed to be human readable text, then this is probably a character encoding problem. For example, if the client and server are both written in Java, it is likely that they using/assuming different character encodings for the text. (It is also possible that the response is not supposed to be human readable text. In that case, the client should not be trying to interpret it as text ... so this question is moot.)
You typically see boxes (splats) when a program tries to render some character code that it does not understand. This maybe a real character (e.g. a Japanese character, mathematical symbol or the like) or it could be an artifact caused by a mismatch between the character sets used for encoding and decoding the text.
To try and figure out what is going on, try modifying your client-side code to read the response as bytes rather than characters, and then output the bytes to the console in hexadecimal. Then post those bytes in your question, together with the displayed characters that you currently see.
If you understand character set naming and have some ideas what the character sets are likely to be, the UNIX / Linux iconv utility may be helpful. Emacs has extensive support for displaying / editing files in a wide range of character encodings. (Even plain old Wordpad can help, if this is just a problem with line termination sequences; e.g. "\n" versus "\r\n" versus "\n\r".)
(I'd avoid trying to diagnose this by copy and pasting. The copy and paste process may itself "mangle" the data, causing you further confusion.)
This is probably just binary data. Most of it will look like gibberish when interpreted as ascii. Make sure you are writing exact number of bytes to the socket, and not some nice number like 4096. Best would be if you can post your code so we can help you find the error(s).

Categories

Resources