I have to fetch text from an online database encoded in Latin-1 charset and every special Latin character (i.e. à, ò, ù, è...) was displayed with black squares with a "?" inside.
How can i display this correctly?
Luckily i found an answer after a couple of hours and i want to share it with you all.
Read below for my solution
solution was really simple but i haven't thought about it, but it has the benefit of being really simple to understand and implement. In fact, here is the code:
mIn = new BufferedReader(new InputStreamReader(mSocket.getInputStream(),"ISO-8859-1"));
this way, all the incoming strings from the Latin-1 server will be decoded correctly and will be displayed perfectly on android TextViews
Related
So, I finally discovered that JavaFX lets you use HostServices.showDocument(uri) to open a browser to the given url. I have run into a problem though; I cannot open up urls that contain Chinese characters. It can only interpret them as '?', taking you to the wrong url. AWT's Display.browse(uri) handles characters without a problem, so I know that it can be communicated to the browser technically. I'm not sure if there is anything I can do on my end or not though.
My question is: Is there any way to make JavaFX's HostServices.showDocument() correctly read in Chinese characters?
EDIT:
Sample string
http://www.mdbg.net/chindict/chindict.php?page=worddict&wdrst=0&wdqb=%E6%96%87
You can follow the link through to see the address' chinese character (at the very end of the url). So in doing this, I noticed that it converts the character to a series of %, letters, and numbers. Plugging those into showDocument() in place of the character works fine. So then, I guess the question is now "How do I convert a character to this format?
I was able to figure out that converting the string into a URI, then using the .toASCIIString() method gave me what I needed. (Converting Chinese characters, and I would assume others, into something readable by showDocument(). Thanks for the help jewelsea.
If there is a better way to do this, feel free to give me another answer.
So I'm working with last.fm API. Sometimes, the query results in tracks that contain characters like these:
Æther, é, Hṛṣṭa
or non-English characters like these:
水鏡.
When debugging in Eclipse, I see them just fine (as-is) but printing on console prints these as ??? - which is OK for me.
Now, how do I handle these? At first I though I could remove every song that has any character other than the ones in English language. I used the regex ^\\w+$ but it didn't work. I also tried \\w+. That didn't work either.
Then I thought further on how do handle these properly. Any one can help me out? I am perfectly fine with letting these tracks out of the equation, ie. I'm fine with having only English character tracks.
Another question: What is the best way to display these character of console and/or Swing GUI?
You must ensure that you use correct encoding when reading your input first.
Second ensure that the font used in Eclipse on platform you developing has ability to display all these characters. Swing must display unicode chars if you read them correctly.
You will likely want to use UTF-8 everywhere.
When I use the extractMetadata( MediaMetadataRetriever.METADATA_KEY_TITLE ) function.
Some of the strings returned are displayed incorrectly.
i.e.
Christina Perri - A Thousand Years
is displayed as
䌀栀爀椀猀琀椀渀愀 倀攀爀爀椀 ⴀ 䄀 吀栀漀甀猀愀渀搀 夀攀愀爀猀
Does anyone have any tips as to how I can get the string to display correctly?
I have no idea about Android, but there are two possibilities
You are reading it correctly and someone used this characters while storing the data.
You get the wrong characters because the text you get, has been stored in a different enconding, than you are using to display it. In this case you need to tell Java in which encoding this string is.
A good start to read about encodings is this blog
The Java tutorial for working with text
Are these squares a representation of chinese characters being turned into unicode?
EDIT:[Here I entered the squares with numbers inside them into the post but they didn't render]
I'd like to either turn this back into the original characters when displayed in android (or to enable mysql to just store them as chinese characters not in unicode???)
BufferedReader reader = new BufferedReader(new InputStreamReader(is, "UTF-8"), 8);
While debugging it shows the strings value as
"\u001a\u001a\u001a\u001a"
byte[] bytes = chinesestringfromdatabase.getBytes();
turns it into
"[26, 26, 26, 26]"
String fresh = new String(bytes, "UTF-8");
and then this turns it back into
EDIT:[Here I entered the squares with numbers inside them into the post but they didn't render]
My phone can display chinese text.
MySQL charset: UTF-8 Unicode (utf8)
While typing my question I realize that perhaps I have the wrong charset all together.
I'm lost as to whether or not my issue will even be anything coding related or if it is just related to a setting or if php cannot handle the character set??
I'd like to store and render multiple language character sets that could contain a mixture of languages.
Here I entered the squares with numbers inside them into the post but they didn't render
With "squares with numbers inside", do you mean the same as those which you also see for some exotic languages somewhere at the bottom of the Wikipedia homepage, while browsing with Firefox browser? (in all other browsers -MSIE, Chrome, Safari, etc- you would only see nothing-saying empty squares).
If true, then it simply means that there are no glyphs available for those characters in the font which the webbrowser/viewer is been instructed to use.
I'd like to store and render multiple language character sets that could contain a mixture of languages.
Use UTF-8 all the way. Only keep in mind that MySQL only supports the BMP panel of Unicode (max 3 bytes per character), not the other panels (4 bytes per character). So the SMP panel (which contains "special" CJK characters) is out of range for MySQL.
References
PHP UTF-8 cheatsheet
Unicode - How to get characters right? (for Java web developers)
MySQL reference - Unicode support
What were the numbers in the boxes? I'm guessing they were 001A? Like ?
(SO will usually filter these out as they're ASCII control characters, typically invisible in other browsers.)
While debugging it shows the strings value as "\u001a\u001a\u001a\u001a"
Well clearly there's no Chinese or any text to be recovered there. Any informational content in the original string has been lost.
Whilst I agree that you need to be using UTF-8 throughout (which for PHP means serving the form page with a UTF-8 <meta> tag, using mysql_set_charset('utf8'), and creating your MySQL tables with UTF-8 collations), I think you must have a more serious corruption problem than just UTF-8-vs-other-ASCII-compatible-encoding if you are somehow getting just identical control characters instead of a text string.
I am working on a client-server networking app in Java SE. I am using stings terminated by a newline from client to server and the server responds with a null terminated string.
In the output window of Netbeans IDE I am finding some gibberish characters amongst the strings that I send and receive.
I can't figure out what these characters are they mostly look like a rectagular box, when I paste that line containing the character in Notepad++ all the characters following and including that character disapear.
How could I know what characters are appearing in the output sreen of the IDE?
If the response you are getting back from the server is supposed to be human readable text, then this is probably a character encoding problem. For example, if the client and server are both written in Java, it is likely that they using/assuming different character encodings for the text. (It is also possible that the response is not supposed to be human readable text. In that case, the client should not be trying to interpret it as text ... so this question is moot.)
You typically see boxes (splats) when a program tries to render some character code that it does not understand. This maybe a real character (e.g. a Japanese character, mathematical symbol or the like) or it could be an artifact caused by a mismatch between the character sets used for encoding and decoding the text.
To try and figure out what is going on, try modifying your client-side code to read the response as bytes rather than characters, and then output the bytes to the console in hexadecimal. Then post those bytes in your question, together with the displayed characters that you currently see.
If you understand character set naming and have some ideas what the character sets are likely to be, the UNIX / Linux iconv utility may be helpful. Emacs has extensive support for displaying / editing files in a wide range of character encodings. (Even plain old Wordpad can help, if this is just a problem with line termination sequences; e.g. "\n" versus "\r\n" versus "\n\r".)
(I'd avoid trying to diagnose this by copy and pasting. The copy and paste process may itself "mangle" the data, causing you further confusion.)
This is probably just binary data. Most of it will look like gibberish when interpreted as ascii. Make sure you are writing exact number of bytes to the socket, and not some nice number like 4096. Best would be if you can post your code so we can help you find the error(s).