My app is English only but some of the data that I am dynamically retrieving and displaying is from a different language (Eg: Korean). I am doing this before adding the string item to a list view:
test = new String(item.name.getBytes("UTF-8"));
When I use the Eclipse debugger to check the test string, I am able to view the string with the appropriate language characters but when I display the listview on the emulator, it turns into garbage.
I've read that Android automatically supports languages like Japanese, Telugu etc so I am assuming that I am doing something wrong here. Can anyone help? Thanks!
Why the hell are you doing this? You have a string, encode it in UTF-8 and THEN decode it with the platforms default encoding - this obviously will fail if the default encoding is anything but UTF-8.
Obvious fix: test = item.name.
Also correct but rather useless: test = new String(item.name.getBytes("UTF-8"), "UTF-8");
Related
I am trying to read .properties files with having Chinese characters. when I read the them using keys they are printing like ??????. I am writing JSF application. Where I need to do translation for this application in chinese. On UI in JSF it is showing all characters correctly as it should be. But in my java code it is showing like ?????. I donot why it is. I also tried "手提電話" and also tried "\u88dc\u7fd2\u500b\u6848" and tried to print them on console with main function, it is printing correctly in chinese lang chars. my properties file having encoding utf-8.
To clarify your confusion about it, I am able to display them console in chinese like
String str="Алексей";
String str="\u88dc\u7fd2\u500b\u6848";
System.out.println("direct output: "+str);
working fine in psvm. but after reading using properties file is shows ???.e
Hope it is clear now.
Also My database is receiving the ?? in place for actual chinese charactersets.
Please help. any other clarification required then please confirm so I can update my lines over here.
Here is my code for reading the properties file which returns the bundle of relevant locale.
FacesContext context = FacesContext.getCurrentInstance();
bundle = context.getApplication().getResourceBundle(context, "hardvalue");
after this I am going to call this to access the value:
bundle.getString("tutorsearch.header"); // results in ?????
Any other code you need then please confirm.
Multiple question marks usually comes from
you had utf8-encoded data (good)
SET NAMES latin1 was in effect (default, but wrong)
the column was declared CHARACTER SET latin1 (default, but wrong)
For Chinese (or Emoji), you need to use MySQL's utf8mb4.
The cure (for future INSERTs):
utf8-encoded data (good)
mysqli_set_charset('utf8mb4') (or whatever your client needs for establishing the CHARACTER SET)
check that the column(s) and/or table default are CHARACTER SET utf8mb4
If you are displaying on a web page, <meta...charset=utf-8> should be near the top. (Note different spelling.)
adding below tagging have resolved my issue.
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
Thanks to All
So I'm working with last.fm API. Sometimes, the query results in tracks that contain characters like these:
Æther, é, Hṛṣṭa
or non-English characters like these:
水鏡.
When debugging in Eclipse, I see them just fine (as-is) but printing on console prints these as ??? - which is OK for me.
Now, how do I handle these? At first I though I could remove every song that has any character other than the ones in English language. I used the regex ^\\w+$ but it didn't work. I also tried \\w+. That didn't work either.
Then I thought further on how do handle these properly. Any one can help me out? I am perfectly fine with letting these tracks out of the equation, ie. I'm fine with having only English character tracks.
Another question: What is the best way to display these character of console and/or Swing GUI?
You must ensure that you use correct encoding when reading your input first.
Second ensure that the font used in Eclipse on platform you developing has ability to display all these characters. Swing must display unicode chars if you read them correctly.
You will likely want to use UTF-8 everywhere.
I'm trying to display arabic text in java but it shows junk characters(Example : ¤[ï߯[î) or sometimes only question marks when i print. How do i make it to print arabic. I heard that its something related to unicode and UTF-8. This is the first time i'm working with languages so no idea. I'm using Eclipse Indigo IDE.
EDIT:
If i use UTF-8 encoding then "¤[ï߯[î" characters are becoming "????????" characters.
For starters you could take a look here. This should allow you to make Eclipse print unicode in its console (which I do not know if it is something which Eclipse supports out of the box without any extra tweaks)
If that does not solve your problem you most likely have an issue with the encoding your program is using, so you might want to create strings in some manner similar to this:
String str = new String("تعطي يونيكود رقما فريدا لكل حرف".getBytes(), "UTF-8");
This at least works for me.
If you embed the text literally in the code make sure you set the encoding for your project correctly.
This is for Java SE, Java EE, or Java ME?
If this is for Java ME, you have to make custom GlyphUtils if you use LWUIT.
Download this file:
http://dl.dropbox.com/u/55295133/U0600.pdf
Look list of unicode encoding..
And look at this thread:
https://stackoverflow.com/a/9172732/1061371
in the answer (post) of Mohamed Nazar that edited by bernama Alex Kliuchnikau,
"The below code can be use for displaying arabic text in J2ME String s=new String("\u0628\u06A9".getBytes(), "UTF-8"); where \u0628\u06A9 is the unicode of two arabic letters"
Look at U0600.pdf file, so we can see that Mohamed Nazar and Alex Kliuchnikau give example to create "ba" and "kaf" character in arabic.
Then the last point that you must consider is: "Make sure your UI support unicode(I mean arabic) character."
Like LWUIT not support yet unicode (I mean arabic) character.
You should make your custom code if you mean your app is using LWUIT.
When I use the extractMetadata( MediaMetadataRetriever.METADATA_KEY_TITLE ) function.
Some of the strings returned are displayed incorrectly.
i.e.
Christina Perri - A Thousand Years
is displayed as
䌀栀爀椀猀琀椀渀愀 倀攀爀爀椀 ⴀ 䄀 吀栀漀甀猀愀渀搀 夀攀愀爀猀
Does anyone have any tips as to how I can get the string to display correctly?
I have no idea about Android, but there are two possibilities
You are reading it correctly and someone used this characters while storing the data.
You get the wrong characters because the text you get, has been stored in a different enconding, than you are using to display it. In this case you need to tell Java in which encoding this string is.
A good start to read about encodings is this blog
The Java tutorial for working with text
First I would like to say thank you for the help in advance.
I am currently writing a web crawler that parses HTML content, strips HTML tags, and then spell checks the text which is retrieved from the parsing.
Stripping HTML tags and spell checking has not caused any problems, using JSoup and Google Spell Check API.
I am able to pull down content from a URL and passing this information into a byte[] and then ultimately a String so that it can be stripped and spell checked. I am running into a problem with character encoding.
For example when parsing http://www.testwareinc.com/...
Original Text: We’ve expanded our Mobile Web and Mobile App testing services.
... the page is using ISO-8859-1 according to meta tag...
ISO-8859-1 Parse: Weve expanded our Mobile Web and Mobile App testing services.
... then trying using UTF-8...
UTF-8 Parse: We�ve expanded our Mobile Web and Mobile App testing services.
Question
Is it possible that HTML of a webpage can include a mix of encodings? And how can that be detected?
It looks like the apostrophe is coded as a 0x92 byte, which according to Wikipedia is an unassigned/private code point.
From there on, it looks like the browser falls back by assuming it's a non-encoded 1-byte Unicode code point : +0092 (Private Use Two) which appears to be represented as an apostrophe. No wait, if it's one byte, it's more probably cp1252: Browsers must have a fallback strategy according to the advertised CP, such as ISO-8859-1 -> CP1252.
So no mix of encoding here but as others said a broken document. But with a fallback heuristic that will sometimes help, sometimes not.
If you're curious enough, you may want to dive into FF or Chrome's source code to see exactly what they do in such a case.
Having more than 1 encoding in a document isn't a mixed document, it is a broken document.
Unfortunately there are a lot of web pages that use an encoding that doesn't match the document definition, or contains some data that is valid in the given encoding and some content that is invalid.
There is no good way to handle this. It is possible to try and guess the encoding of a document, but it is difficult and not 100% reliable. In cases like yours, the simplest solution is just to ignore parts of the document that can't be decoded.
Apache Tika has an encoding detector. There are also commercial alternatives if you need, say, something in C++ and are in a position to spend money.
I can pretty much guarantee that each web page is in one encoding, but it's easy to be mistaken about which one.
seems like issue with special characters. Check this StringEscapeUtils.escapeHtml if it helps. or any method there
edited: added this logic as he was not able to get code working
public static void main(String[] args) throws FileNotFoundException {
String asd = "’";
System.out.println(StringEscapeUtils.escapeXml(asd)); //output - ’
System.out.println(StringEscapeUtils.escapeHtml(asd)); //output - ’
}