I am facing a problem about encoding.
For example, I have a message in XML, whose format encoding is "UTF-8".
<message>
<product_name>apple</product_name>
<price>1.3</price>
<product_name>orange</product_name>
<price>1.2</price>
.......
</message>
Now, this message is supporting multiple languages:
Traditional Chinese (big5),
Simple Chinese (gb),
English (utf-8)
And it will only change the encoding in specific fields.
For example (Traditional Chinese),
蘋果
1.3
橙
1.2
.......
Only "蘋果" and "橙" are using big5, "<product_name>" and "</product_name>" are still using utf-8.
<price>1.3</price> and <price>1.2</price> are using utf-8.
How do I know which word is using different encoding?
It looks like whoever is providing the XML is providing incorrect XML. They should be using a consistent encoding.
http://sourceforge.net/projects/jchardet/files/ is a pretty good heuristic charset detector.
It's a port of the one used in Firefox to detect the encoding of pages that are missing a charset in content-type or a BOM.
You could use that to try and figure out the encoding for substrings in a malformed XML file if you can't get the provider to fix their output.
you should use only one encoding in one xml file. there are counterparts of the characters of big5 in the UTF_8 encoding.
Because I cannot get the provider to fix the output, so I should be handle it by myself and I cannot use the extend library in this project.
I only can solve that like this,
String str = new String(big5String.getByte("UTF-8"));
before display the message.
Related
I have a Java program which process xml files. When transforming xml into another xml file base on certain schema( xsd/xsl) it throws following error.
This error only throws for one xml file which has a xml tag like this.
<abc>xxx yyyy “ggggg vvvv” uuuu</abc>
But after removing or re-type two quotes, it doesn’t throw the error.
Anybody, please assist me to resolve this issue.
java.io.CharConversionException: Character larger than 4 bytes are not supported: byte 0x93 implies a length of more than 4 bytes
at .org.apache.xmlbeans..impl.piccolo.xml.UTF8XMLDecoder.decode(UTF8XMLDecoder.java:162)
<?xml version= “1.0’ encoding =“UTF-8” standalone =“yes “?><xyz xml s=“http://pqr.yy”><Header><abc> aaa “cccc” aaaaa vvv</abc></Header></xyz>.
As others have reported in comments, it has failed because the typographical quotation marks are encoded in Windows-1292 encoding, not in UTF-8, so the parser hasn't managed to decode them.
The encoding declared in the XML declaration must match the actual encoding used for the characters.
To find out how this error arose, and to prevent it happening again, we would need to know where this (wannabe) XML file came from, and how it was created.
My guess would be that someone used a "smart" editor; Microsoft editors in particular are notorious for changing what you type to what Microsoft think you wanted to type. If you're editing XML by hand it's best to use an XML-aware editor.
I have a search form in JSF that is implemented using a RichFaces 4 autocomplete component and the following JSF 2 page and Java bean. I use Tomcat 6 & 7 to run the application.
...
<h:commandButton value="#{msg.search}" styleClass="search-btn" action="#{autoCompletBean.doSearch}" />
...
In the AutoCompleteBean
public String doSearch() {
//some logic here
return "/path/to/page/with/multiple_results?query=" + searchQuery + "&faces-redirect=true";
}
This works well as long as everything withing the "searchQuery" String is in Latin-1, it does not work if is outside of Latin-1.
For instance a search for "bodø" will be automatically encoded as "bod%F8". However a search for "Kra Ðong" will not work since it is unable to encode "Ð".
I have now tried several different approaches to solve this, but none of them works.
I have tried encoding the searchQuery my self using URLEncode, but this only leads to double encoding since % is encoded as %25.
I have tried using java.net.URI to get the encoding, but gives the same result as URLEncode.
I have tried turning on UTF-8 in Tomcat using URIEncoding="UTF-8" in the Connector but this only worsens that problem since then non-ascii characters does not work at all.
So to my questions:
Can I change the way JSF 2 encodes the GET parameters?
If I cannot change the way JSF 2 encodes the GET parameters, can I turn of the encoding and do it manually?
Am I doing something where strange here? This seems like something that should be supported out-of-the-box, but I cannot find any others with the same problem.
I think you've hit a corner case bug in JSF. The query string is URL-encoded by ExternalContext#encodeRedirectURL() which uses the response character encoding as obtained by ExternalContext#getResponseCharacterEncoding(). However, while JSF by default uses UTF-8 as response character encoding, this is only been set if the view is actually to be rendered, not when the response is to be redirected, so the response character encoding still returns the platform default of ISO-8859-1 which causes your characters to be URL-encoded using this wrong encoding.
I've reported this as issue 2440. In the meanwhile your best bet is to explicitly set the response character encoding yourself beforehand.
FacesContext.getCurrentInstance().getExternalContext().setResponseCharacterEncoding("UTF-8");
Note that this still requires that the container itself uses the same character encoding to decode the request URL, so you certainly need to set URIEncoding="UTF-8" in Tomcat's configuration. This won't mess up the characters anymore as they will be really UTF-8 now.
The only character encoding accepted for HTTP URLs and headers is US-ASCII, you need to URL encode these characters to send them back to the application. Simplest way to do this in java would be:
public String doSearch() {
//some logic here
String encodedSearchQuery = java.net.URLEncoder.encode( searchQuery, "UTF-8" );
return "/path/to/page/with/multiple_results?query=" + encodedSearchQuery + "&faces-redirect=true";
}
And then it should work for any character that you use.
I have problem with Barcode4J and generation DataMatrix with ISO-8859-2 characters in message.
Below example use of barcode4j (version 2.1.0) from command line. As You can see when i use message "żaba" i get error Message contains characters outside ISO-8859-1 encoding. Is DataMatrix specification support ISO-8859-1 only or something is missing in Barcode4J ?
java -cp build/barcode4j.jar:lib/avalon-framework-4.2.0.jar:lib/commons-cli-1.0.jar org.krysalis.barcode4j.cli.Main -s datamatrix "żaba"
Exception in thread "main" java.lang.IllegalArgumentException: Message contains characters outside ISO-8859-1 encoding.
at org.krysalis.barcode4j.impl.datamatrix.DataMatrixHighLevelEncoder$EncoderContext.<init>(DataMatrixHighLevelEncoder.java:199)
at org.krysalis.barcode4j.impl.datamatrix.DataMatrixHighLevelEncoder.createEncoderContext(DataMatrixHighLevelEncoder.java:171)
at org.krysalis.barcode4j.impl.datamatrix.DataMatrixHighLevelEncoder.encodeHighLevel(DataMatrixHighLevelEncoder.java:119)
at org.krysalis.barcode4j.impl.datamatrix.DataMatrixLogicImpl.generateBarcodeLogic(DataMatrixLogicImpl.java:50)
at org.krysalis.barcode4j.impl.datamatrix.DataMatrixBean.generateBarcode(DataMatrixBean.java:128)
at org.krysalis.barcode4j.impl.ConfigurableBarcodeGenerator.generateBarcode(ConfigurableBarcodeGenerator.java:174)
at org.krysalis.barcode4j.cli.Main.handleCommandLine(Main.java:164)
at org.krysalis.barcode4j.cli.Main.main(Main.java:86)
As is described here, Barcode4J only currently supports the default character set defined by the DataMatrix specification (ISO-8859-1). Support for ECI hasn't been implemented for DataMatrix, yet. You can, however, encode binary messages by encoding a byte stream as an RFC 2397 data URL. That byte stream could be a string encoded using UTF-8. The drawback: the reader might not be able to interpret the data correctly.
I have a Java servlet which gets RSS feeds converts them to JSON. It works great on Windows, but it fails on Centos.
The RSS feed contains Arabic and it shows unintelligible characters on Centos. I am using those lines to encode the RSS feed:
byte[] utf8Bytes = Xml.getBytes("Cp1256");
// byte[] defaultBytes = Xml.getBytes();
String roundTrip = new String(utf8Bytes, "UTF-8");
I tried it on Glassfish and Tomcat. Both have the same problem; it works on Windows, but fails on Centos. How is this caused and how can I solve it?
byte[] utf8Bytes = Xml.getBytes("Cp1256");
String roundTrip = new String(utf8Bytes, "UTF-8");
This is an attempt to correct a badly-decoded string. At some point prior to this operation you have read in Xml using the default encoding, which on your Windows box is code page 1256 (Windows Arabic). Here you are encoding that string back to code page 1256 to retrieve its original bytes, then decoding it properly as the encoding you actually wanted, UTF-8.
On your Linux server, it fails, because the default encoding is something other than Cp1256; it would also fail on any Windows server not installed in an Arabic locale.
The commented-out line that uses the default encoding instead of explicitly Cp1256 is more likely to work on a Linux server. However, the real fix is to find where Xml is being read, and fix that operation to use the correct encoding(*) instead of the default. Allowing the default encoding to be used is almost always a mistake, as it makes applications dependent on configuration that varies between servers.
(*: for this feed, that's UTF-8, which is the most common encoding, but it may differ for others. Finding out the right encoding for a feed depends on the Content-Type header returned for the resource and the <?xml encoding declaration. By far the best way to cope with this is to fetch and parse the resource using a proper XML library that knows about this, for example with DocumentBuilder.parse(uri).)
There are many places where wrong encoding can be used. Here is the complete list http://wiki.apache.org/tomcat/FAQ/CharacterEncoding#Q8
we are working on a project for school, The project is mandatory tri-lingual (dutch, english and french) , so the answer "Change to English will not do".
All our classes and resource files are encoded in UTF-8 format, and alle non-standar english characters are diplayed correctly in the classes themself.
the problem is that once we try to display our text, alle non-standard english characters are distorted.
We hear alot that this is due to an encoding issue, but I sincerly doubt that, since our whole project is encode in UTF-8.
here is extract from the french resource bundle:
VIDEOSETTINGS = Réglages du Vidéo
SOUNDSETTINGS = Réglages du son
KEYBINDSETTINGS = Keybind Paramètres
LANGUAGESETTINGS = Paramètres de langue
DIFFICULTYSETTINGS = Paramètres de Difficulté
EXITSETTINGS = Sortie les paramètres
and this results in these following displayed strings.
display result for provided resourcebundle extract
I would be most gratefull for a solution for this problem
EDIT
for extra info we are building a desktop app using Swing.
This is due to an encoding issue.
You are using the wrong decoder (probably ISO-8859-1) on UTF-8 encoded bytes.
Are these strings stored in a file? How are you loading the file? Via the Properties class? The Properties class always applies ISO-8859-1 decoding when loading the plain text format from an InputStream. If you are using Properties, use the load(Reader) overload, switch to the XML format, or re-write the file with the matching encoding. Also, if you are using Resource.getBundle() to load a properties file, you must use ISO-8859-1 encoding to write that file, escaping any non-Latin characters.
Since this is an encoding issue, it would be most helpful if you posted the code you have used to select the character encoding.
You didn't show some code, where you read the resource files. But if you use PropertyResourceBundle with an InputStream in the constructor, the InputStream must be encoded in ISO-8859-1. In that case, characters that cannot be represented in ISO-8859-1 encoding must be represented by Unicode Escapes.
You can use native2ascii or AnyEdit as tools to convert Properties to unicode escapes,
see Use cyrillic .properties file in eclipse project