In Java, I have a string obtained from an API, which looks like:
Hola, ésto es una frase con acentos.
And I want to have:
Hola, ésto es una frase con acentos.
Not only for this example, I need it for all UTF-8 encoded characters.
I've been looking for this for an hour but I haven't found a solution.
This isn't encoding, it's an HTML numeric character reference.
The easiest way to deal with it is add the Apache Commons Lang library to your project, and call StringEscapeUtils.unescapeHtml4
Well, if your text is encoded with SGML entities, a possible approach is to use a XML parser to decode it (though it might be not so smart):
public static String decodeSgml(String src)
throws org.xml.sax.SAXException,
javax.xml.parsers.ParserConfigurationException,
java.io.IOException
{
InputSource inputSource=new InputSource(new StringReader("<x>"+src+"</x>"));
javax.xml.parsers.DocumentBuilderFactory factory=javax.xml.parsers.DocumentBuilderFactory.newInstance();
javax.xml.parsers.DocumentBuilder docBuilder=factory.newDocumentBuilder();
org.w3c.dom.Document doc=docBuilder.parse(inputSource);
return doc.getDocumentElement().getTextContent();
}
(If the number of exceptions thrown by the method looks excessive, you could maybe re-throw some of them as ServiceConfigurationErrors, or store some of the variables as static members).
Related
I have a string which was encoded by UTF-16. When parsing using javax.xml.parsers.DocumentBuilder, I got an error like this:
Character reference "�" is an invalid XML character
Here is the code I used to parse the XML:
InputSource inputSource = new InputSource();
inputSource.setCharacterStream(new StringReader(xmlString));
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = factory.newDocumentBuilder();
org.w3c.dom.Document document = parser.parse(inputSource);
My question is, how to replace the invalid characters by (space)?
You just need to use String.replaceAll and pass the pattern of invalid characters.
You are trying to parse an invalid xml entity and this is what raising exception. It seems you need not to worry about UTF-16 for your situation.
Find some explanation and example here.
As an example, it is not possible to use & character for a valid xml, we need to use & instead. Here & is the xml entity.
Assuming above example should be self explanatory to understand what xml entity is.
As I understand there are some xml entity which is not valid. But no worry again. it is possible to declare & add new xml entity. Take a look at the above article for more detail.
EDIT: Assuming there is & character making the xml invalid.
StringEscapeUtils()
escapeXml
public static void escapeXml(java.io.Writer writer,
java.lang.String str)
throws java.io.IOException
Escapes the characters in a String using XML entities.
For example: "bread" & "butter" => "bread" & "butter".
Supports only the five basic XML entities (gt, lt, quot, amp, apos).
Does not support DTDs or external entities.
Note that unicode characters greater than 0x7f are currently escaped to their
numerical \\u equivalent. This may change in future releases.
Parameters:
writer - the writer receiving the unescaped string, not null
str - the String to escape, may be null
Throws:
java.lang.IllegalArgumentException - if the writer is null
java.io.IOException - if there is a problem writing
See Also:
unescapeXml(java.lang.String)
I'm converting a string received in a web service to a Document (DOM) xml, like this:
Document file= null;
String xmlFile= "blablabla"; //latin1 encodeing
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
this.file = builder.parse(new InputSource(new StringReader(xmlFile)));
But the string is encoded with ISO-8859-1 (latin1) and when I read a node of this Document, I have some errors. How can I create correctly DOM object with ISO-8859-1 encoding?or how can I read a node with the encoding Latin 1 in a string???
try this:
this.file = builder.parse(new ByteArrayInputStream(xmlFile.getBytes("ISO-8859-1")));
Foreword
String have no encoding as they represent a sequence of characters (which are abstract entities defined in unicode standard).
Byte sequences have an encoding and may be interpreted as a sequence of Character (provided that you tell java how to interpret it).
Your problem
In your problem, your data is stored into a String. Hence it has already been interpreted as a sequence of characters. Apparently the interpretation was incorrect.
Depending on your problem and the way you know the encoding of your data, there are 2 options:
Solution 1 (may be the best):
DO NOT INTERPRET the data you receive and keep it as a byte sequence (Stream or byte[] or ByteArray). Then pass this Byte sequence directly to your DOM parser (it will correctly decode the xml file, whatever its encoding, provided that the markup is correct.
Solution 2 (may be the only possible depending on the way you get the data):
Reencode the String as a ByteArray as mentioned in #ThOrndike's answer:
this.file = builder.parse(new ByteArrayInputStream(xmlFile.getBytes("ISO-8859-1")));
This will only work if you are sure the String has been correctly interpreted in the first place.
Apparently, it is not the case here and it seems that the library that give you the String, already interpreted it as an UTF-8 byte sequence (replacing all erroneous bytes by '?', it is the behavior of the UTF-8 readers). In that case, you cannot do anything as the original byte has been lost.
Your only hope is solution 1, or find a way to force the library that gives you the String to interpret it correctly.
I am using HtmlCleaner library in order to parse/convert HTML files in java.
It seems that is not able to handle Spanish characters like 'ÁáÉéÍíÑñÓóÚúÜü'
Is there any property which I can set in HtmlCleaner for handling this or any other solution? Here's the code I'm using to invoke it:
CleanerProperties props = new CleanerProperties();
props.setRecognizeUnicodeChars(true);
java.io.File file = new java.io.File("C:\\example.html");
TagNode tagNode = new HtmlCleaner(props).clean(file);
HtmlCleaner uses the default character set read from the JVM unless specified. On Windows this will be Cp1512 not UTF-8, which is probably where it's going wrong.
You can either
specify -Dfile.encoding=UTF-8 on your JVM start line
use the HtmlCleaner.clean() overload that accepts a character set
TagNode tagNode = new HtmlCleaner(props).clean(file, "UTF-8");
(if you've got Google Guava in the project you can use Charsets.UTF_8 for the constant)
use the HtmlCleaner.clean() overload that accepts an InputStreamReader which you've already constructed with the correct character set.
You can change UTF-8 to UTF-16.
It will support maximum number of characters.
we are working on a project for school, The project is mandatory tri-lingual (dutch, english and french) , so the answer "Change to English will not do".
All our classes and resource files are encoded in UTF-8 format, and alle non-standar english characters are diplayed correctly in the classes themself.
the problem is that once we try to display our text, alle non-standard english characters are distorted.
We hear alot that this is due to an encoding issue, but I sincerly doubt that, since our whole project is encode in UTF-8.
here is extract from the french resource bundle:
VIDEOSETTINGS = Réglages du Vidéo
SOUNDSETTINGS = Réglages du son
KEYBINDSETTINGS = Keybind Paramètres
LANGUAGESETTINGS = Paramètres de langue
DIFFICULTYSETTINGS = Paramètres de Difficulté
EXITSETTINGS = Sortie les paramètres
and this results in these following displayed strings.
display result for provided resourcebundle extract
I would be most gratefull for a solution for this problem
EDIT
for extra info we are building a desktop app using Swing.
This is due to an encoding issue.
You are using the wrong decoder (probably ISO-8859-1) on UTF-8 encoded bytes.
Are these strings stored in a file? How are you loading the file? Via the Properties class? The Properties class always applies ISO-8859-1 decoding when loading the plain text format from an InputStream. If you are using Properties, use the load(Reader) overload, switch to the XML format, or re-write the file with the matching encoding. Also, if you are using Resource.getBundle() to load a properties file, you must use ISO-8859-1 encoding to write that file, escaping any non-Latin characters.
Since this is an encoding issue, it would be most helpful if you posted the code you have used to select the character encoding.
You didn't show some code, where you read the resource files. But if you use PropertyResourceBundle with an InputStream in the constructor, the InputStream must be encoded in ISO-8859-1. In that case, characters that cannot be represented in ISO-8859-1 encoding must be represented by Unicode Escapes.
You can use native2ascii or AnyEdit as tools to convert Properties to unicode escapes,
see Use cyrillic .properties file in eclipse project
I am developing an application that reads in an XML document and passes the contents with JNI to a C++-DLL which validates it.
For this task I am using JDom and JUniversalChardet to parse the XML file in the correct encoding. My C++ accepts a const char* for the contents of the XML file and needs it in the encoding "ISO-8895-15", otherwise it will throw an exception because of malformed characters.
My first approach was to use the shipped OutputFormatter of JDom and tell it to use Charset.forName("ISO-8859-15") while formatting the JDom document to a String. After that the header part of the XML in this String says:
<?xml version="1.0" encoding="ISO-8859-15"?>
The Problem is that it is still stored in a Java String and therefore UTF-16 if I got that right.
My native method looks something like this:
public native String jniApiCall(String xmlFileContents);
So I pass the above mentioned String from the OutputFormatter of JDom into this JNI-Method. Still everything UTF-16, right?
In the JNI-C++-Method I access the xmlFileContents String with
const string xmlDataString = env->GetStringUTFChars(xmlFileContents, NULL);
So, now I got my above mentioned String in UTF-16 or UTF-8? And my next question would be: how can I change the character encoding of the std::string xmlDataString to ISO-8859-15? Or is the way I am doing this not exactly elegant? Or is there a way to do the character encoding completely in Java?
Thanks for your help!
Marco
You can always convert any String to byte array with needed character encoding using byte[] getBytes(Charset charset) method (or even byte[] getBytes(String charsetName)).
In java you can maybe use myString.getBytes("ISO-8859-15"); to get the byte array of the String using the character encoding used as parameter (in this case ISO-8859-15).
And then use that byte array in C to get the std::string with something like:
std::string myNewstring ( reinterpret_cast< char const* >(myByteArray) )