HTMLCLEANER handle Spanish characters

HTMLCLEANER handle Spanish characters - java

I am using HtmlCleaner library in order to parse/convert HTML files in java.
It seems that is not able to handle Spanish characters like 'ÁáÉéÍíÑñÓóÚúÜü'
Is there any property which I can set in HtmlCleaner for handling this or any other solution? Here's the code I'm using to invoke it:
CleanerProperties props = new CleanerProperties();
props.setRecognizeUnicodeChars(true);
java.io.File file = new java.io.File("C:\\example.html");
TagNode tagNode = new HtmlCleaner(props).clean(file);

HtmlCleaner uses the default character set read from the JVM unless specified. On Windows this will be Cp1512 not UTF-8, which is probably where it's going wrong.
You can either
specify -Dfile.encoding=UTF-8 on your JVM start line
use the HtmlCleaner.clean() overload that accepts a character set
TagNode tagNode = new HtmlCleaner(props).clean(file, "UTF-8");
(if you've got Google Guava in the project you can use Charsets.UTF_8 for the constant)
use the HtmlCleaner.clean() overload that accepts an InputStreamReader which you've already constructed with the correct character set.

You can change UTF-8 to UTF-16.
It will support maximum number of characters.

Related

Special characters from UNIX not being read properly by Java

I have a Java application wherein a string is being read from a file in UNIX. Then, the string is being passed to another application using URL POST method. However, it is having problems when there are special characters such as:
~
^
[
]
\
{
}
|
I am constructing the URL using a StringBuilder:
new StringBuilder() .append("message=").append(message).toString()
Is there a standard on how these characters should be encoded from UNIX to Java? I believe this is the issue here.

Those are characters used for a regular expression.
So somewhere you place the string in a position where a regex is expected.
replaceFirst
replaceAll instead of replace
split
format
printf
Encoding cannot be the error here (normal ASCII functions). However be aware that FileReader is an old utility class that reads a file with the default platform encoding.
When the file is in a known encoding, say UTF-8, better do:
Path path = file.toPath();
try (BufferedReader in = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
...
}

To properly read characters from a file in Java you need to specify the character set. E.g. like this (error handling left out for brevity):
String charset = "UTF-8"; // replace with what you are really using in your Unix system
Reader reader = new InputStreamReader(new FileInputStream(file), charset);
// use the reader...

A URL requires that certain characters be encoded. This is nothing to do with Unix, or Java; it is part of the specification for URLs.
In Java, you can encode arbitrary text to make it suitable for URLs via the URLEncoder.encode method:
new StringBuilder()
.append("message=")
.append(URLEncoder.encode(message, "UTF-8"))
.toString()

Storing hindi in lang.properties in java and retrieving it

I want to read hindi text from lang.properties(JAVA.util.properties) file.
I am using eclipse IDE.
First of all how can I save(or write) hindi letter in .properties file
Secondly how to read the string from my java class.
lang.properties
hindiText=साहिलसाहिल
Java Class
Properties prop = new Properties();
prop.load(MyCalss.class.getClassLoader().getResourceAsStream("lang.properties"));
String hindi=prop.getProperty("hindiText");
It's not working.

As documented, Properties.load(InputStream) will always use the ISO-8859-1 encoding, and that encoding doesn't handle the characters you're interested in.
Options:
Wrap your stream in an InputStreamReader and specify the encoding explicitly
Use Unicode escaping (e.g \u1234) in the file for any characters not in ISO-8859-1 (and make sure the file is saved as ISO-8859-1)

Serializing supplementary unicode characters into XML documents with Java

I am trying to serialize DOM documents with supplementary unicode characters such as U+1D49C (𝒜, mathematical script capital A). Creating a node with such a character is not a problem (I just set the node value to the UTF-16 equivalent, "\uD835\uDC9C"). When serializing, however, Xalan and XSLTC (with a Transformer) and Xerces (with LSSerializer) all create invalid character entities like "𝒜" instead of "𝒜". I tried the "normalize-characters" parameter for LSSerializer, but it is not supported. Only Saxon gets it right, without using a character entity when the encoding is unicode.
I cannot use Saxon in practice (among other reasons, I use Java applets and do not want to load another jar), so I am looking for a solution with the default JDK libraries. Is it possible to get valid XML documents serialized from a DOM document with supplementary unicode characters ?
[edit] I found someone else who encountered this problem : http://www.dragishak.com/?p=131
[edit2] actually, it seems to work with LSSerializer when I don't have xerces on the classpath (the class used is com.sun.org.apache.xml.internal.serialize.DOMSerializerImpl). It does not work with a transformer and com.sun.org.apache.xalan.internal.xsltc.trax.TransformerFactoryImpl.

Since I didn't see any answer coming, and other people seem to have the same problem, I looked into it further...
To find the origin of the bug, I used the serializer source code from Xalan 2.7.1, which is also used in Xerces.
org.apache.xml.serializer.dom3.LSSerializerImpl uses org.apache.xml.serializer.ToXMLStream, which extends org.apache.xml.serializer.ToStream.
ToStream.characters(final char chars[], final int start, final int length) handles the characters, and does not support unicode characters properly (note: org.apache.xml.serializer.ToTextSream (which can be used with a Transformer) does a better job in the characters method, but it only handles plain text and ignores all markup; one would think that XML files are text, but for some reason ToXMLStream does not extend ToTextStream).
org.apache.xalan.transformer.TransformerIdentityImpl is also using org.apache.xml.serializer.ToXMLStream (which is returned by org.apache.xml.serializer.SerializerFactory.getSerializer(Properties format)), so it suffers from the same bug.
ToStream is using org.apache.xml.serializer.CharInfo to check if a character should be replaced by a String, so the bug could also be fixed there instead of directly in ToStream. CharInfo is using a propery file, org.apache.xml.serializer.XMLEntities.properties, with a list of character entities, so changing this file could also be a way to fix the bug, although so far it is designed just for the special XML characters (quot,amp,lt,gt). The only way to make ToXMLStream use a different property file than the one in the package would be to add a org.apache.xml.serializer.XMLEntities.properties file before in the classpath, which would not be very clean...
With the default JDK (1.6 and 1.7), TransformerFactory returns a com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl, which uses com.sun.org.apache.xml.internal.serializer.ToXMLStream. In com.sun.org.apache.xml.internal.serializer.ToStream, characters() is sometimes calling processDirty(), which calls accumDefaultEscape(), which could handle unicode characters better, but in practice it does not seem to work (maybe processDirty is not called for unicode characters)...
com.sun.org.apache.xml.internal.serialize.DOMSerializerImpl is using com.sun.org.apache.xml.internal.serialize.XMLSerializer, which supports unicode. Strangely enough, XMLSerializer comes from Xerces, and yet it is not used by Xerces when xalan or xsltc are on the classpath. This is because org.apache.xerces.dom.CoreDOMImplementationImpl.createLSSerializer is using org.apache.xml.serializer.dom3.LSSerializerImpl when it is available instead of org.apache.xerces.dom.DOMSerializerImpl. With serializer.jar on the classpath, org.apache.xml.serializer.dom3.LSSerializerImpl is used. Warning: xalan.jar and xsltc.jar both reference serializer.jar in the manifest, so serializer.jar ends up on the classpath if it is in the same directory and either xalan.jar or xsltc.jar is on the classpath ! If only xercesImpl.jar and xml-apis.jar are on the classpath, org.apache.xerces.dom.DOMSerializerImpl is used as the LSSerializer, and unicode characters are properly handled.
CONCLUSION AND WORKAROUND: the bug lies in Apache's org.apache.xml.serializer.ToStream class (renamed com.sun.org.apache.xml.internal.serializer.ToStream inside the JDK). A serializer that handles unicode characters properly is org.apache.xml.serialize.DOMSerializerImpl (renamed com.sun.org.apache.xml.internal.serialize.DOMSerializerImpl inside the JDK). However, Apache prefers ToStream instead of DOMSerializerImpl when it is available, so maybe it behaves better for other things (or maybe it's just a reorganization). On top of that, they went as far as deprecating DOMSerializerImpl in Xerces 2.9.0. Hence the following workaround, which might have side effects :
when Xerces and Apache's serializer are on the classpath, replace "(doc.getImplementation()).createLSSerializer()" by "new org.apache.xerces.dom.DOMSerializerImpl()"
when Apache's serializer is on the classpath (for instance because of xalan) but not Xerces, try to replace "(doc.getImplementation()).createLSSerializer()" by "new com.sun.org.apache.xml.internal.serialize.DOMSerializerImpl()" (a fallback is necessary because this class might disappear in the future)
These 2 workarounds produce a warning when compiling.
I don't have a workaround for XSLT transforms, but this is beyond the scope of the question. I guess one could do a transform to another DOM document and use DOMSerializerImpl to serialize.
Some other workarounds, which might be a better solution for some people :
use Saxon with a Transformer
use XML documents with UTF-16 encoding

Here is an example that worked for me. Code is written in Groovy running on Java 7, which you can easily translate to Java since I've used all Java APIs in the example. If you pass in a DOM document that has supplementary (plane 1) unicode characters and you will get back out a String which has those characters properly serialized. For example, if the document has a unicode Script L (see http://www.fileformat.info/info/unicode/char/1d4c1/index.htm), it will be serialized in the returned String as &#x1d4c1 instead of 𝓁 (which is what you will get with a Xalan Transformer).
import org.w3c.dom.Document
...
def String writeToStringLS( Document doc ) {
def domImpl = doc.getImplementation()
def implLS = domImpl.getFeature("LS", "3.0")
def lsOutput = implLS.createLSOutput()
lsOutput.encoding = "UTF-8"
def bo = new ByteArrayOutputStream()
def out = new BufferedWriter( new OutputStreamWriter( bo, "UTF-8") )
lsOutput.characterStream = out
def lsWriter = implLS.createLSSerializer()
def result = lsWriter.write(doc, lsOutput)
return bo.toString()
}

XML Document read in as Latin1 but half converted to UTF-8

I'm hitting my head off a brick wall with a bizarre problem that I know there will be an obvious answer to, but I can't see if for the life of me. It's all to do with encoding. Before the code, a simple description: I want to take in an XML document which is Latin1 (ISO-8859-1) encoded, and then send the thing completely unchanged over an HttpURLConnection. I have a small test class and the raw XML which shows my problem. The XML file contains a Latin1 character 0xa2 (a cent character), which is invalid UTF-8 - I'm deliberately using this as my test case. The XML declaration is ISO-8859-1. I can read it in no bother, but then when I want to convert the org.w3c.dom.Document to a byte[] array to send down the HttpURLConnection, the 0xa2 character gets converted to the UTF-8 encoded cent character (0xc2 0xa2), and the declaration stays as ISO-8859-1. In other words, it's converted to two characters - totally wrong.
The code which does this:
FileInputStream input = new FileInputStream( "input-file" );
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware( true );
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse( input );
Source source = new DOMSource( document );
ByteArrayOutputStream baos = new ByteArrayOutputStream();
Result result = new StreamResult( baos );
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.transform( source, result );
byte[] bytes = baos.toByteArray();
FileOutputStream fos = new FileOutputStream( "output-file" );
fos.write( bytes );
I'm just writing it to a file at the moment while I figure out what on earth is converting this character. The input-file has 0xa2, the output-file contains 0xc2 0xa2. One way to fix this is to put this line in the 2nd last block:
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
However, not all XML documents that I'll be dealing with will be Latin1; most, indeed, will be UTF-8 when they come in. I'm assuming I shouldn't have to be working out what the encoding is such that I feed that in to the transformer though? I mean, surely it should be working this out for itself, and I'm just doing something else wrong?
A thought had occurred to me that I could just query the document to find out the encoding and thus the extra line could just do the trick:
transformer.setOutputProperty(OutputKeys.ENCODING, document.getInputEncoding());
However, I then determined that this wasn't the answer, as document.getInputEncoding() returns a different String if I run it in a terminal on the linux box in comparison to when I run it within Eclipse on my Mac.
Any hints would be appreciated. I fully accept I'm missing out on something obvious.

yes, by default, xml documents are written as utf-8, so you need to explicitly tell the Transformer to use a different encoding. your last edit is the "trick" to doing this such that it always matches the input xml encoding:
transformer.setOutputProperty(OutputKeys.ENCODING, document.getXmlEncoding());
the only question is, do you really need to maintain the input encoding?

Why not just open it with a normal FileInputStream and stream the bytes to the output stream directly from that? Why do you need to load it into DOM format in memory if you are just sending it byte for byte over an HttpURLConnection?
Edit: According to javadoc for Document, you should probably be using document.getXmlEncoding() to get what matches the encoding in the XML prolog.

This may be helpful - it's too long for a comment, but not really an answer. From the spec:
The encoding attribute specifies the preferred encoding to use for
outputting the result tree. XSLT processors are required to respect
values of UTF-8 and UTF-16. For other values, if the XSLT processor
does not support the specified encoding it may signal an error; if it
does not signal an error it should use UTF-8 or UTF-16 instead.
You may want to test with "encoding=junk", as it were, to see what it does.
The valid values for Java are described here. See also IANA charsets.

Filtering Wikipedia's XML dump: error on some accents

I'm trying to index Wikpedia dumps. My SAX parser make Article objects for the XML with only the fields I care about, then send it to my ArticleSink, which produces Lucene Documents.
I want to filter special/meta pages like those prefixed with Category: or Wikipedia:, so I made an array of those prefixes and test the title of each page against this array in my ArticleSink, using article.getTitle.startsWith(prefix). In English, everything works fine, I get a Lucene index with all the pages except for the matching prefixes.
In French, the prefixes with no accent also work (i.e. filter the corresponding pages), some of the accented prefixes don't work at all (like Catégorie:), and some work most of the time but fail on some pages (like Wikipédia:) but I cannot see any difference between the corresponding lines (in less).
I can't really inspect all the differences in the file because of its size (5 GB), but it looks like a correct UTF-8 XML. If I take a portion of the file using grep or head, the accents are correct (even on the incriminated pages, the <title>Catégorie:something</title> is correctly displayed by grep). On the other hand, when I rectreate a wiki XML by tail/head-cutting the original file, the same page (here Catégorie:Rock par ville) gets filtered in the small file, not in the original…
Any idea ?
Alternatives I tried:
Getting the file (commented lines were tried wihtout success*):
FileInputStream fis = new FileInputStream(new File(xmlFileName));
//ReaderInputStream ris = ReaderInputStream.forceEncodingInputStream(fis, "UTF-8" );
//(custom function opening the stream,
//reading it as UFT-8 into a Reader and returning another byte stream)
//InputSource is = new InputSource( fis ); is.setEncoding("UTF-8");
parser.parse(fis, handler);
Filtered prefixes:
ignoredPrefix = new String[] {"Catégorie:", "Modèle:", "Wikipédia:",
"Cat\uFFFDgorie:", "Mod\uFFFDle:", "Wikip\uFFFDdia:", //invalid char
"CatÃ©gorie:", "ModÃ¨le:", "WikipÃ©dia:", // UTF-8 as ISO-8859-1
"Image:", "Portail:", "Fichier:", "Aide:", "Projet:"}; // those last always work
* ERRATUM
Actually, my bad, that one I tried work, I tested the wrong index:
InputSource is = new InputSource( fis );
is.setEncoding("UTF-8"); // force UTF-8 interpretation
parser.parse(fis, handler);

Since you write the prefixes as plain strings into your source file, you want to make sure that you save that .java file in UTF-8, too (or any other encoding that supports the special characters you're using). Then, however, you have to tell the compiler which encoding the file is in with the -encoding flag:
javac -encoding utf-8 *.java
For the XML source, you could try
Reader r = new InputStreamReader(new FileInputStream(xmlFileName), "UTF-8");
InputStreams do not deal with encodings since they are byte-based, not character-based. So, here we create a Reader from an FileInputStream - the latter (stream) doesn't know about encodings, but the former (reader) does, because we give the encoding in the constructor.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

HTMLCLEANER handle Spanish characters - java

You can change UTF-8 to UTF-16. It will support maximum number of characters.

Related

Special characters from UNIX not being read properly by Java

Storing hindi in lang.properties in java and retrieving it

Serializing supplementary unicode characters into XML documents with Java

XML Document read in as Latin1 but half converted to UTF-8

Filtering Wikipedia's XML dump: error on some accents

Categories

Resources