Need an application to fix a XML with unescaped chars - java

This XML (rdf file extension, but is XML) was generated by a automatic tool, but unfortunately have various "unescaped" strings like
<tag xml:lang="fr">L'insuline (du latin insula, île) </tag>
And the parser (and reasoner software) crash with this...
Java or PHP solutions are valid to me too!
Thanks,
Celso

Here's a general method that I use a lot to make sure a String is escaped properly for XML.
private static final String AMP = "&";
private static final String LT = "<";
private static final String GT = ">";
private static final String QUOTE = """;
private static final String APOS = "&apos;";
public static String encodeEntities(String dirtyString) {
StringBuffer buff = new StringBuffer();
char[] chars = dirtyString.toCharArray();
for (int i = 0; i < chars.length; i++) {
if (chars[i] > 0x7f) {
buff.append("&#" + (int) chars[i] + ";");
continue;
}
switch (chars[i]) {
case '&':
buff.append(AMP);
break;
case '<':
buff.append(LT);
break;
case '\'':
buff.append(APOS);
break;
case '"':
buff.append(QUOTE);
break;
case '>':
buff.append(GT);
break;
default:
buff.append(chars[i]);
break;
}
}
return buff.toString();
}

The xml given by the OP is well-formed xml as the single quote character is valid and so is the circumflex "i", neither needs escaping. I would make sure you're using a text encoding such as UTF-8. Here's quick java example that does an identity transformation:
public static void main(String[] args) throws Exception {
Transformer t = TransformerFactory.newInstance().newTransformer();
StreamResult s = new StreamResult(System.out);
t.transform(new StreamSource(new StringReader("<tag xml:lang=\"fr\">L'insuline (du latin insula, île) </tag>")), s);
}

The XML fragment given by the OP looks well-formed. Neither the apostrophe nor the i-circumflex needs escaping. The most likely problem is that the XML is encoded using iso-8859-1, but lacks an XML declaration, so the parser think it is in UTF-8 encoding. The solution then is to add the XML declaration <?xml version="1.0" encoding="iso-8859-1"?>, which tells the parser how to decode the characters. (For a document containing only ASCII characters, iso-8859-1 and utf-8 are indistinguishable, so this problem only surfaces when you use characters outside the ASCII range).
A word of advice: if you had given the error message generated by the parser, you wouldn't have got so many incorrect answers.

Related

Xalan's SAX implementation - double encoding entities in string

I'm using Sax with xalan implementation (v. 2.7.2). I have string in html format
" <p>Test k"nnen</p>"
and I have to pass it to content of xml tag.
The result is:
"<p>Test k&quot;nnen</p>"
xalan encodes the ampersand sign although it's a part of already escaped entity.
Anyone knows a way how to make xalan understand escaped entities and not escape their ampersand?
One of possible solution is to add startCDATA() to transformerHandler but It's not something can use in my code.
public class TestSax{
public static void main(String[] args) throws TransformerConfigurationException, SAXException {
TestSax t = new TestSax();
System.out.println(t.createSAXXML());
}
public String createSAXXML() throws SAXException, TransformerConfigurationException {
Writer writer = new StringWriter( );
StreamResult streamResult = new StreamResult(writer);
SAXTransformerFactory transformerFactory =
(SAXTransformerFactory) SAXTransformerFactory.newInstance( );
String data = null;
TransformerHandler transformerHandler =
transformerFactory.newTransformerHandler( );
transformerHandler.setResult(streamResult);
transformerHandler.startDocument( );
transformerHandler.startElement(null,"decimal","decimal", null);
data = " <p>Test k"nnen</p>";
transformerHandler.characters(data.toCharArray(),0,data.length( ));
transformerHandler.endElement(null,"decimal","decimal");
transformerHandler.endDocument( );
return writer.toString( );
}}
If your input is XML, then you need to parse it. Then <p> and </p> will be recognized as tags, and " will be recognized as an entity reference.
On the other hand if you want to treat it as a string and pass it through XML machinery, then "<" and "&" are going to be preserved as ordinary characters, which means they will be escaped as < and & respectively.
If you want "<" treated as an ordinary character but "&" treated with its XML meaning, then you need software with some kind of split personality, and you're not going to get that off-the-shelf.

XML escape code

I have written a method to check my XML strings for &.
I need to modify the method to include the following:
< &lt
> &gt
\ &guot
& &amp
\ &apos
Here is the method
private String xmlEscape(String s) {
try {
return s.replaceAll("&(?!amp;)", "&");
}
catch (PatternSyntaxException pse) {
return s;
}
} // end xmlEscape()
Here is the way I am using it
sb.append(" <Host>" + xmlEscape(url.getHost()) + "</Host>\n");
How can I modify my method to incorporate the rest of the symbols?
EDIT
I think I must not have phrase the question correctly.
In the xmlEscape() method I am wanting to check the string for the following chars
< > ' " &, if they are found I want to replace the found char with the correct char.
Example: if there is a char & the char would be replaced with & in the string.
Can you do something as simple as
try {
s.replaceAll("&(?!amp;)", "&");
s.replaceAll("<", "<");
s.replaceAll(">", ">");
s.replaceAll("'", "&apos;");
s.replaceAll("\"", """);
return s;
}
catch (PatternSyntaxException pse) {
return s;
}
You may want to consider using Apache commons StringEscapeUtils.escapeXml method or one of the many other XML escape utilities out there. That gives you a correct escaping to XML content without worrying about missing something when you need to escape something else but a host name.
Alternatively have you considered using the StAX (JSR-173) APIs to compose your XML document rather than appending strings together (an implementation is included in the JDK/JRE)? This will handle all the necessary character escaping for you:
package forum12569441;
import java.io.*;
import javax.xml.stream.*;
public class Demo {
public static void main(String[] args) throws Exception {
// WRITE THE XML
XMLOutputFactory xof = XMLOutputFactory.newFactory();
StringWriter sw = new StringWriter();
XMLStreamWriter xsw = xof.createXMLStreamWriter(sw);
xsw.writeStartDocument();
xsw.writeStartElement("foo");
xsw.writeCharacters("<>\"&'");
xsw.writeEndDocument();
String xml = sw.toString();
System.out.println(xml);
// READ THE XML
XMLInputFactory xif = XMLInputFactory.newFactory();
XMLStreamReader xsr = xif.createXMLStreamReader(new StringReader(xml));
xsr.nextTag(); // Advance to "foo" element
System.out.println(xsr.getElementText());
}
}
Output
<?xml version="1.0" ?><foo><>"&'</foo>
<>"&'

Ideal Java library for cleaning html, and escaping malformed fragments

I've got some HTML files that need to be parsed and cleaned, and they occasionally have content with special characters like <, >, ", etc. which have not been properly escaped.
I have tried running the files through jTidy, but the best I can get it to do is just omit the content it sees as malformed html. Is there a different library that will just escape the malformed fragments instead of omitting them? If not, any recommendations on what library would be easiest to modify?
Clarification:
Sample input: <p> blah blah <M+1> blah </p>
Desired output: <p> blah blah <M+1> blah </p>
You can also try TagSoup. TagSoup emits regular old SAX events so in the end you get what looks like a well-formed XML document.
I have had very good luck with TagSoup and I'm always surprised at how well it handles poorly constructed HTML files.
Ultimately I solved this by running a regular expression first and an unmodified TagSoup second.
Here is my regular expression code to escape unknown tags like <M+1>
private static String escapeUnknownTags(String input) {
Scanner scan = new Scanner(input);
StringBuilder builder = new StringBuilder();
while (scan.hasNext()) {
String s = scan.findWithinHorizon("[^<]*</?[^<>]*>?", 1000000);
if (s == null) {
builder.append(escape(scan.next(".*")));
} else {
processMatch(s, builder);
}
}
return builder.toString();
}
private static void processMatch(String s, StringBuilder builder) {
if (!isKnown(s)) {
String escaped = escape(s);
builder.append(escaped);
}
else {
builder.append(s);
}
}
private static String escape(String s) {
s = s.replaceAll("<", "<");
s = s.replaceAll(">", ">");
return s;
}
private static boolean isKnown(String s) {
Scanner scan = new Scanner(s);
if (scan.findWithinHorizon("[^<]*</?([^<> ]*)[^<>]*>?", 10000) == null) {
return false;
}
MatchResult mr = scan.match();
try {
String tag = mr.group(1).toLowerCase();
if (HTML.getTag(tag) != null) {
return true;
}
}
catch (Exception e) {
// Should never happen
e.printStackTrace();
}
return false;
}
HTML cleaner
HtmlCleaner is open-source HTML parser written in Java. HTML found on
Web is usually dirty, ill-formed and unsuitable for further
processing. For any serious consumption of such documents, it is
necessary to first clean up the mess and bring the order to tags,
attributes and ordinary text. For the given HTML document, HtmlCleaner
reorders individual elements and produces well-formed XML. By default,
it follows similar rules that the most of web browsers use in order to
create Document Object Model. However, user may provide custom tag and
rule set for tag filtering and balancing.
Ok, I suspect it is this. Use the following code, it will help.
javax.swing.text.html.HTML

How can I force a SAX parser to use a DTD if one is not specified in the input file?

How can I force a SAX parser (specifically, Xerces in Java) to use a DTD when parsing a document without having any doctype in the input document? Is this even possible?
Here are some more details of my scenario:
We have a bunch of XML documents that conform to the same DTD that are generated by multiple different systems (none of which I can change). Some of these systems add a doctype to their output documents, others do not. Some use named character entities, some do not. Some use named character entities without declaring a doctype. I know that's not kosher, but it's what I have to work with.
I'm working on system that needs to parse these files in Java. Currently, it's handling the above cases by first reading in the XML document as a stream, attempting to detect if it has a doctype defined, and adding a doctype declaration if one isn't already present. The problem is that this code is buggy, and I'd like to replace it with something cleaner.
The files are large, so I can't use a DOM-based solution. I'm also trying get character entities resolved, so it doesn't help to use an XML Schema.
If you have a solution, could you please post it directly instead of linking to it? It doesn't do Stack Overflow much good if in a the future there's a correct solution with a dead link.
I think it is no sane way to set DOCTYPE, if document hasn't one. Possible solution is write fake one, as you already do. If you're using SAX, you can use this fake InputStream and fake DefaultHandler implementation. (will work only for latin1 one-byte encoding)
I know this solution also ugly, but it only one works well with big data streams.
Here is some code.
private enum State {readXmlDec, readXmlDecEnd, writeFakeDoctipe, writeEnd};
private class MyInputStream extends InputStream{
private final InputStream is;
private StringBuilder sb = new StringBuilder();
private int pos = 0;
private String doctype = "<!DOCTYPE register SYSTEM \"fake.dtd\">";
private State state = State.readXmlDec;
private MyInputStream(InputStream source) {
is = source;
}
#Override
public int read() throws IOException {
int bit;
switch (state){
case readXmlDec:
bit = is.read();
sb.append(Character.toChars(bit));
if(sb.toString().equals("<?xml")){
state = State.readXmlDecEnd;
}
break;
case readXmlDecEnd:
bit = is.read();
if(Character.toChars(bit)[0] == '>'){
state = State.writeFakeDoctipe;
}
break;
case writeFakeDoctipe:
bit = doctype.charAt(pos++);
if(doctype.length() == pos){
state = State.writeEnd;
}
break;
default:
bit = is.read();
break;
}
return bit;
}
#Override
public void close() throws IOException {
super.close();
is.close();
}
}
private static class MyHandler extends DefaultHandler {
#Override
public InputSource resolveEntity(String publicId, String systemId) throws IOException, SAXException {
System.out.println("resolve "+ systemId);
// get real dtd
InputStream is = ClassLoader.class.getResourceAsStream("/register.dtd");
return new InputSource(is);
}
... // rest of code
}

How to replace � in a string

I have a string that contains a character � I haven't been able to replace it correctly.
String.replace("�", "");
doesn't work, does anyone know how to remove/replace the � in the string?
That's the Unicode Replacement Character, \uFFFD. (info)
Something like this should work:
String strImport = "For some reason my �double quotes� were lost.";
strImport = strImport.replaceAll("\uFFFD", "\"");
Character issues like this are difficult to diagnose because information is easily lost through misinterpretation of characters via application bugs, misconfiguration, cut'n'paste, etc.
As I (and apparently others) see it, you've pasted three characters:
codepoint glyph escaped windows-1252 info
=======================================================================
U+00ef ï \u00ef ef, LATIN_1_SUPPLEMENT, LOWERCASE_LETTER
U+00bf ¿ \u00bf bf, LATIN_1_SUPPLEMENT, OTHER_PUNCTUATION
U+00bd ½ \u00bd bd, LATIN_1_SUPPLEMENT, OTHER_NUMBER
To identify the character, download and run the program from this page. Paste your character into the text field and select the glyph mode; paste the report into your question. It'll help people identify the problematic character.
You are asking to replace the character "�" but for me that is coming through as three characters 'ï', '¿' and '½'. This might be your problem... If you are using Java prior to Java 1.5 then you only get the UCS-2 characters, that is only the first 65K UTF-8 characters. Based on other comments, it is most likely that the character that you are looking for is '�', that is the Unicode replacement character. This is the character that is "used to replace an incoming character whose value is unknown or unrepresentable in Unicode".
Actually, looking at the comment from Kathy, the other issue that you might be having is that javac is not interpreting your .java file as UTF-8, assuming that you are writing it in UTF-8. Try using:
javac -encoding UTF-8 xx.java
Or, modify your source code to do:
String.replaceAll("\uFFFD", "");
As others have said, you posted 3 characters instead of one. I suggest you run this little snippet of code to see what's actually in your string:
public static void dumpString(String text)
{
for (int i=0; i < text.length(); i++)
{
System.out.println("U+" + Integer.toString(text.charAt(i), 16)
+ " " + text.charAt(i));
}
}
If you post the results of that, it'll be easier to work out what's going on. (I haven't bothered padding the string - we can do that by inspection...)
Change the Encoding to UTF-8 while parsing .This will remove the special characters
Use the unicode escape sequence. First you'll have to find the codepoint for the character you seek to replace (let's just say it is ABCD in hex):
str = str.replaceAll("\uABCD", "");
for detail
import java.io.UnsupportedEncodingException;
/**
* File: BOM.java
*
* check if the bom character is present in the given string print the string
* after skipping the utf-8 bom characters print the string as utf-8 string on a
* utf-8 console
*/
public class BOM
{
private final static String BOM_STRING = "Hello World";
private final static String ISO_ENCODING = "ISO-8859-1";
private final static String UTF8_ENCODING = "UTF-8";
private final static int UTF8_BOM_LENGTH = 3;
public static void main(String[] args) throws UnsupportedEncodingException {
final byte[] bytes = BOM_STRING.getBytes(ISO_ENCODING);
if (isUTF8(bytes)) {
printSkippedBomString(bytes);
printUTF8String(bytes);
}
}
private static void printSkippedBomString(final byte[] bytes) throws UnsupportedEncodingException {
int length = bytes.length - UTF8_BOM_LENGTH;
byte[] barray = new byte[length];
System.arraycopy(bytes, UTF8_BOM_LENGTH, barray, 0, barray.length);
System.out.println(new String(barray, ISO_ENCODING));
}
private static void printUTF8String(final byte[] bytes) throws UnsupportedEncodingException {
System.out.println(new String(bytes, UTF8_ENCODING));
}
private static boolean isUTF8(byte[] bytes) {
if ((bytes[0] & 0xFF) == 0xEF &&
(bytes[1] & 0xFF) == 0xBB &&
(bytes[2] & 0xFF) == 0xBF) {
return true;
}
return false;
}
}
dissect the URL code and unicode error. this symbol came to me as well on google translate in the armenian text and sometimes the broken burmese.
profilage bas� sur l'analyse de l'esprit (french)
should be translated as:
profilage basé sur l'analyse de l'esprit
so, in this case � = é
No above answer resolve my issue. When i download xml it apppends <xml to my xml. I simply
xml = parser.getXmlFromUrl(url);
xml = xml.substring(3);// it remove first three character from string,
now it is running accurately.

Categories

Resources