Java Unescaping XML/HTML before JAXB parsing doesn't work - java

Can anyone help me?
In HTML/XML:
A numeric character reference refers to a character by its Universal Character Set/Unicode code point, and uses the format:
&#nnnn;
or
&#x hhhh;
I have to unescape (convert to unicode) these references before I use the JAXB parser.
When I use Apache StringEscapeUtils.unescapeXml() also &amp ; and &gt ; and &lt ; are unescaped, and that is not want I want because then parsing will fail.
Is there a library that only converts the &#nnnn to unicode? But does not unescape the rest?
Example:
begin-tag Adam &lt ;&gt ; Sl.meer 4 & 5 &# 55357;&# 56900; end-tag
I have added spaces after &# otherwise you do not see the notation.
For now I fixed it like this, but I want to use a better solution.
String unEncapedString = StringEscapeUtils.unescapeXml(xmlData).replaceAll("&", "&")
.replaceAll("<>", "<>");
StringReader reader = new StringReader(unEncapedString.codePoints().filter(c -> isValidXMLChar(c))
.collect(StringBuilder::new, StringBuilder::appendCodePoint, StringBuilder::append).toString());
return (Xxxx) createUnmarshaller().unmarshal(reader);
Looked in the Apache Commons-text library and finally found the solution:
NumericEntityUnescaper numericEntityUnescaper = new NumericEntityUnescaper(
NumericEntityUnescaper.OPTION.semiColonRequired);
xmlData = numericEntityUnescaper.translate(xmlData);

Related

How to convert utf8 string to escape string in JSON Java?

I want to convert a UTF-8 string to escape \uXXX format in value of JSON Object.
I used both JSON Object and Gson, but did not work for me in this case:
JSONObject js = new JSONObject();
js.put("lastReason","nguyễn");
System.out.println(js.toString());
and
Gson gson = new Gson();
String new_js = gson.toJson(js.toString());
System.out.println(new_js);
Output: {"test":"nguyễn"}
But i am expect that my result is:
Expected Output: {"test":"nguy\u1EC5n"}
Any solutions for this case, please help me to resolve it.
You can use apache commons-text library to change a string to use Unicode escape sequences. Use org.apache.commons.text.StringEscapeUtils to translate the text before adding it to JSONObject.
StringEscapeUtils.escapeJava("nguyễn")
will produce
nguy\u1EC5n
One possible problem with using StringEscapeUtils might be that it will escape control characters as well. If there is a tab character at the end of the string it will be translated to \t. I.e.:
StringEscapeUtils.escapeJava("nguyễn\t")
will produce an incorrect string:
nguy\u1EC5n\t
You can use org.apache.commons.text.translate.UnicodeEscaper to get around this but it will translate every character in the string to a Unicode escape sequence. I.e.:
UnicodeEscaper ue = new UnicodeEscaper();
ue.translate(rawString);
will produce
\u006E\u0067\u0075\u0079\u1EC5\u006E
or
\u006E\u0067\u0075\u0079\u1EC5\u006E\u0009
Whether it is a problem or not is up to you to decide.

How to solve the IllegalDataException in jdom2 library?

I am using jdom 2.0.6 version and received this IllegalDataException:
Error in setText for tokenization:
it fails on calling the setText() method.
Element text = new Element("Text");
text.setText(doc.getText());
It seems some characters in 'text' it doesn't accept. For two examples:
Originally Posted by Yvette H( 45) Odd socks, yes, no undies yes, no coat yes, no shoes odd. 🏻
ParryOtter said: Posted
Should I specify encoding somewhere or for some other reasons?
In fact you just have to escape your text which contains illegal characters with CDATA :
Element text = new Element("Text");
text.setContent(new CDATA(doc.getText()));
The reverse operation (reading text escaped with CDATA is transparent in JDOM2, you won't have to escape it back).
For my tests I added an illegal character at the end of my text by creating one from hex value 0x2 like that :
String text = doc.getText();
int hex = 0x2;
text += (char) hex;

Input Special characters to PPTX using docx4j

I got a special character from ASCII value and created a presentation by inputting that character using docx4j library. If I want to print "£" mark it print with "£". Is there a special way to input special characters to the PPT.
I used following code.
String iChar = new Character((char)163).toString();
t.setTextContent(iChar);
Please unzip your pptx, and have a look at the content of the slide. It should contain something like:
<a:t>£</a:t>
You can create a p containing that with:
// Create object for p
CTTextParagraph textparagraph = dmlObjectFactory.createCTTextParagraph();
textbody.getP().add( textparagraph);
// Create object for r
CTRegularTextRun regulartextrun = dmlObjectFactory.createCTRegularTextRun();
textparagraph.getEGTextRun().add( regulartextrun);
regulartextrun.setT( "£");
or by unmarshalling a string. In either case, you can just provide the £ char directly.
You can generate suitable code using the docx4j webapp at http://webapp.docx4java.org/

Convert HTML symbols and HTML names to HTML number using Java

I have an XML which contains many special symbols like ® (HTML number &#174) etc.
and HTML names like &atilde (HTML number &#227) etc.
I am trying to replace these HTML symbols and HTML names with corresponding HTML number using Java. For this, I first converted XML file to string and then used replaceAll method as:
File fn = new File("myxmlfile.xml");
String content = FileUtils.readFileToString(fn);
content = content.replaceAll("®", "&\#174");
FileUtils.writeStringToFile(fn, content);
But this is not working.
Can anyone please tell how to do it.
Thanks !!!
The signature for the replaceAll method is:
public String replaceAll(String regex, String replacement)
You have to be careful that your first parameter is a valid regular expression. The Java Pattern class describes the constructs used in a Java regular expression.
Based on what I see in the Pattern class description, I don't see what's wrong with:
content = content.replaceAll("®", "&\#174");
You could try:
content = content.replaceAll("\\p(®)", "&\#174");
and see if that works better.
I don't think that \# is a valid escape sequence.
BTW, what's wrong with "&#174" ?
If you want HTML numbers try first escaping for XML.
Use EscapeUtils from Apache Commons Lang.
Java may have trouble dealing with it, so first I prefere to escape Java, and after that XML or HTML.
String escapedStr= StringEscapeUtils.escapeJava(yourString);
escapedStr= StringEscapeUtils.escapeXML(yourString);
escapedStr= StringEscapeUtils.escapeHTML(yourString);

Processing a BZIP string/file in Scala

I'm punishing myself a bit by doing the python challenges series in Scala.
Now, one of the challenges is to read in a string that's been compressed using the bzip algorithm and output the result.
BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084
Now, after some digging it appears as if there isn't a standard java library for bzip processing, but there is something in the apache ant project, that this guy has kindly taken out for use as a separate library.
The thing is, I can't seem to get it to work with the following code, it just hangs in the scala REPL and the JVM maxes out at 100% CPU usage
This is the code I'm trying...
import java.io.{ByteArrayInputStream}
import org.apache.tools.bzip2.{CBZip2InputStream}
import org.apache.commons.io.{IOUtils}
object ChallengeEight extends Application {
val inputString = """BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084"""
val inputStream = new ByteArrayInputStream( inputString.getBytes("UTF-8") ) //convert string to inputstream
inputStream.skip(2) //skip the 'BZ' part at the start
val bzipInputStream = new CBZip2InputStream(inputStream) //hangs here....
val result = IOUtils.toString(bzipInputStream, "UTF-8");
println(result)
}
Anyone got any ideas? Or is the CBZip2InputStream class expecting some extra bytes that you might find in a file that has been zipped with bzip2?
Any help would be appreciated
EDIT For the record this is the python solution
import bz2
un = "BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!" \
"\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084"
print [bz2.decompress(elt) for elt in (un)]
To escape characters use a unicode escape sequence like \uXXXX syntax where XXXX is the hexadecimal sequence for the unicode character.
val un = "BZh91AY&SYA\u00af\u0082\r\u0000\u0000\u0001\u0001\u0080\u0002\u00c0\u0002\u0000 \u0000!\u009ah3M\u0007<]\u00c9\u0014\u00e1BA\u0006\u00be\u00084"
You are enclosing your string in triple quotes which means you will pass the literal characters to the algorithm rather than the control/binary characters they represent.

Categories

Resources