Invalid XML character (Unicode: 0x2) in making PDF from WS answer XML

Invalid XML character (Unicode: 0x2) in making PDF from WS answer XML - java

I'm doing some report from xml, which comes from WS. Report should be in PDF format, so I've chosen fop library to make it. When I'm trying to make report from xml, which located on my computer in xml file everything works fine.
The problems start when I'm invoking this method on WebServer. I have exception on this line:
transformer.transform(src, res);
Exception is:
org.xml.sax.SAXParseException: An invalid XML character (Unicode: 0x2) was found in the element content of the document.
First was (Unicode: 0x1a) character, but I cutted it with this function:
private static String stripNonValidXMLCharacters(String in) {
if (in == null || ("".equals(in))) {
return null;
}
StringBuffer out = new StringBuffer(in);
for (int i = 0; i < out.length(); i++) {
if (out.charAt(i) == 0x1a) {
out.setCharAt(i, '-');
}
}
return out.toString();
}
But then came (Unicode: 0x2) character. Trying to add
else if (out.charAt(i) == 0x2) {
out.setCharAt(i, '-');
}
doesn't help.
I'm using fop version 0.95.

Related

Decode alfresco file name or replace unicode[_x0020_] characters in String/fileName

I am using alfresco download upload services using java.
When I upload the file to alfreco server it gives me the following path :
/app:Home/cm:Company_x0020_Home/cm:Abc/cm:TestFile/cm:V4/cm:BC1X_x0020_0400_x0020_0109-_x0028_1-2_x0029__v2.pdf
When I use the same file path and download using alfresco services I took the file name at the end of the path
i.e ABC1X_x0020_0400_x0020_0109-_x0028_1-2_x0029__v2.pdf
How can I remove or decode the [Unicode] characters in fileName
String decoded = URLDecoder.decode(queryString, "UTF-8");
The above does not work .
These are some Unicode characters which appeared in my file name.
https://en.wikipedia.org/wiki/List_of_Unicode_characters
Please do not mark the question as duplicate as I have searched below links but non of those gave the solution.
Following are the links that I have searched for replacing unicode charectors in String with java.
Java removing unicode characters
Remove non-ASCII characters from String in Java
How can I replace a unicode character in java string
Java Replace Unicode Characters in a String

The solution given by Jeff Potts will be perfect .
But i had a situation where i was using file name in diffrent project where i wont use org.alfresco related jars
I had to take all those dependencies to use for a simple file decoding
So i used java native methods which uses regex to parse the file name and decode it,which gave me the perfect solution which was same from using
ISO9075.decode(test);
This is the code which can be used
public String decode_FileName(String fileName) {
System.out.println("fileName : " + fileName);
String decodedfileName = fileName;
String temp = "";
Matcher m = Pattern.compile("\\_x(.*?)\\_").matcher(decodedfileName); //rejex which matches _x0020_ kind of charectors
List<String> unicodeChars = new ArrayList<String>();
while (m.find()) {
unicodeChars.add(m.group(1));
}
for (int i = 0; i < unicodeChars.size(); i++) {
temp = unicodeChars.get(i);
if (isInteger(temp)) {
String replace_char = String.valueOf(((char) Integer.parseInt(String.valueOf(temp), 16)));//converting
decodedfileName = decodedfileName.replace("_x" + temp + "_", replace_char);
}
}
System.out.println("Decoded FileName :" + decodedfileName);
return decodedfileName;
}
And use this small java util to know Is integer
public static boolean isInteger(String s) {
try {
Integer.parseInt(s);
} catch (NumberFormatException e) {
return false;
} catch (NullPointerException e) {
return false;
}
return true;
}
So the above code works as simple as this :
Example :
0028 Left parenthesis U+0028 You can see in the link
https://en.wikipedia.org/wiki/List_of_Unicode_characters
String replace_char = String.valueOf(((char) Integer.parseInt(String.valueOf("0028"), 16)));
System.out.println(replace_char);
This code gives output : ( which is a Left parenthesis
This is what the logic i have used in my java program.
The above program will give results same as ISO9075.decode(test)
Output :
fileName : ABC1X_x0020_0400_x0020_0109-_x0028_1-2_x0029__v2.pdf
Decoded FileName :ABC1X 0400 0109-(1-2)_v2.pdf

In the org.alfresco.util package you will find a class called ISO9075. You can use it to encode and decode strings according to that spec. For example:
String test = "ABC1X_x0020_0400_x0020_0109-_x0028_1-2_x0029__v2.pdf";
String out = ISO9075.decode(test);
System.out.println(out);
Returns:
ABC1X 0400 0109-(1-2)_v2.pdf
If you want to see what it does behind the scenes, look at the source.

Java - PDFBox - ReplaceString - Issues with parsed tokens (possibly encoding?)

I've been struggling with an issue related to PDFBox and PDF editing. I have been assigned the task to edit a couple of strings given a PDF file, and to output a mirrored version of the files with the edited strings into it. I've been told that the problem has been solved in the past using this tool, so I have been told to do the same. The function I am using is this :
public void doIt( String inputFile, String outputFile, String strToFind, String message)
throws IOException, COSVisitorException
{
// the document
PDDocument doc = null;
try
{
doc = PDDocument.load( inputFile );
List pages = doc.getDocumentCatalog().getAllPages();
for( int i=0; i<pages.size(); i++ )
{
PDPage page = (PDPage)pages.get( i );
PDStream contents = page.getContents();
PDFStreamParser parser = new PDFStreamParser(contents.getStream() );
parser.parse();
List tokens = parser.getTokens();
for( int j=0; j<tokens.size(); j++ )
{
Object next = tokens.get( j );
if( next instanceof PDFOperator )
{
PDFOperator op = (PDFOperator)next;
//Tj and TJ are the two operators that display
//strings in a PDF
if( op.getOperation().equals( "Tj" ) )
{
//Tj takes one operator and that is the string
//to display so lets update that operator
COSString previous = (COSString)tokens.get( j-1 );
String string = previous.getString();
string = string.replaceFirst( strToFind, message );
previous.reset();
previous.append( string.getBytes("ISO-8859-1") );
}
else if( op.getOperation().equals( "TJ" ) )
{
COSArray previous = (COSArray)tokens.get( j-1 );
for( int k=0; k<previous.size(); k++ )
{
Object arrElement = previous.getObject( k );
if( arrElement instanceof COSString )
{
COSString cosString = (COSString)arrElement;
String string = cosString.getString();
string = string.replaceFirst( strToFind, message );
cosString.reset();
cosString.append( string.getBytes("ISO-8859-1") );
}
}
}
}
}
//now that the tokens are updated we will replace the
//page content stream.
PDStream updatedStream = new PDStream(doc);
OutputStream out = updatedStream.createOutputStream();
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens( tokens );
page.setContents( updatedStream );
}
doc.save( outputFile );
}
finally
{
if( doc != null )
{
doc.close();
}
}
}
Which is the code that is being used in a file contained into the PDFBox examples (https://svn.apache.org/repos/asf/pdfbox/tags/1.5.0/pdfbox/src/main/java/org/apache/pdfbox/examples/pdmodel/ReplaceString.java).
The file I have been given, however, is not being modified at all from this function. Nothing happens at all. Upon further inspection, I decided to analyze the sequencing of the tokens produced from the parser. The file is being parsed correctly in everything other than the COSString elements, which contain gibberish characters that look like they have been wrongly encoded (bunch of random symbols and numbers). I tried parsing other documents, and the function works with some of them, but not on everything I passed as input (a latex output file was modified correctly and had correctly encoded COSStrings, whereas other automatically generated pdfs produced no results with gibberish COSString content). I am also fairly sure the rest of the structure is being read correctly, since I rebuild the output on a different file, and the output file looks exactly the same as the input, which seems to mean that the file structure is being analyzed correctly.The file contains Identity-H encoded fonts.
I tried parsing the very same file using the PDFTextStripper (which extracts text from PDFs), and the parsing output from there returns the correct text output, using this:
PDFTextStripper pdfStripper = new PDFTextStripper("UTF-8");
String result = pdfStripper.getText(doc);
System.out.println(result);
Could it be an encoding issue? Can I tell the PDFStreamParser (or whoever holds the responsability) to force an encoding on read? Is it even an encoding issue, since the text extraction is working correctly?
Thanks in advance for the help.

Some files use font subsets. Lets say that the subset uses only the characters E, G, L, and O. So GOOGLE would appear in the file as hex byte values 2, 4, 4, 2, 3, and 1.
Now if you want to change GOOGLE into APPLE you'll have three problems:
1) your subset doesn't contain the characters A, L and P
2) the size will be different
3) It is quite possible that the string you're searching is splitted in several parts.
Btw the current version is 1.8.10. The ReplaceString utility has been removed in the upcoming 2.0 version to avoid giving the illusion that characters can easily be replaced.
This answer is somewhat speculative, because you haven't linked to a PDF.

Inside PDF text can be stored at two places:
Content stream
X Object inside Resource
Inside content stream mostly text are associated with TJ or Tj operator. But texts associated with Tj or TJ are not always in ASCII format, it may be some byte values. We can extract text from these byte value after mapping character codes to unicode values using proper encoding and mapping. While extracting text we use mapping and encoding, but we do not have a reverse mapping to check if a glyph belong to which character code. So basically we should replace character codes of string to be replaced with character codes of new string.
Example:
1. (Text) Tj
2. (12 45 5 3)Tj
Also we should replace string in content stream as well as X Object (if present) inside resource.
So I think this might be helpful.
GoodLuck!

What to do with iText "Unexpected color space /CS0" type of exceptions

I have some files generated by unknown source that open just fine in PDF browsers (Reader/Foxit) but iText fails to process them. For particular file I get:
Exception in thread "main" java.lang.IllegalArgumentException: Unexpected colorspace /CS0
at com.itextpdf.text.pdf.parser.InlineImageUtils.getComponentsPerPixel(InlineImageUtils.java:238)
at com.itextpdf.text.pdf.parser.InlineImageUtils.computeBytesPerRow(InlineImageUtils.java:251)
at com.itextpdf.text.pdf.parser.InlineImageUtils.parseUnfilteredSamples(InlineImageUtils.java:280)
at com.itextpdf.text.pdf.parser.InlineImageUtils.parseInlineImageSamples(InlineImageUtils.java:320)
at com.itextpdf.text.pdf.parser.InlineImageUtils.parseInlineImage(InlineImageUtils.java:153)
at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:370)
at com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:79)
sometimes /CS0 color space changes to /CS1 through /CS9 (or something similar).
Is it a iText bug (I'm using java 1.7, iText 5.4.1) or are my pdf files just broken? Even if the pdf files are broken is there any way I can fix them? (Adobe Reader seems to do that somehow, but unfortunately opening the file and saving it again does not work).

I'm not familiar with PDF specification so I don't know if PDFs I worked with were valid or not. I did however managed to solve the problem by making changes to iText in file com.itextpdf.text.pdf.parser.InlineIamgeUtils method getComponentsPerPixel(...) from:
private static int getComponentsPerPixel(PdfName colorSpaceName, PdfDictionary colorSpaceDic){
if (colorSpaceName == null)
return 1;
if (colorSpaceName.equals(PdfName.DEVICEGRAY))
return 1;
if (colorSpaceName.equals(PdfName.DEVICERGB))
return 3;
if (colorSpaceName.equals(PdfName.DEVICECMYK))
return 4;
if (colorSpaceDic != null){
PdfArray colorSpace = colorSpaceDic.getAsArray(colorSpaceName);
if (colorSpace != null){
if (PdfName.INDEXED.equals(colorSpace.getAsName(0))){
return 1;
}
}
}
throw new IllegalArgumentException("Unexpected color space " + colorSpaceName);
}
to
private static int getComponentsPerPixel(PdfName colorSpaceName, PdfDictionary colorSpaceDic){
if (colorSpaceName == null)
return 1;
if (colorSpaceName.equals(PdfName.DEVICEGRAY))
return 1;
if (colorSpaceName.equals(PdfName.DEVICERGB))
return 3;
if (colorSpaceName.equals(PdfName.DEVICECMYK))
return 4;
if (colorSpaceDic != null){
PdfArray colorSpace = colorSpaceDic.getAsArray(colorSpaceName);
if (colorSpace != null){
if (PdfName.INDEXED.equals(colorSpace.getAsName(0))){
return 1;
}
} /* Begin mod # */ else {
PdfName tempName = colorSpaceDic.getAsName(colorSpaceName);
if(tempName != null) return(getComponentsPerPixel(tempName, colorSpaceDic));
} /* End mod */
}
throw new IllegalArgumentException("Unexpected color space " + colorSpaceName);
}

XML escape code

I have written a method to check my XML strings for &.
I need to modify the method to include the following:
< &lt
> &gt
\ &guot
& &amp
\ &apos
Here is the method
private String xmlEscape(String s) {
try {
return s.replaceAll("&(?!amp;)", "&");
}
catch (PatternSyntaxException pse) {
return s;
}
} // end xmlEscape()
Here is the way I am using it
sb.append(" <Host>" + xmlEscape(url.getHost()) + "</Host>\n");
How can I modify my method to incorporate the rest of the symbols?
EDIT
I think I must not have phrase the question correctly.
In the xmlEscape() method I am wanting to check the string for the following chars
< > ' " &, if they are found I want to replace the found char with the correct char.
Example: if there is a char & the char would be replaced with & in the string.
Can you do something as simple as
try {
s.replaceAll("&(?!amp;)", "&");
s.replaceAll("<", "<");
s.replaceAll(">", ">");
s.replaceAll("'", "&apos;");
s.replaceAll("\"", """);
return s;
}
catch (PatternSyntaxException pse) {
return s;
}

You may want to consider using Apache commons StringEscapeUtils.escapeXml method or one of the many other XML escape utilities out there. That gives you a correct escaping to XML content without worrying about missing something when you need to escape something else but a host name.

Alternatively have you considered using the StAX (JSR-173) APIs to compose your XML document rather than appending strings together (an implementation is included in the JDK/JRE)? This will handle all the necessary character escaping for you:
package forum12569441;
import java.io.*;
import javax.xml.stream.*;
public class Demo {
public static void main(String[] args) throws Exception {
// WRITE THE XML
XMLOutputFactory xof = XMLOutputFactory.newFactory();
StringWriter sw = new StringWriter();
XMLStreamWriter xsw = xof.createXMLStreamWriter(sw);
xsw.writeStartDocument();
xsw.writeStartElement("foo");
xsw.writeCharacters("<>\"&'");
xsw.writeEndDocument();
String xml = sw.toString();
System.out.println(xml);
// READ THE XML
XMLInputFactory xif = XMLInputFactory.newFactory();
XMLStreamReader xsr = xif.createXMLStreamReader(new StringReader(xml));
xsr.nextTag(); // Advance to "foo" element
System.out.println(xsr.getElementText());
}
}
Output
<?xml version="1.0" ?><foo><>"&'</foo>
<>"&'

Unable to decode hex values in javascript tooltip

I have quite the process that we go through in order to display some e-mail communications in our application. Trying to keep it as general as possible...
-We make a request to a service via XML
-Get the XML reply string, send the string to a method to encode any invalid characters as follows:
public static String convertUTF8(String value) {
char[] chars = value.toCharArray();
StringBuffer retVal = new StringBuffer(chars.length);
for (int i = 0; i < chars.length; i++) {
char c = chars[i];
int chVal = (int)c;
if (chVal > Byte.MAX_VALUE) {
retVal.append("&#x").append(Integer.toHexString(chVal)).append(";");
} else {
retVal.append(c);
}
}
return retVal.toString();
}
We then send that result of a string to another method to remove any other invalid characters:
public static String removeInvalidCharacters(String inString)
{
if (inString == null){
return null;
}
StringBuffer newString = new StringBuffer();
char ch;
char c[] = inString.toCharArray();
for (int i = 0; i < c.length; i++)
{
ch = c[i];
// remove any characters outside the valid UTF-8 range as well as all control characters
// except tabs and new lines
if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
{
newString.append(ch);
}
}
return newString.toString();
}
This string is then "unmarshal'ed" via the SaxParser
The object is then sent back to our Display action which generated the response to the calling jsp/javascript to create the page.
The issue is some text can contain characters which can't be processed correctly. The following is eventually rendered on the JSP just fine:
<PrvwCommTxt>This is a new test. Have a*&#xc7;&#xb4;)&#xa1;.&#xf1;&#xc7;&#xa1;.&#xf1;*&#xc7;&#xb4;)...</PrvwCommTxt>
Which shows up as "This is a new test. Have a*Ç´)¡.ñÇ¡." in the browser.
-The following shows up in a tooltip while hovering over the above text:
<CommDetails>This is a new test. Have a*Ç´)¡.ñÇ¡.ñ*Ç´)¡.ñ*´)(¡.ñÇ(¡.ñÇ* Wonderful Day!</CommDetails>
This then shows up incorrectly when rendered in the tooltip javascript with all the HEX values and not being rendered correctly.
Any suggestions on how to make the unknown characters show correctly in javascript?

Get the XML reply string, send the string to a method to encode any invalid characters as follows:
You should be using Apache Commons Lang StringEscapeUtils#escapeXml() for this.
// remove any characters outside the valid UTF-8 range
This makes no sense. There's nothing outside UTF-8 range. The problem lies somewhere else. Get rid of this method.
The issue is some text can contain characters which can't be processed correctly. The following is eventually rendered on the JSP just fine:
You need to set the response encoding to UTF-8 and instruct the webbrowser to use UTF-8. This can be done by putting the following line in top of JSP:
<%#page pageEncoding="UTF-8" %>
See also:
Unicode - How to get characters right?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Invalid XML character (Unicode: 0x2) in making PDF from WS answer XML - java

Related

Decode alfresco file name or replace unicode[_x0020_] characters in String/fileName

Java - PDFBox - ReplaceString - Issues with parsed tokens (possibly encoding?)

What to do with iText "Unexpected color space /CS0" type of exceptions

XML escape code

Unable to decode hex values in javascript tooltip

Categories

Resources