text encoding in ftp download app causing errors - java

I have created a script to download files from an ftp endpoint. I was assured that the files would be in utf-8 encoding but upon downloading and parsing the xml, we encounter bad formatting. The process is to download the file, convert the xml to json and parse and convert to a different format. What we see after converting to json is for example the following which appears instead of chinese/hindi/arabic characters:
"Size": 3227,
"Title": "??? ???? ????? ?? ???? ?? 5 ??? ?? ??? ?? ?? ???? ?? ????????? ?? ???? ???? ??????-Pakistan new army chief
The code snippet is the following:
ftp.connect("xx.xxx.xxx.xx");
ftp.login("xxxx", "xxxxx");
ftp.enterLocalPassiveMode();
ftp.setControlEncoding("UTF-8");
ftp.setFileType(FTP.BINARY_FILE_TYPE);
...
String remoteFile1 = ftp.printWorkingDirectory() + "/" + file.getName();
File downloadFile1 = new File(destFolder + "/" + "/" + file.getName());
OutputStream outputStream1 = new BufferedOutputStream(new FileOutputStream(downloadFile1));
boolean success = ftp.retrieveFile(remoteFile1, outputStream1);
outputStream1.flush();
outputStream1.close();
....
DocumentBuilderFactory docFactory =
DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.newDocument();
doc = docBuilder.parse(xmlFile);
doc.getDocumentElement().normalize();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer trans = tf.newTransformer();
StringWriter sw = new StringWriter();
trans.transform(new DOMSource(doc), new StreamResult(sw));
String xml = sw.toString();
JSONObject xmlJSONObj = XML.toJSONObject(xml);
String jsonPrettyPrintString = xmlJSONObj.toString(4);
jsonMapper.configure(SerializationFeature.WRAP_ROOT_VALUE, false);...
Can someone advise how to ensure the encoding can be changed to output the correct format for foreign characters?

Related

Remove SOAP envelope

I have an InputStream containing a SOAP message, including the envelope. I don't know the contents of the body beforehand and therefore cannot create a Jaxb annotated class for it.
I've tried many ways, inlcuding a custom SOAPWrapper JaxB Class with XmlAnyElement and other ways. Currently I have this:
private InputStream removeSoapEnvelope(final InputStream inputStream) throws IOException, TransformerException
{
final SoapBody body = messageFactory.createWebServiceMessage(inputStream)
.getSoapBody();
final Transformer transformer = TransformerFactory.newInstance()
.newTransformer();
final DOMResult domResult = new DOMResult();
transformer.transform(body.getPayloadSource(), domResult);
final StringWriter writer = new StringWriter();
transformer.transform(new DOMSource(domResult.getNode()), new StreamResult(writer));
byte[] barray = writer.toString()
.getBytes(StandardCharsets.UTF_8);
return new ByteArrayInputStream(barray);
}
It seems to work but is horribly inefficient. Is there no short and concise way of achieving this with standard libraries and without regex?
Thanks
Here's a solution using XPath to get the element (pure JaxB? not sure). Takes the document as a regular XML document so it should work for any I guess
FileInputStream fileIS;
fileIS = new FileInputStream(System.getProperty("user.home") + "/tmp/soap.xml");
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument;
xmlDocument = builder.parse(fileIS);
XPath xPath = XPathFactory.newInstance().newXPath();
String expression01 = "//*[local-name()='Body']";
Node currentNode = (Node) xPath.compile(expression01).evaluate(xmlDocument, XPathConstants.NODE);
StringWriter buf = new StringWriter();
Transformer xform = TransformerFactory.newInstance().newTransformer();
xform.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
xform.setOutputProperty(OutputKeys.INDENT, "yes");
xform.transform(new DOMSource(currentNode), new StreamResult(buf));
System.out.println(buf.toString());
Result:
<soap:Body>
<incident xmlns="http://example.com">
<Company type="String">Test</Company>
</incident>
</soap:Body>
I ended up doing it with regex. All other options are too slow:
private InputStream removeSoapEnvelope(final InputStream inputStream) throws IOException
{
final String text = new String(inputStream.readAllBytes(), UTF_8);
final String replace = text.replaceAll("\\s*<\\/?(?:SOAP-ENV|soap):(?:.|\\s)*?>", "");
File file = File.createTempFile("temp", XML_NS_PREFIX);
Files.writeString(file.toPath(), replace);
return new FileInputStream(file);
}

How can I limit the pages output with WordToHtmlConverter and HWPFDocument?

I'm converting a Word/.doc file to HTML and I'd like to be able to get a subset of pages. Is it possible to limit the range of output? I'm open to creating a new HWPFDocument from the original with only the subset of pages or after converting limit the length there.
File localFile = ...
FileInputStream fis = new FileInputStream(localFile);
HWPFDocument wordDoc = new HWPFDocument(fis);
Document newDoc = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(newDoc);
wordToHtmlConverter.processDocument(wordDoc);
StringWriter stringWriter = new StringWriter();
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
transformer.setOutputProperty(OutputKeys.METHOD, "html");
transformer.transform(
new DOMSource(wordToHtmlConverter.getDocument()),
new StreamResult(stringWriter));
String htmlString = stringWriter.toString();
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(
new FileOutputStream(htmlFile), "UTF-8"));
out.write(htmlString);
out.close();
Not with POI. There is no notion of a page in the HWPF format. Pages are an artifact as the consumer. There are no pages until the consumer renders them, and each client can render pages slightly differently, even between different versions of Word.

creating space in end of xml tag in java

My xml tag is :
<Description/>
I want with space :
<Description />
How can I do this in Java?
I am signing xml document , in original file space has been used but when I used following code and print it, it printing without space.
String thisLine = "";
String xmlString = "";
BufferedReader br = new BufferedReader(new FileReader(originalXmlFilePath));
while ((thisLine = br.readLine()) != null) {
xmlString = xmlString + thisLine.trim();
}
br.close();
ByteArrayInputStream xmlStream = new ByteArrayInputStream(xmlString.getBytes());
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware(true);
dbf.setIgnoringElementContentWhitespace(true);
dbf.setValidating(false);
Document doc = dbf.newDocumentBuilder().parse
(xmlStream );
doc.setXmlStandalone(true);
DOMSignContext dsc = new DOMSignContext
(keyEntry.getPrivateKey(), doc.getDocumentElement());
javax.xml.crypto.dsig.XMLSignature signature = fac.newXMLSignature(si, ki);
signature.sign(dsc);
// Output the resulting document.
// OutputStream os = new FileOutputStream(new File(destnSignedXmlFilePath));
TransformerFactory tf = TransformerFactory.newInstance();
Transformer trans = tf.newTransformer();
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
trans.setOutputProperty(OutputKeys.INDENT, "yes");
StringWriter writer = new StringWriter();
trans.transform(new DOMSource(doc), new StreamResult(writer));
String output = writer.getBuffer().toString();//.replaceAll("\n|\r", "");
System.out.println("output== "+output);
What you are doing wrong is signing an arbitrary unprocessed text instead of submitting a canonical version of your document (without spaces in tags, but also with sorted attributes, with quotes of the same type, etc.) to the digital signature computation.
The Canonical XML and Exclusive Canonical XML W3C recommendations specify a standard and comprehensive way to eliminate arbitrary differences.

Problems copying base64 string from xml file into text file

I have an xml file with a tag that contains a base64 string, like this :
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
<SOAP-ENV:Header/>
<SOAP-ENV:Body>
<PercezioniEStorni>
<MsgPercezioniStorni>
<ApplicazioneChiamante>PEI</ApplicazioneChiamante>
<DataElaborazione>31/03/2016</DataElaborazione>
<CodiceSede>3700203</CodiceSede>
<Esercizio>2015</Esercizio>
<Trimestre>4</Trimestre>
<PeriodoDal>01/10/2015</PeriodoDal>
<PeriodoAl>31/12/2015</PeriodoAl>
<DataValidazione>31/03/2016</DataValidazione>
<UtenteValidatore>giuseppe</UtenteValidatore>
<CodFiscFirmatario>VGRGPP79E08E919T</CodFiscFirmatario>
<PercezioniEStorniTypeZip>
<!-- Base64 example String -->
VUVzREJCUUFDQUFJQUNOcWYwZ0FBQUFBQUFBQUFBQUFBQUFaQUFBQWNtVm5hWE4wY204dE1Ua3lMVEl3TVRZeE1TMHhMbmh0Yk96ZGE1UGt4blh1KzYrUzRUZDh4 V0YzOWQzQm80aFZCVlFqdTNBWkpnRHlWTDFqeUF5WkVUS3BFR2w3Ny9QcFQ2Sm1lTEZrdTFkM3I4UWdoWC9FRG0xcWhxWmt6eTlYTCtTVEsvUExuMzc4OWkvZi9mQWYvMX rOEIvZi9mbkh2M3puL3MrLy9mbUhuLzc1NHkvL1AvLzByei8vL0pkLy91S0xuLzc0cjkvOTI3Yy92WXUvTy8zV3V4Ly8rcWN2cHIvNDRydVAvM05mL05QSC84bi84OU8vL1BvLzlaLy8rWi92L3ZQcS9=
</PercezioniEStorniTypeZip>
</MsgPercezioniStorni>
</PercezioniEStorni>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
The Base64 string encode a zip file (is just an example, doesn't really contains a zip file, the real string is too much long).
I have created this xml through JAXB generated classes in this way:
FileInputStream zipFis = new FileInputStream(fileZip);
buffer = new byte[(new Long(fileZip.length())).intValue()];
zipFis.read(buffer);
zipFis.close();
String encoded = Base64.encode(buffer);
PercezioniEStorni percezioniStorni = new PercezioniEStorni();
/** ...set other properties... **/
MsgPercezioniStorni msgPercezioniStorni = new MsgPercezioniStorni();
/** ...set other properties... **/
msgPercezioniStorni.setPercezioniEStorniTypeZip(encoded.getBytes());
percezioniStorni.setMsgPercezioniStorni(msgPercezioniStorni);
JAXBContext jaxbContext = JAXBContext.newInstance(PercezioniEStorni.class,percezioniStorni);
Marshaller marshaller = jaxbContext.createMarshaller();
Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
marshaller.marshal(element, document);
MessageFactory mf = MessageFactory.newInstance();
SOAPMessage message = mf.createMessage();
message.getSOAPBody().addDocument(document);
File fileXml = new File(xmlPath);
file.createNewFile();
FileOutputStream fileOutput = new FileOutputStream(fileXml);
message.writeTo(fileOutput);
fileOutput.close();
And then I reversed the process:
File file = new File(xmlPath);
FileInputStream fis = new FileInputStream(file);
BufferedReader br = new BufferedReader(new InputStreamReader(fis,"UTF-8"));
String xml = "";
while(br.ready()){
xml += br.readLine();
}
br.close();
MessageFactory mf = MessageFactory.newInstance();
SOAPMessage message = mf.createMessage();
SOAPPart soapPart = message.getSOAPPart();
InputSource is = new InputSource();
is.setByteStream(new ByteArrayInputStream(xml.getBytes("UTF-8")));
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
dbFactory.setNamespaceAware(true);
DocumentBuilder builder = dbFactory.newDocumentBuilder();
Document document = builder.parse(is);
DOMSource domSource = new DOMSource(document);
soapPart.setContent(domSource);
message.saveChanges();
PercezioniEStorni perc = SOAPUtil.unmarshal(PercezioniEStorni.class,message);
byte[] decoded = Base64.decode(new String(perc.getMsgPercezioniStorni().getPercezioniEStorniTypeZip()));
File zipFile = new File(zipPath);
zipFile.createNewFile();
FileOutputStream out = new FileOutputStream(zipFile);
out.write(decoded);
out.close();
Everything is working fine. The archive is successful decoded and I can unzip it.
Later i have manually copied the Base64 string from within the xml file into another text file.
I read this file from java in this way:
File txtFile = new File(textFilePath);
FileInputStream fis = new FileInputStream(txtFile);
Reader r = new InputStreamReader(fis,"UTF-8");
StringBuilder sb = new StringBuilder();
int buffer = 0;
while ((buffer= r.read())!=-1) {
sb.append((char)buffer);
}
fis.close();
File zipFile = new File(zipPath);
zipFile.createNewFile();
FileOutputStream out = new FileOutputStream(zipFile);
Base64.decode(sb.toString(),out);
out.close();
This time the zip archive is corrupted. Also the size is different.
Why? Is there any way to read the same Base64 string from another file?
Unless I've overlooked something: you are manually encoding some data read from a zip file:
String encoded = Base64.encode(buffer);
Then you set a property which is defined to store a byte array in Base64 to avoid anything breaking the XML rules:
msgPercezioniStorni.setPercezioniEStorniTypeZip(encoded.getBytes());
Now the bytes of the encoded string encoded are encoded again.
No wonder the characters from the XML cannot be made into a valid zip by a single decoding step. (Try two.)
Much better: Drop the first encoding step.

Transformer not reading Special Character from Document Object

I am trying to read xml data from Document Object, and then using transformer to render the data inside the document object to pdf,using XSL,
My code is :
Document doc = toXML(arg1,arg2);
doc contains data like :
İlkyönetmeliği
with in tags
InputStream inputStream = new FileInputStream(xslFilePath);
transformer = factory.newTransformer(new StreamSource(inputStream));
transformer.setParameter("encoding", "UTF-8");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(new DOMSource(doc.getDocumentElement()), res);
Special characters present in xml are not getting rendered accordingly and displaying like
#lk yard#m.
I have also set encoding to UTF-8 ,but still it is displaying like above.
It is not clear what causes your encoding problem because I cannot see how your document is read/constructed and how your transformation result res is set up. Try the following standalone example code which handles encoding with XSLT. Maybe you can even modify it gradually to use your actual data in order to see what goes wrong.
public static void main(String[] args) {
try {
String inputEncoding = "UTF-16";
String xsltEncoding = "ASCII";
String outputEncoding = "UTF-8";
ByteArrayOutputStream bos = new ByteArrayOutputStream();
OutputStreamWriter osw = new OutputStreamWriter(bos, inputEncoding);
osw.write("<?xml version='1.0' encoding='" + inputEncoding + "'?>");
osw.write("<root>İlkyönetmeliği</root>"); osw.close();
byte[] inputBytes = bos.toByteArray();
bos.reset();
osw = new OutputStreamWriter(bos, xsltEncoding);
osw.write("<?xml version='1.0' encoding='" + xsltEncoding + "'?>");
osw.write("<xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0'>");
osw.write("<xsl:template match='#*|node()'><xsl:copy><xsl:apply-templates select='#*|node()'/></xsl:copy></xsl:template>");
osw.write("</xsl:stylesheet>"); osw.close();
byte[] xsltBytes = bos.toByteArray();
bos.reset();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document d = db.parse(new InputSource(new InputStreamReader(new ByteArrayInputStream(inputBytes), inputEncoding)));
// if encoding declaration correct, use: Document d = db.parse(new InputSource(new ByteArrayInputStream(inputBytes)));
System.out.println(XPathFactory.newInstance().newXPath().evaluate("/root[1]", d));
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer(new StreamSource(new InputStreamReader(new ByteArrayInputStream(xsltBytes), xsltEncoding)));
// if encoding declaration correct, use: Transformer t = tf.newTransformer(new StreamSource(new ByteArrayInputStream(xsltBytes)));
StreamResult sr = new StreamResult(new OutputStreamWriter(bos, outputEncoding));
t.setOutputProperty(OutputKeys.ENCODING, outputEncoding);
t.transform(new DOMSource(d.getDocumentElement()), sr);
byte[] outputBytes = bos.toByteArray();
Scanner s = new Scanner(new InputStreamReader(new ByteArrayInputStream(outputBytes), outputEncoding));
String output = s.useDelimiter("</>").next(); // read all
s.close();
System.out.println(output);
} catch (Exception ex) {
ex.printStackTrace(System.err);
}
The example code applies the XSLT identity template to a minimal input containing the non-ASCII characters.
I output the string to check if it has been parsed correctly in the document using XPath. You may want to check your (intermediate) document if you know how to locate it with XPath.
Note that, if present, the parser tries to pick up the encoding declared in the XML processing instruction (PI) by default when reading an XML file. It assumes that actual and declared encoding are the same. If they differ or the PI is missing, then you can enforce the actual encoding e.g. by using an InputStreamReader as in the code above.

Categories

Resources