Convert DOM element encoding from CP1251 to UTF-8

Convert DOM element encoding from CP1251 to UTF-8 - java

I have a simple server-side code that takes request xml and inserts it as string into Oracle database Clob column. The problem is that client-side sends request xml with CP1251 encoded text, but I need to insert it into Oracle with UTF-8 encoding.
Now the code that I use for CP1251 is:
Element soapinElement = (Element) streams.getSoapin().getValue().getAny(); //retrieve request xml
Node node = (Node) soapinElement;
Document document = node.getOwnerDocument();
DOMImplementationLS domImplLS = (DOMImplementationLS) document.getImplementation();
LSSerializer serializer = domImplLS.createLSSerializer();
LSOutput output = domImplLS.createLSOutput();
output.setEncoding("CP1251");
Writer stringWriter = new StringWriter();
output.setCharacterStream(stringWriter);
serializer.write(document, output);
String soapinString = stringWriter.toString();
This code recognizes text encoded in CP1251.
The task is to make the same but with readable text encoded in UTF-8. Please suggest any ideas.
I tried this, but it produced unreadable characters instead of cyrillic:
Element soapinElement = (Element) streams.getSoapin().getValue().getAny();
Node node = (Node) soapinElement;
Document document = node.getOwnerDocument();
DOMImplementationLS domImplLS = (DOMImplementationLS) document.getImplementation();
LSSerializer serializer = domImplLS.createLSSerializer();
LSOutput output = domImplLS.createLSOutput();
output.setEncoding("CP1251");
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
output.setByteStream(byteArrayOutputStream);
serializer.write(document, output);
byte[] result = byteArrayOutputStream.toByteArray();
InputStream is = new ByteArrayInputStream(result);
Reader reader = new InputStreamReader(is, "CP1251");
OutputStream out = new ByteArrayOutputStream();
Writer writer = new OutputStreamWriter(out, "UTF-8");
char[] buffer = new char[10];
int read;
while ((read = reader.read(buffer)) != -1) {
writer.write(buffer, 0, read);
}
reader.close();
writer.close();
String soapinString = out.toString();

You can decode the CP1251 characterset Data like below
Charset utf8charset = Charset.forName("UTF-8");
Charset cp1251charset = Charset.forName("CP1251");
// decode CP1251
CharBuffer data = cp1251charset.decode(ByteBuffer.wrap(result));
and encode to UTF-8 character set
// encode UTF-8
ByteBuffer outputBuffer = utf8charset.encode(data);
and convert the ByteBuffer to byte[]
// UTF-8 Value
byte[] outputData = outputBuffer.array();
This should probably solve your issue.

Related

java how to distinguish a file encoding ISO-8859-1 and UTF-8?

I have an Android Aplication that reads a file with SQL script to insert data into a SQLite DB.
However I need to know the exatly encoding of this file, I have an EditText that reads information from SQLite, and if the encoding is not right, it'll be shown as invalid characters like "?" instead of characters like "ç, í, ã".
I have the following code:
FileInputStream fIn = new FileInputStream(myFile);
BufferedReader myReader = new BufferedReader(new InputStreamReader(fIn, "ISO-8859-1"));
String aDataRow;
while ((aDataRow = myReader.readLine()) != null) {
if(!aDataRow.isEmpty()){
String[] querys = aDataRow.split(";");
Collections.addAll(querysParaExecutar, querys);
}
}
myReader.close();
this works for "ISO-8859-1" encoding, and works for UTF-8 if I set to "UTF-8" as a charset. I need to programatically detect the charset encoding (UTF-8 or ISO-8859-1) and apply the correct one to my code.
Is there a simple way to do that?

I resolved the problem with the lib universal chardet.
It's working fine as expected.
FileInputStream fIn = new FileInputStream(myFile);
byte[] buf = new byte[4096];
UniversalDetector detector = new UniversalDetector(null);
int nread;
while ((nread = fIn.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
detector.dataEnd();
String encoding = detector.getDetectedCharset();
String chartsetName = null;
if (encoding.equalsIgnoreCase("WINDOWS-1252")){
chartsetName = "ISO-8859-1";
}
if (encoding.equalsIgnoreCase("UTF-8")){
chartsetName = "UTF-8";
}
BufferedReader myReader = new BufferedReader(new InputStreamReader(fIn, chartsetName));

Problems copying base64 string from xml file into text file

I have an xml file with a tag that contains a base64 string, like this :
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">
<SOAP-ENV:Header/>
<SOAP-ENV:Body>
<PercezioniEStorni>
<MsgPercezioniStorni>
<ApplicazioneChiamante>PEI</ApplicazioneChiamante>
<DataElaborazione>31/03/2016</DataElaborazione>
<CodiceSede>3700203</CodiceSede>
<Esercizio>2015</Esercizio>
<Trimestre>4</Trimestre>
<PeriodoDal>01/10/2015</PeriodoDal>
<PeriodoAl>31/12/2015</PeriodoAl>
<DataValidazione>31/03/2016</DataValidazione>
<UtenteValidatore>giuseppe</UtenteValidatore>
<CodFiscFirmatario>VGRGPP79E08E919T</CodFiscFirmatario>
<PercezioniEStorniTypeZip>
<!-- Base64 example String -->
VUVzREJCUUFDQUFJQUNOcWYwZ0FBQUFBQUFBQUFBQUFBQUFaQUFBQWNtVm5hWE4wY204dE1Ua3lMVEl3TVRZeE1TMHhMbmh0Yk96ZGE1UGt4blh1KzYrUzRUZDh4 V0YzOWQzQm80aFZCVlFqdTNBWkpnRHlWTDFqeUF5WkVUS3BFR2w3Ny9QcFQ2Sm1lTEZrdTFkM3I4UWdoWC9FRG0xcWhxWmt6eTlYTCtTVEsvUExuMzc4OWkvZi9mQWYvMX rOEIvZi9mbkh2M3puL3MrLy9mbUhuLzc1NHkvL1AvLzByei8vL0pkLy91S0xuLzc0cjkvOTI3Yy92WXUvTy8zV3V4Ly8rcWN2cHIvNDRydVAvM05mL05QSC84bi84OU8vL1BvLzlaLy8rWi92L3ZQcS9=
</PercezioniEStorniTypeZip>
</MsgPercezioniStorni>
</PercezioniEStorni>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope>
The Base64 string encode a zip file (is just an example, doesn't really contains a zip file, the real string is too much long).
I have created this xml through JAXB generated classes in this way:
FileInputStream zipFis = new FileInputStream(fileZip);
buffer = new byte[(new Long(fileZip.length())).intValue()];
zipFis.read(buffer);
zipFis.close();
String encoded = Base64.encode(buffer);
PercezioniEStorni percezioniStorni = new PercezioniEStorni();
/** ...set other properties... **/
MsgPercezioniStorni msgPercezioniStorni = new MsgPercezioniStorni();
/** ...set other properties... **/
msgPercezioniStorni.setPercezioniEStorniTypeZip(encoded.getBytes());
percezioniStorni.setMsgPercezioniStorni(msgPercezioniStorni);
JAXBContext jaxbContext = JAXBContext.newInstance(PercezioniEStorni.class,percezioniStorni);
Marshaller marshaller = jaxbContext.createMarshaller();
Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
marshaller.marshal(element, document);
MessageFactory mf = MessageFactory.newInstance();
SOAPMessage message = mf.createMessage();
message.getSOAPBody().addDocument(document);
File fileXml = new File(xmlPath);
file.createNewFile();
FileOutputStream fileOutput = new FileOutputStream(fileXml);
message.writeTo(fileOutput);
fileOutput.close();
And then I reversed the process:
File file = new File(xmlPath);
FileInputStream fis = new FileInputStream(file);
BufferedReader br = new BufferedReader(new InputStreamReader(fis,"UTF-8"));
String xml = "";
while(br.ready()){
xml += br.readLine();
}
br.close();
MessageFactory mf = MessageFactory.newInstance();
SOAPMessage message = mf.createMessage();
SOAPPart soapPart = message.getSOAPPart();
InputSource is = new InputSource();
is.setByteStream(new ByteArrayInputStream(xml.getBytes("UTF-8")));
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
dbFactory.setNamespaceAware(true);
DocumentBuilder builder = dbFactory.newDocumentBuilder();
Document document = builder.parse(is);
DOMSource domSource = new DOMSource(document);
soapPart.setContent(domSource);
message.saveChanges();
PercezioniEStorni perc = SOAPUtil.unmarshal(PercezioniEStorni.class,message);
byte[] decoded = Base64.decode(new String(perc.getMsgPercezioniStorni().getPercezioniEStorniTypeZip()));
File zipFile = new File(zipPath);
zipFile.createNewFile();
FileOutputStream out = new FileOutputStream(zipFile);
out.write(decoded);
out.close();
Everything is working fine. The archive is successful decoded and I can unzip it.
Later i have manually copied the Base64 string from within the xml file into another text file.
I read this file from java in this way:
File txtFile = new File(textFilePath);
FileInputStream fis = new FileInputStream(txtFile);
Reader r = new InputStreamReader(fis,"UTF-8");
StringBuilder sb = new StringBuilder();
int buffer = 0;
while ((buffer= r.read())!=-1) {
sb.append((char)buffer);
}
fis.close();
File zipFile = new File(zipPath);
zipFile.createNewFile();
FileOutputStream out = new FileOutputStream(zipFile);
Base64.decode(sb.toString(),out);
out.close();
This time the zip archive is corrupted. Also the size is different.
Why? Is there any way to read the same Base64 string from another file?

Unless I've overlooked something: you are manually encoding some data read from a zip file:
String encoded = Base64.encode(buffer);
Then you set a property which is defined to store a byte array in Base64 to avoid anything breaking the XML rules:
msgPercezioniStorni.setPercezioniEStorniTypeZip(encoded.getBytes());
Now the bytes of the encoded string encoded are encoded again.
No wonder the characters from the XML cannot be made into a valid zip by a single decoding step. (Try two.)
Much better: Drop the first encoding step.

Reneder XML on browser using Binary Stream

I have one xml file in JCR, I want to render this XML over browser via servlet. Below is the code I am using :
String path = "/content/geometrixx/en/sitemap.xml/jcr:content";
if(session.nodeExists(path)) {
Node node = session.getNode(path);
InputStream inputStream = node.getProperty("jcr:data").getBinary().getStream();
BufferedInputStream bis = new BufferedInputStream(inputStream);
ByteArrayOutputStream buf = new ByteArrayOutputStream();
int result = bis.read();
while (result != -1) {
byte b = (byte) result;
buf.write(b);
result = bis.read();
}
out.print(buf.toString());
}
But in this way only string gets printed over browser means values of tags not the whole XML. How can I render whole XML to browser.

Transformer not reading Special Character from Document Object

I am trying to read xml data from Document Object, and then using transformer to render the data inside the document object to pdf,using XSL,
My code is :
Document doc = toXML(arg1,arg2);
doc contains data like :
İlkyönetmeliği
with in tags
InputStream inputStream = new FileInputStream(xslFilePath);
transformer = factory.newTransformer(new StreamSource(inputStream));
transformer.setParameter("encoding", "UTF-8");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(new DOMSource(doc.getDocumentElement()), res);
Special characters present in xml are not getting rendered accordingly and displaying like
#lk yard#m.
I have also set encoding to UTF-8 ,but still it is displaying like above.

It is not clear what causes your encoding problem because I cannot see how your document is read/constructed and how your transformation result res is set up. Try the following standalone example code which handles encoding with XSLT. Maybe you can even modify it gradually to use your actual data in order to see what goes wrong.
public static void main(String[] args) {
try {
String inputEncoding = "UTF-16";
String xsltEncoding = "ASCII";
String outputEncoding = "UTF-8";
ByteArrayOutputStream bos = new ByteArrayOutputStream();
OutputStreamWriter osw = new OutputStreamWriter(bos, inputEncoding);
osw.write("<?xml version='1.0' encoding='" + inputEncoding + "'?>");
osw.write("<root>İlkyönetmeliği</root>"); osw.close();
byte[] inputBytes = bos.toByteArray();
bos.reset();
osw = new OutputStreamWriter(bos, xsltEncoding);
osw.write("<?xml version='1.0' encoding='" + xsltEncoding + "'?>");
osw.write("<xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0'>");
osw.write("<xsl:template match='#*|node()'><xsl:copy><xsl:apply-templates select='#*|node()'/></xsl:copy></xsl:template>");
osw.write("</xsl:stylesheet>"); osw.close();
byte[] xsltBytes = bos.toByteArray();
bos.reset();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document d = db.parse(new InputSource(new InputStreamReader(new ByteArrayInputStream(inputBytes), inputEncoding)));
// if encoding declaration correct, use: Document d = db.parse(new InputSource(new ByteArrayInputStream(inputBytes)));
System.out.println(XPathFactory.newInstance().newXPath().evaluate("/root[1]", d));
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer(new StreamSource(new InputStreamReader(new ByteArrayInputStream(xsltBytes), xsltEncoding)));
// if encoding declaration correct, use: Transformer t = tf.newTransformer(new StreamSource(new ByteArrayInputStream(xsltBytes)));
StreamResult sr = new StreamResult(new OutputStreamWriter(bos, outputEncoding));
t.setOutputProperty(OutputKeys.ENCODING, outputEncoding);
t.transform(new DOMSource(d.getDocumentElement()), sr);
byte[] outputBytes = bos.toByteArray();
Scanner s = new Scanner(new InputStreamReader(new ByteArrayInputStream(outputBytes), outputEncoding));
String output = s.useDelimiter("</>").next(); // read all
s.close();
System.out.println(output);
} catch (Exception ex) {
ex.printStackTrace(System.err);
}
The example code applies the XSLT identity template to a minimal input containing the non-ASCII characters.
I output the string to check if it has been parsed correctly in the document using XPath. You may want to check your (intermediate) document if you know how to locate it with XPath.
Note that, if present, the parser tries to pick up the encoding declared in the XML processing instruction (PI) by default when reading an XML file. It assumes that actual and declared encoding are the same. If they differ or the PI is missing, then you can enforce the actual encoding e.g. by using an InputStreamReader as in the code above.

Not able to process big file inside a zip file using ZipInputStream

I am having a below mentioned java class which extracts a zip, and one by one convert its content to string and print to console.
Problem is, when the file present inside the zip is big ~80KB. Entire content is not getting displayed (only 3/4 of the data is getting converted to string and displayed in console).
Secondly below mentioned code is introducing null/space in between and also if the file size is small ~1KB
what is wrong in below mentioned code.
public static void main(String[] args) throws Exception {
byte[] buf = new byte[1024];
final int BUFFER = 1024;
String fName = "c:\\DOC00001.zip";
ZipInputStream zinstream = new ZipInputStream(
new FileInputStream(fName));
ZipEntry zentry = zinstream.getNextEntry();
while (zentry != null) {
byte data[] = new byte[BUFFER];
ByteArrayOutputStream out = new ByteArrayOutputStream();
while ((zinstream.read(data, 0, BUFFER)) != -1) {
out.write(data);
}
InputStream is = new ByteArrayInputStream(out.toByteArray());
StringWriter writer = new StringWriter();
IOUtils.copy(is, writer, "UTF-8");
String response = writer.toString();
System.out.println(response);
zentry = zinstream.getNextEntry();
}
zinstream.close();
}

The read method is not guaranteed to read a full buffer; the number of bytes that have been read is returned. The correct way to extract data from a zip file, or any InputStream in general, would be:
byte[] data = new byte[BUFFER];
ByteArrayOutputStream out = new ByteArrayOutputStream();
int bytesRead;
while ((bytesRead = zinstream.read(data, 0, BUFFER)) != -1) {
out.write(data, 0, bytesRead);
}
Or, since you are already using IOUtils,
ByteArrayOutputStream out = new ByteArrayOutputStream();
IOUtils.copy(zinstream, out);
Or, given that you write to a ByteArrayOutputStream only to later write to a String, you can skip the ByteArrayOutputStream entirely:
while (zentry != null) {
StringWriter writer = new StringWriter();
IOUtils.copy(zinstream, writer, "UTF-8");
String response = writer.toString();
System.out.println(response);
zentry = zinstream.getNextEntry();
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Convert DOM element encoding from CP1251 to UTF-8 - java

Related

java how to distinguish a file encoding ISO-8859-1 and UTF-8?

Problems copying base64 string from xml file into text file

Reneder XML on browser using Binary Stream

Transformer not reading Special Character from Document Object

Not able to process big file inside a zip file using ZipInputStream

Categories

Resources