XML pretty print add unnecessary whitespace element content containing CDATA

XML pretty print add unnecessary whitespace element content containing CDATA - java

I have a piece of Java code which pretty prints xml. When using a LSSerializer to pretty print the output, it is formatted nicely and indented but elements which contain CDATA behave strangely. The XML
<root><outer><inner><text><![CDATA[Content of the CDATA block]]></text></inner></outer></root>
gets transformed into the following xml
<?xml version="1.0" encoding="UTF-8"?><root>
<outer>
<inner>
<text>
<![CDATA[Content of the CDATA block]]>
</text>
</inner>
</outer>
</root>
and has the CDATA element in a separate line. This causes issues when extracting the content later on with xpath expressions.
The code
#Test
public void testOutputXML() throws Exception {
final Document document = loadXMLFromString( "<root><outer><inner><text><![CDATA[Content of the CDATA block]]></text></inner></outer></root>" );
final String formattedXml = toXmlPrettyLS( document );
final Document formattedDocument = loadXMLFromString( formattedXml );
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile("//text/text()");
final String evaluate = expr.evaluate( formattedDocument );
assertThat( evaluate ).isEqualTo( "Content of the CDATA block" );
}
private String toXmlPrettyLS( final Document document ) throws Exception {
final ByteArrayOutputStream bos = new ByteArrayOutputStream();
final DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
final DOMImplementationLS loadSave = ( DOMImplementationLS ) registry.getDOMImplementation( "LS" );
final LSOutput output = loadSave.createLSOutput();
output.setByteStream( bos );
final LSSerializer serializer = loadSave.createLSSerializer();
final DOMConfiguration config = serializer.getDomConfig();
config.setParameter( "format-pretty-print", true );
serializer.write( document, output );
return String.valueOf( bos );
}
private Document loadXMLFromString( final String xml ) throws Exception {
final DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware( true );
final DocumentBuilder builder = factory.newDocumentBuilder();
return builder.parse( new ByteArrayInputStream( xml.getBytes() ) );
}
is used to transform the xml and extract the content, the environment is Java 11.
How can I adjust the formatting to get
<text>![CDATA[Content of the CDATA block]]></text>
instead?

Related

Exception while parsing string to document object

I try to bild document object from string and append it into element but I get exception java.io.FileNotFoundException: project folder path\org.xml.sax.InputSource in this line: Document constantDocument = docBuilder.parse(
String.valueOf(new InputSource( new StringReader( xmlAsString ) )));.
My code looks like this:
Element infoElement = document.createElement("information");
String xmlAsString = "..."; //xml in string format
Document constantDocument = docBuilder.parse(
String.valueOf(new InputSource( new StringReader( xmlAsString ) ))); //java.io.FileNotFoundException
infoElement.appendChild(constantDocument);
What am I missing?

Reason is given here in the Documentation :
public Document parse(String uri)
throws SAXException,
IOException
Parse the content of the given URI as an XML document and return a new
DOM Document object. An IllegalArgumentException is thrown if the URI
is null null.
You are providing a String, and Java is looking to fetch the file at the given String / URI and hence the Exception ...
Based on your attempt, the closest you could use is :
parse(InputSource is)
Parse the content of the given input source as
an XML document and return a new DOM Document object.
So changing the .parse to below should solve your problem :
Document constantDocument = docBuilder.parse(new InputSource(new StringReader(xmlAsString)));

Found what I was looking for:
DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
InputSource is = new InputSource();
is.setCharacterStream(new StringReader("<root><nod1></node1></root>"));
Document doc = db.parse(is);

Parsing HTML content from XML file

<xbrli:xbrl xmlns:aoi="http://www.aointl.com/20160331" xmlns:country="http://xbrl.sec.gov/country/2016-01-31" xmlns:currency="http://xbrl.sec.gov/currency/2016-01-31" xmlns:dei="http://xbrl.sec.gov/dei/2014-01-31" xmlns:exch="http://xbrl.sec.gov/exch/2016-01-31" xmlns:invest="http://xbrl.sec.gov/invest/2013-01-31" xmlns:iso4217="http://www.xbrl.org/2003/iso4217" xmlns:link="http://www.xbrl.org/2003/linkbase" xmlns:naics="http://xbrl.sec.gov/naics/2011-01-31" xmlns:nonnum="http://www.xbrl.org/dtr/type/non-numeric" xmlns:num="http://www.xbrl.org/dtr/type/numeric" xmlns:ref="http://www.xbrl.org/2006/ref" xmlns:sic="http://xbrl.sec.gov/sic/2011-01-31" xmlns:stpr="http://xbrl.sec.gov/stpr/2011-01-31" xmlns:us-gaap="http://fasb.org/us-gaap/2016-01-31" xmlns:us-roles="http://fasb.org/us-roles/2016-01-31" xmlns:us-types="http://fasb.org/us-types/2016-01-31" xmlns:utreg="http://www.xbrl.org/2009/utr" xmlns:xbrldi="http://xbrl.org/2006/xbrldi" xmlns:xbrldt="http://xbrl.org/2005/xbrldt" xmlns:xbrli="http://www.xbrl.org/2003/instance" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<link:schemaRef xlink:href="aoi-20160331.xsd" xlink:type="simple"/>
<xbrli:context id="FD2016Q4YTD">
<xbrli:entity>
<xbrli:identifier scheme="http://www.sec.gov/CIK">0000939930</xbrli:identifier>
</xbrli:entity>
<xbrli:period>
<xbrli:startDate>2015-04-01</xbrli:startDate>
<xbrli:endDate>2016-03-31</xbrli:endDate>
</xbrli:period>
</xbrli:context>
<aoi:OtherIncomeAndExpensePolicyTextBlock contextRef="FD2016Q4YTD" id="Fact-F51C7616E17E5B8B0B770D410BBF5A3E">
<div style="font-family:Times New Roman;font-size:10pt;"><div style="line-height:120%;text-align:justify;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;font-weight:bold;">Other Income (Expense)</font></div><div style="line-height:120%;text-align:justify;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;"></font></div></div>
</aoi:OtherIncomeAndExpensePolicyTextBlock>
</xbrli:xbrl>
This is My XML[XBRL], i need to parse this. This xml is my input and i don't know whether its a valid or not but in need output like this :
<div style="font-family:Times New Roman;font-size:10pt;"><div style="line-height:120%;text-align:justify;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;font-weight:bold;">Other Income (Expense)</font></div><div style="line-height:120%;text-align:justify;font-size:10pt;"><font style="font-family:inherit;font-size:10pt;"></font></div></div>
Please someone share me the knowledge for this problem i am facing from last two weeks.
this is the code i am using
File fXmlFile = new File("/home/devteam-user1/Desktop/ky/UnitTesting.xml");
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.parse(fXmlFile);
XPath xPath = XPathFactory.newInstance().newXPath();
final String DIV_UNDER_ROOT = "/*/aoi";
NodeList divList = (NodeList)xPath.compile(DIV_UNDER_ROOT)
.evaluate(doc, XPathConstants.NODESET);
System.out.println(divList.getLength());
for (int i = 0; i < divList.getLength() ; i++) { // just in case there is more than one
Node divNode = divList.item(i);
System.out.println(nodeToString(divNode));
//nodeToString method below
private static String nodeToString(Node node) throws Exception
{
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
StreamResult result = new StreamResult(new StringWriter());
transformer.transform(new DOMSource(node), result);
return result.getWriter().toString();
}

this works well for me
public static void main(String[] args) throws IOException {
FileInputStream fis = new FileInputStream("yourfile.xml");
Document doc = Jsoup.parse(Utils.streamToString(fis));
System.out.println(doc.select("aoi|OtherIncomeAndExpensePolicyTextBlock").html().toString());
}

Your main issue lies with
final String DIV_UNDER_ROOT = "/*/aoi";
Which is an XPath expression that matches "any node 2 levels under the root, which has a local name of aoi and no namespace". This is not what you want.
You want to match any contents of a node that is two levels deep, whose namespace is aliased by "aoi" (which means it belongs to the "http://www.aointl.com/20160331" namespace), and whose local name is "OtherIncomeAndExpensePolicyTextBlock".
Matching namespaces in XPath in Java is quiet cumbersome (see XPath with namespace in Java and How to query XML using namespaces in Java with XPath?), but long story short, you could try this way instead :
final String DIV_UNDER_ROOT = "//*[local-name()='OtherIncomeAndExpensePolicyTextBlock' and namespace-uri()='http://www.aointl.com/20160331']/*";
This will only work if your DocumentBuilderFactory is made namespace aware, so you should make sure by configuring it like so above :
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
dbFactory.setNamespaceAware(true);

How get XML value from Unziped file

I need to get value like "Symbol" ect. from xml file and send to list.
For now my code looks like this:
Scanner sc = null;
byte[] buff = new byte[1 << 13];
List<String> question2 = new ArrayList<String>();
question2 = <MetodToGetFile>(sc,fileListQ);
for ( String strLista : question2){
ByteArrayInputStream in = new ByteArrayInputStream(strLista.getBytes());
try(InputStream reader = Base64.getMimeDecoder().wrap(in)){
try (GZIPInputStream gis = new GZIPInputStream(reader)) {
try (ByteArrayOutputStream out = new ByteArrayOutputStream()){
int readGis = 0;
while ((readGis = gis.read(buff)) > 0)
out.write(buff, 0, readGis);
byte[] buffer = out.toByteArray();
String s2 = new String(buffer);
}
}
}
}
}
I want to know how can i contunue this and takevalue "xxx" and "zzzz" to put to another list, because i need to compere some value.
XML looks like this:
<?xml version="1.0" encoding="utf-8"?>
<Name Name="some value">
<Group Names="some value">
<Package Guid="{7777-7777-7777-7777-7777}">
<Attribute Typ="" Name="Symbol">xxx</Attribute>
<Attribute Type="" Name="Surname">xxx</Attribute>
<Attribute Type="Address" Name="Name">zzzz</Attribute>
<Attribute Type="Address" Name="Country">zzzz</Attribute>
</Package>
EDIT: Hello i hope that my solution will be usefull for someone :)
try{
//Get is(inputSource with xml in s2(xml string value from stream)
InputSource is = new InputSource(new StringReader(s2));
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(is);
XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
//Get "some value" from attribut Name
String name= (String) xpath.evaluate("/Name/#Name", doc, XPathConstants.STRING);
//Get "guid" from attribute guid
String guid= (String) xpath.evaluate("/Name/Group/Package/#Guid", doc, XPathConstants.STRING);
//Get element xxx by tag value Symbol
String symbol= xpath.evaluate("/Name/Group/Package/Attribute[#Name=\"Symbol\"]", doc.getDocumentElement());
System.out.println(name);
System.out.println(guid);
System.out.println(symbol);
}catch(Exception e){
e.printStackTrace();
}
I would be happy if i will help someone by my code :)

Add a method like this to retrieve all of the elements that match a given Path expression:
public List<Node> getNodes(Node sourceNode, String xpathExpresion) throws XPathExpressionException {
// You could cache/reuse xpath for better performance
XPath xpath = XPathFactory.newInstance().newXPath();
NodeList nodes = (NodeList) xpath.evaluate(xpathExpresion,sourceNode,XPathConstants.NODESET);
ArrayList<Node> list = new ArrayList<Node>();
for(int i = 0; i < nodes.getLength(); i++) {
Node node = nodes.item(i);
list.add(node);
}
return list;
}
Add another method to build a Document from an XML input:
public Document buildDoc(InputStream is) throws Exception {
DocumentBuilderFactory fact = DocumentBuilderFactory.newInstance();
DocumentBuilder parser = fact.newDocumentBuilder();
Document newDoc = parser.parse(is);
newDoc.normalize();
is.close();
return newDoc;
}
And then put it all together:
InputSource is = new InputSource(new StringReader("... your XML string here"));
Document doc = buildDoc(is);
List<Node> nodes = getNodes(doc, "/Name/Group/Package/Attribute");
for (Node node: nodes) {
// for the text body of an element, first get its nested Text child
Text text = node.getChildNodes().item(0);
// Then ask that Text child for it's value
String content = node.getNodeValue();
}
I hope I copied and pasted this correctly. I pulled this from a working class in an open source project of mine and cleaned it up a bit to answer your specific question.

Efficiently unmarshaling a part of a large xml file with JAXB and XMLStreamReader

I want to unmarshall part of a large XML file. There exists solution of this already, but I want to improve it for my own implementation.
Please have a look at the following code: (source)
public static void main(String[] args) throws Exception {
XMLInputFactory xif = XMLInputFactory.newFactory();
StreamSource xml = new StreamSource("input.xml");
XMLStreamReader xsr = xif.createXMLStreamReader(xml);
xsr.nextTag();
while(!xsr.getLocalName().equals("VersionList")&&xsr.getElementText().equals("1.81")) {
xsr.nextTag();
}
I want to unmarshall the input.xml (given below) for the node: versionNumber="1.81"
With the current code, the XMLStreamReader will first check the node versionNumber="1.80" and then it will check all sub nodes of versionNumber and then it will again move to node: versionNumber="1.81", where it will satisfy the exit condition of the while loop.
Since, I want to check node versionNumber only, iterating its subnodes are unnecessary and for large xml file, iterating all sub nodes of version 1.80 will take lone time. I want to check only root nodes (versionNumber) and if the first root node (versionNumber=1.80) is not matched, the XMLStreamReader should directly jump to next root node ((versionNumber=1.81)). But it seems not achievable with xsr.nextTag(). Is there any way, to iterate through the desired root nodes only?
input.xml:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<fileVersionListWrapper FileName="src.h">
<VersionList versionNumber="1.80">
<Reviewed>
<commentId>v1.80(c5)</commentId>
<author>Robin</author>
<lines>47</lines>
<lines>48</lines>
<lines>49</lines>
</Reviewed>
<Reviewed>
<commentId>v1.80(c6)</commentId>
<author>Sujan</author>
<lines>82</lines>
<lines>83</lines>
<lines>84</lines>
<lines>85</lines>
</Reviewed>
</VersionList>
<VersionList versionNumber="1.81">
<Reviewed>
<commentId>v1.81(c4)</commentId>
<author>Robin</author>
<lines>47</lines>
<lines>48</lines>
<lines>49</lines>
</Reviewed>
<Reviewed>
<commentId>v1.81(c5)</commentId>
<author>Sujan</author>
<lines>82</lines>
<lines>83</lines>
<lines>84</lines>
<lines>85</lines>
</Reviewed>
</VersionList>
</fileVersionListWrapper>

You can get the node from the xml using XPATH
XPath, the XML Path Language, is a query language for selecting nodes from an XML document. In addition, XPath may be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML document. What is Xpath.
Your XPath expression will be
/fileVersionListWrapper/VersionList[#versionNumber='1.81']
meaning you want to only return VersionList where the attribute is 1.81
JAVA Code
I have made an assumption that you have the xml as string so you will need the following idea
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource inputSource = new InputSource(new StringReader(xml));
Document document = builder.parse(inputSource);
XPathFactory xPathfactory = XPathFactory.newInstance();
XPath xpath = xPathfactory.newXPath();
XPathExpression expr = xpath.compile("/fileVersionListWrapper/VersionList[#versionNumber='1.81']");
NodeList nl = (NodeList) expr.evaluate(document, XPathConstants.NODESET);
Now it will be simply loop through each node
for (int i = 0; i < nl.getLength(); i++)
{
System.out.println(nl.item(i).getNodeName());
}
to get the nodes back to to xml you will have to create a new Document and append the nodes to it.
Document newXmlDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
Element root = newXmlDocument.createElement("fileVersionListWrapper");
for (int i = 0; i < nl.getLength(); i++)
{
Node node = nl.item(i);
Node copyNode = newXmlDocument.importNode(node, true);
root.appendChild(copyNode);
}
newXmlDocument.appendChild(root);
once you have the new document you will then run a serializer to get the xml.
DOMImplementationLS domImplementationLS = (DOMImplementationLS) document.getImplementation();
LSSerializer lsSerializer = domImplementationLS.createLSSerializer();
String string = lsSerializer.writeToString(document);
now that you have your String xml , I have made an assumption you already have a Jaxb object which looks similar to this
#XmlRootElement(name = "fileVersionListWrapper")
public class FileVersionListWrapper
{
private ArrayList<VersionList> versionListArrayList = new ArrayList<VersionList>();
public ArrayList<VersionList> getVersionListArrayList()
{
return versionListArrayList;
}
#XmlElement(name = "VersionList")
public void setVersionListArrayList(ArrayList<VersionList> versionListArrayList)
{
this.versionListArrayList = versionListArrayList;
}
}
Which you will simple use the Jaxb unmarshaller to create the objects for you
JAXBContext jaxbContext = JAXBContext.newInstance(FileVersionListWrapper .class);
Unmarshaller jaxbUnmarshaller = jaxbContext.createUnmarshaller();
StringReader reader = new StringReader(xmlString);
FileVersionListWrapper fileVersionListWrapper = (FileVersionListWrapper) jaxbUnmarshaller.unmarshal(reader);

XSLT parameter not replaced

Could someone advise me what's wrong with the XSLT transformation below? I have stripped it down to a minimum.
Basically, I would like to have a parameter "title" replaced, but I cannot get it to run. The transformation simply ignores the parameter. I have highlighted the relevant bits with some exclamation marks.
Any advise is greatly appreciated.
public class Test {
private static String xslt =
"<?xml version=\"1.0\"?>\n" +
"<xsl:stylesheet\n" +
" xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\"\n" +
" version=\"1.0\">\n" +
" <xsl:param name=\"title\" />\n" +
" <xsl:template match=\"/Foo\">\n" +
" <html><head><title>{$title}</title></head></html>\n" + // !!!!!!!!!!!
" </xsl:template>\n" +
"</xsl:stylesheet>\n";
public static void main(String[] args) {
try {
final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware( true );
final DocumentBuilder db = dbf.newDocumentBuilder();
final Document document = db.newDocument();
document.appendChild( document.createElement( "Foo" ) );
final StringWriter resultWriter = new StringWriter();
TransformerFactory factory = TransformerFactory.newInstance();
Transformer transformer = factory.newTransformer( new StreamSource( new StringReader( xslt ) ) );
// !!!!!!!!!!!!!!!!!!
transformer.setParameter( "title", "This is a title" );
// !!!!!!!!!!!!!!!!!!
transformer.transform( new DOMSource( document ), new StreamResult( resultWriter ) );
System.out.println( resultWriter.toString() );
} catch( Exception ex ) {
ex.printStackTrace();
}
}
}
I'm using Java 6 without any factory-specific system properties set.
Thank you in advance!

<html><head><title>{$title}</title></head></html>
The problem is in the above line.
In XSLT the {someXPathExpression} syntax can be used only in (some) attributes, and never in text nodes.
Solution:
Replace the above with:
<html><head><title><xsl:value-of select="$title"/></title></head></html>

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

XML pretty print add unnecessary whitespace element content containing CDATA - java

Related

Exception while parsing string to document object

Parsing HTML content from XML file

How get XML value from Unziped file

Efficiently unmarshaling a part of a large xml file with JAXB and XMLStreamReader

XSLT parameter not replaced

Categories

Resources