How to deal with unknown entity references? - java

I'm parsing (a lot of) XML files that contain entity references which i dont know in advance (can't change that fact).
For example:
xml = "<tag>I'm content with &funny; &entity; &references;.</tag>"
when i try to parse this using the following code:
final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
final DocumentBuilder db = dbf.newDocumentBuilder();
final InputSource is = new InputSource(new StringReader(xml));
final Document d = db.parse(is);
i get the following exception:
org.xml.sax.SAXParseException: The entity "funny" was referenced, but not declared.
but, what i do want to achieve is, that the parser replaces every entity that is not declared (unknown to the parser) with an empty String ''.
Or even better, is there a way to pass a map to the parser like:
Map<String,String> entityMapping = ...
entityMapping.put("funny","very");
entityMapping.put("entity","important");
entityMapping.put("references","stuff");
so that i could do the following:
final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
final DocumentBuilder db = dbf.newDocumentBuilder();
final InputSource is = new InputSource(new StringReader(xml));
db.setEntityResolver(entityMapping);
final Document d = db.parse(is);
if i would obtain the text from the document using this example code i should receive:
I'm content with very important stuff.
Any suggestions? Of course, i already would be happy to just replace the unknown entity's with empty strings.
Thanks,

The StAX API has support for this. Have a look at XMLInputFactory, it has a runtime property which dictates whether or not internal entities are expanded, or left in place. If set to false, then the StAX event stream will contain instances of EntityReference to represent the unexpanded entities.
If you still want a DOM as the end result, you can chain it together like this:
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
inputFactory.setProperty(XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
String xml = "my xml";
StringReader xmlReader = new StringReader(xml);
XMLEventReader eventReader = inputFactory.createXMLEventReader(xmlReader);
StAXSource source = new StAXSource(eventReader);
DOMResult result = new DOMResult();
transformer.transform(source, result);
Node document = result.getNode();
In this case, the resulting DOM will contain nodes of org.w3c.dom.EntityReference mixed in with the text nodes. You can then process these as you see fit.

Since your XML input seems to be available as a String, could you not do a simple pre-processing with regular expression replacement?
xml = "...";
/* replace entities before parsing */
for (Map.Entry<String,String> entry : entityMapping.entrySet()) {
xml = xml.replaceAll("&" + entry.getKey() + ";", entry.getValue());
}
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
...
It's quite hacky, and you may want to spend some extra effort to ensure that the regexps only match where they really should (think <entity name="&don't-match-me;"/>), but at least it's something...
Of course, there are more efficient ways to achieve the same effect than calling replaceAll() a lot of times.

You could add the entities at the befinning of the file. Look here for more infos.
You could also take a look at this thread where someone seems to have implemented an EntityResolver interface (you could also implement EntityResolver2 !) where you can process the entities on the fly (e.g. with your proposed Map).
WARNING: there is a bug! in jdk6, but you could try it with jdk5

Related

Trying to get the value of a tag in an xml string java

I have an xml string stored in a StringBuilder.
My xml looks like this
couldn't write it in code so here's a screenshot
inside the report tag, it looks like
what it looks like
I would like to get access to any tag value I want in the record tag, what I have is :
StringBuilder informationString = new StringBuilder();
Scanner scanner = new Scanner(url.openStream());
while (scanner.hasNext()) {
informationString.append(scanner.nextLine());
}
//Close the scanner
scanner.close();
System.out.println(informationString);
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(String.valueOf(informationString))));
Element rootElement = document.getDocumentElement();
But I do not know what to do with this and am very lost
Thanks by advance for helping
In general, you can use the below routine
Element documentElement=....
NodeList elmList=documentElement.getElementsByTagName("elementName");
Element e=(Element)elmList.itm(x);//putting it in a loop would do
You could keep using the above to get elements recursively.
Though a better approach would be to use XPath (Saxon has a decent XPath implementaton, though there are many more libraries to choose from)

Format attributes for XML in Pretty format in java

I am trying to format XML string to pretty. I want all the attributes to be printed in single line.
XML input:
<root><feeds attribute1="a" attribute2="b" attribute3="c" attribute4="d" attribute5="e" attribute6="f"> <id>2140</id><title>gj</title><description>ghj</description>
<msg/>
Expected output:
<root>
<feeds attribute1="a" attribute2="b" attribute3="c" attribute4="d" attribute5="e" attribute6="f">
<id>2140</id>
<title>gj</title>
<description>ghj</description>
<msg/>
</feeds>
Actual Output:
<root>
<feeds attribute1="a" attribute2="b" attribute3="c" attribute4="d"
attribute5="e" attribute6="f">
<id>2140</id>
<title>gj</title>
<description>ghj</description>
<msg/>
</feeds>
Here is my code to format xml. I have also tried SAX parser. I don't want to use DOM4J.
public static String formatXml(String xml) {
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS) registry.getDOMImplementation("LS");
LSSerializer writer = impl.createLSSerializer();
writer.getDomConfig().setParameter("format-pretty-print", Boolean.TRUE);
writer.getDomConfig().setParameter("xml-declaration", false);
writer.getDomConfig().setParameter("well-formed", true);
LSOutput output = impl.createLSOutput();
ByteArrayOutputStream out = new ByteArrayOutputStream();
output.setByteStream(out);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(xml));
writer.write(db.parse(is), output);
return new String(out.toByteArray());
}
Is there any way to keep attributes in one line with SAX or DOM parser? I am not looking for any additional library. I am looking for solution with java library only.
A SAX or DOM parser will read your input string and allow your application to understand what was passed in. At some point in time your application then writes out that data, and that is the moment where you decide to insert additional whitespace (like linefeeds and tab characters) to pretty-print the document.
If you really want to use SAX and make the parser efficient the best you could do is write the document while it is being parsed. So you would implement the ContentHandler interface (https://docs.oracle.com/en/java/javase/11/docs/api/java.xml/org/xml/sax/ContentHandler.html) such that it directly writes out the data while adding linefeeds where you feel they belong to.
Check this tutorial to see how the ContentHandler can then be applied in a SAX parser: https://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html

Java transformer w3c.dom.document to inputstream

My scenario is this:
I have a HTML which I loaded into a w3c.dom.Document, after loading it as a doc, I parsed through its nodes and made a few changes in their values, but now I need to transform this document into a String, or preferably into a InputStream directly.
And I managed to do so, however, to the ends I need this HTML it must keep some properties of the initial file, for instance (and this is the one thing I'm struggling a lot trying to solve), all tags must be closed.
Say, I have a link tag on the header, <link .... /> I NEED the dash (/) at the end. However after the transformer transform my doc into a outputStream (which then I proceed to send to an inputStream) all the '/' before the > disappear. All my tags, which ended in /> are changed into simple >.
The reason I need this structure is that one of the libraries I'm using (and I'm afraid I can't go looking for another one, specially not at this point) require all tags to be closed, if not it throws exceptions everywhere and my program crashes....
Does anyone have any good ideas or solutions for me? This is my first contact with the Transform class, so I might be missing something that could help me.
Thank you all so very much,
Warm regards
Some bit of the code to explain the scenario a little bit
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
org.w3c.dom.Document doc = docBuilder.parse(his); // his = the HTML inputStream
XPath xPath = XPathFactory.newInstance().newXPath();
String expression = "//*[#id='pessoaNome']";
org.w3c.dom.Element pessoaNome = null;
try
{
pessoaNome = (org.w3c.dom.Element) (Node) xPath.compile(expression).evaluate(doc, XPathConstants.NODE);
}
catch (Exception e)
{
e.printStackTrace();
}
pessoaNome.setTextContext("The new values for the node");
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Source xmlSource = new DOMSource(doc);
Result outputTarget = new StreamResult(outputStream);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.DOCTYPE_SYSTEM, "HTML");
transformer.transform(xmlSource, outputTarget);
InputStream is = new ByteArrayInputStream(outputStream.toByteArray()); // At this point outputStream is already all messed up, not just the '/'. but this is the only thing causing me problems
as #Lee pointed out, I changed it to use Jsoup. Code got a lot cleaner, just had to set up the outputSettings for it to work like a charm. Code below
org.jsoup.nodes.Document doc = Jsoup.parse(new File(HTML), "UTF-8");
org.jsoup.nodes.Element pessoaNome = doc.getElementById("pessoaNome");
pessoaNome.html("My new html in here");
OutputSettings oSettings = new OutputSettings();
oSettings.syntax(org.jsoup.nodes.Document.OutputSettings.Syntax.xml);
doc.outputSettings(oSettings);
InputStream is = new ByteArrayInputStream(doc.outerHtml().getBytes());
Have a look at jTidy which cleans HTML. There is also jsoup which is newer as supposedly does the same things only better.

Casting JDom 1.1.3 Element to Document without DocumentBuilderFactory or DocumentBuilder

I need to find the easier and the efficient way to convert a JDOM element (with all it's tailoring nodes) to a Document. ownerDocument( ) won't work as this is version JDOM 1.
Moreover, org.jdom.IllegalAddException: The Content already has an existing parent "root" exception occurs when using the following code.
DocumentBuilderFactory dbFac = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder = dbFac.newDocumentBuilder();
Document doc = null;
Element elementInfo = getElementFromDB();
doc = new Document(elementInfo);
XMLOutputter xmlOutput = new XMLOutputter();
byte[] byteInfo= xmlOutput.outputString(elementInfo).getBytes("UTF-8");
String stringInfo = new String(byteInfo);
doc = dBuilder.parse(stringInfo);
I think you have to use the following method of the element.
Document doc = <element>.getDocument();
Refer the API documentation It says
Return this parent's owning document or null if the branch containing this parent is currently not attached to a document.
JDOM content can only have one parent at a time, and you have to detatch it from one parent before you can attach it to another. This code:
Document doc = null;
Element elementInfo = getElementFromDB();
doc = new Document(elementInfo);
if that code is failing, it is because the getElementFromDB() method is returning an Element that is part of some other structure. You need to 'detach' it:
Element elementInfo = getElementFromDB();
elementInfo.detach();
Document doc = new Document(elementInfo);
OK, that solves the IllegalAddException
On the other hand, if you just want to get the document node containing the element, JDOM 1.1.3 allows you to do that with getDocument:
Document doc = elementInfo.getDocument();
Note that the doc may be null.
To get the top most element available, try:
Element top = elementInfo;
while (top.getParentElement() != null) {
top = top.getParentElement();
}
In your case, your elementInfo you get from the DB is a child of an element called 'root', something like:
<root>
<elementInfo> ........ </elementInfo>
</root>
That is why you get the message you do, with the word "root" in it:
The Content already has an existing parent "root"

How to create a XML object from String in Java?

I am trying to write a code that helps me to create a XML object. For example, I will give a string as input to a function and it will return me a XMLObject.
XMLObject convertToXML(String s) {}
When I was searching on the net, generally I saw examples about creating XML documents. So all the things I saw about creating an XML and write on to a file and create the file. But I have done something like that:
Document document = new Document();
Element child = new Element("snmp");
child.addContent(new Element("snmpType").setText("snmpget"));
child.addContent(new Element("IpAdress").setText("127.0.0.1"));
child.addContent(new Element("OID").setText("1.3.6.1.2.1.1.3.0"));
document.setContent(child);
Do you think it is enough to create an XML object? and also can you please help me how to get data from XML? For example, how can I get the IpAdressfrom that XML?
Thank you all a lot
EDIT 1: Actually now I thought that maybe it would be much easier for me to have a file like base.xml, I will write all basic things into that for example:
<snmp>
<snmpType><snmpType>
<OID></OID>
</snmp>
and then use this file to create a XML object. What do you think about that?
If you can create a string xml you can easily transform it to the xml document object e.g. -
String xmlString = "<?xml version=\"1.0\" encoding=\"utf-8\"?><a><b></b><c></c></a>";
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder;
try {
builder = factory.newDocumentBuilder();
Document document = builder.parse(new InputSource(new StringReader(xmlString)));
} catch (Exception e) {
e.printStackTrace();
}
You can use the document object and xml parsing libraries or xpath to get back the ip address.
try something like
public static Document loadXML(String xml) throws Exception
{
DocumentBuilderFactory fctr = DocumentBuilderFactory.newInstance();
DocumentBuilder bldr = fctr.newDocumentBuilder();
InputSource insrc = new InputSource(new StringReader(xml));
return bldr.parse(insrc);
}

Categories

Resources