How to make XML Parser aware of all Character Entity References? - java

I get arbitrary XML from a server and parse it using this Java code:
String xmlStr; // arbitrary XML input
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
try {
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(xmlStr));
return builder.parse(is);
}
catch (SAXException | IOException | ParserConfigurationException e) {
LOGGER.error("Failed to parse XML.", e);
}
Every once in a while, the XML input contains some unknown entity reference like and fails with an error, such as org.xml.sax.SAXParseException: The entity "nbsp" was referenced, but not declared.
I could solve this problem by preprocessing the original xmlStr and translating all problematic entity references before parsing. Here's a dummy implementation that works:
protected static String translateEntityReferences(String xml) {
String newXml = xml;
Map<String, String> entityRefs = new HashMap<>();
entityRefs.put(" ", " ");
entityRefs.put("«", "«");
entityRefs.put("»", "»");
// ... and 250 more...
for(Entry<String, String> er : entityRefs.entrySet()) {
newXml = newXml.replace(er.getKey(), er.getValue());
}
return newXml;
}
However, this is really unsatisfactory, because there are are a huge number of entity references which I don't want to all hard-code into my Java class.
Is there any easy way of teaching this entire list of character entity references to the DocumentBuilder?

If you can change your code to work with StAX instead of DOM, the trivial solution is to use the XMLInputFactory property IS_REPLACING_ENTITY_REFERENCES set to false.
public static void main(String[] args) throws Exception
{
String doc = "<doc> </doc>";
ByteArrayInputStream is = new ByteArrayInputStream(doc.getBytes());
XMLInputFactory xif = XMLInputFactory.newFactory();
xif.setProperty(javax.xml.stream.XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES, false);
XMLStreamReader xr = xif.createXMLStreamReader(is);
while(xr.hasNext())
{
int t = xr.getEventType();
switch(t) {
case XMLEvent.ENTITY_REFERENCE:
System.out.println("Entity: "+ xr.getLocalName());
break;
case XMLEvent.START_DOCUMENT:
System.out.println("Start Document");
break;
case XMLEvent.START_ELEMENT:
System.out.println("Start Element: " + xr.getLocalName());
break;
case XMLEvent.END_DOCUMENT:
System.out.println("End Document");
break;
case XMLEvent.END_ELEMENT:
System.out.println("End Element: " + xr.getLocalName());
break;
default:
System.out.println("Other: ");
break;
}
xr.next();
}
}
Output:
Start Document
Start Element: doc
Entity: nbsp null
End Element: doc
But that may require too much rewrite in your code if you really need the full DOM tree in memory.
I spent an hour tracing through the DOM implementation and couldn't find any way to make the DOM parser read from an XMLStreamReader.
Also there is evidence in the code that the internal DOM parser implementation has an option similar to IS_REPLACING_ENTITY_REFERENCES but I couldn't find any way to set it from the outside.

Related

How to remove double quotes " " from Json string

I'm getting double quotes for below 'data' field in JSON response like this -
{
"bID" : 1000013253,
"bTypeID" : 1,
"name" : "Test1"
"data" : "{"bc": { "b": { "t": 1, "r": 1, "c": "none" }, "i": "CM19014269"}}"
}
While validating this JSOn, I'm getting validation errors as below
Error: Parse error on line 18:
... "document" : "[{"bc": { "b": {
-----------------------^
Expecting 'EOF', '}', ':', ',', ']'
I want JSON response to be displayed as -
{
"bID" : 1000013253,
"bTypeID" : 1,
"name" : "Test1"
"data" : {"bc": { "b": { "t": 1, "r": 1, "c": "none" }, "i": "CM19014269"}}
}
My server side code used is -
{
for (ManageBasketTO manageBasketTO : retList) {
Long basketId = manageBasketTO.getBasketID();
BasketTO basketTo = null;
basketTo = CommonUtil.getBasket(usrCtxtObj, basketId, language, EBookConstants.FOR_VIEWER_INTERFACE,
usrCtxtObj.getScenarioID(), EBookConstants.YES, request, deviceType);
String doc = Utilities.getStringFromDocument(basketTo.getdocument());
doc = doc.replace("<?xml version=\"1.0\" encoding=\"UTF-8\"?>", "");
doc = doc.replace("<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>", "");
doc = doc.trim();
JSONObject object = XML.toJSONObject(doc);
doc = object.toString(4);
BasketsInfoTO basketsInfoTO = new BasketsInfoTO(bId, manageBasketTO.getBTypeID(), manageBasketTO.getName(), doc);
basketsToc.add(basketsInfoTO);
}
basketInfoRestTO.setBasketsInfoTOList(basketsToc);
ObjectMapper mapper = new ObjectMapper();
responseXML = mapper.writerWithDefaultPrettyPrinter().writeValueAsString(basketInfoRestTO);
responseXML = responseXML.replace("\\\"", "\"");
responseXML = responseXML.replace("\\n", "");
}
Any help is much appreciated. Thanks
Parsing and replacing anything inside XML / JSON string values is not a good solution. You might be ok with solving above issue with quotes but your code will be less readable and error-prone - some new error cases might occur in future, but your code will not be able to handle them without refactoring previously written code again (O in SOLID fails). I've written minor sample code, which might help. Try to separate responsibilities in your code as much as you can (single responsibility). org.JSON library (which you used in your code) handles all XML standards so that valid XML will be converted to JSONObject without any issue:
P.S For double quote case, probably your XML input is not valid or your Utilities.getStringFromDocument method breaks XML specification rules. As shown in my code converting XML string - Document back and front doesn't break any specifications in XML / JSON standards; if your input XML string contains double quotes then converted JSON one will do as well. If your input XML has double quotes and you want to remove them during conversion, then you might first convert the whole document then re-struct data only by creating JSONObject / JSONArray instance from text separately.
public static void main(String[] args) {
StringBuilder xmlText = new StringBuilder("<?xml version=\"1.0\" encoding=\"UTF-8\"?>")
.append("<sample>")
.append("<rec1>John</rec1>")
.append("<rec2>Snow</rec2>")
.append("<data>")
.append("<a>Season 1</a>")
.append("<b>Episode 1</b>")
.append("</data>")
.append("</sample>");
// below two lines of code were added in order to show no quote issue might occur in Document conversion case - like question has
Document doc = convertStringToDocument(xmlText.toString());
System.out.println("XML string: " + convertDocumentToString(doc));
JSONObject xmlJSONObj = XML.toJSONObject(xmlText.toString());
System.out.println("JSON string: " + xmlJSONObj.toString());
}
private static Document convertStringToDocument(String input) {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
try {
DocumentBuilder builder = factory.newDocumentBuilder();
return builder.parse(new InputSource(new StringReader(input)));
} catch (Exception e) {
e.printStackTrace();
}
return null;
}
private static String convertDocumentToString(Document document) {
TransformerFactory tf = TransformerFactory.newInstance();
try {
Transformer transformer = tf.newTransformer();
// transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes"); // remove XML declaration
StringWriter writer = new StringWriter();
transformer.transform(new DOMSource(document), new StreamResult(writer));
return writer.getBuffer().toString();
} catch (TransformerException e) {
e.printStackTrace();
}
return null;
}
You can replace the double quote as:
String x="\"abcd";
String z=x.replace("\"", "");
System.out.println(z);

How to get unique value of repeated nodes using dom parser

I have a XML having repeated nodes and I have top parse it using DOM parser. After a lot R&D I could find anything on internet which can help me. My xml looks like
<nos1>
<Name>aqwer</Name>
<class>sas</class>
<class>xcd</class>
<class>asd</class>
<Name>cfg</Name>
<Name>cfg</Name>
<nos1>
Any suggestion How can I parse this xml for repeated values.
You can use w3c dom document to parse your XML as follows:
DocumentBuilderFactory df = DocumentBuilderFactory.newInstance();
try
{
DocumentBuilder db = df.newDocumentBuilder();
InputStream is = new ByteArrayInputStream(response.getContent().getBytes("UTF-8"));
org.w3c.dom.Document doc = db.parse(is);
NodeList links = doc.getElementsByTagName("class");
for(int i=0; i< links.getLength(); i++)
{
Node link = links.item(i);
System.out.println(link.getTextContent());
}
}
catch(Exception ex)
{
}
Hope this helps you.
You should read all elements and after reading eliminate the duplicates via a Set. Here is an example using XMLBeam, but any other library will do.
public class TestMultipleElements {
#XBDocURL("resource://test.xml")
public interface Projection {
#XBRead("/nos1/Name")
List<String> getNames();
#XBRead("/nos1/class")
List<String> getClasses();
}
#Test
public void uniqueElements() throws IOException {
Projection projection = new XBProjector().io().fromURLAnnotation(Projection.class);
for (String name : new HashSet<String>(projection.getNames())) {
System.out.println("Found Name:" + name);
}
for (String clazz : new HashSet<String>(projection.getClasses())) {
System.out.println("Found Name:" + clazz);
}
}
}
This prints out:
Found Name:aqwer
Found Name:cfg
Found Name:xcd
Found Name:sas
Found Name:asd

Parse some elements from a xml

i want to know if is possible to me to parse some atributes from a xml file, to be a object in java
I don´t wanna to create all fields that are in xml.
So, how can i do this?
For exemple below there is a xml file, and i want only the data inside the tag .
<emit>
<CNPJ>1109</CNPJ>
<xNome>OESTE</xNome>
<xFant>ABATEDOURO</xFant>
<enderEmit>
<xLgr>RODOVIA</xLgr>
<nro>S/N</nro>
<xCpl>402</xCpl>
<xBairro>GOMES</xBairro>
<cMun>314</cMun>
<xMun>MINAS</xMun>
<UF>MG</UF>
<CEP>35661470</CEP>
<cPais>58</cPais>
<xPais>Brasil</xPais>
<fone>03</fone>
</enderEmit>
<IE>20659</IE>
<CRT>3</CRT>
For Java XML parsing where you don't have the XSD and don't want to create a complete object graph to represent the XML, JDOM is a great tool. It allows you to easily walk the XML tree and pick the elements you are interested in.
Here's some sample code that uses JDOM to pick arbitrary values from the XML doc:
// reading can be done using any of the two 'DOM' or 'SAX' parser
// we have used saxBuilder object here
// please note that this saxBuilder is not internal sax from jdk
SAXBuilder saxBuilder = new SAXBuilder();
// obtain file object
File file = new File("/tmp/emit.xml");
try {
// converted file to document object
Document document = saxBuilder.build(file);
//You don't need this or the ns parameters in getChild()
//if your XML document has no namespace
Namespace ns = Namespace.getNamespace("http://www.example.com/namespace");
// get root node from xml. emit in your sample doc?
Element rootNode = document.getRootElement();
//getChild() assumes one and only one, enderEmit element. Use a lib and error
//checking as needed for your document
Element enderEmitElement = rootNode.getChild("enderEmit", ns);
//now we get two of the child from
Element xCplElement = enderEmitElement.getChild("xCpl", ns);
//should be 402 in your example
String xCplValue = xCplElement.getText();
System.out.println("xCpl: " + xCplValue);
Element cMunElement = enderEmitElement.getChild("cMun", ns);
//should be 314 in your example
String cMunValue = cMunElement.getText();
System.out.println("cMun: " + cMunValue);
} catch (JDOMException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
You can use JAXB to unmarshal the xml into Java object, with which you can read selective elements easily. With JAXB, the given XML can be represented in Java as follows :
enderEmit element :
#XmlRootElement
public class EnderEmit{
private String xLgr;
//Other elements.Here you can define properties for only those elements that you want to load
}
emit element (This represents your XML file):
#XmlRootElement
public class Emit{
private String cnpj;
private String xnom;
private EnderEmit enderEmit;
..
//Add elements that you want to load
}
Now by using the below lines of code, you can read your xml to an object :
String filePath="filePath";
File file = new File(filePath);
JAXBContext jaxbContext = JAXBContext.newInstance(Emit.class);
jaxbUnmarshaller = jaxbContext.createUnmarshaller();
Emit emit = (Emit) jaxbUnmarshaller.unmarshal(file);
The line will give you an emit object for the given xml.
Try to use StringUtils.subStringBetween
try
{
String input = "";
br = new BufferedReader(new FileReader(FILEPATH));
String result = null;
while ((input = br.readLine()) != null) // here we read the file line by line
{
result = StringUtils.substringBetween(input, ">", "<"); // using StringUtils.subStringBetween to get the data what you want
if(result != null) // if the result should not be null because some of the line not having the tags
{
System.out.println(""+result);
}
}
}
catch (IOException e)
{
e.printStackTrace();
}
finally
{
try
{
if (br != null)
{
br.close();
}
}
catch (IOException ex)
{
ex.printStackTrace();
}
}

Is it possible to force JAXB to marshall illegal characters in an XML tag?

I've got a database full of objects, along with user-defined properties. For example:
class Media {
String name;
String duration;
Map<String,String> custom_tags;
}
Media:
Name: day_at_the_beach.mp4
Length: 4:22
Custom Tags:
Videographer: Charles
Owner ID #: 17a
Our users can come up with their own custom properties to attach to media, and fill in values accordingly. However, when I try to marshall the object into XML, I run into problems:
<media>
<name>day_at_the_beach.mp4</name>
<length>4:22</length>
<custom_tags>
<Videographer>Charles</videographer>
<Owner_ID_#>17a</Owner_ID_#>
</custom_tags>
</media>
Owner_ID_# is an illegal tag name in XML, because it contains a # so JAXB throws an org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML character is specified.
I know that the preferred, correct way to solve this problem would be to reformat the xml to something along the lines of:
<custom_tags>
<custom_tag>
<name>Owner ID #</name>
<value>17z</value>
</custom_tag>
</custom_tags>
However, I'm required to return the former, invalid XML, to maintain legacy behavior from a previous, less-picky implementation of the code. Is there any way to tell JAXB not to worry about the illegal XML character, or am I going to be stuck doing a string replace before/after encoding? My current implementation is simply:
public static <T> String toXml(Object o, Class<T> z) {
try {
StringWriter sw = new StringWriter();
JAXBContext context = JAXBContext.newInstance(z);
Marshaller marshaller = context.createMarshaller();
marshaller.marshal(o, sw);
return sw.toString();
} catch (JAXBException e) {
throw new RuntimeException(e);
}
}
I built a XmlAdapter for this specific object, then:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document document = db.newDocument();
document.setStrictErrorChecking(false); // <--- This one. This accomplished it.

XML parsing java confirmation

this is the way i made an XML file to a Java object(s).
i used "xjc" and a valid XML schema, and i got back some "generated" *.java files.
i imported them into a different package in eclipse.
I am reading the XML file in 2 way now.
1) Loading the XML file:
System.out.println("Using FILE approach:");
File f = new File ("C:\\test_XML_files\\complex.apx");
JAXBElement felement = (JAXBElement) u.unmarshal(f);
MyObject fmainMyObject = (MyObject) felement.getValue ();
2) Using a DOM buider:
System.out.println("Using DOM BUILDER Approach:");
JAXBElement element = (JAXBElement) u.unmarshal(test());;
MyObject mainMyObject = (MyObject ) element.getValue ();
now in method "test()" the code below is included:
public static Node test(){
Document document = parseXmlDom();
return document.getFirstChild();
}
private static Document parseXmlDom() {
Document document = null;
try {
// getting the default implementation of DOM builder
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
// parsing the XML file
document = builder.parse(new File("C:\\test_XML_files\\MyXML_FILE.apx"));
} catch (Exception e) {
// catching all exceptions
System.out.println();
System.out.println(e.toString());
}
return document;
}
is this the standard way of doing XML to an Java Object?
I tested if I could access the object and everything works fine. (so far)
Do you suggest a different approach?? or is this one sufficient?
I don't know about a "standard way", but either way looks OK to me. The first way looks simpler ( less code ) so that's the way I'd probably do it, unless there were other factors / requirements.
FWIW, I'd expect that the unmarshal(File) method was implemented to do pretty much what you are doing in your second approach.
However, it is really up to you (and your co-workers) to make judgments about what is "sufficient" for your project.

Categories

Resources