In the thread What’s your favorite “programmer ignorance” pet peeve?, the following answer appears, with a large amount of upvotes:
Programmers who build XML using string concatenation.
My question is, why is building XML via string concatenation (such as a StringBuilder in C#) bad?
I've done this several times in the past, as it's sometimes the quickest way for me to get from point A to point B when to comes to the data structures/objects I'm working with. So far, I have come up with a few reasons why this isn't the greatest approach, but is there something I'm overlooking? Why should this be avoided?
Probably the biggest reason I can think of is you need to escape your strings manually, and most new programmers (and even some experienced programmers) will forget this. It will work great for them when they test it, but then "randomly" their apps will fail when someone throws an & symbol in their input somewhere. Ok, I'll buy this, but it's really easy to prevent the problem (SecurityElement.Escape to name one).
When I do this, I usually omit the XML declaration (i.e. <?xml version="1.0"?>). Is this harmful?
Performance penalties? If you stick with proper string concatenation (i.e. StringBuilder), is this anything to be concerned about? Presumably, a class like XmlWriter will also need to do a bit of string manipulation...
There are more elegant ways of generating XML, such as using XmlSerializer to automatically serialize/deserialize your classes. Ok sure, I agree. C# has a ton of useful classes for this, but sometimes I don't want to make a class for something really quick, like writing out a log file or something. Is this just me being lazy? If I am doing something "real" this is my preferred approach for dealing w/ XML.
You can end up with invalid XML, but you will not find out until you parse it again - and then it is too late. I learned this the hard way.
I think readability, flexibility and scalability are important factors. Consider the following piece of Linq-to-Xml:
XDocument doc = new XDocument(new XDeclaration("1.0","UTF-8","yes"),
new XElement("products", from p in collection
select new XElement("product",
new XAttribute("guid", p.ProductId),
new XAttribute("title", p.Title),
new XAttribute("version", p.Version))));
Can you find a way to do it easier than this? I can output it to a browser, save it to a document, add attributes/elements in seconds and so on ... just by adding couple lines of code. I can do practically everything with it without much of effort.
Actually, I find the biggest problem with string concatenation is not getting it right the first time, but rather keeping it right during code maintenance. All too often, a perfectly-written piece of XML using string concat is updated to meet a new requirement, and string concat code is just too brittle.
As long as the alternatives were XML serialization and XmlDocument, I could see the simplicity argument in favor of string concat. However, ever since XDocument et. al., there is just no reason to use string concat to build XML anymore. See Sander's answer for the best way to write XML.
Another benefit of XDocument is that XML is actually a rather complex standard, and most programmers simply do not understand it. I'm currently dealing with a person who sends me "XML", complete with unquoted attribute values, missing end tags, improper case sensitivity, and incorrect escaping. But because IE accepts it (as HTML), it must be right! Sigh... Anyway, the point is that string concatenation lets you write anything, but XDocument will force standards-complying XML.
I wrote a blog entry back in 2006 moaning about XML generated by string concatenation; the simple point is that if an XML document fails to validate (encoding issues, namespace issues and so on) it is not XML and cannot be treated as such.
I have seen multiple problems with XML documents that can be directly attributed to generating XML documents by hand using string concatenation, and nearly always around the correct use of encoding.
Ask yourself this; what character set am I currently encoding my document with ('ascii7', 'ibm850', 'iso-8859-1' etc)? What will happen if I write a UTF-16 string value into an XML document that has been manually declared as 'ibm850'?
Given the richness of the XML support in .NET with XmlDocument and now especially with XDocument, there would have to be a seriously compelling argument for not using these libraries over basic string concatenation IMHO.
I think that the problem is that you aren't watching the xml file as a logical data storage thing, but as a simple textfile where you write strings.
It's obvious that those libraries do string manipulation for you, but reading/writing xml should be something similar to saving datas into a database or something logically similar
If you need trivial XML then it's fine. Its just the maintainability of string concatenation breaks down when the xml becomes larger or more complex. You pay either at development or at maintenance time. The choice is yours always - but history suggests the maintenance is always more costly and thus anything that makes it easier is worthwhile generally.
You need to escape your strings manually. That's right. But is that all? Sure, you can put the XML spec on your desk and double-check every time that you've considered every possible corner-case when you're building an XML string. Or you can use a library that encapsulates this knowledge...
Another point against using string concatenation is that the hierarchical structure of the data is not clear when reading the code. In #Sander's example of Linq-to-XML for example, it's clear to what parent element the "product" element belongs, to what element the "title" attribute applies, etc.
As you said, it's just awkward to build XML correct using string concatenation, especially now you have XML linq that allows for simple construction of an XML graph and will get namespaces, etc correct.
Obviously context and how it is being used matters, such as in the logging example string.Format can be perfectly acceptable.
But too often people ignore these alternatives when working with complex XML graphs and just use a StringBuilder.
The main reason is DRY: Don't Repeat Yourself.
If you use string concat to do XML, you will constantly be repeating the functions that keep your string as a valid XML document. All the validation would be repeated, or not present. Better to rely on a class that is written with XML validation included.
I've always found creating an XML to be more of a chore than reading in one. I've never gotten the hang of serialization - it never seems to work for my classes - and instead of spending a week trying to get it to work, I can create an XML file using strings in a mere fraction of the time and write it out.
And then I load it in using an XMLReader tree. And if the XML file doesn't read as valid, I go back and find the problem within my saving routines and corret it. But until I get a working save/load system, I refuse to perform mission-critical work until I know my tools are solid.
I guess it comes down to programmer preference. Sure, there are different ways of doing things, for sure, but for developing/testing/researching/debugging, this would be fine. However I would also clean up my code and comment it before handing it off to another programmer.
Because regardless of the fact you're using StringBuilder or XMLNodes to save/read your file, if it is all gibberish mess, nobody is going to understand how it works.
Maybe it won't ever happen, but what if your environment switches to XML 2.0 someday? Your string-concatenated XML may or may not be valid in the new environment, but XDocument will almost certainly do the right thing.
Okay, that's a reach, but especially if your not-quite-standards-compliant XML doesn't specify an XML version declaration... just saying.
I want to read an XML starting from the top to the bottom using Java. However, I don't want to use recursive functions because I want to be able to jump to a different element and start reading from that position.
I've tried using getParent() and indexOf() methods (All three libraries below have these methods) to do this, but it's gotten very messy, mainly because the methods don't distinguish between attributes and elements.
I'm sure there must be a simple way to do this, but after trying dom4j, jdom, and xom, I still have not found a solution.
[Edit] More Information:
My friend wants to make a console text-based game in a question/answer type style. Instead of hard-coding it into java, I decided to try and make it read an XML file instead, because XML has a tree-like style that would be convenient. Here is an example of what my XML file might look like:
<disp>Text to be displayed</disp>
<disp>Text to be displayed afterward</disp>
<disp>What is your favorite color?</disp>
<question>
<answer name="orange">
<disp>Good choice.</disp>
<!-- More questions and stuff -->
</answer>
<default>
<disp>Wrong. The correct answer was orange.</disp>
</default>
</question>
I don't know if it taboo to use XML like an pseudo programming language. If anyone has other suggestions feel free to give them.
Your design is basically good and an example of Declarative Programming. You should read your XML files using an XML parser either into a DOM or using SAX. Since I think you will want to revisit nodes I suspect you will need a DOM (FWIW I use XOM, xom.nu). One of the best examples of XML-based declarative programming is XSLT where the data and commands are all XML.
I use this model a great deal. It has the advantage that the data structure can be external and can be edited.
(Note that your XML needs a root element)
but it's gotten very messy, mainly because the methods don't
distinguish between attributes and elements.
All DOM or SAX tools differentiate very clearly between attributes and elements, so if there is confusion it is somewhere else.
It would be good if you would show what you want to do with the XML snippets, but normally if you want to read off of any kind of file, use java.util.Scanner.
Scanner scan = new Scanner (new File("file.xml"));
while (scan.hasNext()) {
String theData = scan.nextLine();
}
scan.close();
This should return the values that you need until it runs out of lines to scan.
Hope it works and Happy Coding!
Since you want to read a xml file. I recommend go for SAX parser. It is event based parser very fast and efficient for xml reading (top down approach).
http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/ will explain about usage of sax parser.
Thanks
In many REST based API calls, we have this parameter called nextURL, using which we can query for the next URL. This is usually in the root element.(or may be the next one)
In general how do you guys read this? In case you are using a standard XML parser, it reads and loads the entire XML and then you get to read the nextURL by getElementsByTag. Is there a better work around? Reading the entire xml is of course waste of time/memory.
Edit: An example XML would be something like
<result pubisher="xyz" nextURL="http://actualurl?since_date=<newdate>">
<element>adfsaf</element>
..
</result>
I need to capture the new since_date without reading the entire XML.
Python: You could use the ElementTree iterparse method ... provided the data you want is in an attribute, which will have been parsed by the time that you get the start event. If it's in the text or tail of the element, you will have to wait until the end event. It would be a good idea if you edited your question to show what your XML looks like, and explain "or maybe in the next one" with an example.
The term "Standard XML parser" covers a lot of territory, so much so that I don't think that you can generalize their behaviors. For instance, a standard DOM parser is tree-based and will read the entire XML into memory, but a SAX parser (and I think StAX as well) won't but rather will advance as the app desires it to advance. It sounds like the latter, a SAX or StAX parser, is what you need.
Edit: Please be sure to read KitsuneYMG's comment below on the difference between SAX and StAX behaviors.
Can anybody suggest how can I solve this, there is XML document that I'm trying to append. I'm looking up nodes with xpath, the things is the software which generates this XML sometimes screws it up as following :
<element name="Element ">Element value</element>
So when I'm looking the node up with xpath using //element[#name="Element"] without a blank space I don't get a match. Naturally I get match with this //element[#name="Element "]
Is there something I can do to match this without blank space?Does xpath accept regular expressions or there is more smart way to do this, I can change the xml file as well after it has been generated by the sofware(with faulty user input).
(untested).
Would
//element[normalize-space(#name)=="Element"]
work?
A more general (though less accurate) solution could be:
//element[contains(#name,"Element")]
However that would also catch things like name="Elements" and name="Elementary" etc.
can you recommend some useful resources for xpath
Unfortunately there aren't many great such resources in my experience - w3schools is probably your best bet, or the spec itself for more advanced stuff.
When I try to validate an XML file against an XSD in java (see this example) there are some incompatibilities between the regular expressions given in the XSD file and the regular expressions in java.
If there is an regular expression like "[ab-]" in the XSD (meaning any of the characters "a", "b" or "-", java complains about a syntax error in the expression.
This is a known bug since 28-MAR-2005, see Sun bug database.
What can I do to work around this bug? Up to now I try to "correct" the XSD file by replacing the "[ab-]" by "[ab\-]", but sometimes this is not an option.
If you have problems with this bug, too, please vote for it at the Sun bug database!
Since a bug is already filed, I'd recommend you try a different XML Schema processor. There's not going to be a lot you can do about it.
If you can preprocess the stream the XSD is coming in on, then you could create a parser which understands the basic regular expression structure and can fix anything that looks of the form [.*-] (where the .star is not a literal in this case).
Although it may not be the best solution in the world, you could consider using the Sax parser. I have used it for over 3 years now, however I have not done much regex validation with it, so I cannot speak to it's robustness related to that.
Other than that, I think Kaleb is probably correct on the preprocessing side (which is anything but ideal) - you might be able to use a regex for any of the incoming regex'es to do a replace.... although that has quite the code smell about it.
Edit:
An additional thought that just came to me. If the regex does not need to be in the xsd - i.e. it is there simply because that was "easiest" in the past - you could do the regex validation outside of the xsd. But, if other systems use the xsd, that is likely not the correct solution, and you can forget I said anything.