Is it essential that I use libraries to manipulate XML? - java

I am using Java back end for creating an XML string which is passed to the browser. Currently I am using simple string manipulation to produce this XML. Is it essential that I use some XML library in Java to produce the XML string?
I find the libraries very difficult to use compared to what I need.

It's not essential, but advisable. However, if string manipulation works for you, then go for it! There are plenty of cases where small or simple XML text can be safely built by hand.
Just be aware that creating XML text is harder than it looks. Here's some criteria I would consider:
First: how much control do you have on the information that goes into the xml?
The less control you have on the source data, the more likely you will have trouble, and the more advantageous the library becomes. For example: (a) Can you guarantee that the element names will never have a character that is illegal in a name? (b) How about quotes in an attribute's content? Can they happen, and are you handling them? (c) Does the data ever contain anything that might need to be encoded as an entity (like the less-than which often needs to be output as <); are you doing it correctly?
Second, maintainability: is the code that builds the XML easy to understand by someone else?
You probably don't want to be stuck with the code for life. I've worked with second-hand C++ code that hand-builds XML and it can be surprisingly obscure. Of course, if this is a personal project of yours, then you don't need to worry about "others": substitute "in a year" for "others" above.
I wouldn't worry about performance. If your XML is simple enough that you can hand-write it, any overhead from the library is probably meaningless. Of course, your case might be different, but you should measure to prove it first.
Finally, Yes; you can hand build XML text by hand if it's simple enough; but not knowing the libraries available is probably not the right reason.
A modern XML library is a quite powerful tool, but it can also be daunting. However, learning the essentials of your XML library is not that hard, and it can be quite handy; among other things, it's almost a requisite in today's job marketplace. Just don't get bogged down by namespaces, schemas and other fancier features until you get the essentials.
Good luck.

Xml is hard. Parsing yourself is a bad idea, it's even a worse idea to generate content yourself. Have a look at the Xml 1.1 spec.
You have to deal with such things as proper encoding, attribute encoding (e.g., produces invalid xml), proper CDATA escaping, UTF encoding, custom DTD entities, and that's without throwing in the mix xml namespaces with the default / empty namespace, namespace attributes, etc.
Learn a toolkit, there's plenty available.

I think that custom string manipulation is fine, but you have to keep two things in mind:
Your code isn't as mature as the library. Allocate time in your plan to handle the bugs that pop-up.
Your approach will probably not scale as well as a 3rd party library when the xml starts to grow (both in terms of performance and ease of use).
I know a code base that uses custom string manipulation for xml output (and a 3rd party library for input). It was fine to begin with but became a real hassle after a while.

Yes, use the library.
Somebody took the time and effort to create something that is usually better than what you could come up with. String manipulation is for sending back a single node, but once you start needing to manipulate the DOM, or use an XPath query, the library will save you.

By not using a library, you risk generating or parsing data that isn't well-formed, which sooner or later will happen. For the same reason document.write isn't allowed in XHTML, you shouldn't write your XML markup as a string.

Yes.
It makes no sense to skip essential tool: even writing xml is non-trivial with having to escape those ampersands and lts, not to mention namespace bindings (if needed).
And in the end libs can generally read and write xml not only more reliably but more efficiently (esp. so for Java).
But you may have been looking at wrong tools, if they seem overcomplicated. Data binding using JAXB or XStream is simple; but for simple straight-forward XML output, I go with StaxMate. It can actually simplify the task in many ways (automatically closes start tags, writes namespace declarations if needde etc).

No - If you can parse it yourself (as you are doing), and it will scale for your needs, you do not need any library.
Just ensure that your future needs are going to be met - complex xml creation is better done using libraries - some of which come in very simple flavors too.

The only time I've done something like this in production code was when a collegue and I built a pre-processor so that we could embed XML fragments from other files into a larger XML. On load we would first parse these embed (file references in XML comment strings) and replace them with the actual fragment they referenced. Then we would pass on the combined result to the XML Parser.

You don't have to use library to parse XML, but check out this question What considerations should be made before reinventing the wheel?
before you start writing your own code for parsing/generating xml.

No - especially for generating (parsing I would be less inclined to as input text can always surprise you). I think its fine - but be prepared to shift to a library should you find yourself spending more then a few minutes maintaining your own code.

I don't think that using the DOM XML API wich comes with the JDK is difficult, it's easy to create Element nodes, attributes, etc... and later is easy convert strings to a DOM document sor DOM documents into a String
In the first page google finds from Spain (spanish XML example):
public String DOM2String(Document doc)
{
TransformerFactory transformerFactory =TransformerFactory.newInstance();
Transformer transformer = null;
try{
transformer = transformerFactory.newTransformer();
}catch (javax.xml.transform.TransformerConfigurationException error){
coderror=123;
msgerror=error.getMessage();
return null;
}
Source source = new DOMSource(doc);
StringWriter writer = new StringWriter();
Result result = new StreamResult(writer);
try{
transformer.transform(source,result);
}catch (javax.xml.transform.TransformerException error){
coderror=123;
msgerror=error.getMessage();
return null;
}
String s = writer.toString();
return s;
}
public Document string2DOM(String s)
{
Document tmpX=null;
DocumentBuilder builder = null;
try{
builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
}catch(javax.xml.parsers.ParserConfigurationException error){
coderror=10;
msgerror="Error crando factory String2DOM "+error.getMessage();
return null;
}
try{
tmpX=builder.parse(new ByteArrayInputStream(s.getBytes()));
}catch(org.xml.sax.SAXException error){
coderror=10;
msgerror="Error parseo SAX String2DOM "+error.getMessage();
return null;
}catch(IOException error){
coderror=10;
msgerror="Error generando Bytes String2DOM "+error.getMessage();
return null;
}
return tmpX;
}

Related

Is it bad practice to create XML files directly without using a class to store the structure? [duplicate]

In the thread What’s your favorite “programmer ignorance” pet peeve?, the following answer appears, with a large amount of upvotes:
Programmers who build XML using string concatenation.
My question is, why is building XML via string concatenation (such as a StringBuilder in C#) bad?
I've done this several times in the past, as it's sometimes the quickest way for me to get from point A to point B when to comes to the data structures/objects I'm working with. So far, I have come up with a few reasons why this isn't the greatest approach, but is there something I'm overlooking? Why should this be avoided?
Probably the biggest reason I can think of is you need to escape your strings manually, and most new programmers (and even some experienced programmers) will forget this. It will work great for them when they test it, but then "randomly" their apps will fail when someone throws an & symbol in their input somewhere. Ok, I'll buy this, but it's really easy to prevent the problem (SecurityElement.Escape to name one).
When I do this, I usually omit the XML declaration (i.e. <?xml version="1.0"?>). Is this harmful?
Performance penalties? If you stick with proper string concatenation (i.e. StringBuilder), is this anything to be concerned about? Presumably, a class like XmlWriter will also need to do a bit of string manipulation...
There are more elegant ways of generating XML, such as using XmlSerializer to automatically serialize/deserialize your classes. Ok sure, I agree. C# has a ton of useful classes for this, but sometimes I don't want to make a class for something really quick, like writing out a log file or something. Is this just me being lazy? If I am doing something "real" this is my preferred approach for dealing w/ XML.
You can end up with invalid XML, but you will not find out until you parse it again - and then it is too late. I learned this the hard way.
I think readability, flexibility and scalability are important factors. Consider the following piece of Linq-to-Xml:
XDocument doc = new XDocument(new XDeclaration("1.0","UTF-8","yes"),
new XElement("products", from p in collection
select new XElement("product",
new XAttribute("guid", p.ProductId),
new XAttribute("title", p.Title),
new XAttribute("version", p.Version))));
Can you find a way to do it easier than this? I can output it to a browser, save it to a document, add attributes/elements in seconds and so on ... just by adding couple lines of code. I can do practically everything with it without much of effort.
Actually, I find the biggest problem with string concatenation is not getting it right the first time, but rather keeping it right during code maintenance. All too often, a perfectly-written piece of XML using string concat is updated to meet a new requirement, and string concat code is just too brittle.
As long as the alternatives were XML serialization and XmlDocument, I could see the simplicity argument in favor of string concat. However, ever since XDocument et. al., there is just no reason to use string concat to build XML anymore. See Sander's answer for the best way to write XML.
Another benefit of XDocument is that XML is actually a rather complex standard, and most programmers simply do not understand it. I'm currently dealing with a person who sends me "XML", complete with unquoted attribute values, missing end tags, improper case sensitivity, and incorrect escaping. But because IE accepts it (as HTML), it must be right! Sigh... Anyway, the point is that string concatenation lets you write anything, but XDocument will force standards-complying XML.
I wrote a blog entry back in 2006 moaning about XML generated by string concatenation; the simple point is that if an XML document fails to validate (encoding issues, namespace issues and so on) it is not XML and cannot be treated as such.
I have seen multiple problems with XML documents that can be directly attributed to generating XML documents by hand using string concatenation, and nearly always around the correct use of encoding.
Ask yourself this; what character set am I currently encoding my document with ('ascii7', 'ibm850', 'iso-8859-1' etc)? What will happen if I write a UTF-16 string value into an XML document that has been manually declared as 'ibm850'?
Given the richness of the XML support in .NET with XmlDocument and now especially with XDocument, there would have to be a seriously compelling argument for not using these libraries over basic string concatenation IMHO.
I think that the problem is that you aren't watching the xml file as a logical data storage thing, but as a simple textfile where you write strings.
It's obvious that those libraries do string manipulation for you, but reading/writing xml should be something similar to saving datas into a database or something logically similar
If you need trivial XML then it's fine. Its just the maintainability of string concatenation breaks down when the xml becomes larger or more complex. You pay either at development or at maintenance time. The choice is yours always - but history suggests the maintenance is always more costly and thus anything that makes it easier is worthwhile generally.
You need to escape your strings manually. That's right. But is that all? Sure, you can put the XML spec on your desk and double-check every time that you've considered every possible corner-case when you're building an XML string. Or you can use a library that encapsulates this knowledge...
Another point against using string concatenation is that the hierarchical structure of the data is not clear when reading the code. In #Sander's example of Linq-to-XML for example, it's clear to what parent element the "product" element belongs, to what element the "title" attribute applies, etc.
As you said, it's just awkward to build XML correct using string concatenation, especially now you have XML linq that allows for simple construction of an XML graph and will get namespaces, etc correct.
Obviously context and how it is being used matters, such as in the logging example string.Format can be perfectly acceptable.
But too often people ignore these alternatives when working with complex XML graphs and just use a StringBuilder.
The main reason is DRY: Don't Repeat Yourself.
If you use string concat to do XML, you will constantly be repeating the functions that keep your string as a valid XML document. All the validation would be repeated, or not present. Better to rely on a class that is written with XML validation included.
I've always found creating an XML to be more of a chore than reading in one. I've never gotten the hang of serialization - it never seems to work for my classes - and instead of spending a week trying to get it to work, I can create an XML file using strings in a mere fraction of the time and write it out.
And then I load it in using an XMLReader tree. And if the XML file doesn't read as valid, I go back and find the problem within my saving routines and corret it. But until I get a working save/load system, I refuse to perform mission-critical work until I know my tools are solid.
I guess it comes down to programmer preference. Sure, there are different ways of doing things, for sure, but for developing/testing/researching/debugging, this would be fine. However I would also clean up my code and comment it before handing it off to another programmer.
Because regardless of the fact you're using StringBuilder or XMLNodes to save/read your file, if it is all gibberish mess, nobody is going to understand how it works.
Maybe it won't ever happen, but what if your environment switches to XML 2.0 someday? Your string-concatenated XML may or may not be valid in the new environment, but XDocument will almost certainly do the right thing.
Okay, that's a reach, but especially if your not-quite-standards-compliant XML doesn't specify an XML version declaration... just saying.

How to preserve XML nodes that are not bound to an object when using SAX for parsing

I am working on an android app which interfaces with a bluetooth camera. For each clip stored on the camera we store some fields about the clip (some of which the user can change) in an XML file.
Currently this app is the only app writing this xml data to the device but in the future it is possible a desktop app or an iphone app may write data here too. I don't want to make an assumption that another app couldn't have additional fields as well (especially if they had a newer version of the app which added new fields this version didn't support yet).
So what I want to prevent is a situation where we add new fields to this XML file in another application, and then the user goes to use the android app and its wipes out those other fields because it doesn't know about them.
So lets take hypothetical example:
<data>
<title>My Title</title>
<date>12/24/2012</date>
<category>Blah</category>
</data>
When read from the device this would get translated to a Clip object that looks like this
(simplified for brevity)
public class Clip {
public String title, category;
public Date date;
}
So I'm using SAX to parse the data and store it to a Clip.
I simply store the characters in StringBuilder and write them out when I reach the end element for title,category and date.
I realized though that when I write this data back to the device, if there were any other tags in the original document they would not get written because I only write out the fields I know about.
This makes me think that maybe SAX is the wrong option and perhaps I should use DOM or something else where I could more easily write out any other elements that existed originally.
Alternatively I was thinking maybe my Clip class contains an ArrayList of some generic XML type (maybe DOM), and in startTag I check if the element is not one of the predefined tags, and if so, until I reach the end of that tag I store the whole structure (but in what?).. Then upon writing back out I would just go through all of the additional tags and write them out to the xml file (along with the fields I know about of course)
Is this a common problem with a good known solution?
-- Update 5/22/12 --
I didn't mention that in the actual xml the root node (Actually called annotation), we use a version number which has been set to 1. What I'm going to do for the short term is require that the version number my app supports is >= what the version number is of the xml data. If the xml is a greater number I will attempt to parse for reading back but will deny any saves to the model. I'm still interested in any kind of working example though on how to do this.
BTW I thought of another solution that should be pretty easy. I figure I can use XPATH to find nodes that I know about and replace the content for those nodes when the data is updated. However I ran some benchmarks and the overhead is absurd in parsing the xml when it is parsed into memory. Just the parsing operation without even doing any lookups resulted in performance being 20 times worse than SAX.. Using xpath was between 30-50 times slower in general for parsing, which was really bad considering I parse these in a list view.
So my idea is to keep the SAX to parse the nodes to clips, but store the entirety of the XML in an variable of the Clip class (remember, this xml is short, less than 2kb). Then when I go to write the data back out I could use XPATH to replace out the nodes that I know about in the original XML.
Still interested in any other solutions though. I probably won't accept a solution though unless it includes some code examples.
Here's how you can go about it with SAX filters:
When you read your document with SAX you record all the events. You record them and bubble them up further to the next level of SAX reader. You basically stack together two layers of SAX readers (with XMLFilter) - one will record and relay, and the other one is your current SAX handler that creates objects.
When you're ready to write your modifications back to disk you fire up the recorded SAX events layered with your writer that would overwrite those values/nodes you have altered.
I spent some time with the idea and it worked. It basically came down to proper chaining of XMLFilters. Here's how the unit test looks like, your code would do something similar:
final SAXParserFactory factory = SAXParserFactory.newInstance();
final SAXParser parser = factory.newSAXParser();
final RecorderProxy recorder = new RecorderProxy(parser.getXMLReader());
final ClipHolder clipHolder = new ClipHolder(recorder);
clipHolder.parse(new InputSource(new StringReader(srcXml)));
assertTrue(recorder.hasRecordingToReplay());
final Clip clip = clipHolder.getClip();
assertNotNull(clip);
assertEquals(clip.title, "My Title");
assertEquals(clip.category, "Blah!");
assertEquals(clip.date, Clip.DATE_FORMAT.parse("12/24/2012"));
clip.title = "My Title Updated";
clip.category = "Something else";
final ClipSerializer serializer = new ClipSerializer(recorder);
serializer.setClip(clip);
final TransformerFactory xsltFactory = TransformerFactory.newInstance();
final Transformer t = xsltFactory.newTransformer();
final StringWriter outXmlBuffer = new StringWriter();
t.transform(new SAXSource(serializer,
new InputSource()), new StreamResult(outXmlBuffer));
assertEquals(targetXml, outXmlBuffer.getBuffer().toString());
The important lines are:
your SAX events recorder is wrapped around the SAX parser
your Clip parser (ClipHolder) is wrapped around the recorder
when the XML is parsed, recorder will record everything and your ClipHolder will only look at what it knows about
you then do whatever you need to do with the clip object
the serializer is then wrapped around the recorder (basically re-mapping it onto itself)
you then work with the serializer and it will take care of feeding the recorded events (delegating to the parent and registering self as a ContentHandler) overlayed with what it has to say about the clip object.
Please find the DVR code and the Clip test over at github. I hope it helps.
p.s. it's not a generic solution and the whole record->replay+overlay concept is very rudimentary in the provided implementation. An illustration basically. If your XML is more complex and gets "hairy" (e.g. same element names on different levels, etc.) then the logic will need to be augmented. The concept will remain the same though.
You're right to say that SAX is probably not the best option if you want to keep the nodes that you've not "consumed". You could still do it using some kind of "sax store" that would keep the SAX events and replay them (there are some few implementations of such a thing around), but an object model based API would be much easier to use: you'd easily keep the complete object model and just update "your" nodes.
Of course, you can use DOM which is the standard, but you may also want to consider alternatives which provide an easier access to the specific nodes that you'll be using in an arbitrary data model. Among them, JDOM (http://www.jdom.org/) and XOM (http://www.xom.nu/) are interesting candidates.
If you're not bound to a specific xml schema, you should consider doing something like this:
<data>
<element id="title">
myTitle
</element>
<element id="date">
18/05/2012
</element>
...
</data>
and then store all those elements in a single ArrayList.
In this way you wouldn't lose infos, and you still have the possibility of chosing what element you want to show-edit-etc...
Your assumption on XPath being 20x slower than SAX parsing is flawed... SAX parsing is just a low level tokenizer on which your processing logic would be built... and your processing logic would require additional parsing... XPath's performance has a lot to be with the implementation... As far as I know, vtd-xml's XPath is at least an order of magnitude faster than DOM in general, and is far better suited for heavy duty XML Processing... below are a few links to further references...
http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf
Android - XPath evaluate very slow

A nice Java XML DOM utility

I find myself writing the same verbose DOM manipulation code again and again:
Element e1 = document.createElement("some-name");
e1.setAttribute("attr1", "val1");
e2.setAttribute("attr2", "val2");
document.appendChild(e1);
Element e2 = document.createElement("some-other-name");
e.appendChild(e2);
// Etc, the same for attributes and finding the nodes again:
Element e3 = (Element) document.getElementsByTagName("some-other-name").item(0);
Now, I don't want to switch architecture all together, i.e. I don't want to use JDOM, JAXB, or anything else. Just Java's org.w3c.dom. The reasons for this are
It's about an old and big legacy system
The XML is used in many places and XSLT transformed several times to get XML, HTML, PDF output
I'm just looking for convenience, not a big change.
I'm just wondering if there is a nice wrapper library (e.g. with apache commons or google) that allows me to do things like this with a fluent style similar to jRTF:
// create a wrapper around my DOM document and manipulate it:
// like in jRTF, this code would make use of static imports
dom(document).add(
element("some-name")
.attr("attr1", "val1")
.attr("attr2", "val2")
.add(element("some-other-name")),
element("more-elements")
);
and then
Element e3 = dom(document).findOne("some-other-name");
The important requirement I have here is that I explicitly want to operate on a org.w3c.dom.Document that
already exists
is pretty big
needs quite a bit of manipulation
So transforming the org.w3c.dom.Document into JDOM, dom4j, etc seems like a bad idea. Wrapping it with adapters is what I'd prefer.
If it doesn't exist, I might roll my own, as this jRTF syntax looks really nice! And for XML, it seems quite easy to implement, as there are only few node types. This could become as powerful as jquery from the fluent API perspective!
To elaborate my comment, Dom4J gets you pretty close to what you wanted:
final Document dom = DocumentHelper.createDocument().addElement("some-name")
.addAttribute("attr1", "val1")
.addAttribute("attr2", "val2")
.addElement("some-other-name").getDocument();
System.out.println(dom.asXML());
Output:
<?xml version="1.0" encoding="UTF-8"?>
<some-name attr1="val1" attr2="val2"><some-other-name/></some-name>
I know it's not native DOM, but it's very similar and it has very nice features for Java developers (element iterators, live element lists etc.)
I found some tools that roughly do what I asked for in my question:
http://code.google.com/p/xmltool/
http://jsoup.org/
However, in the mean time, I am more inclinded to roll my own. I'm really a big fan of jquery, and I think jquery can be mapped to a Java fluent API:
http://www.jooq.org/products/jOOX
Well, this is maybe silly but why don't you implement that little API on your own? I'm sure you know DOM API pretty well and it won't take much time to implement what you want.
Btw consider using XPath for manipulation with document (you can also implement your mini-api over this one).

recommended parser for XML in java(absolute beginner to xml)

which parser (java) would you recommend for parsing GPX data?
Im looking for a one that is very intuitive to use and should not need too much RAM (it seems that DOM requires too much, doesn't it?). I have no idea about parsing xml, so it is time for me to learn this ;-)
My documents are not very huge and are always read twice(a point for DOM), but I want to keep as few things as possible in RAM.
What would you do in this situation? Which one would you coose and why?
Unless you have a special reason to use a third-party library for XML parsing, I'd just use the standard Java API. See the package javax.xml.parsers. Assuming you have the XML in a file, you can parse it into an org.w3c.dom.Document (also part of Java's standard API) like this:
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new File(filename));
Since your files are not so large, using DOM would be the easiest and most obvious choice.
You can use the methods of the Document object and related objects (see the classes in the org.w3c.dom package) to get at the data, or use XPath (see the package javax.xml.xpath).
JAXB, that Quotidian mentions, is also in the standard API since Java 6, but it might be a bit more work to set up than using the standard DOM API.
I would suggest XPP,
http://www.extreme.indiana.edu/xgws/xsoap/xpp/
It's more efficient DOM and easier to use than SAX.
STAX is another popuplar pull parser,
http://stax.codehaus.org/Home
Depending on what your wanting to do with the xml, JAXB might be a possibility. The idea is to convert xml into POJOs without messing (directly) with the parsers. (Although one still can mess with the parsers if needed.) Plus I've found that since JAXB is in the javax root package it tends to play nicer with standards.
I'm a big fan of apache digester since it lets you define translation rules and go straight to Java objects. I think that JaxB does something similar.
I would start with an IDE like Oxygen.

Error-tolerant XML parsing in Scala

I would like to be able to parse XML that isn't necessarily well-formed. I'd be looking for a fuzzy rather than a strict parser, able to recover from badly nested tags, for example. I could write my own but it's worth asking here first.
Update:
What I'm trying to do is extract links and other info from HTML. In the case of well-formed XML I can use the Scala XML API. In the case of ill-formed XML, it would be nice to somehow convert it into correct XML (somehow) and deal with it the same way, otherwise I'd have to have two completely different sets of functions for dealing with documents.
Obviously because the input is not well-formed and I'm trying to create a well-formed tree, there would have to be some heuristic involved (such as when you see <parent><child></parent> you would close the <child> first and when you then see a <child> you ignore it). But of course this isn't a proper grammar and so there's no correct way of doing it.
What you're looking for would not be an XML parser. XML is very strict about nesting, closing, etc. One of the other answers suggests Tag Soup. This is a good suggestion, though technically it is much closer to a lexer than a parser. If all you want from XML-ish content is an event stream without any validation, then it's almost trivial to roll your own solution. Just loop through the input, consuming content which matches regular expressions along the way (this is exactly what Tag Soup does).
The problem is that a lexer is not going to be able to give you many of the features you want from a parser (e.g. production of a tree-based representation of the input). You have to implement that logic yourself because there is no way that such a "lenient" parser would be able to determine how to handle cases like the following:
<parent>
<child>
</parent>
</child>
Think about it: what sort of tree would expect to get out of this? There's really no sane answer to that question, which is precisely why a parser isn't going to be of much help.
Now, that's not to say that you couldn't use Tag Soup (or your own hand-written lexer) to produce some sort of tree structure based on this input, but the implementation would be very fragile. With tree-oriented formats like XML, you really have no choice but to be strict, otherwise it becomes nearly impossible to get a reasonable result (this is part of why browsers have such a hard time with compatibility).
Try the parser on the XHtml object. It is much more lenient than the one on XML.
Take a look at htmlcleaner. I have used it successfully to convert "HTML from the wild" to valid XML.
Try Tag Soup.
JTidy does something similar but only for HTML.
I mostly agree with Daniel Spiewak's answer. This is just another way to create "your own parser".
While I don't know of any Scala specific solution, you can try using Woodstox, a Java library that implements the StAX API. (Being an even-based API, I am assuming it will be more fault tolerant than a DOM parser)
There is also a Scala wrapper around Woodstox called Frostbridge, developed by the same guy who made the Simple Build Tool for Scala.
I had mixed opinions about Frostbridge when I tried it, but perhaps it is more suitable for your purposes.
I agree with the answers that turning invalid XML into "correct" XML is impossible.
Why don't you just do a regular text search for the hrefs if that's all you're interested in? One issue would be commented out links, but if the XML is invalid, it might not be possible to tell what is intended to be commented out!
Caucho has a JAXP compliant XML parser that is a little bit more tolerant than what you would usually expect. (Including support for dealing with escaped character entity references, AFAIK.)
Find JavaDoc for the parsers here
A related topic (with my solution) is listed below:
Scala and html parsing

Categories

Resources