I am trying to write XML data using Stax where the content itself is HTML
If I try
xtw.writeStartElement("contents");
xtw.writeCharacters("<b>here</b>");
xtw.writeEndElement();
I get this
<contents><b>here</b></contents>
Then I notice the CDATA method and change my code to:
xtw.writeStartElement("contents");
xtw.writeCData("<b>here</b>");
xtw.writeEndElement();
and this time the result is
<contents><![CDATA[<b>here</b>]]></contents>
which is still not good. What I really want is
<contents><b>here</b></contents>
So is there an XML API/Library that allows me to write raw text without being in a CDATA section? So far I have looked at Stax and JDom and they do not seem to offer this.
In the end I might resort to good old StringBuilder but this would not be elegant.
Update:
I agree mostly with the answers so far. However instead of <b>here</b> I could have a 1MB HTML document that I want to embed in a bigger XML document. What you suggest means that I have to parse this HTML document in order to understand its structure. I would like to avoid this if possible.
Answer:
It is not possible, otherwise you could create invalid XML documents.
The issue is that is not raw text it is an element so you should be writing
xtw.writeStartElement("contents");
xtw.writeStartElement("b");
xtw.writeCData("here");
xtw.writeEndElement();
xtw.writeEndElement();
If you want the XML to be included AS XML and not as character data, then it has to be parsed at some point. If you don't want to manually do the parsing yourself, you have two alternatives:
(1) Use external parsed entities -- in this case the external file will be pulled in and parsed by the XML parser. When the output is again serialized, it will include the contents of the external file.
[ See http://www.javacommerce.com/displaypage.jsp?name=entities.sql&id=18238 ]
(2) Use Xinclude -- in that case the file has to be run thru an xinclude processor which will merge the xinclude references into the output. Most xslt processors, as well as xmllint will also do xinclude with an appropriate option.
[ See: http://www.xml.com/pub/a/2002/07/31/xinclude.html ]
( XSLT can also be used to merge documents without using the XInclude syntax. XInclude just provides a standard syntax )
The problem is not "here", it's <b></b>.
Add the <b> element as a child of contents and you'll be able to do it. Any library like JDOM or DOM4J will allow you to do this. The general case is to parse the content into an XML DOM and add the root element as a child of <contents>.
You can't add escaped values outside of a CDATA section.
If you want to embed a large HTML document in an XML document then CDATA imho is the way to go. That way you don't have to understand or process the internal structure and you can later change the document type from HTML to something else without much hassle. Also I think you can't embed e.g. DOCTYPE instructions directly (i.e. as structured data that retains the semantics of the DOCTYPE instruction). They have to be represented as characters.
(This is primarily a response to your update but alas I don't have enough rep to comment...............)
I don't see what the problem is with parsing the large block of XML you want to insert into your output. Use a StAX parser to parse it, and just write code to forward all of the events to your existing serializer (variable "xtw").
If the blob of html is actually xhtml then I'd suggest doing something like (in pseudo-code):
xtw.writeStartElement("contents")
XMLReader xtr=new XMLReader();
xtr.read(blob);
Dom dom=xtr.getDom();
for(element e:dom){
xtw.writeElement(e);
}
xtw.writeEndElement();
or something like that. I had to do something similar once but used a different library.
If your XML and HTML are not too big, you could make a workaround:
xtw.writeStartElement("contents");
xtw.writeCharacters("anUniqueIdentifierForReplace"); // <--
xtw.writeEndElement();
When you have your XML as a String:
xmlAsString.replace("anUniqueIdentifierForReplace", yourHtmlAsString);
I know, it's not so nice, but this could work.
Edit: Of course, you should check if yourHtmlAsString is valid.
Related
I have a String content that comes from any soap XML request.
I want to I identify a specific tag <FieldName> having the value MyFieldIndicator, and from there fetch the value of the next occuring <FieldValue>.
How could I do this? Is there any library? Or which mechanisms could be used to extract those tags from a plain string?
Performance matters, so it should be as quick as possible.
the library dom4j is probably easiest to use and understand. You need to create a Document-Object from string and then access via iterators the tags. The documentation and examples can be fopund here
Generally u can probably use any DOM or SAX parser.
Other DOM parser : W3C, JDOM
SAX parser : JAXP
If u by any chance want to store the content into an object : try to learn JAXB
I have a very large XML which I receive as input. From this XML I just need a single child element. Parsing the entire XML to retrieve just one element seems like an performance overkill. Are there any better approaches to resolve this issue?
One approach would be to use the DocumentBuilder API to parse the XML and then using XPath to retrieve the desired field. But the parse method will still unnecessarily parse the entire xml. Is there an overloaded parse method in any implementation of parser which takes the xpath and parses the XML only according to the XPath.
What you need is a SAX parser or a similar fast parser. SAX parsers do not parse the entire XML, they just parse the xml to the point until they find the element they are looking for.
You can read about SAX parsers in wikipedia's link. Also have a look at the java docs for SAX parser
Although there is no way around parsing for the proper treatment of your XML data, there is definitely a way around building an in-memory representation of the entire document. Java offers SAX parsing, which is event-based. You can implement an event handler for XML events, ignoring everything on the way to the content that you need, and stopping after retrieving the part that you are looking for.
Here is a tutorial from Oracle showing how to use SAX APIs to retrieve counts of individual tags without building a document in memory.
Since most XPath processors work with SAX as well, you could potentially feed events to an XPath processor, and look for the desired tag in that way, too. However, this may be an overkill for a situation when you need to fetch a single element.
XPath operates over the document object model. So you have to have a DOM in order to evaluate an XPath expression. Otherwise what would it validate against?
So XPath is out if you don't want to parse the document. Your other options are fast SAX parsing, where you ignore all SAX parsing events until you get to the element that you want, extract the text that you want, and then abandon the rest of the parsing process.
The other option is to go way simpler: use grep.
I dont know how to read data from such XML file. Lets say i want to read every every GUID and userID. How do i do it?
Here is part of XML: http://pastebin.com/7B25eyFz
if your xml file is Tree base then use DOM, if it is not nested then use SAX, is faster then DOM.
You may use Xstream
Look into SAX Parser. Also, do a search for your terms - there are a ton of questions about this topic.
Have you read the trail about XML of the Java tutorial?
You should use an XML library like XOM. You can then use it to query the XML document using XPATH. XOM offers a tutorial.
Adding to #user651407 point, If you just want to read the XML then go for SAX, It parses the XML in serial fashion so its faster, but if you want to do more complex operation like Adding, Updating or deleting a node then go for DOM but DOM Has Limitation
1. required more memory as entire XML is loaded at a time.
2. Slow in processing as it is a tree based parser.
I'm looking into how I can get values from specific XML nodes in an XML file that I have. In my application, I have the entire XML file in a string, and I want to grab the specific information from there. I've heard a little bit about DOM and SAX, but I don't exactly know where to start. Any help?
One of the easiest ways is to use xPath. Here's a tutorial.
You can either use XPath (example) or you can use DOM or SAX (as you mentioned) You can view my answer here (how to retrieve element value of XML using Java?) on SO.
Well, there is also Xstream
http://x-stream.github.io/index.html
It let´s you do both directions (object to xml, and xml to object).
Here is the "two minutes tutorial":
http://x-stream.github.io/tutorial.html
I want to parse a document that is not pure xml. For example
my name is <j> <b> mike</b> </j>
example 2
my name is <mytag1 attribute="val" >mike</mytag1> and yours is <mytag2> john</mytag2>
Means my input is not pure xml. ITs simliar to html but the tags are not html.
How can i parse it in java?
Your examples are valid XML, except for the lack of a document element. If you know this to always be the case, then you could just wrap a set of dummy tags around the whole thing and use a standard parser (SAX, DOM...)
On the other hand if you get something uglier (e.g. tags don't match up, or are spaced out in an overlapping fashion), you'll have to do something custom which will involve a number of rules that you have to decide on that will be unique to your application. (e.g. How do I handle an opening tag that has no close? What do I do if the closing tag is outside the parent?)
There are few parsers that take not well formed html and turn it into well formed xml, here is some comparison with examples, that includes the most popular ones, except maybe HTMLParser. Probably that's what you need.