I need your expertise once again. I have a java class that searches a directory for xml files (displays the files it finds in the eclipse console window), applies the specified xslt to these and sends the output to a directory.
What I want to do now is create an xml containing the file names and file format types. The format should be something like;
<file>
<fileName> </fileName>
<fileType> </fileType>
</file>
<file>
<fileName> </fileName>
<fileType> </fileType>
</file>
Where for every file it finds in the directory it creates a new <file>.
Any help is truely appreciated.
Use an XML library. There are plenty around, and the third party ones are almost all easier to use than the built-in DOM API in Java. Last time I used it, JDom was pretty good. (I haven't had to do much XML recently.)
Something like:
Element rootElement = new Element("root"); // You didn't show what this should be
Document document = new Document(rootElement);
for (Whatever file : files)
{
Element fileElement = new Element("file");
fileElement.addContent(new Element("fileName").addContent(file.getName());
fileElement.addContent(new Element("fileType").addContent(file.getType());
}
String xml = XMLOutputter.outputString(document);
Have a look at DOM and ECS. The following example was adapted to you requirements from here:
XMLDocument document = new XMLDocument();
for (File f : files) {
document.addElement( new XML("file")
.addXMLAttribute("fileName", file.getName())
.addXMLAttribute("fileType", file.getType())
)
);
}
You can use the StringBuilder approach suggested by Vinze, but one caveat is that you will need to make sure your filenames contain no special XML characters, and escape them if they do (for example replace < with <, and deal with quotes appropriately).
In this case it probably doesn't arise and you will get away without it, however if you ever port this code to reuse in another case, you may be bitten by this. So you might want to look at an XMLWriter class which will do all the escaping work for you.
Well just use a StringBuilder :
StringBuilder builder = new StringBuilder();
for(File f : files) {
builder.append("<file>\n\t<fileName>").append(f.getName).append("</fileName>\n)";
[...]
}
System.out.println(builder.toString());
Related
I have to xml files say abc.xml & 123.xml which are almost similar, i mean has the same content, but the second one i.e, 123.xml has more content than the earlier one. I want to read both the files using Java and to add the extra contents to abc.xml without changing its existing contents
Java internally has xml parser. Search over net on how to use it.
Two links are given below,
http://www.tutorialspoint.com/java_xml/java_dom_parse_document.htm
http://www.tutorialspoint.com/java_xml/java_dom_modify_document.htm
Simply you have to read both of them and then write logic to compare and then wrtie to combined to another file.
XMLUnit is the key my friend. It compares two XMLs in an advanced yet simple way and easy to understand. To add a new tag, you can either do it in the good ol' string concatenation or using jDom or any similar library. Here is some good references:
http://www.jdom.org/docs/apidocs/
http://xmlunit.sourceforge.net/userguide/html/
String refXmlPaht = "../test1.xml";
String testXmlPaht = "../test2.xml";
Document doc1 = TransformXML.convertXmlToDom(refXmlPaht);
Document doc2 = TransformXML.convertXmlToDom(testXmlPaht);
Diff myDiff = new Diff(doc1, doc2);
XMLUnit.setIgnoreWhitespace(true);
XMLUnit.setIgnoreComments(true);
XMLUnit.setIgnoreAttributeOrder(true);
assertXMLEqual("pieces of XML are not similar ", myDiff, true);
assertTrue("but are they identical? " + myDiff, myDiff.identical());
this part was taken from other links in different cases.
Hit me up if you have more questions!
I have an XML file with thousands of tags to read their text content, as in the screenshot below :
I am trying to read the text content of all the "word" tags using this code :
String filePath = "...";
File xmlFile = new File( filePath );
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document domObject = db.parse( xmlFile );
domObject.getDocumentElement().normalize();
NodeList categoryNodes = domObject.getElementsByTagName( "category" ); // Get all the <category> nodes.
for (int s = 0; s < categoryNodes.getLength(); s++) { //Loop on the <category> nodes.
String categoryName = categoryNodes.item(s).getAttributes().getNamedItem( "name" ).getNodeValue();
if( selectedCategoryName.equals( categoryName ) ) { //get its words.
NodeList wordsNodes = categoryNodes.item(s).getChildNodes();
for( int i = 0; i < wordsNodes.getLength(); i++ ) {
if( wordsNodes.item( i ).getNodeType() != Node.ELEMENT_NODE ) continue;
String word = wordsNodes.item( i ).getTextContent();
categoryWordsList.add( word ); // Some words are read wrong !!
}
break;
}
}
But for some reason many words are being read in wrong manner, examples :
"AMK6780KBU" is read as "9826</word"
"ASSI.ABR30326" is read as "rd>ASSI.AEP26"
"ASSI.25066" is read as "SI.4268</6"
It might be because the file size is big. If i just add some empty lines or remove some empty lines from the XML file, other words will be read wrong than the ones mentioned above, which is a strange thing !
You can download the XML file from here.
Solution
See below :-)
What I tried in the process
Changing the XML version from 1.1 -> 1.0 fixed the problem for me. I'm using Java 1.6.0_33 (as #orique pointed out in the comments).
In my tests there are definitely issues with corruption after a certain number of nodes. I narrowed it down to somewhere around ASSI.MTK69609. Removing everything, including that line fixed the corruption of the previous words.
The corruption is also resolved by simply changing the declaration to:
<?xml version="1.0">
and I saw zero corruption using the entire original source XML.
Similarly if you leave the version at 1.1 but remove whitespace nodes from the source, the result is as expected, for example:
<word>ASSI.MTK68490</word>
<word>ASSI.MTK6862617</word>
<word>ASSI.MTK693115</word>
<word>ASSI.MTK69609</word>
results in the desired output and
<word>ASSI.MTK68490</word>
<word>ASSI.MTK6862617</word>
<word>ASSI.MTK693115</word>
<word>ASSI.MTK69609</word>
is corrupted.
Removing some end-of-line "nodes" also corrected the problem, for example
<word>ASSI.MTK693115</word><word>ASSI.MTK69609</word>
So it was all pointing towards a bug, but where...? Eventually it clicked! Xerces
The version of Xerces shipped with Java 1.6 (and probably 1.7) is old, old, old and buggy (for example #6760982). In fact, I can break my test class by simply adding:
Document domObject = db.parse( xmlFile );
domObject.normalizeDocument(); // <-- causes following Exception
Exception in thread "main" java.lang.NullPointerException
at com.sun.org.apache.xerces.internal.util.XML11Char.isXML11ValidNCName(XML11Char.java:340)
There have been many defects fixed for XML 1.1, so on a hunch I downloaded the latest version Xerces2 Java 2.11.0.
Simply running with the most recent version resulted in the expected uncorrupted output.
java -classpath .;xercesImpl.jar;xml-apis.jar Foo > foo.txt
We have noticed that getTextContent() is buggy on some Windows implementations.
Our workaround is to do something like this
// getTextContent is buggy on some Java Windows Implementations
if ( n.getNodeType( ) == Node.ELEMENT_NODE ) {
results [ i ] = (String) xPathFunction.evaluate( "./text()", n, XPathConstants.STRING );
} else { //Node.TEXT_NODE
results [ i ] = n.getNodeValue( );
}
xPathFunction is an javax.xml.xpath.XPath. Expensive, but works reliably.
Actually in your case I would directly use an XPath and call something like,
NodeList l = (NodeList) xPathFunction.evaluate( "/categories/category/word/text()", domObject, XPathConstants.NODESET )
EDIT
Beats me! On OSX, Java 1.6.0_43, I get the same behaviour. In case there was any doubt the DOM model is buggy in Java... The wrong values seem to reliably appear at certain intervals, which looks like some bytes buffer overrun. I never got an OOM error.
Here is what I have unsuccessfully tried:
word.getFirstChild().getNodeValue(); instead of word.getTextContent(); -> no change in behaviour
use an InputSource as an input into the DocumentBuilder instead of using a File
run an XPath ("/categories/category[#name='Category1']/word/text()") instead of looping over the nodes and manually traversing their children
run the same Test using Saxon as the XPath engine
check for "strange" characters in the XML file
I believe the DocumentBuilder is the culprit. It is a memory hog.
Your next best chance is to go for a SAX Parser or any other streaming parser. Since your data model is small and very simple, the implementation should be easy. To further ease implementation, you may try XMLDog. We use a slightly modified version to parse gigabyte size XML files successfully.
If you ever find the issue, please update this post.
I'm trying to read an xml file on from an android app using XOM as the XML library. I'm trying this:
Builder parser = new Builder();
Document doc = parser.build(context.openFileInput(XML_FILE_LOCATION));
But I'm getting nu.xom.ParsingException: Premature end of file. even when the file is empty.
I need to parse a very simple XML file, and I'm ready to use another library instead of XOM so let me know if there's a better one. or just a solution to the problem using XOM.
In case it helps, I'm using xerces to get the parser.
------Edit-----
PS: The purpose of this wasn't to parse an empty file, the file just happened to be empty on the first run which showed this error.
If you follow this post to the end, it seems that this has to do with xerces and the fact that its an empty file, and they didn't reach a solution on xerces side.
So I handled the issue as follows:
Document doc = null;
try {
Builder parser = new Builder();
doc = parser.build(context.openFileInput(XML_FILE_LOCATION));
}catch (ParsingException ex) { //other catch blocks are required for other exceptions.
//fails to open the file with a parsing error.
//I create a new root element and a new document.
//I fill them with xml data (else where in the code) and save them.
Element root = new Element("root");
doc = new Document(root);
}
And then I can do whatever I want with doc. and you can add extra checks to make sure that the cause is really an empty file (like check the file size as indicated by one of sam's comments on the question).
An empty file is not a well-formed XML document. Throwing a ParsingException is the right thing to do here.
I have this XML file which doesn't have a root node. Other than manually adding a "fake" root element, is there any way I would be able to parse an XML file in Java? Thanks.
I suppose you could create a new implementation of InputStream that wraps the one you'll be parsing from. This implementation would return the bytes of the opening root tag before the bytes from the wrapped stream and the bytes of the closing root tag afterwards. That would be fairly simple to do.
I may be faced with this problem too. Legacy code, eh?
Ian.
Edit: You could also look at java.io.SequenceInputStream which allows you to append streams to one another. You would need to put your prefix and suffix in byte arrays and wrap them in ByteArrayInputStreams but it's all fairly straightforward.
Your XML document needs a root xml element to be considered well formed. Without this you will not be able to parse it with an xml parser.
One way is to provide your own dummy wrapper without touching the original 'xml' (the not well formed 'xml') Need the word for that:
Syntax
<!DOCTYPE some_root_elem SYSTEM "/home/ego/some.dtd"
[
<!ENTITY entity-name "Some value to be inserted at the entity">
]
Example:
<!DOCTYPE dummy [
<!ENTITY data SYSTEM "http://wherever-my-data-is">
]>
<dummy>
&data;
</dummy>
You could use another parser like Jsoup. It can parse XML without a root.
I think even if any API would have an option for this, it will only return you the first node of the "XML" which will look like a root and discard the rest.
So the answer is probably to do it yourself. Scanner or StringTokenizer might do the trick.
Maybe some html parsers might help, they are usually less strict.
Here's what I did:
There's an old java.io.SequenceInputStream class, which is so old that it takes Enumeration rather than List or such.
With it, you can prepend and append the root element tags (<div> and </div> in my case) around your no-root XML stream. (You shouldn't do it by concatenating Strings due to performance and memory reasons.)
public void tryExtractHighestHeader(ParserContext context)
{
String xhtmlString = context.getBody();
if (xhtmlString == null || "".equals(xhtmlString))
return;
// The XHTML needs to be wrapped, because it has no root element.
ByteArrayInputStream divStart = new ByteArrayInputStream("<div>".getBytes(StandardCharsets.UTF_8));
ByteArrayInputStream divEnd = new ByteArrayInputStream("</div>".getBytes(StandardCharsets.UTF_8));
ByteArrayInputStream is = new ByteArrayInputStream(xhtmlString.getBytes(StandardCharsets.UTF_8));
Enumeration<InputStream> streams = new IteratorEnumeration(Arrays.asList(new InputStream[]{divStart, is, divEnd}).iterator());
try (SequenceInputStream wrapped = new SequenceInputStream(streams);) {
DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = builderFactory.newDocumentBuilder();
Document xmlDocument = builder.parse(wrapped);
From here you can do whatever you like, but keep in mind the extra element.
XPath xPath = XPathFactory.newInstance().newXPath();
}
catch (Exception e) {
throw new RuntimeException("Failed parsing XML: " + e.getMessage());
}
}
I'm working on a project under which i have to take a raw file from the server and convert it into XML file.
Is there any tool available in java which can help me to accomplish this task like JAXP can be used to parse the XML document ?
I guess you will need your objects for later use ,so create MyObject that will be some bean that you will load the values form your Raw File and you can write this to someFile.xml
FileOutputStream os = new FileOutputStream("someFile.xml");
XMLEncoder encoder = new XMLEncoder(os);
MyObject p = new MyObject();
p.setFirstName("Mite");
encoder.writeObject(p);
encoder.close();
Or you con go with TransformerFactory if you don't need the objects for latter use.
Yes. This assumes that the text in the raw file is already XML.
You start with the DocumentBuilderFactory to get a DocumentBuilder, and then you can use its parse() method to turn an input stream into a Document, which is an internal XML representation.
If the raw file contains something other than XML, you'll want to scan it somehow (your own code here) and use the stuff you find to build up from an empty Document.
I then usually use a Transformer from a TransformerFactory to convert the Document into XML text in a file, but there may be a simpler way.
JAXP can also be used to create a new, empty document:
Document dom = DocumentBuilderFactory.newInstance()
.newDocumentBuilder()
.newDocument();
Then you can use that Document to create elements, and append them as needed:
Element root = dom.createElement("root");
dom.appendChild(root);
But, as Jørn noted in a comment to your question, it all depends on what you want to do with this "raw" file: how should it be turned into XML. And only you know that.
I think if you try to load it in an XmlDocument this will be fine