woodstox skip part of xml - java

Java: 1.6
Woodstox: 4.1.4
I just want to skip part of xml file, while parsing.
Let's look at that simple xml:
<family>
<mom>
<data height="160"/>
</mom>
<dad>
<data height="175"/>
</dad>
</family>
I just want do skip dad element. So it look's like using skipElement method like shown below is a good idea:
FileInputStream fis = ...;
XMLStreamReader2 xmlsr = (XMLStreamReader2) xmlif.createXMLStreamReader(fis);
String currentElementName = null;
while(xmlsr.hasNext()){
int eventType = xmlsr.next();
switch(eventType){
case (XMLEvent2.START_ELEMENT):
currentElementName = xmlsr.getName().toString();
if("dad".equals(currentElementName) == true){
logger.info("isStartElement: " + xmlsr.isStartElement());
logger.info("Element BEGIN: " + currentElementName);
xmlsr.skipElement();
}
...
}
}
We just find start of element dad, and skip it. But not so fast, because Exception will be thrown. This is the output:
isStartElement: true
Element BEGIN: dad
Exception in thread "main" java.lang.IllegalStateException: Current state not START_ELEMENT
That is not what expected. This is indeed very unexpected, because method skipElement is executed in START_ELEMENT state. What is going on?

I tried this in java 1.6 (jdk1.6.0_30) with woodstox-core-lgpl-4.1.4.jar, stax2-api-3.1.1.jar on the library path.
My java file is this:
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import org.codehaus.stax2.XMLStreamReader2;
import org.codehaus.stax2.evt.XMLEvent2;
public class Skip {
public static void main(String[] args) throws FileNotFoundException,
XMLStreamException {
System.setProperty("javax.xml.stream.XMLInputFactory",
"com.ctc.wstx.stax.WstxInputFactory");
System.setProperty("javax.xml.stream.XMLOutputFactory",
"com.ctc.wstx.stax.WstxOutputFactory");
System.setProperty("javax.xml.stream.XMLEventFactory",
"com.ctc.wstx.stax.WstxEventFactory");
FileInputStream fis = new FileInputStream(new File("family.xml"));
XMLInputFactory xmlif = XMLInputFactory.newFactory();
XMLStreamReader2 xmlsr = (XMLStreamReader2) xmlif
.createXMLStreamReader(fis);
String currentElementName = null;
while (xmlsr.hasNext()) {
int eventType = xmlsr.next();
switch (eventType) {
case (XMLEvent2.START_ELEMENT):
currentElementName = xmlsr.getName().toString();
if ("dad".equals(currentElementName) == true) {
System.out.println("isStartElement: "
+ xmlsr.isStartElement());
System.out.println("Element BEGIN: " + currentElementName);
xmlsr.skipElement();
}
else {
System.out.println(currentElementName);
}
}
}
}
}
Works like a charm.
Output is
family
mom
data
isStartElement: true
Element BEGIN: dad

Since Woodstox is a StAX (JSR-173) compliant parser, you could use a StAX StreamFilter to exclude events corresponding to certain elements. I prefer this approach so that you can keep the filtering logic separate from your application logic.
Demo
import javax.xml.stream.*;
import javax.xml.transform.stream.StreamSource;
public class Demo {
public static void main(String[] args) throws Exception {
XMLInputFactory xif = XMLInputFactory.newFactory();
StreamSource xml = new StreamSource("src/forum14326598/input.xml");
XMLStreamReader xsr = xif.createXMLStreamReader(xml);
xsr = xif.createFilteredReader(xsr, new StreamFilter() {
private boolean accept = true;
#Override
public boolean accept(XMLStreamReader reader) {
if((reader.isStartElement() || reader.isEndElement()) && "dad".equals(reader.getLocalName())) {
accept = !accept;
return false;
} else {
return accept;
}
}
});
while(xsr.hasNext()) {
if(xsr.isStartElement()) {
System.out.println("start: " + xsr.getLocalName());
} else if(xsr.isCharacters()) {
if(xsr.getText().trim().length() > 0) {
System.out.println("chars: " + xsr.getText());
}
} else if(xsr.isEndElement()) {
System.out.println("end: " + xsr.getLocalName());
}
xsr.next();
}
}
}
Output
start: family
start: mom
start: data
end: data
end: mom
end: family

I've found the reason, why I was getting the IllegalStateException. The very useful was flup's answer. Thanks a lot.
It is worth to read answer given by Blaise too.
But getting to the heart of the matter.
The problem was not skipElement() method itself. The problem was caused becouse of methods used to read attributes. There are three dots (...) in my question. So let's look what was there:
switch(eventType){
case (XMLEvent2.START_ELEMENT):
currentElementName = xmlsr.getName().toString();
logger.info("currentElementName: " + currentElementName);
if("dad".equals(currentElementName) == true){
logger.info("isStartElement: " + xmlsr.isStartElement());
logger.info("Element BEGIN: " + currentElementName);
xmlsr.skipElement();
}
case (XMLEvent2.ATTRIBUTE):
int attributeCount = xmlsr.getAttributeCount();
...
break;
}
Important thing. There is no break statement for START_ELEMENT. So every time START_ELEMENT event occurs the code for event ATTRIBUTE is also executed.
That looks OK according to Java Docs, becouse methods getAttributeCount(), getAttributeValue() etc. can be executed for both START_ELEMENT and ATTRIBUTE.
But after calling method skipElement(), event START_ELEMENT is changed to END_ELEMENT. So calling method getAttributeCount() is not allowed. This call is the reason why IllegalStateException is thrown.
The simplest way to avoid that Exception is just calling break statement after calling skipElement() method. In that case code for getting attributes will not be executed, thus Exception will not be thrown.
if("dad".equals(currentElementName) == true){
logger.info("isStartElement: " + xmlsr.isStartElement());
logger.info("Element BEGIN: " + currentElementName);
xmlsr.skipElement();
break; //the cure for IllegalStateException
}
I'm sorry I gave you no chance to answer my original question becouse of to much code hidden.

It looks like the method xmlsr.skipElement() is the one that must consume the XMLEvent2.START_ELEMENT event. And since you already consumed it (xmlsr.next()), that method throws you an error.

Related

Encode attribute newlines in XMLEventWriter

I am doing some surgical XML transformations using XMLEventReader and XMLEventWriter. For the most part, I just write the events as they are read:
import javax.xml.stream.*;
import javax.xml.stream.events.XMLEvent;
import java.io.StringReader;
import java.io.StringWriter;
public class StaxExample {
public static void main(String[] args) throws XMLStreamException {
String inputXml =
"<foo>" +
" <bar baz=\"a
b
c
\"/>" +
" <changeme/>" +
"</foo>";
StringWriter result = new StringWriter();
XMLEventReader reader = XMLInputFactory.newFactory().createXMLEventReader(new StringReader(inputXml));
XMLEventWriter writer = XMLOutputFactory.newFactory().createXMLEventWriter(result);
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
//in real code, look for "changeme" and insert some stuff
writer.add(event);
}
System.out.println(result.toString());
}
}
My problem is, this produces:
<?xml version="1.0" ?><foo> <bar baz="a
b
c
"></bar> <changeme></changeme></foo>
While syntactically valid XML, it's necessary (due to a downstream consumer) that I preserve the newlines. The above XML will instead be normalized to a b c by that consumer (and indeed, by StAX itself--if I take this output and feed it back into the same program, the second time it will output baz="a b c ").
While I've given up on XMLEventWriter preserving non-semantic formatting, is there a way to prevent it from essentially changing my attribute values?
Well, I suggest you implement your own Writer:
public class EscappingNLWriter extends FilterWriter
{
public EscappingNLWriter(Writer out) {super(out);}
public void write(c)
{
if (c=='\n')
{
out.write("
");
}
else
{
out.write(c);
}
}
public void write(char[] buff, int offset, int len) throws IOException
{
// ...Same char filtering...
}
public void write(String str, int offset, int len) throws IOException
{
// ...Same char filtering...
}
}
And then use it to encapsulate the StringWriter:
Writer result = new EscappingNLWriter(new StringWriter());
If you need an absolute accuracy about where to escape newlines in the XML and where not to (i.e.: you need to escape newlines only within attributes and not elsewhere), I have another suggestion tough a little more complicated:
Look at your code:
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
//in real code, look for "changeme" and insert some stuff
writer.add(event);
}
There is one point where you can interpose between the attribute and the writer: Just after initializing event and before passing it to writer.add, you can encapsulate the event in your own implementation of XMLEvent to ensure that if it is an instance of javax.xml.stream.events.Attribute, you will overwrite Attribute.getValue() to return the value properly escapped.
But there is an extra complication: The XMLEvents returned by a XMLEventReader usually do not include Attribute events: Attributes are included within its corresponding StartElement events. So you need one more level of encapsulation: The StartElement objects and then the contained Attribute objects.

How to guarantee atomic move or exception of a file in Java?

In one of my projects I have concurrent write access to one single file within one JRE and want to handle that by first writing to a temporary file and afterwards moving that temp file to the target using an atomic move. I don't care about the order of the write access or such, all I need to guarantee is that any given time the single file is usable. I'm already aware of Files.move and such, my problem is that I had a look at at least one implementation for that method and it raised some doubts about if implementations really guarantee atomic moves. Please look at the following code:
Files.move on GrepCode for OpenJDK
1342 FileSystemProvider provider = provider(source);
1343 if (provider(target) == provider) {
1344 // same provider
1345 provider.move(source, target, options);
1346 } else {
1347 // different providers
1348 CopyMoveHelper.moveToForeignTarget(source, target, options);
1349 }
The problem is that the option ATOMIC_MOVE is not considered in all cases, but the location of the source and target path is the only thing that matters in the first place. That's not what I want and how I understand the documentation:
If the move cannot be performed as an atomic file system operation then
AtomicMoveNotSupportedException is thrown. This can arise, for example, when the target
location is on a different FileStore and would require that the file be copied, or target
location is associated with a different provider to this object.
The above code clearly violates that documentation because it falls back to a copy-delete-strategy without recognizing ATOMIC_MOVE at all. An exception would be perfectly OK in my case, because with that a hoster of our service could change his setup to use only one filesystem which supports atomic moves, as that's what we expect in the system requirements anyway. What I don't want to deal with is things silently failing just because an implementation uses a copy-delete-strategy which may result in data corruption in the target file. So, from my understanding it is simply not safe to rely on Files.move for atomic operations, because it doesn't always fail if those are not supported, but implementations may fall back to a copy-delete-strategy.
Is such behaviour a bug in the implementation and needs to get filed or does the documentation allow such behaviour and I'm understanding it wrong? Does it make any difference at all if I now already know that such maybe broken implementations are used out there? I would need to synchronize the write access on my own in that case...
You are looking at the wrong place. When the file system providers are not the same, the operation will be delegated to moveToForeignTarget as you have seen within the code snippet you’ve posted. The method moveToForeignTarget however will use the method convertMoveToCopyOptions (note the speaking name…) for getting the necessary copy options for the translated operation. And convertMoveToCopyOptions will throw an AtomicMoveNotSupportedException if it encounters the ATOMIC_MOVE option as there is no way to convert that move option to a valid copy option.
So there’s no reason to worry and in general it’s recommended to avoid hasty conclusion from seeing just less than ten lines of code (especially when not having tried a single test)…
The standard Java library does not provide a way to perform an atomic move in all cases.
Files.move() does not guarantee atomic move. You can pass ATOMIC_MOVE as an option, but if the move cannot be performed as an atomic operation, AtomicMoveNotSupportedException is thrown (this is the case when target location is on a different FileStore and would require that the file be copied).
You have to implement it yourself if you really need that. One solution can be to catch AtomicMoveNotSupportedException and then do this: Try to move the file without the ATOMIC_MOVE option but catch exceptions and remove the target if error occured during the copy.
I came across similar problem to be solved:
One process frequently updates file via 'save to tempfile -> move tempfile to final file' using Files.move(tmp, out, ATOMIC_MOVE, REPLACE_EXISTING);
Another one or more processes read that file - completely, all-at-once, and closes immediatelly. File is rather small - less than 50k.
And it just does not work reliably, at least on windows. Under heavy load reader occasionally gets NoSuchFileException - this means Files.move is not that ATOMIC even on the same file system :(
My env: Windows 10 + java 11.0.12
Here is the code to play with:
import org.junit.Test;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.nio.ByteBuffer;
import java.nio.channels.ByteChannel;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.Set;
import static java.nio.charset.StandardCharsets.UTF_8;
import static java.nio.file.StandardCopyOption.ATOMIC_MOVE;
import static java.nio.file.StandardCopyOption.REPLACE_EXISTING;
import static java.util.Locale.US;
public class SomeTest {
static int nWrite = 0;
static int nRead = 0;
static int cErrors = 0;
static boolean writeFinished;
static boolean useFileChannels = true;
static String filePath = "c:/temp/test.out";
#Test
public void testParallelFileAccess() throws Exception {
new Writer().start();
new Reader().start();
while( !writeFinished ) {
Thread.sleep(10);
}
System.out.println("cErrors: " + cErrors);
}
static class Writer extends Thread {
public Writer() {
setDaemon(true);
}
#Override
public void run() {
File outFile = new File("c:/temp/test.out");
File outFileTmp = new File(filePath + "tmp");
byte[] bytes = "test".getBytes(UTF_8);
for( nWrite = 1; nWrite <= 100000; nWrite++ ) {
if( (nWrite % 1000) == 0 )
System.out.println("nWrite: " + nWrite + ", cReads: " + nRead);
try( FileOutputStream fos = new FileOutputStream(outFileTmp) ) {
fos.write(bytes);
}
catch( Exception e ) {
logException("write", e);
}
int maxAttemps = 10;
for( int i = 0; i <= maxAttemps; i++ ) {
try {
Files.move(outFileTmp.toPath(), outFile.toPath(), ATOMIC_MOVE, REPLACE_EXISTING);
break;
}
catch( IOException e ) {
try {
Thread.sleep(1);
}
catch( InterruptedException ex ) {
break;
}
if( i == maxAttemps )
logException("move", e);
}
}
}
System.out.println("Write finished ...");
writeFinished = true;
}
}
static class Reader extends Thread {
public Reader() {
setDaemon(true);
}
#Override
public void run() {
File inFile = new File(filePath);
Path inPath = inFile.toPath();
byte[] bytes = new byte[100];
ByteBuffer buffer = ByteBuffer.allocateDirect(100);
try { Thread.sleep(100); } catch( InterruptedException e ) { }
for( nRead = 0; !writeFinished; nRead++ ) {
if( useFileChannels ) {
try ( ByteChannel channel = Files.newByteChannel(inPath, Set.of()) ) {
channel.read(buffer);
}
catch( Exception e ) {
logException("read", e);
}
}
else {
try( InputStream fis = Files.newInputStream(inFile.toPath()) ) {
fis.read(bytes);
}
catch( Exception e ) {
logException("read", e);
}
}
}
}
}
private static void logException(String action, Exception e) {
cErrors++;
System.err.printf(US, "%s: %s - wr=%s, rd=%s:, %s%n", cErrors, action, nWrite, nRead, e);
}
}

Problems reading an ini file in Java

I am having trouble using Apache commons configuration to read an ini file. I attached the imports incase I am missing something. Below is an example I found on stackoverflow, and as far as I can find, there are no other examples to look at. The problem is iniObj. Using Eclipse it is highlighted in red.
If I initialize the variable, new "HierarchicalINIConfiguration(iniFile); gets angry and wants to add a try/catch or throws... which should be no problem... but then the try/catch or throws gets angry and says "No exception of type ConfigurationException can be thrown; an exception type must be a subclass of Throwable."
Which then brought me to this question. I added the commons lang 3.1. I have commons config 1.9, commons collections 3.2.1. commons logging 1.1.1 as well. I have also tried this with commons config 1.8 and lang 2.6. Now I get a new error "Exception in thread "main" java.lang.NullPointerException at com.toolbox.dev.ReadIni.main(ReadIni.java:28)" You can see the new code below after the adjustments I made to try and resolve the errors.
My code:
import java.util.Iterator;
import java.util.Set;
import org.apache.commons.configuration.ConfigurationException;
import org.apache.commons.configuration.HierarchicalINIConfiguration;
import org.apache.commons.configuration.SubnodeConfiguration;
public static void main(String[] args) throws ConfigurationException {
String iniFile = "file.ini";
HierarchicalINIConfiguration iniConfObj = new HierarchicalINIConfiguration(iniFile);
// Get Section names in ini file
Set setOfSections = iniConfObj.getSections();
Iterator sectionNames = setOfSections.iterator();
while(sectionNames.hasNext()) {
String sectionName = sectionNames.next().toString();
HierarchicalINIConfiguration iniObj = null;
SubnodeConfiguration sObj = iniObj.getSection(sectionName);
Iterator it1 = sObj.getKeys();
while (it1.hasNext()) {
// Get element
Object key = it1.next();
System.out.print("Key " + key.toString() + " Value " +
sObj.getString(key.toString()) + "\n");
}
}
}
Original code from Stack Overflow:
import java.util.Iterator;
import java.util.Set;
import org.apache.commons.configuration.HierarchicalINIConfiguration;
import org.apache.commons.configuration.SubnodeConfiguration;
public class ReadIni {
public static void main(String[] args) {
String iniFile = "";
HierarchicalINIConfiguration iniConfObj = new HierarchicalINIConfiguration(iniFile);
// Get Section names in ini file
Set setOfSections = iniConfObj.getSections();
Iterator sectionNames = setOfSections.iterator();
while(sectionNames.hasNext()) {
String sectionName = sectionNames.next().toString();
SubnodeConfiguration sObj = iniObj.getSection(sectionName);
Iterator it1 = sObj.getKeys();
while (it1.hasNext()) {
// Get element
Object key = it1.next();
System.out.print("Key " + key.toString() + " Value " +
sObj.getString(key.toString()) + "\n");
}
}
Since you have already initialized the HierarchicalINIConfiguration (second line in "main") as :
HierarchicalINIConfiguration iniConfObj = new HierarchicalINIConfiguration(iniFile);
I believe you want to remove HierarchicalINIConfiguration iniObj = null; (around 5 lines down) from your code and change
SubnodeConfiguration sObj = iniObj.getSection(sectionName);
to (use iniConfObj in place of iniObj)
SubnodeConfiguration sObj = iniConfObj.getSection(sectionName);
This doesn't look promising ?
HierarchicalINIConfiguration iniObj = null;
SubnodeConfiguration sObj = iniObj.getSection(sectionName);
Is this line 28 ?
You could try JINIFile. Is a translation of the TIniFile from Delphi, but for java. It fully supports all the INI file features
https://github.com/SubZane/JIniFile

Dynamic XML Shredding in Java Without Using a Database

Is there a "standardized" way (i.e., code pattern, or, even better, open source library) in Java for dynamically flattening ("shredding") a hierarchical XML file, of large size and unknown structure, with output not redirected to an RDBMS but directly accessible?
I am looking at a transformation like the one mentioned in this question, but all the code examples I have seen use some SQL command to inject the flattened XML input to a database table, via an RDBMS (e.g., MySQL).
What I would like to do is progressively extract the XML data into a string, or, at least, into a text file, which could be post-processed afterwards, without going through any RDBMS.
EDIT:
After working further on the issue, there are a couple of solutions using XSLT (including a fully parameterizable one) in this question.
You could do it with JDOM (see example below, jdom.jar has to be on the classpath). But beware, the whole dom is in memory. If the XML is bigger you should use XSLT or a SAX parser.
import java.io.IOException;
import java.io.StringReader;
import org.jdom.Document;
import org.jdom.Element;
import org.jdom.JDOMException;
import org.jdom.input.SAXBuilder;
import org.junit.Test;
public class JDomFlatten {
#Test
public void testFlatten() {
final String xml = "<grandparent name=\"grandpa bob\">"//
+ "<parent name=\"papa john\">"//
+ "<children>"//
+ "<child name=\"mark\" />"//
+ "<child name=\"cindy\" />"//
+ "</children>"//
+ "</parent>"//
+ "<parent name=\"papa henry\">"//
+ "<children>" //
+ "<child name=\"mary\" />"//
+ "</children>"//
+ "</parent>" //
+ "</grandparent>";
final StringReader stringReader = new StringReader(xml);
final SAXBuilder builder = new SAXBuilder();
try {
final Document document = builder.build(stringReader);
final Element grandparentElement = document.getRootElement();
final StringBuilder outString = new StringBuilder();
for (final Object parentElementObject : grandparentElement.getChildren()) {
final Element parentElement = (Element) parentElementObject;
for (final Object childrenElementObject : parentElement.getChildren()) {
final Element childrenElement = (Element) childrenElementObject;
for (final Object childElementObject : childrenElement.getChildren()) {
final Element childElement = (Element) childElementObject;
outString.append(grandparentElement.getAttributeValue("name"));
outString.append(" ");
outString.append(parentElement.getAttributeValue("name"));
outString.append(" ");
outString.append(childElement.getAttributeValue("name"));
outString.append("\n");
}
}
}
System.out.println(outString);
} catch (final JDOMException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (final IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

How to improve splitting xml file performance

I've see quite a lot posts/blogs/articles about splitting XML file into a smaller chunks and decided to create my own because I have some custom requirements. Here is what I mean, consider the following XML :
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<company>
<staff id="1">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
</staff>
<staff id="2">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
</staff>
<staff id="3">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
</staff>
<staff id="4">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
</staff>
<staff id="5">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<salary>100000</salary>
</staff>
</company>
I want to split this xml into n parts, each containing 1 file, but the staff element must contain nickname , if it's not there I don't want it. So this should produce 4 xml splits, each containing staff id starting at 1 until 4.
Here is my code :
public int split() throws Exception{
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(inputFilePath)));
String line;
List<String> tempList = null;
while((line=br.readLine())!=null){
if(line.contains("<?xml version=\"1.0\"") || line.contains("<" + rootElement + ">") || line.contains("</" + rootElement + ">")){
continue;
}
if(line.contains("<"+ element +">")){
tempList = new ArrayList<String>();
}
tempList.add(line);
if(line.contains("</"+ element +">")){
if(hasConditions(tempList)){
writeToSplitFile(tempList);
writtenObjectCounter++;
totalCounter++;
}
}
if(writtenObjectCounter == itemsPerFile){
writtenObjectCounter = 0;
fileCounter++;
tempList.clear();
}
}
if(tempList.size() != 0){
writeClosingRootElement();
}
return totalCounter;
}
private void writeToSplitFile(List<String> itemList) throws Exception{
BufferedWriter wr = new BufferedWriter(new FileWriter(outputDirectory + File.separator + "split_" + fileCounter + ".xml", true));
if(writtenObjectCounter == 0){
wr.write("<" + rootElement + ">");
wr.write("\n");
}
for (String string : itemList) {
wr.write(string);
wr.write("\n");
}
if(writtenObjectCounter == itemsPerFile-1)
wr.write("</" + rootElement + ">");
wr.close();
}
private void writeClosingRootElement() throws Exception{
BufferedWriter wr = new BufferedWriter(new FileWriter(outputDirectory + File.separator + "split_" + fileCounter + ".xml", true));
wr.write("</" + rootElement + ">");
wr.close();
}
private boolean hasConditions(List<String> list){
int matchList = 0;
for (String condition : conditionList) {
for (String string : list) {
if(string.contains(condition)){
matchList++;
}
}
}
if(matchList >= conditionList.size()){
return true;
}
return false;
}
I know that opening/closing stream for each written staff element which does impact the performance. But if I write once per file(which may contain n number of staff). Naturally root and split elements are configurable.
Any ideas how can I improve the performance/logic? I'd prefer some code, but good advice can be better sometimes
Edit:
This XML example is actually a dummy example, the real XML which I'm trying to split is about 300-500 different elements under split element all appearing at the random order and number varies. Stax may not be the best solution after all?
Bounty update :
I'm looking for a solution(code) that will:
Be able to split XML file into n parts with x split elements(from the dummy XML example staff is the split element).
The content of the spitted files should be wrapped in the root element from the original file(like in the dummy example company)
I'd like to be able to specify condition that must be in the split element i.e. I want only staff which have nickname, I want to discard those without nicknames. But be able to also split without conditions while running split without conditions.
The code doesn't necessarily have to improve my solution(lacking good logic and performance), but it works.
And not happy with "but it works". And I can't find enough examples of Stax for these kind of operations, user community is not great as well. It doesn't have to be Stax solution as well.
I'm probably asking too much, but I'm here to learn stuff, giving good bounty for the solution I think.
First piece of advice: don't try to write your own XML handling code. Use an XML parser - it's going to be much more reliable and quite possibly faster.
If you use an XML pull parser (e.g. StAX) you should be able to read an element at a time and write it out to disk, never reading the whole document in one go.
Here's my suggestion. It requires a streaming XSLT 3.0 processor: which means in practice that it needs Saxon-EE 9.3.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
<xsl:mode streamable="yes">
<xsl:template match="/">
<xsl:apply-templates select="company/staff"/>
</xsl:template>
<xsl:template match=staff">
<xsl:variable name="v" as="element(staff)">
<xsl:copy-of select="."/>
</xsl:variable>
<xsl:if test="$v/nickname">
<xsl:result-document href="{#id}.xml">
<xsl:copy-of select="$v"/>
</xsl:result-document>
</xsl:if>
</xsl:template>
</xsl:stylesheet>
In practice, though, unless you have hundreds of megabytes of data, I suspect a non-streaming solution will be quite fast enough, and probably faster than your hand-written Java code, given that your Java code is nothing to get excited about. At any rate, give an XSLT solution a try before you write reams of low-level Java. It's a routine problem, after all.
You could do the following with StAX:
Algorithm
Read and hold onto the root element event.
Read first chunk of XML:
Queue events until condition has been met.
If condition has been met:
Write start document event.
Write out root start element event
Write out split start element event
Write out queued events
Write out remaining events for this section.
If condition was not met then do nothing.
Repeat step 2 with next chunk of XML
Code for Your Use Case
The following code uses StAX APIs to break up the document as outlined in your question:
package forum7408938;
import java.io.*;
import java.util.*;
import javax.xml.namespace.QName;
import javax.xml.stream.*;
import javax.xml.stream.events.*;
public class Demo {
public static void main(String[] args) throws Exception {
Demo demo = new Demo();
demo.split("src/forum7408938/input.xml", "nickname");
//demo.split("src/forum7408938/input.xml", null);
}
private void split(String xmlResource, String condition) throws Exception {
XMLEventFactory xef = XMLEventFactory.newFactory();
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLEventReader xer = xif.createXMLEventReader(new FileReader(xmlResource));
StartElement rootStartElement = xer.nextTag().asStartElement(); // Advance to statements element
StartDocument startDocument = xef.createStartDocument();
EndDocument endDocument = xef.createEndDocument();
XMLOutputFactory xof = XMLOutputFactory.newFactory();
while(xer.hasNext() && !xer.peek().isEndDocument()) {
boolean metCondition;
XMLEvent xmlEvent = xer.nextTag();
if(!xmlEvent.isStartElement()) {
break;
}
// BOUNTY CRITERIA
// Be able to split XML file into n parts with x split elements(from
// the dummy XML example staff is the split element).
StartElement breakStartElement = xmlEvent.asStartElement();
List<XMLEvent> cachedXMLEvents = new ArrayList<XMLEvent>();
// BOUNTY CRITERIA
// I'd like to be able to specify condition that must be in the
// split element i.e. I want only staff which have nickname, I want
// to discard those without nicknames. But be able to also split
// without conditions while running split without conditions.
if(null == condition) {
cachedXMLEvents.add(breakStartElement);
metCondition = true;
} else {
cachedXMLEvents.add(breakStartElement);
xmlEvent = xer.nextEvent();
metCondition = false;
while(!(xmlEvent.isEndElement() && xmlEvent.asEndElement().getName().equals(breakStartElement.getName()))) {
cachedXMLEvents.add(xmlEvent);
if(xmlEvent.isStartElement() && xmlEvent.asStartElement().getName().getLocalPart().equals(condition)) {
metCondition = true;
break;
}
xmlEvent = xer.nextEvent();
}
}
if(metCondition) {
// Create a file for the fragment, the name is derived from the value of the id attribute
FileWriter fileWriter = null;
fileWriter = new FileWriter("src/forum7408938/" + breakStartElement.getAttributeByName(new QName("id")).getValue() + ".xml");
// A StAX XMLEventWriter will be used to write the XML fragment
XMLEventWriter xew = xof.createXMLEventWriter(fileWriter);
xew.add(startDocument);
// BOUNTY CRITERIA
// The content of the spitted files should be wrapped in the
// root element from the original file(like in the dummy example
// company)
xew.add(rootStartElement);
// Write the XMLEvents that were cached while when we were
// checking the fragment to see if it matched our criteria.
for(XMLEvent cachedEvent : cachedXMLEvents) {
xew.add(cachedEvent);
}
// Write the XMLEvents that we still need to parse from this
// fragment
xmlEvent = xer.nextEvent();
while(xer.hasNext() && !(xmlEvent.isEndElement() && xmlEvent.asEndElement().getName().equals(breakStartElement.getName()))) {
xew.add(xmlEvent);
xmlEvent = xer.nextEvent();
}
xew.add(xmlEvent);
// Close everything we opened
xew.add(xef.createEndElement(rootStartElement.getName(), null));
xew.add(endDocument);
fileWriter.close();
}
}
}
}
#Jon Skeet is spot on as usual in his advice. #Blaise Doughan gave you a very basic picture of using StAX (which would be my preferred choice, although you can do basically the same thing with SAX). You seem to be looking for something more explicit, so here's some pseudo code to get you started (based on StAX):
find first "staff" StartElement
set a flag indicating you are in a "staff" element and start tracking the depth (StartElement is +1, EndElement is -1)
now, process the "staff" sub-elements, grab any of the data you care about and put it in a file (or where ever)
keep processing until your depth reaches 0 (when you find the matching "staff" EndElement)
unset the flag indicating you are in a "staff" element
search for the next "staff" StartElement
if found, go to 2. and repeat
if not found, document is complete
EDIT:
wow, i have to say i'm amazed at the number of people willing to do someone else's work for them. i didn't realize SO was basically a free version of rent-a-coder.
#Gandalf StormCrow:
Let me divide your problem into three separate issues:-
i) Reading XML and simultaenous split XML in best possible way
ii) Checking condition in split file
iii) If condition met, process that spilt file.
for i), there are ofcourse mutliple solutions: SAX, STAX and other parsers and as simple as that as you mentioned just read using simple java io operations and search for tags.
I believe SAX/STAX/simple java IO, anything will do. I have taken your example as base for my solution.
ii) Checking condition in split file: you have used contains() method to check for existence of nickname. This does not seem best way: what if your conditions are as complex as if nickname should be present but length>5 or salary should be numeric etc.
I would use new java XML validation framework for this which make uses of XML schema.Please note we can cache schema object in memory so to reuse it again and again. This new validation framework is pretty fast.
iii) If condition met, process that spilt file.
You may want use java concurrent APIs to submit async tasks(ExecutorService class) to acheive parallel execution for faster performance.
So considering above points, one possible solution can be:-
You can create a company.xsd file like:-
<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.example.org/NewXMLSchema"
xmlns:tns="http://www.example.org/NewXMLSchema"
elementFormDefault="unqualified">
<element name="company">
<complexType>
<sequence>
<element name="staff" type="tns:stafftype"/>
</sequence>
</complexType>
</element>
<complexType name="stafftype">
<sequence>
<element name="firstname" type="string" minOccurs="0" />
<element name="lastname" type="string" minOccurs="0" />
<element name="nickname" type="string" minOccurs="1" />
<element name="salary" type="int" minOccurs="0" />
</sequence>
</complexType>
</schema>
then your java code would look like:-
import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.File;
import java.io.IOException;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import javax.xml.transform.stream.StreamSource;
import javax.xml.validation.Schema;
import javax.xml.validation.SchemaFactory;
import javax.xml.validation.Validator;
import org.xml.sax.SAXException;
public class testXML {
// Lookup a factory for the W3C XML Schema language
static SchemaFactory factory = SchemaFactory
.newInstance("http://www.w3.org/2001/XMLSchema");
// Compile the schema.
static File schemaLocation = new File("company.xsd");
static Schema schema = null;
static {
try {
schema = factory.newSchema(schemaLocation);
} catch (SAXException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private final ExecutorService pool = Executors.newFixedThreadPool(20);;
boolean validate(StringBuffer splitBuffer) {
boolean isValid = false;
Validator validator = schema.newValidator();
try {
validator.validate(new StreamSource(new ByteArrayInputStream(
splitBuffer.toString().getBytes())));
isValid = true;
} catch (SAXException ex) {
System.out.println(ex.getMessage());
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return isValid;
}
void split(BufferedReader br, String rootElementName,
String splitElementName) {
StringBuffer splitBuffer = null;
String line = null;
String startRootElement = "<" + rootElementName + ">";
String endRootElement = "</" + rootElementName + ">";
String startSplitElement = "<" + splitElementName + ">";
String endSplitElement = "</" + splitElementName + ">";
String xmlDeclaration = "<?xml version=\"1.0\"";
boolean startFlag = false, endflag = false;
try {
while ((line = br.readLine()) != null) {
if (line.contains(xmlDeclaration)
|| line.contains(startRootElement)
|| line.contains(endRootElement)) {
continue;
}
if (line.contains(startSplitElement)) {
startFlag = true;
endflag = false;
splitBuffer = new StringBuffer(startRootElement);
splitBuffer.append(line);
} else if (line.contains(endSplitElement)) {
endflag = true;
startFlag = false;
splitBuffer.append(line);
splitBuffer.append(endRootElement);
} else if (startFlag) {
splitBuffer.append(line);
}
if (endflag) {
//process splitBuffer
boolean result = validate(splitBuffer);
if (result) {
//send it to a thread for processing further
//it is async so that main thread can continue for next
pool.submit(new ProcessingHandler(splitBuffer));
}
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
class ProcessingHandler implements Runnable {
String splitXML = null;
ProcessingHandler(StringBuffer splitXMLBuffer) {
this.splitXML = splitXMLBuffer.toString();
}
#Override
public void run() {
// do like writing to a file etc.
}
}
Have a look at this. This is slightly reworked sample from xmlpull.org:
http://www.xmlpull.org/v1/download/unpacked/doc/quick_intro.html
The following should do all you need unless you have nested splitting tags like:
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<company>
<staff id="1">
<firstname>yong</firstname>
<lastname>mook kim</lastname>
<nickname>mkyong</nickname>
<salary>100000</salary>
<other>
<staff>
...
</staff>
</other>
</staff>
</company>
To run it in pass-through mode simply pass null as splitting tag.
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import org.apache.commons.io.FileUtils;
import org.xmlpull.v1.XmlPullParser;
import org.xmlpull.v1.XmlPullParserException;
import org.xmlpull.v1.XmlPullParserFactory;
public class XppSample {
private String rootTag;
private String splitTag;
private String requiredTag;
private int flushThreshold;
private String fileName;
private String rootTagEnd;
private boolean hasRequiredTag = false;
private int flushCount = 0;
private int fileNo = 0;
private String header;
private XmlPullParser xpp;
private StringBuilder nodeBuf = new StringBuilder();
private StringBuilder fileBuf = new StringBuilder();
public XppSample(String fileName, String rootTag, String splitTag, String requiredTag, int flushThreshold) throws XmlPullParserException, FileNotFoundException {
this.rootTag = rootTag;
rootTagEnd = "</" + rootTag + ">";
this.splitTag = splitTag;
this.requiredTag = requiredTag;
this.flushThreshold = flushThreshold;
this.fileName = fileName;
XmlPullParserFactory factory = XmlPullParserFactory.newInstance(System.getProperty(XmlPullParserFactory.PROPERTY_NAME), null);
factory.setNamespaceAware(true);
xpp = factory.newPullParser();
xpp.setInput(new FileReader(fileName));
}
public void processDocument() throws XmlPullParserException, IOException {
int eventType = xpp.getEventType();
do {
if(eventType == XmlPullParser.START_TAG) {
processStartElement(xpp);
} else if(eventType == XmlPullParser.END_TAG) {
processEndElement(xpp);
} else if(eventType == XmlPullParser.TEXT) {
processText(xpp);
}
eventType = xpp.next();
} while (eventType != XmlPullParser.END_DOCUMENT);
saveFile();
}
public void processStartElement(XmlPullParser xpp) {
int holderForStartAndLength[] = new int[2];
String name = xpp.getName();
char ch[] = xpp.getTextCharacters(holderForStartAndLength);
int start = holderForStartAndLength[0];
int length = holderForStartAndLength[1];
if(name.equals(rootTag)) {
int pos = start + length;
header = new String(ch, 0, pos);
} else {
if(requiredTag==null || name.equals(requiredTag)) {
hasRequiredTag = true;
}
nodeBuf.append(xpp.getText());
}
}
public void flushBuffer() throws IOException {
if(hasRequiredTag) {
fileBuf.append(nodeBuf);
if(((++flushCount)%flushThreshold)==0) {
saveFile();
}
}
nodeBuf = new StringBuilder();
hasRequiredTag = false;
}
public void saveFile() throws IOException {
if(fileBuf.length()>0) {
String splitFile = header + fileBuf.toString() + rootTagEnd;
FileUtils.writeStringToFile(new File((fileNo++) + "_" + fileName), splitFile);
fileBuf = new StringBuilder();
}
}
public void processEndElement (XmlPullParser xpp) throws IOException {
String name = xpp.getName();
if(name.equals(rootTag)) {
flushBuffer();
} else {
nodeBuf.append(xpp.getText());
if(name.equals(splitTag)) {
flushBuffer();
}
}
}
public void processText (XmlPullParser xpp) throws XmlPullParserException {
int holderForStartAndLength[] = new int[2];
char ch[] = xpp.getTextCharacters(holderForStartAndLength);
int start = holderForStartAndLength[0];
int length = holderForStartAndLength[1];
String content = new String(ch, start, length);
nodeBuf.append(content);
}
public static void main (String args[]) throws XmlPullParserException, IOException {
//XppSample app = new XppSample("input.xml", "company", "staff", "nickname", 3);
XppSample app = new XppSample("input.xml", "company", "staff", null, 3);
app.processDocument();
}
}
Normally I would suggest using StAX, but it is unclear to me how 'stateful' your real XML is. If simple, then use SAX for ultimate performance, if not-so-simple, use StAX. So you need to
read bytes from disk
convert them to characters
parse the XML
determine whether to keep XML or throw away (skip out subtree)
write XML
convert characters to bytes
write to disk
Now, it might seem like steps 3-5 are the most resource-intensive, but I would rate them as
Most: 1 + 7
Middle: 2 + 6
Least: 3 + 4 + 5
As operations 1 and 7 are kind of seperate of the rest, you should do them in an async way, at least creating multiple small files is best done in n other threads, if you are familiar with multi-threading. For increased performance, you might also look into the new IO stuff in Java.
Now for steps 2 + 3 and 5 + 6 you can go a long way with FasterXML, it really does a lot of the stuff you are looking for, like triggering JVM hot-spot attention in the right places; might even support async reading/writing looking through the code quickly.
So then we are left with step 5, and depending on your logic, you should either
a. make an object binding, then decide how what to do
b. write XML anyways, hoping for the best, and then throw it away if no 'staff' element is present.
Whatever you do, object reuse is sensible. Note that both alternatives (obisously) requires the same amount of parsing (skip out of subtree ASAP), and for alternative b, that a little extra XML is actually not so bad performancewise, ideally make sure your char buffers are > one unit.
Alternative b is the most easy to implement, simply copy the 'xml event' from your reader to writer, example for StAX:
private static void copyEvent(int event, XMLStreamReader reader, XMLStreamWriter writer) throws XMLStreamException {
if (event == XMLStreamConstants.START_ELEMENT) {
String localName = reader.getLocalName();
String namespace = reader.getNamespaceURI();
// TODO check this stuff again before setting in production
if (namespace != null) {
if (writer.getPrefix(namespace) != null) {
writer.writeStartElement(namespace, localName);
} else {
writer.writeStartElement(reader.getPrefix(), localName, namespace);
}
} else {
writer.writeStartElement(localName);
}
// first: namespace definition attributes
if(reader.getNamespaceCount() > 0) {
int namespaces = reader.getNamespaceCount();
for(int i = 0; i < namespaces; i++) {
String namespaceURI = reader.getNamespaceURI(i);
if(writer.getPrefix(namespaceURI) == null) {
String namespacePrefix = reader.getNamespacePrefix(i);
if(namespacePrefix == null) {
writer.writeDefaultNamespace(namespaceURI);
} else {
writer.writeNamespace(namespacePrefix, namespaceURI);
}
}
}
}
int attributes = reader.getAttributeCount();
// the write the rest of the attributes
for (int i = 0; i < attributes; i++) {
String attributeNamespace = reader.getAttributeNamespace(i);
if (attributeNamespace != null && attributeNamespace.length() != 0) {
writer.writeAttribute(attributeNamespace, reader.getAttributeLocalName(i), reader.getAttributeValue(i));
} else {
writer.writeAttribute(reader.getAttributeLocalName(i), reader.getAttributeValue(i));
}
}
} else if (event == XMLStreamConstants.END_ELEMENT) {
writer.writeEndElement();
} else if (event == XMLStreamConstants.CDATA) {
String array = reader.getText();
writer.writeCData(array);
} else if (event == XMLStreamConstants.COMMENT) {
String array = reader.getText();
writer.writeComment(array);
} else if (event == XMLStreamConstants.CHARACTERS) {
String array = reader.getText();
if (array.length() > 0 && !reader.isWhiteSpace()) {
writer.writeCharacters(array);
}
} else if (event == XMLStreamConstants.START_DOCUMENT) {
writer.writeStartDocument();
} else if (event == XMLStreamConstants.END_DOCUMENT) {
writer.writeEndDocument();
}
}
And for a subtree,
private static void copySubTree(XMLStreamReader reader, XMLStreamWriter writer) throws XMLStreamException {
reader.require(XMLStreamConstants.START_ELEMENT, null, null);
copyEvent(XMLStreamConstants.START_ELEMENT, reader, writer);
int level = 1;
do {
int event = reader.next();
if(event == XMLStreamConstants.START_ELEMENT) {
level++;
} else if(event == XMLStreamConstants.END_ELEMENT) {
level--;
}
copyEvent(event, reader, writer);
} while(level > 0);
}
From which you probably can deduct how to skip out to a certain level. In general, for stateful StaX parsing, use the pattern
private static void parseSubTree(XMLStreamReader reader) throws XMLStreamException {
int level = 1;
do {
int event = reader.next();
if(event == XMLStreamConstants.START_ELEMENT) {
level++;
// do stateful stuff here
// for child logic:
if(reader.getLocalName().equals("Whatever")) {
parseSubTreeForWhatever(reader);
level --; // read from level 1 to 0 in submethod.
}
// alternatively, faster
if(level == 4) {
parseSubTreeForWhateverAtRelativeLevel4(reader);
level --; // read from level 1 to 0 in submethod.
}
} else if(event == XMLStreamConstants.END_ELEMENT) {
level--;
// do stateful stuff here, too
}
} while(level > 0);
}
where you in the start of the document read till the first start element and break (add the writer+copy for your use of course, as above).
Note that if you do an object binding, these methods should be placed in that object, and equally for the serialization methods.
I am pretty sure you will get 10s of MB/s on a modern system, and that should be sufficient. An issue to be investigate further, is approaches to use multiple cores for the actualy input, if you know for a fact the encoding subset, like non-crazy UTF-8, or ISO-8859, then random access might be possible -> send to different cores.
Have fun, and tell use how it went ;)
Edit: Almost forgot, if you for some reason are the one who is creating the file in the first place, or you will be reading them after splitting, you will se HUGE performance gains using XML binarization; there exist XML Schema generators which again can go into code generators. (And some XSLT transform libs use code generation too.) And run with the -server option for JVM.
How to make i faster:
Use asynchronous writes, possibly in parallel, might boost your perf if you have RAID-X something disks
Write to an SSD instead of HDD
My suggestion is that SAX, STAX, or DOM are not the ideal xml parser for your problem, the perfect solutions is called vtd-xml, there is an article on this subject explaining why DOM sax and STAX all done something very wrong... the code below is the shortest you have to write, yet performs 10x faster than DOM or SAX. http://www.javaworld.com/javaworld/jw-07-2006/jw-0724-vtdxml.html
Here is a latest paper entitled Processing XML with Java – A Performance Benchmark: http://recipp.ipp.pt/bitstream/10400.22/1847/1/ART_BrunoOliveira_2013.pdf
import com.ximpleware.*;
import java.io.*;
public class gandalf {
public static void main(String a[]) throws VTDException, Exception{
VTDGen vg = new VTDGen();
if (vg.parseFile("c:\\xml\\gandalf.txt", false)){
VTDNav vn=vg.getNav();
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/company/staff[nickname]");
int i=-1;
int count=0;
while((i=ap.evalXPath())!=-1){
vn.dumpFragment("c:\\xml\\staff"+count+".xml");
count++;
}
}
}
}
Here is DOM based solution. I have tested this with the xml you provided. This needs to be checked against the actual xml files that you have.
Since this is based on DOM parser, please remember that this will require a lot of memory depending upon your xml file size. But its much faster as it's DOM based.
Algorithm :
Parse the document
Extract the root element name
Get list he nodes based on the split criteria (using XPath)
For each node, create an empty document with root element name as extracted in step #2
Insert the node in this new document
Check if nodes are to be filtered or not.
If nodes are to be filtered, then check if a specified element is present in the newly created doc.
If node is not present, don't write to the file.
If the nodes are NOT to be filtered at all, don't check for condition in #7, and write the document to the file.
This can be run from command prompt as follows
java XMLSplitter xmlFileLocation splitElement filter filterElement
For the xml you mentioned it will be
java XMLSplitter input.xml staff true nickname
In case you don't want to filter
java XMLSplitter input.xml staff
Here is the complete java code:
package com.xml.xpath;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.StringReader;
import java.io.StringWriter;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.DOMException;
import org.w3c.dom.DOMImplementation;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
public class XMLSplitter {
DocumentBuilder builder = null;
XPath xpath = null;
Transformer transformer = null;
String filterElement;
String splitElement;
String xmlFileLocation;
boolean filter = true;
public static void main(String[] arg) throws Exception{
XMLSplitter xMLSplitter = null;
if(arg.length < 4){
if(arg.length < 2){
System.out.println("Insufficient arguments !!!");
System.out.println("Usage: XMLSplitter xmlFileLocation splitElement filter filterElement ");
return;
}else{
System.out.println("Filter is off...");
xMLSplitter = new XMLSplitter();
xMLSplitter.init(arg[0],arg[1],false,null);
}
}else{
xMLSplitter = new XMLSplitter();
xMLSplitter.init(arg[0],arg[1],Boolean.parseBoolean(arg[2]),arg[3]);
}
xMLSplitter.start();
}
public void init(String xmlFileLocation, String splitElement, boolean filter, String filterElement )
throws ParserConfigurationException, TransformerConfigurationException{
//Initialize the Document builder
System.out.println("Initializing..");
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
builder = domFactory.newDocumentBuilder();
//Initialize the transformer
TransformerFactory transformerFactory = TransformerFactory.newInstance();
transformer = transformerFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.ENCODING,"UTF-8");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
//Initialize the xpath
XPathFactory factory = XPathFactory.newInstance();
xpath = factory.newXPath();
this.filterElement = filterElement;
this.splitElement = splitElement;
this.xmlFileLocation = xmlFileLocation;
this.filter = filter;
}
public void start() throws Exception{
//Parser the file
System.out.println("Parsing file.");
Document doc = builder. parse(xmlFileLocation);
//Get the root node name
System.out.println("Getting root element.");
XPathExpression rootElementexpr = xpath.compile("/");
Object rootExprResult = rootElementexpr.evaluate(doc, XPathConstants.NODESET);
NodeList rootNode = (NodeList) rootExprResult;
String rootNodeName = rootNode.item(0).getFirstChild().getNodeName();
//Get the list of split elements
XPathExpression expr = xpath.compile("//"+splitElement);
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
System.out.println("Total number of split nodes "+nodes.getLength());
for (int i = 0; i < nodes.getLength(); i++) {
//Wrap each node inside root of the parent xml doc
Node sigleNode = wrappInRootElement(rootNodeName,nodes.item(i));
//Get the XML string of the fragment
String xmlFragment = serializeDocument(sigleNode);
//System.out.println(xmlFragment);
//Write the xml fragment in file.
storeInFile(xmlFragment,i);
}
}
private Node wrappInRootElement(String rootNodeName, Node fragmentDoc)
throws XPathExpressionException, ParserConfigurationException, DOMException,
SAXException, IOException, TransformerException{
//Create empty doc with just root node
DOMImplementation domImplementation = builder.getDOMImplementation();
Document doc = domImplementation.createDocument(null,null,null);
Element theDoc = doc.createElement(rootNodeName);
doc.appendChild(theDoc);
//Insert the fragment inside the root node
InputSource inStream = new InputSource();
String xmlString = serializeDocument(fragmentDoc);
inStream.setCharacterStream(new StringReader(xmlString));
Document fr = builder.parse(inStream);
theDoc.appendChild(doc.importNode(fr.getFirstChild(),true));
return doc;
}
private String serializeDocument(Node doc) throws TransformerException, XPathExpressionException{
if(!serializeThisNode(doc)){
return null;
}
DOMSource domSource = new DOMSource(doc);
StringWriter stringWriter = new StringWriter();
StreamResult streamResult = new StreamResult(stringWriter);
transformer.transform(domSource, streamResult);
String xml = stringWriter.toString();
return xml;
}
//Check whether node is to be stored in file or rejected based on input
private boolean serializeThisNode(Node doc) throws XPathExpressionException{
if(!filter){
return true;
}
XPathExpression filterElementexpr = xpath.compile("//"+filterElement);
Object result = filterElementexpr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
if(nodes.item(0) != null){
return true;
}else{
return false;
}
}
private void storeInFile(String content, int fileIndex) throws IOException{
if(content == null || content.length() == 0){
return;
}
String fileName = splitElement+fileIndex+".xml";
File file = new File(fileName);
if(file.exists()){
System.out.println(" The file "+fileName+" already exists !! cannot create the file with the same name ");
return;
}
FileWriter fileWriter = new FileWriter(file);
fileWriter.write(content);
fileWriter.close();
System.out.println("Generated file "+fileName);
}
}
Let me know if this works for you or any other help regarding this code.

Categories

Resources