Camel, split large XML file with header, using field condition - java

I'm trying to set up an Apache Camel route, that inputs a large XML file and then split the payload into two different files using a field condition. I.e. if an ID field starts with a 1, it goes to one output file, otherwise to another. Using Camel is not a must and I've looked at XSLT and regular Java options as well but I just feel that this should work.
I've covered splitting the actual payload but I'm having issues with making sure that the parent nodes, including a header, is included in each file as well. As the file can be large, I want to make sure that streams are used for the payload. I feel like I've read hundreds of different questions here, blog entries, etc. on this, and pretty much every case covers either loading the entire file into memory, splitting the file equally into parts og just using the payload nodes individually.
My prototype XML file looks like this:
<root>
<header>
<title>Testing</title>
</header>
<orders>
<order>
<id>11</id>
<stuff>One</stuff>
</order>
<order>
<id>20</id>
<stuff>Two</stuff>
</order>
<order>
<id>12</id>
<stuff>Three</stuff>
</order>
</orders>
</root>
The result should be two files - condition true (id starts with 1):
<root>
<header>
<title>Testing</title>
</header>
<orders>
<order>
<id>11</id>
<stuff>One</stuff>
</order>
<order>
<id>12</id>
<stuff>Three</stuff>
</order>
</orders>
</root>
Condition false:
<root>
<header>
<title>Testing</title>
</header>
<orders>
<order>
<id>20</id>
<stuff>Two</stuff>
</order>
</orders>
</root>
My prototype route:
from("file:" + inputFolder)
.log("Processing file ${headers.CamelFileName}")
.split()
.tokenizeXML("order", "*") // Includes parent in every node
.streaming()
.choice()
.when(body().contains("id>1"))
.to("direct:ones")
.stop()
.otherwise()
.to("direct:others")
.stop()
.end()
.end();
from("direct:ones")
//.aggregate(header("ones"), new StringAggregator()) // missing end condition
.to("file:" + outputFolder + "?fileName=ones-${in.header.CamelFileName}&fileExist=Append");
from("direct:others")
//.aggregate(header("others"), new StringAggregator()) // missing end condition
.to("file:" + outputFolder + "?fileName=others-${in.header.CamelFileName}&fileExist=Append");
This works as intented, except that the parent tags (header and footer, if you will) is added for every node. Using just the node in tokenizeXML returns only the node itself but I can't figure out how to add the header and footer. Preferably I would want to stream the parent tags into a header and footer property and add them before and after the split.
How can I do this? Would I somehow need to tokenize the parent tags first and would this mean streaming the file twice?
As a final note you might notice the aggregate at the end. I don't want to aggregate every node before writing to the file, as that defeats the purpose of streaming it and keep the entire file out of memory, but I figured I might gain some performance by aggregating a number of nodes before writing to the file, to lessen the perfomance hit of writing to the drive for every node. I'm not sure if this make sense to do.

I was unable to make it work with Camel. Or rather, when using plain Java for extracting the header, I already had everything I needed to continue and make the split and swapping back to Camel seemed cumbersome. There are most likely ways to improve on this, but this was my solution to splitting the XML payload.
Switching between the two types of output streams is not that pretty but it eases the use of everything else. Also of note, is that I chose equalsIgnoreCase to check the tag names even though XML is normally case sensitive. For me, it reduces the risk of errors. Finally, make sure your regex match the entire string using wildcards, as per normal string regex.
/**
* Splits a XML file's payload into two new files based on a regex condition. The payload is a specific XML tag in the
* input file that is repeated a number of times. All tags before and after the payload are added to both files in order
* to keep the same structure.
*
* The content of each payload tag is compared to the regex condition and if true, it is added to the primary output file.
* Otherwise it is added to the secondary output file. The payload can be empty and an empty payload tag will be added to
* the secondary output file. Note that the output will not be an unaltered copy of the input as self-closing XML tags are
* altered to corresponding opening and closing tags.
*
* Data is streamed from the input file to the output files, keeping memory usage small even with large files.
*
* #param inputFilename Path and filename for the input XML file
* #param outputFilenamePrimary Path and filename for the primary output file
* #param outputFilenameSecondary Path and filename for the secondary output file
* #param payloadTag XML tag name of the payload
* #param payloadParentTag XML tag name of the payload's direct parent
* #param splitRegex The regex split condition used on the payload content
* #throws Exception On invalid filenames, missing input, incorrect XML structure, etc.
*/
public static void splitXMLPayload(String inputFilename, String outputFilenamePrimary, String outputFilenameSecondary, String payloadTag, String payloadParentTag, String splitRegex) throws Exception {
XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
XMLOutputFactory xmlOutputFactory = XMLOutputFactory.newInstance();
XMLEventReader xmlEventReader = null;
FileInputStream fileInputStream = null;
FileWriter fileWriterPrimary = null;
FileWriter fileWriterSecondary = null;
XMLEventWriter xmlEventWriterSplitPrimary = null;
XMLEventWriter xmlEventWriterSplitSecondary = null;
try {
fileInputStream = new FileInputStream(inputFilename);
xmlEventReader = xmlInputFactory.createXMLEventReader(fileInputStream);
fileWriterPrimary = new FileWriter(outputFilenamePrimary);
fileWriterSecondary = new FileWriter(outputFilenameSecondary);
xmlEventWriterSplitPrimary = xmlOutputFactory.createXMLEventWriter(fileWriterPrimary);
xmlEventWriterSplitSecondary = xmlOutputFactory.createXMLEventWriter(fileWriterSecondary);
boolean isStart = true;
boolean isEnd = false;
boolean lastSplitIsPrimary = true;
while (xmlEventReader.hasNext()) {
XMLEvent xmlEvent = xmlEventReader.nextEvent();
// Check for start of payload element
if (!isEnd && xmlEvent.isStartElement()) {
StartElement startElement = xmlEvent.asStartElement();
if (startElement.getName().getLocalPart().equalsIgnoreCase(payloadTag)) {
if (isStart) {
isStart = false;
// Flush the event writers as we'll use the file writers for the payload
xmlEventWriterSplitPrimary.flush();
xmlEventWriterSplitSecondary.flush();
}
String order = getTagAsString(xmlEventReader, xmlEvent, payloadTag, xmlOutputFactory);
if (order.matches(splitRegex)) {
lastSplitIsPrimary = true;
fileWriterPrimary.write(order);
} else {
lastSplitIsPrimary = false;
fileWriterSecondary.write(order);
}
}
}
// Check for end of parent tag
else if (!isStart && !isEnd && xmlEvent.isEndElement()) {
EndElement endElement = xmlEvent.asEndElement();
if (endElement.getName().getLocalPart().equalsIgnoreCase(payloadParentTag)) {
isEnd = true;
}
}
// Is neither start or end and we're handling payload (most often white space)
else if (!isStart && !isEnd) {
// Add to last split handled
if (lastSplitIsPrimary) {
xmlEventWriterSplitPrimary.add(xmlEvent);
xmlEventWriterSplitPrimary.flush();
} else {
xmlEventWriterSplitSecondary.add(xmlEvent);
xmlEventWriterSplitSecondary.flush();
}
}
// Start and end is added to both files
if (isStart || isEnd) {
xmlEventWriterSplitPrimary.add(xmlEvent);
xmlEventWriterSplitSecondary.add(xmlEvent);
}
}
} catch (Exception e) {
logger.error("Error in XML split", e);
throw e;
} finally {
// Close the streams
try {
xmlEventReader.close();
} catch (XMLStreamException e) {
// ignore
}
try {
xmlEventReader.close();
} catch (XMLStreamException e) {
// ignore
}
try {
xmlEventWriterSplitPrimary.close();
} catch (XMLStreamException e) {
// ignore
}
try {
xmlEventWriterSplitSecondary.close();
} catch (XMLStreamException e) {
// ignore
}
try {
fileWriterPrimary.close();
} catch (IOException e) {
// ignore
}
try {
fileWriterSecondary.close();
} catch (IOException e) {
// ignore
}
}
}
/**
* Loops through the events in the {#code XMLEventReader} until the specific XML end tag is found and returns everything
* contained within the XML tag as a String.
*
* Data is streamed from the {#code XMLEventReader}, however the String can be large depending of the number of children
* in the XML tag.
*
* #param xmlEventReader The already active reader. The starting tag event is assumed to have already been read
* #param startEvent The starting XML tag event already read from the {#code XMLEventReader}
* #param tag The XML tag name used to find the starting XML tag
* #param xmlOutputFactory Convenience include to avoid creating another factory
* #return String containing everything between the starting and ending XML tag, the tags themselves included
* #throws Exception On incorrect XML structure
*/
private static String getTagAsString(XMLEventReader xmlEventReader, XMLEvent startEvent, String tag, XMLOutputFactory xmlOutputFactory) throws Exception {
StringWriter stringWriter = new StringWriter();
XMLEventWriter xmlEventWriter = xmlOutputFactory.createXMLEventWriter(stringWriter);
// Add the start tag
xmlEventWriter.add(startEvent);
// Add until end tag
while (xmlEventReader.hasNext()) {
XMLEvent xmlEvent = xmlEventReader.nextEvent();
// End tag found
if (xmlEvent.isEndElement() && xmlEvent.asEndElement().getName().getLocalPart().equalsIgnoreCase(tag)) {
xmlEventWriter.add(xmlEvent);
xmlEventWriter.close();
stringWriter.close();
return stringWriter.toString();
} else {
xmlEventWriter.add(xmlEvent);
}
}
xmlEventWriter.close();
stringWriter.close();
throw new Exception("Invalid XML, no closing tag for <" + tag + "> found!");
}

Related

Java stax: The reference to entity "R" must end with the ';' delimiter

I am trying to parse a xml using stax but the error I get is:
javax.xml.stream.XMLStreamException: ParseError at [row,col]:[414,47]
Message: The reference to entity "R" must end with the ';' delimiter.
Which get stuck on the line 414 which has P&Rinside the xml file. The code I have to parse it is:
public List<Vild> getVildData(File file){
XMLInputFactory factory = XMLInputFactory.newFactory();
try {
ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(Files.readAllBytes(file.toPath()));
XMLStreamReader reader = factory.createXMLStreamReader(byteArrayInputStream, "iso8859-1");
List<Vild> vild = saveVild(reader);
reader.close();
return vild;
} catch (IOException e) {
e.printStackTrace();
} catch (XMLStreamException e) {
e.printStackTrace();
}
return Collections.emptyList();
}
private List<Vild> saveVild(XMLStreamReader streamReader) {
List<Vild> vildList = new ArrayList<>();
try{
Vild vild = new Vild();
while (streamReader.hasNext()) {
streamReader.next();
//Creating list with data
}
}catch(XMLStreamException | IllegalStateException ex) {
ex.printStackTrace();
}
return Collections.emptyList();
}
I read online that the & is invalid xml code but I don't know how to change it before it throws this error inside the saveVild method. Does someone know how to do this efficiently?
Change the question: you're not trying to parse an XML file, you're trying to parse a non-XML file. For that, you need a non-XML parser, and to write such a parser you need to start with a specification of the language you are trying to parse, and you'll need to agree the specification of this language with the other partners to the data interchange.
How much work you could all save by conforming to standards!
Treat broken XML arriving in your shop the way you would treat any other broken goods coming from a supplier: return it to sender marked "unfit for purpose".
The problem here, as you mention is that the parser finds the & and it expects also the ;
This gets fixed escaping the character, so that the parser finds & instead.
Take a look here for further reference

JaxB marshaler overwriting file contents

I am trying to use JaxB to marshall objects I create to an XML. What I want is to create a list then print it to the file, then create a new list and print it to the same file but everytime I do it over writes the first. I want the final XML file to look like I only had 1 big list of objects. I would do this but there are so many that I quickly max my heap size.
So, my main creates a bunch of threads each of which iterate through a list of objects it receives and calls create_Log on each object. Once it is finished it calls printToFile which is where it marshalls the list to the file.
public class LogThread implements Runnable {
//private Thread myThread;
private Log_Message message = null;
private LinkedList<Log_Message> lmList = null;
LogServer Log = null;
private String Username = null;
public LogThread(LinkedList<Log_Message> lmList){
this.lmList = lmList;
}
public void run(){
//System.out.println("thread running");
LogServer Log = new LogServer();
//create iterator for list
final ListIterator<Log_Message> listIterator = lmList.listIterator();
while(listIterator.hasNext()){
message = listIterator.next();
CountTrans.addTransNumber(message.TransactionNumber);
Username = message.input[2];
Log.create_Log(message.input, message.TransactionNumber, message.Message, message.CMD);
}
Log.printToFile();
init_LogServer.threadCount--;
init_LogServer.doneList();
init_LogServer.doneUser();
System.out.println("Thread "+ Thread.currentThread().getId() +" Completed user: "+ Username+"... Number of Users Complete: " + init_LogServer.getUsersComplete());
//Thread.interrupt();
}
}
The above calls the below function create_Log to build a new object I generated from the XSD I was given (SystemEventType,QuoteServerType...etc). These objects are all added to an ArrayList using the function below and attached to the Root object. Once the LogThread loop is finished it calls the printToFile which takes the list from the Root object and marshalls it to the file... overwriting what was already there. How can I add it to the same file without over writing and without creating one master list in the heap?
public class LogServer {
public log Root = null;
public static String fileName = "LogFile.xml";
public static File XMLfile = new File(fileName);
public LogServer(){
this.Root = new log();
}
//output LogFile.xml
public synchronized void printToFile(){
System.out.println("Printing XML");
//write to xml file
try {
init_LogServer.marshaller.marshal(Root,XMLfile);
} catch (JAXBException e) {
e.printStackTrace();
}
System.out.println("Done Printing XML");
}
private BigDecimal ConvertStringtoBD(String input){
DecimalFormatSymbols symbols = new DecimalFormatSymbols();
symbols.setGroupingSeparator(',');
symbols.setDecimalSeparator('.');
String pattern = "#,##0.0#";
DecimalFormat decimalFormat = new DecimalFormat(pattern, symbols);
decimalFormat.setParseBigDecimal(true);
// parse the string
BigDecimal bigDecimal = new BigDecimal("0");
try {
bigDecimal = (BigDecimal) decimalFormat.parse(input);
} catch (ParseException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return bigDecimal;
}
public QuoteServerType Log_Quote(String[] input, int TransactionNumber){
BigDecimal quote = ConvertStringtoBD(input[4]);
BigInteger TransNumber = BigInteger.valueOf(TransactionNumber);
BigInteger ServerTimeStamp = new BigInteger(input[6]);
Date date = new Date();
long timestamp = date.getTime();
ObjectFactory factory = new ObjectFactory();
QuoteServerType quoteCall = factory.createQuoteServerType();
quoteCall.setTimestamp(timestamp);
quoteCall.setServer(input[8]);
quoteCall.setTransactionNum(TransNumber);
quoteCall.setPrice(quote);
quoteCall.setStockSymbol(input[3]);
quoteCall.setUsername(input[2]);
quoteCall.setQuoteServerTime(ServerTimeStamp);
quoteCall.setCryptokey(input[7]);
return quoteCall;
}
public SystemEventType Log_SystemEvent(String[] input, int TransactionNumber, CommandType CMD){
BigInteger TransNumber = BigInteger.valueOf(TransactionNumber);
Date date = new Date();
long timestamp = date.getTime();
ObjectFactory factory = new ObjectFactory();
SystemEventType SysEvent = factory.createSystemEventType();
SysEvent.setTimestamp(timestamp);
SysEvent.setServer(input[8]);
SysEvent.setTransactionNum(TransNumber);
SysEvent.setCommand(CMD);
SysEvent.setFilename(fileName);
return SysEvent;
}
public void create_Log(String[] input, int TransactionNumber, String Message, CommandType Command){
switch(Command.toString()){
case "QUOTE": //Quote_Log
QuoteServerType quote_QuoteType = Log_Quote(input,TransactionNumber);
Root.getUserCommandOrQuoteServerOrAccountTransaction().add(quote_QuoteType);
break;
case "QUOTE_CACHED":
SystemEventType Quote_Cached_SysType = Log_SystemEvent(input, TransactionNumber, CommandType.QUOTE);
Root.getUserCommandOrQuoteServerOrAccountTransaction().add(Quote_Cached_SysType);
break;
}
}
EDIT: The below is code how the objects are added to the ArrayList
public List<Object> getUserCommandOrQuoteServerOrAccountTransaction() {
if (userCommandOrQuoteServerOrAccountTransaction == null) {
userCommandOrQuoteServerOrAccountTransaction = new ArrayList<Object>();
}
return this.userCommandOrQuoteServerOrAccountTransaction;
}
Jaxb is about mapping java object tree to xml document or vice versa. So in principle, you need complete object model before you can save it to xml.
Of course it would not be possible, for very large data, for example DB dump, so jaxb allows marshalling object tree in fragments, letting the user control moment of the object creation and marshaling. Typical use case would be fetching records from DB one by one and marshaling them one by one to a file, so there would not be problem with the heap.
However, you are asking about appending one object tree to another (one fresh in memory, second one already represented in a xml file). Which is not normally possible as it is not really appending but crating new object tree that contains content of the both (there is only one document root element, not two).
So what you could do,
is to create new xml representation with manually initiated root
element,
copy the existing xml content to the new xml either using XMLStreamWriter/XMLStreamReader read/write operations or unmarshaling
the log objects and marshaling them one by one.
marshal your log objects into the same xml stram
complete the xml with the root closing element. -
Vaguely, something like that:
XMLStreamWriter writer = XMLOutputFactory.newInstance().createXMLStreamWriter(new FileOutputStream(...), StandardCharsets.UTF_8.name());
//"mannually" output the beginign of the xml document == its declaration and the root element
writer.writeStartDocument();
writer.writeStartElement("YOUR_ROOT_ELM");
Marshaller mar = ...
mar.setProperty(Marshaller.JAXB_FRAGMENT, true); //instructs jaxb to output only objects not the whole xml document
PartialUnmarshaler existing = ...; //allows reading one by one xml content from existin file,
while (existing.hasNext()) {
YourObject obj = existing.next();
mar.marshal(obj, writer);
writer.flush();
}
List<YourObject> toAppend = ...
for (YourObject toAppend) {
mar.marshal(obj,writer);
writer.flush();
}
//finishing the document, closing the root element
writer.writeEndElement();
writer.writeEndDocument();
Reading the objects one by one from large xml file, and complete implementation of PartialUnmarshaler is described in this answer:
https://stackoverflow.com/a/9260039/4483840
That is the 'elegant' solution.
Less elegant is to have your threads write their logs list to individual files and the append them yourself. You only need to read and copy the header of the first file, then copy all its content apart from the last closing tag, copy the content of the other files ignoring the document openkng and closing tag, output the closing tag.
If your marshaller is set to marshaller.setProperty(Marshaller.JAXB_FORMATTED_OUTPUT, true);
each opening/closing tag will be in different line, so the ugly hack is to
copy all the lines from 3rd to one before last, then output the closing tag.
It is ugly hack, cause it is sensitive to your output format (if you for examle change your container root element). But faster to implement than full Jaxb solution.

Update XML using XMLStreamWriter

I have a large XML and I want to update a particular node of the XML (like removing duplicate nodes).
As the XML is huge I considered using the STAX api class - XMLStreamReader. I first read the XML using XMLStreamReader. I stored the read data in user objects and manipulated these user objects to remove duplicates.
Now I want to put this updated user object back into my original XML. What I thought is that I can marshall the user object to a string and place the string at the right position in my input xml. But I am not able to achieve it using the STAX class - XMLStreamWriter
Can this be achieved using XMLStreamWriter? Please suggest.
If no, they please suggest an alternative approach to my problem.
My main concern is memory as I cannot load such huge XMLs into our project server's memory which is shared across multiple processes. Hence I do not want use DOM because this will use lot of memory to load these huge XML.
If you need to alter a particular value like text content /tag name etc. STAX might help. It would also help in removing few elements using createFilteredReader
Below code renames Name to AuthorName and adds a comment
public class StAx {
public static void main(String[] args) throws FileNotFoundException,
XMLStreamException {
String filename = "HelloWorld.xml";
try (InputStream in = new FileInputStream(filename);
OutputStream out = System.out;) {
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLOutputFactory xof = XMLOutputFactory.newInstance();
XMLEventFactory ef = XMLEventFactory.newInstance();
XMLEventReader reader = factory.createXMLEventReader(filename, in);
XMLEventWriter writer = xof.createXMLEventWriter(out);
while (reader.hasNext()) {
XMLEvent event = (XMLEvent) reader.next();
if (event.isCharacters()) {
String data = event.asCharacters().getData();
if (data.contains("Hello")) {
String replace = data.replace("Hello", "Oh");
event = ef.createCharacters(replace);
}
writer.add(event);
} else if (event.isStartElement()) {
StartElement s = event.asStartElement();
String tagName = s.getName().getLocalPart();
if (tagName.equals("Name")) {
String newName = "Author" + tagName;
event = ef.createStartElement(new QName(newName), null,
null);
writer.add(event);
writer.add(ef.createCharacters("\n "));
event = ef.createComment("auto generated comment");
writer.add(event);
} else {
writer.add(event);
}
} else {
writer.add(event);
}
}
writer.flush();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Input
<?xml version="1.0"?>
<BookCatalogue>
<Book>
<Title>HelloLord</Title>
<Name>
<first>New</first>
<last>Earth</last>
</Name>
<ISBN>12345</ISBN>
</Book>
<Book>
<Title>HelloWord</Title>
<Name>
<first>New</first>
<last>Moon</last>
</Name>
<ISBN>12346</ISBN>
</Book>
</BookCatalogue>
Output
<?xml version="1.0"?><BookCatalogue>
<Book>
<Title>OhLord</Title>
<AuthorName>
<!--auto generated comment-->
<first>New</first>
<last>Earth</last>
</AuthorName>
<ISBN>12345</ISBN>
</Book>
<Book>
<Title>OhWord</Title>
<AuthorName>
<!--auto generated comment-->
<first>New</first>
<last>Moon</last>
</AuthorName>
<ISBN>12346</ISBN>
</Book>
</BookCatalogue>
As you can see things gets really complicated when modification is much more than this like swapping two nodes deleting one node based on state of few other node : delete All Books with price more than average price
Best solution in this case is to produce resulting xml using xslt transformation

Parsing malformed/incomplete/invalid XML files [duplicate]

This question already has answers here:
How to parse invalid (bad / not well-formed) XML?
(4 answers)
Closed 5 years ago.
I have a process that parses an XML file using JDOM and xpath to parse the file as shown below:
private static SAXBuilder builder = null;
private static Document doc = null;
private static XPath xpathInstance = null;
builder = new SAXBuilder();
Text list = null;
try {
doc = builder.build(new StringReader(xmldocument));
} catch (JDOMException e) {
throw new Exception(e);
}
try {
xpathInstance = XPath.newInstance("//book[author='Neal Stephenson']/title/text()");
list = (Text) xpathInstance.selectSingleNode(doc);
} catch (JDOMException e) {
throw new Exception(e);
}
The above works fine. The xpath expressions are stored in a properties file so these can be changed anytime. Now i have to process some more xml files that come from a legacy system that will only send the xml files in chunks of 4000 bytes. The existing processing reads the 4000 byte chunks and stores them in an Oracle database with each chunk as one row in the database (Making any changes to the legacy system or the processing that stores the chunks as rows in the database is out of the question).
I can build the complete valid XML document by extracting all the rows related to a specific xml document and merging them and then use the existing processing (shown above) to parse the xml document.
The thing is though, the data i need to extract from the XML document will always be on the first 4000 bytes. This chunk ofcourse is not a valid XML document as it will be incomplete but will contain all the data i need. I cant parse just the one chunk as the JDOM builder will reject it.
I am wondering whether i can parse the malformed XML chunk without having to merge all parts (which could get to quite many) in order to get a valid XML document. This will save me several trips to the database to check if a chunk is available and i wont have to merge 100s of chunks only for being able to use the first 4000 bytes.
I know i could probably use java's string functions to extract the relevant data but is this possible using a parser or even xpath? or do they both expect the xml document to be a well formed document before it can parse it?
You could try to use JSoup to parse the invalid XML. By definition XML should be well-formed, otherwise it's invalid and should not be used.
UPDATE - example:
public static void main(String[] args) {
for (Node node : Parser.parseFragment("<test><author name=\"Vlad\"><book name=\"SO\"/>" ,
new Element(Tag.valueOf("p"), ""),
"")) {
print(node, 0);
}
}
public static void print(Node node, int offset) {
for (int i = 0; i < offset; i++) {
System.out.print(" ");
}
System.out.print(node.nodeName());
for (Attribute attribute: node.attributes()) {
System.out.print(", ");
System.out.print(attribute.getKey() + "=" + attribute.getValue());
}
System.out.println();
for (Node child : node.childNodes()) {
print(child, offset + 4);
}
}

Code for Using StAX in java

I have an 200 MB xml of the following form:
<school name = "some school">
<class standard = "2A">
<student>
.....
</student>
<student>
.....
</student>
<student>
.....
</student>
</class>
</school>
I need to split this xml into several files using StAX such that n students come under each xml file and the structure is preserved as <school> then <class> and <students> under them. The attributes of School and class also must be preserved in the resultant xmls.
Here is the code I am using:
XMLInputFactory inputFactory = XMLInputFactory.newInstance();
String xmlFile = "input.XML";
XMLEventReader reader = inputFactory.createXMLEventReader(new FileReader(xmlFile));
XMLOutputFactory outputFactory = XMLOutputFactory.newInstance();
outputFactory.setProperty("javax.xml.stream.isRepairingNamespaces", Boolean.TRUE);
XMLEventWriter writer = null;
int count = 0;
QName name = new QName(null, "student");
try {
while (true) {
XMLEvent event = reader.nextEvent();
if (event.isStartElement()) {
StartElement element = event.asStartElement();
if (element.getName().equals(name)) {
String filename = "input"+ count + ".xml";
writer = outputFactory.createXMLEventWriter(new FileWriter(filename));
writeToFile(reader, event, writer);
writer.close();
count++;
}
}
if (event.isEndDocument())
break;
}
} catch (XMLStreamException e) {
throw e;
} catch (IOException e) {
e.printStackTrace();
} finally {
reader.close();
}
private static void writeToFile(XMLEventReader reader, XMLEvent startEvent, XMLEventWriter writer) throws XMLStreamException, IOException {
StartElement element = startEvent.asStartElement();
QName name = element.getName();
int stack = 1;
writer.add(element);
while (true) {
XMLEvent event = reader.nextEvent();
if (event.isStartElement() && event.asStartElement().getName().equals(name))
stack++;
if (event.isEndElement()) {
EndElement end = event.asEndElement();
if (end.getName().equals(name)) {
stack--;
if (stack == 0) {
writer.add(event);
break;
}
}
}
writer.add(event);
}
}
Please check the function call writeToFile(reader, event, writer) in the try block. Here the reader object has only the student tag. I need the reader has the school, class, and then n students in it. so that the file generated has a similar structure as the original only with lesser children per file.
Thanks in advance.
I think you can keep track of list of parent events prior to the "student" start element event and pass it to the writeToFile() method. Then in the writeToFile() method you can use that list to simulate the "school" and "class" events.
You have code for determining when to start a new file which I haven't examined closely, but the process of finishing one file and starting the next is definitely incomplete.
On reaching a point where you want to end a file, you have to generate end events for the enclosing <class> and <school> tags and for the document before closing it. When you start your new file, you need to generate start events for the same after opening it and before starting again to copy student events.
In order to generate the start events properly, you will have to retain the corresponding events from the input.
Save yourself trouble and time and use the flat xml file structure you currently have, and then create POJO Objects which will represent each object as you've stated; Student, School and Class. And then using Jaxb bind the objects with different part of the Structure. You can then effectively unmarshal the xml and access the various elements as if you're dealing with SQL objects.
Use this link as a starting point XML parsing with JAXB
One issue doing it this way is memory consumption. For design flexibility and memory management, I will suggest using SQL to handle this.

Categories

Resources