Best parser to parse 1 GB xml data in java [duplicate]

Best parser to parse 1 GB xml data in java [duplicate] - java

This question already has answers here:
Processing large xml files
(5 answers)
Which is the best library for XML parsing in java [closed]
(7 answers)
Closed 4 years ago.
My Requirement is :
I have 1GB xml file and want to remove few nodes from xml file.Here removing xml nodes can be anything in entire file which is based on the input.What is the best parser in JAVA.
I'm Currently using DOM parser and it is working fine for 100MB files but it is throwing out of memory error :heap space for 1 GB file.
Can anyone suggest best approach for my code below:
public static void main(String[] args) {
DocumentBuilder docBuilder = null;
File inputFile = new File("/scratch/bigfile/final.txt");
// Parse the xml file using DOM parser
try{
DocumentBuilderFactory docBuilderFactory =DocumentBuilderFactory.newInstance();
docBuilderFactory.setExpandEntityReferences(false);
docBuilderFactory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);
docBuilder = docBuilderFactory.newDocumentBuilder();
Document doc= docBuilder.parse(inputFile);
// Remove unwanted nodes from xml file
Element element1 = (Element) doc.getElementsByTagName("G_SUMMARY_ROWSET").item(0);
element1.getParentNode().removeChild(element1);
Element element2 = (Element) doc.getElementsByTagName("G_JRNLSOURCE_ROWSET").item(0);
element2.getParentNode().removeChild(element2);
Element element3 = (Element) doc.getElementsByTagName("G_JRNLSOURCE_UNMATCHED_ROWSET").item(0);
element3.getParentNode().removeChild(element3);
Element element4 = (Element) doc.getElementsByTagName("G_JRNLDETAILS_UNMATCHED_ROWSET").item(0);
element4.getParentNode().removeChild(element4);
// Convbert Dom Document to Byte array
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
ByteArrayOutputStream bos=new ByteArrayOutputStream();
StreamResult result=new StreamResult(bos);
transformer.transform(source, result);
byte []array=bos.toByteArray();
System.out.println(array.length);
}
catch (Exception e) {
e.printStackTrace();
}
}

Consider using SAXParser. It is generally better to use a SAXParser for larger files because the data is not stored in memory and discards most elements after they have been processed. This would solve your issue of running out of memory.
This is contrasted with a DOM (Document Object Model) parser where the entire document is loaded into memory.

Related

Adding element to XML using javax parser without document modification

i am trying to add elements to xml document. Elements are added successfuly but problem is, that parser modifies original xml file in other places e.g it swaps namespace and id attributes or deletes duplicate namespace definitions. I need to get precisely the same document (same syntax, preserved whitespaces) only with specific elements added. I would greatly appreciate any suggestions. Here is my code:
public void appendTimestamp(String timestamp, String signedXMLFile, String timestampedXMLFile){
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
try{
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new File(signedXMLFile));
XPath xPath = XPathFactory.newInstance().newXPath();
NodeList list = (NodeList)xPath.evaluate("//*[local-name()='Signature']/*[local-name()='Object']/*[local-name()='QualifyingProperties']", doc, XPathConstants.NODESET);
if(list.getLength() != 1){
throw new Exception();
}
Node node = list.item(0);
Node unsignedProps = doc.createElement("xades:UnsignedProperties");
Node unsignedSignatureProps = doc.createElement("xzep:UnsignedSignatureProperties");
Node timestampNode = doc.createElement("xzep:SignatureTimeStamp");
timestampNode.appendChild(doc.createTextNode(timestamp));
unsignedSignatureProps.appendChild(timestampNode);
unsignedProps.appendChild(unsignedSignatureProps);
node.appendChild(unsignedProps);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.INDENT, "no");
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
DOMSource source = new DOMSource(doc);
StringWriter writer = new StringWriter();
StreamResult stringWriter = new StreamResult(writer);
transformer.transform(source, stringWriter);
writer.flush();
System.out.println(writer.toString());
}catch(Exception e){
e.printStackTrace();
}
}
The original xml file:
...
<ds:Object Id="objectIdVerificationObject" xmlns:ds="http://www.w3.org/2000/09/xmldsig#">
...
Modified xml file:
...
<ds:Object xmlns:ds="http://www.w3.org/2000/09/xmldsig#" Id="objectIdVerificationObject">
...

If you use the dom model, then the whole xml file is read, then represented in the memory as the node tree and then saved to xml in a way determined by the writer. So it is almost impossible to preserve the original xml format as you don't have the control over it and for example whitespaces are not represented at all in the node tree.
You need to read partially the original xml and ouptut its content to the new file preserving what was read, then in the "right" place you need to add your new content adn then continue simple coapying of the original.
For example you could use the XMLStreamWriter and XMLStreamReader to achieve that as they offer "the low" level operations.
But probably it would be much easier to just copy the xml as the text line by one till you recognize the insertion point, then create new xml portion and append it as text and continue with copying.

How to reduce the file size of a xml file that is created using java?

I have to convert a text file with coordinates into a xml file. But the point of converting the text file into a xml file is so that the file size to be smaller. How can I reduce the size of my file?
public void writeXML() throws Exception
{
ArrayList<Frame> frameList = new ArrayList<Frame>();
frameList = readFile();
DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder dBuilder;
try
{
dBuilder = dbFactory.newDocumentBuilder();
Document doc = dBuilder.newDocument();
// append stuff
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
//transformer.setOutputProperty(OutputKeys.INDENT, "yes");
DOMSource source = new DOMSource(doc);
StreamResult console = new StreamResult(System.out);
StreamResult file = new StreamResult(new File("file.xml"));
//transformer.transform(source, console);
transformer.transform(source, file);
System.out.println("DONE");
}
catch (Exception e)
{
e.printStackTrace();
}
}

But the point of converting the text file into a xml file is so that the file size to be smaller.
That is probably not achievable. XML is less dense than a typical text representation because properly designed XML adds a significant amount of "markup" to the file.
A file consisting of coordinates in an appropriately designed text form (e.g. CSV) will take less space than the same coordinates expressed in XML.
If you want "denser" files than a custom text format:
consider compressing the text file, or
consider using a binary representation instead of text.
If you are fixed on the idea of using XML, then the best way to reduce file size will be to compress it. Given that XML has a lot of redundancy in it (e.g. the markup), you should get significant compression.

create and write file from XML

I have XML about article
<ARTICLE ID="74">
<ARTICLE_CATEGORY_ID>1</ARTICLE_CATEGORY_ID >
<ARTICLE_NAME>......</ARTICLE_NAME >
<ARTICLE_EXTENSION>pdf</ARTICLE_EXTENSION >
<ARTICLE_BYTE>[B#6d78f375</ARTICLE_BYTE >
<ARTICLE_DATE>2014-10-11 00:00:00.0</ARTICLE_DATE >
<ARTICLE_ACTIVE>1</ARTICLE_ACTIVE>
</MAKALE>
i want to create file ,and write ARTICLE_BYTE but i can't do it ,it seems, byte seems is String i quess so i don't know how can i do this ? Thank you for helping
//EDIT
Sorry i'm very new stackoverflow..
yes this is from a program that i've written.
That's part of code from the program
try {
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
// XML Kok Ismi
Document doc = docBuilder.newDocument();
Element rootElement = doc.createElement("ARTICLES");
doc.appendChild(rootElement);
for (Nesne nesne : userList) {
// ARTICLE ELEMENT
Element ARTICLE_ID= doc.createElement("ARTICLE ");
rootElement.appendChild(ARTICLE_ID);
//ID
Attr attr = doc.createAttribute("ID");
attr.setValue(String.valueOf(nesne.getArticleID()));
MAKALE_ID.setAttributeNode(attr);
//ARTICLE_CATEGORY_ID
Element ARTICLE_CATEGORY_ID= doc.createElement("ARTICLE_CATEGORY_ID");
ARTICLE_CATEGORY_ID.appendChild(doc.createTextNode(String.valueOf(nesne.getARTICLE_CATEGORY_ID())));
ARTICLE_ID.appendChild(ARTICLE_CATEGORY_ID);
//[ARTICLE_NAME]
Element ARTICLE_NAME= doc.createElement("ARTICLE_NAME");
ARTICLE_NAME.appendChild(doc.createTextNode(nesne.getARTICLE_NAME()));
ARTICLE_ID.appendChild(ARTICLE_NAME);
//[ARTICLE_EXTENSION]
Element ARTICLE_EXTENSION= doc.createElement("ARTICLE_EXTENSION");
ARTICLE_EXTENSION.appendChild(doc.createTextNode(nesne.getARTICLE_EXTENSION()));
ARTICLE_ID.appendChild(ARTICLE_EXTENSION);
//[ARTICLE_BYTE]
Element ARTICLE_BYTE= doc.createElement("ARTICLE_BYTE");
ARTICLE_BYTE.appendChild(doc.createTextNode(nesne.getARTICLE_BYTE().toString()));
ARTICLE_ID.appendChild(ARTICLE_BYTE);
//[ARTICLE_DATE]
Element ARTICLE_DATE = doc.createElement("ARTICLE_DATE");
ARTICLE_DATE.appendChild(doc.createTextNode(nesne.getARTICLE_DATE()));
ARTICLE_ID.appendChild(ARTICLE_DATE);
//[ARTICLE_ACTIVE]
Element ARTICLE_ACTIVE= doc.createElement("ARTICLE_ACTIVE");
ARTICLE_ACTIVE.appendChild(doc.createTextNode(String.valueOf(nesne.getArticleActive())));
ARTICLE_ID.appendChild(ARTICLE_ACTIVE);
}
// write the content into xml file
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer transformer = transformerFactory.newTransformer();
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(new File("dosyalar.xml"));
// Output to console for testing
// StreamResult result = new StreamResult(System.out);
transformer.transform(source, result);
System.out.println("File saved!");

Your problem appears to be here:
nesne.getARTICLE_BYTE().toString()
I'm not familiar with your Nesne class, but I can tell that the method, getARTICLE_BYTE() returns a byte array, that calling toString() on it will return the useless information that you're currently seeing, and that if this needs to be stored, then you somehow need to store the entire array. Using a for loop and storing the array as the String representations of each byte would not be the most efficient thing to do and would lead to a super-large unreadable XML, so perhaps you could store it off of the XML and pass a reference to the byte array file in the XML. Or store in a database as a BLOB. Note that this is not something I do much of, and so I'm no expert.

Prevent transformer.transform( source, result ) from escaping special character

I'm updating node and text content of the xml using DOM parser. To save that DOM parser I'm using transformer.transform method.
Below is the sample code.
String xmlText = "<uc>abcd><name>mine</name>efgh\netg<tag>sd</tag></uc>";
DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
InputSource inStream = new InputSource();
inStream.setCharacterStream(new StringReader(xmlText));
Document document = documentBuilder.parse(inStream);
Node node = document.getDocumentElement();
node.normalize();
NodeList childNodes = node.getChildNodes();
for(int i=0; i<childNodes.getLength(); i++) {
if(childNodes.item(i).getNodeType() == Node.TEXT_NODE) {
System.out.println(childNodes.item(i).getTextContent());
childNodes.item(i).setTextContent("123>");
}
}
TransformerFactory tFactory = TransformerFactory.newInstance();
Transformer transformer = tFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "US-ASCII");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
DOMSource source = new DOMSource( document );
OutputStream xml = new ByteArrayOutputStream();
StreamResult result = new StreamResult( xml );
transformer.transform( source, result );
String formattedXml = xml.toString();
System.out.println(formattedXml);
Since my updated document is having text content like ">", transformer.transform method is changing it to &g t;
Is there a way to get the output without escaping special characters.
I can't use other parser because of some project constraints.
I can't use StringEscapeUtils.unescapeXml(). The reason is xml can have &g t;. If i use this utility method, &g t; which was originally present in the xml will also get changed.
So i want a mechanism which will not escape any special character.

The transformer you create with
Transformer transformer = tFactory.newTransformer();
is initialized with a default stylesheed that implements the identity transformation. That means it will simply serialize your DOM to a well-formed XML document. Output escaping is automatically applied where necessary.
If you want better control over the output, and possibly generate something that does not adhere to XML document structures, you can use a custom stylesheet that switches the output method to text. This way you control more of the structure but can do more mistakes in the XML area.
More information at
https://docs.oracle.com/en/java/javase/11/docs/api/java.xml/javax/xml/transform/TransformerFactory.html#newTransformer()
https://www.w3.org/TR/xslt20/#element-output

How do I modify an XML element using StAX in Java

I want to modify my name in xml cv file, but when I use this statement:
XMLOutputFactory xof = XMLOutputFactory.newInstance();
XMLStreamWriter xtw = null;
xtw = xof.createXMLStreamWriter(new FileWriter("eman.xml"));
all the content of the file are removed and it becomes empty. Basically, I want to open (eman.xml) for modification without removing its content.

If u want to use STAX in processing your xml file , u should know that u can only read/write from/ to xml file , but if u want to modify ur xml when exact event happen in STAX u could make processing for ur xml using DOM .
here is some example:
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader xmlReader= factory.createXMLStreamReader(new FileReader(fileName));
int eventType;
while(xmlReader.hasNext()){
eventType= xmlReader.next();
if(eventType==XMLEvent.START_ELEMENT)
{
QName qNqme = xmlReader.getName();
if("YURTAG".equals(qNqme.toString()))
{
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
File file = new File("YOURXML.xml");
Document doc = builder.parse(file);
//make the required processing for your file.
}
}

The question about reading and writing at the same time with stax is also answered here: How to modify a huge XML file by StAX?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Best parser to parse 1 GB xml data in java [duplicate] - java

Related

Adding element to XML using javax parser without document modification

How to reduce the file size of a xml file that is created using java?

create and write file from XML

Prevent transformer.transform( source, result ) from escaping special character

How do I modify an XML element using StAX in Java

Categories

Resources