Parsing large RSS feeds using Rome, running out of memory - java

More specifically large XML webpages (RSS Feeds). I am using the excellent Rome library to parse them, but the page I am currently trying to get is really large and Java runs out of memory before getting the whole document.
How can I split up the webpage so that I can pass it to XMLReader? Should I just do it myself and pass the feeds in parts after adding my own XML to start and finish them?

First off learn to set the java command line options for Xms and Xmx to appropriate values, all the DOM based parsers each crap loads of memory. Second look at using a Pull Parser, it will not have to load the entire XML into a document before processing it.

Related

Java heap size error in mirth

I am using mirth connect 3.0.3 and i am having a .xml file which is almost 85mb size and contains some device information. i need to read this .xml file and insert that data to the database(sql server).
the problem i am facing is when i try to read the data it is showing java heap size error:
i increased server memory to 1024mb and client memory to 1024mb.
but it is showing the same error. if i increase the memory to more, i am not able to start mirth connect.
any suggestion is appreciated.
Thanks.
Is the XML file comprised of multiple separate sections/pieces of data that would make sense to split up into multiple channel messages? If so, consider using a Batch Adapter. The XML data type has options to split based on element/tag name, the node depth/level, or an XPath query. All of those options currently still require the message to read into memory in its entirety, but it will still be more memory-efficient than reading the entire XML document in as a single message.
You can also use a JavaScript batch script, in which case you're given a Java BufferedReader, and can use the script to read through the file and return a message at a time. In this case, you will not have to read the entire file into memory.
Are there large blobs of data in the message that don't need to be manipulated in a transformer? Like, embedded images, etc? If so, consider using an Attachment Handler. That way you can extract that data and store it once, rather than having it copied and stored multiple times throughout the message lifecycle (for Raw / Transformed / Encoded / etc.).

Java heap space error while parsing large xml file

I want to parse a large xml file(785mb) and write the data to csv. I am getting java heapspace error(out of memory) when I try to parse the file.
I tried increasing the heap size to 1024mb but the code could handle a file of 50mb maximum.
Please let me know a solution for parsing large xml file in java.
You should use a SAXParser instead of a DOMParser
The difference is that it doesn't load the complete XML data in memory.
Look at this tutorial : http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/
Regards,
Romain.
The solution here is to use Streaming Api for XML (StAX).
Here is good tutorial.

Extracting articles from Wiki Dump

I have a huge wiki dump (~ 50GB after extracting the tar.bz file), from which I want to extract the individual articles. I am using the wikixmlj library to extract the contents and it does gives the title, text, categories mentioned at the end and a few other attributes. But I am more interested in the external links/references associated with each article, for which this library doesnt provide any API for.
Is there any elegant and efficient way to extract that other than parsing the wikiText that we get with the getWikiText() API.
Or is there any other java library to extract from this dump file, which gives me the title, content, categories and the references/external-links.
The XML dump contains exactly what the library is offering you: the page text along with some basic metadata. It doesn't contain any metadata about categories or external links.
The way I see it, you have three options:
Use the specific SQL dumps for the data you need, e.g. categorylinks.sql for categories or externallinks.sql for external links. But there is no dump for references (because MediaWiki doesn't track those).
Parse the wikitext from the XML dump. This would have problems with templates.
Use your own instance of MediaWiki to parse the wikitext into HTML and then parse that. This could potentially handle templates too.
May be too late but this link could help: http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/wikiprep.html
Here is an example output of above program: http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/sample.hgw.xml

Searching for contents in huge number of files in java

We are building search feature in our application to traverse over more than 100,000 xml files for content.
Data content are in form of huge number of xml files.
Is this a good idea to keep huge number of xml files and on search (like by name etc) traverse through each file for result? It may reduce our application search performance.
Or what is the best way?
You want elasticsearch here. It will give you what you need.

Java Parser HTML using plain String methods?

Is it a good idea? Well I have used other 3rd party Libraries like JSoup and it works great, but for this project it's different. Is it worth it to load and parse a whole document when you just want to get one item from it? Some of the html pages are simple too, so I could use String methods too. Reason is cause memory will be an issue, and it also takes some time to load the document too. When parsing XML I always use a SAX Parser because it doesn't load it in memory and it is fast. Could I use the same thing on html documents, or is there already one like this out there? So if there is a non-DOM HTML lightweight parser, that would be great too.
If the HTML is XML compliant (i.e. it's XHTML) then you can use a standard SAX parser. Here you can find a list of HTML parsers in Java to choose from: http://java-source.net/open-source/html-parsers. HotSax probably will handle all your use cases.

Categories

Resources