How to write more than 30 MB of data in xml? - java

First of all sorry if I'm repeating this question but I don't find any relevant solutions for my problem.
I'm facing difficulty in finding the way to solve the below issues.
1) I'm facing a scenario where I have to write more than 30 MB - 400 MB of data in a xml. When I'm using 'String' object to append the data to xml I'm getting 'OutOfMemory' exception.
After spending more time in doing R&D, I came to know that using 'Stream' will resolve this issue. But I'm not sure about this.
2) Once I constructed the xml, I have to send this data to the DMZ server using Android devices. As I know sending large amount of data using Http is difficult in this situation. In this case,
a) Using FTP will be helpful in this scenario?
b) Splitting the data into chunks of data and sending will be helpful?
Kindly let me know your suggestions. Thanks in advance.

i would consider zipping up the data before ftping it across.You could use a ZipOutputStream .
For the Out of Memory Exception, you could consider increasing the Heap Size.
Check this : Increase heap size in Java
Can you post some values of heap size you tried, your code and some exception traces?

Use StAX or SAX. These can create XML of any size because they write XML parts they generate to OutputStream on the fly.

What you should do is
First, use a XML parser to read and write data in XML format. it could be SAX or DOM. If data size is huge try CSV format it will take less space as you do not have to store XML tag.
Second, When creating output file make sure those are small small files.
third when sending over network, make sure you zipped everything.
And for god sake, don't eat up user mobile data cap for this design. Warn user about this file size and suggest him to use WiFi network.

Related

Read REST api response hosting very large data

I am calling an REST API endpoint which is hosting a very large amount of data. The data quantity is too much that my chrome tab crashes as well (It displays the data for a short time and it's loading even more data causing the tab to crash). Even postman fails to get the data and instead would only return 200 OK code without displaying any response body.
I'm trying to write a java program to consume the response from the API. Is there a way to consume the response without using a lot of memory?
Please let me know if the question is not clear. Thank you !!
A possibility is to use a JSON streaming parser like Jackson Streaming API (https://github.com/FasterXML/jackson-docs/wiki/JacksonStreamingApi) for example code see https://javarevisited.blogspot.com/2015/03/parsing-large-json-files-using-jackson.html
For JS there is https://github.com/DonutEspresso/big-json
If data is really so large then better to split task:
Download full data via ordinary http client to disk
Make bulk processing , using some streaming approach, similar to SAX parsing for XML:
JAVA - Best approach to parse huge (extra large) JSON file
With such split, you will not deal with possible network errors during processing and will keep data consistency.

Simplest format to read/write huge files

I need to write huge files ( more than 1 million lines) and send the file to a different machine where I need to read it with a Java BufferedReader, one line at a time.
I was using indetned Json format but it turned out to be not very handy,
it requires too much coding and that consumes extra RAM/CPU.
I'm looking for something that looks like this:
client:id="1" name="jack" adress="House N°1\nCity N°3 \n Country 1" age="20"
client:id="2" name="alice" adress="House N°2\nCity N°5 \n Country 2" age="30"
vihecul:id="1" model="ford" hp="250" fuel="diesel"
vihecul:id="2" model="nisan" hp="190" fuel="diesel"
This way I can read the objects one at a time.
I know about url.encode & base64, but I'm trying to keep shorter readable lines.
So any suggestions please!
With the huge files, any textual data formats, specially with the markup data like JSON, YAML or XML, is not a very nice solution.
I can suggest to use a universal binary format, like Google Protocol Buffers or ASN1.
The Google Protocol Buffers is much easy to get started.
Of course if you just need a Java-To-Java data transferring, you can use java out of the box serialization.
What about reading/writing files in binary format using DataInputStream and DataOutputStream?
Of course, your data must have fixed structure, but as a benefit you'll get smaller file sizes and faster reading/writing.

What is the fastest file / way to parse a large data file?

So I am working on a GAE project. I need to look up cities, Country Names and Country Codes for sign ups, LBS, ect ...
Now I figured that putting all the information in the Datastore is rather stupid as it will be used quite frequently and its gonna eat up my datastore quotations for no reason, specially that these lists arent going to change, so its pointless to put in datastore.
Now that leaves me with a few options:
API - No budget for paid services, free ones are not exactly reliable.
Upload Parse-able file - Favorable option as I like the certainty that the data will always be there.
So I got the files needed from GeoNames (link has source files for all countries in case someone needs it). The file for each country is a regular UTF-8 tab delimited file which is great.
However, now that I have the option to choose how to format and access the data, the question is:
What is the best way to format and retrieve data systematically from a static file in a Java servelet container ?
The best way being the fastest, and least resource hungry method.
Valid options:
TXT file, tab delimited
XML file Static
Java Class with Tons of enums
I know that importing country files as Java Enums and going through their values will be very fast, but do you think this is going to affect memory beyond reasonable limits ? On the other hand, every time I need to access a record, the loop will go through a few thousand lines until it finds the required record ... reading line by line so no memory issues, but incredibly slow ... I have had some experience with parsing an excel file in a Java servelet and it took something like 20 seconds just to parse 250 records, on large scale, response time WILL timeout (no doubt about it) so is XML anything like excel ??
Thank you very much guys !! Please provide opinions, all and anything is appreciated !
Easiest and fastest way would be to have the file as a static web resource file, under the WEB-INF folder and on application startup, have a context listener to load the file into memory.
In memory, it should be a Map, mapping from a key you want to search by. This will allow you like a constant access time.
Memory consumption would only matter if it is really big. A hundred thousand record for example not worth optimizing if you need to access this many times.
The static file should be plain text format or CSV, they are read and parsed most efficiently. No need XML formatting as parsing it would be slow.
If the list is really big, you can break it up into multiple, smaller files, and only parse those and only when they are required. A reasonable, easy partitioning would be to break it up by country, but any other partitioning would work (like based on its name using the first few characters from its name).
You could also consider building this Map in the memory once, and then serialize this map to a binary file, and include that binary file as a static resource file, and that way you would only have to deserialize this Map and would be no need to parse/process it as a text file and build objects yourself.
Improvements on the data file
An alternative to having the static resource file as a text/CSV file or a serialized Map
data file would be to have it as a binary data file where you could create your own custom file format.
Using DataOutputStream you can write data to a binary file in a very compact and efficient way. Then you could use DataInputStream to load data from this custom file.
This solution has the advantages that the file could be much less (compared to plain text / CSV / serialized Map), and loading it would be much faster (because DataInputStream doesn't use number parsing from a text for example, it reads the bytes of a number directly).
Hold the data in source form as XML. At start of day, or when it changes, read it into memory: that's the only time you incur the parsing cost. There are then two main options:
(a) your in-memory form is still an XML tree, and you use XPath/XQuery to query it.
(b) your in-memory form is something like a java HashMap
If the data is very simple then (b) is probably best, but it only allows you to do one kind of query, which is hard-coded. If the data is more complex or you have a variety of possible queries, then (a) is more flexible.

Rest calls-Large amount of data between calls

We are using Rest using Jersey. There are few scenarios where server(WAS 8.5) sends large amount of data to client, which is RCP application. In some cases data is so huge(150MB) in xml format that client gets an OutOfMemoryError exception.
I have below questions
How much size is increased when java object is converted in xml?
How we can send large java object to client and still use rest calls?
1) Tough question to answer without seeing the XML schema, I've seen well designed schemas that result in tight, lean XML, and others that are a mess and very bloated. To test it write some test code that serializes your Java objects to a byte[] and compare it's size to the XML payload you currently produce.
2) Might be worth looking into a chunking process, 150MB is pretty large for a single payload. Also are you using GZIP compression for this already? Also may be worth looking at Fast Infoset. Basically it's a binary encoding for XML that generally helps reduce the size of an XML Document.

Java XML Parser for huge files

I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.
Any suggestions?
Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).
StAX Project Home: http://stax.codehaus.org/Home
Brief introduction: http://www.xml.com/pub/a/2003/09/17/stax.html
Javadoc: https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/package-summary.html
Use a SAX based parser that presents you with the contents of the document in a stream of events.
StAX API is easier to deal with compared to SAX. Here is a short tutorial
Try VTD-XML. I've found it to be more performant, and more importantly, easier to use than SAX.
As others have said, use a SAX parser, as it is a streaming parser. Using the various events, you extract your information as necessary and then, on the fly store it someplace else (database, another file, what have you).
You can even store it in memory if you truly just need a minor subset, or if you're simply summarizing the file. Depends on the use case of course.
If you're spooling to a DB, make sure you take some care to make your process restartable or whatever. A lot can happen in 1.8GB that can fail in the middle.
Stream the file into a SAX parser and read it into memory in chunks.
SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.
Use almost any SAX Parser to stream the file a bit at a time.
I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).
FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.
If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.
So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc | Download ZIP.
StringsFile file = new StringsFile();
StringInFile str = file.newString("abc"); // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file
+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.

Categories

Resources