What is the fastest file / way to parse a large data file?

What is the fastest file / way to parse a large data file? - java

So I am working on a GAE project. I need to look up cities, Country Names and Country Codes for sign ups, LBS, ect ...
Now I figured that putting all the information in the Datastore is rather stupid as it will be used quite frequently and its gonna eat up my datastore quotations for no reason, specially that these lists arent going to change, so its pointless to put in datastore.
Now that leaves me with a few options:
API - No budget for paid services, free ones are not exactly reliable.
Upload Parse-able file - Favorable option as I like the certainty that the data will always be there.
So I got the files needed from GeoNames (link has source files for all countries in case someone needs it). The file for each country is a regular UTF-8 tab delimited file which is great.
However, now that I have the option to choose how to format and access the data, the question is:
What is the best way to format and retrieve data systematically from a static file in a Java servelet container ?
The best way being the fastest, and least resource hungry method.
Valid options:
TXT file, tab delimited
XML file Static
Java Class with Tons of enums
I know that importing country files as Java Enums and going through their values will be very fast, but do you think this is going to affect memory beyond reasonable limits ? On the other hand, every time I need to access a record, the loop will go through a few thousand lines until it finds the required record ... reading line by line so no memory issues, but incredibly slow ... I have had some experience with parsing an excel file in a Java servelet and it took something like 20 seconds just to parse 250 records, on large scale, response time WILL timeout (no doubt about it) so is XML anything like excel ??
Thank you very much guys !! Please provide opinions, all and anything is appreciated !

Easiest and fastest way would be to have the file as a static web resource file, under the WEB-INF folder and on application startup, have a context listener to load the file into memory.
In memory, it should be a Map, mapping from a key you want to search by. This will allow you like a constant access time.
Memory consumption would only matter if it is really big. A hundred thousand record for example not worth optimizing if you need to access this many times.
The static file should be plain text format or CSV, they are read and parsed most efficiently. No need XML formatting as parsing it would be slow.
If the list is really big, you can break it up into multiple, smaller files, and only parse those and only when they are required. A reasonable, easy partitioning would be to break it up by country, but any other partitioning would work (like based on its name using the first few characters from its name).
You could also consider building this Map in the memory once, and then serialize this map to a binary file, and include that binary file as a static resource file, and that way you would only have to deserialize this Map and would be no need to parse/process it as a text file and build objects yourself.
Improvements on the data file
An alternative to having the static resource file as a text/CSV file or a serialized Map
data file would be to have it as a binary data file where you could create your own custom file format.
Using DataOutputStream you can write data to a binary file in a very compact and efficient way. Then you could use DataInputStream to load data from this custom file.
This solution has the advantages that the file could be much less (compared to plain text / CSV / serialized Map), and loading it would be much faster (because DataInputStream doesn't use number parsing from a text for example, it reads the bytes of a number directly).

Hold the data in source form as XML. At start of day, or when it changes, read it into memory: that's the only time you incur the parsing cost. There are then two main options:
(a) your in-memory form is still an XML tree, and you use XPath/XQuery to query it.
(b) your in-memory form is something like a java HashMap
If the data is very simple then (b) is probably best, but it only allows you to do one kind of query, which is hard-coded. If the data is more complex or you have a variety of possible queries, then (a) is more flexible.

Related

Simplest format to read/write huge files

I need to write huge files ( more than 1 million lines) and send the file to a different machine where I need to read it with a Java BufferedReader, one line at a time.
I was using indetned Json format but it turned out to be not very handy,
it requires too much coding and that consumes extra RAM/CPU.
I'm looking for something that looks like this:
client:id="1" name="jack" adress="House N°1\nCity N°3 \n Country 1" age="20"
client:id="2" name="alice" adress="House N°2\nCity N°5 \n Country 2" age="30"
vihecul:id="1" model="ford" hp="250" fuel="diesel"
vihecul:id="2" model="nisan" hp="190" fuel="diesel"
This way I can read the objects one at a time.
I know about url.encode & base64, but I'm trying to keep shorter readable lines.
So any suggestions please!

With the huge files, any textual data formats, specially with the markup data like JSON, YAML or XML, is not a very nice solution.
I can suggest to use a universal binary format, like Google Protocol Buffers or ASN1.
The Google Protocol Buffers is much easy to get started.
Of course if you just need a Java-To-Java data transferring, you can use java out of the box serialization.

What about reading/writing files in binary format using DataInputStream and DataOutputStream?
Of course, your data must have fixed structure, but as a benefit you'll get smaller file sizes and faster reading/writing.

Best file format regarding standard string and integer data?

For my project, I need to store info about protocols (the data sent (most likely integers) and in the order it's sent) and info that might be formatted something like this:
'ID' 'STRING' 'ADDITIONAL INTEGER DATA'
This info will be read by a Java program and stored in memory for processing, but I don't know what would be the most sensible format to store this data in?
EDIT: Here's some extra information:
1)I will be using this data in a game server.
2)Since it is a game server, speed is not the primary concern, since this data will primary be read and utilized during startup, which shouldn't occur very often.
3)Memory consumption I would like to keep at a minimum, however.
4)The second data "example" will be used as a "dictionary" to look up names of specific in-game items, their stats and other integer data (and therefore might become very large, unlike the first data containing the protocol information, where each file will only note small protocol bites, like a login protocol for instance).
5)And yes, I would like the data to be "human-editable".
EDIT 2: Here's the choices that I've made:
JSON - For the protocol descriptions
CSV - For the dictionaries

There are many factors that could come to weigh--here are things that might help you figure this out:
1) Speed/memory usage: If the data needs to load very quickly or is very large, you'll probably want to consider rolling your own binary format.
2) Portability/compatibility: Balanced against #1 is the consideration that you might want to use the data elsewhere, with programs that won't read a custom binary format. In this case, your heavy hitters are probably going to be CSV, dBase, XML, and my personal favorite, JSON.
3) Simplicity: Delimited formats like CSV are easy to read, write, and edit by hand. Either use double-quoting with proper escaping or choose a delimiter that will not appear in the data.
If you could post more info about your situation and how important these factors are, we might be able to guide you further.

How about XML, JSON or CSV ?

I've written a similar protocol-specification using XML. (Available here.)
I think it is a good match, since it captures the hierarchal nature of specifying messages / network packages / fields etc. Order of fields are well defined and so on.
I even wrote a code-generator that generated the message sending / receiving classes with methods for each message type in XSLT.
The only drawback as I see it is the verbosity. If you have a really simple structure of the specification, I would suggest you use some simple home-brewed format and write a parser for it using a parser-generator of your choice.

In addition to the formats suggested by others here (CSV, XML, JSON, etc.) you might consider storing the info in a Java properties file. (See the java.util.Properties class.) The code is already there for you, so all you have to figure out is the properties names (or name prefixes) you want to use.
The Properties class also provides for storing/loading properties in a simple XML format.

read/write to a large size file in java

i have a binary file with following format :
[N bytes identifier & record length] [n1 bytes data]
[N bytes identifier & record length] [n2 bytes data]
[N bytes identifier & record length] [n3 bytes data]
as you see i have records with different lengths. in each record i have N bytes fixed which contains and id and the length of data in record.
this file is very big and can contains 3 millions records.
I want to open this file by an application and let user to browse and edit the records.
( Insert / Update / Delete records)
my initial plan is to create and index file from original file and for each record, keep next and previous record address to navigate forward and backward easily. (some sort of linked list but in file not in memory)
is there library (java library) to help me to implement this requirement ?
any recommendation or experience that you think is useful?
----------------- EDIT ----------------------------------------------
Thanks for guides and suggestions,
some more info:
the original file and its format is out of my control (it's a third party file) and i can't change the file format. but i have to read it, let user to navigate over records and edit some of them (insert new record/ update an existing record/ delete a record) and at the end save it back to original file format.
do u still recommend DataBase instead of a normal index file ?
----------------- SECOND EDIT ----------------------------------------------
record size in update mode is fixed. it means updated (edited) record has same length as original record's, unless user delete the record and create another record with different format.
Many Thanks

Seriously, you should NOT be using a binary file for this. You should use a database.
The problems with trying to implement this as a regular file stem from the fact that operating systems do not allow you to insert extra bytes into the middle of an existing file. So if you need to insert a record (anywhere but the end), update a record (with a different size) or remove a record, you would need to:
rewrite other records (after the insertion/update/deletion point) to make or reclaim space, or
implement some kind of free space management within the file.
All of this is complicated and / or expensive.
Fortunately, there is a class of software that implements this kind of thing. It is called database software. There are a wide range of options, ranging from using a full-scale RDBMS to light-weight solutions like BerkeleyDB files.
In response to your 1st and 2nd edits, a database will still be simpler.
However, here's an alternative that might perform better for this use-case than using a DB... without doing complicated free-space management.
Read the file and build an in-memory index that maps ids to file locations.
Create a second file to hold new and updated records.
Perform the record adds/updates/deletes:
An addition is handled by writing the new record to the end of the second file, and adding an index entry for it.
An update is handled by writing the updated record to the end of the second file, and changing the existing index entry to point to it.
A delete is handled by deleting the index entry for the record's key.
Compact the file as follows:
Create a new file.
Read each record in the old file in order, and check the index for the record's key. If the entry still points to the location of the record, copy the record to the new file. Otherwise skip it.
Repeat the step 4.2 for the second file.
If we completed all of the above successfully, delete the old file and second file.
Note this relies on being able to keep the index in memory. If that is not feasible, then the implementation is going to be more complicated ... and more like a database.

Having a data file and an index file would be the general base idea for such an implementation, but you'd pretty much find yourself dealing with data fragmentation upon repeated data updates/deletion, etc. This kind of project, in itself, should be a separate project and should not be part of your main application. However, essentially, a database is what you need as it is specifically designed for such operations and use cases and will also allow you to search, sort, and extend (alter) your data structure without having to refactor an in-house (custom) solution.
May I suggest you to download Apache Derby and create a local embedded database (derby does it for you want you create a new embedded connection at run-time). It will not only be faster than anything you'll write yourself, but will make your application easier to maintain.
Apache Derby is a single jar file that you can simply include and distribute with your project (check the license if any legal issue may apply in your app). There is no need for a database server or third party software; it's all pure Java.
Bottom line as that it all depends on how large is your application, if you need to share the data across many clients, if speed is a critical aspect of your app, etc.
For a stand-alone, single user project, I recommend Apache Derby. For a n-tier application, you might want to look into MySQL, PostgreSQL or (hrm) even Oracle. Using already made and tested solutions is not only smart, but will cut down your development time (and maintenance efforts).
Cheers.

Generally you are better off letting a library or database do the work for you.
You may not want to have an SQL database and there are plenty of simple databases which don't use SQL. http://nosql-database.org/ lists 122 of them.
At a minimum, if you are going to write this I suggest you read the source for one of these databases to see how they work.
Depending on the size of the records, 3 million isn't that much and I would suggest you keep as much in memory as possible.
The problem you are likely to have is ensuring the data is consistent and recovering the data when a corruption occurs. The second problem is dealing with fragmentation efficiently (some thing the brightest minds working on the GC deal with) The third problem is likely to be maintain the index in a transaction fashion with the source data to ensure there are no inconsistencies.
While this may appear simple at first, there are significant complexities in making sure there data is reliable, maintainable and can be accessed efficiently. This is why most developers use an existing database/datastore library and concentrate on the features which are unqiue to their application.

(Note: My answer is about the problem in general, not considering any Java libraries or - like the other answers also proposed - using a database (library), which might be better than reinventing the wheel)
The idea to create an index is good and will be very helpful performance-wise (although you wrote "index file", I think it should be kept in memory). Generating the index should be quite fast if you read the ID and record length for each entry and then just skip the data with a file seek.
You should also think about the edit functionality. Especially inserting and deleting can be very slow on such a big file if you do it wrong (f.e. deleting and then moving all the following entries to close the gap).
The best option would be to only mark deleted entries as deleted. When inserting, you can overwrite one of those or append to the end of the file.

Insert / Update / Delete records
Inserting (rather than merely appending) and deleting records to a file is expensive because you have to move all the following content of the file to create space for the new record or to remove the space it used. Updating is similarly expensive if the update changes the length of the record (you say they are variable length).
The file format you propose is fundamentally unsuitable for the kinds of operations you want to perform. Others have suggested using a data-base. If you don't want to go that far, adding an index file (as you suggest) is the way to go. I recommend making the index records all the same length.

As others have stated a database would seem a better solution. The following are Java SQL DB's that could be used: H2, Derby or HSQLDB
If you want to use an index file look at Berkley DB or No Sql
If there is some reason for using a file, look at JRecord . It has
Several Classes for reading/writing files with variable length binary records (they where written for Cobol VB files). Any of Mainframe / Fujitsu / Open Cobol VB file structures should do the job.
An Editor for editing JRecord files. The latest version of the Editor can handle large files (it uses Compression / spill file). The editor suffers from having to download the whole file and only one user can edit the file at one time.
The JRecord solution will only work if
There is a limited number (preferably one) users all located in the one location
Fast infostructure

Java XML Parser for huge files

I need a xml parser to parse a file that is approximately 1.8 gb.
So the parser should not load all the file to memory.
Any suggestions?

Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).
StAX Project Home: http://stax.codehaus.org/Home
Brief introduction: http://www.xml.com/pub/a/2003/09/17/stax.html
Javadoc: https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/package-summary.html

Use a SAX based parser that presents you with the contents of the document in a stream of events.

StAX API is easier to deal with compared to SAX. Here is a short tutorial

Try VTD-XML. I've found it to be more performant, and more importantly, easier to use than SAX.

As others have said, use a SAX parser, as it is a streaming parser. Using the various events, you extract your information as necessary and then, on the fly store it someplace else (database, another file, what have you).
You can even store it in memory if you truly just need a minor subset, or if you're simply summarizing the file. Depends on the use case of course.
If you're spooling to a DB, make sure you take some care to make your process restartable or whatever. A lot can happen in 1.8GB that can fail in the middle.

Stream the file into a SAX parser and read it into memory in chunks.
SAX gives you a lot of control and being event-driven makes sense. The api is a little hard to get a grip on, you have to pay attention to some things like when the characters() method is called, but the basic idea is you write a content handler that gets called when the start and end of each xml element is read. So you can keep track of the current xpath in the document, identify which paths have which data you're interested in, and identify which path marks the end of a chunk that you want to save or hand off or otherwise process.

Use almost any SAX Parser to stream the file a bit at a time.

I had a similar problem - I had to read a whole XML file and create a data structure in memory. On this data structure (the whole thing had to be loaded) I had to do various operations. A lot of the XML elements contained text (which I had to output in my output file, but wasn't important for the algorithm).
FIrstly, as suggested here, I used SAX to parse the file and build up my data structure. My file was 4GB and I had an 8GB machine so I figured maybe 3GB of the file was just text, and java.lang.String would probably need 6GB for those text using its UTF-16.
If the JVM takes up more space than the computer has physical RAM, then the machine will swap. Doing a mark+sweep garbage collection will result in the pages getting accessed in a random-order manner and also objects getting moved from one object pool to another, which basically kills the machine.
So I decided to write all my strings out to disk in a file (the FS can obviously handle sequential-write of the 3GB just fine, and when reading it in the OS will use available memory for a file-system cache; there might still be random-access reads but fewer than a GC in java). I created a little helper class which you are more than welcome to download if it helps you: StringsFile javadoc | Download ZIP.
StringsFile file = new StringsFile();
StringInFile str = file.newString("abc"); // writes string to file
System.out.println("str is: " + str.toString()); // fetches string from file

+1 for StaX. It's easier to use than SaX because you don't need to write callbacks (you essentially just loop over all elements of the while until you're done) and it has (AFAIK) no limit as to the size of the files it can process.

Indexing a set of key pair values for use in J2ME Application

I have some 1000 key-pair values that I will use in my j2me application, reading it from a resource file. However I will be using only a few of those values at any time, say 10, based on the record number being generated inside the application logic. Loading all the values into memory and then looking up is fairly not an efficient option as I will not be using all the records. Is there a better scheme to store the values in the file, some indexing or something so that I can retrieve those key-pair values by skipping the amount of bytes in the file to reach and read the appropriate record? As this is a resource file in the jar there wont be any modifications to it.

If you know the record length when they are created, you could write the records out in binary format to a file. But at the start of each record, you could first write a number indicating its size in bytes and use a RandomAccessFile to access the records by moving the file pointer.
But in terms of speed, loading into memory will be faster than reading from a file, but if memory is at a premium, then a file wouldn't be a bad way to go.
Jeff

Skipping bytes in a compressed resource file inside a jar is not really going to be optimal either and the implementation of InputStream you get as a result of calling Class.getResourceAsInputStream() may be fragmented if you plan on running your application on several devices.
EDIT after additional info in comment:
It could be that the best way to do this is actually to store the (question, answer) data in 1000 different classes.
It's going to feel very weird as a solution but the class loader should only load the 10 classes you actually use, you can generate the 1000 source files with a simple J2SE program and you can load 10 random classes based on an integer inside their name using java.lang.Class.forName().
If the jar file doesn't become too big to use, you're basically relying on the indexing of its zip file format for the class loader performances...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.