Efficiently store easily parsable data in a file?

Efficiently store easily parsable data in a file? - java

I am needing to store easily parsable data in a file as an alternative to the database backed solution (not up for debate). Since its going to be storing lots of data, preferably it would be a lightweight syntax. This does not necessarily need to be human readable, but should be parsable. Note that there are going to be multiple types of fields/columns, some of which might be used and some of which won't
From my limited experience without a database I see several options, all with issues
CSV - I could technically do this, and it is very light. However the parsing would be an issue, and then it would suck if I wanted to add a column. Multi-language support is iffy, mainly people's own custom parsers
XML - This is the perfect solution from many fronts except when it comes to parsing and overhead. Thats a lot of tags and would generate a giant file, and parsing would be very resource consuming. However virtually every language supports XML
JSON - This is the middle ground, but I don't really want to do this as its an awkward syntax and parsing is non-trivial. Language support is iffy.
So all have their disadvantages. But what would be the best when trying to aim for language support AND somewhat small file size?

How about sqlite? This would allow you to basically embed the "DB" in your application, but not require a separate DB backend.
Also, if you end up using a DB backend later, it should be fairly easy to switch over.
If that's not suitable, I'd suggest one of the DBM-like stores for key-value lookups, such as Berkely DB or tdb.

If you're just using the basics of all these formats, all of the parsers are trivial. If CSV is an option, then for XML and JSON you're talking blocks of name/value pairs, so there's not even a recursive structure involved. json.org has support for pretty much any language.
That said.
I don't see what the problem is with CSV. If people write bad parsers, then too bad. If you're concerned about compatibility, adopt the default CSV model from Excel. Anyone that can't parse CSV from Excel isn't going to get far in this world. The weakest support you find in CSV is embedded newlines and carriage returns. If you data doesn't have this, then it's not a problem. Only other issue is embedded quotations, and those are escaped in CSV. If you don't have those either, then its even more trivial.
As for "adding a column", you have that problem with all of these. If you add a column, you get to rewrite the entire file. I don't see this being a big issue either.
If space is your concern, CSV is the most compact, followed by JSON, followed by XML. None of the resulting files can be easily updated. They pretty much all would need to be rewritten for any change in the data. CSV has the advantage that it's easily appended to, as there's no closing element (like JSON and XML).

JSON is probably your best bet (it's lightish, faster to parse, and self-descriptive so you can add your new columns as time goes by). You've said parsable - do you mean using Java? There are JSON libraries for Java to take the pain out of most of the work. There are also various light-weight in memory databases that can persist to a file (in case "not an option" means you don't want a big separate database)

If this is just for logging some data quickly to a file, I find tab delimited files are easier to parse than CSV, so if it's a flat text file you're looking for I'd go with that (so long as you don't have tabs in the feed of course). If you have fixed size columns, you could use fixed-length fields. That is even quicker because you can seek.
If it's unstructured data that might need some analysis, I'd go for JSON.
If it's structured data and you envision ever doing any querying on it... I'd go with sqlite.

When I needed solution like this I wrote up a simple representation of data prefixed with length. For example "Hi" will be represented as(in hex) 02 48 69.
To form rows just nest this operation(first number is number of fields, and then the fields), for example if field 0 contains "Hi" and field 1 contains "abc" then it will be:
Num of fields Field Length Data Field Length Data
02 02 48 69 03 61 62 63
You can also use first row as names for the columns.
(I have to say this is kind of a DB backend).

You can use CSV and if you only add columns to the end this is simple to handle. i.e. if you have less columns than you expect, use the default value for the "missing" fields.
If you want to be able to change the order/use of fields, you can add a heading row. i.e. the first row has the names of the columns. This can be useful when you are trying to read the data.

If you are forced to use a flat file, why not develop your own format? You should be able to tweak overhead and customize as much as you want (which is good if you are parsing lots of data).
Data entries will be either of a fixed or variable length, there are advantages to forcing some entries to a fixed length but you will need to create a method for delimiting both. If you have different "types" of rows, write all the rows of a each type in a chunk. Each chunk of rows will have a header. Use one header to describe the type of the chunk, and another header to describe the columns and their sizes. Determine how you will use the headers to describe each chunk.
eg (H is header, C is column descriptions and D is data entry):
H Phone Numbers
C num(10) type
D 1234567890 Home
D 2223334444 Cell
H Addresses
C house(5) street postal(6) province
D 1234_ "some street" N1G5K6 Ontario

I'd say that if you want to store rows and columns, you've got to to use a DB. The reason is simple - modification of the structure with any approach except RDBMS will require significant efforts, and you mentioned that you want to change the structure in future.

Related

Fastest way to read a CSV?

I've profiled my application and it seems like one of my biggest bottlenecks at the moment is the String.split method. It's taking up 21% of my runtime, and the other big contributors aren't parts that I can streamline anymore than they are. It also seems like all of the newly-created String objects are causing issues with the garbage collector, although I'm less clear whether or not that's the case.
I'm reading in a gzipped file comma-separated values that contain financial data. The number of fields in each row varies depending on what kind of record it is, and the size of each field varies too. What's the fastest way to read the data in, creating the fewest intermediate objects?
I saw this thread but none of the answers give any evidence that OpenCSV is any faster than String.split, and they all seem to focus on using an external library rather than writing new code. I'm also very concerned about memory overhead, because I spend another 20% or so of the total runtime doing garbage collection. I would like to just return views of the string in question, but it looks like that's not possible anymore.

A quicker way is to use just a simple StringTokenizer. It doesn't have the regex overhead of split() and it's in the JDK.

If you do not want to use a library, then an alternative to StringTokenizer would be to write a simple state machine to parse your CSV. Tokenizers can have problems with commas embedded in fields. CSV is a reasonably simple format, so it is not difficult to build a state machine to handle it. If you know exactly what the format of the input file is, then you can simplify it even further since you will not have to deal with any possibilities not present in your specific file.
Numeric data could potentially be converted direct to int on the fly, without having to hold a large number of strings simultaneously.

Use uniVocity-parsers to parse your CSV file. It is suite of parsers for tabular text formats and its CSV parser is the fastest among all other parsers for Java (as you can see here, and here). Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).
We used the architecture provided by this framework to build a custom parser for MySQL dump files for this project. We managed to parse a 42GB dump file in 15 minutes (1+ billion rows).
It should solve your problem.

Map multiple columns from multiple files which are slightly different

I am looking for a good practical method of tackling metadata normalization between multiple files that have slightly different schema's for a batch ETL job in Talend.
I have a few hundred historical reports (around 25K to 200K records each) with about 100 to 150 columns per excel file. Most of the column names are the same for all the files (98% overlap) however there are subtle evil differences:
Different Column orders
Different Column names (sometimes using and sometimes not using abbreviations)
Different counts of columns
Sometimes columns have spaces between words, sometimes, dots, dashes or underscores
etc.
Short of writing a specialized application or brute forcing all the files by manually correcting them, are there any good free tools or methods that would provide a diff and correction between file column names in an intelligent or semi-automated fashion?

You could use Talend Open Studio to achieve that. But I do see one caveat.
The official way
In order to make Talend understand your Excel files, you will need to first load it's metadata. The caveat is that you will need to load all metadata by hand (one by one). In the free version of Talend (Open Studio Data), there is no support for dynamic metadata.
Using components like tMap you can then map your input metadata into your desired output metadata (could be a Excel file or a Database or something else). During this step you can shape your input data into your desired output (fixing / ignoring / transforming it / etc).
The unofficial way
There seems to exist a user contributed component that offers support the Excel dynamic metadata. I did not test it, but it worth trying :
http://www.talendforge.org/exchange/?eid=663&product=tos&action=view&nav=1,1,1
This can evolve as components are released, updated frequently.
My answer is about the status as it is on version 5.3.1

I write this tentatively as an "answer" because I dont have the link to hand to demonstrate how exactly it can be done. However Pentaho data integration provides a very good way to load files like this - There is a method by which you can read the metadata of the file in the first transformation, by that I mean the column names, and you can then use the "metadata injection" functionality to inject that metadata into the next transformation which reads the file.
Now; In the scenario where your column names are slightly different, youll have to somehow do some additional mapping. perhaps you can store a lookup table somewhere of "alias" column name and real column name.
Either way, this sounds like a pretty complex / nasty task to automate!
I've not seen any way to handle varying metadata of a file in Talend - Although happy to be corrected on this point!

converting legacy data to json format

I am working with some legacy code where I am using a column value and its earlier history to make some comparisons and highlight the differences. However, the column is stored in an arbitary delimited fashion and the code around comparison is well, very hard to understand.
My initial thought was to refactor the code - but when I later thought about it, I thought why not fix the original source of the issue which is the data in this case. I have some limitations around the table structure and hence converting this from a single column into multiple column data is not an option.
With that, I thought if its worth to convert this arbitary delimited data into a standardized format like json. My approach would be to export this data to some file and apply some regular expressions and convert this data to json format and then re-import it back.
I wanted to check with this group if I am approaching this problem the right way and if there are other ideas that I should consider. Is JSON the right format in the first place? I'd like to know how you approached a similar problem. Thanks!

However, the column is stored in an arbitary delimited fashion and the
code around comparison is well, very hard to understand.
So it seems the problem is in the maintainability/evolvability of the code for adding or modifying functionalities.
My initial thought was to refactor the code - but when I later thought
about it, I thought why not fix the original source of the issue which
is the data in this case.
My opinion is: why? I think your initial thought was right. You should refactor (i.e. " ...restructuring an existing body of code, altering its internal structure without changing its external behavior") to gain what you want. If you change the data format you could end with a messy code above the new cooler data format in any case.
In others words you can have all the combinations because they are not strictly related: you can have a cool data format with a messy cody or with a fantastic code, and you can have a ugly data format with a fantastic code or a messy code.
It seems you are in the last case but moving from this case to: cool data format + messy code and then to cool data format + fantastic code is less straightforward than: fantastic code + ugly data format and then cool data format + fantastic code (optional).
So my opinion is point your goal in a straightforward way:
1) write tests (if you haven't) for the current functionalities and verify them with the current code (first step for refactoring)
2) change the code driven by the tests wrote in 1
3) If you still want, change the data format ever guided by your tests
Regards

How to encode long-string in the DB where attribute type is varchar(254)?

Say for some reason I’ve got a table in the DB that only have varchar(254) attribute types.
So in other words, I cannot store a String directly which has more than 254 characters.
But would there be a way around this, i.e. is it possible to encode a long string (say approx 700 chars) in the DB given this constraint.
What would be the easiest way to do this? I use Java.

Depending on the nature of the string you're wanting to store, you might be able to compress it down to the required length.
However, this is a can of worms. If I were in your shoes, I'd first investigate whether widening the column is an option (in most SQL DBMSs all it takes is a simple ALTER COLUMN command).
P.S. If you have to compress the data, take a look at the Deflater class. However, if I were you, I'd fight really hard for that trivial schema change.

The right way to manage a big matrix in Java

I'm working with a big matrix (not sparse), it contains about 10^10 double.
Of course I cannot keep it in memory, and I need just 1 row at time.
I thought to split it in files, every file 1 row (it requires a lot of files) and just read a file every time I need a row. do you know any more efficient way?

Why do you want to store it in different files? Can't u use a single file?
You could use functions inside RandomAccessFile class to perform the reading from that File.

So, 800KB per file, sounds like a good division. Nothing really stops you from using one giant file, of course. A matrix, at least one like yours that isn't sparse, can be considered a file of fixed length records, making random access a trivial matter.
If you do store it one file per row, I might suggest making a directory tree corresponding to decimal digits, so 0/0/0/0 through 9/9/9/9.
Considerations one way or the other...
is it being backed up? Do you have high-capacity backup media or something ordinary?
does this file ever change?
if it does change and it is backed up, does it change all at once or are changes localized?

It depends on the algorithms you want to execute, but I guess that in most cases a representation where each file contains some square or rectangular region would be better.
For example, matrix multiplication can be done recursively by breaking a matrix into submatrices.

If you are going to be saving it in a file, I believe serializing it will save space/time over storing it as text.
Serializing the doubles will store them as 2 bytes (plus serialization overhead) and means that you will not have to convert these doubles back and forth to and from Strings when saving or loading the file.

I'd suggest to use a disk-persistent cache like Ehcache. Just configure it to keep as many fragments of your matrix in memory as you like and it will take care of the serialization. All you have to do is decide on the way of fragmentation.
Another approach that comes to my mind is using Terracotta (which recently bought Ehache by the way). It's great to get a large network-attached heap that can easily manage your 10^10 double values without caring about it in code at all.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.