Best file format regarding standard string and integer data?

Best file format regarding standard string and integer data? - java

For my project, I need to store info about protocols (the data sent (most likely integers) and in the order it's sent) and info that might be formatted something like this:
'ID' 'STRING' 'ADDITIONAL INTEGER DATA'
This info will be read by a Java program and stored in memory for processing, but I don't know what would be the most sensible format to store this data in?
EDIT: Here's some extra information:
1)I will be using this data in a game server.
2)Since it is a game server, speed is not the primary concern, since this data will primary be read and utilized during startup, which shouldn't occur very often.
3)Memory consumption I would like to keep at a minimum, however.
4)The second data "example" will be used as a "dictionary" to look up names of specific in-game items, their stats and other integer data (and therefore might become very large, unlike the first data containing the protocol information, where each file will only note small protocol bites, like a login protocol for instance).
5)And yes, I would like the data to be "human-editable".
EDIT 2: Here's the choices that I've made:
JSON - For the protocol descriptions
CSV - For the dictionaries

There are many factors that could come to weigh--here are things that might help you figure this out:
1) Speed/memory usage: If the data needs to load very quickly or is very large, you'll probably want to consider rolling your own binary format.
2) Portability/compatibility: Balanced against #1 is the consideration that you might want to use the data elsewhere, with programs that won't read a custom binary format. In this case, your heavy hitters are probably going to be CSV, dBase, XML, and my personal favorite, JSON.
3) Simplicity: Delimited formats like CSV are easy to read, write, and edit by hand. Either use double-quoting with proper escaping or choose a delimiter that will not appear in the data.
If you could post more info about your situation and how important these factors are, we might be able to guide you further.

How about XML, JSON or CSV ?

I've written a similar protocol-specification using XML. (Available here.)
I think it is a good match, since it captures the hierarchal nature of specifying messages / network packages / fields etc. Order of fields are well defined and so on.
I even wrote a code-generator that generated the message sending / receiving classes with methods for each message type in XSLT.
The only drawback as I see it is the verbosity. If you have a really simple structure of the specification, I would suggest you use some simple home-brewed format and write a parser for it using a parser-generator of your choice.

In addition to the formats suggested by others here (CSV, XML, JSON, etc.) you might consider storing the info in a Java properties file. (See the java.util.Properties class.) The code is already there for you, so all you have to figure out is the properties names (or name prefixes) you want to use.
The Properties class also provides for storing/loading properties in a simple XML format.

Related

Read proto partly instead of full parsing in java

I used to define a proto file, for example
option java_package = "proto.data";
message Data {
repeated string strs = 1;
repeated int ints = 2;
}
I received from network this object's inputstream (or bytes). Then, normally, I do a parsing like Data.parserFrom(stream) or Data.parserFrom(bytes) to get the object.
By this, I have to hold full memory on Data object while I just need travel
all string and integer values in the object. It's bad when the object size is big.
What should I do for this issue?

Unfortunately, there is no way to parse just part of a protobuf. If you want to be sure that you've seen all of the strs or all of the ints, you have to parse the entire message, since the values could appear in any order or even interleaved.
If you only care about memory usage and not CPU time then you could, in theory, use a hand-written parser to parse the message and ignore fields that you don't care about. You still have to do the work of parsing, you can just discard them immediately rather than keeping them in memory. However, to do this you'd need to study the Protobuf wire format and write your own parser. You can use Protobuf's CodedInputStream class but a lot of work still needs to be done manually. The Protobuf library really isn't designed for this.
If you are willing to consider using a different protocol framework, Cap'n Proto is extremely similar in design to Protobufs but features in the ability to read only the part of the message you care about. Cap'n Proto incurs no overhead for the fields you don't examine, other than obviously the bandwidth and memory to receive the raw message bytes. If you are reading from a file, and you use memory mapping (MappedByteBuffer in Java), then only the parts of the message you actually use will be read from disk.
(Disclosure: I am the author of most of Google Protobufs v2 (the version you are probably using) as well as Cap'n Proto.)

Hmm. It appears that it may be already implemented but not adequately documented.
Has you tested it ?
See for discussion:
https://groups.google.com/forum/#!topic/protobuf/7vTGDHe0ZyM
See also, sample test code in google's github:
https://github.com/google/protobuf/blob/4644f99d1af4250dec95339be6a13e149787ab33/java/src/test/java/com/google/protobuf/lazy_fields_lite.proto

What is the fastest file / way to parse a large data file?

So I am working on a GAE project. I need to look up cities, Country Names and Country Codes for sign ups, LBS, ect ...
Now I figured that putting all the information in the Datastore is rather stupid as it will be used quite frequently and its gonna eat up my datastore quotations for no reason, specially that these lists arent going to change, so its pointless to put in datastore.
Now that leaves me with a few options:
API - No budget for paid services, free ones are not exactly reliable.
Upload Parse-able file - Favorable option as I like the certainty that the data will always be there.
So I got the files needed from GeoNames (link has source files for all countries in case someone needs it). The file for each country is a regular UTF-8 tab delimited file which is great.
However, now that I have the option to choose how to format and access the data, the question is:
What is the best way to format and retrieve data systematically from a static file in a Java servelet container ?
The best way being the fastest, and least resource hungry method.
Valid options:
TXT file, tab delimited
XML file Static
Java Class with Tons of enums
I know that importing country files as Java Enums and going through their values will be very fast, but do you think this is going to affect memory beyond reasonable limits ? On the other hand, every time I need to access a record, the loop will go through a few thousand lines until it finds the required record ... reading line by line so no memory issues, but incredibly slow ... I have had some experience with parsing an excel file in a Java servelet and it took something like 20 seconds just to parse 250 records, on large scale, response time WILL timeout (no doubt about it) so is XML anything like excel ??
Thank you very much guys !! Please provide opinions, all and anything is appreciated !

Easiest and fastest way would be to have the file as a static web resource file, under the WEB-INF folder and on application startup, have a context listener to load the file into memory.
In memory, it should be a Map, mapping from a key you want to search by. This will allow you like a constant access time.
Memory consumption would only matter if it is really big. A hundred thousand record for example not worth optimizing if you need to access this many times.
The static file should be plain text format or CSV, they are read and parsed most efficiently. No need XML formatting as parsing it would be slow.
If the list is really big, you can break it up into multiple, smaller files, and only parse those and only when they are required. A reasonable, easy partitioning would be to break it up by country, but any other partitioning would work (like based on its name using the first few characters from its name).
You could also consider building this Map in the memory once, and then serialize this map to a binary file, and include that binary file as a static resource file, and that way you would only have to deserialize this Map and would be no need to parse/process it as a text file and build objects yourself.
Improvements on the data file
An alternative to having the static resource file as a text/CSV file or a serialized Map
data file would be to have it as a binary data file where you could create your own custom file format.
Using DataOutputStream you can write data to a binary file in a very compact and efficient way. Then you could use DataInputStream to load data from this custom file.
This solution has the advantages that the file could be much less (compared to plain text / CSV / serialized Map), and loading it would be much faster (because DataInputStream doesn't use number parsing from a text for example, it reads the bytes of a number directly).

Hold the data in source form as XML. At start of day, or when it changes, read it into memory: that's the only time you incur the parsing cost. There are then two main options:
(a) your in-memory form is still an XML tree, and you use XPath/XQuery to query it.
(b) your in-memory form is something like a java HashMap
If the data is very simple then (b) is probably best, but it only allows you to do one kind of query, which is hard-coded. If the data is more complex or you have a variety of possible queries, then (a) is more flexible.

writing data in to files with java

I am writing a server in java that allows clients to play a game similar to 20 questions. The game itself is basically a binary tree with nodes that are questions about an object and leaves that are guesses at the object's identity. When the game guesses wrong it needs to be able to get the right answer from the player and add it to the tree. This data is then saved to a random access file.
The question is: How do you go about representing a tree within a file so that the data can be reaccessed as a tree at a later time.
If you know where I can find information on keeping data structures like trees organized as such when writing/reading to files then please link it. Thanks a lot.
Thanks for the quick answers everyone. This is a school project so it has some odd requirements like using random access files and telnet.

This data is then saved to a random access file.
That's the hard way to solve your problem (the "random access" bit, I mean).
The problem you are really trying to solve is how to persist a "complicated" data structure. In fact, there are a number of ways that this can be done. Here are some of them ...
Use Java persistence. This is simple to implement; make sure that your data structure is serializable, and then its just a few lines of code to serialize and few more lines to deserialize. The downsides are:
Serialized objects can be fragile in the face of code changes.
Serialization is not incremental. You write/read the whole graph each time.
If you have multiple separate serialized graphs, you need some scheme to name and manage them.
Use XML. This is more work to implement than Java persistence, but it has the advantage of being less fragile. And if something does go wrong, there's a chance you can fix it with XSLT or a text editor. (There are XML "binding" libraries that eliminate a lot of the glue coding.)
Use an SQL database. This addresses all of the downsides of Java persistence, but involves more coding ... and using a different computational model to access the persistent data (query versus graph navigation).
Use a database and an Object Relational Mapping technology; e.g. a JPA or JDO implementation. (Hibernate is a popular choice). These bridge between the database and in-memory views of data in a more or less transparent fashion, and avoids a lot of the glue code that you need to write in the SQL database and XML cases.

I think you're looking for serialization. Try this:
http://java.sun.com/developer/technicalArticles/Programming/serialization/

As mentioned, serialization is what you are looking for. It allows you to write an object to a file, and read it back later with minimal effort. The file will automatically be read back in as your object type. This makes things much easier than trying to store the object yourself using XML.

Java serialization has some pitfalls (like when you update your class). I would serialize in a text format. Json is my first choice here but xml and yaml would work as well.
This way you would have a file that doesn't rely on the binary version of your class.
There are several java libraries: http://www.json.org
Some examples:
http://code.google.com/p/json-simple/wiki/DecodingExamples
http://code.google.com/p/json-simple/wiki/EncodingExamples
And to save and read from the file you can use the Commons Io:
import org.apache.commons.io.FileUtis;
import java.io.File;
...
File dataFile = new File("yourfile.json");
String data = FileUtils.readFileToString(dataFile);
FileUtils.writeStringToFile(dataFile, content);

Java properties: .properties files vs xml?

I'm a newbie when it comes to properties, and I read that XML is the preferred way to store these. I noticed however, that writing a regular .properties file in the style of
foo=bar
fu=baz
also works. This would mean a lot less typing (and maybe easier to read and more efficient as well). So what are the benefits of using an XML file?

In XML you can store more complex (e.g. hierarchical) data than in a properties file. So it depends on your usecase. If you just want to store a small number of direct properties a properties file is easier to handle (though the Java properties class can read XML based properties, too).
It would make sense to keep your configuration interface as generic as possible anyway, so you have no problem to switch to another representation ( e.g. by using Apache Commons Configuration ) if you need to.

The biggest benefit to using an XML file is that XML declares its encoding, while .properties does not.
If you are translating these properties files to N languages, it is possible that these files could come back in N different encodings. And if you're not careful, you or someone else could irreversibly corrupt the character encodings.

If you have a lot of repeating data, it can be simpler to process
<connections>
<connection>this</connection>
<connection>that</connection>
<connection>the other</connection>
</connections>
than it is to process
connection1=this
connection2=that
connection3=the other
especially if you are expecting to have to store a lot of data, or it must be stored in a definite hierarchy
If you are just storing a few scalar values though, I'd go for the simple Properties approach every time

If you have both hierarchical data & duplicate namespaces, then use XML.
1) To emulate just a hierarchical structure in a properties file, simply use dot notation:
a.b=The Joker
a.b.c=Batgirl
a.b=Batman
a.b=Superman
a.b.c=Supergirl
So, complex (hierarchical) data representation is *not a reason to use xml.
2) For just repeating data, we can use a 3rd party library like ini4j to peg explicitly in java a count identifier on an implicit quantifier in the properties file itself.
a.b=The Joker
a.b=Batgirl
a.b=Batman
is translated to (in the background)
a.b1=The Joker
a.b2=Batgirl
a.b3=Batman
However, numerating same name properties still doesn't maintain the specific parent-child relationships. ie. how do we represent whether Batgirl is with The Joker or Batman?
So, xml is required when both features are needed. We can now decide if the 1st xml entry is what we want or the 2nd.
[a]
[b]Joker[/b]
[b]
[c]Batgirl[/c]
[/b]
[a]
--or--
[a]
[b]Batman[/b]
[b]
[c]Batgirl[/c]
[/b]
[/a]
Further detail in ....
http://ilupper.blogspot.com/2010/05/xml-vs-properties.html

XML is handy for complex data structures and or relationships. It does a decent job for having a "common language" between systems.
However, xml comes at a cost. Its is heavy to consume. You've got to load a parser, ensure the file is in the correct format, find the information etc...
Whereas properties files is pretty light weight and easy to read. Works for simple key/value pairs.

It depends on the data you're encoding. With XML, you can define a more complex representation of the configuration data in your application. Take something like the struts framework as an example. Within the framework you have a number of Action classes that can contain 1...n number of forward branches. With an XML configuration file, you can define it like:
<action class="MyActionClass">
<forward name="prev" targetAction="..."/>
<forward name="next" targetAction="..."/>
<forward name="help" targetAction="..."/>
</action>
This kind of association is difficult to accomplish using just the key-value pair representation of the properties file. Most likely, you would need to come up with a delimiting character and then include all of the forward actions on a single property separated by this delimiting character. It's quite a bit of work for a hackish solution.
Yet, as you pointed out, the XML syntax can become a burden if you just want to state something very simple, like set feature blah to true.

The disadvantages of XML:
It is hard to read - the tags make it look busier than it really is
The hierarchies and tags make it hard to edit and more prone to human errors
It is not possible to "append" to an XML property file to introduce a new property or provide an overriding value for an existing property so that the last one wins. The ability to append a property can be very powerful - we can implement a property management logic around this so that certain properties are "hot" and we don't need to restart the instance when these change
The Java property file solves the above problems. Consistent naming conventions and dot notation can help in solving the issue of hierarchy.

Simple properties to string conversion in Java

Using Java, I need to encode a Map<String, String> of name value pairs to store into a String, and be able to decode it again. These will be stored in a database column, and will probably usually be short and simple, so the common case should produce a simple nice looking line, but shouldn't corrupt the data, even if it contains unexpected characters, etc.
How would you choose to do it such that:
The encoded form is a single, human readable line
It doesn't require a big library or much context to encode / decode
Any delimeters are properly escaped
Url encoding? JSON? Do it yourself? Please specify any helper libraries or methods you'd use.
(Edited to specify more context and requirements as requested.)

As #Uri says, additional context would be good. I think your primary concerns are less about the particular encoding scheme, as rolling your own for most encodings is pretty easy for a simple Map<String, String>.
An interesting question is: what will this intermediate string encoding be used for?
if it's purely internal, an ad-hoc format is fine eg simple concatenation:
key1|value1|key2|value2
if humans night read it, a format like Ruby's map declaration is nice:
{ first_key => first_value,
second_key => second_value }
if the encoding is to send a serialised map over the wire to another application, the XML suggestion makes a lot of sense as it's standard-ish and reasonably self-documenting, at the cost of XML's verbosity.
<map>
<entry key='foo' value='bar'/>
<entry key='this' value='that'/>
</map>
if the map is going to be flushed to file and read back later by another Java application, #Cletus' suggestion of the Properties class is a good one, and has the additional benefit of being easy to open and inspect by human beings.
Edit: you've added the information that this is to store in a database column - is there a reason to use a single column, rather than three columns like so:
CREATE TABLE StringMaps
(
map_id NUMBER NOT NULL, -- ditch this if you only store one map...
key VARCHAR2 NOT NULL,
value VARCHAR2
);
As well as letting you store more semantically meaningful data, this moves the encoding/decoding into your data access layer more formally, and allows other database readers to easily see the data without having to understand any custom encoding scheme you might use. You can also easily query by key or value if you want to.
Edit again: you've said that it really does need to fit into a single column, in which case I'd either:
use the first pipe-separated encoding (or whatever exotic character you like, maybe some unprintable-in-English unicode character). Simplest thing that works. Or...
if you're using a database like Oracle that recognises XML as a real type (and so can give you XPath evaluations against it and so on) and need to be able to read the data well from the database layer, go with XML. Writing XML parsers for decoding is never fun, but shouldn't be too painful with such a simple schema.
Even if your database doesn't support XML natively, you can just throw it into any old character-like column-type...

Why not just use the Properties class? That does exactly what you want.

I have been contemplating a similar need of choosing a common representation for the conversations (transport content) between my clients and servers via a facade pattern. I want a representation that is standardized, human-readable (brief), robust, fast. I want it to be lightweight to implement and run, easy to test, and easy to "wrap". Note that I have already eliminated XML by my definition, and by explicit intent.
By "wrap", I mean that I want to support other transport content representations such as XML, SOAP, possibly Java properties or Windows INI formats, comma-separated values (CSV) and that ilk, Google protocol buffers, custom binary formats, proprietary binary formats like Microsoft Excel workbooks, and whatever else may come along. I would implement these secondary representations using wrappers/decorators around the primary facade. Each of these secondary representations is desirable, especially to integrate with other systems in certain circumstances, but none of them is desirable as a primary representation due to various shortcomings (failure to meet one or more of my criteria listed above).
Therefore, so far, I am opting for the JSON format as my primary transport content representation. I intend to explore that option in detail in the near future.
Only in cases of extreme performance considerations would I skip translating the underlying conventional format. The advantages of a clean design include good performance (no wasted effort, ease of maintainability) for which a decent hardware selection should be the only necessary complement. When performance needs become extreme (e.g., processing forty thousand incoming data files totaling forty million transactions per day), then EVERYTHING has to be revisited anyway.
As a developer, DBA, architect, and more, I have built systems of practically every size and description. I am confident in my selection of criteria, and eagerly await confirmation of its suitability. Indeed, I hope to publish an implementation as open-source (but don't hold your breath quite yet).
Note that this design discussion ignores the transport medium (HTTP, SMTP, RMI, .Net Remoting, etc.), which is intentional. I find that it is much more effective to treat the transport medium and the transport content as completely separate design considerations, from each other and from the system in question. Indeed, my intent is to make these practically "pluggable".
Therefore, I encourage you to strongly consider JSON. Best wishes.

Some additional context for the question would help.
If you're going to be encoding and decoding at the entire-map granularity, why not just use XML?

As #DanVinton says, if you need this in internal use (I mean "
internal use
as
it's used only by my components, not components written by others
you can concate key and value.
I prefer use different separator between key and key and key and value:
Instead of
key1+SEPARATOR+value1+SEPARATOR+key2 etc
I code
key1+SEPARATOR_KEY_AND_VALUE+value1+SEPARATOR_KEY(n)_AND_KEY(N+1)+key2 etc
if you must debug, this way is clearer (by design too)

Check out the apache commons configuration package. This will allow you to read/save a file as XML or properties format. It also gives you an option of automatically saving the property changes to a file.
Apache Configuration

A realise this is an old "deadish" thread, but I've got a solution not posited previously which I think is worth throwing in the ring.
We store "arbitrary" attributes (i.e. created by the user at runtime) of geographic features in a single CLOB column in the DB in the standard XML attributes format. That is:
name="value" name="value" name="value"
To create an XML element you just "wrap up" the attributes in an xml element. That is:
String xmlString += "<arbitraryAttributes" + arbitraryAttributesString + " />"
"Serialising" a Properties instance to an xml-attributes-string is a no-brainer... it's like ten lines of code. We're lucky in that we can impose on the users the rule that all attribute names must be valid xml-element-names; and we xml-escape (i.e. &quote; etc) each "value" to avoid problems from double-quotes and whatever in the value strings.
It's effective, flexible, fast (enough) and simple.
Now, having said all that... if we had the time again, we'd just totally divorce ourselves from the whole "metadata problem" by storing the complete unadulterated uninterpreted metadata xml-document in a CLOB and use one of the open-source metadata editors to handle the whole mess.
Cheers. Keith.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.