Compare text representation of binary file - java

I have two xml reports from the execution of two programs. Such reports contain a section which lists all the i/o operation executed, along with the content of each one. Some of them are xml, others are binary, but data contained within the report is always textual, so I have something similar to this:
.....0.................. .......................#........'F...O)v...O*......................0..........l...c...=
Y!...!pvw.........(.........E...
yY...-qVC......p...K,......Pm.........Si4........,.......C0....?0....'...................K0....0
. *...H......
....0I1.0 ..U....US1.0...U.
.
Google Inc1%0#..U....Google Internet Authority G20..
140423121609Z.
140722000000Z0f1.0 ..U....US1.0...U...
California1.0...U...
Mountain View1.0...U.
.
Google Inc1.0...U....*.google.com0...."0
. *...H......
..........0....
..............>..........:...z...S...5...%f............-....*J...i.......c}m......N%...t....G..f.......y.........0x...F.........:......k...k$......!............I...A...........A...G.......q...C...g........r.......b....6.......c...|X.........F...?qs......'.........................mrM.....D....9...
....$...v... .........=.........amAdo..V.....................#.../... U~....r......... .........g_ ...[y...7=...i... >......b......s...........W......#w..............e..........yI.........{..............0.....0...U.%..0...+.........+.......0.........U..........0.......*.google.com...
*.android.com....*.appengine.google.com....*.cloud.google.com....*.google-analytics.com....*.google.ca....*.google.cl....*.google.co.in....*.google.co.jp....*.google.co.uk....*.google.com.ar....*.google.com.au....*.google.com.br....*.google.com.co....*.google.com.mx....*.google.com.tr....*.google.com.vn....*.google.de....*.google.es....*.google.fr....*.google.hu....*.google.it....*.google.nl....*.google.pl....*.google.pt....*.googleapis.cn....*.googlecommerce.com....*.googlevideo.com...
*.gstatic.com...
*.gvt1.com....*.urchin.com....*.url.google.com....*.youtube-nocookie.com...
*.youtube.com....*.youtubeeducation.com....*.ytimg.com....android.com....g.co....goo.gl....google-analytics.com...
google.com....googlecommerce.com...
urchin.com....youtu.be....youtube.com....youtubeeducation.com0h..+.........0Z0+..+.....0.....http://pki.google.com/GIAG2.crt0+..+.....0.....http://clients1.google.com/ocsp0...U.........XV.H...%....r..!.......y...'0...U.........00...U.#..0.....J............h...v...b....Z.../0...U. ..0.0..
+.......y...00..U...)0'0%...#...!....http://pki.google.com/GIAG2.crl0
. *...H......
..........A...d...A~A..0...P-JY/........"..M...N.=...H....n%...A......u......2...X......I........F...%....%p..............K...j...A.............g$Y...h....K....E...m......s/......t.....S..SN...Wo.B6.......a......|.............q........?.............y...N....K=....1......|+......3=.....6....j...&...H?.1.....X.H..#V".k.............-.....C.....5S......$.G............eMY(...1+,.e...v"......K...C...}.....V............28K......[......4A.Vr.......C0....?0....'...................K0....0
. *...H......
....0I1.0 ..U....US1.0...U.
I have to compare these segments to find similarities, i.e. to find whether the two programs wrote/read similar content to/from the filesystem. Also, since there are many i/o operations (100s) and many reports (10000s), I should do it pretty quickly. I am working with java.
Any advices?

In the end I used the Normalized Compression Distance. I don't know yet if this is the best approach for my data anyway...

Related

How to detect mistakes in IRIs in a RDF file?

I am trying to make a RDF corrector. One of the things I specifically want to correct are IRIs. My question is that, irrespective of the RDF format, is there anything that I can do to correct mistakes in the IRI? I understand there can be multiple number of mistakes, but what are the most generic mistakes that I can fix?
I am using ANTLR to make the corrector. I have extended the BaseErrorListener so that it gives out the errors made in the IRI in particular.
In my experience, the errors made in the real world depend on the source. A source may be systematically creating IRIs with spaces in, or have been binary copied between ISO-8859-1 ("latin") and UTF-8 (the correct format) which corrupts the UTF-8. These low level errors can be best fixed with a text editor on the input file (and correct the code generating them).
Try a few sample IRIs at http://www.sparql.org/iri-validator.html, which prints out warnings and errors, and is the same code as the parsers.

Security of uploading and parsing Named Binary Tag files (NBT) via PHP

I'm building a application that deals with uploading/downloading Named Binary Tag files (NBT).
After they're uploaded I need to parse them and get some information.
I'm a bit concerned security wise as I don't have the necessary knowledge to properly understand how they're build or what kind of data to expect from them.
What are some sanity checks that I can perform, when the files are uploaded, to make sure that they are indeed NBT files.
Should I be concerned when parsing them?
If there's anything else I should be concerned with, please, do tell.
I realize these are vague questions. There aren't a lot of answers on Google, else I wouldn't be here.
The file-format for NBT is really simple and compact. It's a binary stream (uncompressed or gzipped), which was specified by Notch.
One "problem" comes with special crafted NBT-files, which contains a lot of empty lists and lists of lists ... the memory-overhead of parsing these may result in service failure (mostly because the created objects for each entry just fills your memory).
One solution could be to limit the amount of entries you are reading and when reaching that limit just dropping the parsed file.
I recently published a java-library for reading nbt-files (but without having a limit), maybe it helps you to understand that file-format.
edit: forgot to share this website about the "exploit": http://arstechnica.com/security/2015/04/just-released-minecraft-exploit-makes-it-easy-to-crash-game-servers/

What is the fastest file / way to parse a large data file?

So I am working on a GAE project. I need to look up cities, Country Names and Country Codes for sign ups, LBS, ect ...
Now I figured that putting all the information in the Datastore is rather stupid as it will be used quite frequently and its gonna eat up my datastore quotations for no reason, specially that these lists arent going to change, so its pointless to put in datastore.
Now that leaves me with a few options:
API - No budget for paid services, free ones are not exactly reliable.
Upload Parse-able file - Favorable option as I like the certainty that the data will always be there.
So I got the files needed from GeoNames (link has source files for all countries in case someone needs it). The file for each country is a regular UTF-8 tab delimited file which is great.
However, now that I have the option to choose how to format and access the data, the question is:
What is the best way to format and retrieve data systematically from a static file in a Java servelet container ?
The best way being the fastest, and least resource hungry method.
Valid options:
TXT file, tab delimited
XML file Static
Java Class with Tons of enums
I know that importing country files as Java Enums and going through their values will be very fast, but do you think this is going to affect memory beyond reasonable limits ? On the other hand, every time I need to access a record, the loop will go through a few thousand lines until it finds the required record ... reading line by line so no memory issues, but incredibly slow ... I have had some experience with parsing an excel file in a Java servelet and it took something like 20 seconds just to parse 250 records, on large scale, response time WILL timeout (no doubt about it) so is XML anything like excel ??
Thank you very much guys !! Please provide opinions, all and anything is appreciated !
Easiest and fastest way would be to have the file as a static web resource file, under the WEB-INF folder and on application startup, have a context listener to load the file into memory.
In memory, it should be a Map, mapping from a key you want to search by. This will allow you like a constant access time.
Memory consumption would only matter if it is really big. A hundred thousand record for example not worth optimizing if you need to access this many times.
The static file should be plain text format or CSV, they are read and parsed most efficiently. No need XML formatting as parsing it would be slow.
If the list is really big, you can break it up into multiple, smaller files, and only parse those and only when they are required. A reasonable, easy partitioning would be to break it up by country, but any other partitioning would work (like based on its name using the first few characters from its name).
You could also consider building this Map in the memory once, and then serialize this map to a binary file, and include that binary file as a static resource file, and that way you would only have to deserialize this Map and would be no need to parse/process it as a text file and build objects yourself.
Improvements on the data file
An alternative to having the static resource file as a text/CSV file or a serialized Map
data file would be to have it as a binary data file where you could create your own custom file format.
Using DataOutputStream you can write data to a binary file in a very compact and efficient way. Then you could use DataInputStream to load data from this custom file.
This solution has the advantages that the file could be much less (compared to plain text / CSV / serialized Map), and loading it would be much faster (because DataInputStream doesn't use number parsing from a text for example, it reads the bytes of a number directly).
Hold the data in source form as XML. At start of day, or when it changes, read it into memory: that's the only time you incur the parsing cost. There are then two main options:
(a) your in-memory form is still an XML tree, and you use XPath/XQuery to query it.
(b) your in-memory form is something like a java HashMap
If the data is very simple then (b) is probably best, but it only allows you to do one kind of query, which is hard-coded. If the data is more complex or you have a variety of possible queries, then (a) is more flexible.

Mahout: converting one large text file to SequenceFile format

I have done a lot of searching on the web for this, but I've found nothing, even though I feel like it has to be somewhat common. I have used Mahout's seqdirectory command to convert a folder containing text files (each file is a separate document) in the past. But in this case there are so many documents (in the 100,000s) that I have one very large text file in which each line is a document. How can I convert this large file to SequenceFile format so that Mahout understands that each line should be considered a separate document? Thank you very much for any help.
Yeah, it is not quite apparent or very intuitive how to do this, although (lucky for you :P) I have answered that exact question several times here in stack, for instance here. Have a look ;)

Best file format regarding standard string and integer data?

For my project, I need to store info about protocols (the data sent (most likely integers) and in the order it's sent) and info that might be formatted something like this:
'ID' 'STRING' 'ADDITIONAL INTEGER DATA'
This info will be read by a Java program and stored in memory for processing, but I don't know what would be the most sensible format to store this data in?
EDIT: Here's some extra information:
1)I will be using this data in a game server.
2)Since it is a game server, speed is not the primary concern, since this data will primary be read and utilized during startup, which shouldn't occur very often.
3)Memory consumption I would like to keep at a minimum, however.
4)The second data "example" will be used as a "dictionary" to look up names of specific in-game items, their stats and other integer data (and therefore might become very large, unlike the first data containing the protocol information, where each file will only note small protocol bites, like a login protocol for instance).
5)And yes, I would like the data to be "human-editable".
EDIT 2: Here's the choices that I've made:
JSON - For the protocol descriptions
CSV - For the dictionaries
There are many factors that could come to weigh--here are things that might help you figure this out:
1) Speed/memory usage: If the data needs to load very quickly or is very large, you'll probably want to consider rolling your own binary format.
2) Portability/compatibility: Balanced against #1 is the consideration that you might want to use the data elsewhere, with programs that won't read a custom binary format. In this case, your heavy hitters are probably going to be CSV, dBase, XML, and my personal favorite, JSON.
3) Simplicity: Delimited formats like CSV are easy to read, write, and edit by hand. Either use double-quoting with proper escaping or choose a delimiter that will not appear in the data.
If you could post more info about your situation and how important these factors are, we might be able to guide you further.
How about XML, JSON or CSV ?
I've written a similar protocol-specification using XML. (Available here.)
I think it is a good match, since it captures the hierarchal nature of specifying messages / network packages / fields etc. Order of fields are well defined and so on.
I even wrote a code-generator that generated the message sending / receiving classes with methods for each message type in XSLT.
The only drawback as I see it is the verbosity. If you have a really simple structure of the specification, I would suggest you use some simple home-brewed format and write a parser for it using a parser-generator of your choice.
In addition to the formats suggested by others here (CSV, XML, JSON, etc.) you might consider storing the info in a Java properties file. (See the java.util.Properties class.) The code is already there for you, so all you have to figure out is the properties names (or name prefixes) you want to use.
The Properties class also provides for storing/loading properties in a simple XML format.

Categories

Resources