How I can make sure that a file is readable by humans.
By that I essentially want to check if the file is a txt, a yml, a doc, a json file and so on.
The issue is that in the case i want to perform this check, file extensions are misleading, and by that i mean that a plain text file (That should be .txt) has an extension of .d and various others :- (
What is the best way to verify that a file can be read by humans?
So far i have tried my luck with extensions as follows:
private boolean humansCanRead(String extention) {
switch (extention.toLowerCase()) {
case "txt":
case "doc":
case "json":
case "yml":
case "html":
case "htm":
case "java":
case "docx":
return true;
default:
return false;
}
}
But as i said extensions are not as expected.
EDIT: To clarify, i am looking for a solution that is platform independed and without using external libraries, And to narrow down what i mean "human readable", i mean plain text files that contain characters of any language, also i dont really mind if the text in the file makes sense like if it is encoded, i dont really care at this point.
Thanks so far for all the responses! :D
In general, you cannot do that. You could use a language identification algorithm to guess whether a given text is a text that could be spoken by humans. Since your example contains formal languages like html, however, you are in some deep trouble. If you really want to implement your check for (a finite set of) formal languages, you could use a GLR parser to parse the (ambiguous) grammar that combines all these languages. This, however would not yet solve the problem of syntax-errors (although it might be possible to define a heuristic). Finally, you need to consider what you actually mean by "human readable": E.g. do you include Base64?
edit: In case you are only interested in the character set: See this questions' answer. Basically, you have to read the file and check whether the content is valid in whatever character encoding you think of as human readable (utf-8 should cover most of your real-world cases).
For some files, a check on the proportion of bytes in the printable ASCII range will help. If more than 75% of the bytes are in that range within the first few hundred bytes then it is probably 'readable'.
Some files have headers, like the various forms of BoM on UTF files, the 0xA5EC which starts MS doc files or the "MZ" signature at the start of .exe, which will tell you if the file is readable or not.
A lot of modern text files are in one of the UTF formats, which can usually be identified by reading the first chunk of the file, even if they don't have a BoM.
Basically, you are going to have to run through a lot of different file types to see if you get a match. Load the first kilobyte of the file into memory and run a lot of different checks on it. Once you have some data, you can order the checks to look for the most common formats first.
I have two xml reports from the execution of two programs. Such reports contain a section which lists all the i/o operation executed, along with the content of each one. Some of them are xml, others are binary, but data contained within the report is always textual, so I have something similar to this:
.....0.................. .......................#........'F...O)v...O*......................0..........l...c...=
Y!...!pvw.........(.........E...
yY...-qVC......p...K,......Pm.........Si4........,.......C0....?0....'...................K0....0
. *...H......
....0I1.0 ..U....US1.0...U.
.
Google Inc1%0#..U....Google Internet Authority G20..
140423121609Z.
140722000000Z0f1.0 ..U....US1.0...U...
California1.0...U...
Mountain View1.0...U.
.
Google Inc1.0...U....*.google.com0...."0
. *...H......
..........0....
..............>..........:...z...S...5...%f............-....*J...i.......c}m......N%...t....G..f.......y.........0x...F.........:......k...k$......!............I...A...........A...G.......q...C...g........r.......b....6.......c...|X.........F...?qs......'.........................mrM.....D....9...
....$...v... .........=.........amAdo..V.....................#.../... U~....r......... .........g_ ...[y...7=...i... >......b......s...........W......#w..............e..........yI.........{..............0.....0...U.%..0...+.........+.......0.........U..........0.......*.google.com...
*.android.com....*.appengine.google.com....*.cloud.google.com....*.google-analytics.com....*.google.ca....*.google.cl....*.google.co.in....*.google.co.jp....*.google.co.uk....*.google.com.ar....*.google.com.au....*.google.com.br....*.google.com.co....*.google.com.mx....*.google.com.tr....*.google.com.vn....*.google.de....*.google.es....*.google.fr....*.google.hu....*.google.it....*.google.nl....*.google.pl....*.google.pt....*.googleapis.cn....*.googlecommerce.com....*.googlevideo.com...
*.gstatic.com...
*.gvt1.com....*.urchin.com....*.url.google.com....*.youtube-nocookie.com...
*.youtube.com....*.youtubeeducation.com....*.ytimg.com....android.com....g.co....goo.gl....google-analytics.com...
google.com....googlecommerce.com...
urchin.com....youtu.be....youtube.com....youtubeeducation.com0h..+.........0Z0+..+.....0.....http://pki.google.com/GIAG2.crt0+..+.....0.....http://clients1.google.com/ocsp0...U.........XV.H...%....r..!.......y...'0...U.........00...U.#..0.....J............h...v...b....Z.../0...U. ..0.0..
+.......y...00..U...)0'0%...#...!....http://pki.google.com/GIAG2.crl0
. *...H......
..........A...d...A~A..0...P-JY/........"..M...N.=...H....n%...A......u......2...X......I........F...%....%p..............K...j...A.............g$Y...h....K....E...m......s/......t.....S..SN...Wo.B6.......a......|.............q........?.............y...N....K=....1......|+......3=.....6....j...&...H?.1.....X.H..#V".k.............-.....C.....5S......$.G............eMY(...1+,.e...v"......K...C...}.....V............28K......[......4A.Vr.......C0....?0....'...................K0....0
. *...H......
....0I1.0 ..U....US1.0...U.
I have to compare these segments to find similarities, i.e. to find whether the two programs wrote/read similar content to/from the filesystem. Also, since there are many i/o operations (100s) and many reports (10000s), I should do it pretty quickly. I am working with java.
Any advices?
In the end I used the Normalized Compression Distance. I don't know yet if this is the best approach for my data anyway...
In the thread What’s your favorite “programmer ignorance” pet peeve?, the following answer appears, with a large amount of upvotes:
Programmers who build XML using string concatenation.
My question is, why is building XML via string concatenation (such as a StringBuilder in C#) bad?
I've done this several times in the past, as it's sometimes the quickest way for me to get from point A to point B when to comes to the data structures/objects I'm working with. So far, I have come up with a few reasons why this isn't the greatest approach, but is there something I'm overlooking? Why should this be avoided?
Probably the biggest reason I can think of is you need to escape your strings manually, and most new programmers (and even some experienced programmers) will forget this. It will work great for them when they test it, but then "randomly" their apps will fail when someone throws an & symbol in their input somewhere. Ok, I'll buy this, but it's really easy to prevent the problem (SecurityElement.Escape to name one).
When I do this, I usually omit the XML declaration (i.e. <?xml version="1.0"?>). Is this harmful?
Performance penalties? If you stick with proper string concatenation (i.e. StringBuilder), is this anything to be concerned about? Presumably, a class like XmlWriter will also need to do a bit of string manipulation...
There are more elegant ways of generating XML, such as using XmlSerializer to automatically serialize/deserialize your classes. Ok sure, I agree. C# has a ton of useful classes for this, but sometimes I don't want to make a class for something really quick, like writing out a log file or something. Is this just me being lazy? If I am doing something "real" this is my preferred approach for dealing w/ XML.
You can end up with invalid XML, but you will not find out until you parse it again - and then it is too late. I learned this the hard way.
I think readability, flexibility and scalability are important factors. Consider the following piece of Linq-to-Xml:
XDocument doc = new XDocument(new XDeclaration("1.0","UTF-8","yes"),
new XElement("products", from p in collection
select new XElement("product",
new XAttribute("guid", p.ProductId),
new XAttribute("title", p.Title),
new XAttribute("version", p.Version))));
Can you find a way to do it easier than this? I can output it to a browser, save it to a document, add attributes/elements in seconds and so on ... just by adding couple lines of code. I can do practically everything with it without much of effort.
Actually, I find the biggest problem with string concatenation is not getting it right the first time, but rather keeping it right during code maintenance. All too often, a perfectly-written piece of XML using string concat is updated to meet a new requirement, and string concat code is just too brittle.
As long as the alternatives were XML serialization and XmlDocument, I could see the simplicity argument in favor of string concat. However, ever since XDocument et. al., there is just no reason to use string concat to build XML anymore. See Sander's answer for the best way to write XML.
Another benefit of XDocument is that XML is actually a rather complex standard, and most programmers simply do not understand it. I'm currently dealing with a person who sends me "XML", complete with unquoted attribute values, missing end tags, improper case sensitivity, and incorrect escaping. But because IE accepts it (as HTML), it must be right! Sigh... Anyway, the point is that string concatenation lets you write anything, but XDocument will force standards-complying XML.
I wrote a blog entry back in 2006 moaning about XML generated by string concatenation; the simple point is that if an XML document fails to validate (encoding issues, namespace issues and so on) it is not XML and cannot be treated as such.
I have seen multiple problems with XML documents that can be directly attributed to generating XML documents by hand using string concatenation, and nearly always around the correct use of encoding.
Ask yourself this; what character set am I currently encoding my document with ('ascii7', 'ibm850', 'iso-8859-1' etc)? What will happen if I write a UTF-16 string value into an XML document that has been manually declared as 'ibm850'?
Given the richness of the XML support in .NET with XmlDocument and now especially with XDocument, there would have to be a seriously compelling argument for not using these libraries over basic string concatenation IMHO.
I think that the problem is that you aren't watching the xml file as a logical data storage thing, but as a simple textfile where you write strings.
It's obvious that those libraries do string manipulation for you, but reading/writing xml should be something similar to saving datas into a database or something logically similar
If you need trivial XML then it's fine. Its just the maintainability of string concatenation breaks down when the xml becomes larger or more complex. You pay either at development or at maintenance time. The choice is yours always - but history suggests the maintenance is always more costly and thus anything that makes it easier is worthwhile generally.
You need to escape your strings manually. That's right. But is that all? Sure, you can put the XML spec on your desk and double-check every time that you've considered every possible corner-case when you're building an XML string. Or you can use a library that encapsulates this knowledge...
Another point against using string concatenation is that the hierarchical structure of the data is not clear when reading the code. In #Sander's example of Linq-to-XML for example, it's clear to what parent element the "product" element belongs, to what element the "title" attribute applies, etc.
As you said, it's just awkward to build XML correct using string concatenation, especially now you have XML linq that allows for simple construction of an XML graph and will get namespaces, etc correct.
Obviously context and how it is being used matters, such as in the logging example string.Format can be perfectly acceptable.
But too often people ignore these alternatives when working with complex XML graphs and just use a StringBuilder.
The main reason is DRY: Don't Repeat Yourself.
If you use string concat to do XML, you will constantly be repeating the functions that keep your string as a valid XML document. All the validation would be repeated, or not present. Better to rely on a class that is written with XML validation included.
I've always found creating an XML to be more of a chore than reading in one. I've never gotten the hang of serialization - it never seems to work for my classes - and instead of spending a week trying to get it to work, I can create an XML file using strings in a mere fraction of the time and write it out.
And then I load it in using an XMLReader tree. And if the XML file doesn't read as valid, I go back and find the problem within my saving routines and corret it. But until I get a working save/load system, I refuse to perform mission-critical work until I know my tools are solid.
I guess it comes down to programmer preference. Sure, there are different ways of doing things, for sure, but for developing/testing/researching/debugging, this would be fine. However I would also clean up my code and comment it before handing it off to another programmer.
Because regardless of the fact you're using StringBuilder or XMLNodes to save/read your file, if it is all gibberish mess, nobody is going to understand how it works.
Maybe it won't ever happen, but what if your environment switches to XML 2.0 someday? Your string-concatenated XML may or may not be valid in the new environment, but XDocument will almost certainly do the right thing.
Okay, that's a reach, but especially if your not-quite-standards-compliant XML doesn't specify an XML version declaration... just saying.
I have done a lot of searching on the web for this, but I've found nothing, even though I feel like it has to be somewhat common. I have used Mahout's seqdirectory command to convert a folder containing text files (each file is a separate document) in the past. But in this case there are so many documents (in the 100,000s) that I have one very large text file in which each line is a document. How can I convert this large file to SequenceFile format so that Mahout understands that each line should be considered a separate document? Thank you very much for any help.
Yeah, it is not quite apparent or very intuitive how to do this, although (lucky for you :P) I have answered that exact question several times here in stack, for instance here. Have a look ;)
I'd like to generate an image file showing some mathematical expression, taking a String like "(x+a)^n=∑_(k=0)^n" as input and getting a more (human) readable image file as output. AFAIK stuff like that is used in Wikipedia for example. Are there maybe any java libraries that do that?
Or maybe I use the wrong approach. What would you do if the requirement was to enable pasting of formulas from MS Word into an HTML-document? I'd ask the user to just make a screenshot himself, but that would be the lazy way^^
Edit: Thanks for the answers so far, but I really do not control the input. What I get is some messy Word-style formula, not clean latex-formatted one.
Edit2: http://www.panschk.de/text.tex
Looks a bit like LaTeX doesn't it? That's what I get when I do
clipboard.getContents(RTFTransfer.getInstance()) after having pasted a formula from Word07.
First and foremost you should familiarize yourself with TeX (and LaTeX) - a famous typesetting system created by Donald Knuth. Typesetting mathematical formulae is an advanced topic with many opinions and much attention to detail - therefore use something that builds upon TeX. That way you are sure to get it right ;-)
Edit: Take a look at texvc
It can output to PNG, HTML, MathML. Check out the README
Edit #2 Convert that messy Word-stuff to TeX or MathML?
My colleague found a surprisingly simple solution for this very specific problem: When you copy formulas from Word2007, they are also stored as "HTML" in the Clipboard. As representing formulas in HTML isn't easy neither, Word just creates a temporary image file on the fly and embeds it into the HTML-code. You can then simply take the temporary formula-image and copy it somewhere else. Problem solved;)
What you're looking for is Latex.
MikTex is a nice little application for churning out images using LaTeX.
I'd like to look into creating them on-the-fly though...
Steer clear of LaTeX. Seriously.
Check out JEuclid. It can convert MathML expressions into images.