How to verify if a file is readable by humans?

How to verify if a file is readable by humans? - java

How I can make sure that a file is readable by humans.
By that I essentially want to check if the file is a txt, a yml, a doc, a json file and so on.
The issue is that in the case i want to perform this check, file extensions are misleading, and by that i mean that a plain text file (That should be .txt) has an extension of .d and various others :- (
What is the best way to verify that a file can be read by humans?
So far i have tried my luck with extensions as follows:
private boolean humansCanRead(String extention) {
switch (extention.toLowerCase()) {
case "txt":
case "doc":
case "json":
case "yml":
case "html":
case "htm":
case "java":
case "docx":
return true;
default:
return false;
}
}
But as i said extensions are not as expected.
EDIT: To clarify, i am looking for a solution that is platform independed and without using external libraries, And to narrow down what i mean "human readable", i mean plain text files that contain characters of any language, also i dont really mind if the text in the file makes sense like if it is encoded, i dont really care at this point.
Thanks so far for all the responses! :D

In general, you cannot do that. You could use a language identification algorithm to guess whether a given text is a text that could be spoken by humans. Since your example contains formal languages like html, however, you are in some deep trouble. If you really want to implement your check for (a finite set of) formal languages, you could use a GLR parser to parse the (ambiguous) grammar that combines all these languages. This, however would not yet solve the problem of syntax-errors (although it might be possible to define a heuristic). Finally, you need to consider what you actually mean by "human readable": E.g. do you include Base64?
edit: In case you are only interested in the character set: See this questions' answer. Basically, you have to read the file and check whether the content is valid in whatever character encoding you think of as human readable (utf-8 should cover most of your real-world cases).

For some files, a check on the proportion of bytes in the printable ASCII range will help. If more than 75% of the bytes are in that range within the first few hundred bytes then it is probably 'readable'.
Some files have headers, like the various forms of BoM on UTF files, the 0xA5EC which starts MS doc files or the "MZ" signature at the start of .exe, which will tell you if the file is readable or not.
A lot of modern text files are in one of the UTF formats, which can usually be identified by reading the first chunk of the file, even if they don't have a BoM.
Basically, you are going to have to run through a lot of different file types to see if you get a match. Load the first kilobyte of the file into memory and run a lot of different checks on it. Once you have some data, you can order the checks to look for the most common formats first.

Related

How to detect mistakes in IRIs in a RDF file?

I am trying to make a RDF corrector. One of the things I specifically want to correct are IRIs. My question is that, irrespective of the RDF format, is there anything that I can do to correct mistakes in the IRI? I understand there can be multiple number of mistakes, but what are the most generic mistakes that I can fix?
I am using ANTLR to make the corrector. I have extended the BaseErrorListener so that it gives out the errors made in the IRI in particular.

In my experience, the errors made in the real world depend on the source. A source may be systematically creating IRIs with spaces in, or have been binary copied between ISO-8859-1 ("latin") and UTF-8 (the correct format) which corrupts the UTF-8. These low level errors can be best fixed with a text editor on the input file (and correct the code generating them).
Try a few sample IRIs at http://www.sparql.org/iri-validator.html, which prints out warnings and errors, and is the same code as the parsers.

Zip folder to Array of byte using Scala

I am working on application where I have to convert .Zip folder to array of byte and I am using Scala and Play framework.
As of now I'm using,
val byteOfArray = Source.fromFile("resultZip.zip", "UTF-8").map(_.toByte).toArray
But when I am performing operation with byteOfArray I was getting error.
I have printed byteOfArray and found the result as below
empty parser
can you please let me know is this the correct way to convert .zip to array of byte?
Also let me know if is there another good way to convert array of byte.

Your solution is incorrect. UTF-8 is a text encoding, and zip files are binary files. It might happen by accident that a zip file is a valid UTF-8 file, but even in this case UTF-8 can use multiple bytes for a single character which you'll then convert to a single byte. Source is only intended to work with text files (as you can see from the presence of encoding parameter, Char type use, etc.). There is nothing in the standard Scala library to work with binary IO.
If you really hate the idea of using Java standard library (you shouldn't; that's what any Scala solution is going to be based on, and it doesn't get less verbose than a single method call), use better-files (not tested, just based on README examples):
import better.files._
val file = File("resultZip.zip")
file.bytes.toArray // if you really need an Array and can't work with Iterator
but for this specific case it isn't a real win, you just need to add an extra dependency.
I mean a folder contains files and another folders having files in it
If you have a folder which contains .zip files and possibly some others in nested folders, you can get all of them with
val zipFiles = File(directoryName).glob("**/*.zip")
and then
zipFiles.map(_.bytes.toArray)
will give you a Seq[Array[Byte]] containing all zip files as byte arrays. Modify to taste if you need to use file names and/or paths, etc. in further processing.

How can I output data with special characters visible?

I have a text file that was provided to me and no one knows the encoding on it. Looking at it in a text editor, everything looks fine, aligned properly into neat columns.
However, I'm seeing some anomalies when I read the data. Even though, visually, the field "Foo" appears in the same columns in the text file (for instance, in columns 15-20), when I try to pull it out using substring(15,20) my data varies wildly. Sometimes I'll pull bytes 11-16, sometimes 18-23, sometimes 15-20...there's no consistency between records.
I suspect that there are some special chartacters, invisible to my text editor, but readable by (and counted in the index of) the String methods. Is there any way in Java to dump the contents of the file with any special characters visible so I can see what I need to Strings I need replace with regex?
If not in Java, can anyone recommed a tool that may be able to help me out?

I would start with having a look at the file directly. Any code adds a layer of doubt. Take a Total Commander (or equivalent on your platform), view the file (F3) and switch to hex mode. You suggest that the special characters behavior is not even consistent between lines, so you should get some visual clue about the format before you even attempt to fix it algorithmically.

Have you tried printing the contents of the file as individual integers or bytes? That way you can see if there are any hidden characters.

What html parser is able to handle encoding?

This is the beginning -- I have a file on disk which is HTML page. When I open it with regular web browser it displays as it should -- i.e. no matter what encoding is used, I see correct national characters.
Then I come -- my task is to load the same file, parse it, and print out some pieces on the screen (console) -- let's say, all <hX> texts. Of course I would like to see only correct characters, not some mambo-jumbo. The last step is changing some of text, and save the file.
So the parser has to parse and handle encoding in both ways as well. So far I am unaware of parser which is even capable of loading data correctly.
Question
What parser would you recommend?
Details
HTML page in general has the encoding given in header (in meta tag), so parser should use it. The scenario I have to look in advance and check the encoding, and then manually set the encoding in code is no-go. For example, this is taken from JSoup tutorials:
File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
I cannot do such thing, parser has to handle encoding detection by itself.
In C# I faced similar problem with loading html. I used HTMLAgilityPack and first executed encoding detection, then using it I encoded the data stream, and after that I parsed the data. So, I did both steps explicitly, but since the library delivers both methods it is fine with me.
Such explicit separation might be even better, because it would be possible to use in case of missing header probabilistic encoding detection method.

The Jsoup API reference says for that parse method that if you provide null as the second argument (the encoding one), it'll use the http-equiv meta-tag to determine the encoding. So it looks like it already does the "parse a bit, determine encoding, re-parse with proper encoding" routine. Normally such parsers should be capable of resolving the encoding themselves using any means available to them. I know that SAX parsers in Java are supposed to use byte-order marks and the XML declaration to try and establish an encoding.
Apparently Jsoup will default to UTF-8 if no proper meta-tag is found. As they say in the documentation, this is "usually safe" since UTF-8 is compatible with a host of common encodings for the lower code points. But I take it that "usually safe" might not really be good enough in this case.
If you don't sufficiently trust Jsoup to detect the encoding, I see two alternatives:
Should you somehow be ascertained that the HTML is always in fact XHTML, then an XML parser might prove a better fit. But that would only work if the input is definitely XML compliant.
Do a heuristic encoding detection yourself by trying to use byte-order marks, parsing a portion using common encodings and finding a meta-tag, detecting the encoding by byte patterns you'd expect in header tags and finally, all else failing, use a default.

How to retrieve forbidden characters for filenames, in Java?

There are some restricted characters (and even full filenames, in Windows), for file and directory names. This other question covers them already.
Is there a way, in Java, to retrieve this list of forbidden characters, which would be depending on the system (a bit like retrieving the line breaker chars)? Or can I only put the list myself, checking for the system?
Edit: More background on my particular situation, aside from the general question.
I use a default name, coming from some data (no real control over their content), and this name is given to a JFileChooser, as a default file name to use (with setSelectedFile()). However, this one truncates anything prior to the last invalid character.
These default names occasionally end with dates in a "mm/dd/yy" format, which leaves only the "yy", in the default name, because "/" are forbidden. As such, checking for Exceptions is not really an option there, because the file itself is not even created yet.
Edit bis: Hmm, that makes me think, if JFileChooser is truncating the name, it probably has access to a list of such characters, can be interesting to check that further.
Edit ter: Ok, checking sources from JFileChooser shows something completely simple. For the text field, it uses file.getName(). It doesn't actually check for invalid characters, it's simply that it takes the "/" as a path separator, and keeps only the end, the "actual filename". Other forbidden characters actually go through.

When it comes to dealing with "forbidden" characters I'd rather be overcautious and ban/replace all "special" characters that may cause a problem on any filesystem.
Even if technically allowed, sometimes those characters can cause weirdness.
For example, we had an issue where the PDF files were being written (successfully) to a SAN, but when served up via a web server from that location some of the characters would cause issues when we were embedding the PDF in an HTML page that was being rendered in Firefox. It was fine if the PDF was accessed directly and it was fine in other browser. Some weird error with how Firefox and Adobe Reader interact.
Summary: "Special" characters in file names -> weird errors waiting to happen
Ultimately, the only way to be sure is to use a white-list.

Having certain "forbidden characters" is just one of many things that can go wrong when creating a file (others are access rights and file and path name lengths).
It doesn't really make sense to try and catch some of these early when there are others you can't catch until you actually try to create the file. Just handle the exceptions properly.

Have you tried using File.getCanonicalPath and comparing it to the original file name (or whatever is retrieved from getAbsolutePath)?
This will not give you the actual characters, but it may help you in determining whether this is a valid filename in the OS you're running on.

Have a look at this link for some info on how to get the OS the application is running on. Basically you need to use System.getProperty("os.name") and do an equals() or contains() to find out the operating system.
Something to be weary of though is that knowing the OS does not necessarily tell you the underlying file system being used, for example a Mac can read and write onto the FAT32 file system.
source: http://www.mkyong.com/java/how-to-detect-os-in-java-systemgetpropertyosname/

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.