How to retrieve forbidden characters for filenames, in Java?

How to retrieve forbidden characters for filenames, in Java? - java

There are some restricted characters (and even full filenames, in Windows), for file and directory names. This other question covers them already.
Is there a way, in Java, to retrieve this list of forbidden characters, which would be depending on the system (a bit like retrieving the line breaker chars)? Or can I only put the list myself, checking for the system?
Edit: More background on my particular situation, aside from the general question.
I use a default name, coming from some data (no real control over their content), and this name is given to a JFileChooser, as a default file name to use (with setSelectedFile()). However, this one truncates anything prior to the last invalid character.
These default names occasionally end with dates in a "mm/dd/yy" format, which leaves only the "yy", in the default name, because "/" are forbidden. As such, checking for Exceptions is not really an option there, because the file itself is not even created yet.
Edit bis: Hmm, that makes me think, if JFileChooser is truncating the name, it probably has access to a list of such characters, can be interesting to check that further.
Edit ter: Ok, checking sources from JFileChooser shows something completely simple. For the text field, it uses file.getName(). It doesn't actually check for invalid characters, it's simply that it takes the "/" as a path separator, and keeps only the end, the "actual filename". Other forbidden characters actually go through.

When it comes to dealing with "forbidden" characters I'd rather be overcautious and ban/replace all "special" characters that may cause a problem on any filesystem.
Even if technically allowed, sometimes those characters can cause weirdness.
For example, we had an issue where the PDF files were being written (successfully) to a SAN, but when served up via a web server from that location some of the characters would cause issues when we were embedding the PDF in an HTML page that was being rendered in Firefox. It was fine if the PDF was accessed directly and it was fine in other browser. Some weird error with how Firefox and Adobe Reader interact.
Summary: "Special" characters in file names -> weird errors waiting to happen
Ultimately, the only way to be sure is to use a white-list.

Having certain "forbidden characters" is just one of many things that can go wrong when creating a file (others are access rights and file and path name lengths).
It doesn't really make sense to try and catch some of these early when there are others you can't catch until you actually try to create the file. Just handle the exceptions properly.

Have you tried using File.getCanonicalPath and comparing it to the original file name (or whatever is retrieved from getAbsolutePath)?
This will not give you the actual characters, but it may help you in determining whether this is a valid filename in the OS you're running on.

Have a look at this link for some info on how to get the OS the application is running on. Basically you need to use System.getProperty("os.name") and do an equals() or contains() to find out the operating system.
Something to be weary of though is that knowing the OS does not necessarily tell you the underlying file system being used, for example a Mac can read and write onto the FAT32 file system.
source: http://www.mkyong.com/java/how-to-detect-os-in-java-systemgetpropertyosname/

Related

how to ignore ESAPI exception "org.owasp.esapi.errors.IntrusionException: Input validation failure"?

My project (built on JSP,Struts,hibernate) takes an input from user and saves it in the database. To make my application secure I have used ESAPI jar.
I am getting exception
org.owasp.esapi.errors.IntrusionException: Input validation failure
at the method ESAPI.encoder().canonicalize();
This exception is generally coming when we are copying and pasting data from skype,MS word etc.
When I copy paste the string from skype messenger it automatically adds extra styling data with div,meta,p,etc (all the HTML tags) which leads to addition of many special characters which might be causing the exception mentioned above.
But when I copy the string from notepad it doesn't give an exception.
How can I ignore this exception so that I can add the data ? is there something to be modified in ESAPI.properties or validation.properties? what are your views?

I think your weird issue has to do with additional encoding when you paste something from (say) MS Word versus from something simple like notepad. When you are in Word, it picks up some additional meta-data and the default 'paste' from 'MS Word' is really 'paste special'. This is done so that you can copy text from one Office application to another (e.g., Word to Outlook) and "retain formatting". I think it is all this additional meta-data that you are getting that is messing you up, because it probably looks to ESAPI like it is multi-encoded or it thinks that mixed-encoding is used.
That said,if you want to do validation, you really ought to be using one of the Validator.isValidInput() or Validator.getValidInput() methods. This call Encoder.canonicalize() by default (unless you use the latest ESAPI from the 'develop' branch, where you can actually disable the canonicalization--a recent bug fix).
-kevin

How to validate a filename in JAVA to resolve CWE ID 73(External Control of File Name or Path) using ESAPI?

I am facing this security flaw in my project at multiple places. I don't have any white-list to do a check at every occurrence of this flaw. I want to use ESAPI call to perform a basic blacklist check on the file name. I have read that we can use SafeFile object of ESAPI but cannot figure out how and where.
Below are a few options I came up with, Please let me know which one will work out?
ESAPI.validator().getValidInput() or ESAPI.validator().getValidFileName()

Blacklists are a no-win scenario. This can only protect you against known threats. Any code scanning tool you use here will continue to report the vulnerability... because a blacklist is a vulnerability. See this note from OWASP:
This strategy, also known as "negative" or "blacklist" validation is a
weak alternative to positive validation. Essentially, if you don't
expect to see characters such as %3f or JavaScript or similar, reject
strings containing them. This is a dangerous strategy, because the set
of possible bad data is potentially infinite. Adopting this strategy
means that you will have to maintain the list of "known bad"
characters and patterns forever, and you will by definition have
incomplete protection.
Also, character encoding and OS makes this a problem too. Let's say we accept an upload of a *.docx file. Here's the different corner-cases to consider, and this would be for every application in your portfolio.
Is the accepting application running on a linux platform or an NT platform? (File separators are \ in Windows and / in linux.)
a. spaces are also treated differently in file/directory paths across systems.
Does the application already account for URL-encoding?
Is the file being sent stored in a database or on the system itself?
Is the file you're receiving executable or not? For example, if I rename netcat.exe to foo.docx does your application actually check to see if the file being uploaded contains the magic numbers for an exe file?
I can go on. But I won't. I could write an encyclopedia.
If this is across multiple applications against your company's portfolio it is your ethical duty to state this clearly, and then your company needs to come up with an app/by/app whitelist.
As far as ESAPI is concerned, you would use Validator.getValidInput() with a regex that was an OR of all the files you wanted to reject, ie. in validation.properties you'd do something like: Validator.blackListsAreABadIdea=regex1|regex2|regex3|regex4
Note that the parsing penalty for blacklists is higher too... every input string will have to be run against EVERY regex in your blacklist, which as OWASP points out, can be infinite.
So again, the correct solution is to have every application team in your portfolio construct a whitelist for their application. If this is really impossible (and I doubt that) then you need to make sure that you've stated the risks cited here clearly to management and you refuse to proceed with the blacklist approach until you have written documentation that the company chooses to accept the risk. This will protect you from legal liability when the blacklist fails and you're taken to court.
[EDIT]
The method you're looking for was called HTTPUtilites.safeFileUpload() listed here as acceptance criteria but this was most likely never implemented due to the difficulties I posted above. Blacklists are extremely custom to the application. The best you'll get is a method HTTPUtilities.getFileUploads() which uses a list defined in ESAPI.properties under the key HttpUtilities.ApprovedUploadExtensions
However, the default version needs to be customized as I doubt you want your users uploading .class files and dll to your system.
Also note: This solution is a whitelist and NOT a blacklist.

The following code snippet works to get past the issue CWE ID 73, if the directory path is static and just the filename is externally controlled :
//'DIRECTORY_PATH' is the directory of the file
//'filename' variable holds the name of the file
//'myFile' variable holds reference to the file object
File dir = new File(DIRECTORY_PATH);
FileFilter fileFilter = new WildcardFileFilter(filename);
File[] files = dir.listFiles(fileFilter);
File myFile = null ;
if(files.length == 1 )
myFile = files[0];

regex to disallow access to parent directories - java

So what I need is to create a regex which is going to be used on my server to make sure that all the files that the user is requesting access to, are under a specific directory. Let's name that dir UserFiles and let's assume that it is under the path /Server/Users/Bob/UserFiles.
So now when a client sends a request to read a file I want to validate that the path that he is asking access to is under /Bob/UserFiles/.
I thought about making sure that the prefix of the path always begins with /Userfiles/ and that there is no .. in the path (so that would also protect me from restricted access like /UserFiles/../../noAccess.txt)
examples of not allowed inputs:
C:/UserFiles/
../../Alice/txt.txt
/UserFiles/../../noAccess.txt
examples of allowed input:
/UserFiles/UserFiles/Alice/txt.txt
/UserFiles/txt.txt
/UserFiles/Bob/Bob/txt.txt
I cannot think of any cases why this wouldn't work. I also tried to build the regex but it is not quite right as it allows inputs like /UserFiles//txt.txt (Might allow even more that it shouldn't that I have no knowledge of)
So is my idea complete or there are other cases I havent thought of? If my idea is complete could you please help me fix my regex?
(?!\.\.)^\/UserFiles\/[/\w,\s-]+\.[A-Za-z]{3}$

How about resolving the path and checking only afterwards (note, the behaviour is OS-dependent):
new File(input).getCanonicalPath().startsWith("/UserFiles/")
Or, depending on how to interpret your question:
new File(input).getCanonicalPath().startsWith("/Server/Users/Bob/UserFiles/")

How to verify if a file is readable by humans?

How I can make sure that a file is readable by humans.
By that I essentially want to check if the file is a txt, a yml, a doc, a json file and so on.
The issue is that in the case i want to perform this check, file extensions are misleading, and by that i mean that a plain text file (That should be .txt) has an extension of .d and various others :- (
What is the best way to verify that a file can be read by humans?
So far i have tried my luck with extensions as follows:
private boolean humansCanRead(String extention) {
switch (extention.toLowerCase()) {
case "txt":
case "doc":
case "json":
case "yml":
case "html":
case "htm":
case "java":
case "docx":
return true;
default:
return false;
}
}
But as i said extensions are not as expected.
EDIT: To clarify, i am looking for a solution that is platform independed and without using external libraries, And to narrow down what i mean "human readable", i mean plain text files that contain characters of any language, also i dont really mind if the text in the file makes sense like if it is encoded, i dont really care at this point.
Thanks so far for all the responses! :D

In general, you cannot do that. You could use a language identification algorithm to guess whether a given text is a text that could be spoken by humans. Since your example contains formal languages like html, however, you are in some deep trouble. If you really want to implement your check for (a finite set of) formal languages, you could use a GLR parser to parse the (ambiguous) grammar that combines all these languages. This, however would not yet solve the problem of syntax-errors (although it might be possible to define a heuristic). Finally, you need to consider what you actually mean by "human readable": E.g. do you include Base64?
edit: In case you are only interested in the character set: See this questions' answer. Basically, you have to read the file and check whether the content is valid in whatever character encoding you think of as human readable (utf-8 should cover most of your real-world cases).

For some files, a check on the proportion of bytes in the printable ASCII range will help. If more than 75% of the bytes are in that range within the first few hundred bytes then it is probably 'readable'.
Some files have headers, like the various forms of BoM on UTF files, the 0xA5EC which starts MS doc files or the "MZ" signature at the start of .exe, which will tell you if the file is readable or not.
A lot of modern text files are in one of the UTF formats, which can usually be identified by reading the first chunk of the file, even if they don't have a BoM.
Basically, you are going to have to run through a lot of different file types to see if you get a match. Load the first kilobyte of the file into memory and run a lot of different checks on it. Once you have some data, you can order the checks to look for the most common formats first.

How can I output data with special characters visible?

I have a text file that was provided to me and no one knows the encoding on it. Looking at it in a text editor, everything looks fine, aligned properly into neat columns.
However, I'm seeing some anomalies when I read the data. Even though, visually, the field "Foo" appears in the same columns in the text file (for instance, in columns 15-20), when I try to pull it out using substring(15,20) my data varies wildly. Sometimes I'll pull bytes 11-16, sometimes 18-23, sometimes 15-20...there's no consistency between records.
I suspect that there are some special chartacters, invisible to my text editor, but readable by (and counted in the index of) the String methods. Is there any way in Java to dump the contents of the file with any special characters visible so I can see what I need to Strings I need replace with regex?
If not in Java, can anyone recommed a tool that may be able to help me out?

I would start with having a look at the file directly. Any code adds a layer of doubt. Take a Total Commander (or equivalent on your platform), view the file (F3) and switch to hex mode. You suggest that the special characters behavior is not even consistent between lines, so you should get some visual clue about the format before you even attempt to fix it algorithmically.

Have you tried printing the contents of the file as individual integers or bytes? That way you can see if there are any hidden characters.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.