How to know whether a file was processed before

How to know whether a file was processed before - java

How can I be sure if a file was processed before? There is a remote storage location which is a file source for my application. My program gets files from this location and processes them in a scheduled way. How can I be sure that the next time I fetch only non-processed files? I'm thinking about using file attributes. The archive and modified date can be a solution. But I learned that two bits of file attributes are not used. How can I use these fields in Java? By the way I don't want to use a database.

A common strategy is to use some form of hash function to create a checksum. Record the checksum of the file, and compare the list of processed files identified by checksum against the file in question. If the checksum of the file in question is in the list, you have already processed it.
Protect your list of processed file checksums. If you lose it, or it becomes corrupted, it might be a long, bad day.
To prevent unnecessary network traffic, you might consider preparing 'check' files on the remote repository that contain a checksum that corresponds to a potential input file.
EDIT:
Upon further comment, it is potentially possible to directly interact with file system attributes. The proposed Java 1.7 spec introduces file-system specific attribute views to directly interact with these attributes. The view you would be interested in is 'DosFileAttributeView'.
Basic use might be something similar to this ('input' is a file based on a java 'Path'; add necessary exception handling):
// import as necessary from java.nio.file and java.io
DosFileAttributeView view = input.getFileAttributeView(DosFileAttributeView.class);
//Check if the system supports this view
if (view != null)
{
DosFileAttributes attributes = view.readAttributes();
// skip any file already marked as an archive
if (!attributes.isArchive())
{
myObject.process(input)
attributes.setArchive(true)
}
}

Can you rename the file (e.g. "filename.archive")? or into an "archive" subdirectory?

Related

Java: Assert that file serialization is complete?

I am dealing with an object in Java that is very expensive to compute and several megabytes in size. In order to preserve it across application restarts, I want to serialize it into a File, and re-load that file on startup (if present).
The problem is that most file systems are not transactional. The file writing process can be interrupted by exceptions, JVM termination and/or power failure. What I absolutely need to assert is that if the file is used, the information within is complete. I can throw away the information and recalculate if needed, but reading and relying on incomplete data must be avoided.
My attempt would be to serialize and write a "seal" object at the end of the file, like a checksum for example. The presence of this object during deserialization guarantees that the serialization process was complete. If the seal object is absent during deserialization, I know that I cannot trust the data in the file as it might be incomplete. I am looking for an OS-independent solution, and I do not need to consider "attacks" that maliciously modify the contents of the serialized file.
My question is: Is the seal object approach outlined above safe, or are there still some corner cases where I can end up reading an incomplete file without noticing it?

Just write the file under a different, temporary name. Once the file is complete, delete any previous version of the file and rename the new file to the real name.
If program dies during write, you're just left with an incomplete temp file. The real file is still as before (or missing), so you'll never see an incomplete file to load.

How to check if a file/directory is a protected OS file?

I'm working on a project which, in part, displays all the files in a directory in a JTable, including sub-directories. Users can double-click the sub-directories to update the table with that new directory's content. However, I've run into a problem.
My lists of files are generated with file.listFiles(), which pulls up everything: hidden files, locked files, OS files, the whole kit and caboodle, and I don't have access to all of them. For example, I don't have permission to read/write in "C:\Users\user\Cookies\" or "C:\ProgramData\ApplicationData\". That's ok though, this isn't a question about getting access to these. Instead, I don't want the program to display a directory it can't open. However, the directories I don't have access to and the directories I do are behaving almost exactly the same, which is making it very difficult to filter them out.
The only difference in behavior I've found is if I call listFiles() on a locked directory, it returns null.
Here's the block of code I'm using as a filter:
for(File file : folder.listFiles())
if(!(file.isDirectory() && file.listFiles() == null))
strings.add(file.getName());
Where 'folder' is the directory I'm looking inside and 'strings' is a list of names of the files in that directory. The idea is a file only gets loaded into the list if it's a file or directory I'm allowed to edit. The filtering aspect works, but there are some directories which contain hundreds of sub-directories, each of which contains hundreds more files, and since listFiles() is O(n), this isn't a feasible solution (list() isn't any better either).
However,
file.isHidden() returns false
canWrite()/canRead()/canExecute() return true
getPath() returns the same as getAbsolutePath() and getCanonicalPath()
createNewFile() returns false for everything, even directories I know are ok. Plus, that's a solution I'd really like to avoid even if that worked.
Is there some method or implementation I just don't know to help me see if this directory is accessible without needing to parse through all of its contents?
(I'm running Windows 7 Professional and I'm using Eclipse Mars 4.5.2, and all instances of File are java.io.File).

The problem you have is that you are dealing with File. By all accounts, in 2016, and, in fact, since 2011 (when Java 7 came out), it has been superseded by JSR 203.
Now, what is JSR 203? It is a totally new API to deal with anything file systems and file system objects; and it extend the definition of a "file system" to include what you find on your local machine (the so called "default filesystem" by the JDK) and other file systems which you may use.
Sample page on how to use it: here
Among the many advantages of this API is that it grants access to metadata which you could not access before; for instance, you specifically mention the case, in a comment, that you want to know which files Windows considers as "system files".
This is how you can do it:
// get the path
final Path path = Paths.get(...);
// get the attributes
final DosAttributes attrs = Files.readAttributes(path, DosFileAttributes.class);
// Is this file a "system file"?
final boolean isSystem = attrs.isSystem();
Now, what is Paths.get()? As mentioned previously, the API gives you access to more than one filesystem at a time; a class called FileSystems gives access to all file systems visible by the JDK (including creating new filesystems), and the default file system, which always exists, is given by FileSystems.getDefault().
A FileSystem instance also gives you access to a Path using FileSystem#getPath.
Combine this and you get that those two are equivalent:
Paths.get(a, b, ...)
FileSystems.getDefault().getPath(a, b, ...)
About exceptions: File handles them very poorly. Just two examples:
File#createNewFile will return false if the file cannot be created;
File#listFiles will return null if the contents of the directory pointed by the File object cannot be read for whatever reason.
JSR 203 has none of these drawbacks, and does even more. Let us take the two equivalent methods:
File#createNewFile becomes Files#createFile;
File#listFiles becomes either of Files#newDirectoryStream (or derivatives; see javadoc) or (since Java 8) Files#list.
These methods, and others, have a fundamental difference in behaviour: in the event of a failure, they will throw an exception.
And what is more, you can differentiate what exception this is:
if it is a FileSystemException or derivative, the error is at the filesystem level (for instance, "access denied" is an AccessDeniedException);
if is is an IOException, then the problem is more fundamental.
This answer cannot contain each and every use case of JSR 203; this API is vast, very complete, although not without flaws, but it is infinitely better than what File has to offer in any case.

I faced the very same problem with paths like C://users/myuser/cookies.
I already used JSR203, so the above answer kind of didn't help me.
In my case the important attribute of those files was the hidden one.
I ended up using the FileSystemview, which excluded those files as I wanted.
File[] files = FileSystemView.getFileSystemView().getFiles(new File(strHomeDirectory), !showHidden);

How to validate a filename in JAVA to resolve CWE ID 73(External Control of File Name or Path) using ESAPI?

I am facing this security flaw in my project at multiple places. I don't have any white-list to do a check at every occurrence of this flaw. I want to use ESAPI call to perform a basic blacklist check on the file name. I have read that we can use SafeFile object of ESAPI but cannot figure out how and where.
Below are a few options I came up with, Please let me know which one will work out?
ESAPI.validator().getValidInput() or ESAPI.validator().getValidFileName()

Blacklists are a no-win scenario. This can only protect you against known threats. Any code scanning tool you use here will continue to report the vulnerability... because a blacklist is a vulnerability. See this note from OWASP:
This strategy, also known as "negative" or "blacklist" validation is a
weak alternative to positive validation. Essentially, if you don't
expect to see characters such as %3f or JavaScript or similar, reject
strings containing them. This is a dangerous strategy, because the set
of possible bad data is potentially infinite. Adopting this strategy
means that you will have to maintain the list of "known bad"
characters and patterns forever, and you will by definition have
incomplete protection.
Also, character encoding and OS makes this a problem too. Let's say we accept an upload of a *.docx file. Here's the different corner-cases to consider, and this would be for every application in your portfolio.
Is the accepting application running on a linux platform or an NT platform? (File separators are \ in Windows and / in linux.)
a. spaces are also treated differently in file/directory paths across systems.
Does the application already account for URL-encoding?
Is the file being sent stored in a database or on the system itself?
Is the file you're receiving executable or not? For example, if I rename netcat.exe to foo.docx does your application actually check to see if the file being uploaded contains the magic numbers for an exe file?
I can go on. But I won't. I could write an encyclopedia.
If this is across multiple applications against your company's portfolio it is your ethical duty to state this clearly, and then your company needs to come up with an app/by/app whitelist.
As far as ESAPI is concerned, you would use Validator.getValidInput() with a regex that was an OR of all the files you wanted to reject, ie. in validation.properties you'd do something like: Validator.blackListsAreABadIdea=regex1|regex2|regex3|regex4
Note that the parsing penalty for blacklists is higher too... every input string will have to be run against EVERY regex in your blacklist, which as OWASP points out, can be infinite.
So again, the correct solution is to have every application team in your portfolio construct a whitelist for their application. If this is really impossible (and I doubt that) then you need to make sure that you've stated the risks cited here clearly to management and you refuse to proceed with the blacklist approach until you have written documentation that the company chooses to accept the risk. This will protect you from legal liability when the blacklist fails and you're taken to court.
[EDIT]
The method you're looking for was called HTTPUtilites.safeFileUpload() listed here as acceptance criteria but this was most likely never implemented due to the difficulties I posted above. Blacklists are extremely custom to the application. The best you'll get is a method HTTPUtilities.getFileUploads() which uses a list defined in ESAPI.properties under the key HttpUtilities.ApprovedUploadExtensions
However, the default version needs to be customized as I doubt you want your users uploading .class files and dll to your system.
Also note: This solution is a whitelist and NOT a blacklist.

The following code snippet works to get past the issue CWE ID 73, if the directory path is static and just the filename is externally controlled :
//'DIRECTORY_PATH' is the directory of the file
//'filename' variable holds the name of the file
//'myFile' variable holds reference to the file object
File dir = new File(DIRECTORY_PATH);
FileFilter fileFilter = new WildcardFileFilter(filename);
File[] files = dir.listFiles(fileFilter);
File myFile = null ;
if(files.length == 1 )
myFile = files[0];

MarkLogic: Move document from one directory to another on some condition

I'm new to MarkLogic and trying to implement following scenario with its Java API:
For each user I'll have two directories, something like:
1.1. user1/xmls/recent/
1.2. user1/xmls/archived/
When user is doing something with his xml - it's put to the "recent" directory;
When user is doing something with his next xml and "recent" directory is full (e.g. has some amount of documents, let's say 20) - the oldest document is moved to the "archived" directory;
User can request all documents from the "recent" directory and should get no more than 20 records;
User can remove something from the "recent" directory manually; In this case, if it had 20 documents, after deleting one it must have 19;
User can do something with his xmls simultaneously and "recent" directory should never become bigger than 20 entries.
Questions are:
In order to properly handle simultaneous adding of xmls to the "recent" directory, should I block whole "recent" directory when adding new entry (to actually add it, check if there are more than 20 records after adding, select the oldest 21st one and move it to the "archived" directory and do all these steps atomically)? How can I do it?
Any suggestions on how to implement this via Java API?
Is it possible to change document's URI (e.g. replace "recent" with "archived" in my case)?
Should I consider using MarkLogic's collections here?
I'm open to any suggestions and comments (as I said I'm new to MarkLogic and maybe my thoughts on how to handle described scenario are completely wrong).

You can achieve atomicity of a sequence of transactions using Multi-Statement Transactions (MST)
It is possible to MST from the Java API: http://docs.marklogic.com/guide/java/transactions#id_79848
It's not possible to change a URI. However, it is possible to use an MST to delete the old document and reinsert a new one using the new URI in one an atomic step. This would have the same effect.
Possibly, and judging from your use case, unless you must have the recent/archived information as part of the URI, it may be simpler to store this information in collections. However, you should read the documentation and evaluate for yourself: http://docs.marklogic.com/guide/search-dev/collections#chapter

Personally I would skip all the hassle with separate directories as well as collections. You would endlessly have to move files around, or changes their properties. It would be much easier to not calculate anything up front, and simply use lastModified property, or something alike, to determine most recent items at run-time.
HTH!

How to safely create a file if it doesn't exist from concurrently-running processes without using locking?

Suppose two (or more) concurrently-running Java processes need to check for the existence of a file, create it if it doesn't exist, and then potentially read from that file over the course of their runs. We want to protect ourselves against the possibility of multiple writer processes clobbering each other and/or reader processes reading an incomplete or inconsistent version of the file.
What we're currently doing to arbitrate this situation is to use Java NIO FileLocks. One process manages to acquire an exclusive lock on the file to be created using FileChannel.tryLock() and creates it, while the other concurrently-running processes fail to acquire a lock and fall back to using an in-memory version of the file for their runs.
Locking is causing various problems for us on our compute farm, however, so we're exploring alternatives. So my question to you is: is there a way to do this safely without using file locks?
Could, eg., the processes write to independent temporary files when they find a file doesn't exist, and then more or less "atomically" move the temp file(s) into place after they've been written? In this scenario, we might end up with multiple writer processes, but that would be ok provided that any processes reading from the file always read one version or another, and not a mix of two or more versions. However, I don't think all operating systems guarantee that if you have a file open for reading, you'll continue reading from the original version of the file even if it's overwritten mid-way through the read.
Any suggestions would be much appreciated!

Suppose two (or more) concurrently-running Java processes need to check for the existence of a file, create it if it doesn't exist, and then potentially read from that file over the course of their runs.
I don't quite understand the create and read part of the question. If you are looking to make sure that you have a unique file then you could use new File(...).createNewFile() and check to make sure that it returns true. To quote from the Javadocs:
Atomically creates a new, empty file named by this abstract pathname if
and only if a file with this name does not yet exist. The check for the
existence of the file and the creation of the file if it does not exist
are a single operation that is atomic with respect to all other
filesystem activities that might affect the file.
This would give you a unique file that only that process (or thread) would then "own". I'm not sure how you were planning on letting the writer know which file to write to however.
If you are talking about creating a unique file that you write do and then moved into a write directory to be consumed then the above should work. You would need to create a unique name in the write directory once you were done as well.
You could use something like the following:
private File getUniqueFile(File dir, String prefix) {
long suffix = System.currentTimeMillis();
while (true) {
File file = new File(dir, prefix + suffix);
// try creating this file, if true then it is unique
if (file.createNewFile()) {
return file;
}
// someone already has that suffix so ++ and try again
suffix++;
}
}
As an alternative, you could also create a unique filename using UUID.randomUUID() or something to generate a unique name.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.