writing a file finder (java)

writing a file finder (java) - java

I am writing an application which will search for for files with special filename extension on computer. (JPG for example). Input data: "D:", ".JPG" Output: txt file with results(file directories); I know one simple reccursive algo, but may be there is smth better. So, may be you tell me an efficient algorithm to traverse the file directory. Also I want to use multithreading for solving this problem to make better performance. But how many threads should I use? If I will use 1 thread for 1 directory - this will be stupid.

The recursive option you name is the only way to go, unless you want to get your hands dirty with the file system. I suspect you don't.
Regarding thread performance, your best choice is to make the number of threads configurable, create some sample directories, and measure performance for each setting.
By the way, most file-finders create an index of files. They scan the disc on a schedule, and update a file which contains the relevant information about the files and directories on disk. The file is in a format designed to facilitate searching. This index file is used to perform actual searches. If you're planning on repeatedly running this search against the same directory, you should do this.

Related

Save Properties File MOST EFFICIENT

Which of these ways is better (faster, less storage)?
Save thousands of xyz.properties in every file — about 30 keys/values
One .properties file with all the data in it — about 30,000 keys/values

I think there are two aspects here:
As Guenther has correctly pointed out, dealing with files comes with overhead. You need "file handles"; and possible other data structures that deal with files; so there might many different levels where having one huge file is better than having many small files.
But there is also "maintainability". Meaning: from a developers point of view, dealing with a property file that contains 30 K key/values is something you really don't want to get into. If everything is in one file, you have to constantly update (and deploy) that one huge file. One change; and the whole file needs to go out. Will you have mechanisms in place that allow for "run-time" reloading of properties; or would that mean that your application has to shut down? And how often will it happen that you have duplicates in that large file; or worse: you put a value for property A on line 5082, and then somebody doesn't pay attention and overrides property A on line 29732. There are many things that can go wrong; just because of having all that stuff in one file; unable to be digested by any human being anymore! And rest assured: debugging something like that will be hard.
I just gave you some questions to think about; so you might want to step back to give more requirements from your end.
In any way; you might want to look into a solution where developers deal with the many small property file (you know, like one file per functionality). And then you use tooling to build that one large file used in the production environment.
Finally: if your application really needs 30K properties; then you should very much more worry about the quality of your product. In my eyes, this isn't a design "smell"; it sounds like a design fetidness. Meaning: no reasonably application should require 30K properties to function on.

Opening and closing 1000s of files is a major overhead with the operating system, so you'd probably best off with one big file.

Is it possible to attack a server by uploading / through uploaded files?

the title actually tells the issue. And before you get me wrong, I DO NOT want to know how this can be done, but how I can prevent it.
I want to write a file uploader (in Java with JPA and MySQL database). Since I'm not yet 100% sure about the internal management, there is the possibility that at some point the file could be executed/opened internally.
So, therefor I'd be glad to know, what there is, an attacker can do to harm, infect or manipulate my system by uploading whatever type of file, may it be a media file, a binary or whatever.
For instance:
What about special characters in the file name?
What about manipulating meta data like EXIF?
What about "embedded viruses" like in an MP3 file?
I hope this is not too vague and I'd be glad to read your tips and hints.
Best regards,
Stacky

It's really very application specific. If you're using a particular web app like phpBB, there are completely different security needs than if you're running a news group. If you want tailored security recommendations, you'll need to search for them based on the context of what you're doing. It could range from sanitizing input to limiting upload size and format.
For example, an MP3 file virus probably only works on a few specific MP3 players. Not on all of them.
At any rate, if you want broad coverage from viruses, then scan the files with a virus scanner, but that probably won't protect you from things like script injection.

If your server doesn't do something inherently stupid, there should be no problem. But...
Since I'm not yet 100% sure about the internal management, there is the possibility that at some point the file could be executed/opened internally.
... this qualifies as inherently stupid. You have to make sure you don't accidently execute uploaded files (permissions on the upload directory are a starting point, limit the upload to specific directories etc.).
Aside from executing, if the server attempts any file type specific processing (e.g. make thumbnails of images) there is always the possibility that the processing can be attacked through buffer overflow exploits (these are specific for each type of software/library though).
A pure file server (e.g. FTP) that just stores/serves files is save (when there are no other holes).

2 java processes - one reading and one writing to the same file

I have two java processes which I want completely decoupled from each other.
I figure that the best way to do this is for one to write out its data to file and the other to read it from that file (the second might also have to write to the file to say its processed the line).
Problems I envisage are do with similtaneous access to the file. Is there a good simple pattern I can use to get around this problem? Is there a library that handles this sort of functionality?
Best way to describe it is as a simple direct message passing mechanism I could implement using files. (Simpler than JMS).
Thanks Dan

If you want a simple solution and you can assume that "rename file" is an atomic operation (this is not completely true), each one of the processes can rename the file when reading it or writing to it and rename back when it finishes. The other one will not find the file and will wait until the file appears.

you mean like a named pipe? it's possible but java doesn't allow pipe creation unless you use non portable processes

You are asking for functionality that is exactly what JMS does. JMS is an API which has many implemententations. Can you you not just use a lightweight implementation? I don't see why you think this is "complicated". By the time you've mananged to reliably implement your solution you'll have found that it's not trivial to deal with all the edge cases.

Correct me if I don't understand your problem...
Why don't you look at file locks ? When a program acquire the lock, the other wait until the lock is released

If you are not locked on a file-based solution, a database can solve your problem.
Each record will be a line written by the writing process. A single column in the record will be untouched and the reading process will use it to indicate that it red the record.
Naturally you will have to deal with cleanup of the table before it becomes to large, or its partitioning so it will be easy for the reading process to find information inside it.
If you must use a file - you can think of another file that just has the ID of the record that the reader process read - that way you don't need to have concurrently writing processes on the same file.

Java content APIs for a large number of files

Does anyone know any java libraries (open source) that provides features for handling a large number of files (write/read) from a disk. I am talking about 2-4 millions of files (most of them are pdf and ms docs). it is not a good idea to store all files in a single directory. Instead of re-inventing the wheel, I am hoping that it has been done by many people already.
Features I am looking for
1) Able to write/read files from disk
2) Able to create random directories/sub-directories for new files
2) Provide version/audit (optional)
I was looking at JCR API and it looks promising but it starts with a workspace and not sure what will be the performance when there are many nodes.

Edit: JCP does look pretty good. I'd suggest trying it out to see how it actually does perform for your use-case.
If you're running your system on Windows and noticed a horrible n^2 performance hit at some point, you're probably running up against the performance hit incurred by automatic 8.3 filename generation. Of course, you can disable 8.3 filename generation, but as you pointed out, it would still not be a good idea to store large numbers of files in a single directory.
One common strategy I've seen for handling large numbers of files is to create directories for the first n letters of the filename. For example, document.pdf would be stored in d/o/c/u/m/document.pdf. I don't recall ever seeing a library to do this in Java, but it seems pretty straightforward. If necessary, you can create a database to store the lookup table (mapping keys to the uniformly-distributed random filenames), so you won't have to rebuild your index every time you start up. If you want to get the benefit of automatic deduplication, you could hash each file's content and use that checksum as the filename (but you would also want to add a check so you don't accidentally discard a file whose checksum matches an existing file even though the contents are actually different).
Depending on the sizes of the files, you might also consider storing the files themselves in a database--if you do this, it would be trivial to add versioning, and you wouldn't necessarily have to create random filenames because you could reference them using an auto-generated primary key.

Combine the functionality in the java.io package with your own custom solution.
The java.io package can write and read files from disk and create arbitrary directories or sub-directories for new files. There is no external API required.
The versioning or auditing would have to be provided with your own custom solution. There are many ways to handle this, and you probably have a specific need that needs to be filled. Especially if you're concerned about the performance of an open-source API, it's likely that you will get the best result by simply coding a solution that specifically fits your needs.
It sounds like your module should scan all the files on startup and form an index of everything that's available. Based on the method used for sharing and indexing these files, it can rescan the files every so often or you can code it to receive a message from some central server when a new file or version is available. When someone requests a file or provides a new file, your module will know exactly how it is organized and exactly where to get or put the file within the directory tree.
It seems that it would be far easier to just engineer a solution specific to your needs.

How do I store files or file paths in a java application?

I apologize if this is a really beginner question, but I have not worked with Java in several years.
In my application, I need to keep up with a list of files (most, if not all, are txt files). I need to be able to add to this list, remove file paths from the list, and eventually read the contents of the files (though not when the files are initially added to the list).
What is the best data structure to use to store this list of files? Is it standard to just save the path to the file as a String, or is there a better way?
Thanks very much.

Yes, paths are usually stored as String or File instances. The list can be stored as an ArrayList instance.

It really depends on your requirements
you can store filenames/paths using anything that implements Collection if you have a small number of files and/or a flat directory structure
if looking up files is performance critical you should use a data structure that gives you fast search, like a HashSet
if memory space is an issue (e.g. on mobile devices) and your number of files is high and/or your directory structure deep you should use a data structure that allows for compact storage, like a trie
If the data structure allows, I would store Files rather than Strings however because there is no additional overhead and File obviously offers convenient file handling methods.

One way is to use the Properties class. It has load and store methods for reading and writing to a file, but it may not match what you are doing.

I'm not sure if I understood your question completely. But I like to store Files as File Objects in Java. If you apply the same operation to each File then you can store them in a List. But maybe you have to clarify your question a little bit.

I would recommend storing a set of file objects using the Collection interface of your choice. The reason to do this is that the File Object creates a canonical reference to the file, which is device independent.
I don't think that the handle is open when you do this, but I am open to correction.
http://java.sun.com/javase/6/docs/api/java/io/File.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.