I've a scenario where web archive files (warc) are being dropped by a crawler periodically in different directories. Each warc file internally consists of thousand of HTML files.
Now, I need to build a framework to process these files efficiently. I know Java doesn't scale in terms of parallel processing of I/O. What I'm thinking is to have a monitor thread which scans this directory, pick the file names and drop into a Executor Service or some Java blocking queue. A bunch of worker threads (maybe a small number for I/O issue) listening under the executor service will read the files, read the HTML files within and do respective processing. This is to make sure that threads do not fight for the same file.
Is this the right approach in terms of performance and scalability? Also, how to handle the files once they are processed? Ideally, the files should be moved or tagged so that they are not being picked up by the thread again. Can this be handled through Future objects ?
In recent versions of Java (starting from 1.5 I believe) there are already built in file change notification services as part of the native io library. You might want to check this out first instead of going on your own. See here
My key recommendation is to avoid re-inventing the wheel unless you have some specific requirement.
If you're using Java 7, you can take advantage of the WatchService (as suggested by Simeon G).
If you're restricted to Java 6 or earlier, these services aren't available in the JRE. However, Apache Commons-IO provides file monitoring See here.
As an advantage over Java 7, Commons-IO monitors will create a thread for you that raises events against the registered callback. With Java 7, you will need to poll the event list yourself.
Once you have the events, your suggestion of using an ExecutorService to process files off-thread is a good one. Moving files is supported by Java IO and you can just ignore any delete events that are raised.
I've used this model in the past with success.
Here are some things to watch out for:
The new file event will likely be raised once the file exists in the directory. HOWEVER, data will still be being written to it. Consider reasonable expectations for file size and how long you need to wait until a file is considered 'whole'
What is the maximum amount of time you must spend on a file?
Make your executor service parameters tweakable via config - this will simplify your performance testing
Hope this helps. Good luck.
Related
I'm building a program in Java 7 and I need to download multiple files from a backend server depending on a different downloaded file value.
I'll explain:
Firstly, my program downloads a file via AsyncTask this file contains the value of files to download. onPost method it calls a different method that downloads these files and insert them into an array list after manipulating them into my app data.
Now, in order to create some kind of handling the ending of these AsyncTasks I've created a different AsyncTask in order to create a busy wait, considering I know how many files are to be downloaded, I check wether the size of the array is equal to the files number on a while loop.
My question is, does this busy wait in the AsyncTask disables the ability of the OS to release the processor it is running on or it's nothing to worry about?
I don't want the busy wait to lock a processor in order to download the files faster, or does it even matter?
Since I assume the async task has its own apoc and goes to sleep if needed I assume there won't be abuse of a processor by this busy wait?
Does the amount of files needed to be downloaded affect processing time? And if yes, does downloading a single file with the data is better than multi threading the download into a few hundred smaller files?
And finally, is it a good practice to write my own busy wait in an AsyncTask?
I'll add some code snippets soon...
is it a good practice to write my own XYZ - no, almost never, unless its Free vs proprietary, this is for the new language or platform, you are doing university research or the standard implementation is so crappy that you have realistic chances to take over. In this case, none of the above seems true.
I would suggest to use the standard parts of java.util.concurrent package. It has many classes that would be very useful for your project, including the Future class that implements the waiting functionality you need. Futures can be returned by the executor service that is implemented there in even multiple ways.
If you need more control over the process, it is also possible to use the CyclicBarrier. It allows one thread to wait till another completes something, and you can use the time out if you think your download may stall.
To answer clearly and directly, no, there is not a good practice to implement the homegrown framework for the functionality that duplicates that is available for free as part of the standard runtime. With or without busy waiting, do not even matter much.
I need to develop an application that will process csv files as soon as the files are created in a predefined directory. Huge number of incoming files is expected.
I have seen applications using Apache Commons IO File Monitoring in the production. It works pretty well. I have seen it processing as many as 21 million files in a day. It seems Apache Commons IO File Monitoring polls the directory and do listFiles to process the files.
My question:
Is JDK WatchService as good an option as Apache Commons IO File Monitoring? Does anyone know of any pros and cons?
Since the time I asked this question, I have got some more insight into the matter. Hence trying to answer for those who might have similar question.
Apache commons monitoring uses a polling mechanism with a configurable polling interval. In every poll, it calls listFiles() method of File class and compares with the listFiles() output of the previous iteration to identify file creation, modification and deletion. The algorithm is robust enough and I have never seen any miss. It works great with even large volume of files. However, since it polls and invokes listFiles in every iteration, it will consume unnecessary CPU cycles, if the input file inflow is not much. Works even on network drives.
JDK WatchService does not need polling. It is event based. It s triggered only when an event occurs and hence less CPU is required if the input file inflow is not that much. If the input file inflow is heavy and the event processing mechanism is processing at a slower rate that the rate at which the event is occurring, there may be a chance of event overflow. Additionally, it will not work with network drives.
Hence, in conclusion, if the file inflow is continuos and huge, it is better to go for Apache File Monitoring. Otherwise, JDK WatchService is a good option.
We have an application that reads files from a particular folder, processes them and copies(some business logic) it to another folder.
The problem here is when there are very large number of files to be processed, running a single instance of an application or a single thread is no longer enough to process this files.
One approach we have for this is to start multiple instances of the application(I feel something is wrong with this approach. Suggest me an alternative if there is one).
Spawning threads or starting multiple instances of the application, care should be taken that, if a thread reads one file and starts processing it, another thread should not pick it up.
We are trying to achieve this by having a database table with the list of file names in the folder, so that when a thread first reads the table for the file name ,we will change the status to in-process or completed and pessimistically lock the table so that other threads cannot read it.
Is there any better solution to the problem ?
You can use most of your existing implementation as the front-end processor to feed file streams to worker threads that you can start/stop as demand dictates. Only the front-end thread opens files, so there is no possibility of one worker interfering with another.
EDIT: Added the word 'no' as it changes the meaning quite a bit...
Also have a look at JDK 7. It has a new file I/O API and a fork/ join framework which might help.
Take a look at Apache Camel (http://camel.apache.org), and its File component (http://camel.apache.org/file2.html). Using Camel allows you to very easily define a set of processing instructions to consume files in a directory atomically, and also to configure a thread pool to deal with multiple files at the same time. Camel in Action's a great book to get you started.
What you describe reminds me of the classical style to develop on UNIX.
In this classical style, you would move a file to a work-in-progress directory so that other files do not pick it up. In general: You could use one directory per processing state and than move files from state to state.
This works essentially because file moves are atomic (at least under Unix systems and NFTS).
What is nice with this approach, is that it is pretty easy to handle problematic situations like crashes and it has automatically a nice management interface everyone is familiar with (the filesystem GUI, ls, Windows Explorer, ...).
I have an application where I need to check for a file which may be created dynamically during my execution, I will give up after some MAX time where the file has yet to show up. I wanted to know if there was a more efficient method in Java of checking for the file other than polling for it and then sleeping every X seconds? If not what would be the most efficient manner of doing this?
You currently have to poll the file system as you mentioned. Java 7 is supposed to have file system notifications, so this should get easier at some point.
If the same program is doing the creation of the file as the polling, you could instead have the logic that creates the file notify the part of the program using Object.notify(). A general description of the wait() and notify/notifyAll() mechanism can be found here: http://java.sun.com/docs/books/tutorial/essential/concurrency/guardmeth.html
You could try JPoller to poll for the file changes.
If you are running on Windows, you can get directory change notifications, see Obtaining Directory Change Notifications. Of course, this is not cross-platform, and will require use of JNA or similar native bridge. In fact, JNA offsers such as class, the FileMonitor class (in the download) that uses the underlying platform's file change notification.
If you are watching a handlful of files or fewer, then of course, polling is unlikely to be a performance problem, it's just not a "feel-good" solution - but not so bad to warrant the pain of a non pure java solution. Monitoring directories containing thousands of files on the other hand would benefit from direct noficiation from the OS.
I have a directory that continually fills up with "artefact" files. Many different programs dump their temporary files in this directory and it's unlikely that these programs will become self-cleaning any time soon.
Meanwhile, I would like to write a program that continually deletes files in this directory as they become stale, which I'll define as "older than 30 minutes".
A typical approach would be to have a timed mechanism that lists the files in the directory, filters on the old stuff, and deletes the old stuff. However, this approach is not very performant in my case because this directory could conceivably contain 10s or hundreds of thousands of files that do not yet qualify as stale. Consequently, this approach would continually be looping over the same thousands of files to find the old ones.
What I'd really like to do is implement some kind of directory listener that was notified of any new files added to the directory. This listener would then add those files to a queue to be deleted down the road. However, there doesn't appear to be a way to implement such a solution in the languages I program in (JVM languages like Java and Scala).
So: I'm looking for the most efficient way to keep a directory "as clean as it can be" on Windows, preferably with a JVM language. Also, though I've never programmed with Powershell, I'd consider it if it offered this kind of functionality. Finally, if there are 3rd party tools out there to do such things, I'd like to hear about them.
Thanks.
Why can't you issue a directory system command sorted by oldest first:
c:>dir /OD
Take the results and delete all files older than your threshold or sleep if no files are old enough.
Combine that with a Timer or Executor set to a granularity 1 second - 1 minute which guarantees that the files don't keep piling up faster than you can delete them.
If you don't want to write C++, you can use Python. Install pywin32 and you can then use the win32 API as such:
import win32api, win32con
change_handle = win32api.FindFirstChangeNotification(
path_to_watch,
0,
win32con.FILE_NOTIFY_CHANGE_FILE_NAME
)
Full explanation of what to do with that handle by Tim Golden here: http://timgolden.me.uk/python/win32_how_do_i/watch_directory_for_changes.html.
In Java, you can also use Apache Commons JCI FAM. It's is an opensource java library that you can use for free.
JDK 7 (released in beta currently) includes support for file notifications as well. Check out Java NIO2 tutorial.
Both options should work both on Windows and Linux.
http://www.cyberpro.com.au/Tips_n_Tricks/Windows_Related_Tips/Purge_a_Directory_in_Windows_automatically/
I'd go with C++ for a utility like this - lets you interface with the WIN32 API, which does indeed have directory listening facilities (FindFirstChangeNotification or ReadDirectoryChangesW). Use one thread that listens for change notifications and updates your list of files (iirc FFCN requires you to rescan the folder, whereas RDCW gives you the actual changes).
If you keep this list sorted according to modification time, it becomes easy to Sleep() just long enough for a file to go stale, instead of polling at some random fixed interval. You might want to do a WaitForSingleObject with a timeout instead of Sleep, in order to react to outside changes (ie, the file you're waiting for to become stale has been deleted externally, so you'll want to wake up and determine when the next file will become stale).
Sounds like a fun little tool to write :)
You might want to bite the bullet and code it up in C# (or VB). What you're asking for is pretty well handled by the FileSystemWatcher class. It would work basically the way you are describing. Register files as they are added into the directory. Have a periodic timer that scans the list of files for ones that are stale and deletes them if they are still there. I'd probably code it up as a Windows service running under a service id that has enough rights to read/delete files in the directory.
EDIT: A quick google turned up this FileSystemWatcher for Java. Commercial software. Never used it, so can't comment on how well it works.