Multiple file readers to read same file at same time in parallell

Multiple file readers to read same file at same time in parallell - java

I have a very large text data file. Is it possible to read this file by multiple file readers from different locations at the same time in parallel? For example one reader starts reading it from start and goes to middle, another starts reading from the middle to the end of the file.
I have an alternate approach of reading file via stream and using parallel. But it does not achieve the goal.
Files.lines(filePath).parallel
Multiple file readers might not read the same file because the file has already been taken and locked by another thread that is currently reading the file. Is there anyway to make this file shared among many threads and they can concurrently read it.

Is it possible to read this file by multiple file readers from different locations at the same time in parallel?
Yes. You can read the file from multiple points at once using RandomAccessFile. You can only do this if the file isn't locked.
If you have a text file you need to find where the start of the line is and the built-in tools don't really support this class so you will have to do some of the work yourself.
Is there any way to make this file shared among many threads and they can concurrently read it.
We have a couple of tools which support both concurrent reads and writes to the same file across processes so it is possible. (They work on binary files, not text)

Related

Multithreaded access to files in Java

I'm working on a multithreaded server in Java.
The server monitors a directory of files. Clients can ask the server:
to download a file from the server directory
to upload a new version of an already existing file to the server, overwriting the old version in the server directory.
To do the transfers, I'm planning to use FileChannels and SocketChannels, using the methods transferFrom and transferTo. According to the documentation, these two methods are thread safe.
The thing is that a single call to these two function could not be sufficient to read/write the file entirely.
The problem arises if there are more than one request on the same file at the same time. In this scenario, multiple threads could be doing read/write operations on the same file. Now, the single calls to transferFrom/transferTo are thread safe, according to the Java documentation. But a single call to these two functions could not be sufficient to read/write the file entirely. If thread A is replying to a download request and thread B is replying to an upload request referring to the same file, it could happen that:
Thread A starts reading from the file
In thread A, for some reason the read call returns before the EOF
Thread B overwrites the entire file with a single write call
Thread A continues reading from the file
In this case, the downloading client receives a portion of the old version and a portion of the new version.
To solve this I think I should be using some sort of locking, but I'm not sure how to do it in an efficient way. I could create two synchronized methods for reading and writing, but that creates obviously too much contention.
The best solution I have in mind is to use lock striping. Before doing any read/write operation, an hash based on the filename is calculated. Then, the lock in position lockArr[hash % numOfLocks] is acquired.
I think also that I should be using ReadWriteLocks, since multiple simultaneous reads should be allowed.
Now, this is my analysis of the problem and I could be completely wrong. Is there any better solution to this?

Locking means that somebody has to wait for somebody else -- not the best solution.
When the client uploads a file, you should write it out to a temp file on the same disk (usually in the same directory), and then when the file upload is done:
Rename the old version to a temporary name. Any current readers should be forced to close the old one, re-open the temp version, and seek to the correct position.
Rename the uploaded file to the target file name.
Delete the temp version of the old file when any readers are done with it.
In a typical implementation, you'd need a centralized class (lets call it ConcurrentFileAccessor) to manage the interactions between threads.
Readers would need to register with this class, and synchronize on some object during the actual read operation. When an upload completes, the writer would have to claim all those locks to block reads, close all the read files, rename the old version, reopen, seek, and then release them to allow the readers to continue.

Java File watcher with multiple JVM watching single directory for incoming files

I have a situation where there are two java applications are watching a directory for incoming file. Say there is a directory DIR that is being watched by two JVM processes for any files with the extension .SGL.
The problem we face here is that, sometimes both nodes are being notified about the new files and both nodes are trying to process the same file.
Usually we handle these situations using a database that try to insert into a table with unique file name column and only one will succeed and continue processing.
But for this situation, we don't have database.
What is the best way to handle these kind of problems? Can we depend on the file renaming solutions? Is file renaming is atomic operation?

For such a situation Spring Integration suggests FileSystemPersistentAcceptOnceFileListFilter: https://docs.spring.io/spring-integration/reference/html/files.html#file-reading
Stores "seen" files in a MetadataStore to survive application restarts.
The default key is 'prefix' plus the absolute file name; value is the timestamp of the file.
Files are deemed as already 'seen' if they exist in the store and have the
same modified time as the current file.
When you have shared persistent MetadataStore for all your application instances only one of them will process the file. All others will just filter it.

Every watcher (even two in the same JVM) should always be notified of the new File being added.
If you want to divide the work, you can either
use one JVM to run twice as many threads and divide the work via a queue.
use an operation which will only succeed for one JVM. e.g.
file rename
create a lock file
lock the file itself
Is file renaming is atomic operation?
Yes, only one process can successful rename a file, even if both attempt to rename to same name.

Java monitor folder for files

I need to monitor a certain folder for new files, which I need to process.
I have the following requirements:
The filenames of the files are sequence numbers. I need to process each file in order. (Lowest number first, there's no guarantee that each sequence number exists. eg: 1,2,5,8,9
If files already exist in the folder during startup, I need to process them directly
I need a guarantee that I only process each file once
I need to avoid reading incomplete files (which are still being copied)
The service should ofcourse be reliable...
What is the most common way to accomplish this?
I'm using Java SE7 and Spring 4.
I already had a look at the WatchService of Java 7 but it seems to have problems with processing already existing files during startup, and avoid processing incomplete files.

Assembling comments into an answer.
Easiest way to parse the files in the correct order is to load the entire directory file listing into an array / list and then sort the list using an appropriate comparator. E.g. Load files with File.list() or File.listFiles().
This is not the most efficient methodology, but for less than 10,000 files should be adequate unless you need faster startup time performance (I can imagine a small lag before processing begins as all of the files are listed).
To avoid reading incomplete files you should acquire an exclusive FileLock (via a FileChannel which you can get from the FileOutputStream or FileInputStream, however you may not be able to get an exclusive lock from the FileInputStream) on the file. Assuming the OS being used supports file locking (which modern OSes do) and the application writing the file is well behaved and holding a lock (hopefully it is) then as soon as you are able to acquire the lock you know the file is complete.
If for some reason you cannot rely on file locking then you either need to have the writing program first write to a temporary file (perhaps with a different extension) and then atomically move / rename the file (atomic for most OSes if on the same file system / partition), or monitor the file for a period of time to see if further bytes are being written (not the most robust methodology).

How can I safely read Log4j logs while another thread is logging at the same time?

If I have multiple threads that use log4j to write to a single log file, and I want another thread to read it back out, is there a way to safely read(line by line) those logs such that I always read a full line?
EDIT:
Reason for this is I need to upload all logs to a central location and it might be logs that are days old or those that are just being written

You should use a read write lock.
Read locks can be held by multiple users if there is no one writing to the file, but a write lock can only be held by 1 thread at a time no matter what.
Just make sure that as your writing thread is done writing, it releases the readwritelock to allow the reading threads to read. Likewise, always release the read lock when the reader are done reading so log4j can continue to write
Check out
http://docs.oracle.com/javase/1.5.0/docs/api/java/util/concurrent/locks/ReadWriteLock.html
However, coming to think of it, what is your purpose for this? If you simply want to monitor your logs, you should use a different solution rather than having a monitor thread within the same application. Seems to not make sense. If the data is available within the application / service, why pass it off to a file and read it right back in?

It is going to be a pain if you need to implement what you are doing, especially you have to deal with file rolling.
For your specific requirement, there are better choices:
If the location you are going to be backed up can be directly written (i.e. mounted in your file system), it is better to simply set your file rolling to write to that backup directory; or
Make use of log management tools like Splunk to monitor and manage your log files (so that you don't even need to copy to that backup directory); or
Even you need to do the backup all by yourself, you don't need to (and have no reason to) do it in a separate thread. Trying to write a shell script monitoring your log directory, and make use of tools like rsync or write similar logic by yourself, to do the upload only for files that are not matching in local and remote location.

Text file split libraries in Java

My program receives large CSV files and transforms them to XML files. In order to have better performance I would like to split this files in smaller segments of (for example) 500 lines. What are the available Java libraries for splitting text files?

I don't understand what you'd be gaining by splitting up the CSV file into smaller ones? With Java, you can read and process the file as you go, you don't have to read it all at once...

What do you intend to do with those data ?
If it is just record by record processing then event oriented (SAX or StaX) parsing will be the way to go. For record by record processing, an existing "pipeline" toolkit may be applicable.
You can pre-process your file with a splitter function like this one or this Splitter.java.

How are you planning on distributing the work once the files have been split?
I have done something similar to this on a framework called GridGain - it's a grid computing framework which allows you to execute tasks on a grid of computers.
With this in hand you can then use a cache provider such as JBoss Cache to distribute the file to multiple nodes, specify a start and end line number and process. This is outlined in the following GridGain example: http://www.gridgainsystems.com/wiki/display/GG15UG/Affinity+MapReduce+with+JBoss+Cache
Alternatively you could look at something like Hadoop and the Hadoop File System for moving the file between different nodes.
The same concept could be done on your local machine by loading the file into a cache and then assigning certain "chunks" of the file to be worked on by seperate threads. The grid computing stuff really is only for really large problems, or to provide some level of scalability transparently to your solution. You might need to watch out for IO bottlenecks and locks, but a simple thread pool which you dispatch "jobs" into after the file is split could work.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.