Converse of java FileDescriptor .sync() for *reading* files - java

Reading the javadoc on FileDesciptor's .sync() method, it is apparent that sync() is primarily concerned with committing any modified buffers back to the underlying storage. I.e., making sure that anything that your program has output will actually make it to the disk (or socket or what-have-you, but my question pertains mainly to disks).
But what about the other direction, what about INPUT? Suppose my program has some parts of a java.io.RandomAccessFile buffered in memory, and I want to READ those parts of the file, but perhaps some other process has modified those parts of the file since the last time my program read those blocks?
This is akin to marking a variable as 'volatile' in a C program; something else may have changed the 'real version' of something you merely have a convenient copy of.
I.e., how can you be certain that what your java program reads is at least reasonably up-to-date?
(Clearly the definition of 'up to date' matters. Purely as an example, suppose that the other process, the one that writes to the file, does so on the order of maybe once per second, and suppose that the reading process reads maybe once per minute. In a situation like this, performance isn't a big deal, it's just a matter of making sure that what the reader reads is consistent with what the write writes, to within say, a second.)

Before re-reading your file, it is usually a good idea to check the last modified timestamp of the file with File.lastModified(). If this timestamp is not newer than the last time you read the file, you don't need to bother with more disk I/O to re-read the blocks you are interested in. One thing to keep in mind though, is that the last modifed timestamp may not always be updated immediately when the contents are updated if you are using a network filesystem. If you are dealing with a local process updating the file and another local process running your code reading the file, you most likely won't run into this issue.
One method I've had success with in the past was to have a separate thread poll the file for the last modified timestamp on certain intervals, say 5 seconds. If the file changed, re-process the file and send an event to registered listeners. In my case, 5 seconds was more than soon enough to get updates.

At the moment where the file is read into the internal buffer, the contents are up-to-date to the contents on the disk.
If you want to be sure to have the latest contents on your next access, you also have to go to the disk again, skipping all internal buffers and caches. If you really want to be sure, that all such layers are skipped, you'll have to reopen the file from scratch and seek to the according position you want to access.
Of course, your performance will go down the tubes if you access the disk on every possible access of the data. Don't think of 3-5 fold or so but orders of magnitudes.

If another program you control is the only one writing to the file, then it's probably best to have 2 threads in the same Java process coordinate. The easiest solution is to create a java.util.concurrrent.atomic.AtomicBoolean. The writer thread calls set(true) on the AtomicBoolean and the reader calls getAndSet(false). If getAndSet() returns true, then you know the reader needs to re-read the data. If it's an issue, you could synchronize on some object to prevent the writer from writing while the reader is reading.
You said "process" in the question, so maybe you are concerned about any other process on the system changing the data. In this case, I think you best bet is to just reopen and reread the data. The performance impact of this should be negligible if you really are only reading once per minute.

Related

How do you write to a blocking device file using Java?

I have a character special device file on a Linux system (eg /dev/foobardcma6) that I wish to constantly write data to. What is the preferred way to do this in Java?
I tried using AsynchronousFileChannel and was able to write some bytes to it, but eventually it blocks/hangs when writing. I don't know if this is the right approach or not, however.
I could use a FileChannel. The write method returns how many bytes were actually written. I assume if it returns less bytes than were requested to write, it means the write buffer is full, and you should wait before writing again. However, I don't see any mechanism to be notified that the file is ready for writing.
Update:
I tried using a FileChannel, and it also blocks after a certain number of bytes. The suspicious thing is both the FileChannel and AsychronousFileChannel implementations block after writing exactly the same number of bytes. In both cases the last call to write never returns.
I have a test utility written in C++ that can successfully write data to the device without issue, so it's not a hardware problem. I assume I'm doing something wrong with the FileChannels.

Why does the PrintWriter class (and other writers) require a call to flush after writing?

I have noticed that some I/O Classes in Java (and a lot others, like the BufferedWriter and FileWriter), require a call to flush() after writing. (With the exception of AutoFlush, I'll get to that later).
For example, this call to println() will not work. However, if I invoke writer#flush() after, the line will print.
PrintWriter writer = new PrintWriter(System.out);
writer.println("test");
Also, does autoflushing impact performance in any way (especially in larger/consistent writes), or is it just a convenience, and is it recommended to use it?
Why does the PrintWriter class (and other writers) require a call to flush after writing?
To the extent that flushing is wanted1, it will be needed if the "stack" of output classes beneath the print writer is doing some output buffering. If an output stream is buffered, then some event needs to trigger pushing (flushing) the buffered output to the the external file, pipe, socket or whatever. The things that will trigger flushing are:
the buffer filling up
something calling close() on the stream, or
something calling flush() on the stream.
In the case of a PrintWriter, the underlying stream can also be flushed by the classes auto-flushing mechanism.
The reason for buffering output (in general) is efficiency. Performing the low-level output operation that writes data to the (external) file, pipe, whatever involves a system call. There are significant overheads in doing this, so you want to avoid doing lots of "little" writes.
Output buffering is the standard way to solve this problem. Data to be written is collected in the buffer until the buffer fills up. The net result us lots of "little" writes can be aggregated into a "big" write. The performance improvement can be significant.
Also, does autoflushing impact performance in any way (especially in larger/consistent writes), or is it just a convenience, and is it recommended to use it?
It is really a convenience to avoid having to explicitly flush. It doesn't improve performance. Indeed, if you don't need the data to be flushed1, then unnecessary auto-flushing will reduce performance.
1 - You would want the data to be flushed if someone or something wants to see the data you are writing as soon as possible.
They don't require flushing, only if you want to guarantee that output has been displayed so far, which is exactly what flushing is. If you are fine writing to a file and just want to make sure it gets there before the program terminates, then no need to flush.
When data is written to an output stream, the underlying
an operating system does not guarantee that the data will make it
to the file system immediately. In many operating systems, the
data may be cached in memory, with a write occurring only
after a temporary cache is filled or after some amount of time
has passed.
If the data is cached in memory and the application terminates
unexpectedly, the data would be lost, because it was never
written to the file system. To address this, all output stream
classes provide a flush() method, which requests that all accumulated data be written immediately to disk.
The flush() method helps reduce the amount of data lost if the
application terminates unexpectedly. It is not without cost,
though. Each time it is used, it may cause a noticeable delay in
the application, especially for large files. Unless the data that
you are writing is extremely critical, the flush() method should
be used only intermittently. For example, it should not
necessarily be called after every write.
You also do not need to call the flush() method when you have
finished writing data, since the close() method will
automatically do this.
Read from the book here -> OCP Oracle Certified Professional Java SE 11
Hope this is clear to you!

Reading an input stream twice without storing it in memory

With reference to the stackoverflow question it is said that the InputStream can be read multiple times with mark() and reset() provided by the InputStream or by using PushbackInputStream.
In all these cases the content of the stream is stored in byte array (ie; the original content of the file is stored in main memory) and reused multiple times.
What happens when the size of the file exceeds the memory size? I think this may pave way for OutOfMemoryException.
Is there any better way to read the stream content multiple times without storing the stream content locally (ie; in main memory)?
Please help me knowing this. Thanks in advance.
It depends on the source of the stream.
If it's a local file, you can likely re-open and re-read the stream as many times as you want.
If it's dynamically generated by a process, a remote service, etc., you might not be free to re-generate it. In that case, you need to store it, either in memory or in some more persistent (and slow) storage like a file system or storage service.
Maybe an analogy would help. Suppose your friend is speaking to you at length. You listen carefully without interruption, but when they are done, you realize you didn't understand something they said near the beginning, and want to review that portion.
At this point, there are a few possibilities.
Perhaps your friend was actually reading aloud from a book. You can simply re-read the book.
Or, perhaps you had to foresight to record their monologue. You can replay the recording.
However, since neither you nor your friend has perfect and unlimited recall, simply repeating verbatim what was said ten minutes ago from memory alone is not an option.
An InputStream is like your friend speaking. Neither of you has a good enough memory to remember exactly, word-for-word, what is said. In the same way, neither a process that is generating the data stream nor your program has enough RAM to store, byte-for-byte, the stream. To scale, your program has to rely on its "short-term memory" (RAM), working with just a small portion of the whole stream at any given time, and "taking notes" (writing to a persistent store) as it encounters important points.
If the source of stream is a local file, then it's like your friend reading a book. Either of you can re-read that content easily enough.
If you copy the stream to some persistent storage, that's like recording your friend's speech. You can replay it as often as you like.
Consider a scenario where browser is uploading a large file, but the server is busy, and not able to read that stream for some time. Where is that data stored during that delay?
Because the receiver can't always respond immediately to input, TCP and many other protocols allocate a small buffer to store some data from a sender. But, they also have a way to tell the sender to wait, they are sending data too fast—flow control. Going back to the analogy, it's like telling your friend to pause a moment while you catch up with your note-taking.
As the browser uploads the file, at first, the buffer will be filled. But if the server can't keep up, the browser will be instructed to pause its upload until there is more room in the buffer. (This generally happens at the OS and TCP level; the client and server applications don't manage this directly.) The upload speed depends on how fast the browser can read the file from disk, how fast the network link is, and how fast the server can process the uploaded data. Even a fast network and client will be limited by the weak link in this chain.

Should I fill stream into a sting/container for manipulation?

I want to make this general question. If we have a program which reads data from outside of the program, should we first put the data in a container and then manipulate the data or should we work directly on the stream, if the stream api of a language is powerful enough?
For example. I am writing a program which reads from text file. Should I first put the data in a string and then manipulate instead of working directly on the stream. I am using java and let's say it has powerful enough (for my needs) stream classes.
Stream processing is generally preferable to accumulating data in memory.
Why? One obvious reason is that the file you are reading might not even fit into memory. You might not even know the size of the data before you've read it completely (imagine, that you are reading from a socket or a pipe rather than a file).
It is also more efficient, especially, when the size isn't known ahead of time - allocating large chunks of memory and moving data around between them can be taxing. Things like processing and concatenating large strings aren't free either.
If the io is slow (ever tried reading from a tape?) or if the data is being produced in real time by a peer process (socket/pipe), your processing of the data read can, at least in part, happen in parallel with reading, which will speed things up.
Stream processing is inherently easier to scale and parallelize if necessary, because your logic is forced to only depend on the current element, being processed, you are free from state. If the amount of data becomes too large to process sequentially, you can trivially scale your app, by adding more readers, and splitting the stream between them.
You might argue, that in case none of this matters, because the file you are reading is only 300 bytes. Indeed, for small amounts of data, this is not crucial (you may also bubble sort it while you are at it), but adopting good patterns and practices makes you a better programmer, and will help when it does matters. There is no disadvantage to it. No, it does not make your code more complicated. It might seem so to you at first, but that's simply because you are not used to stream processing. Once you get into the right mindset, and it becomes natural to you, you'll see that, if anything, the code, dealing with one small piece of data at a time, and not caring about indexes, pointers and positions, is simpler than the alternative.
All of the above applies to sequential processing though. You read the stream once, processing the data immediately, as it comes in, and discarding it (or, perhaps, writing out to the next stream in pipeline).
You mentioned RandomAccessFile ... that's a completely different beast. If you need random access, and the data fits in memory, put it in memory. Seeking the file back and forth is the same thing conceptually, only much slower. There is no benefit to it other than saving memory.
You should certainly process it as you receive it. The other way adds latency and doesn't scale.

Concurrent text file reads by numerous VMs

I have a java app I am porting as a proof-of-concept to a cloud architecture. I want to process a very large text file by running the same processing program on chunks of the file on separate VMs.
Worker nodes = n
Head node running Master and one Worker, with n-1 Worker nodes
I have two ideas in mind:
Master reads file line-by-line, sends first line to first worker node, second to second worker node and so on, repeating every n lines.
Master reads number of lines in file. Worker nodes then instructed to read no_of_lines/n concurrently from the file.
I am considering using an RMI or sockets based approach for transfer of data. Could anyone tell me which of the above methods would be most efficient? If this question cannot be answered without specifying which java constructs I would be using, I would appreciate suggestions on those.
Also, would locking be an issue with concurrent file access if I each node knows which lines it is supposed to read?
Thanks for any suggestions
Ian
To take the second question first, there is never any problem in many programs reading one file IFF no program is writing the file: each program has its own file-position pointer. Even if some program is writing to the file, there might not be any problem if that program is always writing at the end of the file which, in any sane system, is always the case.
As for the first question, IFF all of the lines in the file are of constant length, then the issue is as always one of efficiency: it's more efficient to read several lines than it is to read one line.
If I were doing the project, the master would ask the workers to read (n_lines_in_file/n_workers) lines. There seems to me little point in the master's reading lines and passing them out to workers. That's assuming, though, that each line takes the same amount of worker processing as any other.
If that's not true, or if there are other variables you haven't told about, my strategy would no doubt change.
When you break up a program, you should ensure that you are not creating more overhead than you are looking to save. For example, reading a few lines of text is relatively cheap compared with doing an RMI call. Copying the data to many hosts may be more expensive than the processing you intend to do.
How long does the processing take? This will guide you as to how large each piece of work needs to be to be efficient. You may find that the optimal number of threads is one. ;)

Categories

Resources