Why use Java's AsynchronousFileChannel? - java

I can understand why network apps would use multiplexing (to not create too many threads), and why programs would use async calls for pipelining (more efficient). But I don't understand the efficiency purpose of AsynchronousFileChannel.
Any ideas?

It's a channel that you can use to read files asynchronously, i.e. the I/O operations are done on a separate thread, so that the thread you're calling it from can do other things while the I/O operations are happening.
For example: The read() methods of the class return a Future object to get the result of reading data from the file. So, what you can do is call read(), which will return immediately with a Future object. In the background, another thread will read the actual data from the file. Your own thread can continue doing things, and when it needs the read data, you call get() on the Future object. That will then return the data (if the background thread hasn't completed reading the data, it will make your thread block until the data is ready). The advantage of this is that your thread doesn't have to wait the whole length of the read operation; it can do some other things until it really needs the data.
See the documentation.
Note that AsynchronousFileChannel will be a new class in Java SE 7, which is not released yet.

I've just come across another, somewhat unexpected reason for using AsynchronousFileChannel. When performing random record-oriented writes across large files (exceeding physical memory so caching isn't helping everything) on NTFS, I find that AsynchronousFileChannel performs over twice as many operations, in single-threaded mode, versus a normal FileChannel.
My best guess is that because the asynchronous io boils down to overlapped IO in Windows 7, the NTFS file system driver is able to update its own internal structures faster when it doesn't have to create a sync point after every call.
I micro-benchmarked against RandomAccessFile to see how it would perform (results are very close to FileChannel, and still half of the performance of AsynchronousFileChannel.
Not sure what happens with multi-threaded writes. This is on Java 7, on an SSD (the SSD is an order of magnitude faster than magnetic, and another order of magnitude faster on smaller files that fit in memory).
Will be interesting to see if the same ratios hold on Linux.

The main reason I can think of to use asynchronous IO is to better utilize the processor. Imagine you have some application which does some sort of processing on a file. And also let's assume you can process the data contained in the file in chunks. If you don't make use of asynchronous IO then your application will probably behave something like this:
Read a block of data. No processor utilization at this point as you're blocked waiting for the data to be read.
process the data you just read. At this point your application will start consuming CPU cycles as it processed the data.
If more data to read, goto #1.
The processor utilization will go up and then to zero and then up and then to zero, ... . Ideally you want to not be idle if you want your application to be efficient and process the data as fast as possible. A better approach would be:
Issue async read
When read completes issue next async read and then process data
The first step is the bootstrapping. You have no data yet so you have to issue a read. From then on, when you get notified a read has completed you issue another async read and then process the data. The benefit here is that by the time you finish processing the chunk of data the next read has probably finished, so you always have data available to process and thus you're more efficiently using the processor. If your processing finishes before the read has finished you might need to issue multiple asynchronous reads so that you have more data to process.
Nick

Here's something no one has mentioned:
A plain FileChannel implements InterruptibleChannel so it, as well as anything that uses it such as the OutputStream returned by Files.newOutputStream(), has the unfortunate[1][2] behaviour that performing any blocking operation on it (e.g. read() and write()) in a thread in interrupted state will cause the Channel itself to close with java.nio.channels.ClosedByInterruptException.
If this is a problem, using AsynchronousFileChannel instead is a possible alternative.
[1] http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6608965
[2] https://bugs.openjdk.java.net/browse/JDK-4469683

Related

Reading huge file in Java

I read a huge File (almost 5 million lines). Each line contains Date and a Request, I must parse Requests between concrete **Date**s. I use BufferedReader for reading File till start Date and than start parse lines. Can I use Threads for parsing lines, because it takes a lot of time?
It isn't entirely clear from your question, but it sounds like you are reparsing your 5 million-line file every time a client requests data. You certainly can solve the problem by throwing more threads and more CPU cores at it, but a better solution would be to improve the efficiency of your application by eliminating duplicate work.
If this is the case, you should redesign your application to avoid reparsing the entire file on every request. Ideally you should store data in a database or in-memory instead of processing a flat text file on every request. Then on a request, look up the information in the database or in-memory data structure.
If you cannot eliminate the 5 million-line file entirely, you can periodically recheck the large file for changes, skip/seek to the end of the last record that was parsed, then parse only new records and update the database or in-memory data structure. This can all optionally be done in a separate thread.
Firstly, 5 million lines of 1000 characters is only 5Gb, which is not necessarily prohibitive for a JVM. If this is actually a critical use case with lots of hits then buying more memory is almost certainly the right thing to do.
Secondly, if that is not possible, most likely the right thing to do is to build an ordered Map based on the date. So every date is a key in the map and points to a list of line numbers which contain the requests. You can then go direct to the relevant line numbers.
Something of the form
HashMap<Date, ArrayList<String>> ()
would do nicely. That should have a memory usage of order 5,000,000*32/8 bytes = 20Mb, which should be fine.
You could also use the FileChannel class to keep the I/O handle open as you go jumping from on line to a different line. This allows Memory Mapping.
See http://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html
And http://en.wikipedia.org/wiki/Memory-mapped_file
A good way to parallelize a lot of small tasks is to wrap the processing of each task with a FutureTask and then pass each task to a ThreadPoolExecutor to run them. The executor should be initalized with the number of CPU cores your system has available.
When you call executor.execute(future), the future will be queued for background processing. To avoid creating and destroying too many threads, the ScheduledThreadPoolExecutor will only create as many threads as you specified and execute the futures one after another.
To retrieve the result of a future, call future.get(). When the future hasn't completed yet (or wasn't even started yet), this method will freeze until it is completed. But other futures get executed in background while you wait.
Remember to call executor.shutdown() when you don't need it anymore, to make sure it terminates the background threads it otherwise keeps around until the keepalive time has expired or it is garbage-collected.
tl;dr pseudocode:
create executor
for each line in file
create new FutureTask which parses that line
pass future task to executor
add future task to a list
for each entry in task list
call entry.get() to retrieve result
executor.shutdown()

Asynchronous processing with a single thread

Even after reading http://krondo.com/?p=1209 or Does an asynchronous call always create/call a new thread? I am still confused about how to provide asynchronous calls on an inherently single-threaded system. I will explain my understanding so far and point out my doubts.
One of the examples I read was describing a TCP server providing asynch processing of requests - a user would call a method e.g. get(Callback c) and the callback would be invoked some time later. Now, my first issue here - we have already two systems, one server and one client. This is not what I mean, cause in fact we have two threads at least - one in the server and one on the client side.
The other example I read was JavaScript, as this is the most prominent example of single-threaded asynch system with Node.js. What I cannot get through my head, maybe thinking in Java terms, is this:If I execute the code below (apologies for incorrect, probably atrocious syntax):
function foo(){
read_file(FIle location, Callback c) //asynchronous call, does not block
//do many things more here, potentially for hours
}
the call to read file executes (sth) and returns, allowing the rest of my function to execute. Since there is only one thread i.e. the one that is executing my function, how on earth the same thread (the one and only one which is executing my stuff) will ever get to read in the bytes from disk?
Basically, it seems to me I am missing some underlying mechanism that is acting like round-robin scheduler of some sort, which is inherently single-threaded and might split the tasks to smaller ones or call into a multiothraded components that would spawn a thread and read the file in.
Thanks in advance for all comments and pointing out my mistakes on the way.
Update: Thanks for all responses. Further good sources that helped me out with this are here:
http://www.html5rocks.com/en/tutorials/async/deferred/
http://lostechies.com/johnteague/2012/11/30/node-js-must-know-concepts-asynchrounous/
http://www.interact-sw.co.uk/iangblog/2004/09/23/threadless (.NET)
http://ejohn.org/blog/how-javascript-timers-work/ (intrinsics of timers)
http://www.mobl-lang.org/283/reducing-the-pain-synchronous-asynchronous-programming/
The real answer is that it depends on what you mean by "single thread".
There are two approaches to multitasking: cooperative and interrupt-driven. Cooperative, which is what the other StackOverflow item you cited describes, requires that routines explicitly relinquish ownership of the processor so it can do other things. Event-driven systems are often designed this way. The advantage is that it's a lot easier to administer and avoids most of the risks of conflicting access to data since only one chunk of your code is ever executing at any one time. The disadvantage is that, because only one thing is being done at a time, everything has to either be designed to execute fairly quickly or be broken up into chunks that to so (via explicit pauses like a yield() call), or the system will appear to freeze until that event has been fully processed.
The other approach -- threads or processes -- actively takes the processor away from running chunks of code, pausing them while something else is done. This is much more complicated to implement, and requires more care in coding since you now have the risk of simultaneous access to shared data structures, but is much more powerful and -- done right -- much more robust and responsive.
Yes, there is indeed a scheduler involved in either case. In the former version the scheduler is just spinning until an event arrives (delivered from the operating system and/or runtime environment, which is implicitly another thread or process) and dispatches that event before handling the next to arrive.
The way I think of it in JavaScript is that there is a Queue which holds events. In the old Java producer/consumer parlance, there is a single consumer thread pulling stuff off this queue and executing every function registered to receive the current event. Events such as asynchronous calls (AJAX requests completing), timeouts or mouse events get pushed on to the Queue as soon as they happen. The single "consumer" thread pulls them off the queue and locates any interested functions and then executes them, it cannot get to the next Event until it has finished invoking all the functions registered on the current one. Thus if you have a handler that never completes, the Queue just fills up - it is said to be "blocked".
The system has more than one thread (it has at least one producer and a consumer) since something generates the events to go on the queue, but as the author of the event handlers you need to be aware that events are processed in a single thread, if you go into a tight loop, you will lock up the only consumer thread and make the system unresponsive.
So in your example :
function foo(){
read_file(location, function(fileContents) {
// called with the fileContents when file is read
}
//do many things more here, potentially for hours
}
If you do as your comments says and execute potentially for hours - the callback which handles fileContents will not fire for hours even though the file has been read. As soon as you hit the last } of foo() the consumer thread is done with this event and can process the next one where it will execute the registered callback with the file contents.
HTH

From classic multithreaded to java.nio asynchronous/non-blocking server

I'm the main developer of an online game.
Players use a specific client software that connects to the game server with TCP/IP (TCP, not UDP)
At the moment, the architecture of the server is a classic multithreaded server with one thread per connection.
But in peak hours, when there are often 300 or 400 connected people, the server is getting more and more laggy.
I was wondering, if by switching to a java.nio.* asynchronous I/O model with few threads managing many connections, if the performances would be better.
Finding example codes on the web that cover the basics of such a server architecture is very easy. However, after hours of googling, I didn't find the answers to some more advanced questions:
1 - The protocol is text-based, not binary-based. The clients and the server exchanges lines of text encoded in UTF-8. A single line of text represents a single command, each lines are properly terminated by \n or \r\n.
For the classic multithreaded server, I have that kind of code :
public Connection (Socket sock) {
this.in = new BufferedReader( new InputStreamReader( sock.getInputStream(), "UTF-8" ));
this.out = new BufferedWriter( new OutputStreamWriter(sock.getOutputStream(), "UTF-8"));
new Thread(this) .start();
}
And then in run, data are read line by line with readLine.
In the doc, I found an utilitiy class Channels that can create a Reader out of a SocketChannel. But it is said that the produced Reader wont work if the Channel is in non-blocking mode, what contradicts the fact that non-blocking mode is mandatory to use the highly performant channel selection API I'm willing to use. So, I suspect that it isn't the right solution for what I would like to do.
The first question is therefore the following: if I can't use that, how to efficiently and properly take care of breaking lines and converting native java strings from/to UTF-8 encoded data in the nio API, with buffers and channels?
Do I have to play with get/put or inside the wrapped byte array by hand? How to go from ByteBuffer to strings encoded in UTF-8 ? I admit to don't understand very well how to use classes in the charset package and how it works to do that.
2 - In the asynchronous/non-blocking I/O world, what about the handling of consecutive read/write that have by nature to be executed sequencially one after the other?
For example, the login procedure, which is typicly challenge-response-based: the server sends a question (a particular computation), the client sends the response, and then the server checks the response given by the client.
The answer is, I think, certainly not to make a single task to send to worker threads for the whole login process, as it is quite long, with the risk to freeze worker threads for too much time (Imagine that scenario: 10 pool threads, 10 players try to connect at the same time; tasks related to players already online are delayed until one thread is again ready).
3 - What happens if two different threads simultaneously call Channel.write(ByteBuffer) on the same Channel?
Do the client might receive mixed up lines ? For example if a thread sends "aaaaa" and another sends "bbbbb", could the client receive "aaabbbbbaa", or am I ensured that everyting is sent in a consist order? Am I allowed to modify the buffer used right after the call returned?
Or asked differently, do I need additional synchronization to avoid this sort of situation?
If I need additionnal synchronization, how to know when release locks and so on, upon write finishes?
I'm afraid that the answer isn't as simple as registering for OP_WRITE in the selector. By trying that, I noticed that I get the write-ready event all the time and always for all clients, exiting Selector.select early mostly for nothing, since there are only 3 or 4 messages to send pers second per client, while the selection loop is performed hundreds of times per second. So, potentially, active wait in perspective, what is very bad.
4 - Can multiple threads call Selector.select on the same selector simultaneously without any concurrency problems such as missing an event, scheduling it twice, etc?
5 - In fact, is nio as good as it is said to be ? Would it be interesting to stay to classic multithreaded model, but unstead of creating a thread per connection, use fewer threads and loop over the connections to look for data availability using InputStream.isAvailable ? Is that idea stupid and/or inefficient?
1) Yes. I think that you need to write your own nonblocking readLine method. Note also that a nonblocking read may be signaled when there are several lines in the buffer, or when there is an incomplete line:
Example: (first read)
USER foo
PASS
(second read)
bar
You will need to store (see 2) the data that was not consumed, until enough information is ready to process it.
//channel was select for OP_READ
read data from channel
prepend data from previous read
split complete lines
save incomplete line
execute commands
2) You will need to keep the state of each client.
Map<SocketChannel,State> clients = new HashMap<SocketChannel,State>();
when a channel is connected, put a fresh state into the map
clients.put(channel,new State());
Or store the current state as the attached object of the SelectionKey.
Then, when executing each command, update the state. You may write it as a monolithic method, or do something more fancy such as polymorphic implementations of State, where each state knows how to deal with some commands (e.g. LoginState expects USER and PASS, then you change the state into a new AuthorizedState).
3) I don't recall using NIO with many asynchronous writers per channel, but the documentation says it is thread safe (I won't elaborate, since I have no proof of this). About OP_WRITE, note that it signals when the write buffer is not full. In other words, as said here: OP_WRITE is almost always ready, i.e. except when the socket send buffer is full, so you will just cause your Selector.select() method to spin mindlessly.
4) Yes. Selector.select() performs a blocking selection operation.
5) I think that the most difficult part is switching from a thread-per-client architecture, to a different design where reads and writes are decoupled from processing. Once you have done that, it is easier to work with channels than working your own way with blocking streams.

Chat system in Java

Is there a way to immediately print the message received from the client without using an infinite loop to check whether the input stream is empty or not?
Because I found that using infinite loop consumes a lot of system resources, which makes the program running so slow. And we also have to do the same (infinite loop) on the client side to print the message on the screen in real time.
I'm using Java.
You should be dealing with the input stream in a separate Thread - and let it block waiting for input. It will not use any resources while it blocks. If you're seeing excessive resource usage while doing this sort of thing, you're doing it wrong.
I think you can just put your loop in a different thread and have it sleep a bit (maybe for half a second?) between iterations. It would still be an infinite loop, but it would not consume nearly as many resources.
You don't you change your architecture a little bit to accommodate WebSockets. check out Socket.IO . It is a cross browser WebSockets enabler.
You will have to write controllers (servlets for example in java) that push data to the client. This does not follow the request-response architecture.
You can also architect it so that a "push servlet" triggers a "request" from the client to obtain the "response".
Since your question talks about Java, and if you are interested in WebSockets, check this link out.
If you're using Sockets, which you should be for any networking.
Then you can use the socket's DataInputStream which you can get using socket.getInputStream() (i think that's the right method) and do the following:
public DataInputStream streamIn;
public Socket soc;
// initialize socket, etc...
streamIn = soc.getInputStream();
public String getInput() {
return (String) streamIn.readUTF(); // Do some other casting if this doesn't work
}
streamIn.readUTF() blocks until data is available, meaning you don't have to loop, and threading will let you do other processing while you wait for data.
Look here for more information on DataInputStream and what you can do with it: http://docs.oracle.com/javase/6/docs/api/java/io/DataInputStream.html
A method that does not require threads would involve subclassing the input stream and adding a notify type method. When called this method would alert any interested objects (i.e. objects that would have to change state due to the additions to the stream) that changes have been made. These interested objects could then respond in anyway that is desired.
Objects writing to the buffer would do their normal writing, and afterward would call the notify() method on the input stream, informing all interested objects of the change.
Edit: This might require subclassing more than a couple of classes and so could involve a lot of code changes. Without knowing more about your design you would have to decide if the implementation is worth the effort.
There are two approaches that avoid busy loops / sleeps.
Use a thread for each client connection, and simply have each thread call read. This blocks the thread until the client sends some data, but that's no problem because it doesn't block the threads handling other clients.
Use Java NIO channel selectors. These allow a thread to wait until one of set of channels (in this case sockets) has data to be read. There is a section of the Oracle Java Tutorials on this.
Of these two approaches, the second one is most efficient in terms of overall resource usage. (The thread-per-client approach uses a lot of memory on thread stacks, and CPU on thread switching overheads.)
Busy loops that repeatedly call (say) InputStream.available() to see if there is any input are horribly inefficient. You can make them less inefficient by slowing down the polling with Thread.sleep(...) calls, but this has the side effect of making the service less responsive. For instance, if you add a 1 second sleep between each set of polls, the effect that each client will see is that the server typically delays 1 second before processing each request. Assuming that those requests are keystrokes and the responses echo them, the net result is a horribly laggy service.

Are RandomAccessFile writes asynchronous?

Looking at the constructor of RandomAccessFile for the mode it says 'rws' The file is opened for reading and writing. Every change of the file's content or metadata must be written synchronously to the target device.
Does this imply that the mode 'rw' is asynchronous? Do I need to include the 's' if I need to know when the file write is complete?
Are RandomAccessFile writes asynchronous?
The synchronous / asynchronous distinction refers to the guarantee that the data / metadata has been safely to disk before the write call returns. Without the guarantee of synchronous mode, it is possible that the data that you wrote may still only be in memory at the point that the write system call completes. (The data will be written to disk eventually ... typically within a few seconds ... unless the operating system crashes or the machine dies due to a power failure or some such.)
Synchronous mode output is (obviously) slower that asynchronous mode output.
Does this imply that the mode 'rw' is asynchronous?
Yes, it is, in the sense above.
Do I need to include the 's' if I need to know when the file write is complete?
Yes, if by "complete" you mean "written to disc".
That is true for RandomAccessFile and java.io classes when using multiple threads. The mode "rw" offers asynchronous read/write but you can use a synchronous mode for read and write operations.

Categories

Resources