I want to be able to measure thread I/O through code on my live/running application. So far the best (and only) solution I found is this one which requires me to hook directly into windows' performance monitor. However it seems very complex and there must be a simpler way to do this. I don't mind writing different code for Windows & Linux, to be honest I was expecting it.
Thanks for you help
Update:
So what I mean by Disk I/O is: If you open windows resource monitoring and go to the disk tab you can look at each process that is running and lock at the avg read/write of B/Sec over the last minute. I want to get the same data but per thread
Take a look at Hyperic Sigar. It is a cross-platform native resource Java API.
The class FileSystemUsage gives you a bunch of stats about file system usage (obvious...) such as
Disk Queue
Disk Writes / Disk Bytes Written
Disk Reads / Disk Reads Written
Most of them are cumulative, but you just compute a delta every n seconds to get a rate.
There's also a javaagent for profiling/monitoring file IO called JPicus. I have not looked at it in a while, but it may be useful.
Related
Problem Statement: FTP server is flooded with files coming at the rate of 100 Mbps(ie. 12.5 MB/s) each file size is 100 MB approx. Files will be deleted after 30 sec from their creation time stamp. If any process is interested to read those files it should take away complete file in less then 30 sec. I am using Java to solve this particular problem.
Need suggestion on which Design pattern would be best suited for this kind of problem. How would I make sure that the each file will be consumed before server delete it.
Your suggestion will be greatly appreciated. Thanks
If the Java application runs on the same machine as the FTP service, then it could use File.renameTo(File) or equivalent to move a required file out of the FTP server directory and into another directory. Then it can process it at its leisure. It could use a WatchService to monitor the FTP directory for newly arrived files. It should watch for events on the directory, and when a file starts to appear it should wait for the writes to stop happening. (Depending on the OS, you may or may not be able to move the file while the FTP service is writing to it.)
There is a secondary issue of whether a Java application could keep up with the required processing rate. However, if you had multiple cores and multiple worker threads, then your app could potentially process them in parallel. (It depends on computationally and/or I/O intensive the processing is. And the rate at which a Java thread can read a file ... which will be OS and possibly hardware dependent.)
If the Java application is not running on the FTP server, it would probably need to use FTP to fetch the file. I am doubtful that you could implement something to do that consistently and reliably; i.e. without losing files occasionally.
I have a simple question, I've read up online but couldn't find a simple solution:
I'm running a java program on the command line as follows which accesses a database:
java -jar myProgram.jar
I would like a simple mechanism to see the number of disk I/Os performed by this program (on OSX).
So far I've come across iotop but how do I get iotop to measure the disk I/O of myProgram.jar?
Do I need a profiler like JProfiler do get this information?
iotop is a utility which gives you top n processes in descending order of IO consumption/utilization.
Most importantly it is a live monitoring utility which means its output changes every n sec( or time interval you specify). Though you can redirect it to a file, you need to parse that file and find out meaningful data after plotting a graph.
I would recommend to use sar. you can read more about it here
It is the lowest level monitoring utility in linux/unix. It will give you much more data than iotop.
best thing about sar is you can collect the data using a daemon when your program is running and then later analyze it using ksar
According to me, you can follow below approach,
Start sar monitoring, collect sar data every n seconds. value of n depends of approximate execution time of your program.
example : if your program takes 10 seconds to execute then monitoring per sec is good but if your program takes 1hr to execute then monitor per min or 30 sec. This will minimize overhead of sar process and still your data is meaningful.
Wait for some time (so that you get data before your program starts) and then start your program
end of your program execution
wait for some time again (so that you get data after your program finishes)
stop sar.
Monitor/visualize sar data using ksar. To start with, you check for disk utilization and then IOPS for a disk.
You can use Profilers for same thing but they have few drawbacks,
They need their own agents (agents will have their own overhead)
Some of them are not free.
Some of them are not easy to set up.
may or may not provide enough/required data.
besides this IMHO, Using inbuilt/system level utilities is always beneficial.
I hope this was helpful.
Your Java program will eventually be a process for host system so you need to filter out output of monitoring tool for your own process id. Refer Scripts section of this Blog Post
Also, even though you have tagged question with OsX but do mention in question that you are using OsX.
If you are looking for offline data - that is provided by proc filesystem in Unix bases systems but unfortunately that is missing in OSX , Where is the /proc folder on Mac OS X?
/proc on Mac OS X
You might chose to write a small script to dump data from disk and process monitoring tools for your process id. You can get your process id in script by process name, put script in a loop to look for that process name and start script before you execute your Java program. When script finds the said process, it will keep dumping relevant data from commands chosen by you at intervals decided by you. Once your programs ends ,log dumping script also terminates.
My threads have fallen behind schedule and a thread dump reveals they're all stuck in blocking IO writing log output to hard disk. My quick fix is just to reduce logging, which is easy enough to do with respect to my QA requirements. Of course, this isn't vertically scalable which will be a problem soon enough.
I thought about just increasing the thread count but I'm guessing the bottleneck is on file contention and this could be rather bad if it's the wrong thing to do.
I have a lot of ideas but really no idea which are fruitful.
I thought about increasing the thread count but I'm guessing they're bottlenecked so this won't do anything. Is this correct? How to determine it? Could decreasing the threadcount help?
How do I profile the right # of threads to be writing to disk? Is this a function of number of write requests, number of bytes written per second, number of bytes per write op, what else?
Can I toggle a lower-level setting (filesystem, OS, etc.) to reduce locking on a file in exchange for out-of-order lines being possible? Either in my Java application or lower level?
Can I profile my system or hard disk to ensure it's not somehow being overworked? (Vague, but I'm out of my domain here).
So my question is: how to profile to determine the right number of threads that can safely write to a common file? What variables determine this - number of write operations, number of bytes written per second, number of bytes per write request, any OS or hard disk information.
Also is there any way to make the log file more liberal to be written to? We timestamp everything so I'm okay with a minority of out-of-order lines if it reduces blocking.
My threads have fallen behind schedule and a thread dump reveals they're all stuck in blocking IO writing log output to hard disk.
Typically in these situations, I schedule a thread just for logging. Most logging classes (such as PrintStream) are synchronized and write/flush each line of output. By moving to a central logging thread and using some sort of BlockingQueue to queue up log messages to be written, you can make use of a BufferedWriter or some such to limit the individual IO requests. The default buffer size is 8k but you should increase that size. You'll need to make sure that you properly close the stream when your application shuts down.
With a buffered writer, you could additionally write through a GZIPOutputStream which would significantly lower your IO requirements if your log messages repeat a lot.
That said, if you are outputting too much debugging information, you may be SOL and need to either decrease your logging bandwidth or increase your disk IO speeds. Once you've optimized your application, the next steps include moving to SSD on your log server to handle the load. You could also try distributing the log messages to multiple servers to be persisted but a local SSD would most likely be faster.
To simulate the benefits of a SSD, a local RAM disk should give you a pretty good idea about increased IO bandwidth.
I thought about increasing the thread count but I'm guessing they're bottlenecked so this won't do anything. Is this correct?
If all your threads are blocked in IO, then yes, increasing the thread count will not help.
How do I profile the right # of threads to be writing to disk?
Tough question. You are going to have to do some test runs. See the throughput of your application with 10 threads, with 20, etc.. You are trying to maximize your overall transactions processed in some time. Make sure your test runs execute for a couple of minutes for best results. But, it is important to realize that a single thread can easily swamp a disk or network IO stream if it is spewing too much output.
Can I toggle a lower-level setting (filesystem, OS, etc.) to reduce locking on a file in exchange for out-of-order lines being possible? Either in my Java application or lower level?
No. See my buffered thread writer above. This is not about file locking which (I assume) is not happening. This is about number of IO requests/second.
Can I profile my system or hard disk to ensure it's not somehow being overworked? (Vague, but I'm out of my domain here).
If you are IO bound then the IO is slowing you down so it is "overworked". Moving to a SSD or RAM disk is an easy test to see if your application runs faster.
I want to read a file of 500 Mb with the help of 2 threads, so that reading the file will be much faster. Someone please give me some code for the task using core java concepts.
Multi-threading is not likely to make the code faster at all. This because reading a file is an I/O-bound process. You will be limited by the speed of the disk rather than your processor.
Instead of trying to multi-thread the reading, you may benefit from multi-threading the processing of the data. This can make it look like using multiple threads to read can help, but in reality, using one thread to read and multiple threads to process is often better.
This often takes longer and is CPU bound. Using multiple threads to read files usually helps when you have multiple files on different physical disks (a rare occasion)
While you might not be able to speed up the read from disc by using multiple threads to read the file you can speed up the process by not doing processing in the same thread as the read.
This will be dependant on the contents of the file.
I'm trying to think how should I utilize threads in my program.
Right now I have a single threaded program that reads a single huge file. Very simple program, just reads line by line and collects some statistics about the words.
Now, I would like to use multi threads to make it faster. I'm not sure how to approach this.
One solution is to separate the data into X pieces in advance, then have X threads, each runs on one piece simultaneously, with one sync method to write the stats to memory. Is there a better approach? specifically, I would like to avoid separating the data in advance.
Thanks!
First of all, do some profiling to make sure that your process is actually compute-bound and not I/O bound. That is, that your statistics collection is slower than accessing the file. Otherwise, multi-threading will slow your program, not speed it, particularly if you are running on a single-core CPU (or an ancient JVM).
Also consider: if your file resides on a hard drive: how will you schedule reads? You risk adding hard drive seek delays otherwise, stalling all threads that have managed to finish their chunk of work, while one thread is asking the hard drive to seek to position 0x03457000...
You could have a look at the producer-consumer approach. It is a classical threading problem where one thread produces data (in your case the one that reads the file) and writes it to a shared buffer from where another thread reads that data (consumer) which is your calculation thread (some Java examples).
Also have a look at Javas non-blocking IO.
The assumption that multithreaded disk access is faster might be wrong, as disguessed here: Directory walker on modern operating systems slower when it's multi-threaded?
Performance improvement could be achieved by splitting reading and processing of data in separate threads.
But wait, reading files line-by-line? That doesn't sounds optimal. Better read them as stream of characters (using FileReader).
See this sun tutorial.
if your problem is I/O Bound, maybe you can consider splitting your data into multiple files and putting it into a distributed filesystem such as Hadoop Filesystem (HDFS) and then run Map/Reduce operation on it?