disk I/O of a command line java program

disk I/O of a command line java program - java

I have a simple question, I've read up online but couldn't find a simple solution:
I'm running a java program on the command line as follows which accesses a database:
java -jar myProgram.jar
I would like a simple mechanism to see the number of disk I/Os performed by this program (on OSX).
So far I've come across iotop but how do I get iotop to measure the disk I/O of myProgram.jar?
Do I need a profiler like JProfiler do get this information?

iotop is a utility which gives you top n processes in descending order of IO consumption/utilization.
Most importantly it is a live monitoring utility which means its output changes every n sec( or time interval you specify). Though you can redirect it to a file, you need to parse that file and find out meaningful data after plotting a graph.
I would recommend to use sar. you can read more about it here
It is the lowest level monitoring utility in linux/unix. It will give you much more data than iotop.
best thing about sar is you can collect the data using a daemon when your program is running and then later analyze it using ksar
According to me, you can follow below approach,
Start sar monitoring, collect sar data every n seconds. value of n depends of approximate execution time of your program.
example : if your program takes 10 seconds to execute then monitoring per sec is good but if your program takes 1hr to execute then monitor per min or 30 sec. This will minimize overhead of sar process and still your data is meaningful.
Wait for some time (so that you get data before your program starts) and then start your program
end of your program execution
wait for some time again (so that you get data after your program finishes)
stop sar.
Monitor/visualize sar data using ksar. To start with, you check for disk utilization and then IOPS for a disk.
You can use Profilers for same thing but they have few drawbacks,
They need their own agents (agents will have their own overhead)
Some of them are not free.
Some of them are not easy to set up.
may or may not provide enough/required data.
besides this IMHO, Using inbuilt/system level utilities is always beneficial.
I hope this was helpful.

Your Java program will eventually be a process for host system so you need to filter out output of monitoring tool for your own process id. Refer Scripts section of this Blog Post
Also, even though you have tagged question with OsX but do mention in question that you are using OsX.
If you are looking for offline data - that is provided by proc filesystem in Unix bases systems but unfortunately that is missing in OSX , Where is the /proc folder on Mac OS X?
/proc on Mac OS X
You might chose to write a small script to dump data from disk and process monitoring tools for your process id. You can get your process id in script by process name, put script in a loop to look for that process name and start script before you execute your Java program. When script finds the said process, it will keep dumping relevant data from commands chosen by you at intervals decided by you. Once your programs ends ,log dumping script also terminates.

Related

Is it better to launch a Java app once and sleep or repeat launching and killing?

I have a Java application that needs to run several times. Every time it runs, it checks if there's data to process and if so, it processes the data.
I'm trying to figure out what's the best approach (performance, resource consumption, etc.) to do this:
1.- Launch it once, and if there's nothing to process make it sleep (All Java).
2.- Using a bash script to launch the Java app, and when it finishes, sleep (the script) and then relaunch the java app.
I was wondering if it is best to keep the Java app alive (sleeping) or relaunching every time.

It's hard to answer your question without the specific context. On the face of it, your questions sounds like it could be a premature optimization.
Generally, I suggest you do what's easier for you to do (and to maintain), unless you have good reasons not to. Here are some possible good reasons, pick the ones appropriate to your situation:
For sleeping in Java:
The check of whether there's new data is easier in Java
Starting the Java program takes time or other resources, for example if on startup, your program needs to load a bunch of data
Starting the Java process from bash is complex for some reason - maybe it requires you to fiddle with a bunch of environment variables, files or something else.
For re-launching the Java program from bash:
The check of whether there's new data is easier in bash
Getting the Java process to sleep is complex - maybe your Java process is a complex multi-threaded beast, and stopping, and then re-starting the various threads is complicated.
You need the memory in between Java jobs - killing the Java process entirely would free all of its memory.

I would not keep it alive.
Instead of it you can use some Job which runs at defined intervals you can use jenkins or you can use Windows scheduler and configure it to run every 5 minutes (as you wish).
Run a batch file with Windows task scheduler
And from your batch file you can do following:
javac JavaFileName.java // To Compile
java JavaFileName // to execute file
See here how to execute java file from cmd :
How do I run a Java program from the command line on Windows?

I personally would determine it, by the place where the application is working.
if it would be my personal computer, I would use second option with bash script (as resources on my local machine might change a lot, due to extensive use of some other programs and it can happen that at some point I might be running out of memory for example)
if it goes to cloud (amazon, google, whatever) I know exactly what kind of processes are running there (it should not change so dynamically comparing to my local PC) and long running java with some scheduler would be fine for me

Memory Usage of of external tools in java

I have a few executable tools. In my Java application I need to launch each of them for a few hundred times and measure their memory consumption based on different inputs. I am using
Runtime.getRuntime().exec(externalToolCommand);
to execute external tools. But I don't know how to measure the max memory usage of the external tools.
To make more clear I will exemplify it;
Let say I have prism.exe, mrmc.exe, and plasma.exe which are three executable external tools. I have want to know when I launch one of the tools e.g. prism.exe, how much memory it consumes. I don't need to measure my Java application memory consumption. I need only to know the external memory consumption.
Thanks.

I don't know the exact code but here is what I could think of. On Windows, you can launch a batch script say p.bat from your Java application and launch a powershell script from p.bat say q.ps1 and now you have got access to a powershell script. You can run a process monitor tool(I don't know maybe perfmon...) to measure one time memory consumption of it, log it in a text file. Terminate the process from the script. Do the whole process in loop in your Java application. Finally you have got the file containing the process memory consumption n number of times.
But Beware! it is really expensive as it involves File I/O, context switch back and forth, pipelining on the top.
With powershell, the possibilities are endless. But I am no powershell expert so pardon, I can't write the exact code for you. This answer requires a lot of research on your side on various steps involved.

Try "jvisualvm", you can find it at /bin/jvisualvm.exe

Writing a collector for OpenTSDB

I don't understand from the official documentation of OpenTSDB how to create a collector and how to make it running. In addition to that, i would like to make one collector in Java language.
I'm also a bit new to Unix systems, but i know the basics

Writing a collector for OpenTSDB is quite simple, if you have cloned from git repository the tcollector script you will see the startstop executable, this daemon once being launched will execute all files that are stored inside ./tcollector/collectors/NUMBER where NUMBER is the periodicity in minutes.
Said that, what you need to do is coding those scripts that will be stored inside collectors folder. When OpenTSDB executes those scripts it expects the following output:
<METRIC> <UNIX_TIMESTAMP> <VALUE>
So, in your case. Imaging you want to report the temperature of your PC (call each 5 minutes, you will have to follow the next steps:
Write your script, for example in Java, that gets the temperature of your PC (using SNMP, from the OS, or using any other method). Then when you run your script manually it will output: pc.temperature 1371075574 40
Put the script under ./tcollector/collectors/5/ so OpenTSDB will launch it each 5 minutes
Launch the collector by invoking startstop (OpenTSDB must be running)
A more detailed explanation here.

Measuring Thread I/O in Java

I want to be able to measure thread I/O through code on my live/running application. So far the best (and only) solution I found is this one which requires me to hook directly into windows' performance monitor. However it seems very complex and there must be a simpler way to do this. I don't mind writing different code for Windows & Linux, to be honest I was expecting it.
Thanks for you help
Update:
So what I mean by Disk I/O is: If you open windows resource monitoring and go to the disk tab you can look at each process that is running and lock at the avg read/write of B/Sec over the last minute. I want to get the same data but per thread

Take a look at Hyperic Sigar. It is a cross-platform native resource Java API.
The class FileSystemUsage gives you a bunch of stats about file system usage (obvious...) such as
Disk Queue
Disk Writes / Disk Bytes Written
Disk Reads / Disk Reads Written
Most of them are cumulative, but you just compute a delta every n seconds to get a rate.
There's also a javaagent for profiling/monitoring file IO called JPicus. I have not looked at it in a while, but it may be useful.

Slowing process creation under Java?

I have a single, large heap (up to 240GB, though in the 20-40GB range for most of this phase of execution) JVM [1] running under Linux [2] on a server with 24 cores. We have tens of thousands of objects that have to be processed by an external executable & then load the data created by those executables back into the JVM. Each executable produces about half a megabyte of data (on disk) that when read right in, after the process finishes, is, of course, larger.
Our first implementation was to have each executable handle only a single object. This involved the spawning of twice as many executables as we had objects (since we called a shell script that called the executable). Our CPU utilization would start off high, but not necessarily 100%, and slowly worsen. As we began measuring to see what was happening we noticed that the process creation time [3] continually slows. While starting at sub-second times it would eventually grow to take a minute or more. The actual processing done by the executable usually takes less than 10 seconds.
Next we changed the executable to take a list of objects to process in an attempt to reduce the number of processes created. With batch sizes of a few hundred (~1% of our current sample size), the process creation times start out around 2 seconds & grow to around 5-6 seconds.
Basically, why is it taking so long to create these processes as execution continues?
[1] Oracle JDK 1.6.0_22
[2] Red Hat Enterprise Linux Advanced Platform 5.3, Linux kernel 2.6.18-194.26.1.el5 #1 SMP
[3] Creation of the ProcessBuilder object, redirecting the error stream, and starting it.

My guess is that you MIGHT be running into problems with fork/exec, if Java is using the fork/exec system calls to spawn subprocesses.
Normally fork/exec is fairly efficient, because fork() does very little - all pages are copy-on-write. This stops being so true with very large processes (i.e. those with gigabytes of pages mapped) because the page tables themselves take a relatively long time to create - and of course, destroy, as you immediately call exec.
As you're using a huge amount of heap, this might be affecting you. The more pages you have mapped in, the worse it may become, which could be what's causing the progressive slowdown.
Consider either:
Using posix_spawn, if that is NOT implemented by fork/exec in libc
Using a single subprocess which is responsible for creating / reaping others; spawn this once and use some IPC (pipes etc) to tell it what to do.
NB: This is all speculation; you should probably do some experiments to see whether this is the case.

Most likely you are running out of a resource. Are your disks getting busier as you create these processes. Do you ensure you have less processes than you have cores? (To minimise context switches) Is your load average below 24?
If your CPU consumption is dropping you are likely to be hitting IO (disk/network) contention i.e. the processes cannot get/write data fast enough to keep them busy. If you have 24 cores, how many disks do you have?
I would suggest you have one process per CPU (in your case I imagine 4) Give each JVM six tasks to run concurrently to use all its cores without overloading the system.

You would be much better off using a set of long lived processes pulling your data off of queues and sending them back that constantly forking new processes for each event, especially from the host JVM with that enormous heap.
Forking a 240GB image is not free, it consumes a large amount of virtual resources, even if only for a second. The OS doesn't know how long the new process will be aware so it must prepare itself as if the entire process will be long lived, thus it sets up the virtual clone of all 240GB before obliterating it with the exec call.
If instead you had a long lived process that you could end objects to via some queue mechanism (and there are many for both Java and C, etc.), that would relieve you of some of the pressure of the forking process.
I don't know how you are transferring the data form the JVM to the external program. But if your external program can work with stdin/stdout, then (assuming you're using unix), you could leverage inetd. Here you make a simple entry in the inetd configuration file for your process, and assign it a port. Then you open up a socket, pour the data down in to it, then read back from the socket. Inetd handles the networking details for you and your program works as simply with stdin and stdout. Mind you'll have an open socket on the network, which may or may not be secure in your deployment. But it's pretty trivial to set up even legacy code to run via a network service.
You could use a simple wrapper like this:
#!/bin/sh
infile=/tmp/$$.in
outfile=/tmp/$$.out
cat > $infile
/usr/local/bin/process -input $infile -output $outfile
cat $outfile
rm $infile $outfile
It's not the highest performing server on the planet designed to zillions of transactions, but it's sure a lot faster than forking 240GB over and over and over.

I most agree with Peter. Your are most probably suffering from IO bottlenecks. Once you have may process the OS has to work harder too for trivial tasks hence having exponential performance penalty.
So the 'solution' could be to create 'consumer' processes, only initialise certain few; as Peter suggested one per CPU or more. Then use some form of IPC to 'transfer' these objects to the consumer processes.
Your 'consumer' processes should manage sub-process creation; the processing executable which I presume you don't have any access to, and this way you don't clutter the OS with too many executables and the 'job' will be "eventually" complete.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.