I have a Java program which is launched through command-line by a Bash script, which is in turn called at various intervals by cron.
There are several operations performed by this program, the first being the copy of a possibly large number of more or less large files. (Anything from 10000 files of 30 KB to 1 big 1 GB file, but both of these are edge cases.)
I am curious about how this step should be accomplished to ensure performance (as in speed).
I can use either Bash's cp function, or Java 7's Files.copy(). I will run my own tests but I'm wondering if someone has any comparison data I could take into account before deciding on an implementation?
Related
I want to compare the performance of two different implementations (Python and Java) of the same algorithm. I run scripts on the terminal (using Ubuntu 18) like this:
time script_name
I'm not sure how accurate this is. Is it possible to increase the accuracy of this benchmark? Perhaps there is a way to remove any restrictions or set-up in Python or Java?
As explained in this answer, the correct way to benchmark a program using time is the following command:
sudo chrt -f 99 /usr/bin/time --verbose <benchmark>. However, note that this will only be accurate if the algorithm takes at least a second to execute, as otherwise the exec call might take up a big part of the benchmark.
This answer suggests using perf stat instead, as such:
perf stat -r 10 -d <your app and arguments>.
The accuracy of the time command is probably fine for most testing.
But if you wanted, you could write a 2nd Python script for timing with an import subprocess that then uses the subprocess.call() function (or one of several related functions in the subprocess module) to run the two versions of the algorithm.
Your timing script could also import time and then do datetime.datetime.now().time() before and after the algorithm runs, to show how much time passed.
I have decided to, as side project, try and write a sorting method for large files. Unfortunately, I only have my 4 core laptop available for the research right now. For data, I am only using characters for each record. A typical record looks like this:
AAAAM_EL,QMOIXYGB,LAD_HNTU,BYFKKHWY,AVVCIXMC,KWVGCIUB,YWD_LQNU,HDTKUFK_,W_E_LT_M,MW_HEQKE,VHEDHK_U,SAIUAVGH,DQTSMK_L,RNUBFKUX,OXEVMHNR,EMEEJHJB,BKYQWYAP,MKMWKAAT,MIAEDTDY,RANAGVOM
All the fields are randomly generated. However, I am only sorting using the complete record as the key. A file containing 1 million records equals 181Million bytes. I have noticed the following on my laptop:
using a unix shell and executing unix sort command against the file, it takes approximately 15 to 22 seconds to sort and write the file back to disk as another file.
I tried using the unix sort command with the parallel=cores option but that would not work in my widows-bash.
Using a quick sort algorithm I implemented in java: It takes 3 seconds to read the file into memory, sort it and write back out to a new file.
Using an experimental multi-threaded java application that I implemented takes as long as the "unix sort" command took.
Does anyone have some reliable approximate times for sorting a file this size? I plan on sorting much larger files once I study the multi-threaded approach that I have currently implemented. It needs a lot of improvement I am sure. However I need some good target times to try and achieve. Does anyone know of such target times. Any examples on the net , or any sorting research papers that would give me a hint as to how much time it should take.
I have a simple question, I've read up online but couldn't find a simple solution:
I'm running a java program on the command line as follows which accesses a database:
java -jar myProgram.jar
I would like a simple mechanism to see the number of disk I/Os performed by this program (on OSX).
So far I've come across iotop but how do I get iotop to measure the disk I/O of myProgram.jar?
Do I need a profiler like JProfiler do get this information?
iotop is a utility which gives you top n processes in descending order of IO consumption/utilization.
Most importantly it is a live monitoring utility which means its output changes every n sec( or time interval you specify). Though you can redirect it to a file, you need to parse that file and find out meaningful data after plotting a graph.
I would recommend to use sar. you can read more about it here
It is the lowest level monitoring utility in linux/unix. It will give you much more data than iotop.
best thing about sar is you can collect the data using a daemon when your program is running and then later analyze it using ksar
According to me, you can follow below approach,
Start sar monitoring, collect sar data every n seconds. value of n depends of approximate execution time of your program.
example : if your program takes 10 seconds to execute then monitoring per sec is good but if your program takes 1hr to execute then monitor per min or 30 sec. This will minimize overhead of sar process and still your data is meaningful.
Wait for some time (so that you get data before your program starts) and then start your program
end of your program execution
wait for some time again (so that you get data after your program finishes)
stop sar.
Monitor/visualize sar data using ksar. To start with, you check for disk utilization and then IOPS for a disk.
You can use Profilers for same thing but they have few drawbacks,
They need their own agents (agents will have their own overhead)
Some of them are not free.
Some of them are not easy to set up.
may or may not provide enough/required data.
besides this IMHO, Using inbuilt/system level utilities is always beneficial.
I hope this was helpful.
Your Java program will eventually be a process for host system so you need to filter out output of monitoring tool for your own process id. Refer Scripts section of this Blog Post
Also, even though you have tagged question with OsX but do mention in question that you are using OsX.
If you are looking for offline data - that is provided by proc filesystem in Unix bases systems but unfortunately that is missing in OSX , Where is the /proc folder on Mac OS X?
/proc on Mac OS X
You might chose to write a small script to dump data from disk and process monitoring tools for your process id. You can get your process id in script by process name, put script in a loop to look for that process name and start script before you execute your Java program. When script finds the said process, it will keep dumping relevant data from commands chosen by you at intervals decided by you. Once your programs ends ,log dumping script also terminates.
It's pretty easy to minimize scriptS with Yuicompressor. Unfortunately, this process is totally slow when executing the JAR with exec in php.
Example (PHP):
// start with basic command
$cmd = 'java -Xmx32m -jar /bin/yuicompressor-2.4.8pre.jar -o \'/var/www/myscript.min.js\' \'/var/www/myscript.min.temp.js\'';
// execute the command
exec($cmd . ' 2>&1', $ok);
The execution time for ~20 files takes up to 30 seconds ! on a Quad Core Server with 8GB Ram.
Does anybody know a faster solution, to minimize a bunch of scripts ?
The execution time mainly depends of the file size(s).
Let's take a try with Google Closure Compiler.
It is also a good idea to caching the result in a file or use some extensions (APC, Memcached) with the combination of client-side caching headers. If you checking the last modification time with filemtime() you will know to need minify or not.
I often use separate caching by files, to prevent minification of a large content, then creating an MD5 checksum by the whole and if it is modified since the last request, then save the new checksum and print out the content, else just using:
header('Not Modified', true, 302);
By this way, it is a very few calculations by each requests also in dev state. I'm using ExtJS 4 for my current project wich is 1.2 MB large at raw and a lot of project-codes without any problem and under 1s response time.
I have a single, large heap (up to 240GB, though in the 20-40GB range for most of this phase of execution) JVM [1] running under Linux [2] on a server with 24 cores. We have tens of thousands of objects that have to be processed by an external executable & then load the data created by those executables back into the JVM. Each executable produces about half a megabyte of data (on disk) that when read right in, after the process finishes, is, of course, larger.
Our first implementation was to have each executable handle only a single object. This involved the spawning of twice as many executables as we had objects (since we called a shell script that called the executable). Our CPU utilization would start off high, but not necessarily 100%, and slowly worsen. As we began measuring to see what was happening we noticed that the process creation time [3] continually slows. While starting at sub-second times it would eventually grow to take a minute or more. The actual processing done by the executable usually takes less than 10 seconds.
Next we changed the executable to take a list of objects to process in an attempt to reduce the number of processes created. With batch sizes of a few hundred (~1% of our current sample size), the process creation times start out around 2 seconds & grow to around 5-6 seconds.
Basically, why is it taking so long to create these processes as execution continues?
[1] Oracle JDK 1.6.0_22
[2] Red Hat Enterprise Linux Advanced Platform 5.3, Linux kernel 2.6.18-194.26.1.el5 #1 SMP
[3] Creation of the ProcessBuilder object, redirecting the error stream, and starting it.
My guess is that you MIGHT be running into problems with fork/exec, if Java is using the fork/exec system calls to spawn subprocesses.
Normally fork/exec is fairly efficient, because fork() does very little - all pages are copy-on-write. This stops being so true with very large processes (i.e. those with gigabytes of pages mapped) because the page tables themselves take a relatively long time to create - and of course, destroy, as you immediately call exec.
As you're using a huge amount of heap, this might be affecting you. The more pages you have mapped in, the worse it may become, which could be what's causing the progressive slowdown.
Consider either:
Using posix_spawn, if that is NOT implemented by fork/exec in libc
Using a single subprocess which is responsible for creating / reaping others; spawn this once and use some IPC (pipes etc) to tell it what to do.
NB: This is all speculation; you should probably do some experiments to see whether this is the case.
Most likely you are running out of a resource. Are your disks getting busier as you create these processes. Do you ensure you have less processes than you have cores? (To minimise context switches) Is your load average below 24?
If your CPU consumption is dropping you are likely to be hitting IO (disk/network) contention i.e. the processes cannot get/write data fast enough to keep them busy. If you have 24 cores, how many disks do you have?
I would suggest you have one process per CPU (in your case I imagine 4) Give each JVM six tasks to run concurrently to use all its cores without overloading the system.
You would be much better off using a set of long lived processes pulling your data off of queues and sending them back that constantly forking new processes for each event, especially from the host JVM with that enormous heap.
Forking a 240GB image is not free, it consumes a large amount of virtual resources, even if only for a second. The OS doesn't know how long the new process will be aware so it must prepare itself as if the entire process will be long lived, thus it sets up the virtual clone of all 240GB before obliterating it with the exec call.
If instead you had a long lived process that you could end objects to via some queue mechanism (and there are many for both Java and C, etc.), that would relieve you of some of the pressure of the forking process.
I don't know how you are transferring the data form the JVM to the external program. But if your external program can work with stdin/stdout, then (assuming you're using unix), you could leverage inetd. Here you make a simple entry in the inetd configuration file for your process, and assign it a port. Then you open up a socket, pour the data down in to it, then read back from the socket. Inetd handles the networking details for you and your program works as simply with stdin and stdout. Mind you'll have an open socket on the network, which may or may not be secure in your deployment. But it's pretty trivial to set up even legacy code to run via a network service.
You could use a simple wrapper like this:
#!/bin/sh
infile=/tmp/$$.in
outfile=/tmp/$$.out
cat > $infile
/usr/local/bin/process -input $infile -output $outfile
cat $outfile
rm $infile $outfile
It's not the highest performing server on the planet designed to zillions of transactions, but it's sure a lot faster than forking 240GB over and over and over.
I most agree with Peter. Your are most probably suffering from IO bottlenecks. Once you have may process the OS has to work harder too for trivial tasks hence having exponential performance penalty.
So the 'solution' could be to create 'consumer' processes, only initialise certain few; as Peter suggested one per CPU or more. Then use some form of IPC to 'transfer' these objects to the consumer processes.
Your 'consumer' processes should manage sub-process creation; the processing executable which I presume you don't have any access to, and this way you don't clutter the OS with too many executables and the 'job' will be "eventually" complete.