java get file size efficiently

java get file size efficiently - java

While googling, I see that using java.io.File#length() can be slow.
FileChannel has a size() method that is available as well.
Is there an efficient way in java to get the file size?

Well, I tried to measure it up with the code below:
For runs = 1 and iterations = 1 the URL method is fastest most times followed by channel. I run this with some pause fresh about 10 times. So for one time access, using the URL is the fastest way I can think of:
LENGTH sum: 10626, per Iteration: 10626.0
CHANNEL sum: 5535, per Iteration: 5535.0
URL sum: 660, per Iteration: 660.0
For runs = 5 and iterations = 50 the picture draws different.
LENGTH sum: 39496, per Iteration: 157.984
CHANNEL sum: 74261, per Iteration: 297.044
URL sum: 95534, per Iteration: 382.136
File must be caching the calls to the filesystem, while channels and URL have some overhead.
Code:
import java.io.*;
import java.net.*;
import java.util.*;
public enum FileSizeBench {
LENGTH {
#Override
public long getResult() throws Exception {
File me = new File(FileSizeBench.class.getResource(
"FileSizeBench.class").getFile());
return me.length();
}
},
CHANNEL {
#Override
public long getResult() throws Exception {
FileInputStream fis = null;
try {
File me = new File(FileSizeBench.class.getResource(
"FileSizeBench.class").getFile());
fis = new FileInputStream(me);
return fis.getChannel().size();
} finally {
fis.close();
}
}
},
URL {
#Override
public long getResult() throws Exception {
InputStream stream = null;
try {
URL url = FileSizeBench.class
.getResource("FileSizeBench.class");
stream = url.openStream();
return stream.available();
} finally {
stream.close();
}
}
};
public abstract long getResult() throws Exception;
public static void main(String[] args) throws Exception {
int runs = 5;
int iterations = 50;
EnumMap<FileSizeBench, Long> durations = new EnumMap<FileSizeBench, Long>(FileSizeBench.class);
for (int i = 0; i < runs; i++) {
for (FileSizeBench test : values()) {
if (!durations.containsKey(test)) {
durations.put(test, 0l);
}
long duration = testNow(test, iterations);
durations.put(test, durations.get(test) + duration);
// System.out.println(test + " took: " + duration + ", per iteration: " + ((double)duration / (double)iterations));
}
}
for (Map.Entry<FileSizeBench, Long> entry : durations.entrySet()) {
System.out.println();
System.out.println(entry.getKey() + " sum: " + entry.getValue() + ", per Iteration: " + ((double)entry.getValue() / (double)(runs * iterations)));
}
}
private static long testNow(FileSizeBench test, int iterations)
throws Exception {
long result = -1;
long before = System.nanoTime();
for (int i = 0; i < iterations; i++) {
if (result == -1) {
result = test.getResult();
//System.out.println(result);
} else if ((result = test.getResult()) != result) {
throw new Exception("variance detected!");
}
}
return (System.nanoTime() - before) / 1000;
}
}

The benchmark given by GHad measures lots of other stuff (such as reflection, instantiating objects, etc.) besides getting the length. If we try to get rid of these things then for one call I get the following times in microseconds:
file sum___19.0, per Iteration___19.0
raf sum___16.0, per Iteration___16.0
channel sum__273.0, per Iteration__273.0
For 100 runs and 10000 iterations I get:
file sum__1767629.0, per Iteration__1.7676290000000001
raf sum___881284.0, per Iteration__0.8812840000000001
channel sum___414286.0, per Iteration__0.414286
I did run the following modified code giving as an argument the name of a 100MB file.
import java.io.*;
import java.nio.channels.*;
import java.net.*;
import java.util.*;
public class FileSizeBench {
private static File file;
private static FileChannel channel;
private static RandomAccessFile raf;
public static void main(String[] args) throws Exception {
int runs = 1;
int iterations = 1;
file = new File(args[0]);
channel = new FileInputStream(args[0]).getChannel();
raf = new RandomAccessFile(args[0], "r");
HashMap<String, Double> times = new HashMap<String, Double>();
times.put("file", 0.0);
times.put("channel", 0.0);
times.put("raf", 0.0);
long start;
for (int i = 0; i < runs; ++i) {
long l = file.length();
start = System.nanoTime();
for (int j = 0; j < iterations; ++j)
if (l != file.length()) throw new Exception();
times.put("file", times.get("file") + System.nanoTime() - start);
start = System.nanoTime();
for (int j = 0; j < iterations; ++j)
if (l != channel.size()) throw new Exception();
times.put("channel", times.get("channel") + System.nanoTime() - start);
start = System.nanoTime();
for (int j = 0; j < iterations; ++j)
if (l != raf.length()) throw new Exception();
times.put("raf", times.get("raf") + System.nanoTime() - start);
}
for (Map.Entry<String, Double> entry : times.entrySet()) {
System.out.println(
entry.getKey() + " sum: " + 1e-3 * entry.getValue() +
", per Iteration: " + (1e-3 * entry.getValue() / runs / iterations));
}
}
}

All the test cases in this post are flawed as they access the same file for each method tested. So disk caching kicks in which tests 2 and 3 benefit from. To prove my point I took test case provided by GHAD and changed the order of enumeration and below are the results.
Looking at result I think File.length() is the winner really.
Order of test is the order of output. You can even see the time taken on my machine varied between executions but File.Length() when not first, and incurring first disk access won.
---
LENGTH sum: 1163351, per Iteration: 4653.404
CHANNEL sum: 1094598, per Iteration: 4378.392
URL sum: 739691, per Iteration: 2958.764
---
CHANNEL sum: 845804, per Iteration: 3383.216
URL sum: 531334, per Iteration: 2125.336
LENGTH sum: 318413, per Iteration: 1273.652
---
URL sum: 137368, per Iteration: 549.472
LENGTH sum: 18677, per Iteration: 74.708
CHANNEL sum: 142125, per Iteration: 568.5

When I modify your code to use a file accessed by an absolute path instead of a resource, I get a different result (for 1 run, 1 iteration, and a 100,000 byte file -- times for a 10 byte file are identical to 100,000 bytes)
LENGTH sum: 33, per Iteration: 33.0
CHANNEL sum: 3626, per Iteration: 3626.0
URL sum: 294, per Iteration: 294.0

In response to rgrig's benchmark, the time taken to open/close the FileChannel & RandomAccessFile instances also needs to be taken into account, as these classes will open a stream for reading the file.
After modifying the benchmark, I got these results for 1 iterations on a 85MB file:
file totalTime: 48000 (48 us)
raf totalTime: 261000 (261 us)
channel totalTime: 7020000 (7 ms)
For 10000 iterations on same file:
file totalTime: 80074000 (80 ms)
raf totalTime: 295417000 (295 ms)
channel totalTime: 368239000 (368 ms)
If all you need is the file size, file.length() is the fastest way to do it. If you plan to use the file for other purposes like reading/writing, then RAF seems to be a better bet. Just don't forget to close the file connection :-)
import java.io.File;
import java.io.FileInputStream;
import java.io.RandomAccessFile;
import java.nio.channels.FileChannel;
import java.util.HashMap;
import java.util.Map;
public class FileSizeBench
{
public static void main(String[] args) throws Exception
{
int iterations = 1;
String fileEntry = args[0];
Map<String, Long> times = new HashMap<String, Long>();
times.put("file", 0L);
times.put("channel", 0L);
times.put("raf", 0L);
long fileSize;
long start;
long end;
File f1;
FileChannel channel;
RandomAccessFile raf;
for (int i = 0; i < iterations; i++)
{
// file.length()
start = System.nanoTime();
f1 = new File(fileEntry);
fileSize = f1.length();
end = System.nanoTime();
times.put("file", times.get("file") + end - start);
// channel.size()
start = System.nanoTime();
channel = new FileInputStream(fileEntry).getChannel();
fileSize = channel.size();
channel.close();
end = System.nanoTime();
times.put("channel", times.get("channel") + end - start);
// raf.length()
start = System.nanoTime();
raf = new RandomAccessFile(fileEntry, "r");
fileSize = raf.length();
raf.close();
end = System.nanoTime();
times.put("raf", times.get("raf") + end - start);
}
for (Map.Entry<String, Long> entry : times.entrySet()) {
System.out.println(entry.getKey() + " totalTime: " + entry.getValue() + " (" + getTime(entry.getValue()) + ")");
}
}
public static String getTime(Long timeTaken)
{
if (timeTaken < 1000) {
return timeTaken + " ns";
} else if (timeTaken < (1000*1000)) {
return timeTaken/1000 + " us";
} else {
return timeTaken/(1000*1000) + " ms";
}
}
}

I ran into this same issue. I needed to get the file size and modified date of 90,000 files on a network share. Using Java, and being as minimalistic as possible, it would take a very long time. (I needed to get the URL from the file, and the path of the object as well. So its varied somewhat, but more than an hour.) I then used a native Win32 executable, and did the same task, just dumping the file path, modified, and size to the console, and executed that from Java. The speed was amazing. The native process, and my string handling to read the data could process over 1000 items a second.
So even though people down ranked the above comment, this is a valid solution, and did solve my issue. In my case I knew the folders I needed the sizes of ahead of time, and I could pass that in the command line to my win32 app. I went from hours to process a directory to minutes.
The issue did also seem to be Windows specific. OS X did not have the same issue and could access network file info as fast as the OS could do so.
Java File handling on Windows is terrible. Local disk access for files is fine though. It was just network shares that caused the terrible performance. Windows could get info on the network share and calculate the total size in under a minute too.
--Ben

If you want the file size of multiple files in a directory, use Files.walkFileTree. You can obtain the size from the BasicFileAttributes that you'll receive.
This is much faster then calling .length() on the result of File.listFiles() or using Files.size() on the result of Files.newDirectoryStream(). In my test cases it was about 100 times faster.

Actually, I think the "ls" may be faster. There are definitely some issues in Java dealing with getting File info. Unfortunately there is no equivalent safe method of recursive ls for Windows. (cmd.exe's DIR /S can get confused and generate errors in infinite loops)
On XP, accessing a server on the LAN, it takes me 5 seconds in Windows to get the count of the files in a folder (33,000), and the total size.
When I iterate recursively through this in Java, it takes me over 5 minutes. I started measuring the time it takes to do file.length(), file.lastModified(), and file.toURI() and what I found is that 99% of my time is taken by those 3 calls. The 3 calls I actually need to do...
The difference for 1000 files is 15ms local versus 1800ms on server. The server path scanning in Java is ridiculously slow. If the native OS can be fast at scanning that same folder, why can't Java?
As a more complete test, I used WineMerge on XP to compare the modified date, and size of the files on the server versus the files locally. This was iterating over the entire directory tree of 33,000 files in each folder. Total time, 7 seconds. java: over 5 minutes.
So the original statement and question from the OP is true, and valid. Its less noticeable when dealing with a local file system. Doing a local compare of the folder with 33,000 items takes 3 seconds in WinMerge, and takes 32 seconds locally in Java. So again, java versus native is a 10x slowdown in these rudimentary tests.
Java 1.6.0_22 (latest), Gigabit LAN, and network connections, ping is less than 1ms (both in the same switch)
Java is slow.

From GHad's benchmark, there are a few issue people have mentioned:
1>Like BalusC mentioned: stream.available() is flowed in this case.
Because available() returns an estimate of the number of bytes that can be read (or skipped over) from this input stream without blocking by the next invocation of a method for this input stream.
So 1st to remove the URL this approach.
2>As StuartH mentioned - the order the test run also make the cache difference, so take that out by run the test separately.
Now start test:
When CHANNEL one run alone:
CHANNEL sum: 59691, per Iteration: 238.764
When LENGTH one run alone:
LENGTH sum: 48268, per Iteration: 193.072
So looks like the LENGTH one is the winner here:
#Override
public long getResult() throws Exception {
File me = new File(FileSizeBench.class.getResource(
"FileSizeBench.class").getFile());
return me.length();
}

Related

Groovy with JAVA: Error while merging multiple large PDF file, causing out of memory [duplicate]

I am trying to split a document with a decent 300 pages using Apache PDFBOX API V2.0.2.
While trying to split the pdf file to single pages using the following code:
PDDocument document = PDDocument.load(inputFile);
Splitter splitter = new Splitter();
List<PDDocument> splittedDocuments = splitter.split(document); //Exception happens here
I receive the following exception
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
Which indicates that the GC is taking much time to clear the heap that is not justified by the amount reclaimed.
There are numerous JVM tuning methods that can solve the situation, however, all of these are just treating the symptom and not the real issue.
One final note, I am using JDK6, hence using the new java 8 Consumer is not an option in my case.Thanks
Edit:
This is not a duplicate question of http://stackoverflow.com/questions/37771252/splitting-a-pdf-results-in-very-large-pdf-documents-with-pdfbox-2-0-2 as:
1. I do not have the size problem mentioned in the aforementioned
topic. I am slicing a 270 pages 13.8MB PDF file and after slicing
the size of each slice is an average of 80KB with total size of
30.7MB.
2. The Split throws the exception even before it returns the splitted parts.
I found that the split can pass as long as I am not passing the whole document, instead I pass it as "Batches" with 20-30 pages each, which does the job.

PDF Box stores the parts resulted from the split operation as objects of type PDDocument in the heap as objects, which results in heap getting filled fast, and even if you call the close() operation after every round in the loop, still the GC will not be able to reclaim the heap size in the same manner it gets filled.
An option is to split the document split operation to batches, in which each batch is a relatively manageable chunk (10 to 40 pages)
public void execute() {
File inputFile = new File(path/to/the/file.pdf);
PDDocument document = null;
try {
document = PDDocument.load(inputFile);
int start = 1;
int end = 1;
int batchSize = 50;
int finalBatchSize = document.getNumberOfPages() % batchSize;
int noOfBatches = document.getNumberOfPages() / batchSize;
for (int i = 1; i <= noOfBatches; i++) {
start = end;
end = start + batchSize;
System.out.println("Batch: " + i + " start: " + start + " end: " + end);
split(document, start, end);
}
// handling the remaining
start = end;
end += finalBatchSize;
System.out.println("Final Batch start: " + start + " end: " + end);
split(document, start, end);
} catch (IOException e) {
e.printStackTrace();
} finally {
//close the document
}
}
private void split(PDDocument document, int start, int end) throws IOException {
List<File> fileList = new ArrayList<File>();
Splitter splitter = new Splitter();
splitter.setStartPage(start);
splitter.setEndPage(end);
List<PDDocument> splittedDocuments = splitter.split(document);
String outputPath = Config.INSTANCE.getProperty("outputPath");
PDFTextStripper stripper = new PDFTextStripper();
for (int index = 0; index < splittedDocuments.size(); index++) {
String pdfFullPath = document.getDocumentInformation().getTitle() + index + start+ ".pdf";
PDDocument splittedDocument = splittedDocuments.get(index);
splittedDocument.save(pdfFullPath);
}
}

SeekableByteChannel.read() always returns 0, InputStream is fine

We have a data file for which we need to generate a CRC. (As a placeholder, I'm using CRC32 while the others figure out what CRC polynomial they actually want.) This code seems like it ought to work:
broken:
Path in = ......;
try (SeekableByteChannel reading =
Files.newByteChannel (in, StandardOpenOption.READ))
{
System.err.println("byte channel is a " + reading.getClass().getName() +
" from " + in + " of size " + reading.size() + " and isopen=" + reading.isOpen());
java.util.zip.CRC32 placeholder = new java.util.zip.CRC32();
ByteBuffer buffer = ByteBuffer.allocate (reasonable_buffer_size);
int bytesread = 0;
int loops = 0;
while ((bytesread = reading.read(buffer)) > 0) {
byte[] raw = buffer.array();
System.err.println("Claims to have read " + bytesread + " bytes, have buffer of size " + raw.length + ", updating CRC");
placeholder.update(raw);
loops++;
buffer.clear();
}
// do stuff with placeholder.getValue()
}
catch (all the things that go wrong with opening files) {
and handle them;
}
The System.err and loops stuff is just for debugging; we don't actually care how many times it takes. The output is:
byte channel is a sun.nio.ch.FileChannelImpl from C:\working\tmp\ls2kst83543216xuxxy8136.tmp of size 7196 and isopen=true
finished after 0 time(s) through the loop
There's no way to run the real code inside a debugger to step through it, but from looking at the source to sun.nio.ch.FileChannelImpl.read() it looks like a 0 is returned if the file magically becomes closed while internal data structures are prepared; the code below is copied from the Java 7 reference implementation, comments added by me:
// sun.nio.ch.FileChannelImpl.java
public int read(ByteBuffer dst) throws IOException {
ensureOpen(); // this throws if file is closed...
if (!readable)
throw new NonReadableChannelException();
synchronized (positionLock) {
int n = 0;
int ti = -1;
Object traceContext = IoTrace.fileReadBegin(path);
try {
begin();
ti = threads.add();
if (!isOpen())
return 0; // ...argh
do {
n = IOUtil.read(fd, dst, -1, nd);
} while (......)
.......
But the debugging code tests isOpen() and gets true. So I don't know what's going wrong.
As the current test data files are tiny, I dropped this in place just to have something working:
works for now:
try {
byte[] scratch = Files.readAllBytes(in);
java.util.zip.CRC32 placeholder = new java.util.zip.CRC32();
placeholder.update(scratch);
// do stuff with placeholder.getValue()
}
I don't want to slurp the entire file into memory for the Real Code, because some of those files can be large. I do note that readAllBytes uses an InputStream in its reference implementation, which has no trouble reading the same file that SeekableByteChannel failed to. So I'll probably rewrite the code to just use input streams instead of byte channels. I'd still like to figure out what's gone wrong in case a future scenario comes up where we need to use byte channels. What am I missing with SeekableByteChannel?

Check that 'reasonable_buffer_size' isn't zero.

How is reading an InputStream object from a local file different than from the network (via Amazon S3)?

I didn't think there was a difference between an inputstream object read from a local file vs one from a network source (Amazon S3 in this case) so hopefully someone can enlighten me.
These programs were run on a VM running Centos 6.3.
The test file in both cases are 10MB.
Local file code:
InputStream is = new FileInputStream("/home/anyuser/test.jpg");
int read = 0;
int buf_size = 1024 * 1024 * 2;
byte[] buf = new byte[buf_size];
ByteArrayOutputStream baos = new ByteArrayOutputStream(buf_size);
long t3 = System.currentTimeMillis();
int i = 0;
while ((read = is.read(buf)) != -1) {
baos.write(buf,0,read);
System.out.println("reading for the " + i + "th time");
i++;
}
long t4 = System.currentTimeMillis();
System.out.println("Time to read = " + (t4-t3) + "ms");
The output of this code is this: it reads 5 times, which makes sense since the buffer size read in is 2MB and the file is 10MB.
reading for the 0th time
reading for the 1th time
reading for the 2th time
reading for the 3th time
reading for the 4th time
Time to read = 103ms
Now, we have the same code run with the same 10MB test file, except this time, the source is from Amazon S3. We don't start reading until we finish getting the stream from S3. However, this time, the read loop is running through thousands of times, when it should only read it 5 times.
InputStream is;
long t1 = System.currentTimeMillis();
is = getS3().getFileFromBucket(S3Path,input);
long t2 = System.currentTimeMillis();
System.out.print("Time to get file " + input + " from S3: ");
System.out.println((t2-t1) + "ms");
int read = 0;
int buf_size = 1024*1024*2;
byte[] buf = new byte[buf_size];
ByteArrayOutputStream baos = new ByteArrayOutputStream(buf_size);
long t3 = System.currentTimeMillis();
int i = 0;
while ((read = is.read(buf)) != -1) {
baos.write(buf,0,read);
if ((i % 100) == 0)
System.out.println("reading for the " + i + "th time");
i++;
}
long t4 = System.currentTimeMillis();
System.out.println("Time to read = " + (t4-t3) + "ms");
The output is as follows:
Time to get file test.jpg from S3: 2456ms
reading for the 0th time
reading for the 100th time
reading for the 200th time
reading for the 300th time
reading for the 400th time
reading for the 500th time
reading for the 600th time
reading for the 700th time
reading for the 800th time
reading for the 900th time
reading for the 1000th time
reading for the 1100th time
reading for the 1200th time
reading for the 1300th time
reading for the 1400th time
Time to read = 14471ms
The amount of time taken to read the stream changes from run to run. Sometimes it takes 60 seconds, sometimes 15 seconds. It doesn't get faster than 15 sec. The read loop still loops through 1400+ times on each test run of the program, even though I think it should only be 5 times, like the local file example.
Is this how inputstream works when the source is through the network, even though we had finished getting the file from the network source? Thanks in advance for your help.

I don't think it's specific to java. When you read from the network, the actual read call to the operating system will return a packet of data at a time, no matter how big is the buffer you allocated. If you check the size of the read data (your read variable), it should show the size of the network packet used.
This is one of the reason why people use a separate thread to read from network and avoid blocking by using async i/o technique.

As #imel96 points out, there is nothing in the documentation that guarantees the behaviour you are expecting. You will never read 2MB at a time from a socket, because the socket receive buffer isn't normally that large, quite apart from other factors such as bandwidth.

Splitting a .gz file into specified file sizes in Java using byte[] array

I have written a code to split a .gz file into user specified parts using byte[] array. But the for loop is not reading/writing the last part of the parent file which is less than the array size. Can you please help me in fixing this?
package com.bitsighttech.collection.packaging;
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.log4j.Logger;
public class FileSplitterBytewise
{
private static Logger logger = Logger.getLogger(FileSplitterBytewise.class);
private static final long KB = 1024;
private static final long MB = KB * KB;
private FileInputStream fis;
private FileOutputStream fos;
private DataInputStream dis;
private DataOutputStream dos;
public boolean split(File inputFile, String splitSize)
{
int expectedNoOfFiles =0;
try
{
double parentFileSizeInB = inputFile.length();
Pattern p = Pattern.compile("(\\d+)\\s([MmGgKk][Bb])");
Matcher m = p.matcher(splitSize);
m.matches();
String FileSizeString = m.group(1);
String unit = m.group(2);
double FileSizeInMB = 0;
try {
if (unit.toLowerCase().equals("kb"))
FileSizeInMB = Double.parseDouble(FileSizeString) / KB;
else if (unit.toLowerCase().equals("mb"))
FileSizeInMB = Double.parseDouble(FileSizeString);
else if (unit.toLowerCase().equals("gb"))
FileSizeInMB = Double.parseDouble(FileSizeString) * KB;
} catch (NumberFormatException e) {
logger.error("invalid number [" + FileSizeInMB + "] for expected file size");
}
double fileSize = FileSizeInMB * MB;
int fileSizeInByte = (int) Math.ceil(fileSize);
double noOFFiles = parentFileSizeInB/fileSizeInByte;
expectedNoOfFiles = (int) Math.ceil(noOFFiles);
int splinterCount = 1;
fis = new FileInputStream(inputFile);
dis = new DataInputStream(new BufferedInputStream(fis));
fos = new FileOutputStream("F:\\ff\\" + "_part_" + splinterCount + "_of_" + expectedNoOfFiles);
dos = new DataOutputStream(new BufferedOutputStream(fos));
byte[] data = new byte[(int) fileSizeInByte];
while ( splinterCount <= expectedNoOfFiles ) {
int i;
for(i = 0; i<data.length-1; i++)
{
data[i] = s.readByte();
}
dos.write(data);
splinterCount ++;
}
}
catch(Exception e)
{
logger.error("Unable to split the file " + inputFile.getName() + " in to " + expectedNoOfFiles);
return false;
}
logger.debug("Successfully split the file [" + inputFile.getName() + "] in to " + expectedNoOfFiles + " files");
return true;
}
public static void main(String args[])
{
String FilePath1 = "F:\\az.gz";
File file= new File(FilePath1);
FileSplitterBytewise fileSplitter = new FileSplitterBytewise();
String splitlen = "1 MB";
fileSplitter.split(file, splitlen);
}
}

I'd suggest to make more methods. You've got a complicated string-handling section of code in split(); it would be best to make a method that takes the human-friendly string as input and returns the number you're looking for. (It would also make it far easier for you to test this section of the routine; there's no way you can test it now.)
Once it is split off and you're writing test cases, you'll probably find that the error message you generate if the string doesn't contain kb, mb, or gb is extremely confusing -- it blames the number 0 for the mistake rather than pointing out the string does not have the expected units.
Using an int to store the file size means your program will never handle files larger than two gigabytes. You should stick with long or double. (double feels wrong for something that is actually confined to integer values but I can't quickly think why it would fail.)
byte[] data = new byte[(int) fileSizeInByte];
Allocating several gigabytes like this is going to destroy your performance -- that's a potentially huge memory allocation (and one that might be considered under control of an adversary; depending upon your security model, this might or might not be a big deal). Don't try to work with the entire file in one piece.
You appear to be reading and writing the files one byte at a time. That's a guarantee to very slow performance. Doing some performance testing for another question earlier today, I found that my machine could read (from a hot cache) 2000 times faster using 131kb blocks than two-byte blocks. One-byte blocks would be even worse. A cold cache would be significantly worse for such small sizes.
fos = new FileOutputStream("F:\\ff\\" + "_part_" + splinterCount + "_of_" + expectedNoOfFiles);
You only appear to ever open one file output stream. Your post probably should have said "only the first works", because it looks like you've not yet tried it on a file that creates three or more pieces.
catch(Exception e)
At this point, you've got the ability to discover errors in your program; you choose to ignore them completely. Sure, you log an error message, but you cannot actually debug your program with the data you log. You should log at a minimum the exception type, message, and maybe even full stack-trace. This combination of data is immensely useful when trying to solve problems, especially in a few months when you've forgotten the details of how it works.

Can you please help me in fixing this?
I would use;
drop the DataInput/OutputStreams, you don't need them.
use in.read(data) to read the whole block instead on one byte at a time. Reading one byte at a time is so much slower!
or read the whole of the data array, you are reading one less.
stop when you reach the end of the file, it might not be a whole multiple of the size.
only write as much as you have read, if your blocks at 1 MB byte there is 100 KB left you should only read/write 100 KB at the end.
close your files when have finished, esp as you have a buffered stream.
you "split" writes everything to the same file (so its not actually splitting) You need to create, write to and close output files in a loop.
don't use fields when you could be/should be using local variables.
would use the length as a long in bytes.
the pattern ignores incorrect input and your pattern doesn't match the test you check for. e.g. your patten allows 1 G or 1 k but these will be treated as 1 MB.

How do I monitor the computer's CPU, memory, and disk usage in Java?

I would like to monitor the following system information in Java:
Current CPU usage** (percent)
Available memory* (free/total)
Available disk space (free/total)
*Note that I mean overall memory available to the whole system, not just the JVM.
I'm looking for a cross-platform solution (Linux, Mac, and Windows) that doesn't rely on my own code calling external programs or using JNI. Although these are viable options, I would prefer not to maintain OS-specific code myself if someone already has a better solution.
If there's a free library out there that does this in a reliable, cross-platform manner, that would be great (even if it makes external calls or uses native code itself).
Any suggestions are much appreciated.
To clarify, I would like to get the current CPU usage for the whole system, not just the Java process(es).
The SIGAR API provides all the functionality I'm looking for in one package, so it's the best answer to my question so far. However, due it being licensed under the GPL, I cannot use it for my original purpose (a closed source, commercial product). It's possible that Hyperic may license SIGAR for commercial use, but I haven't looked into it. For my GPL projects, I will definitely consider SIGAR in the future.
For my current needs, I'm leaning towards the following:
For CPU usage, OperatingSystemMXBean.getSystemLoadAverage() / OperatingSystemMXBean.getAvailableProcessors() (load average per cpu)
For memory, OperatingSystemMXBean.getTotalPhysicalMemorySize() and OperatingSystemMXBean.getFreePhysicalMemorySize()
For disk space, File.getTotalSpace() and File.getUsableSpace()
Limitations:
The getSystemLoadAverage() and disk space querying methods are only available under Java 6. Also, some JMX functionality may not be available to all platforms (i.e. it's been reported that getSystemLoadAverage() returns -1 on Windows).
Although originally licensed under GPL, it has been changed to Apache 2.0, which can generally be used for closed source, commercial products.

Along the lines of what I mentioned in this post. I recommend you use the SIGAR API. I use the SIGAR API in one of my own applications and it is great. You'll find it is stable, well supported, and full of useful examples. It is open-source with a GPL 2 Apache 2.0 license. Check it out. I have a feeling it will meet your needs.
Using Java and the Sigar API you can get Memory, CPU, Disk, Load-Average, Network Interface info and metrics, Process Table information, Route info, etc.

The following supposedly gets you CPU and RAM. See ManagementFactory for more details.
import java.lang.management.ManagementFactory;
import java.lang.management.OperatingSystemMXBean;
import java.lang.reflect.Method;
import java.lang.reflect.Modifier;
private static void printUsage() {
OperatingSystemMXBean operatingSystemMXBean = ManagementFactory.getOperatingSystemMXBean();
for (Method method : operatingSystemMXBean.getClass().getDeclaredMethods()) {
method.setAccessible(true);
if (method.getName().startsWith("get")
&& Modifier.isPublic(method.getModifiers())) {
Object value;
try {
value = method.invoke(operatingSystemMXBean);
} catch (Exception e) {
value = e;
} // try
System.out.println(method.getName() + " = " + value);
} // if
} // for
}

In JDK 1.7, you can get system CPU and memory usage via com.sun.management.OperatingSystemMXBean. This is different than java.lang.management.OperatingSystemMXBean.
long getCommittedVirtualMemorySize()
// Returns the amount of virtual memory that is guaranteed to be available to the running process in bytes, or -1 if this operation is not supported.
long getFreePhysicalMemorySize()
// Returns the amount of free physical memory in bytes.
long getFreeSwapSpaceSize()
// Returns the amount of free swap space in bytes.
double getProcessCpuLoad()
// Returns the "recent cpu usage" for the Java Virtual Machine process.
long getProcessCpuTime()
// Returns the CPU time used by the process on which the Java virtual machine is running in nanoseconds.
double getSystemCpuLoad()
// Returns the "recent cpu usage" for the whole system.
long getTotalPhysicalMemorySize()
// Returns the total amount of physical memory in bytes.
long getTotalSwapSpaceSize()
// Returns the total amount of swap space in bytes.

This works for me perfectly without any external API, just native Java hidden feature :)
import com.sun.management.OperatingSystemMXBean;
...
OperatingSystemMXBean osBean = ManagementFactory.getPlatformMXBean(
OperatingSystemMXBean.class);
// What % CPU load this current JVM is taking, from 0.0-1.0
System.out.println(osBean.getProcessCpuLoad());
// What % load the overall system is at, from 0.0-1.0
System.out.println(osBean.getSystemCpuLoad());

Have a look at this very detailled article:
http://nadeausoftware.com/articles/2008/03/java_tip_how_get_cpu_and_user_time_benchmarking#UsingaSuninternalclasstogetJVMCPUtime
To get the percentage of CPU used, all you need is some simple maths:
MBeanServerConnection mbsc = ManagementFactory.getPlatformMBeanServer();
OperatingSystemMXBean osMBean = ManagementFactory.newPlatformMXBeanProxy(
mbsc, ManagementFactory.OPERATING_SYSTEM_MXBEAN_NAME, OperatingSystemMXBean.class);
long nanoBefore = System.nanoTime();
long cpuBefore = osMBean.getProcessCpuTime();
// Call an expensive task, or sleep if you are monitoring a remote process
long cpuAfter = osMBean.getProcessCpuTime();
long nanoAfter = System.nanoTime();
long percent;
if (nanoAfter > nanoBefore)
percent = ((cpuAfter-cpuBefore)*100L)/
(nanoAfter-nanoBefore);
else percent = 0;
System.out.println("Cpu usage: "+percent+"%");
Note: You must import com.sun.management.OperatingSystemMXBean and not java.lang.management.OperatingSystemMXBean.

The accepted answer in 2008 recommended SIGAR. However, as a comment from 2014 (#Alvaro) says:
Be careful when using Sigar, there are problems on x64 machines...
Sigar 1.6.4 is crashing: EXCEPTION_ACCESS_VIOLATION and it seems the library
doesn't get updated since 2010
My recommendation is to use https://github.com/oshi/oshi
Or the answer mentioned above.

For disk space, if you have Java 6, you can use the getTotalSpace and getFreeSpace methods on File. If you're not on Java 6, I believe you can use Apache Commons IO to get some of the way there.
I don't know of any cross platform way to get CPU usage or Memory usage I'm afraid.

A lot of this is already available via JMX. With Java 5, JMX is built-in and they include a JMX console viewer with the JDK.
You can use JMX to monitor manually, or invoke JMX commands from Java if you need this information in your own run-time.

/* YOU CAN TRY THIS TOO */
import java.io.File;
import java.lang.management.ManagementFactory;
// import java.lang.management.OperatingSystemMXBean;
import java.lang.reflect.Method;
import java.lang.reflect.Modifier;
import java.lang.management.RuntimeMXBean;
import java.io.*;
import java.net.*;
import java.util.*;
import java.io.LineNumberReader;
import java.lang.management.ManagementFactory;
import com.sun.management.OperatingSystemMXBean;
import java.lang.management.ManagementFactory;
import java.util.Random;
public class Pragati
{
public static void printUsage(Runtime runtime)
{
long total, free, used;
int mb = 1024*1024;
total = runtime.totalMemory();
free = runtime.freeMemory();
used = total - free;
System.out.println("\nTotal Memory: " + total / mb + "MB");
System.out.println(" Memory Used: " + used / mb + "MB");
System.out.println(" Memory Free: " + free / mb + "MB");
System.out.println("Percent Used: " + ((double)used/(double)total)*100 + "%");
System.out.println("Percent Free: " + ((double)free/(double)total)*100 + "%");
}
public static void log(Object message)
{
System.out.println(message);
}
public static int calcCPU(long cpuStartTime, long elapsedStartTime, int cpuCount)
{
long end = System.nanoTime();
long totalAvailCPUTime = cpuCount * (end-elapsedStartTime);
long totalUsedCPUTime = ManagementFactory.getThreadMXBean().getCurrentThreadCpuTime()-cpuStartTime;
//log("Total CPU Time:" + totalUsedCPUTime + " ns.");
//log("Total Avail CPU Time:" + totalAvailCPUTime + " ns.");
float per = ((float)totalUsedCPUTime*100)/(float)totalAvailCPUTime;
log( per);
return (int)per;
}
static boolean isPrime(int n)
{
// 2 is the smallest prime
if (n <= 2)
{
return n == 2;
}
// even numbers other than 2 are not prime
if (n % 2 == 0)
{
return false;
}
// check odd divisors from 3
// to the square root of n
for (int i = 3, end = (int)Math.sqrt(n); i <= end; i += 2)
{
if (n % i == 0)
{
return false;
}
}
return true;
}
public static void main(String [] args)
{
int mb = 1024*1024;
int gb = 1024*1024*1024;
/* PHYSICAL MEMORY USAGE */
System.out.println("\n**** Sizes in Mega Bytes ****\n");
com.sun.management.OperatingSystemMXBean operatingSystemMXBean = (com.sun.management.OperatingSystemMXBean)ManagementFactory.getOperatingSystemMXBean();
//RuntimeMXBean runtimeMXBean = ManagementFactory.getRuntimeMXBean();
//operatingSystemMXBean = (com.sun.management.OperatingSystemMXBean) ManagementFactory.getOperatingSystemMXBean();
com.sun.management.OperatingSystemMXBean os = (com.sun.management.OperatingSystemMXBean)
java.lang.management.ManagementFactory.getOperatingSystemMXBean();
long physicalMemorySize = os.getTotalPhysicalMemorySize();
System.out.println("PHYSICAL MEMORY DETAILS \n");
System.out.println("total physical memory : " + physicalMemorySize / mb + "MB ");
long physicalfreeMemorySize = os.getFreePhysicalMemorySize();
System.out.println("total free physical memory : " + physicalfreeMemorySize / mb + "MB");
/* DISC SPACE DETAILS */
File diskPartition = new File("C:");
File diskPartition1 = new File("D:");
File diskPartition2 = new File("E:");
long totalCapacity = diskPartition.getTotalSpace() / gb;
long totalCapacity1 = diskPartition1.getTotalSpace() / gb;
double freePartitionSpace = diskPartition.getFreeSpace() / gb;
double freePartitionSpace1 = diskPartition1.getFreeSpace() / gb;
double freePartitionSpace2 = diskPartition2.getFreeSpace() / gb;
double usablePatitionSpace = diskPartition.getUsableSpace() / gb;
System.out.println("\n**** Sizes in Giga Bytes ****\n");
System.out.println("DISC SPACE DETAILS \n");
//System.out.println("Total C partition size : " + totalCapacity + "GB");
//System.out.println("Usable Space : " + usablePatitionSpace + "GB");
System.out.println("Free Space in drive C: : " + freePartitionSpace + "GB");
System.out.println("Free Space in drive D: : " + freePartitionSpace1 + "GB");
System.out.println("Free Space in drive E: " + freePartitionSpace2 + "GB");
if(freePartitionSpace <= totalCapacity%10 || freePartitionSpace1 <= totalCapacity1%10)
{
System.out.println(" !!!alert!!!!");
}
else
System.out.println("no alert");
Runtime runtime;
byte[] bytes;
System.out.println("\n \n**MEMORY DETAILS ** \n");
// Print initial memory usage.
runtime = Runtime.getRuntime();
printUsage(runtime);
// Allocate a 1 Megabyte and print memory usage
bytes = new byte[1024*1024];
printUsage(runtime);
bytes = null;
// Invoke garbage collector to reclaim the allocated memory.
runtime.gc();
// Wait 5 seconds to give garbage collector a chance to run
try {
Thread.sleep(5000);
} catch(InterruptedException e) {
e.printStackTrace();
return;
}
// Total memory will probably be the same as the second printUsage call,
// but the free memory should be about 1 Megabyte larger if garbage
// collection kicked in.
printUsage(runtime);
for(int i = 0; i < 30; i++)
{
long start = System.nanoTime();
// log(start);
//number of available processors;
int cpuCount = ManagementFactory.getOperatingSystemMXBean().getAvailableProcessors();
Random random = new Random(start);
int seed = Math.abs(random.nextInt());
log("\n \n CPU USAGE DETAILS \n\n");
log("Starting Test with " + cpuCount + " CPUs and random number:" + seed);
int primes = 10000;
//
long startCPUTime = ManagementFactory.getThreadMXBean().getCurrentThreadCpuTime();
start = System.nanoTime();
while(primes != 0)
{
if(isPrime(seed))
{
primes--;
}
seed++;
}
float cpuPercent = calcCPU(startCPUTime, start, cpuCount);
log("CPU USAGE : " + cpuPercent + " % ");
try
{
Thread.sleep(1000);
}
catch (InterruptedException e) {}
}
try
{
Thread.sleep(500);
}`enter code here`
catch (Exception ignored) { }
}
}

The following code is Linux (maybe Unix) only, but it works in a real project.
private double getAverageValueByLinux() throws InterruptedException {
try {
long delay = 50;
List<Double> listValues = new ArrayList<Double>();
for (int i = 0; i < 100; i++) {
long cput1 = getCpuT();
Thread.sleep(delay);
long cput2 = getCpuT();
double cpuproc = (1000d * (cput2 - cput1)) / (double) delay;
listValues.add(cpuproc);
}
listValues.remove(0);
listValues.remove(listValues.size() - 1);
double sum = 0.0;
for (Double double1 : listValues) {
sum += double1;
}
return sum / listValues.size();
} catch (Exception e) {
e.printStackTrace();
return 0;
}
}
private long getCpuT throws FileNotFoundException, IOException {
BufferedReader reader = new BufferedReader(new FileReader("/proc/stat"));
String line = reader.readLine();
Pattern pattern = Pattern.compile("\\D+(\\d+)\\D+(\\d+)\\D+(\\d+)\\D+(\\d+)")
Matcher m = pattern.matcher(line);
long cpuUser = 0;
long cpuSystem = 0;
if (m.find()) {
cpuUser = Long.parseLong(m.group(1));
cpuSystem = Long.parseLong(m.group(3));
}
return cpuUser + cpuSystem;
}

Make a batch file "Pc.bat" as,
typeperf -sc 1 "\mukit\processor(_Total)\%% Processor Time"
You can use the class MProcess,
/*
*Md. Mukit Hasan
*CSE-JU,35
**/
import java.io.*;
public class MProcessor {
public MProcessor() {
String s;
try {
Process ps = Runtime.getRuntime().exec("Pc.bat");
BufferedReader br = new BufferedReader(new InputStreamReader(ps.getInputStream()));
while((s = br.readLine()) != null) {
System.out.println(s);
}
}
catch( Exception ex ) {
System.out.println(ex.toString());
}
}
}
Then after some string manipulation, you get the CPU use. You can use the same process for other tasks.
--Mukit Hasan

OperatingSystemMXBean osBean = ManagementFactory.getPlatformMXBean(OperatingSystemMXBean.class);
System.out.println((osBean.getCpuLoad() * 100) + "%");
import com.sun.management.OperatingSystemMXBean
It only starts working after the second call so save the osBean and put it in a loop

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.