How does the SPF4J Java Profiler write buffered metrics to files? - java

I wrote a simple test to capture timing metrics using the SPF4J (Simple Profiler Framework For Java) MeasurementRecorder. I'm running a simple for loop and capturing time of a random sleep from 100-200 ms as shown in the complete code sample below.
When I run the test, the Time-Series DataBase (TSDB) files are created successfully, but they're empty while the test is running (around 2 mins). The buffered data are written to files when the application completes, but samples at the end are missing and the last one is truncated, like the buffer is not getting flushed properly.
If the application never terminates (e.g. web service), when will the buffered metrics be written to file - on a timer, or when a certain amount of data is buffered? Is this configurable and, if so, how?
package com.examples.spf4j;
import org.spf4j.perf.MeasurementRecorder;
import org.spf4j.perf.impl.RecorderFactory;
import java.io.File;
import java.util.Random;
import org.junit.Test;
public class TestMeasurementRecorder {
#Test
public void testMeasurementRecorder() throws InterruptedException {
initialize();
MeasurementRecorder measurementRecorder = getMeasurementRecorder();
Random random = new Random();
for (int i=0; i<=1000; i++) {
long startTime = System.currentTimeMillis();
Thread.sleep(100 + random.nextInt(100));
measurementRecorder.record(System.currentTimeMillis() - startTime);
}
}
public static void initialize() {
String tsDbFile = System.getProperty("user.dir") + File.separator + "spf4j-performance-monitor.tsdb2";
String tsTextFile = System.getProperty("user.dir") + File.separator + "spf4j-performance-monitor.txt";
System.setProperty("spf4j.perf.ms.config", "TSDB#" + tsDbFile + "," + "TSDB_TXT#" + tsTextFile);
}
public static MeasurementRecorder getMeasurementRecorder() {
int sampleTimeMillis = 1000;
return RecorderFactory.createScalableQuantizedRecorder("response time", "ms", sampleTimeMillis, 10, 0, 40, 10);
}
}

You will need to set the system property: spf4j.perf.ms.periodicFlush=true
to enable the periodic flush.

Related

Program dismisses steps when running quickly (Java)

I am running a tool that runs an external Java program several times in its operation. The external tool starts with opening a JOptionPane inside a JFrame.
Here is a test script I wrote to try to solve my issue.
import java.io.File;
public class Test {
public static void main(String[] args) throws Exception {
for(int i=0; i<6; i++) {
//Thread.sleep(1000);
String toRun = "java -jar \"" + "C:\\Folder\\File.jar" + "\" " + i;
Runtime.getRuntime().exec(toRun, null, new File("C:\\Folder"));
}
}
}
When this runs, only the final run's JOptionPane (i=5) appears, but it seems that others are "trying" to appear as panes seem to be opening and immediately closing.
When I uncomment the Thread.sleep however, all of the panes open separately. If i set the sleep to 300 (0.3 seconds) about half of the panes appear, usually the first and last ones.
I would like to find a way to run all instances of the external program fully without needing to use Thread.sleep() at all, if possible.
Edit: As per requirement's I've minimalised my external program as well.
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.text.SimpleDateFormat;
import java.util.Date;
import javax.swing.JFrame;
import javax.swing.JOptionPane;
public class File {
static JFrame frame = new JFrame("Frame");
private static String doc1Address = "C:\\Folder\\doc1.csv";
private static String doc2Address = "C:\\Folder\\doc2.csv";
public static void main(String[] args) throws Exception {
if(args.length == 1) {
SimpleDateFormat form = new SimpleDateFormat("yyyy-MM-dd hh-mm-ss");
Date date = new Date();
String currentDate = form.format(date);
//Save Backup of doc1
String doc1BackAddress = doc1Log.substring(0, doc1Log.length()-15) + "doc1Back " + currentDate + ".csv";
Path todoc1 = Paths.get(doc1Address);
Path todoc1Back = Paths.get(doc1BackAddress);
Files.copy(todoc1, todoc1Back);
Files.setAttribute(todoc1Back, "dos:readonly", true);
//Save Backup of doc2
String doc2BackAddress = doc2Log.substring(0, doc2Log.length()-16) + "doc2Back " + currentDate + ".csv";
Path todoc2 = Paths.get(doc2Address);
Path todoc2Back = Paths.get(doc2BackAddress);
Files.copy(todoc2, todoc2Back);
Files.setAttribute(todoc2Back, "dos:readonly", true);
//Format JFrame
frame.pack();
frame.setLodoc1ionRelativeTo(null);
frame.setVisible(true);
frame.setDefaultCloseOperation(JFrame.DISPOSE_ON_CLOSE);
JOptionPane.showMessageDialog(frame, args[0]);
frame.dispose();
}
}
}
Found my own issue; since the backup files use the format yyyy-MM-dd hh-mm-ss, and files saved during the same second result in a FileAlreadyExists Exception, meaning only the first file to finish saving allows the program to continue running.
Having a 1 second pause results in the files having different save names, so no error occurs.
Having a sub-1 second pause results in some file name overlap, but some different names too, hence some files appear.
Solution: either change the name format (i.e. include milliseconds), or include the backup functions in an if-statement, that is ignored if the file with the same time already exists.
(also; thank you #ErwinBolwidt, in being encouraged to format my question properly I realised that the issue in my code was not where I assumed it to be).

Solr custom Tokenizer Factory works randomly

I am new in Solr and I have to do a filter to lemmatize text to index documents and also to lemmatize querys.
I created a custom Tokenizer Factory for lemmatized text before passing it to the Standard Tokenizer.
Making tests in Solr analysis section works fairly good (on index ok, but on query sometimes analyzes text two times), but when indexing documents it analyzes only the first documment and on querys it analyses randomly (It only analyzes first, and to analyze another you have to wait a bit time). It's not performance problem because I tried modifyng text instead of lemmatizing.
Here is the code:
package test.solr.analysis;
import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;
import java.util.Map;
import org.apache.lucene.analysis.util.TokenizerFactory;
import org.apache.lucene.util.AttributeSource.AttributeFactory;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.standard.StandardTokenizer;
//import test.solr.analysis.TestLemmatizer;
public class TestLemmatizerTokenizerFactory extends TokenizerFactory {
//private TestLemmatizer lemmatizer = new TestLemmatizer();
private final int maxTokenLength;
public TestLemmatizerTokenizerFactory(Map<String,String> args) {
super(args);
assureMatchVersion();
maxTokenLength = getInt(args, "maxTokenLength", StandardAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
if (!args.isEmpty()) {
throw new IllegalArgumentException("Unknown parameters: " + args);
}
}
public String readFully(Reader reader){
char[] arr = new char[8 * 1024]; // 8K at a time
StringBuffer buf = new StringBuffer();
int numChars;
try {
while ((numChars = reader.read(arr, 0, arr.length)) > 0) {
buf.append(arr, 0, numChars);
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("### READFULLY ### => " + buf.toString());
/*
The original return with lemmatized text would be this:
return lemmatizer.getLemma(buf.toString());
To test it I only change the text adding "lemmatized" word
*/
return buf.toString() + " lemmatized";
}
#Override
public StandardTokenizer create(AttributeFactory factory, Reader input) {
// I print this to see when enters to the tokenizer
System.out.println("### Standar tokenizer ###");
StandardTokenizer tokenizer = new StandardTokenizer(luceneMatchVersion, factory, new StringReader(readFully(input)));
tokenizer.setMaxTokenLength(maxTokenLength);
return tokenizer;
}
}
With this, it only indexes the first text adding the word "lemmatized" to the text.
Then on first query if I search "example" word it looks for "example" and "lemmatized" so it returns me the first document.
On next searches it doesn't modify the query. To make a new query adding "lemmatized" word to the query, I have to wait some minutes.
What happens?
Thank you all.
I highly doubt that the create method is invoked on each query (for starters performance issues come to mind). I would take the safe route and create a Tokenizer that wraps a StandardTokenizer, then just override the setReader method and do my work there

Why do I get a java heap space exception when I load around 11k images with a size of around 40mb

I have a situation in my program where I need to access a certain number of images saved on my hard drive. I could either only load them once I really need them or load them all already at start up.
Out of curiousity I tried to read all (around 11k) images from several folders and save them in several Lists. I was wondering whether this takes too long but received an OutOfMemoryError after reading in around ~ 9k images.
My JRE has a heap size of 1g (-Xmx1g).
Why does this happen for my code? Am I producing a memory leak? Do you have any suggestions what to change for not experiencing this anymore or is the solution just to read only the files, once I need them? The overall size of all files is only around 40mb, so I thought that it would be okay to keep it in my memory (40mb seems not to be too much from my 1g heap size). Or does java perform some crazy magic stuff multiplying the file size by high numbers once loaded?
I have done some research on stackoverflow but couldn't really convert the given answers (like from here or here)for my case. So if anyone of you has an idea, I would be really happy :)
My code is here, sorry if it's a bit ugly - I was just messing around:
import java.awt.image.BufferedImage;
import java.io.File;
import java.util.ArrayList;
import java.util.List;
import javax.imageio.ImageIO;
public class FatfontsManager {
private List<BufferedImage> until9;
private List<BufferedImage> until99;
private List<BufferedImage> until999;
private List<BufferedImage> until9999;
// loads and keeps all images
public FatfontsManager() {
until9 = new ArrayList<>();
until99 = new ArrayList<>();
until999 = new ArrayList<>();
until9999 = new ArrayList<>();
String absolutPathToThisProject = new java.io.File("")
.getAbsolutePath();
// load Images
for (int i = 1; i < 10; i++) {
until9.add(loadImage(absolutPathToThisProject
+ "\\latexDir\\0..9\\" + i + ".png"));
}
System.out.println("1 - 9 loaded");
for (int i = 1; i < 100; i++) {
until99.add(loadImage(absolutPathToThisProject
+ "\\latexDir\\0..99\\" + i + ".png"));
}
System.out.println("1 - 99 loaded");
for (int i = 1; i < 1000; i++) {
until999.add(loadImage(absolutPathToThisProject
+ "\\latexDir\\0..999\\" + i + ".png"));
}
System.out.println("1 - 999 loaded");
for (int i = 1; i < 10000; i++) {
until9999.add(loadImage(absolutPathToThisProject
+ "\\latexDir\\0..9999\\" + i + ".png"));
}
System.out.println("1 - 9999 loaded");
}
public static void main(String[] args) {
new FatfontsManager();
}
public static BufferedImage loadImage(String ref) {
BufferedImage bimg = null;
try {
bimg = ImageIO.read(new File(ref));
} catch (Exception e) {
System.err.println("Error loading image file " + ref);
e.printStackTrace();
}
return bimg;
}
}
Thanks a lot for every answer.
The overall size of all files is only around 40mb ...
But when you load the image files into memory, the images will decompressed and turned into arrays of pixels. Depending on the original image file format (and compression parameters) this could require an order of magnitude or more heap space.
It is not entirely clear what you are trying to achieve, but maybe you should consider only caching these images in memory when they are actually going to be used.

Splitting a .gz file into specified file sizes in Java using byte[] array

I have written a code to split a .gz file into user specified parts using byte[] array. But the for loop is not reading/writing the last part of the parent file which is less than the array size. Can you please help me in fixing this?
package com.bitsighttech.collection.packaging;
import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.log4j.Logger;
public class FileSplitterBytewise
{
private static Logger logger = Logger.getLogger(FileSplitterBytewise.class);
private static final long KB = 1024;
private static final long MB = KB * KB;
private FileInputStream fis;
private FileOutputStream fos;
private DataInputStream dis;
private DataOutputStream dos;
public boolean split(File inputFile, String splitSize)
{
int expectedNoOfFiles =0;
try
{
double parentFileSizeInB = inputFile.length();
Pattern p = Pattern.compile("(\\d+)\\s([MmGgKk][Bb])");
Matcher m = p.matcher(splitSize);
m.matches();
String FileSizeString = m.group(1);
String unit = m.group(2);
double FileSizeInMB = 0;
try {
if (unit.toLowerCase().equals("kb"))
FileSizeInMB = Double.parseDouble(FileSizeString) / KB;
else if (unit.toLowerCase().equals("mb"))
FileSizeInMB = Double.parseDouble(FileSizeString);
else if (unit.toLowerCase().equals("gb"))
FileSizeInMB = Double.parseDouble(FileSizeString) * KB;
} catch (NumberFormatException e) {
logger.error("invalid number [" + FileSizeInMB + "] for expected file size");
}
double fileSize = FileSizeInMB * MB;
int fileSizeInByte = (int) Math.ceil(fileSize);
double noOFFiles = parentFileSizeInB/fileSizeInByte;
expectedNoOfFiles = (int) Math.ceil(noOFFiles);
int splinterCount = 1;
fis = new FileInputStream(inputFile);
dis = new DataInputStream(new BufferedInputStream(fis));
fos = new FileOutputStream("F:\\ff\\" + "_part_" + splinterCount + "_of_" + expectedNoOfFiles);
dos = new DataOutputStream(new BufferedOutputStream(fos));
byte[] data = new byte[(int) fileSizeInByte];
while ( splinterCount <= expectedNoOfFiles ) {
int i;
for(i = 0; i<data.length-1; i++)
{
data[i] = s.readByte();
}
dos.write(data);
splinterCount ++;
}
}
catch(Exception e)
{
logger.error("Unable to split the file " + inputFile.getName() + " in to " + expectedNoOfFiles);
return false;
}
logger.debug("Successfully split the file [" + inputFile.getName() + "] in to " + expectedNoOfFiles + " files");
return true;
}
public static void main(String args[])
{
String FilePath1 = "F:\\az.gz";
File file= new File(FilePath1);
FileSplitterBytewise fileSplitter = new FileSplitterBytewise();
String splitlen = "1 MB";
fileSplitter.split(file, splitlen);
}
}
I'd suggest to make more methods. You've got a complicated string-handling section of code in split(); it would be best to make a method that takes the human-friendly string as input and returns the number you're looking for. (It would also make it far easier for you to test this section of the routine; there's no way you can test it now.)
Once it is split off and you're writing test cases, you'll probably find that the error message you generate if the string doesn't contain kb, mb, or gb is extremely confusing -- it blames the number 0 for the mistake rather than pointing out the string does not have the expected units.
Using an int to store the file size means your program will never handle files larger than two gigabytes. You should stick with long or double. (double feels wrong for something that is actually confined to integer values but I can't quickly think why it would fail.)
byte[] data = new byte[(int) fileSizeInByte];
Allocating several gigabytes like this is going to destroy your performance -- that's a potentially huge memory allocation (and one that might be considered under control of an adversary; depending upon your security model, this might or might not be a big deal). Don't try to work with the entire file in one piece.
You appear to be reading and writing the files one byte at a time. That's a guarantee to very slow performance. Doing some performance testing for another question earlier today, I found that my machine could read (from a hot cache) 2000 times faster using 131kb blocks than two-byte blocks. One-byte blocks would be even worse. A cold cache would be significantly worse for such small sizes.
fos = new FileOutputStream("F:\\ff\\" + "_part_" + splinterCount + "_of_" + expectedNoOfFiles);
You only appear to ever open one file output stream. Your post probably should have said "only the first works", because it looks like you've not yet tried it on a file that creates three or more pieces.
catch(Exception e)
At this point, you've got the ability to discover errors in your program; you choose to ignore them completely. Sure, you log an error message, but you cannot actually debug your program with the data you log. You should log at a minimum the exception type, message, and maybe even full stack-trace. This combination of data is immensely useful when trying to solve problems, especially in a few months when you've forgotten the details of how it works.
Can you please help me in fixing this?
I would use;
drop the DataInput/OutputStreams, you don't need them.
use in.read(data) to read the whole block instead on one byte at a time. Reading one byte at a time is so much slower!
or read the whole of the data array, you are reading one less.
stop when you reach the end of the file, it might not be a whole multiple of the size.
only write as much as you have read, if your blocks at 1 MB byte there is 100 KB left you should only read/write 100 KB at the end.
close your files when have finished, esp as you have a buffered stream.
you "split" writes everything to the same file (so its not actually splitting) You need to create, write to and close output files in a loop.
don't use fields when you could be/should be using local variables.
would use the length as a long in bytes.
the pattern ignores incorrect input and your pattern doesn't match the test you check for. e.g. your patten allows 1 G or 1 k but these will be treated as 1 MB.

java get file size efficiently

While googling, I see that using java.io.File#length() can be slow.
FileChannel has a size() method that is available as well.
Is there an efficient way in java to get the file size?
Well, I tried to measure it up with the code below:
For runs = 1 and iterations = 1 the URL method is fastest most times followed by channel. I run this with some pause fresh about 10 times. So for one time access, using the URL is the fastest way I can think of:
LENGTH sum: 10626, per Iteration: 10626.0
CHANNEL sum: 5535, per Iteration: 5535.0
URL sum: 660, per Iteration: 660.0
For runs = 5 and iterations = 50 the picture draws different.
LENGTH sum: 39496, per Iteration: 157.984
CHANNEL sum: 74261, per Iteration: 297.044
URL sum: 95534, per Iteration: 382.136
File must be caching the calls to the filesystem, while channels and URL have some overhead.
Code:
import java.io.*;
import java.net.*;
import java.util.*;
public enum FileSizeBench {
LENGTH {
#Override
public long getResult() throws Exception {
File me = new File(FileSizeBench.class.getResource(
"FileSizeBench.class").getFile());
return me.length();
}
},
CHANNEL {
#Override
public long getResult() throws Exception {
FileInputStream fis = null;
try {
File me = new File(FileSizeBench.class.getResource(
"FileSizeBench.class").getFile());
fis = new FileInputStream(me);
return fis.getChannel().size();
} finally {
fis.close();
}
}
},
URL {
#Override
public long getResult() throws Exception {
InputStream stream = null;
try {
URL url = FileSizeBench.class
.getResource("FileSizeBench.class");
stream = url.openStream();
return stream.available();
} finally {
stream.close();
}
}
};
public abstract long getResult() throws Exception;
public static void main(String[] args) throws Exception {
int runs = 5;
int iterations = 50;
EnumMap<FileSizeBench, Long> durations = new EnumMap<FileSizeBench, Long>(FileSizeBench.class);
for (int i = 0; i < runs; i++) {
for (FileSizeBench test : values()) {
if (!durations.containsKey(test)) {
durations.put(test, 0l);
}
long duration = testNow(test, iterations);
durations.put(test, durations.get(test) + duration);
// System.out.println(test + " took: " + duration + ", per iteration: " + ((double)duration / (double)iterations));
}
}
for (Map.Entry<FileSizeBench, Long> entry : durations.entrySet()) {
System.out.println();
System.out.println(entry.getKey() + " sum: " + entry.getValue() + ", per Iteration: " + ((double)entry.getValue() / (double)(runs * iterations)));
}
}
private static long testNow(FileSizeBench test, int iterations)
throws Exception {
long result = -1;
long before = System.nanoTime();
for (int i = 0; i < iterations; i++) {
if (result == -1) {
result = test.getResult();
//System.out.println(result);
} else if ((result = test.getResult()) != result) {
throw new Exception("variance detected!");
}
}
return (System.nanoTime() - before) / 1000;
}
}
The benchmark given by GHad measures lots of other stuff (such as reflection, instantiating objects, etc.) besides getting the length. If we try to get rid of these things then for one call I get the following times in microseconds:
file sum___19.0, per Iteration___19.0
raf sum___16.0, per Iteration___16.0
channel sum__273.0, per Iteration__273.0
For 100 runs and 10000 iterations I get:
file sum__1767629.0, per Iteration__1.7676290000000001
raf sum___881284.0, per Iteration__0.8812840000000001
channel sum___414286.0, per Iteration__0.414286
I did run the following modified code giving as an argument the name of a 100MB file.
import java.io.*;
import java.nio.channels.*;
import java.net.*;
import java.util.*;
public class FileSizeBench {
private static File file;
private static FileChannel channel;
private static RandomAccessFile raf;
public static void main(String[] args) throws Exception {
int runs = 1;
int iterations = 1;
file = new File(args[0]);
channel = new FileInputStream(args[0]).getChannel();
raf = new RandomAccessFile(args[0], "r");
HashMap<String, Double> times = new HashMap<String, Double>();
times.put("file", 0.0);
times.put("channel", 0.0);
times.put("raf", 0.0);
long start;
for (int i = 0; i < runs; ++i) {
long l = file.length();
start = System.nanoTime();
for (int j = 0; j < iterations; ++j)
if (l != file.length()) throw new Exception();
times.put("file", times.get("file") + System.nanoTime() - start);
start = System.nanoTime();
for (int j = 0; j < iterations; ++j)
if (l != channel.size()) throw new Exception();
times.put("channel", times.get("channel") + System.nanoTime() - start);
start = System.nanoTime();
for (int j = 0; j < iterations; ++j)
if (l != raf.length()) throw new Exception();
times.put("raf", times.get("raf") + System.nanoTime() - start);
}
for (Map.Entry<String, Double> entry : times.entrySet()) {
System.out.println(
entry.getKey() + " sum: " + 1e-3 * entry.getValue() +
", per Iteration: " + (1e-3 * entry.getValue() / runs / iterations));
}
}
}
All the test cases in this post are flawed as they access the same file for each method tested. So disk caching kicks in which tests 2 and 3 benefit from. To prove my point I took test case provided by GHAD and changed the order of enumeration and below are the results.
Looking at result I think File.length() is the winner really.
Order of test is the order of output. You can even see the time taken on my machine varied between executions but File.Length() when not first, and incurring first disk access won.
---
LENGTH sum: 1163351, per Iteration: 4653.404
CHANNEL sum: 1094598, per Iteration: 4378.392
URL sum: 739691, per Iteration: 2958.764
---
CHANNEL sum: 845804, per Iteration: 3383.216
URL sum: 531334, per Iteration: 2125.336
LENGTH sum: 318413, per Iteration: 1273.652
---
URL sum: 137368, per Iteration: 549.472
LENGTH sum: 18677, per Iteration: 74.708
CHANNEL sum: 142125, per Iteration: 568.5
When I modify your code to use a file accessed by an absolute path instead of a resource, I get a different result (for 1 run, 1 iteration, and a 100,000 byte file -- times for a 10 byte file are identical to 100,000 bytes)
LENGTH sum: 33, per Iteration: 33.0
CHANNEL sum: 3626, per Iteration: 3626.0
URL sum: 294, per Iteration: 294.0
In response to rgrig's benchmark, the time taken to open/close the FileChannel & RandomAccessFile instances also needs to be taken into account, as these classes will open a stream for reading the file.
After modifying the benchmark, I got these results for 1 iterations on a 85MB file:
file totalTime: 48000 (48 us)
raf totalTime: 261000 (261 us)
channel totalTime: 7020000 (7 ms)
For 10000 iterations on same file:
file totalTime: 80074000 (80 ms)
raf totalTime: 295417000 (295 ms)
channel totalTime: 368239000 (368 ms)
If all you need is the file size, file.length() is the fastest way to do it. If you plan to use the file for other purposes like reading/writing, then RAF seems to be a better bet. Just don't forget to close the file connection :-)
import java.io.File;
import java.io.FileInputStream;
import java.io.RandomAccessFile;
import java.nio.channels.FileChannel;
import java.util.HashMap;
import java.util.Map;
public class FileSizeBench
{
public static void main(String[] args) throws Exception
{
int iterations = 1;
String fileEntry = args[0];
Map<String, Long> times = new HashMap<String, Long>();
times.put("file", 0L);
times.put("channel", 0L);
times.put("raf", 0L);
long fileSize;
long start;
long end;
File f1;
FileChannel channel;
RandomAccessFile raf;
for (int i = 0; i < iterations; i++)
{
// file.length()
start = System.nanoTime();
f1 = new File(fileEntry);
fileSize = f1.length();
end = System.nanoTime();
times.put("file", times.get("file") + end - start);
// channel.size()
start = System.nanoTime();
channel = new FileInputStream(fileEntry).getChannel();
fileSize = channel.size();
channel.close();
end = System.nanoTime();
times.put("channel", times.get("channel") + end - start);
// raf.length()
start = System.nanoTime();
raf = new RandomAccessFile(fileEntry, "r");
fileSize = raf.length();
raf.close();
end = System.nanoTime();
times.put("raf", times.get("raf") + end - start);
}
for (Map.Entry<String, Long> entry : times.entrySet()) {
System.out.println(entry.getKey() + " totalTime: " + entry.getValue() + " (" + getTime(entry.getValue()) + ")");
}
}
public static String getTime(Long timeTaken)
{
if (timeTaken < 1000) {
return timeTaken + " ns";
} else if (timeTaken < (1000*1000)) {
return timeTaken/1000 + " us";
} else {
return timeTaken/(1000*1000) + " ms";
}
}
}
I ran into this same issue. I needed to get the file size and modified date of 90,000 files on a network share. Using Java, and being as minimalistic as possible, it would take a very long time. (I needed to get the URL from the file, and the path of the object as well. So its varied somewhat, but more than an hour.) I then used a native Win32 executable, and did the same task, just dumping the file path, modified, and size to the console, and executed that from Java. The speed was amazing. The native process, and my string handling to read the data could process over 1000 items a second.
So even though people down ranked the above comment, this is a valid solution, and did solve my issue. In my case I knew the folders I needed the sizes of ahead of time, and I could pass that in the command line to my win32 app. I went from hours to process a directory to minutes.
The issue did also seem to be Windows specific. OS X did not have the same issue and could access network file info as fast as the OS could do so.
Java File handling on Windows is terrible. Local disk access for files is fine though. It was just network shares that caused the terrible performance. Windows could get info on the network share and calculate the total size in under a minute too.
--Ben
If you want the file size of multiple files in a directory, use Files.walkFileTree. You can obtain the size from the BasicFileAttributes that you'll receive.
This is much faster then calling .length() on the result of File.listFiles() or using Files.size() on the result of Files.newDirectoryStream(). In my test cases it was about 100 times faster.
Actually, I think the "ls" may be faster. There are definitely some issues in Java dealing with getting File info. Unfortunately there is no equivalent safe method of recursive ls for Windows. (cmd.exe's DIR /S can get confused and generate errors in infinite loops)
On XP, accessing a server on the LAN, it takes me 5 seconds in Windows to get the count of the files in a folder (33,000), and the total size.
When I iterate recursively through this in Java, it takes me over 5 minutes. I started measuring the time it takes to do file.length(), file.lastModified(), and file.toURI() and what I found is that 99% of my time is taken by those 3 calls. The 3 calls I actually need to do...
The difference for 1000 files is 15ms local versus 1800ms on server. The server path scanning in Java is ridiculously slow. If the native OS can be fast at scanning that same folder, why can't Java?
As a more complete test, I used WineMerge on XP to compare the modified date, and size of the files on the server versus the files locally. This was iterating over the entire directory tree of 33,000 files in each folder. Total time, 7 seconds. java: over 5 minutes.
So the original statement and question from the OP is true, and valid. Its less noticeable when dealing with a local file system. Doing a local compare of the folder with 33,000 items takes 3 seconds in WinMerge, and takes 32 seconds locally in Java. So again, java versus native is a 10x slowdown in these rudimentary tests.
Java 1.6.0_22 (latest), Gigabit LAN, and network connections, ping is less than 1ms (both in the same switch)
Java is slow.
From GHad's benchmark, there are a few issue people have mentioned:
1>Like BalusC mentioned: stream.available() is flowed in this case.
Because available() returns an estimate of the number of bytes that can be read (or skipped over) from this input stream without blocking by the next invocation of a method for this input stream.
So 1st to remove the URL this approach.
2>As StuartH mentioned - the order the test run also make the cache difference, so take that out by run the test separately.
Now start test:
When CHANNEL one run alone:
CHANNEL sum: 59691, per Iteration: 238.764
When LENGTH one run alone:
LENGTH sum: 48268, per Iteration: 193.072
So looks like the LENGTH one is the winner here:
#Override
public long getResult() throws Exception {
File me = new File(FileSizeBench.class.getResource(
"FileSizeBench.class").getFile());
return me.length();
}

Categories

Resources