Is there a Java way to pre-allocate drive space for exclusive usage in the application?
There is no requirement for this space to be a separate filesystem or a part of existing filesystem (so could easily be a database), but it should allow for reserving the specified amount of space and allow for random reads/writes with high enough throughput.
Here's a stripped down version of my JNA-based fallocate solution. The main trick is obtaining the native file descriptor. I've only tested it on Linux so far, but it should work on all modern POSIX/non-Windows systems. It's not necessary on Windows, as Windows does not create sparse files by default (only with StandardOpenOption.SPARSE), so RandomAccessFile.setLength(size) or FileChannel.write(ByteBuffer.allocate(1), size - 1) are adequate there.
/**
* Provides access to operating system-specific {#code fallocate} and
* {#code posix_fallocate} functions.
*/
public final class Fallocate {
private static final boolean IS_LINUX = Platform.isLinux();
private static final boolean IS_POSIX = !Platform.isWindows();
private static final int FALLOC_FL_KEEP_SIZE = 0x01;
private final int fd;
private int mode;
private long offset;
private final long length;
private Fallocate(int fd, long length) {
if (!isSupported()) {
throwUnsupported("fallocate");
}
this.fd = fd;
this.length = length;
}
public static boolean isSupported() {
return IS_POSIX;
}
public static Fallocate forChannel(FileChannel channel, long length) {
return new Fallocate(getDescriptor(channel), length);
}
public static Fallocate forDescriptor(FileDescriptor descriptor, long length) {
return new Fallocate(getDescriptor(descriptor), length);
}
public Fallocate fromOffset(long offset) {
this.offset = offset;
return this;
}
public Fallocate keepSize() {
requireLinux("fallocate keep size");
mode |= FALLOC_FL_KEEP_SIZE;
return this;
}
private void requireLinux(String feature) {
if (!IS_LINUX) {
throwUnsupported(feature);
}
}
private void throwUnsupported(String feature) {
throw new UnsupportedOperationException(feature +
" is not supported on this operating system");
}
public void execute() throws IOException {
final int errno;
if (IS_LINUX) {
final int result = FallocateHolder.fallocate(fd, mode, offset, length);
errno = result == 0 ? 0 : Native.getLastError();
} else {
errno = PosixFallocateHolder.posix_fallocate(fd, offset, length);
}
if (errno != 0) {
throw new IOException("fallocate returned " + errno);
}
}
private static class FallocateHolder {
static {
Native.register(Platform.C_LIBRARY_NAME);
}
private static native int fallocate(int fd, int mode, long offset, long length);
}
private static class PosixFallocateHolder {
static {
Native.register(Platform.C_LIBRARY_NAME);
}
private static native int posix_fallocate(int fd, long offset, long length);
}
private static int getDescriptor(FileChannel channel) {
try {
// sun.nio.ch.FileChannelImpl declares private final java.io.FileDescriptor fd
final Field field = channel.getClass().getDeclaredField("fd");
field.setAccessible(true);
return getDescriptor((FileDescriptor) field.get(channel));
} catch (final Exception e) {
throw new UnsupportedOperationException("unsupported FileChannel implementation", e);
}
}
private static int getDescriptor(FileDescriptor descriptor) {
try {
// Oracle java.io.FileDescriptor declares private int fd
final Field field = descriptor.getClass().getDeclaredField("fd");
field.setAccessible(true);
return (int) field.get(descriptor);
} catch (final Exception e) {
throw new UnsupportedOperationException("unsupported FileDescriptor implementation", e);
}
}
}
You could try using a RandomAccessFile object and use the setLength() method.
Example:
File file = ... //Create a temporary file on the filesystem your trying to reserve.
long bytes = ... //number of bytes you want to reserve.
RandomAccessFile rf = null;
try{
rf = new RandomAccessFile(file, "rw"); //rw stands for open in read/write mode.
rf.setLength(bytes); //This will cause java to "reserve" memory for your application by inflating/truncating the file to the specific size.
//Do whatever you want with the space here...
}catch(IOException ex){
//Handle this...
}finally{
if(rf != null){
try{
rf.close(); //Lets be nice and tidy here.
}catch(IOException ioex){
//Handle this if you want...
}
}
}
Note: The file must exist before you create the RandomAccessFile object.
The RandomAccessFile object can then be used to read/write to the file. Make sure the target filesystem has enough free space. The space may not be "exclusive" per-say but you can always use File Locks to do that.
P.S: If you end up realizing hard drives are slow and useless (or meant to use RAM from the start) you can use the ByteBuffer object from java.nio. The allocate() and allocateDirect() methods should be more than enough. The byte buffer will be allocated into RAM (and possible SwapFile) and will be exclusive to this java program. Random access can be done by changing the position of the buffer. Since these buffers use signed integers to reference position, max sizes are limited to 2^31 - 1.
Read more on RandomAccessFile here.
Read more on FileLock (the java object) here.
Read more on ByteBuffer here.
On Linux systems you can use fallocate() system call. It's extremely fast. Just run Bash command.
UPD:
fallocate -l 10G 10Gigfile
You can pre-allocate space by writing a large file, but to be honest I wouldn't bother. Performance will be pretty good/ probably better than you need.
If you really needed performance, you'd be writing C++/C# and doing RAW I/O.
But that's typically only done when writing an RDBMS engine, high-volume media capture or similar.
Related
I'm parsing large PCAP files in Java using Kaitai-Struct. Whenever the file size exceeds Integer.MAX_VALUE bytes I face an IllegalArgumentException caused by the size limit of the underlying ByteBuffer.
I haven't found references to this issue elsewhere, which leads me to believe that this is not a library limitation but a mistake in the way I'm using it.
Since the problem is caused by trying to map the whole file into the ByteBuffer I'd think that the solution would be mapping only the first region of the file, and as the data is being consumed map again skipping the data already parsed.
As this is done within the Kaitai Struct Runtime library it would mean to write my own class extending fom KatiaiStream and overwrite the auto-generated fromFile(...) method, and this doesn't really seem the right approach.
The auto-generated method to parse from file for the PCAP class is.
public static Pcap fromFile(String fileName) throws IOException {
return new Pcap(new ByteBufferKaitaiStream(fileName));
}
And the ByteBufferKaitaiStream provided by the Kaitai Struct Runtime library is backed by a ByteBuffer.
private final FileChannel fc;
private final ByteBuffer bb;
public ByteBufferKaitaiStream(String fileName) throws IOException {
fc = FileChannel.open(Paths.get(fileName), StandardOpenOption.READ);
bb = fc.map(FileChannel.MapMode.READ_ONLY, 0, fc.size());
}
Which in turn is limitted by the ByteBuffer max size.
Am I missing some obvious workaround? Is it really a limitation of the implementation of Katiati Struct in Java?
There are two separate issues here:
Running Pcap.fromFile() for large files is generally not a very efficient method, as you'll eventually get all files parsed into memory array at once. A example on how to avoid that is given in kaitai_struct/issues/255. The basic idea is that you'd want to have control over how you read every packet, and then dispose of every packet after you've parsed / accounted it somehow.
2GB limit on Java's mmaped files. To mitigate that, you can use alternative RandomAccessFile-based KaitaiStream implementation: RandomAccessFileKaitaiStream — it might be slower, but it should avoid that 2GB problem.
This library provides a ByteBuffer implementation which uses long offset. I haven't tried this approach but looks promising. See section Mapping Files Bigger than 2 GB
http://www.kdgregory.com/index.php?page=java.byteBuffer
public int getInt(long index)
{
return buffer(index).getInt();
}
private ByteBuffer buffer(long index)
{
ByteBuffer buf = _buffers[(int)(index / _segmentSize)];
buf.position((int)(index % _segmentSize));
return buf;
}
public MappedFileBuffer(File file, int segmentSize, boolean readWrite)
throws IOException
{
if (segmentSize > MAX_SEGMENT_SIZE)
throw new IllegalArgumentException(
"segment size too large (max " + MAX_SEGMENT_SIZE + "): " + segmentSize);
_segmentSize = segmentSize;
_fileSize = file.length();
RandomAccessFile mappedFile = null;
try
{
String mode = readWrite ? "rw" : "r";
MapMode mapMode = readWrite ? MapMode.READ_WRITE : MapMode.READ_ONLY;
mappedFile = new RandomAccessFile(file, mode);
FileChannel channel = mappedFile.getChannel();
_buffers = new MappedByteBuffer[(int)(_fileSize / segmentSize) + 1];
int bufIdx = 0;
for (long offset = 0 ; offset < _fileSize ; offset += segmentSize)
{
long remainingFileSize = _fileSize - offset;
long thisSegmentSize = Math.min(2L * segmentSize, remainingFileSize);
_buffers[bufIdx++] = channel.map(mapMode, offset, thisSegmentSize);
}
}
finally
{
// close quietly
if (mappedFile != null)
{
try
{
mappedFile.close();
}
catch (IOException ignored) { /* */ }
}
}
}
I need to get the free available disk space for all disks in system, or all partitions, I don't mind that. (I dont have to use Sigar, but I am using it already on the project for some other processes, so I can use it for this as well)
I am using Sigar API and got this
public double getFreeHdd() throws SigarException{
FileSystemUsage f= sigar.getFileSystemUsage("/");
return ( f.getAvail());
}
But this only gives me the system partition (root), how can i get a list of all partition and loop them to get their free space?
I tried this
FileSystemView fsv = FileSystemView.getFileSystemView();
File[] roots = fsv.getRoots();
for (int i = 0; i < roots.length; i++) {
System.out.println("Root: " + roots[i]);
}
But it only returns the root dir
Root: /
Thanks
Edit
it seems that I could use
FileSystem[] fslist = sigar.getFileSystemList();
But the results i am getting do not match the ones i get from the terminal. On the other hand on this system I am working on, i have 3 disks with a total 12 partitions, so i might be loosing something there. Will try it on some other system in case i can make something useful out of the results.
We use SIGAR extensively for cross-platform monitoring. This is the code we use to get the file system list:
/**
* #return a list of directory path names of file systems that are local or network - not removable media
*/
public static Set<String> getLocalOrNetworkFileSystemDirectoryNames() {
Set<String> ret = new HashSet<String>();
try {
FileSystem[] fileSystemList = getSigarProxy().getFileSystemList();
for (FileSystem fs : fileSystemList) {
if ((fs.getType() == FileSystem.TYPE_LOCAL_DISK) || (fs.getType() == FileSystem.TYPE_NETWORK)) {
ret.add(fs.getDirName());
}
}
}
catch (SigarException e) {
// log or rethrow as appropriate
}
return ret;
}
You can then use that as the input to other SIGAR methods:
FileSystemUsage usageStats = getSigarProxy().getFileSystemUsage(fileSystemDirectoryPath);
The getSigarProxy() is just a convenience base method:
// The Humidor handles thread safety for a single instance of a Sigar object
static final private SigarProxy sigarProxy = Humidor.getInstance().getSigar();
static final protected SigarProxy getSigarProxy() {
return sigarProxy;
}
You can use java.nio.file.FileSystems to get a list of java.nio.file.FileStorages and then see the usable/available space. Per instance (assuming that you are using Java 7+):
import java.io.IOException;
import java.nio.file.FileStore;
import java.nio.file.FileSystem;
import java.nio.file.FileSystems;
import java.util.function.Consumer;
public static void main(String[] args) {
FileSystem fs = FileSystems.getDefault();
fs.getFileStores().forEach(new Consumer<FileStore>() {
#Override
public void accept(FileStore store) {
try {
System.out.println(store.getTotalSpace());
System.out.println(store.getUsableSpace());
} catch (IOException e) {
e.printStackTrace();
}
}
});
}
Also, keep in mind that FileStore.getUsableSpace() returns the size in bytes. See the docs for more information.
I'm currently trying to write a custom streams proxy (let's call it in that way) that can change the content from the given input stream and produce a modified, if necessary, output. This requirement is really necessary because sometimes I have to modify the streams in my application (e.g. compress the data truly on the fly). The following class is pretty easy and it uses internal buffering.
private static class ProxyInputStream extends InputStream {
private final InputStream iStream;
private final byte[] iBuffer = new byte[512];
private int iBufferedBytes;
private final ByteArrayOutputStream oBufferStream;
private final OutputStream oStream;
private byte[] oBuffer = emptyPrimitiveByteArray;
private int oBufferIndex;
ProxyInputStream(InputStream iStream, IFunction<OutputStream, ByteArrayOutputStream> oStreamFactory) {
this.iStream = iStream;
oBufferStream = new ByteArrayOutputStream(512);
oStream = oStreamFactory.evaluate(oBufferStream);
}
#Override
public int read() throws IOException {
if ( oBufferIndex == oBuffer.length ) {
iBufferedBytes = iStream.read(iBuffer);
if ( iBufferedBytes == -1 ) {
return -1;
}
oBufferIndex = 0;
oStream.write(iBuffer, 0, iBufferedBytes);
oStream.flush();
oBuffer = oBufferStream.toByteArray();
oBufferStream.reset();
}
return oBuffer[oBufferIndex++];
}
}
Let's assume we also have a sample test output stream that simply adds a space character before every written byte ("abc" -> " a b c") like this:
private static class SpacingOutputStream extends OutputStream {
private final OutputStream outputStream;
SpacingOutputStream(OutputStream outputStream) {
this.outputStream = outputStream;
}
#Override
public void write(int b) throws IOException {
outputStream.write(' ');
outputStream.write(b);
}
}
And the following test method:
private static void test(final boolean useDeflater) throws IOException {
final FileInputStream input = new FileInputStream(SOURCE);
final IFunction<OutputStream, ByteArrayOutputStream> outputFactory = new IFunction<OutputStream, ByteArrayOutputStream>() {
#Override
public OutputStream evaluate(ByteArrayOutputStream outputStream) {
return useDeflater ? new DeflaterOutputStream(outputStream) : new SpacingOutputStream(outputStream);
}
};
final InputStream proxyInput = new ProxyInputStream(input, outputFactory);
final OutputStream output = new FileOutputStream(SOURCE + ".~" + useDeflater);
int c;
while ( (c = proxyInput.read()) != -1 ) {
output.write(c);
}
output.close();
proxyInput.close();
}
This test method simply reads the file content and writes it to another stream, that's probably can be modified somehow. If the test method is running with useDeflater=false, the expected approach works fine as it's expected. But if the test method is invoked with the useDeflater set on, it behaves really strange and simply writes almost nothing (if omit the header 78 9C). I suspect that the deflater class may not be designed to meet the approach I like to use, but I always believed that ZIP format and the deflate compression are designed to work on-fly.
Probably I'm wrong at some point with the specifics of the deflate compression algorithm. What do I really miss?.. Perhaps there could be another approach to write a "streams proxy" to behave exactly as I want it to work... How can I compress the data on the fly being limited with the streams only?
Thanks in advance.
UPD: The following basic version works pretty nice with deflater and inflater:
public final class ProxyInputStream<OS extends OutputStream> extends InputStream {
private static final int INPUT_BUFFER_SIZE = 512;
private static final int OUTPUT_BUFFER_SIZE = 512;
private final InputStream iStream;
private final byte[] iBuffer = new byte[INPUT_BUFFER_SIZE];
private final ByteArrayOutputStream oBufferStream;
private final OS oStream;
private final IProxyInputStreamListener<OS> listener;
private byte[] oBuffer = emptyPrimitiveByteArray;
private int oBufferIndex;
private boolean endOfStream;
private ProxyInputStream(InputStream iStream, IFunction<OS, ByteArrayOutputStream> oStreamFactory, IProxyInputStreamListener<OS> listener) {
this.iStream = iStream;
oBufferStream = new ByteArrayOutputStream(OUTPUT_BUFFER_SIZE);
oStream = oStreamFactory.evaluate(oBufferStream);
this.listener = listener;
}
public static <OS extends OutputStream> ProxyInputStream<OS> proxyInputStream(InputStream iStream, IFunction<OS, ByteArrayOutputStream> oStreamFactory, IProxyInputStreamListener<OS> listener) {
return new ProxyInputStream<OS>(iStream, oStreamFactory, listener);
}
#Override
public int read() throws IOException {
if ( oBufferIndex == oBuffer.length ) {
if ( endOfStream ) {
return -1;
} else {
oBufferIndex = 0;
do {
final int iBufferedBytes = iStream.read(iBuffer);
if ( iBufferedBytes == -1 ) {
if ( listener != null ) {
listener.afterEndOfStream(oStream);
}
endOfStream = true;
break;
}
oStream.write(iBuffer, 0, iBufferedBytes);
oStream.flush();
} while ( oBufferStream.size() == 0 );
oBuffer = oBufferStream.toByteArray();
oBufferStream.reset();
}
}
return !endOfStream || oBuffer.length != 0 ? (int) oBuffer[oBufferIndex++] & 0xFF : -1;
}
}
I don't believe that DeflaterOutputStream.flush() does anything meaningful. the deflater will accumulate data until it has something to write out to the underlying stream. the only way to force the remaining bit of data out is to call DeflaterOutputStream.finish(). however, this would not work for your current implementation, as you can't call finish until you are entirely done writing.
it's actually very difficult to write a compressed stream and read it within the same thread. In the RMIIO project i actually do this, but you need an arbitrarily sized intermediate output buffer (and you basically need to push data in until something comes out compressed on the other end, then you can read it). You might be able to use some of the util classes in that project to accomplish what you want to do.
Why don't use GZipOutputStream?
I'm a little lost. But I should simple use the original outputStream when I don't want to compress and new GZipOutputStream(outputStream) when I DO want to compress. That's all. Anyway, check you are flushing the output streams.
Gzip vs zip
Also: one thing is GZIP (compress a stream, that's what you're doing) and another thing is writing a valid zip file (file headers, file directory, entries (header,data)*). Check ZipOutputStream.
Be careful, if somewhere you use method
int read(byte b[], int off, int len) and in case of exception in line
final int iBufferedBytes = iStream.read(iBuffer);
you will get stuck in infinite loop
I am working on a software product with an integrated log file viewer. Problem is, its slow and unstable for really large files because it reads the whole file into memory when you view a log file. I'm wanting to write a new log file viewer that addresses this problem.
What are the best practices for writing viewers for large text files? How does editors like notepad++ and VIM acomplish this? I was thinking of using a buffered Bi-directional text stream reader together with Java's TableModel. Am I thinking along the right lines and are such stream implementations available for Java?
Edit: Will it be worthwhile to run through the file once to index the positions of the start of each line of text so that one knows where to seek to? I will probably need the amount of lines, so will probably have to scan through the file at least once?
Edit2: I've added my implementation to an answer below. Please comment on it or edit it to help me/us arrive at a more best-practice implementation or otherwise provide your own.
I'm not sure that NotePad++ actually implements random access, but I think that's the way to go, especially with a log file viewer, which implies that it will be read only.
Since your log viewer will be read only, you can use a read only random access memory mapped file "stream". In Java, this is the FileChannel.
Then just jump around in the file as needed and render to the screen just a scrolling window of the data.
One of the advantages of the FileChannel is that concurrent threads can have the file open, and reading doesn't affect the current file pointer. So, if you're appending to the log file in another thread, it won't be affected.
Another advantage is that you can call the FileChannel's size method to get the file size at any moment.
The problem with mapping memory directly to a random access file, which some text editors allow (such as HxD and UltraEdit), is that any changes directly affect the file. Therefore, changes are immediate (except for write caching), which is something users typically don't want. Instead, users typically don't want their changes made until they click Save. However, since this is just a viewer, you don't have the same concerns.
A typical approach is to use a seekable file reader, make one pass through the log recording an index of line offsets and then present only a window onto a portion of the file as requested.
This reduces both the data you need in quick recall and doesn't load up a widget where 99% of its contents aren't currently visible.
I post my test implementation (after following the advice of Marcus Adams and msw) here for your convenience and also for further comments and criticism. Its quite fast.
I've not bothered with Unicode encoding safety. I guess this will be my next question. Any hints on that very welcome.
class LogFileTableModel implements TableModel {
private final File f;
private final int lineCount;
private final String errMsg;
private final Long[] index;
private final ByteBuffer linebuf = ByteBuffer.allocate(1024);
private FileChannel chan;
public LogFileTableModel(String filename) {
f = new File(filename);
String m;
int l = 1;
Long[] idx = new Long[] {};
try {
FileInputStream in = new FileInputStream(f);
chan = in.getChannel();
m = null;
idx = buildLineIndex();
l = idx.length;
} catch (IOException e) {
m = e.getMessage();
}
errMsg = m;
lineCount = l;
index = idx;
}
private Long[] buildLineIndex() throws IOException {
List<Long> idx = new LinkedList<Long>();
idx.add(0L);
ByteBuffer buf = ByteBuffer.allocate(8 * 1024);
long offset = 0;
while (chan.read(buf) != -1) {
int len = buf.position();
buf.rewind();
int pos = 0;
byte[] bufA = buf.array();
while (pos < len) {
byte c = bufA[pos++];
if (c == '\n')
idx.add(offset + pos);
}
offset = chan.position();
}
System.out.println("Done Building index");
return idx.toArray(new Long[] {});
}
#Override
public int getColumnCount() {
return 2;
}
#Override
public int getRowCount() {
return lineCount;
}
#Override
public String getColumnName(int columnIndex) {
switch (columnIndex) {
case 0:
return "#";
case 1:
return "Name";
}
return "";
}
#Override
public Object getValueAt(int rowIndex, int columnIndex) {
switch (columnIndex) {
case 0:
return String.format("%3d", rowIndex);
case 1:
if (errMsg != null)
return errMsg;
try {
Long pos = index[rowIndex];
chan.position(pos);
chan.read(linebuf);
linebuf.rewind();
if (rowIndex == lineCount - 1)
return new String(linebuf.array());
else
return new String(linebuf.array(), 0, (int)(long)(index[rowIndex+1]-pos));
} catch (Exception e) {
return "Error: "+ e.getMessage();
}
}
return "a";
}
#Override
public Class<?> getColumnClass(int columnIndex) {
return String.class;
}
// ... other methods to make interface complete
}
I am trying to process files one at a time that are stored over a network. Reading the files is fast due to buffering is not the issue. The problem I have is just listing the directories in a folder. I have at least 10k files per folder over many folders.
Performance is super slow since File.list() returns an array instead of an iterable. Java goes off and collects all the names in a folder and packs it into an array before returning.
The bug entry for this is http://bugs.sun.com/view_bug.do;jsessionid=db7fcf25bcce13541c4289edeb4?bug_id=4285834 and doesn't have a work around. They just say this has been fixed for JDK7.
A few questions:
Does anybody have a workaround to this performance bottleneck?
Am I trying to achieve the impossible? Is performance still going to be poor even if it just iterates over the directories?
Could I use the beta JDK7 builds that have this functionality without having to build my entire project on it?
Although it's not pretty, I solved this kind of problem once by piping the output of dir/ls to a file before starting my app, and passing in the filename.
If you needed to do it within the app, you could just use system.exec(), but it would create some nastiness.
You asked. The first form is going to be blazingly fast, the second should be pretty fast as well.
Be sure to do the one item per line (bare, no decoration, no graphics), full path and recurse options of your selected command.
EDIT:
30 minutes just to get a directory listing, wow.
It just struck me that if you use exec(), you can get it's stdout redirected into a pipe instead of writing it to a file.
If you did that, you should start getting the files immediately and be able to begin processing before the command has completed.
The interaction may actually slow things down, but maybe not--you might give it a try.
Wow, I just went to find the syntax of the .exec command for you and came across this, possibly exactly what you want (it lists a directory using exec and "ls" and pipes the result into your program for processing): good link in wayback (Jörg provided in a comment to replace this one from sun that Oracle broke)
Anyway, the idea is straightforward but getting the code right is annoying. I'll go steal some codes from the internets and hack them up--brb
/**
* Note: Only use this as a last resort! It's specific to windows and even
* at that it's not a good solution, but it should be fast.
*
* to use it, extend FileProcessor and call processFiles("...") with a list
* of options if you want them like /s... I highly recommend /b
*
* override processFile and it will be called once for each line of output.
*/
import java.io.*;
public abstract class FileProcessor
{
public void processFiles(String dirOptions)
{
Process theProcess = null;
BufferedReader inStream = null;
// call the Hello class
try
{
theProcess = Runtime.getRuntime().exec("cmd /c dir " + dirOptions);
}
catch(IOException e)
{
System.err.println("Error on exec() method");
e.printStackTrace();
}
// read from the called program's standard output stream
try
{
inStream = new BufferedReader(
new InputStreamReader( theProcess.getInputStream() ));
processFile(inStream.readLine());
}
catch(IOException e)
{
System.err.println("Error on inStream.readLine()");
e.printStackTrace();
}
} // end method
/** Override this method--it will be called once for each file */
public abstract void processFile(String filename);
} // end class
And thank you code donor at IBM
How about using File.list(FilenameFilter filter) method and implementing FilenameFilter.accept(File dir, String name) to process each file and return false.
I ran this on Linux vm for directory with 10K+ files and it took <10 seconds.
import java.io.File;
import java.io.FilenameFilter;
public class Temp {
private static void processFile(File dir, String name) {
File file = new File(dir, name);
System.out.println("processing file " + file.getName());
}
private static void forEachFile(File dir) {
String [] ignore = dir.list(new FilenameFilter() {
public boolean accept(File dir, String name) {
processFile(dir, name);
return false;
}
});
}
public static void main(String[] args) {
long before, after;
File dot = new File(".");
before = System.currentTimeMillis();
forEachFile(dot);
after = System.currentTimeMillis();
System.out.println("after call, delta is " + (after - before));
}
}
An alternative is to have the files served over a different protocol. As I understand you're using SMB for that and java is just trying to list them as a regular file.
The problem here might not be java alone ( how does it behaves when you open that directory with Microsoft Explorer x:\shared ) In my experience it also take a considerably amount of time.
You can change the protocol to something like HTTP, only to fetch the file names. This way you can retrieve the list of files over http ( 10k lines should't be too much ) and let the server deal with file listing. This would be very fast, since it will run with local resources ( those in the server )
Then when you have the list, you can process them one by exactly the way you're doing right now.
The keypoint is to have an aid mechanism in the other side of the node.
Is this feasible?
Today:
File [] content = new File("X:\\remote\\dir").listFiles();
for ( File f : content ) {
process( f );
}
Proposed:
String [] content = fetchViaHttpTheListNameOf("x:\\remote\\dir");
for ( String fileName : content ) {
process( new File( fileName ) );
}
The http server could be a very small small and simple file.
If this is the way you have it right now, what you're doing is to fetch all the 10k files information to your client machine ( I don't know how much of that info ) when you only need the file name for later processing.
If the processing is very fast right now it may be slowed down a bit. This is because the information prefetched is no longer available.
Give it a try.
A non-portable solution would be to make native calls to the operating system and stream the results.
For Linux
You can look at something like readdir. You can walk the directory structure like a linked list and return results in batches or individually.
For Windows
In windows the behavior would be fairly similar using FindFirstFile and FindNextFile apis.
I doubt the problem is relate to the bug report you referenced.
The issue there is "only" memory usage, but not necessarily speed.
If you have enough memory the bug is not relevant for your problem.
You should measure whether your problem is memory related or not. Turn on your Garbage Collector log and use for example gcviewer to analyze your memory usage.
I suspect that it has to do with the SMB protocol causing the problem.
You can try to write a test in another language and see if it's faster, or you can try to get the list of filenames through some other method, such as described here in another post.
If you need to eventually process all files, then having Iterable over String[] won't give you any advantage, as you'll still have to go and fetch the whole list of files.
If you're on Java 1.5 or 1.6, shelling out "dir" commands and parsing the standard output stream on Windows is a perfectly acceptable approach. I've used this approach in the past for processing network drives and it has generally been a lot faster than waiting for the native java.io.File listFiles() method to return.
Of course, a JNI call should be faster and potentially safer than shelling out "dir" commands. The following JNI code can be used to retrieve a list of files/directories using the Windows API. This function can be easily refactored into a new class so the caller can retrieve file paths incrementally (i.e. get one path at a time). For example, you can refactor the code so that FindFirstFileW is called in a constructor and have a seperate method to call FindNextFileW.
JNIEXPORT jstring JNICALL Java_javaxt_io_File_GetFiles(JNIEnv *env, jclass, jstring directory)
{
HANDLE hFind;
try {
//Convert jstring to wstring
const jchar *_directory = env->GetStringChars(directory, 0);
jsize x = env->GetStringLength(directory);
wstring path; //L"C:\\temp\\*";
path.assign(_directory, _directory + x);
env->ReleaseStringChars(directory, _directory);
if (x<2){
jclass exceptionClass = env->FindClass("java/lang/Exception");
env->ThrowNew(exceptionClass, "Invalid path, less than 2 characters long.");
}
wstringstream ss;
BOOL bContinue = TRUE;
WIN32_FIND_DATAW data;
hFind = FindFirstFileW(path.c_str(), &data);
if (INVALID_HANDLE_VALUE == hFind){
jclass exceptionClass = env->FindClass("java/lang/Exception");
env->ThrowNew(exceptionClass, "FindFirstFileW returned invalid handle.");
}
//HANDLE hStdOut = GetStdHandle(STD_OUTPUT_HANDLE);
//DWORD dwBytesWritten;
// If we have no error, loop thru the files in this dir
while (hFind && bContinue){
/*
//Debug Print Statment. DO NOT DELETE! cout and wcout do not print unicode correctly.
WriteConsole(hStdOut, data.cFileName, (DWORD)_tcslen(data.cFileName), &dwBytesWritten, NULL);
WriteConsole(hStdOut, L"\n", 1, &dwBytesWritten, NULL);
*/
//Check if this entry is a directory
if (data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY){
// Make sure this dir is not . or ..
if (wstring(data.cFileName) != L"." &&
wstring(data.cFileName) != L"..")
{
ss << wstring(data.cFileName) << L"\\" << L"\n";
}
}
else{
ss << wstring(data.cFileName) << L"\n";
}
bContinue = FindNextFileW(hFind, &data);
}
FindClose(hFind); // Free the dir structure
wstring cstr = ss.str();
int len = cstr.size();
//WriteConsole(hStdOut, cstr.c_str(), len, &dwBytesWritten, NULL);
//WriteConsole(hStdOut, L"\n", 1, &dwBytesWritten, NULL);
jchar* raw = new jchar[len];
memcpy(raw, cstr.c_str(), len*sizeof(wchar_t));
jstring result = env->NewString(raw, len);
delete[] raw;
return result;
}
catch(...){
FindClose(hFind);
jclass exceptionClass = env->FindClass("java/lang/Exception");
env->ThrowNew(exceptionClass, "Exception occured.");
}
return NULL;
}
Credit:
https://sites.google.com/site/jozsefbekes/Home/windows-programming/miscellaneous-functions
Even with this approach, there are still efficiencies to be gained. If you serialize the path to a java.io.File, there is a huge performance hit - especially if the path represents a file on a network drive. I have no idea what Sun/Oracle is doing under the hood but if you need additional file attributes other than the file path (e.g. size, mod date, etc), I have found that the following JNI function is much faster than instantiating a java.io.File object on a network the path.
JNIEXPORT jlongArray JNICALL Java_javaxt_io_File_GetFileAttributesEx(JNIEnv *env, jclass, jstring filename)
{
//Convert jstring to wstring
const jchar *_filename = env->GetStringChars(filename, 0);
jsize len = env->GetStringLength(filename);
wstring path;
path.assign(_filename, _filename + len);
env->ReleaseStringChars(filename, _filename);
//Get attributes
WIN32_FILE_ATTRIBUTE_DATA fileAttrs;
BOOL result = GetFileAttributesExW(path.c_str(), GetFileExInfoStandard, &fileAttrs);
if (!result) {
jclass exceptionClass = env->FindClass("java/lang/Exception");
env->ThrowNew(exceptionClass, "Exception Occurred");
}
//Create an array to store the WIN32_FILE_ATTRIBUTE_DATA
jlong buffer[6];
buffer[0] = fileAttrs.dwFileAttributes;
buffer[1] = date2int(fileAttrs.ftCreationTime);
buffer[2] = date2int(fileAttrs.ftLastAccessTime);
buffer[3] = date2int(fileAttrs.ftLastWriteTime);
buffer[4] = fileAttrs.nFileSizeHigh;
buffer[5] = fileAttrs.nFileSizeLow;
jlongArray jLongArray = env->NewLongArray(6);
env->SetLongArrayRegion(jLongArray, 0, 6, buffer);
return jLongArray;
}
You can find a full working example of this JNI-based approach in the javaxt-core library. In my tests using Java 1.6.0_38 with a Windows host hitting a Windows share, I have found this JNI approach approximately 10x faster then calling java.io.File listFiles() or shelling out "dir" commands.
I wonder why there are 10k files in a directory. Some file systems do not work well with so many files. There are specifics limitations for file systems like max amount of files per directory and max amount of levels of subdirectory.
I solve a similar problem with an iterator solution.
I needed to walk across huge directorys and several levels of directory tree recursively.
I try FileUtils.iterateFiles() of Apache commons io. But it implement the iterator by adding all the files in a List and then returning List.iterator(). It's very bad for memory.
So I prefer to write something like this:
private static class SequentialIterator implements Iterator<File> {
private DirectoryStack dir = null;
private File current = null;
private long limit;
private FileFilter filter = null;
public SequentialIterator(String path, long limit, FileFilter ff) {
current = new File(path);
this.limit = limit;
filter = ff;
dir = DirectoryStack.getNewStack(current);
}
public boolean hasNext() {
while(walkOver());
return isMore && (limit > count || limit < 0) && dir.getCurrent() != null;
}
private long count = 0;
public File next() {
File aux = dir.getCurrent();
dir.advancePostition();
count++;
return aux;
}
private boolean walkOver() {
if (dir.isOutOfDirListRange()) {
if (dir.isCantGoParent()) {
isMore = false;
return false;
} else {
dir.goToParent();
dir.advancePostition();
return true;
}
} else {
if (dir.isCurrentDirectory()) {
if (dir.isDirectoryEmpty()) {
dir.advancePostition();
} else {
dir.goIntoDir();
}
return true;
} else {
if (filter.accept(dir.getCurrent())) {
return false;
} else {
dir.advancePostition();
return true;
}
}
}
}
private boolean isMore = true;
public void remove() {
throw new UnsupportedOperationException();
}
}
Note that the iterator stop by an amount of files iterateds and it has a FileFilter also.
And DirectoryStack is:
public class DirectoryStack {
private class Element{
private File files[] = null;
private int currentPointer;
public Element(File current) {
currentPointer = 0;
if (current.exists()) {
if(current.isDirectory()){
files = current.listFiles();
Set<File> set = new TreeSet<File>();
for (int i = 0; i < files.length; i++) {
File file = files[i];
set.add(file);
}
set.toArray(files);
}else{
throw new IllegalArgumentException("File current must be directory");
}
} else {
throw new IllegalArgumentException("File current not exist");
}
}
public String toString(){
return "current="+getCurrent().toString();
}
public int getCurrentPointer() {
return currentPointer;
}
public void setCurrentPointer(int currentPointer) {
this.currentPointer = currentPointer;
}
public File[] getFiles() {
return files;
}
public File getCurrent(){
File ret = null;
try{
ret = getFiles()[getCurrentPointer()];
}catch (Exception e){
}
return ret;
}
public boolean isDirectoryEmpty(){
return !(getFiles().length>0);
}
public Element advancePointer(){
setCurrentPointer(getCurrentPointer()+1);
return this;
}
}
private DirectoryStack(File first){
getStack().push(new Element(first));
}
public static DirectoryStack getNewStack(File first){
return new DirectoryStack(first);
}
public String toString(){
String ret = "stack:\n";
int i = 0;
for (Element elem : stack) {
ret += "nivel " + i++ + elem.toString()+"\n";
}
return ret;
}
private Stack<Element> stack=null;
private Stack<Element> getStack(){
if(stack==null){
stack = new Stack<Element>();
}
return stack;
}
public File getCurrent(){
return getStack().peek().getCurrent();
}
public boolean isDirectoryEmpty(){
return getStack().peek().isDirectoryEmpty();
}
public DirectoryStack downLevel(){
getStack().pop();
return this;
}
public DirectoryStack goToParent(){
return downLevel();
}
public DirectoryStack goIntoDir(){
return upLevel();
}
public DirectoryStack upLevel(){
if(isCurrentNotNull())
getStack().push(new Element(getCurrent()));
return this;
}
public DirectoryStack advancePostition(){
getStack().peek().advancePointer();
return this;
}
public File[] peekDirectory(){
return getStack().peek().getFiles();
}
public boolean isLastFileOfDirectory(){
return getStack().peek().getFiles().length <= getStack().peek().getCurrentPointer();
}
public boolean gotMoreLevels() {
return getStack().size()>0;
}
public boolean gotMoreInCurrentLevel() {
return getStack().peek().getFiles().length > getStack().peek().getCurrentPointer()+1;
}
public boolean isRoot() {
return !(getStack().size()>1);
}
public boolean isCurrentNotNull() {
if(!getStack().isEmpty()){
int currentPointer = getStack().peek().getCurrentPointer();
int maxFiles = getStack().peek().getFiles().length;
return currentPointer < maxFiles;
}else{
return false;
}
}
public boolean isCurrentDirectory() {
return getStack().peek().getCurrent().isDirectory();
}
public boolean isLastFromDirList() {
return getStack().peek().getCurrentPointer() == (getStack().peek().getFiles().length-1);
}
public boolean isCantGoParent() {
return !(getStack().size()>1);
}
public boolean isOutOfDirListRange() {
return getStack().peek().getFiles().length <= getStack().peek().getCurrentPointer();
}
}
Using an Iterable doesn't imply that the Files will be streamed to you. In fact its usually the opposite. So an array is typically faster than an Iterable.
Are you sure it's due to Java, not just a general problem with having 10k entries in one directory, particularly over the network?
Have you tried writing a proof-of-concept program to do the same thing in C using the win32 findfirst/findnext functions to see whether it's any faster?
I don't know the ins and outs of SMB, but I strongly suspect that it needs a round trip for every file in the list - which is not going to be fast, particularly over a network with moderate latency.
Having 10k strings in an array sounds like something which should not tax the modern Java VM too much either.