I've done some testing with Streams in special with DirectoryStreams of the nio-package. I simply try to get a list of all files in a directory sorted by last modified date and size.
The JavaDoc of old File.listFiles() stated a Note to the method in Files:
Note that the Files class defines the newDirectoryStream method to
open a directory and iterate over the names of the files in the
directory. This may use less resources when working with very large
directories.
I run the code down below a lot of times (first three times below):
First-run:
Run time of Arrays.sort: 1516
Run time of Stream.sorted as Array: 2912
Run time of Stream.sorted as List: 2875
Second-run:
Run time of Arrays.sort: 1557
Run time of Stream.sorted as Array: 2978
Run time of Stream.sorted as List: 2937
Third-run:
Run time of Arrays.sort: 1563
Run time of Stream.sorted as Array: 2919
Run time of Stream.sorted as List: 2896
My question is: Why do the streams perform so bad?
import java.io.File;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.attribute.FileTime;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class FileSorter {
// This sorts from old to young and from big to small
Comparator<Path> timeSizeComparator = (Path o1, Path o2) -> {
int sorter = 0;
try {
FileTime lm1 = Files.getLastModifiedTime(o1);
FileTime lm2 = Files.getLastModifiedTime(o2);
if (lm2.compareTo(lm1) == 0) {
Long s1 = Files.size(o1);
Long s2 = Files.size(o2);
sorter = s2.compareTo(s1);
} else {
sorter = lm1.compareTo(lm2);
}
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}
return sorter;
};
public String[] getSortedFileListAsArray(Path dir) throws IOException {
Stream<Path> stream = Files.list(dir);
return stream.sorted(timeSizeComparator).
map(Path::getFileName).map(Path::toString).toArray(String[]::new);
}
public List<String> getSortedFileListAsList(Path dir) throws IOException {
Stream<Path> stream = Files.list(dir);
return stream.sorted(timeSizeComparator).
map(Path::getFileName).map(Path::toString).collect(Collectors.
toList());
}
public String[] sortByDateAndSize(File[] fileList) {
Arrays.sort(fileList, (File o1, File o2) -> {
int r = Long.compare(o1.lastModified(), o2.lastModified());
if (r != 0) {
return r;
}
return Long.compare(o1.length(), o2.length());
});
String[] fileNames = new String[fileList.length];
for (int i = 0; i < fileNames.length; i++) {
fileNames[i] = fileList[i].getName();
}
return fileNames;
}
public static void main(String[] args) throws IOException {
// File (io package)
File f = new File("C:\\Windows\\system32");
// Path (nio package)
Path dir = Paths.get("C:\\Windows\\system32");
FileSorter fs = new FileSorter();
long before = System.currentTimeMillis();
String[] names = fs.sortByDateAndSize(f.listFiles());
long after = System.currentTimeMillis();
System.out.println("Run time of Arrays.sort: " + ((after - before)));
long before2 = System.currentTimeMillis();
String[] names2 = fs.getSortedFileListAsArray(dir);
long after2 = System.currentTimeMillis();
System.out.
println("Run time of Stream.sorted as Array: " + ((after2 - before2)));
long before3 = System.currentTimeMillis();
List<String> names3 = fs.getSortedFileListAsList(dir);
long after3 = System.currentTimeMillis();
System.out.
println("Run time of Stream.sorted as List: " + ((after3 - before3)));
}
}
Update
After applying the code from Peter I have this results:
Run time of Arrays.sort: 1615
Run time of Stream.sorted as Array: 3116
Run time of Stream.sorted as List: 3059
Run time of Stream.sorted as List with caching: 378
Update 2
After doing some research on the solution of Peter, I can say, that reading file attributes with for ex. Files.getLastModified must be a heavy crunch. Changing only that part in Comparator to:
Comparator<Path> timeSizeComparator = (Path o1, Path o2) -> {
File f1 = o1.toFile();
File f2 = o2.toFile();
long lm1 = f1.lastModified();
long lm2 = f2.lastModified();
int cmp = Long.compare(lm2, lm1);
if (cmp == 0) {
cmp = Long.compare(f2.length(), f1.length());
}
return cmp;
};
Gets the even better result on my computer:
Run time of Arrays.sort: 1968
Run time of Stream.sorted as Array: 1999
Run time of Stream.sorted as List: 1975
Run time of Stream.sorted as List with caching: 488
But as you can see, caching the object is the much best way. And as jtahlborn mentioned, it is a kind of stable sort.
Update 3 (best solution I've found)
After a bit more research, I've seen, that the methods Files.lastModified and Files.size, both do a huge job on a same thing: Attributes. So I made three versions of the PathInfo class to test:
Peters version as described down below
An old style File version, where I do a Path.toFile() once in the constructor and get all values from that file with f.lastModified and f.length
An version of Peters solution, but now I read an Attribute object with Files.readAttributes(path,BasicFileAttributes.class) and done things on this.
Putting it all in a loop for doing it 100 times each, I came up with these results:
After doing all hundred times
Mean performance of Peters solution: 432.26
Mean performance of old File solution: 343.11
Mean performance of read attribute object once solution: 255.66
Code in constructor of PathInfo for the best solution:
public PathInfo(Path path) {
try {
// read the whole attributes once
BasicFileAttributes bfa = Files.readAttributes(path, BasicFileAttributes.class);
fileName = path.getFileName().toString();
modified = bfa.lastModifiedTime().toMillis();
size = bfa.size();
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}
}
My result: Never read attributes twice and caching in an Object is bursting performance.
Files.list() is a O(N) operation whereas sorting is O(N log N). It is far more likely that the operations inside the sorting which matter. Given the comparisons don't do the same thing, this is the most likely explanation. There is a lot of files with the same modification date under C:/Windows/System32 meaning the size would be checked quite often.
To show that most of the time is not spent in FIles.list(dir) Stream, I have optimise the comparison so the data about a file is only obtained once per file.
import java.io.File;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.attribute.FileTime;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class FileSorter {
// This sorts from old to young and from big to small
Comparator<Path> timeSizeComparator = (Path o1, Path o2) -> {
int sorter = 0;
try {
FileTime lm1 = Files.getLastModifiedTime(o1);
FileTime lm2 = Files.getLastModifiedTime(o2);
if (lm2.compareTo(lm1) == 0) {
Long s1 = Files.size(o1);
Long s2 = Files.size(o2);
sorter = s2.compareTo(s1);
} else {
sorter = lm1.compareTo(lm2);
}
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}
return sorter;
};
public String[] getSortedFileListAsArray(Path dir) throws IOException {
Stream<Path> stream = Files.list(dir);
return stream.sorted(timeSizeComparator).
map(Path::getFileName).map(Path::toString).toArray(String[]::new);
}
public List<String> getSortedFileListAsList(Path dir) throws IOException {
Stream<Path> stream = Files.list(dir);
return stream.sorted(timeSizeComparator).
map(Path::getFileName).map(Path::toString).collect(Collectors.
toList());
}
public String[] sortByDateAndSize(File[] fileList) {
Arrays.sort(fileList, (File o1, File o2) -> {
int r = Long.compare(o1.lastModified(), o2.lastModified());
if (r != 0) {
return r;
}
return Long.compare(o1.length(), o2.length());
});
String[] fileNames = new String[fileList.length];
for (int i = 0; i < fileNames.length; i++) {
fileNames[i] = fileList[i].getName();
}
return fileNames;
}
public List<String> getSortedFile(Path dir) throws IOException {
return Files.list(dir).map(PathInfo::new).sorted().map(p -> p.getFileName()).collect(Collectors.toList());
}
static class PathInfo implements Comparable<PathInfo> {
private final String fileName;
private final long modified;
private final long size;
public PathInfo(Path path) {
try {
fileName = path.getFileName().toString();
modified = Files.getLastModifiedTime(path).toMillis();
size = Files.size(path);
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}
}
#Override
public int compareTo(PathInfo o) {
int cmp = Long.compare(modified, o.modified);
if (cmp == 0)
cmp = Long.compare(size, o.size);
return cmp;
}
public String getFileName() {
return fileName;
}
}
public static void main(String[] args) throws IOException {
// File (io package)
File f = new File("C:\\Windows\\system32");
// Path (nio package)
Path dir = Paths.get("C:\\Windows\\system32");
FileSorter fs = new FileSorter();
long before = System.currentTimeMillis();
String[] names = fs.sortByDateAndSize(f.listFiles());
long after = System.currentTimeMillis();
System.out.println("Run time of Arrays.sort: " + ((after - before)));
long before2 = System.currentTimeMillis();
String[] names2 = fs.getSortedFileListAsArray(dir);
long after2 = System.currentTimeMillis();
System.out.println("Run time of Stream.sorted as Array: " + ((after2 - before2)));
long before3 = System.currentTimeMillis();
List<String> names3 = fs.getSortedFileListAsList(dir);
long after3 = System.currentTimeMillis();
System.out.println("Run time of Stream.sorted as List: " + ((after3 - before3)));
long before4 = System.currentTimeMillis();
List<String> names4 = fs.getSortedFile(dir);
long after4 = System.currentTimeMillis();
System.out.println("Run time of Stream.sorted as List with caching: " + ((after4 - before4)));
}
}
This prints on my laptop.
Run time of Arrays.sort: 1980
Run time of Stream.sorted as Array: 1295
Run time of Stream.sorted as List: 1228
Run time of Stream.sorted as List with caching: 185
As you can see, about 85% of the time is spent obtaining the modification date and size of the files repeatedly.
Related
I want to use a Stream to parallelize processing of a heterogenous set of remotely stored JSON files of unknown number (the number of files is not known upfront). The files can vary widely in size, from 1 JSON record per file up to 100,000 records in some other files. A JSON record in this case means a self-contained JSON object represented as one line in the file.
I really want to use Streams for this and so I implemented this Spliterator:
public abstract class JsonStreamSpliterator<METADATA, RECORD> extends AbstractSpliterator<RECORD> {
abstract protected JsonStreamSupport<METADATA> openInputStream(String path);
abstract protected RECORD parse(METADATA metadata, Map<String, Object> json);
private static final int ADDITIONAL_CHARACTERISTICS = Spliterator.IMMUTABLE | Spliterator.DISTINCT | Spliterator.NONNULL;
private static final int MAX_BUFFER = 100;
private final Iterator<String> paths;
private JsonStreamSupport<METADATA> reader = null;
public JsonStreamSpliterator(Iterator<String> paths) {
this(Long.MAX_VALUE, ADDITIONAL_CHARACTERISTICS, paths);
}
private JsonStreamSpliterator(long est, int additionalCharacteristics, Iterator<String> paths) {
super(est, additionalCharacteristics);
this.paths = paths;
}
private JsonStreamSpliterator(long est, int additionalCharacteristics, Iterator<String> paths, String nextPath) {
this(est, additionalCharacteristics, paths);
open(nextPath);
}
#Override
public boolean tryAdvance(Consumer<? super RECORD> action) {
if(reader == null) {
String path = takeNextPath();
if(path != null) {
open(path);
}
else {
return false;
}
}
Map<String, Object> json = reader.readJsonLine();
if(json != null) {
RECORD item = parse(reader.getMetadata(), json);
action.accept(item);
return true;
}
else {
reader.close();
reader = null;
return tryAdvance(action);
}
}
private void open(String path) {
reader = openInputStream(path);
}
private String takeNextPath() {
synchronized(paths) {
if(paths.hasNext()) {
return paths.next();
}
}
return null;
}
#Override
public Spliterator<RECORD> trySplit() {
String nextPath = takeNextPath();
if(nextPath != null) {
return new JsonStreamSpliterator<METADATA,RECORD>(Long.MAX_VALUE, ADDITIONAL_CHARACTERISTICS, paths, nextPath) {
#Override
protected JsonStreamSupport<METADATA> openInputStream(String path) {
return JsonStreamSpliterator.this.openInputStream(path);
}
#Override
protected RECORD parse(METADATA metaData, Map<String,Object> json) {
return JsonStreamSpliterator.this.parse(metaData, json);
}
};
}
else {
List<RECORD> records = new ArrayList<RECORD>();
while(tryAdvance(records::add) && records.size() < MAX_BUFFER) {
// loop
}
if(records.size() != 0) {
return records.spliterator();
}
else {
return null;
}
}
}
}
The problem I'm having is that while the Stream parallelizes beautifully at first, eventually the largest file is left processing in a single thread. I believe the proximal cause is well documented: the spliterator is "unbalanced".
More concretely, appears that the trySplit method is not called after a certain point in the Stream.forEach's lifecycle, so the extra logic to distribute small batches at the end of trySplit is rarely executed.
Notice how all the spliterators returned from trySplit share the same paths iterator. I thought this was a really clever way to balance the work across all spliterators, but it hasn't been enough to achieve full parallelism.
I would like the parallel processing to proceed first across files, and then when few large files are still left spliterating, I want to parallelize across chunks of the remaining files. That was the intent of the else block at the end of trySplit.
Is there an easy / simple / canonical way around this problem?
Your trySplit should output splits of equal size, regardless of the size of the underlying files. You should treat all the files as a single unit and fill up the ArrayList-backed spliterator with the same number of JSON objects each time. The number of objects should be such that processing one split takes between 1 and 10 milliseconds: lower than 1 ms and you start approaching the costs of handing off the batch to a worker thread, higher than that and you start risking uneven CPU load due to tasks which are too coarse-grained.
The spliterator is not obliged to report a size estimate, and you are already doing this correctly: your estimate is Long.MAX_VALUE, which is a special value meaning "unbounded". However, if you have many files with a single JSON object, resulting in batches of size 1, this will hurt your performance in two ways: the overhead of opening-reading-closing the file may become a bottleneck and, if you manage to escape that, the cost of thread handoff may be significant compared to the cost of processing one item, again causing a bottleneck.
Five years ago I was solving a similar problem, you can have a look at my solution.
After much experimentation, I was still not able to get any added parallelism by playing with the size estimates. Basically, any value other than Long.MAX_VALUE will tend to cause the spliterator to terminate too early (and without any splitting), while on the other hand a Long.MAX_VALUE estimate will cause trySplit to be called relentlessly until it returns null.
The solution I found is to internally share resources among the spliterators and let them rebalance amongst themselves.
Working code:
public class AwsS3LineSpliterator<LINE> extends AbstractSpliterator<AwsS3LineInput<LINE>> {
public final static class AwsS3LineInput<LINE> {
final public S3ObjectSummary s3ObjectSummary;
final public LINE lineItem;
public AwsS3LineInput(S3ObjectSummary s3ObjectSummary, LINE lineItem) {
this.s3ObjectSummary = s3ObjectSummary;
this.lineItem = lineItem;
}
}
private final class InputStreamHandler {
final S3ObjectSummary file;
final InputStream inputStream;
InputStreamHandler(S3ObjectSummary file, InputStream is) {
this.file = file;
this.inputStream = is;
}
}
private final Iterator<S3ObjectSummary> incomingFiles;
private final Function<S3ObjectSummary, InputStream> fileOpener;
private final Function<InputStream, LINE> lineReader;
private final Deque<S3ObjectSummary> unopenedFiles;
private final Deque<InputStreamHandler> openedFiles;
private final Deque<AwsS3LineInput<LINE>> sharedBuffer;
private final int maxBuffer;
private AwsS3LineSpliterator(Iterator<S3ObjectSummary> incomingFiles, Function<S3ObjectSummary, InputStream> fileOpener,
Function<InputStream, LINE> lineReader,
Deque<S3ObjectSummary> unopenedFiles, Deque<InputStreamHandler> openedFiles, Deque<AwsS3LineInput<LINE>> sharedBuffer,
int maxBuffer) {
super(Long.MAX_VALUE, 0);
this.incomingFiles = incomingFiles;
this.fileOpener = fileOpener;
this.lineReader = lineReader;
this.unopenedFiles = unopenedFiles;
this.openedFiles = openedFiles;
this.sharedBuffer = sharedBuffer;
this.maxBuffer = maxBuffer;
}
public AwsS3LineSpliterator(Iterator<S3ObjectSummary> incomingFiles, Function<S3ObjectSummary, InputStream> fileOpener, Function<InputStream, LINE> lineReader, int maxBuffer) {
this(incomingFiles, fileOpener, lineReader, new ConcurrentLinkedDeque<>(), new ConcurrentLinkedDeque<>(), new ArrayDeque<>(maxBuffer), maxBuffer);
}
#Override
public boolean tryAdvance(Consumer<? super AwsS3LineInput<LINE>> action) {
AwsS3LineInput<LINE> lineInput;
synchronized(sharedBuffer) {
lineInput=sharedBuffer.poll();
}
if(lineInput != null) {
action.accept(lineInput);
return true;
}
InputStreamHandler handle = openedFiles.poll();
if(handle == null) {
S3ObjectSummary unopenedFile = unopenedFiles.poll();
if(unopenedFile == null) {
return false;
}
handle = new InputStreamHandler(unopenedFile, fileOpener.apply(unopenedFile));
}
for(int i=0; i < maxBuffer; ++i) {
LINE line = lineReader.apply(handle.inputStream);
if(line != null) {
synchronized(sharedBuffer) {
sharedBuffer.add(new AwsS3LineInput<LINE>(handle.file, line));
}
}
else {
return tryAdvance(action);
}
}
openedFiles.addFirst(handle);
return tryAdvance(action);
}
#Override
public Spliterator<AwsS3LineInput<LINE>> trySplit() {
synchronized(incomingFiles) {
if (incomingFiles.hasNext()) {
unopenedFiles.add(incomingFiles.next());
return new AwsS3LineSpliterator<LINE>(incomingFiles, fileOpener, lineReader, unopenedFiles, openedFiles, sharedBuffer, maxBuffer);
} else {
return null;
}
}
}
}
This is not a direct answer to your question. But I think it is worth a try with Stream in library abacus-common:
void test_58601518() throws Exception {
final File tempDir = new File("./temp/");
// Prepare the test files:
// if (!(tempDir.exists() && tempDir.isDirectory())) {
// tempDir.mkdirs();
// }
//
// final Random rand = new Random();
// final int fileCount = 1000;
//
// for (int i = 0; i < fileCount; i++) {
// List<String> lines = Stream.repeat(TestUtil.fill(Account.class), rand.nextInt(1000) * 100 + 1).map(it -> N.toJSON(it)).toList();
// IOUtil.writeLines(new File("./temp/_" + i + ".json"), lines);
// }
N.println("Xmx: " + IOUtil.MAX_MEMORY_IN_MB + " MB");
N.println("total file size: " + Stream.listFiles(tempDir).mapToLong(IOUtil::sizeOf).sum() / IOUtil.ONE_MB + " MB");
final AtomicLong counter = new AtomicLong();
final Consumer<Account> yourAction = it -> {
counter.incrementAndGet();
it.toString().replace("a", "bbb");
};
long startTime = System.currentTimeMillis();
Stream.listFiles(tempDir) // the file/data source could be local file system or remote file system.
.parallel(2) // thread number used to load the file/data and convert the lines to Java objects.
.flatMap(f -> Stream.lines(f).map(line -> N.fromJSON(Account.class, line))) // only certain lines (less 1024) will be loaded to memory.
.parallel(8) // thread number used to execute your action.
.forEach(yourAction);
N.println("Took: " + ((System.currentTimeMillis()) - startTime) + " ms" + " to process " + counter + " lines/objects");
// IOUtil.deleteAllIfExists(tempDir);
}
Till end, the CPU usage on my laptop is pretty high(about 70%), and it took about 70 seconds to process 51,899,100 lines/objects from 1000 files with Intel(R) Core(TM) i5-8365U CPU and Xmx256m jvm memory. Total file size is about: 4524 MB. if yourAction is not a heavy operation, sequential stream could be even faster than parallel stream.
F.Y.I I'm the developer of abacus-common
I've a regex pattern of words like welcome1|welcome2|changeme... which I need to search for in thousands of files (varies between 100 to 8000) ranging from 1KB to 24 MB each, in size.
I would like to know if there's a faster way of pattern matching than doing what I have been trying.
Environment:
jdk 1.8
Windows 10
Unix4j Library
Here's what I tried till now
try (Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(FilePredicates.isFileAndNotDirectory())) {
List<String> obviousStringsList = Strings_PASSWORDS.stream()
.map(s -> ".*" + s + ".*").collect(Collectors.toList()); //because Unix4j apparently needs this
Pattern pattern = Pattern.compile(String.join("|", obviousStringsList));
GrepOptions options = new GrepOptions.Default(GrepOption.count,
GrepOption.ignoreCase,
GrepOption.lineNumber,
GrepOption.matchingFiles);
Instant startTime = Instant.now();
final List<Path> filesWithObviousStringss = stream
.filter(path -> !Unix4j.grep(options, pattern, path.toFile()).toStringResult().isEmpty())
.collect(Collectors.toList());
System.out.println("Time taken = " + Duration.between(startTime, Instant.now()).getSeconds() + " seconds");
}
I get Time taken = 60 seconds which makes me think I'm doing something really wrong.
I've tried different ways with the stream and on an average every method takes about a minute to process my current folder of 6660 files.
Grep on mysys2/mingw64 takes about 15 seconds and exec('grep...') in node.js takes about 12 seconds consistently.
I chose Unix4j because it provides java native grep and clean code.
Is there a way to produce better results in Java, that I'm sadly missing?
The main reason why native tools can process such text files much faster, is their assumption of one particular charset, especially when it has an ASCII based 8 Bit encoding, whereas Java performs a byte to character conversion whose abstraction is capable of supporting arbitrary charsets.
When we similarly assume a single charset with the properties named above, we can use lowlevel tools which may increase the performance dramatically.
For such an operation, we define the following helper methods:
private static char[] getTable(Charset cs) {
if(cs.newEncoder().maxBytesPerChar() != 1f)
throw new UnsupportedOperationException("Not an 8 bit charset");
byte[] raw = new byte[256];
IntStream.range(0, 256).forEach(i -> raw[i] = (byte)i);
char[] table = new char[256];
cs.newDecoder().onUnmappableCharacter(CodingErrorAction.REPLACE)
.decode(ByteBuffer.wrap(raw), CharBuffer.wrap(table), true);
for(int i = 0; i < 128; i++)
if(table[i] != i) throw new UnsupportedOperationException("Not ASCII based");
return table;
}
and
private static CharSequence mapAsciiBasedText(Path p, char[] table) throws IOException {
try(FileChannel fch = FileChannel.open(p, StandardOpenOption.READ)) {
long actualSize = fch.size();
int size = (int)actualSize;
if(size != actualSize) throw new UnsupportedOperationException("file too large");
MappedByteBuffer mbb = fch.map(FileChannel.MapMode.READ_ONLY, 0, actualSize);
final class MappedCharSequence implements CharSequence {
final int start, size;
MappedCharSequence(int start, int size) {
this.start = start;
this.size = size;
}
public int length() {
return size;
}
public char charAt(int index) {
if(index < 0 || index >= size) throw new IndexOutOfBoundsException();
byte b = mbb.get(start + index);
return b<0? table[b+256]: (char)b;
}
public CharSequence subSequence(int start, int end) {
int newSize = end - start;
if(start<0 || end < start || end-start > size)
throw new IndexOutOfBoundsException();
return new MappedCharSequence(start + this.start, newSize);
}
public String toString() {
return new StringBuilder(size).append(this).toString();
}
}
return new MappedCharSequence(0, size);
}
}
This allows to map a file into the virtual memory and project it directly to a CharSequence, without copy operations, assuming that the mapping can be done with a simple table and, for ASCII based charsets, the majority of the characters do not even need a table lookup, as their numerical value is identical to the Unicode codepoint.
With these methods, you may implement the operation as
// You need this only once per JVM.
// Note that running inside IDEs like Netbeans may change the default encoding
char[] table = getTable(Charset.defaultCharset());
try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(Files::isRegularFile)) {
Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
long startTime = System.nanoTime();
final List<Path> filesWithObviousStringss = stream//.parallel()
.filter(path -> {
try {
return pattern.matcher(mapAsciiBasedText(path, table)).find();
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
})
.collect(Collectors.toList());
System.out.println("Time taken = "
+ TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}
This runs much faster than the normal text conversion, but still supports parallel execution.
Besides requiring an ASCII based single byte encoding, there’s the restriction that this code doesn’t support files larger than 2 GiB. While it is possible to extend the solution to support larger files, I wouldn’t add this complication unless really needed.
I don’t know what “Unix4j” provides that isn’t already in the JDK, as the following code does everything with built-in features:
try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(Files::isRegularFile)) {
Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
long startTime = System.nanoTime();
final List<Path> filesWithObviousStringss = stream
.filter(path -> {
try(Scanner s = new Scanner(path)) {
return s.findWithinHorizon(pattern, 0) != null;
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
})
.collect(Collectors.toList());
System.out.println("Time taken = "
+ TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}
One important property of this solution is that it doesn’t read the whole file, but stops at the first encountered match. Also, it doesn’t deal with line boundaries, which is suitable for the words you’re looking for, as they never contain line breaks anyway.
After analyzing the findWithinHorizon operation, I consider that line by line processing may be better for larger files, so, you may try
try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(Files::isRegularFile)) {
Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
long startTime = System.nanoTime();
final List<Path> filesWithObviousStringss = stream
.filter(path -> {
try(Stream<String> s = Files.lines(path)) {
return s.anyMatch(pattern.asPredicate());
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
})
.collect(Collectors.toList());
System.out.println("Time taken = "
+ TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}
instead.
You may also try to turn the stream to parallel mode, e.g.
try(Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))
.filter(Files::isRegularFile)) {
Pattern pattern = Pattern.compile(String.join("|", Strings_PASSWORDS));
long startTime = System.nanoTime();
final List<Path> filesWithObviousStringss = stream
.parallel()
.filter(path -> {
try(Stream<String> s = Files.lines(path)) {
return s.anyMatch(pattern.asPredicate());
} catch(IOException ex) {
throw new UncheckedIOException(ex);
}
})
.collect(Collectors.toList());
System.out.println("Time taken = "
+ TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()-startTime) + " seconds");
}
It’s hard to predict whether this has a benefit, as in most cases, the I/O dominates such an operation.
I never used Unix4j yet, but Java provides nice file APIs as well nowadays. Also, Unix4j#grep seems to return all the found matches (as you're using .toStringResult().isEmpty()), while you seem to just need to know whether at least one match got found (which means that you should be able to break once one match is found). Maybe this library provides another method that could better suit your needs, e.g. something like #contains? Without the use of Unix4j, Stream#anyMatch could be a good candidate here. Here is a vanilla Java solution if you want to compare with yours:
private boolean lineContainsObviousStrings(String line) {
return Strings_PASSWORDS // <-- weird naming BTW
.stream()
.anyMatch(line::contains);
}
private boolean fileContainsObviousStrings(Path path) {
try (Stream<String> stream = Files.lines(path)) {
return stream.anyMatch(this::lineContainsObviousStrings);
}
}
public List<Path> findFilesContainingObviousStrings() {
Instant startTime = Instant.now();
try (Stream<Path> stream = Files.walk(Paths.get(FILES_DIRECTORY))) {
return stream
.filter(FilePredicates.isFileAndNotDirectory())
.filter(this::fileContainsObviousStrings)
.collect(Collectors.toList());
} finally {
Instant endTime = Instant.now();
System.out.println("Time taken = " + Duration.between(startTime, endTime).getSeconds() + " seconds");
}
}
Please try this out too (if it is possible), I am curious how it performs on your files.
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.UncheckedIOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Optional;
import java.util.concurrent.TimeUnit;
import java.util.stream.Collectors;
import java.util.stream.Stream;
public class Filescan {
public static void main(String[] args) throws IOException {
Filescan sc = new Filescan();
sc.findWords("src/main/resources/files", new String[]{"author", "book"}, true);
}
// kind of Tuple/Map.Entry
static class Pair<K,V>{
final K key;
final V value;
Pair(K key, V value){
this.key = key;
this.value = value;
}
#Override
public String toString() {
return key + " " + value;
}
}
public void findWords(String directory, String[] words, boolean ignorecase) throws IOException{
final String[] searchWords = ignorecase ? toLower(words) : words;
try (Stream<Path> stream = Files.walk(Paths.get(directory)).filter(Files::isRegularFile)) {
long startTime = System.nanoTime();
List<Pair<Path,Map<String, List<Integer>>>> result = stream
// you can test it with parallel execution, maybe it is faster
.parallel()
// searching
.map(path -> findWordsInFile(path, searchWords, ignorecase))
// filtering out empty optionals
.filter(Optional::isPresent)
// unwrap optionals
.map(Optional::get).collect(Collectors.toList());
System.out.println("Time taken = " + TimeUnit.NANOSECONDS.toSeconds(System.nanoTime()
- startTime) + " seconds");
System.out.println("result:");
result.forEach(System.out::println);
}
}
private String[] toLower(String[] words) {
String[] ret = new String[words.length];
for (int i = 0; i < words.length; i++) {
ret[i] = words[i].toLowerCase();
}
return ret;
}
private static Optional<Pair<Path,Map<String, List<Integer>>>> findWordsInFile(Path path, String[] words, boolean ignorecase) {
try (BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(path.toFile())))) {
String line = br.readLine();
line = ignorecase & line != null ? line.toLowerCase() : line;
Map<String, List<Integer>> map = new HashMap<>();
int linecount = 0;
while(line != null){
for (String word : words) {
if(line.contains(word)){
if(!map.containsKey(word)){
map.put(word, new ArrayList<Integer>());
}
map.get(word).add(linecount);
}
}
line = br.readLine();
line = ignorecase & line != null ? line.toLowerCase() : line;
linecount++;
}
if(map.isEmpty()){
// returning empty optional when nothing in the map
return Optional.empty();
}else{
// returning a path-map pair with the words and the rows where each word has been found
return Optional.of(new Pair<Path,Map<String, List<Integer>>>(path, map));
}
} catch (IOException ex) {
throw new UncheckedIOException(ex);
}
}
}
I need to improve an open source tool (Releng) (with JDK 1.5 compliance) that updates copyright headers in source files. (e.g copyright 2000, 2011).
It reads files and inserts back a newer revision date (e.g 2014).
Currently it eats so much memory that performance slows down to a crawl.
I need to re-write the file parser so that it uses less memory/runs faster.
I've written a basic file parser (below) that reads all files in a directory (project/files). It then increment's the first four digits found in the file and prints run-time information.
[edit]
On a small scale the current result performs 25 garbage collections and garbage collection takes 12 ms. On a large scale I get so much memory overhead that GC thrashes performance.
Runs Time(ms) avrg(ms) GC_count GC_time
200 4096 20 25 12
200 4158 20 25 12
200 4072 20 25 12
200 4169 20 25 13
Is it possible to re-use File or String objects (and other objects??) to reduce garbage collection count?
Optimization guides suggest re-using objects.
I have considered using Stringbuilder instead of Strings. But from what I gather, it's only useful if you do a lot of concatenation. Which is not done in this case?
I also don't know how to re-use any other objects in the code below (e.g files?)?
How can I go about re-using objects in this scenario (or optimize the code below)?
Any ideas/suggestions are welcomed.
import java.io.File;
import java.io.IOException;
import java.lang.management.GarbageCollectorMXBean;
import java.lang.management.ManagementFactory;
import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.nio.file.StandardOpenOption;
import java.util.ArrayList;
public class Test {
//Use Bash script to create 2000 files, each having a 4 digit number.
/*
#!/bin/sh
rm files/test*
for i in {1..2000}
do
echo "2000" > files/test$i
done
*/
/*
* Example output:
* runs: 200
* Run time: 4822 average: 24
* Gc runs: Total Garbage Collections: 28
* Total Garbage Collection Time (ms): 17
*/
private static String filesPath = System.getProperty("user.dir") + "/src/files";
public static void main(String args[]) {
final File folder = new File(filesPath);
ArrayList<String> paths = listFilesForFolder(folder);
if (paths == null) {
System.out.println("no files found");
return;
}
long start = System.currentTimeMillis();
// ..
// your code
int runs = 200;
System.out.println("Run: ");
for (int i = 1; i <= runs; i++) {
System.out.print(" " + i);
updateFiles(paths);
}
System.out.println("");
// ..
long end = System.currentTimeMillis();
long runtime = end - start;
System.out.println("Runs Time avrg GC_count GC_time");
System.out.println(runs + " " + Long.toString(runtime) + " " + (runtime / runs) + " " + printGCStats());
}
private static ArrayList<String> listFilesForFolder(final File folder) {
ArrayList<String> paths = new ArrayList<>();
for (final File fileEntry : folder.listFiles()) {
if (fileEntry.isDirectory()) {
listFilesForFolder(fileEntry);
} else {
paths.add(filesPath + "/" + fileEntry.getName());
}
}
if (paths.size() == 0) {
return null;
} else {
return paths;
}
}
private static void updateFiles(final ArrayList<String> paths) {
for (String path : paths) {
try {
String content = readFile(path, StandardCharsets.UTF_8);
int year = Integer.parseInt(content.substring(0, 4));
year++;
Files.write(Paths.get(path), Integer.toString(year).getBytes(),
StandardOpenOption.CREATE);
} catch (IOException e) {
System.out.println("Failed to read: " + path);
}
}
}
static String readFile(String path, Charset encoding) throws IOException {
byte[] encoded = Files.readAllBytes(Paths.get(path)); // closes file.
return new String(encoded, encoding);
}
//PROFILING HELPER
public static String printGCStats() {
long totalGarbageCollections = 0;
long garbageCollectionTime = 0;
for (GarbageCollectorMXBean gc : ManagementFactory.getGarbageCollectorMXBeans()) {
long count = gc.getCollectionCount();
if (count >= 0) {
totalGarbageCollections += count;
}
long time = gc.getCollectionTime();
if (time >= 0) {
garbageCollectionTime += time;
}
}
return " " + totalGarbageCollections + " " + garbageCollectionTime;
}
}
In the end, the code above actually works fine.
I found that in the production code, the code didn't close a file buffer which caused a memory leak which caused performance issues with larger amounts of files.
After that was fixed, it scaled well.
I have a server side application that I am profiling using VisualVM that makes use of Streaming API.
However, since there are a lot of factors in that code I also made a toy example to compare streaming vs mapping.
I have a feeling that something may be off in that there is a lot of randomness in the results.
Is it the measuring? Would using other types of typers make a difference? Is it that there is something that is multi-threaded I don't know about?
Currently I am writing to NUL file object the windows equivalent of dev/null. I am running this on high priority in case the operating system may affect it.
Toy Example Code:
import java.io.File;
import java.io.IOException;
import java.io.StringWriter;
import java.util.ArrayList;
import java.util.Map.Entry;
import java.util.Scanner;
import java.util.TreeMap;
import com.fasterxml.jackson.core.JsonEncoding;
import com.fasterxml.jackson.core.JsonFactory;
import com.fasterxml.jackson.core.JsonGenerator;
import com.fasterxml.jackson.databind.ObjectMapper;
public class TestStreamingMapping {
public final static int NUM_SIMULATED_CATALOGS = 10000;
public final static int CATALOG_SIZE = 1000; //1000 Items in CATALOG, 500 requests per second
public final static boolean WRITE_TO_FILE = false; //Write to file, or write to string
public final static boolean DEBUG_PRINT_100_CHAR = false; //Print out part of string to see all ok
public static final String mappingFile = "mapping.txt"; //If writing to file, where?
public static final String streamingFile = "streaming.txt"; //If streaming to file, where?
public static final boolean PRINT_INTERMEDIATE_RESULTS = false;
public static TreeMap<Long,Double> iterationPercentages = new TreeMap<Long,Double>();
ObjectMapper mapper= new ObjectMapper();
JsonFactory f = new JsonFactory();
JsonGenerator g;
public static long totalCountStream = 0, totalCountMap = 0;
public static void main(String args[])
{
System.out.println("Press enter when profiler is connected...");
new Scanner(System.in).nextLine();
System.out.println("Starting iterations of JSON generation.");
double percentage;
for(long i=0; i<NUM_SIMULATED_CATALOGS; i++)
{
performTest();
percentage = (totalCountStream*100.0d / totalCountMap);
iterationPercentages.put(i, percentage);
if(!PRINT_INTERMEDIATE_RESULTS && i%100 == 0)System.out.print(i+"-");
}
System.out.println("Total Streaming API: " + totalCountStream + " ns.");
System.out.println("Total Mapping API: " + totalCountMap + " ns.");
System.out.println("Total Stream(as % of map): " + totalCountStream*100.0d / totalCountMap + " %\r\n" );
System.out.println("Iteration\tStreamPercent");
for(Entry<Long, Double> entry : iterationPercentages.entrySet())
if(entry.getKey() % 20 ==0)
System.out.println(entry.getKey() + "\t\t" + Math.round(entry.getValue()) + "%" );
}
public static void performTest()
{
TestStreamingMapping test = new TestStreamingMapping();
long time1, time2;
double percentage = 0;
try {
long starttime1 = System.nanoTime();
test.streamingToFile();
totalCountStream+=time1=System.nanoTime() - starttime1;
long starttime2 = System.nanoTime();
test.objectMapping();
totalCountMap+=time2=System.nanoTime() - starttime2;
percentage = (time1*100.0d / time2);
if(PRINT_INTERMEDIATE_RESULTS)
{
System.out.println("Streaming API: " + time1 + " ns.");
System.out.println("Mapping API: " + time2 + " ns.");
System.out.println("Stream(as % of map): " + percentage + " %" );
System.out.println("----------------------------------------------\r\n");
}
} catch (IOException e) {
e.printStackTrace();
}
}
public String[] numbers;
public ArrayList<String> arrayList = new ArrayList<String>();
public TestStreamingMapping()
{
numbers=new String[62];
for(int i=0; i<60; i++) numbers[i] = String.valueOf(Math.random()*i);
for(int i=0; i<60; i++) arrayList.add(String.valueOf(Math.random()*i));
}
public void initializeGenerator(StringWriter writer) throws IOException
{
if(WRITE_TO_FILE)
g = f.createGenerator(new File(mappingFile), JsonEncoding.UTF8);
else
g = f. createGenerator(writer);
}
public void objectMapping() throws IOException
{
StringWriter writer = new StringWriter();
initializeGenerator(writer);
for(int j=0; j<CATALOG_SIZE; j++)
mapper.writeValue(g, this);
g.close();
writer.close();
if(DEBUG_PRINT_100_CHAR)
System.out.println(writer.toString().substring(0,100));
}
public void streamingToFile() throws IOException
{
StringWriter writer = new StringWriter();
initializeGenerator(writer);
for(int j=0; j<CATALOG_SIZE; j++)
{
g.writeStartObject();
g.writeFieldName("numbers_streaming");
g.writeStartArray();
for(int i=0; i<numbers.length; i++) g.writeString(numbers[i]);
g.writeEndArray();
g.writeFieldName("arrayList"); g.writeStartArray();
for(String num: arrayList) g.writeString(num);
g.writeEndArray();
g.writeEndObject();
}
g.close();
writer.close();
if(DEBUG_PRINT_100_CHAR)
System.out.println(writer.toString().substring(0,100));
}
}
The below code is simulating a service that would generate a JSON catalog document with 1000 Prouct Objects. The hotspot obviously is the serialization of the products (streamToFile() vs objectMapping()).
Ok, couple of things.
Most importantly, you should create just one JsonFactory instance, similar to how you reuse ObjectMapper. Reuse of these objects is one of key things for performance with Jackson. See here for more ideas.
Another thing to consider is that use of File adds I/O overhead, which should be about the same for both approaches, and diminishes difference in actual processing times. You may want to separate this to see how much of time is spent on file access. I realize that this may be bogus file (as per note on how OS deals with that), but even without physical overhead, OS typically incurs some syscall overhead.
And then one general aspect is that when measuring performance on JVM, you always need to keep in mind warm-up time: you should always warm up tests for multiple seconds (5 or 10 seconds minimum), as well as run actual test for sufficient time (like 30 seconds or more), to get more stable results.
This is where test frameworks can help, as they can actually statistically measure things and figure out when results stabilize enough to be meaningful.
Hope this helps!
I'm running Java on a Unix platform. How can I get a list of all mounted filesystems via the Java 1.6 API?
I've tried File.listRoots() but that returns a single filesystem (that is, /). If I use df -h I see more than that:
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk0s2 931Gi 843Gi 87Gi 91% 221142498 22838244 91% /
devfs 187Ki 187Ki 0Bi 100% 646 0 100% /dev
map -hosts 0Bi 0Bi 0Bi 100% 0 0 100% /net
map auto_home 0Bi 0Bi 0Bi 100% 0 0 100% /home
/dev/disk1s2 1.8Ti 926Gi 937Gi 50% 242689949 245596503 50% /Volumes/MyBook
/dev/disk2 1.0Gi 125Mi 875Mi 13% 32014 223984 13% /Volumes/Google Earth
I would expect to see /home as well (at a minimum).
In Java7+ you can use nio
import java.io.IOException;
import java.nio.file.FileStore;
import java.nio.file.FileSystems;
public class ListMountedVolumesWithNio {
public static void main(String[] args) throws IOException {
for (FileStore store : FileSystems.getDefault().getFileStores()) {
long total = store.getTotalSpace() / 1024;
long used = (store.getTotalSpace() - store.getUnallocatedSpace()) / 1024;
long avail = store.getUsableSpace() / 1024;
System.out.format("%-20s %12d %12d %12d%n", store, total, used, avail);
}
}
}
Java doesn't provide any access to mount points. You have to run system command mount via Runtime.exec() and parse its output. Either that, or parse the contents of /etc/mtab.
You can try use follow method for resolve issue:
My code
public List<String> getHDDPartitions() {
try {
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(new FileInputStream("/proc/mounts"), "UTF-8"));
String response;
StringBuilder stringBuilder = new StringBuilder();
while ((response = bufferedReader.readLine()) != null) {
stringBuilder.append(response.replaceAll(" +", "\t") + "\n");
}
bufferedReader.close();
return Lists.newArrayList(Arrays.asList(stringBuilder.toString().split("\n")));
} catch (IOException e) {
LOGGER.error("{}", ExceptionWriter.INSTANCE.getStackTrace(e));
}
return null;
}
public List<Map<String, String>> getMapMounts() {
List<Map<String, String>> resultList = Lists.newArrayList();
for (String mountPoint : getHDDPartitions()) {
Map<String, String> result = Maps.newHashMap();
String[] mount = mountPoint.split("\t");
result.put("FileSystem", mount[2]);
result.put("MountPoint", mount[1]);
result.put("Permissions", mount[3]);
result.put("User", mount[4]);
result.put("Group", mount[5]);
result.put("Total", String.valueOf(new File(mount[1]).getTotalSpace()));
result.put("Free", String.valueOf(new File(mount[1]).getFreeSpace()));
result.put("Used", String.valueOf(new File(mount[1]).getTotalSpace() - new File(mount[1]).getFreeSpace()));
result.put("Free Percent", String.valueOf(getFreeSpacePercent(new File(mount[1]).getTotalSpace(), new File(mount[1]).getFreeSpace())));
resultList.add(result);
}
return resultList;
}
private Integer getFreeSpacePercent(long total, long free) {
Double result = (Double.longBitsToDouble(free) / Double.longBitsToDouble(total)) * 100;
return result.intValue();
}
OSHI (Operating System and Hardware Information library for Java) can be useful here: https://github.com/oshi/oshi.
Check out this code:
#Test
public void test() {
final SystemInfo systemInfo = new SystemInfo();
final OSFileStore[] fileStores = systemInfo.getOperatingSystem().getFileSystem().getFileStores();
Stream.of(fileStores)
.peek(fs ->{
System.out.println("name: "+fs.getName());
System.out.println("type: "+fs.getType() );
System.out.println("str: "+fs.toString() );
System.out.println("mount: "+fs.getMount());
System.out.println("...");
}).count();
}
You can call the getmntent function (use "man getmntent" to get more information) using JNA.
Here is some example code to get you started:
import java.util.Arrays;
import java.util.List;
import com.sun.jna.Library;
import com.sun.jna.Native;
import com.sun.jna.Pointer;
import com.sun.jna.Structure;
public class MntPointTest {
public static class mntent extends Structure {
public String mnt_fsname; //Device or server for filesystem
public String mnt_dir; //Directory mounted on
public String mnt_type; //Type of filesystem: ufs, nfs, etc.
public String mnt_opts;
public int mnt_freq;
public int mnt_passno;
#Override
protected List getFieldOrder() {
return Arrays.asList("mnt_fsname", "mnt_dir", "mnt_type", "mnt_opts", "mnt_freq", "mnt_passno");
}
}
public interface CLib extends Library {
CLib INSTANCE = (CLib) Native.loadLibrary("c", CLib.class);
Pointer setmntent(String file, String mode);
mntent getmntent(Pointer stream);
int endmntent(Pointer stream);
}
public static void main(String[] args) {
mntent mntEnt;
Pointer stream = CLib.INSTANCE.setmntent("/etc/mtab", "r");
while ((mntEnt = CLib.INSTANCE.getmntent(stream)) != null) {
System.out.println("Mounted from: " + mntEnt.mnt_fsname);
System.out.println("Mounted on: " + mntEnt.mnt_dir);
System.out.println("File system type: " + mntEnt.mnt_type);
System.out.println("-------------------------------");
}
CLib.INSTANCE.endmntent(stream);
}
}
I was already on the way to using mount when #Cozzamara pointed out that's the way to go. What I ended up with is:
// get the list of mounted filesystems
// Note: this is Unix specific, as it requires the "mount" command
Process mountProcess = Runtime.getRuntime ().exec ( "mount" );
BufferedReader mountOutput = new BufferedReader ( new InputStreamReader ( mountProcess.getInputStream () ) );
List<File> roots = new ArrayList<File> ();
while ( true ) {
// fetch the next line of output from the "mount" command
String line = mountOutput.readLine ();
if ( line == null )
break;
// the line will be formatted as "... on <filesystem> (...)"; get the substring we need
int indexStart = line.indexOf ( " on /" );
int indexEnd = line.indexOf ( " ", indexStart );
roots.add ( new File ( line.substring ( indexStart + 4, indexEnd - 1 ) ) );
}
mountOutput.close ();