I have a directory with many files and want to filter the one with a certain name and save them in the fileList ArrayList and it works in this way but it takes much time. Is there a way to make this faster?
String processingDir = "C:/Users/Ferid/Desktop/20181024";
String CorrId = "00a3d321-171c-484a-ad7c-74e22ffa3625");
Path dirPath = Paths.get(processingDir);
ArrayList<Path> fileList;
try (Stream<Path> paths = Files.walk(dirPath))
{
fileList = paths.filter(t -> (t.getFileName().toString().indexOf("EPX_" +
corrId + "_") >= 0)).collect(Collectors.toCollection(ArrayList::new));
}
The walking through the directory in the try condition is not taking much time but the collecting it in fileList is taking much time and I do not know which operation it is exactly which has this poor performance or which of them to improve. (This is not the complete code of course, just the relevant things)
From java.nio.file.Files.walk(Path) api:
Return a Stream that is lazily populated with Path by walking the file
tree rooted at a given starting file.
That's why it gives you the impression that "walking through the directory in the try condition is not taking much time".
Actually, the real deal is mostly done on collect and it is not collect's mechanism fault for being slow.
If scanning the files each time is too slow you can build an index of files, either on startup or persisted and maintained as files change.
You could use a Watch Service to be notified when files are added or removed while the program is running.
This would be much faster to query as it would be entirely in memory. It would take the same amount of time to load the first time but could be loading the background before you need it initially.
e.g.
static Map<String, List<Path>> pathMap;
public static void initPathMap(String processingDir) throws IOException {
try (Stream<Path> paths = Files.walk(Paths.get(processingDir))) {
pathMap = paths.collect(Collectors.groupingBy(
p -> getCorrId(p.getFileName().toString())));
}
pathMap.remove(""); // remove entries without a corrId.
}
private static String getCorrId(String fileName) {
int start = fileName.indexOf("EPX_");
if (start < 0)
return "";
int end = fileName.indexOf("_", start + 4);
if (end < 0)
return "";
return fileName.substring(start + 4, end);
}
// later
String corrId = "00a3d321-171c-484a-ad7c-74e22ffa3625";
List<Path> pathList = pathMap.get(corrId); // very fast.
You can make this code cleaner by writing the following, however, I wouldn't expect it to be much faster.
List<Path> fileList;
try (Stream<Path> paths = Files.walk(dirPath)) {
String find = "EPX_" + corrId + "_"; // only calculate this once
fileList = paths.filter(t -> t.getFileName().contains(find))
.collect(Collectors.toList());
}
The cost is in the time taken to scan the files of the directory. The cost of processing the file names is far, far less.
Using an SSD, or only scanning directories already cached in memory would speed this up dramatically.
One way to test this is to perform the operation more than once after a clean boot (so it's not cached). The amount the first run takes longer tells you how much time was spent loading the data from disk.
Related
My application stores a large number (about 700,000) of strings in an ArrayList. The strings are loaded from a text file like this:
List<String> stringList = new ArrayList<String>(750_000);
//there's a try catch here but I omitted it for this example
Scanner fileIn = new Scanner(new FileInputStream(listPath), "UTF-8");
while (fileIn.hasNext()) {
String s = fileIn.nextLine().trim();
if (s.isEmpty()) continue;
if (s.startsWith("#")) continue; //ignore comments
stringList.add(s);
}
fileIn.close();
Later on, Other strings are compared to this list, using this code:
String example = "Something";
if (stringList.contains(example))
doSomething();
This comparison will happen many hundreds (thousands?) of times.
This all works, but I want to know if there's anything I can do to make it better. I notice that the JVM increases in size from about 100MB to 600MB when it loads the 700K Strings. The strings are mainly about this size:
Blackened Recordings
Divergent Series: Insurgent
Google
Pixels Movie Money
X Ambassadors
Power Path Pro Advanced
CYRFZQ
Is there anything I can do to reduce the memory, or is that to be expected? Any suggestions in general?
ArrayList is a memory effective. Probably your issue is caused by java.util.Scanner. Scanner creates a lot of temp objects during parsing (Patterns, Matchers etc) and not suitable for big files.
Try to replace it with java.io.BufferedReader:
List<String> stringList = new ArrayList<String>();
BufferedReader fileIn = new BufferedReader(new FileReader("UTF-8"));
String line = null;
while ((line = fileIn.readLine()) != null) {
line = line.trim();
if (line.isEmpty()) continue;
if (line.startsWith("#")) continue; //ignore comments
stringList.add(line);
}
fileIn.close();
See java.util.Scanner source code
To pinpoint memory issue attach to your JVM any memory profiler, for example VisualVM from JDK tools.
Added:
Let's make few assumtions:
you have 700000 string with 20 characters each.
object reference size is 32 bits, object header - 24, array header - 16, char - 16, int 32.
Then every string will consume 24+32*2+32+(16+20*16) = 456 bits.
Whole ArrayList with string object will consume about 700000*(32*2+456) = 364000000 bits = 43.4 MB (very roughly).
Not quite an answer, but:
Your scenario uses around 70mb on my machine:
long usedMemory = -(Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory());
{//
String[] strings = new String[700_000];
for (int i = 0; i < strings.length; i++) {
strings[i] = new String(new char[20]);
}
}//
usedMemory += Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
System.out.println(usedMemory / 1_000_000d + " mb");
How did you reach 500mb there? As far as I know, String has internally a char[], and each char has 16 bits. Taking the Object and String overhead in account, 500mb is still quite much for the strings only. You may perform some benchmarking tests on your machine.
As others already mentioned, you should change the datastructure for element look-ups/comparison.
You're likely going to be better off using a HashSet instead of an ArrayList as both add and contains are constant time operations in a HashSet.
However, it does assume that your object's hashCode implementation (which is part of Object, but can be overridden) is evenly distributed.
There is a Trie data structure which can be used as dictionary, with so many strings they can occur multiple times. https://en.wikipedia.org/wiki/Trie . It seems to fit your case.
UPDATE:
An alternative can be HashSet or HashMap string -> something if you want occurrences of strings for example. Hashed collection will be faster than list for sure.
I would start with HashSet.
Using an ArrayList is a very bad idea for your use case, because it is not sorted, and hence you cannot efficiently search for an entry.
The best built-in type for your case is a is a TreeSet<String>. It guarantees O(log(n)) Performance for add() and contains().
Be aware that TreeSet is not thread-safe in the basic implementation. Use an mt-safe wrapper (see the JavaDocs of TreeSet for this).
Here is a Java 8 approach. It uses Files.lines() method which take advantage of Stream API. This method reads all lines from a file as a Stream.
As a consequence no String objects are created till the terminal operation which is a static method MyExecutor.doSomething(String).
/**
* Process lines from a file.
* Uses Files.lines() method which take advantage of Stream API introduced in Java 8.
*/
private static void processStringsFromFile(final Path file) {
try (Stream<String> lines = Files.lines(file)) {
lines.map(s -> s.trim())
.filter(s -> !s.isEmpty())
.filter(s -> !s.startsWith("#"))
.filter(s -> s.contains("Something"))
.forEach(MyExecutor::doSomething);
} catch (IOException ex) {
logProcessStringsFailed(ex);
}
}
I conducted an Analysis of Memory Usage in NetBeans and here are the Memory Results for empty implementation of doSomething()
public static void doSomething(final String s) {
}
Live Bytes = 6702720 ≈ 6.4MB.
Is there a better [pre-existing optional Java 1.6] solution than creating a streaming file reader class that will meet the following criteria?
Given an ASCII file of arbitrary large size where each line is terminated by a \n
For each invocation of some method readLine() read a random line from the file
And for the life of the file handle no call to readLine() should return the same line twice
Update:
All lines must eventually be read
Context: the file's contents are created from Unix shell commands to get a directory listing of all paths contained within a given directory; there are between millions to a billion files (which yields millions to a billion lines in the target file). If there is some way to randomly distribute the paths into a file during creation time that is an acceptable solution as well.
In order to avoid reading in the whole file, which may not be possible in your case, you may want to use a RandomAccessFile instead of a standard java FileInputStream. With RandomAccessFile, you can use the seek(long position) method to skip to an arbitrary place in the file and start reading there. The code would look something like this.
RandomAccessFile raf = new RandomAccessFile("path-to-file","rw");
HashMap<Integer,String> sampledLines = new HashMap<Integer,String>();
for(int i = 0; i < numberOfRandomSamples; i++)
{
//seek to a random point in the file
raf.seek((long)(Math.random()*raf.length()));
//skip from the random location to the beginning of the next line
int nextByte = raf.read();
while(((char)nextByte) != '\n')
{
if(nextByte == -1) raf.seek(0);//wrap around to the beginning of the file if you reach the end
nextByte = raf.read();
}
//read the line into a buffer
StringBuffer lineBuffer = new StringBuffer();
nextByte = raf.read();
while(nextByte != -1 && (((char)nextByte) != '\n'))
lineBuffer.append((char)nextByte);
//ensure uniqueness
String line = lineBuffer.toString();
if(sampledLines.get(line.hashCode()) != null)
i--;
else
sampledLines.put(line.hashCode(),line);
}
Here, sampledLines should hold your randomly selected lines at the end. You may need to check that you haven't randomly skipped to the end of the file as well to avoid an error in that case.
EDIT: I made it wrap to the beginning of the file in case you reach the end. It was a pretty simple check.
EDIT 2: I made it verify uniqueness of lines by using a HashMap.
Pre-process the input file and remember the offset of each new line. Use a BitSet to keep track of used lines. If you want to save some memory, then remember the offset of every 16th line; it is still easy to jump into the file and do a sequential lookup within a block of 16 lines.
Since you can pad the lines, I would do something along those lines, and you should also note that even then, there may exist a limitation with regards to what a List can actually hold.
Using a random number each time you want to read the line and adding it to a Set would also do, however this ensures that the file is completely read:
public class VeryLargeFileReading
implements Iterator<String>, Closeable
{
private static Random RND = new Random();
// List of all indices
final List<Long> indices = new ArrayList<Long>();
final RandomAccessFile fd;
public VeryLargeFileReading(String fileName, long lineSize)
{
fd = new RandomAccessFile(fileName);
long nrLines = fd.length() / lineSize;
for (long i = 0; i < nrLines; i++)
indices.add(i * lineSize);
Collections.shuffle(indices);
}
// Iterator methods
#Override
public boolean hasNext()
{
return !indices.isEmpty();
}
#Override
public void remove()
{
// Nope
throw new IllegalStateException();
}
#Override
public String next()
{
final long offset = indices.remove(0);
fd.seek(offset);
return fd.readLine().trim();
}
#Override
public void close() throws IOException
{
fd.close();
}
}
If the number of files is truly arbitrary it seems like there could be an associated issue with tracking processed files in terms of memory usage (or IO time if tracking in files instead of a list or set). Solutions that keep a growing list of selected lines also run in to timing-related issues.
I'd consider something along the lines of the following:
Create n "bucket" files. n could be determined based on something that takes in to account the number of files and system memory. (If n is large, you could generate a subset of n to keep open file handles down.)
Each file's name is hashed, and goes into an appropriate bucket file, "sharding" the directory based on arbitrary criteria.
Read in the bucket file contents (just filenames) and process as-is (randomness provided by hashing mechanism), or pick rnd(n) and remove as you go, providing a bit more randomosity.
Alternatively, you could pad and use the random access idea, removing indices/offsets from a list as they're picked.
I'm working with a very big text file (755Mb).
I need to sort the lines (about 1890000) and then write them back in another file.
I already noticed that discussion that has a starting file really similar to mine:
Sorting Lines Based on words in them as keys
The problem is that i cannot store the lines in a collection in memory because I get a Java Heap Space Exception (even if i expanded it at maximum)..(already tried!)
I can't either open it with excel and use the sorting feature because the file is too large and it cannot be completely loaded..
I thought about using a DB ..but i think that writing all the lines then use the SELECT query it's too much long in terms of time executing..am I wrong?
Any hints appreciated
Thanks in advance
I think the solution here is to do a merge sort using temporary files:
Read the first n lines of the first file, (n being the number of lines you can afford to store and sort in memory), sort them, and write them to file 1.tmp (or however you call it). Do the same with the next n lines and store it in 2.tmp. Repeat until all lines of the original file has been processed.
Read the first line of each temporary file. Determine the smallest one (according to your sort order), write it to the destination file, and read the next line from the corresponding temporary file. Repeat until all lines have been processed.
Delete all the temporary files.
This works with arbitrary large files, as long as you have enough disk space.
You can run the following with
-mx1g -XX:+UseCompressedStrings # on Java 6 update 29
-mx1800m -XX:-UseCompressedStrings # on Java 6 update 29
-mx2g # on Java 7 update 2.
import java.io.*;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
public class Main {
public static void main(String... args) throws IOException {
long start = System.nanoTime();
generateFile("lines.txt", 755 * 1024 * 1024, 189000);
List<String> lines = loadLines("lines.txt");
System.out.println("Sorting file");
Collections.sort(lines);
System.out.println("... Sorted file");
// save lines.
long time = System.nanoTime() - start;
System.out.printf("Took %.3f second to read, sort and write to a file%n", time / 1e9);
}
private static void generateFile(String fileName, int size, int lines) throws FileNotFoundException {
System.out.println("Creating file to load");
int lineSize = size / lines;
StringBuilder sb = new StringBuilder();
while (sb.length() < lineSize) sb.append('-');
String padding = sb.toString();
PrintWriter pw = new PrintWriter(fileName);
for (int i = 0; i < lines; i++) {
String text = (i + padding).substring(0, lineSize);
pw.println(text);
}
pw.close();
System.out.println("... Created file to load");
}
private static List<String> loadLines(String fileName) throws IOException {
System.out.println("Reading file");
BufferedReader br = new BufferedReader(new FileReader(fileName));
List<String> ret = new ArrayList<String>();
String line;
while ((line = br.readLine()) != null)
ret.add(line);
System.out.println("... Read file.");
return ret;
}
}
prints
Creating file to load
... Created file to load
Reading file
... Read file.
Sorting file
... Sorted file
Took 4.886 second to read, sort and write to a file
divide and conquer is the best solution :)
divide your file to smaller ones, sort each file seperately then regroup.
Links:
Sort a file with huge volume of data given memory constraint
http://hackerne.ws/item?id=1603381
Algorithm:
How much memory do we have available? Let’s assume we have X MB of memory available.
Divide the file into K chunks, where X * K = 2 GB. Bring each chunk into memory and sort the lines as usual using any O(n log n) algorithm. Save the lines back to the file.
Now bring the next chunk into memory and sort.
Once we’re done, merge them one by one.
The above algorithm is also known as external sort. Step 3 is known as N-way merge
Why don't you try multithreading and increasing heap size of the program you are running? (this also requires you to use merge sort kind of thing provided you have more memory than 755mb in your system.)
Maybe u can use perl to format the file .and load into the database like mysql. it's so fast. and use the index to query the data. and write to another file.
u can set jvm heap size like '-Xms256m -Xmx1024m' .i hope to help u .thanks
I'm trying to write many files to a directory, and when the directory reaches X number of files, I want the least recently accessed file to be deleted before writing the new file. I don't really want to roll my own solution to this because I'd imagine someone else has already done this before. Are there existing solutions to this? Note this is for a windows application.
This is related to my question Java ehcache disk store, but I'm asking this question separately since now I'm focusing on a file caching solution.
Thanks,
Jeff
I would roll my own, because the problem sounds so easy that writing it yourself is probably easier than trying to learn and adopt an existing library :-)
It it's a low number of files and / or your cash is accessed from multiple processes, call the following method before writing a file:
void deleteOldFiles(String dir, long maxFileCount) {
while (true) {
File oldest = null;
long oldestTime = 0;
File[] list = new File(dir).listFiles();
if (list.length < maxFileCount) {
break;
}
for (File f : list) {
long m = f.lastModified();
if (oldest == null || oldestTime > m) {
oldestTime = m;
oldest = f;
}
}
oldest.delete();
}
}
If you only access the cache from one process, you could write something more efficient using LinkedHashMap or LinkedHashSet.
Update
Check the number of files instead of the total file size.
You can try this before creating a new file:
void deleteOldFiles(String dir, int maxFiles) {
File fdir = new File(dir);
while (true) {
// Check number of files. Also do nothing if maxFiles == 0
File[] files = fdir.listFiles();
if (maxFiles == 0 || files.length < maxFiles)
break;
// Delete oldest
File oldest = files[0];
for (int i = 1; i < files.length; i++) {
if (files[i].lastModified() < oldest.lastModified()) {
oldest = files[i];
}
}
oldest.delete();
}
}
This would not be efficient for a large number of files, though. In that case I would keep a list of files in the directory, sorted by creation time.
Although all of this gets into the 'roll my own category'...
If you were using Cacheonix you could hook up to the cache events API and remove the files when receiving the notification that a cache entry was evicted by the LRU algorithm.
How do I count the number of files in a directory using Java ? For simplicity, lets assume that the directory doesn't have any sub-directories.
I know the standard method of :
new File(<directory path>).listFiles().length
But this will effectively go through all the files in the directory, which might take long if the number of files is large. Also, I don't care about the actual files in the directory unless their number is greater than some fixed large number (say 5000).
I am guessing, but doesn't the directory (or its i-node in case of Unix) store the number of files contained in it? If I could get that number straight away from the file system, it would be much faster. I need to do this check for every HTTP request on a Tomcat server before the back-end starts doing the real processing. Therefore, speed is of paramount importance.
I could run a daemon every once in a while to clear the directory. I know that, so please don't give me that solution.
Ah... the rationale for not having a straightforward method in Java to do that is file storage abstraction: some filesystems may not have the number of files in a directory readily available... that count may not even have any meaning at all (see for example distributed, P2P filesystems, fs that store file lists as a linked list, or database-backed filesystems...).
So yes,
new File(<directory path>).list().length
is probably your best bet.
Since Java 8, you can do that in three lines:
try (Stream<Path> files = Files.list(Paths.get("your/path/here"))) {
long count = files.count();
}
Regarding the 5000 child nodes and inode aspects:
This method will iterate over the entries but as Varkhan suggested you probably can't do better besides playing with JNI or direct system commands calls, but even then, you can never be sure these methods don't do the same thing!
However, let's dig into this a little:
Looking at JDK8 source, Files.list exposes a stream that uses an Iterable from Files.newDirectoryStream that delegates to FileSystemProvider.newDirectoryStream.
On UNIX systems (decompiled sun.nio.fs.UnixFileSystemProvider.class), it loads an iterator: A sun.nio.fs.UnixSecureDirectoryStream is used (with file locks while iterating through the directory).
So, there is an iterator that will loop through the entries here.
Now, let's look to the counting mechanism.
The actual count is performed by the count/sum reducing API exposed by Java 8 streams. In theory, this API can perform parallel operations without much effort (with multihtreading). However the stream is created with parallelism disabled so it's a no go...
The good side of this approach is that it won't load the array in memory as the entries will be counted by an iterator as they are read by the underlying (Filesystem) API.
Finally, for the information, conceptually in a filesystem, a directory node is not required to hold the number of the files that it contains, it can just contain the list of it's child nodes (list of inodes). I'm not an expert on filesystems, but I believe that UNIX filesystems work just like that. So you can't assume there is a way to have this information directly (i.e: there can always be some list of child nodes hidden somewhere).
Unfortunately, I believe that is already the best way (although list() is slightly better than listFiles(), since it doesn't construct File objects).
This might not be appropriate for your application, but you could always try a native call (using jni or jna), or exec a platform-specific command and read the output before falling back to list().length. On *nix, you could exec ls -1a | wc -l (note - that's dash-one-a for the first command, and dash-lowercase-L for the second). Not sure what would be right on windows - perhaps just a dir and look for the summary.
Before bothering with something like this I'd strongly recommend you create a directory with a very large number of files and just see if list().length really does take too long. As this blogger suggests, you may not want to sweat this.
I'd probably go with Varkhan's answer myself.
Since you don't really need the total number, and in fact want to perform an action after a certain number (in your case 5000), you can use java.nio.file.Files.newDirectoryStream. The benefit is that you can exit early instead having to go through the entire directory just to get a count.
public boolean isOverMax(){
Path dir = Paths.get("C:/foo/bar");
int i = 1;
try (DirectoryStream<Path> stream = Files.newDirectoryStream(dir)) {
for (Path p : stream) {
//larger than max files, exit
if (++i > MAX_FILES) {
return true;
}
}
} catch (IOException ex) {
ex.printStackTrace();
}
return false;
}
The interface doc for DirectoryStream also has some good examples.
If you have directories containing really (>100'000) many files, here is a (non-portable) way to go:
String directoryPath = "a path";
// -f flag is important, because this way ls does not sort it output,
// which is way faster
String[] params = { "/bin/sh", "-c",
"ls -f " + directoryPath + " | wc -l" };
Process process = Runtime.getRuntime().exec(params);
BufferedReader reader = new BufferedReader(new InputStreamReader(
process.getInputStream()));
String fileCount = reader.readLine().trim() - 2; // accounting for .. and .
reader.close();
System.out.println(fileCount);
Using sigar should help. Sigar has native hooks to get the stats
new Sigar().getDirStat(dir).getTotal()
This method works for me very well.
// Recursive method to recover files and folders and to print the information
public static void listFiles(String directoryName) {
File file = new File(directoryName);
File[] fileList = file.listFiles(); // List files inside the main dir
int j;
String extension;
String fileName;
if (fileList != null) {
for (int i = 0; i < fileList.length; i++) {
extension = "";
if (fileList[i].isFile()) {
fileName = fileList[i].getName();
if (fileName.lastIndexOf(".") != -1 && fileName.lastIndexOf(".") != 0) {
extension = fileName.substring(fileName.lastIndexOf(".") + 1);
System.out.println("THE " + fileName + " has the extension = " + extension);
} else {
extension = "Unknown";
System.out.println("extension2 = " + extension);
}
filesCount++;
allStats.add(new FilePropBean(filesCount, fileList[i].getName(), fileList[i].length(), extension,
fileList[i].getParent()));
} else if (fileList[i].isDirectory()) {
filesCount++;
extension = "";
allStats.add(new FilePropBean(filesCount, fileList[i].getName(), fileList[i].length(), extension,
fileList[i].getParent()));
listFiles(String.valueOf(fileList[i]));
}
}
}
}
Unfortunately, as mmyers said, File.list() is about as fast as you are going to get using Java. If speed is as important as you say, you may want to consider doing this particular operation using JNI. You can then tailor your code to your particular situation and filesystem.
public void shouldGetTotalFilesCount() {
Integer reduce = of(listRoots()).parallel().map(this::getFilesCount).reduce(0, ((a, b) -> a + b));
}
private int getFilesCount(File directory) {
File[] files = directory.listFiles();
return Objects.isNull(files) ? 1 : Stream.of(files)
.parallel()
.reduce(0, (Integer acc, File p) -> acc + getFilesCount(p), (a, b) -> a + b);
}
Count files in directory and all subdirectories.
var path = Path.of("your/path/here");
var count = Files.walk(path).filter(Files::isRegularFile).count();
In spring batch I did below
private int getFilesCount() throws IOException {
ResourcePatternResolver resolver = new PathMatchingResourcePatternResolver();
Resource[] resources = resolver.getResources("file:" + projectFilesFolder + "/**/input/splitFolder/*.csv");
return resources.length;
}