I have a folder with 100 .txt files, 20 or + MB each.
All the files have about 2*10^5 lines of UTF-8 encoded text.
What is, possibly using multithreading, the fastest way to find which files contain a fixed key string? (The criteria for the contain is the same as the java .contains() function, ie. a normal substring).
There are several ways I found here on SO, but none used multithreading (Why?), and all of them seem to vary speed depending on the requirements, I cannot seem to understand which of the approaches is better for me.
For example this super-complex approach:
https://codereview.stackexchange.com/questions/44021/fast-way-of-searching-for-a-string-in-a-text-file
seems to be 2 times slower than a simple line by line search with a BufferedReader and the .contains() function. How can it be?
And how can I use multithreading to its full potential? The program is run on a very powerful multicore machine.
The output I'm looking for is which files contain the string, and, possibly, at which line.
The following code does the job.
It'll get to your directory and look for all the files.
And then will create a new thread for each file and look for target string.
Make sure to change the path of folder and target String in TheThread class as per you need
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
import java.util.stream.Stream;
//class used for thread
class TheThread implements Runnable {
int counter = 0;
//to get stream of paths
Stream<Path> streamOfFiles = Files.walk(Paths.get("./src/Multi_tasking/Files"));
//List of total all files in the folder
List<Path> listOfFiles = streamOfFiles.collect(Collectors.toList());
//because Files.walk may throw IOException
public TheThread() throws IOException {
}
#Override
public void run() {
//increments counter to access the indexes of the list
counter++;
//Calling the method for search file at index counter and target String
SearchTextInMultipleFilesUsingMultiThreading.lookIn(listOfFiles.get(counter), "target String");
}
}
public class SearchTextInMultipleFilesUsingMultiThreading {
//method responsible for searching the target String in file
public static void lookIn(Path path, String text) {
try {
List<String> texts = Files.readAllLines(path);
boolean flag = false;
for (int i = 0; i < texts.size(); i++) {
String str = texts.get(i);
if (str.contains(text)) {
System.out.println("Found \"" + text + "\" in " + path.getFileName() + " at line : " + (i + 1) + " from thread : " + Thread.currentThread().getName());
flag = true;
}
}
if (!flag) {
System.out.println("\"" + text + "\" not found in " + path.getFileName() + " through thread : " + Thread.currentThread().getName());
}
} catch (IOException e) {
System.out.println("Error while reading " + path.getFileName());
e.printStackTrace();
}
}
public static void main(String[] args) throws IOException {
//creating object of our thread class
TheThread theThread = new TheThread();
//getting the number of files in the folder
int numberOfFiles = theThread.listOfFiles.size() - 1;
//if the folder doesn't contain any file at all
if (numberOfFiles == 0) {
System.out.println("No file found in the folder");
System.exit(0);
}
//creating the List to store threads
List<Thread> listOfThreads = new ArrayList<>();
//keeping required number of threads inside the list
for (int i = 0; i < numberOfFiles; i++) {
listOfThreads.add(new Thread(theThread));
}
//starting all the threads
for (Thread thread :
listOfThreads) {
thread.start();
}
}
}
I'll let the answers to other questions speak for themselves, but multithreading is unlikely to be helpful for I/O bound tasks with data stored on a single disk. Assuming your folder is stored on a single disk, the use case that the disk caches are most optimized for is single threaded access, so that's likely to be the most effective solution. The reason is because reading the data from disk is likely to be slower than looking through the data once it's loaded into memory, so the disk read is rate limiting.
The simple solution with a BufferedReader and the contains() function may well be the fastest since this is library code that is likely highly optimized.
Now, if your data was sharded onto multiple disks, it might be worthwhile to run multiple threads, depending on how the operating system did disk caching. If you were going to do multiple searches for different strings, not all known at the time of the first search so that a single pass approach wouldn't work, it might be worthwhile to load all the files into memory and then do multithreaded searches only on memory. But then your problem isn't really a file search problem any more, but a more general data search problem.
Related
I have a directory with many files and want to filter the one with a certain name and save them in the fileList ArrayList and it works in this way but it takes much time. Is there a way to make this faster?
String processingDir = "C:/Users/Ferid/Desktop/20181024";
String CorrId = "00a3d321-171c-484a-ad7c-74e22ffa3625");
Path dirPath = Paths.get(processingDir);
ArrayList<Path> fileList;
try (Stream<Path> paths = Files.walk(dirPath))
{
fileList = paths.filter(t -> (t.getFileName().toString().indexOf("EPX_" +
corrId + "_") >= 0)).collect(Collectors.toCollection(ArrayList::new));
}
The walking through the directory in the try condition is not taking much time but the collecting it in fileList is taking much time and I do not know which operation it is exactly which has this poor performance or which of them to improve. (This is not the complete code of course, just the relevant things)
From java.nio.file.Files.walk(Path) api:
Return a Stream that is lazily populated with Path by walking the file
tree rooted at a given starting file.
That's why it gives you the impression that "walking through the directory in the try condition is not taking much time".
Actually, the real deal is mostly done on collect and it is not collect's mechanism fault for being slow.
If scanning the files each time is too slow you can build an index of files, either on startup or persisted and maintained as files change.
You could use a Watch Service to be notified when files are added or removed while the program is running.
This would be much faster to query as it would be entirely in memory. It would take the same amount of time to load the first time but could be loading the background before you need it initially.
e.g.
static Map<String, List<Path>> pathMap;
public static void initPathMap(String processingDir) throws IOException {
try (Stream<Path> paths = Files.walk(Paths.get(processingDir))) {
pathMap = paths.collect(Collectors.groupingBy(
p -> getCorrId(p.getFileName().toString())));
}
pathMap.remove(""); // remove entries without a corrId.
}
private static String getCorrId(String fileName) {
int start = fileName.indexOf("EPX_");
if (start < 0)
return "";
int end = fileName.indexOf("_", start + 4);
if (end < 0)
return "";
return fileName.substring(start + 4, end);
}
// later
String corrId = "00a3d321-171c-484a-ad7c-74e22ffa3625";
List<Path> pathList = pathMap.get(corrId); // very fast.
You can make this code cleaner by writing the following, however, I wouldn't expect it to be much faster.
List<Path> fileList;
try (Stream<Path> paths = Files.walk(dirPath)) {
String find = "EPX_" + corrId + "_"; // only calculate this once
fileList = paths.filter(t -> t.getFileName().contains(find))
.collect(Collectors.toList());
}
The cost is in the time taken to scan the files of the directory. The cost of processing the file names is far, far less.
Using an SSD, or only scanning directories already cached in memory would speed this up dramatically.
One way to test this is to perform the operation more than once after a clean boot (so it's not cached). The amount the first run takes longer tells you how much time was spent loading the data from disk.
I had this "java.lang.OutOfMemoryError: Java heap space " and I read and understand that I can increase my memory using -Xmx1024m. But I think in my code I can change something to this error does not happen anymore.
First, this is the image from VisualVM about my memory :
In the image you can see that the object "Pedidos" is not so big and I have the another object "Enderecos" that have more and less the same size but is not complete because I have the error before the object is completed.
The point is :
I have 2 classes that search for a big csv file ( 400.000 values each ), I will show the code. I tried to use Garbage Collector, set variables as null, but is not working, can anyone help me please? Here is the Code from the class "Pedidos", the class "Enderecos" is the same and my project is just calling this 2 classes.
// all Imports
import java.io.BufferedReader;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.Reader;
import java.util.ArrayList;
import javax.swing.JOptionPane;
import Objetos.Pedido;
// CLASS
public class GerenciadorPedido{
// ArrayList I will add all the "Pedidos" Objects
ArrayList<Pedido> listaPedidos = new ArrayList<Pedido>();
// Int that I need to use the values correctly
int helper;
// I create this global because I didnt want to create a new String everytime the for is running (trying to use less memory)
String Campo[];
String Linha;
String newLinha;
public ArrayList<Pedido> getListaPedidos() throws IOException {
// Here I change the "\" and "/" to be accepted be the FILE (the csv address)
String Enderecotemp = System.getProperty("user.dir"), Endereco = "";
char a;
for (int i = 0; i < Enderecotemp.length(); i++) {
a = Enderecotemp.charAt(i);
if (a == '\\') a = '/';
Endereco = Endereco + String.valueOf(a);
}
Endereco = Endereco + "/Pedido.csv";
// Open the CSV File and the reader to read it
File NovoArquivo = new File(Endereco);
Reader FileLer = null;
// Try to read the File
try
{
FileLer = new FileReader(NovoArquivo);
}
catch(FileNotFoundException e) {
JOptionPane.showMessageDialog(null, "Erro, fale com o Vini <Arquivo de Pedido Não Encontrado>");
}
// Read the File
BufferedReader Lendo = new BufferedReader(FileLer);
try
{
// Do for each line of the csv
while (Lendo.ready()) {
// Read the line and replace the caracteres ( needed to funcionality works )
Linha = Lendo.readLine();
newLinha = Linha.replaceAll("\"", "");
newLinha = newLinha.replaceAll(",,", ", , ");
newLinha = newLinha.replaceAll(",,", ", , ");
newLinha = newLinha + " ";
// Create Campo[x] for each value between ","
Campo = newLinha.split(",");
// Object
Pedido pedido = new Pedido();
helper = 0;
// Just to complete the object with the right values if the Campo.length have 15, 16, 17, 18 or 19 of size.
switch (Campo.length) {
case 15: pedido.setAddress1(Campo[9]);
break;
case 16: pedido.setAddress1(Campo[9] + Campo[10]);
helper = 1;
break;
case 17: pedido.setAddress1(Campo[9] + Campo[10] + Campo[11]);
helper = 2;
break;
case 18: pedido.setAddress1(Campo[9] + Campo[10] + Campo[11] + Campo[12]);
helper = 3;
break;
case 19: pedido.setAddress1(Campo[9] + Campo[10] + Campo[11] + Campo[12] + Campo[13]);
helper = 4;
break;
}
// Complete the Object
pedido.setOrder(Campo[0]);
pedido.setOrderValue(Float.parseFloat(Campo[1]));
pedido.setOrderPv(Float.parseFloat(Campo[2]));
pedido.setCombinedOrderFlag(Campo[3]);
pedido.setCombineOrder(Campo[4]);
pedido.setOrderType(Campo[5]);
pedido.setOrderShipped(Campo[6]);
pedido.setOrderCancelled(Campo[7]);
pedido.setTransactionType(Campo[8]);
pedido.setAddress2(Campo[10 + helper]);
pedido.setAddress3(Campo[11 + helper]);
pedido.setPost(Campo[12 + helper]);
pedido.setCity(Campo[13 + helper]);
pedido.setState(Campo[14 + helper]);
// Add the object in the ArrayList
listaPedidos.add(pedido);
// Set everything to null to start again
Campo = null;
Linha = null;
newLinha = null;
}
}
catch(IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
finally
{
// Close the file and run garbage collector to try to clear the trash
Lendo.close();
FileLer.close();
System.gc();
}
// return the ArrayList.
return listaPedidos;
}
}
The project runs this class, but when the project try to run the another ( the same as this one, changing just names and the csv ), I have the memory Error. I don't know how can I clear this char[] and String that is to big as can you see on the image. Any new Ideas ? Is really impossible without increase the memory ?
As is being discussed in the comments already, the main factor is your program places everything in memory at the same time. That design will inherently limit the size of the files you can process.
The way garbage collection works is that only garbage is collected. Any object that is referenced by another is not garbage. So, starting with the "root" objects (anything declared as static or local variables currently on the stack), follow the references. Your GerenciadorPedido instance is surely referenced from main(). It references a list listaPedidos. That list references (many) instances of Pedido each of which references many String instances. Those objects will all remain in memory while they are reachable through the list.
The way to design your program so it doesn't have a limit on the size of the file it can process is to eliminate the list entirely. Don't read the entire file and return a list (or other collection). Instead implement an Iterator. Read one line from the CSV file, create the Pedido, return it. When the program is finished with that one, then read the next line and create the next Pedido. Then you will have only one of these objects in memory at any given time.
Some additional notes regarding your current algorithm:
every String object references a char[] internally that contains the characters
ArrayList has very poor memory usage characteristics when adding to a large list. Since it is backed by an array, in order to grow to add the new element, it must create an entirely new array larger than the current one then copy all of the references. During the process it will use double the memory that it needs. This also becomes increasingly slower the larger the list is.
One solution is to tell the ArrayList how large you will need it to be so you can avoid resizing. This is only applicable if you actually know how large you will need it to be. If you need 100 elements: new ArrayList<>(100).
Another solution is to use a different data structure. A LinkedList is better for adding elements one at a time because it does not need to allocate and copy an entire array.
Each call to .replaceAll() will create a new char[] for the new String object. Since you then orphan the previous String object, it will get garbage collected. Just be aware of this need for allocation.
Each string concatenation (eg newLinha + " " or Campo[9] + Campo[10]) will create a new StringBuilder object, append the two strings, then create a new String object. This, again, can have an impact when repeated for large amounts of data.
You should, in general, never need to call System.gc(). It is okay to call it, but the system will perform garbage collection whenever memory is needed.
One addtional note: your approach to parsing the CSV will fail when the data contains characters you aren't expecting. In particular if any of the fields were to contain a comma. I recommend using an existing CSV parsing library for a simple solution to correctly handling the entire definition of CSV. (I have successful experience using opencsv)
I need the advice from someone who knows Java very well and the memory issues.
I have a large file (something like 1.5GB) and I need to cut this file in many (100 small files for example) smaller files.
I know generally how to do it (using a BufferedReader), but I would like to know if you have any advice regarding the memory, or tips how to do it faster.
My file contains text, it is not binary and I have about 20 character per line.
To save memory, do not unnecessarily store/duplicate the data in memory (i.e. do not assign them to variables outside the loop). Just process the output immediately as soon as the input comes in.
It really doesn't matter whether you're using BufferedReader or not. It will not cost significantly much more memory as some implicitly seem to suggest. It will at highest only hit a few % from performance. The same applies on using NIO. It will only improve scalability, not memory use. It will only become interesting when you've hundreds of threads running on the same file.
Just loop through the file, write every line immediately to other file as you read in, count the lines and if it reaches 100, then switch to next file, etcetera.
Kickoff example:
String encoding = "UTF-8";
int maxlines = 100;
BufferedReader reader = null;
BufferedWriter writer = null;
try {
reader = new BufferedReader(new InputStreamReader(new FileInputStream("/bigfile.txt"), encoding));
int count = 0;
for (String line; (line = reader.readLine()) != null;) {
if (count++ % maxlines == 0) {
close(writer);
writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("/smallfile" + (count / maxlines) + ".txt"), encoding));
}
writer.write(line);
writer.newLine();
}
} finally {
close(writer);
close(reader);
}
First, if your file contains binary data, then using BufferedReader would be a big mistake (because you would be converting the data to String, which is unnecessary and could easily corrupt the data); you should use a BufferedInputStream instead. If it's text data and you need to split it along linebreaks, then using BufferedReader is OK (assuming the file contains lines of a sensible length).
Regarding memory, there shouldn't be any problem if you use a decently sized buffer (I'd use at least 1MB to make sure the HD is doing mostly sequential reading and writing).
If speed turns out to be a problem, you could have a look at the java.nio packages - those are supposedly faster than java.io,
You can consider using memory-mapped files, via FileChannels .
Generally a lot faster for large files. There are performance trade-offs that could make it slower, so YMMV.
Related answer: Java NIO FileChannel versus FileOutputstream performance / usefulness
This is a very good article:
http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/
In summary, for great performance, you should:
Avoid accessing the disk.
Avoid accessing the underlying operating system.
Avoid method calls.
Avoid processing bytes and characters individually.
For example, to reduce the access to disk, you can use a large buffer. The article describes various approaches.
Does it have to be done in Java? I.e. does it need to be platform independent? If not, I'd suggest using the 'split' command in *nix. If you really wanted, you could execute this command via your java program. While I haven't tested, I imagine it perform faster than whatever Java IO implementation you could come up with.
You can use java.nio which is faster than classical Input/Output stream:
http://java.sun.com/javase/6/docs/technotes/guides/io/index.html
Yes.
I also think that using read() with arguments like read(Char[], int init, int end) is a better way to read a such a large file
(Eg : read(buffer,0,buffer.length))
And I also experienced the problem of missing values of using the BufferedReader instead of BufferedInputStreamReader for a binary data input stream. So, using the BufferedInputStreamReader is a much better in this like case.
package all.is.well;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import junit.framework.TestCase;
/**
* #author Naresh Bhabat
*
Following implementation helps to deal with extra large files in java.
This program is tested for dealing with 2GB input file.
There are some points where extra logic can be added in future.
Pleasenote: if we want to deal with binary input file, then instead of reading line,we need to read bytes from read file object.
It uses random access file,which is almost like streaming API.
* ****************************************
Notes regarding executor framework and its readings.
Please note :ExecutorService executor = Executors.newFixedThreadPool(10);
* for 10 threads:Total time required for reading and writing the text in
* :seconds 349.317
*
* For 100:Total time required for reading the text and writing : seconds 464.042
*
* For 1000 : Total time required for reading and writing text :466.538
* For 10000 Total time required for reading and writing in seconds 479.701
*
*
*/
public class DealWithHugeRecordsinFile extends TestCase {
static final String FILEPATH = "C:\\springbatch\\bigfile1.txt.txt";
static final String FILEPATH_WRITE = "C:\\springbatch\\writinghere.txt";
static volatile RandomAccessFile fileToWrite;
static volatile RandomAccessFile file;
static volatile String fileContentsIter;
static volatile int position = 0;
public static void main(String[] args) throws IOException, InterruptedException {
long currentTimeMillis = System.currentTimeMillis();
try {
fileToWrite = new RandomAccessFile(FILEPATH_WRITE, "rw");//for random write,independent of thread obstacles
file = new RandomAccessFile(FILEPATH, "r");//for random read,independent of thread obstacles
seriouslyReadProcessAndWriteAsynch();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Thread currentThread = Thread.currentThread();
System.out.println(currentThread.getName());
long currentTimeMillis2 = System.currentTimeMillis();
double time_seconds = (currentTimeMillis2 - currentTimeMillis) / 1000.0;
System.out.println("Total time required for reading the text in seconds " + time_seconds);
}
/**
* #throws IOException
* Something asynchronously serious
*/
public static void seriouslyReadProcessAndWriteAsynch() throws IOException {
ExecutorService executor = Executors.newFixedThreadPool(10);//pls see for explanation in comments section of the class
while (true) {
String readLine = file.readLine();
if (readLine == null) {
break;
}
Runnable genuineWorker = new Runnable() {
#Override
public void run() {
// do hard processing here in this thread,i have consumed
// some time and ignore some exception in write method.
writeToFile(FILEPATH_WRITE, readLine);
// System.out.println(" :" +
// Thread.currentThread().getName());
}
};
executor.execute(genuineWorker);
}
executor.shutdown();
while (!executor.isTerminated()) {
}
System.out.println("Finished all threads");
file.close();
fileToWrite.close();
}
/**
* #param filePath
* #param data
* #param position
*/
private static void writeToFile(String filePath, String data) {
try {
// fileToWrite.seek(position);
data = "\n" + data;
if (!data.contains("Randomization")) {
return;
}
System.out.println("Let us do something time consuming to make this thread busy"+(position++) + " :" + data);
System.out.println("Lets consume through this loop");
int i=1000;
while(i>0){
i--;
}
fileToWrite.write(data.getBytes());
throw new Exception();
} catch (Exception exception) {
System.out.println("exception was thrown but still we are able to proceeed further"
+ " \n This can be used for marking failure of the records");
//exception.printStackTrace();
}
}
}
Don't use read without arguments.
It's very slow.
Better read it to buffer and move it to file quickly.
Use bufferedInputStream because it supports binary reading.
And it's all.
Unless you accidentally read in the whole input file instead of reading it line by line, then your primary limitation will be disk speed. You may want to try starting with a file containing 100 lines and write it to 100 different files one line in each and make the triggering mechanism work on the number of lines written to the current file. That program will be easily scalable to your situation.
Ok. I am supposed to write a program to take a 20 GB file as input with 1,000,000,000 records and create some kind of an index for faster access. I have basically decided to split the 1 bil records into 10 buckets and 10 sub-buckets within those. I am calculating two hash values for the record to locate its appropriate bucket. Now, i create 10*10 files, one for each sub-bucket. And as i hash the record from the input file, i decide which of the 100 files it goes to; then append the record's offset to that particular file.
I have tested this with a sample file with 10,000 records. I have repeated the process 10 times. Effectively emulating a 100,000 record file. For this it takes me around 18 seconds. This means its gonna take me forever to do the same for a 1 bil record file.
Is there anyway i can speed up/ optimize my writing.
And i am going through all this because i can't store all the records in main memory.
import java.io.*;
// PROGRAM DOES THE FOLLOWING
// 1. READS RECORDS FROM A FILE.
// 2. CALCULATES TWO SETS OF HASH VALUES N, M
// 3. APPENDING THE OFFSET OF THAT RECORD IN THE ORIGINAL FILE TO ANOTHER FILE "NM.TXT" i.e REPLACE THE VALUES OF N AND M.
// 4.
class storage
{
public static int siz=10;
public static FileWriter[][] f;
}
class proxy
{
static String[][] virtual_buffer;
public static void main(String[] args) throws Exception
{
virtual_buffer = new String[storage.siz][storage.siz]; // TEMPORARY STRING BUFFER TO REDUCE WRITES
String s,tes;
for(int y=0;y<storage.siz;y++)
{
for(int z=0;z<storage.siz;z++)
{
virtual_buffer[y][z]=""; // INITIALISING ALL ELEMENTS TO ZERO
}
}
int offset_in_file = 0;
long start = System.currentTimeMillis();
// READING FROM THE SAME IP FILE 20 TIMES TO EMULATE A SINGLE BIGGER FILE OF SIZE 20*IP FILE
for(int h=0;h<20;h++){
BufferedReader in = new BufferedReader(new FileReader("outTest.txt"));
while((s = in.readLine() )!= null)
{
tes = (s.split(";"))[0];
int n = calcHash(tes); // FINDING FIRST HASH VALUE
int m = calcHash2(tes); // SECOND HASH
index_up(n,m,offset_in_file); // METHOD TO WRITE TO THE APPROPRIATE FILE I.E NM.TXT
offset_in_file++;
}
in.close();
}
System.out.println(offset_in_file);
long end = System.currentTimeMillis();
System.out.println((end-start));
}
static int calcHash(String s) throws Exception
{
char[] charr = s.toCharArray();;
int i,tot=0;
for(i=0;i<charr.length;i++)
{
if(i%2==0)tot+= (int)charr[i];
}
tot = tot % storage.siz;
return tot;
}
static int calcHash2(String s) throws Exception
{
char[] charr = s.toCharArray();
int i,tot=1;
for(i=0;i<charr.length;i++)
{
if(i%2==1)tot+= (int)charr[i];
}
tot = tot % storage.siz;
if (tot<0)
tot=tot*-1;
return tot;
}
static void index_up(int a,int b,int off) throws Exception
{
virtual_buffer[a][b]+=Integer.toString(off)+"'"; // THIS BUFFER STORES THE DATA TO BE WRITTEN
if(virtual_buffer[a][b].length()>2000) // TO A FILE BEFORE WRITING TO IT, TO REDUCE NO. OF WRITES
{ .
String file = "c:\\adsproj\\"+a+b+".txt";
new writethreader(file,virtual_buffer[a][b]); // DOING THE ACTUAL WRITE PART IN A THREAD.
virtual_buffer[a][b]="";
}
}
}
class writethreader implements Runnable
{
Thread t;
String name, data;
writethreader(String name, String data)
{
this.name = name;
this.data = data;
t = new Thread(this);
t.start();
}
public void run()
{
try{
File f = new File(name);
if(!f.exists())f.createNewFile();
FileWriter fstream = new FileWriter(name,true); //APPEND MODE
fstream.write(data);
fstream.flush(); fstream.close();
}
catch(Exception e){}
}
}
Consider using VisualVM to pinpoint the bottlenecks. Everything else below is based on guesswork - and performance guesswork is often really, really wrong.
I think you have two issues with your write strategy.
The first is that you're starting a new thread on each write; the second is that you're re-opening the file on each write.
The thread problem is especially bad, I think, because I don't see anything preventing one thread writing on a file from overlapping with another. What happens then? Frankly, I don't know - but I doubt it's good.
Consider, instead, creating an array of open files for all 100. Your OS may have a problem with this - but I think probably not. Then create a queue of work for each file. Create a set of worker threads (100 is too many - think 10 or so) where each "owns" a set of files that it loops through, outputting and emptying the queue for each file. Pay attention to the interthread interaction between queue reader and writer - use an appropriate queue class.
I would throw away the entire requirement and use a database.
How do I count the number of files in a directory using Java ? For simplicity, lets assume that the directory doesn't have any sub-directories.
I know the standard method of :
new File(<directory path>).listFiles().length
But this will effectively go through all the files in the directory, which might take long if the number of files is large. Also, I don't care about the actual files in the directory unless their number is greater than some fixed large number (say 5000).
I am guessing, but doesn't the directory (or its i-node in case of Unix) store the number of files contained in it? If I could get that number straight away from the file system, it would be much faster. I need to do this check for every HTTP request on a Tomcat server before the back-end starts doing the real processing. Therefore, speed is of paramount importance.
I could run a daemon every once in a while to clear the directory. I know that, so please don't give me that solution.
Ah... the rationale for not having a straightforward method in Java to do that is file storage abstraction: some filesystems may not have the number of files in a directory readily available... that count may not even have any meaning at all (see for example distributed, P2P filesystems, fs that store file lists as a linked list, or database-backed filesystems...).
So yes,
new File(<directory path>).list().length
is probably your best bet.
Since Java 8, you can do that in three lines:
try (Stream<Path> files = Files.list(Paths.get("your/path/here"))) {
long count = files.count();
}
Regarding the 5000 child nodes and inode aspects:
This method will iterate over the entries but as Varkhan suggested you probably can't do better besides playing with JNI or direct system commands calls, but even then, you can never be sure these methods don't do the same thing!
However, let's dig into this a little:
Looking at JDK8 source, Files.list exposes a stream that uses an Iterable from Files.newDirectoryStream that delegates to FileSystemProvider.newDirectoryStream.
On UNIX systems (decompiled sun.nio.fs.UnixFileSystemProvider.class), it loads an iterator: A sun.nio.fs.UnixSecureDirectoryStream is used (with file locks while iterating through the directory).
So, there is an iterator that will loop through the entries here.
Now, let's look to the counting mechanism.
The actual count is performed by the count/sum reducing API exposed by Java 8 streams. In theory, this API can perform parallel operations without much effort (with multihtreading). However the stream is created with parallelism disabled so it's a no go...
The good side of this approach is that it won't load the array in memory as the entries will be counted by an iterator as they are read by the underlying (Filesystem) API.
Finally, for the information, conceptually in a filesystem, a directory node is not required to hold the number of the files that it contains, it can just contain the list of it's child nodes (list of inodes). I'm not an expert on filesystems, but I believe that UNIX filesystems work just like that. So you can't assume there is a way to have this information directly (i.e: there can always be some list of child nodes hidden somewhere).
Unfortunately, I believe that is already the best way (although list() is slightly better than listFiles(), since it doesn't construct File objects).
This might not be appropriate for your application, but you could always try a native call (using jni or jna), or exec a platform-specific command and read the output before falling back to list().length. On *nix, you could exec ls -1a | wc -l (note - that's dash-one-a for the first command, and dash-lowercase-L for the second). Not sure what would be right on windows - perhaps just a dir and look for the summary.
Before bothering with something like this I'd strongly recommend you create a directory with a very large number of files and just see if list().length really does take too long. As this blogger suggests, you may not want to sweat this.
I'd probably go with Varkhan's answer myself.
Since you don't really need the total number, and in fact want to perform an action after a certain number (in your case 5000), you can use java.nio.file.Files.newDirectoryStream. The benefit is that you can exit early instead having to go through the entire directory just to get a count.
public boolean isOverMax(){
Path dir = Paths.get("C:/foo/bar");
int i = 1;
try (DirectoryStream<Path> stream = Files.newDirectoryStream(dir)) {
for (Path p : stream) {
//larger than max files, exit
if (++i > MAX_FILES) {
return true;
}
}
} catch (IOException ex) {
ex.printStackTrace();
}
return false;
}
The interface doc for DirectoryStream also has some good examples.
If you have directories containing really (>100'000) many files, here is a (non-portable) way to go:
String directoryPath = "a path";
// -f flag is important, because this way ls does not sort it output,
// which is way faster
String[] params = { "/bin/sh", "-c",
"ls -f " + directoryPath + " | wc -l" };
Process process = Runtime.getRuntime().exec(params);
BufferedReader reader = new BufferedReader(new InputStreamReader(
process.getInputStream()));
String fileCount = reader.readLine().trim() - 2; // accounting for .. and .
reader.close();
System.out.println(fileCount);
Using sigar should help. Sigar has native hooks to get the stats
new Sigar().getDirStat(dir).getTotal()
This method works for me very well.
// Recursive method to recover files and folders and to print the information
public static void listFiles(String directoryName) {
File file = new File(directoryName);
File[] fileList = file.listFiles(); // List files inside the main dir
int j;
String extension;
String fileName;
if (fileList != null) {
for (int i = 0; i < fileList.length; i++) {
extension = "";
if (fileList[i].isFile()) {
fileName = fileList[i].getName();
if (fileName.lastIndexOf(".") != -1 && fileName.lastIndexOf(".") != 0) {
extension = fileName.substring(fileName.lastIndexOf(".") + 1);
System.out.println("THE " + fileName + " has the extension = " + extension);
} else {
extension = "Unknown";
System.out.println("extension2 = " + extension);
}
filesCount++;
allStats.add(new FilePropBean(filesCount, fileList[i].getName(), fileList[i].length(), extension,
fileList[i].getParent()));
} else if (fileList[i].isDirectory()) {
filesCount++;
extension = "";
allStats.add(new FilePropBean(filesCount, fileList[i].getName(), fileList[i].length(), extension,
fileList[i].getParent()));
listFiles(String.valueOf(fileList[i]));
}
}
}
}
Unfortunately, as mmyers said, File.list() is about as fast as you are going to get using Java. If speed is as important as you say, you may want to consider doing this particular operation using JNI. You can then tailor your code to your particular situation and filesystem.
public void shouldGetTotalFilesCount() {
Integer reduce = of(listRoots()).parallel().map(this::getFilesCount).reduce(0, ((a, b) -> a + b));
}
private int getFilesCount(File directory) {
File[] files = directory.listFiles();
return Objects.isNull(files) ? 1 : Stream.of(files)
.parallel()
.reduce(0, (Integer acc, File p) -> acc + getFilesCount(p), (a, b) -> a + b);
}
Count files in directory and all subdirectories.
var path = Path.of("your/path/here");
var count = Files.walk(path).filter(Files::isRegularFile).count();
In spring batch I did below
private int getFilesCount() throws IOException {
ResourcePatternResolver resolver = new PathMatchingResourcePatternResolver();
Resource[] resources = resolver.getResources("file:" + projectFilesFolder + "/**/input/splitFolder/*.csv");
return resources.length;
}