Program slowing down due to recursion - java

I'm trying to write a program that adds every single file and folder name on my C: drive to an ArrayList. The code works fine, but because of the massive amount of recursion, it gets painfully slow. Here is the code:
public static void updateFileDataBase()
{
ArrayList<String> currentFiles = new ArrayList<String>();
addEverythingUnder("C:/",currentFiles,new String[]{"SteamApps","AppData"});
for(String name : currentFiles)
System.out.println(name);
}
private static void addEverythingUnder(String path, ArrayList<String> list, String[] exceptions)
{
System.gc();
System.out.println("searching " + path);
File search = new File(path);
try
{
for(int i = 0; i < search.list().length; i++)
{
boolean include = true;
for(String exception : exceptions)
if(search.list()[i].contains(exception))
include = false;
if(include)
{
list.add(search.list()[i]);
if(new File(path + "/" + search.list()[i]).isDirectory())
{
addEverythingUnder(path + "/" + search.list()[i],list,exceptions);
}
}
}
}
catch(Exception error)
{
System.out.println("ACCESS DENIED");
}
}
I was wondering if there was anything at all that I could do to speed up the process. Thanks in advance :)

Program slowing down due to recursion
No it isn't. Recursion doesn't make things slow. Poor algorithms and bad coding make things slow.
For example, you are calling Files.list() four times for every file you process, as well as once per directory. You can save an O(N) by doing that once per directory:
for(File file : search.listFiles())
{
String name = file.getName();
boolean include = true;
for(String exception : exceptions)
if(name.contains(exception))
include = false;
if(include)
{
list.add(name);
if(file.isDirectory())
{
addEverythingUnder(file,list,exceptions);
}
}
}

There is (as of Java 7) a built in way to do this, Files.walkFileTree, which is much more efficient and removes the need to reinvent the wheel. It calls into a FileVisitor for every entry it finds. There are a couple of examples on the FileVisitor page to get you started.

Is there a particular reason for reinventing the wheel?
If you dont mind please use
http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/FileUtils.html#listFiles(java.io.File, java.lang.String[], boolean)

because of the massive amount of recursion, it gets painfully slow
While your code is very inefficient as EJP suggests, I suspect the problem is even more basic. When you access a large number of files, this takes time to read from disk (the first time, reading the same again, and again is much quicker as it is cache) Opening files is also pretty slow for a HDD.
A typical HDD has a seek time of 8 ms, if finding and opening a file takes two operations, then you are looking at 16 ms per file. say you have 10,000 files, this will take at least 160 seconds, no matter how efficient you make the code. BTW If you use a decent SSD, this will take about 1 second.
In short, you are likely to be hitting a hardware limit which has nothing to do with how you wrote your software. Shorter: Don't have large numbers of files if you want performance.

Related

Issues with Dynamic Destinations in Dataflow

I have a Dataflow job that reads data from pubsub and based on the time and filename writes the contents to GCS where the folder path is based on the YYYY/MM/DD. This allows files to be generated in folders based on date and uses apache beam's FileIO and Dynamic Destinations.
About two weeks ago, I noticed an unusual buildup of unacknowledged messages. Upon restarting the df job the errors disappeared and new files were being written in GCS.
After a couple of days, writing stopped again, except this time, there were errors claiming that processing was stuck. After some trusty SO research, I found out that this was likely caused by a deadlock issue in pre 2.90 Beam because it used the Conscrypt library as the default security provider. So, I upgraded to Beam 2.11 from Beam 2.8.
Once again, it worked, until it didn't. I looked more closely at the error and noticed that it had a problem with a SimpleDateFormat object, which isn't thread-safe. So, I switched to use Java.time and DateTimeFormatter, which is thread-safe. It worked until it didn't. However, this time, the error was slightly different and didn't point to anything in my code:
The error is provided below.
Processing stuck in step FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles for at least 05m00s without outputting or completing in state process
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at org.apache.beam.vendor.guava.v20_0.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:469)
at org.apache.beam.vendor.guava.v20_0.com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:76)
at org.apache.beam.runners.dataflow.worker.MetricTrackingWindmillServerStub.getStateData(MetricTrackingWindmillServerStub.java:202)
at org.apache.beam.runners.dataflow.worker.WindmillStateReader.startBatchAndBlock(WindmillStateReader.java:409)
at org.apache.beam.runners.dataflow.worker.WindmillStateReader$WrappedFuture.get(WindmillStateReader.java:311)
at org.apache.beam.runners.dataflow.worker.WindmillStateReader$BagPagingIterable$1.computeNext(WindmillStateReader.java:700)
at org.apache.beam.vendor.guava.v20_0.com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:145)
at org.apache.beam.vendor.guava.v20_0.com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:140)
at org.apache.beam.vendor.guava.v20_0.com.google.common.collect.MultitransformedIterator.hasNext(MultitransformedIterator.java:47)
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:701)
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown Source)
This error started occurring approximately 5 hours after job deployment and at an increasing rate over time. Writing slowed significantly within 24 hours. I have 60 workers and I suspect that one worker fails every time there is an error, which eventually kills the job.
In my writer, I parse the lines for certain keywords (may not be the best way) in order to determine which folder it belongs in. I then proceed to insert the file to GCS with the determined filename. Here is the code I use for my writer:
The partition function is provided as the following:
#SuppressWarnings("serial")
public static class datePartition implements SerializableFunction<String, String> {
private String filename;
public datePartition(String filename) {
this.filename = filename;
}
#Override
public String apply(String input) {
String folder_name = "NaN";
String date_dtf = "NaN";
String date_literal = "NaN";
try {
Matcher foldernames = Pattern.compile("\"foldername\":\"(.*?)\"").matcher(input);
if(foldernames.find()) {
folder_name = foldernames.group(1);
}
else {
Matcher folderid = Pattern.compile("\"folderid\":\"(.*?)\"").matcher(input);
if(folderid.find()) {
folder_name = folderid.group(1);
}
}
Matcher date_long = Pattern.compile("\"timestamp\":\"(.*?)\"").matcher(input);
if(date_long.find()) {
date_literal = date_long.group(1);
if(Utilities.isNumeric(date_literal)) {
LocalDateTime date = LocalDateTime.ofInstant(Instant.ofEpochMilli(Long.valueOf(date_literal)), ZoneId.systemDefault());
date_dtf = date.format(dtf);
}
else {
date_dtf = date_literal.split(":")[0].replace("-", "/").replace("T", "/");
}
}
return folder_name + "/" + date_dtf + "h/" + filename;
}
catch(Exception e) {
LOG.error("ERROR with either foldername or date");
LOG.error("Line : " + input);
LOG.error("folder : " + folder_name);
LOG.error("Date : " + date_dtf);
return folder_name + "/" + date_dtf + "h/" + filename;
}
}
}
And the actual place where the pipeline is deployed and run can be found below:
public void streamData() {
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))
.apply(options.getWindowDuration() + " Window",
Window.<PubsubMessage>into(FixedWindows.of(parseDuration(options.getWindowDuration())))
.triggering(AfterWatermark.pastEndOfWindow())
.discardingFiredPanes()
.withAllowedLateness(parseDuration("24h")))
.apply(new GenericFunctions.extractMsg())
.apply(FileIO.<String, String>writeDynamic()
.by(new datePartition(options.getOutputFilenamePrefix()))
.via(TextIO.sink())
.withNumShards(options.getNumShards())
.to(options.getOutputDirectory())
.withNaming(type -> FileIO.Write.defaultNaming(type, ".txt"))
.withDestinationCoder(StringUtf8Coder.of()));
pipeline.run();
}
The error 'Processing stuck ...' indicates that some particular operation took longer than 5m, not that the job is permanently stuck. However, since the step FileIO.Write/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles is the one that is stuck and the job gets cancelled/killed, I would think on an issue while the job is writing temp files.
I found out the BEAM-7689 issue which is related to a second-granularity timestamp (yyyy-MM-dd_HH-mm-ss) that is used to write temporary files. This happens because several concurrent jobs can share the same temporary directory and this can cause that one of the jobs deletes it before the other(s) job finish(es).
According to the previous link, to mitigate the issue please upgrade to SDK 2.14. And let us know if the error is gone.
Since posting this question, I've optimized the dataflow job to dodge bottlenecks and increase parallelization. Much like rsantiago explained, processing stuck isn't an error, but simply a way dataflow communicates that a step is taking significantly longer than other steps, which is essentially a bottleneck that can't be cleared with the given resources. The changes I made seem to have addressed them. The new code is as follows:
public void streamData() {
try {
Pipeline pipeline = Pipeline.create(options);
pipeline.apply("Read PubSub Events", PubsubIO.readMessagesWithAttributes().fromSubscription(options.getInputSubscription()))
.apply(options.getWindowDuration() + " Window",
Window.<PubsubMessage>into(FixedWindows.of(parseDuration(options.getWindowDuration())))
.triggering(AfterWatermark.pastEndOfWindow())
.discardingFiredPanes()
.withAllowedLateness(parseDuration("24h")))
.apply(FileIO.<String,PubsubMessage>writeDynamic()
.by(new datePartition(options.getOutputFilenamePrefix()))
.via(Contextful.fn(
(SerializableFunction<PubsubMessage, String>) inputMsg -> new String(inputMsg.getPayload(), StandardCharsets.UTF_8)),
TextIO.sink())
.withDestinationCoder(StringUtf8Coder.of())
.to(options.getOutputDirectory())
.withNaming(type -> new CrowdStrikeFileNaming(type))
.withNumShards(options.getNumShards())
.withTempDirectory(options.getTempLocation()));
pipeline.run();
}
catch(Exception e) {
LOG.error("Unable to deploy pipeline");
LOG.error(e.toString(), e);
}
}
The biggest change involved removing the extractMsg() function and changing partitioning to only use metadata. Both of these steps forced deserialization/reserialization of messages and heavily impacted performance.
Additionally, since my data set was unbounded, I had to set a non-zero number of shards. I wanted to simplify my filenaming policy, so I set it to 1 without knowing how much it hurt performance. Since then, I've found a good balance of workers/shards/machine type for my job (mostly based on guess & check, unfortunately).
Although it's still possible that a bottleneck might be observed with a large enough data load, the pipeline has been performing well despite heavy load (3-5tb per day). The changes also significantly improved autoscaling, but I'm not sure why. The dataflow job now reacts to spikes and valleys a lot quicker.

Listing tar archive contents (java commons compress) returns variable number of entries, then stream closed exception. How to fix?

I have researched this issue for quite some time on google and stackoverflow.
Unfortunately, I can't seem to find any resources that seem to address this issue. Admittedly, my search-fu isn't the best; any help, examples, or pointers to relevant resources will be very much appreciated.
The following code is my method for listing the archive contents from a TarArchiveInputStream:
public static List<String> tarListDir(InputStream incoming)
throws Exception {
TarArchiveInputStream tarInput = new TarArchiveInputStream(incoming);
TarArchiveEntry entry = null;
List<String> ouah = new ArrayList<String>();
try {
while ((entry = tarInput.getNextTarEntry()) != null) {
if (!entry.isFile()) {
continue;
}
ouah.add(entry.getName());
if (RecScan.verbose || RecScan.debugging) {
if (entry.isFile()) {
System.out.print("file: \t");
} else if (entry.isDirectory()) {
System.out.print("dir: \t");
} else {
System.out.print("wut: \t");
}
System.out.println(ouah.get(ouah.size() - 1));
}
}
} catch (Exception e) {
tarInput.close();
throw new Exception("Closed w/exception: " + e.getMessage());
}
if (RecScan.verbose || RecScan.debugging) {
System.out.println("Closing tarInput normally");
}
tarInput.close();
return ouah;
}
As mentioned above, I am hoping to obtain the List of entries in the archive. Unfortunately, it appears that the method is only obtaining a random amount of directory entries. This number varies, per-archive (testing with 4 different ones), from 0 entries to 12 entries. The exception that is being thrown is an IOException, Stream Closed to be specific.
I'm not very well versed in the Apache Commons Compress libraries (obviously), and I don't really know any hex editors well enough to dig around in the archive to see if there's something non-POSIX that it's stumbling over. I would think that the entry.isFile() conditional would avoid that complication, though.
As I mentioned, any help or resources greatly appreciated! TIA!
EDIT: Though the code there (in the first comment's post reference) seemed to boil down basically to the same as what I was using, I did actually cut out my code and use what was there, virtually verbatim. Still got the exact same error: Stream Closed. I did manage to find something that may hold a clue; I tried swapping out the BufferedInputStream wrapping of the original InputStream that I handed off. This caused the Stream Closed error immediately, vs. the BufferedInputStream closing after a bit of the archive listing. Definitely still looking for hints.

Java copy when file is not being used [duplicate]

This question already has answers here:
JAVA NIO Watcher: How to detect end of a long lasting (copy) operation?
(2 answers)
Closed 8 years ago.
I am writing a directory monitoring utility in java(1.6) using polling at certain intervals using lastModified long value as the indication of change. I found that when my polling interval is small (seconds) and the copied file is big then the change event is fired before the actual completion of file copying.
I would like to know whether there is a way I can find the status of file like in transit, complete etc.
Environments: Java 1.6; expected to work on windows and linux.
There are two approaches I've used in the past which are platform agnostic.
1/ This was for FTP transfers where I controlled what was put, so it may not be directly relevant.
Basically, whatever is putting a file file.txt will, when it's finished, also put a small (probably zero-byte) dummy file called file.txt.marker (for example).
That way, the monitoring tool just looks for the marker file to appear and, when it does, it knows the real file is complete. It can then process the real file and delete the marker.
2/ An unchanged duration.
Have your monitor program wait until the file is unchanged for N seconds (where N is reasonably guaranteed to be large enough that the file will be finished).
For example, if the file size hasn't changed in 60 seconds, there's a good chance it's finished.
There's a balancing act between not thinking the file is finished just because there's no activity on it, and the wait once it is finished before you can start processing it. This is less of a problem for local copying than FTP.
This solution worked for me:
File ff = new File(fileStr);
if(ff.exists()) {
for(int timeout = 100; timeout>0; timeout--) {
RandomAccessFile ran = null;
try {
ran = new RandomAccessFile(ff, "rw");
break; // no errors, done waiting
} catch (Exception ex) {
System.out.println("timeout: " + timeout + ": " + ex.getMessage());
} finally {
if(ran != null) try {
ran.close();
} catch (IOException ex) {
//do nothing
}
ran = null;
}
try {
Thread.sleep(100); // wait a bit then try again
} catch (InterruptedException ex) {
//do nothing
}
}
System.out.println("File lockable: " + fileStr +
(ff.exists()?" exists":" deleted during process"));
} else {
System.out.println("File does not exist: " + fileStr);
}
This solution relies on the fact that you can't open the file for writing if another process has it open. It will stay in the loop until the timeout value is reached or the file can be opened. The timeout values will need to be adjusted depending on the application's actual needs. I also tried this method with channels and tryLock(), but it didn't seem to be necessary.
Do you mean that you're waiting for the lastModified time to settle? At best that will be a bit hit-and-miss.
How about trying to open the file with write access (appending rather than truncating the file, of course)? That won't succeed if another process is still trying to write to it. It's a bit ugly, particularly as it's likely to be a case of using exceptions for flow control (ick) but I think it'll work.
If I understood the question correctly, you're looking for a way to distinguish whether the copying of a file is complete or still in progress?
How about comparing the size of the source and destination file (i.e. file.length())? If they're equal, then copying is complete. Otherwise, it's still in progress.
I'm not sure it's efficient since it would still require polling. But it "might" work.
You could look into online file upload with progressbar techniques - they use OutputStreamListener and custom writer to notify the listener about bytes written.
http://www.missiondata.com/blog/java/28/file-upload-progress-with-ajax-and-java-and-prototype/
File Upload with Java (with progress bar)
We used to monitor the File Size change for determine whether the File is inComplete or not.
we used Spring integration File endpoint to do the polling for a directory for every 200 ms.
Once the file is detected(regardless of whether it is complete or not), We have a customer File filter, which will have a interface method "accept(File file)" to return a flag indicating whether we can process the file.
If the False is returned by the filter, this FILE instance will be ignored and it will be pick up during the next polling for the same filtering process..
The filter does the following:
First, we get its current file size. and we will wait for 200ms(can be less) and check for the size again. If the size differs, we will retry for 5 times. Only when the file size stops growing, the File will be marked as COMPLETED.(i.e. return true).
Sample code used is as the following:
public class InCompleteFileFilter<F> extends AbstractFileListFilter<F> {
protected Object monitor = new Object();
#Override
protected boolean accept(F file) {
synchronized (monitor){
File currentFile = (File)file;
if(!currentFile.getName().contains("Conv1")){return false;}
long currentSize = currentFile.length();
try { Thread.sleep(200); } catch (InterruptedException e) { e.printStackTrace(); }
int retryCount = 0;
while(retryCount++ < 4 && currentFile.length() > currentSize){
try { Thread.sleep(200); } catch (InterruptedException e) { e.printStackTrace(); }
}
if(retryCount == 5){
return false;
}else{
return true;
}
}
}
}

jvm caching of method calls?

I am doing some performance tests of a HTML stripper (written in java), that is to say, I pass a string (actually html content) to a method of the HTML stripper
and the latter returns plain text (without HTML tags and meta information).
Here is an example of the concrete implementation
public void performanceTest() throws IOException {
long totalTime;
File file = new File("/directory/to/ten/different/htmlFiles");
for (int i = 0; i < 200; ++i) {
for (File fileEntry : file.listFiles()) {
HtmlStripper stripper = new HtmlStripper();
URL url = fileEntry.toURI().toURL();
InputStream inputStream = url.openStream();
String html = IOUtils.toString(inputStream, "UTF-8");
long start = System.currentTimeMillis();
String text = stripper.getText(html);
long end = System.currentTimeMillis();
totalTime = totalTime + (end - start);
//The duration for the stripping of each file is computed here
// (200 times for each time). That duration value decreases and then becomes constant
//IMHO if the duration for the same file should always remain the same.
//Or is a cache technique used by the JVM?
System.out.println("time needed for stripping current file: "+ (end -start));
}
}
System.out.println("Average time for one document: "
+ (totalTime / 2000));
}
But the duration for the stripping of each file is computed 200 times for each time and has a different decreasing value. IMHO if the duration for one and the same file X should always remain the same!? Or is a cache technique used by the JVM?
Any help would be appreciated.
Thanks in advance
Horace
N.B:
- I am doing the tests local (NO remote, NO http) on my machine.
- I am using java 6 on Ubuntu 10.04
This is totally normal. The JIT compiles methods to native code and optimizes them more heavily as they're more and more heavily used. (The "constant" your benchmark eventually converges to is the peak of the JIT's optimization capabilities.)
You cannot get good benchmarks in Java without running the method many times before you start timing at all.
IMHO if the duration for one and the same file X should always remain the same
Not in the presence of an optimizing just-in-time compiler. Among other things it keeps track of how many times a particular method/branch is used, and selectively compiles Java byte codes into native code.

How to know whether a file copying is 'in progress'/complete in java (1.6) [duplicate]

This question already has answers here:
JAVA NIO Watcher: How to detect end of a long lasting (copy) operation?
(2 answers)
Closed 8 years ago.
I am writing a directory monitoring utility in java(1.6) using polling at certain intervals using lastModified long value as the indication of change. I found that when my polling interval is small (seconds) and the copied file is big then the change event is fired before the actual completion of file copying.
I would like to know whether there is a way I can find the status of file like in transit, complete etc.
Environments: Java 1.6; expected to work on windows and linux.
There are two approaches I've used in the past which are platform agnostic.
1/ This was for FTP transfers where I controlled what was put, so it may not be directly relevant.
Basically, whatever is putting a file file.txt will, when it's finished, also put a small (probably zero-byte) dummy file called file.txt.marker (for example).
That way, the monitoring tool just looks for the marker file to appear and, when it does, it knows the real file is complete. It can then process the real file and delete the marker.
2/ An unchanged duration.
Have your monitor program wait until the file is unchanged for N seconds (where N is reasonably guaranteed to be large enough that the file will be finished).
For example, if the file size hasn't changed in 60 seconds, there's a good chance it's finished.
There's a balancing act between not thinking the file is finished just because there's no activity on it, and the wait once it is finished before you can start processing it. This is less of a problem for local copying than FTP.
This solution worked for me:
File ff = new File(fileStr);
if(ff.exists()) {
for(int timeout = 100; timeout>0; timeout--) {
RandomAccessFile ran = null;
try {
ran = new RandomAccessFile(ff, "rw");
break; // no errors, done waiting
} catch (Exception ex) {
System.out.println("timeout: " + timeout + ": " + ex.getMessage());
} finally {
if(ran != null) try {
ran.close();
} catch (IOException ex) {
//do nothing
}
ran = null;
}
try {
Thread.sleep(100); // wait a bit then try again
} catch (InterruptedException ex) {
//do nothing
}
}
System.out.println("File lockable: " + fileStr +
(ff.exists()?" exists":" deleted during process"));
} else {
System.out.println("File does not exist: " + fileStr);
}
This solution relies on the fact that you can't open the file for writing if another process has it open. It will stay in the loop until the timeout value is reached or the file can be opened. The timeout values will need to be adjusted depending on the application's actual needs. I also tried this method with channels and tryLock(), but it didn't seem to be necessary.
Do you mean that you're waiting for the lastModified time to settle? At best that will be a bit hit-and-miss.
How about trying to open the file with write access (appending rather than truncating the file, of course)? That won't succeed if another process is still trying to write to it. It's a bit ugly, particularly as it's likely to be a case of using exceptions for flow control (ick) but I think it'll work.
If I understood the question correctly, you're looking for a way to distinguish whether the copying of a file is complete or still in progress?
How about comparing the size of the source and destination file (i.e. file.length())? If they're equal, then copying is complete. Otherwise, it's still in progress.
I'm not sure it's efficient since it would still require polling. But it "might" work.
You could look into online file upload with progressbar techniques - they use OutputStreamListener and custom writer to notify the listener about bytes written.
http://www.missiondata.com/blog/java/28/file-upload-progress-with-ajax-and-java-and-prototype/
File Upload with Java (with progress bar)
We used to monitor the File Size change for determine whether the File is inComplete or not.
we used Spring integration File endpoint to do the polling for a directory for every 200 ms.
Once the file is detected(regardless of whether it is complete or not), We have a customer File filter, which will have a interface method "accept(File file)" to return a flag indicating whether we can process the file.
If the False is returned by the filter, this FILE instance will be ignored and it will be pick up during the next polling for the same filtering process..
The filter does the following:
First, we get its current file size. and we will wait for 200ms(can be less) and check for the size again. If the size differs, we will retry for 5 times. Only when the file size stops growing, the File will be marked as COMPLETED.(i.e. return true).
Sample code used is as the following:
public class InCompleteFileFilter<F> extends AbstractFileListFilter<F> {
protected Object monitor = new Object();
#Override
protected boolean accept(F file) {
synchronized (monitor){
File currentFile = (File)file;
if(!currentFile.getName().contains("Conv1")){return false;}
long currentSize = currentFile.length();
try { Thread.sleep(200); } catch (InterruptedException e) { e.printStackTrace(); }
int retryCount = 0;
while(retryCount++ < 4 && currentFile.length() > currentSize){
try { Thread.sleep(200); } catch (InterruptedException e) { e.printStackTrace(); }
}
if(retryCount == 5){
return false;
}else{
return true;
}
}
}
}

Categories

Resources