I wrote a function in Java that edit file name, and replace each space char into dash char.
Currently I iterate all the files in a specific directory, iterate in each file name, creating a new file name, and replace the file in the directory.
I guess that the current complexity is O(N*M) {N = number of files in directory, M = number of chars in each file}.
Can anyone help me improve the run-time-complexity?
Thanks
public static void editSpace(String source, String target) {
// Source directory where all the files are there
File dir = new File(source);
File[] directoryListing = dir.listFiles();
// Iterate in each file in the directory
for (File file : directoryListing) {
String childName = file.getName();
String childNameNew = "";
// Iterate in each file name and change every space char to dash char
for (int i = 0; i < childName.length(); i++) {
if (childName.charAt(i) == ' ') {
childNameNew += "-";
} else {
childNameNew += childName.charAt(i);
}
}
// Update the new directory of the child
String childDir = target + "\\" + childNameNew;
// Renaming the file and moving it to a new location
if (!(childNameNew.equals(""))
&& (file.renameTo(new File(childDir)))) {
// If file copied successfully then delete the original file .
file.delete();
// Print message
System.out.println(childName + " File moved successfully to "
+ childDir);
}
// Moving failed
else {
// Print message
System.out.println(childName + " Failed to move the file to "
+ childDir);
}
}
}
I guess that the current complexity is O(N*M) {N = number of files in directory, M = number of chars in each file}. Can anyone help me improve the run-time-complexity?
Nobody can. You figured it yourself: when your task is to modify N file names that have like M chars to read(or modify), then you end up with NxM. There is no conceptual way to modify N file names based on their current names without each file and at each thing in there.
But what is possible: look carefully at your code, and see if you can improve the actual implementation.
You should start by relying much more on library methods. For example, you have String.replace() that allows you to turn all spaces into dashes with a single call. That shouldn't affect performance, but it allows your own code (having less code is mostly a good thing!). You could go one step further and look at streams to use even less code, see here.
But the real answer here: you are probably doing pre-mature optimisation. In the end, you are talking about something where the JVM needs to OS in order to make changes out there in the file system. There are zillions of aspects that influence overall, end to end performance for such a use case. It might be helpful to have more than one thread, so that can "process" file names from different directories in parallel for example.
On the other hand: creating a thread is a costly operation. And typically, it only helps you to speed up CPU intensive activities. Worse, multiple threads accessing the file system like that in parallel ... might actually slow down things, overall.
Meaning: depending on your overall setup, you might be able to speed up renaming files. Or not.
In the end, you are spending a lot of time and energy here. And the real question: is it really worth it?! Does it really matter to you whether your code will need 500 ms, or 1 sec, or 2 seconds? Depending on context it might, but maybe: it doesn't. That is the first thing to clarify. And when you figure that you really need the highest performance solution here, then you will have to invest real time into measuring what is going on, and doing experiments to find out which setting affects performance the most.
In other words: if you really care about performance here, you have a lot of low level details to look at. If you don't care about performance that much, I would throw away the java code and write 3 lines of python code, or Kotlin, or whatever you normally use for scripting, and go with that. Not because that code will be faster, but it will easier to read, write, and maintain. Because that is what matters when performance isn't your primary priority.
Related
Suppose a very simple program that lists out all the subdirectories of a given directory. Sound simple enough? Except the only way to list all subdirectories in Java is to use FilenameFilter combined with File.list().
This works for the trivial case, but when the folder has say 150,000 files and 2 sub folders, it's silly waiting there for 45 seconds iterating through all the files and testing for file.isDirectory(). Is there a better way to list sub directories??
PS. Sorry, please save the lectures on having too many files in the same directory. Our live environment has this as part of the requirement.
As has already been mentioned, this is basicly a hardware problem. Disk access is always slow, and most file systems aren't really designed to handle directories with that many files.
If you for some reason have to store all the files in the same directory, I think you'll have to maintain your own cache. This could be done using a local database such as sqlite, HeidiSQL or HSQL. If you want extreme performance, use a java TreeSet and cache it in memory. This means at the very least that you'll have to read the directory less often, and it could possibly be done in the background. You could reduce the need to refresh the list even further by using your systems native file update notification API (inotify on linux) to subscribe to changes to the directory.
This doesn't seem to be possible for you, but I once solved a similiar problem by "hashing" the files into subdirectories. In my case, the challenge was to store a couple of millions images with numeric ids. I constructed the directory structure as follows:
images/[id - (id % 1000000)]/[id - (id % 1000)]/[id].jpg
This has worked well for us, and it's the solution that I would recommend. You could do something similiar to alpha-numeric filenames by simply taking the first two letters of the filename, and then the next two letters. I've done this as well once, and it did the job as well.
Do you know the finite list of possible subdirectory names? If so, use a loop over all possible names and check for directory's existence.
Otherwise, you can not get ONLY directory names in most underlying OSs (e.g. in Unix, the directory listing is merely reading contents of "directory" file, so there's no way to find "just directories" quickly without listing all the files).
However, in NIO.2 in Java7 (see http://java.sun.com/developer/technicalArticles/javase/nio/#3 ), there's a way to have a streaming directory list so you don't get a full array of file elements cluttering your memory/network.
There's actually a reason why you got the lectures: it's the correct answer to your problem. Here's the background, so that perhaps you can make some changes in your live environment.
First: directories are stored on the filesystem; think of them as files, because that's exactly what they are. When you iterate through the directory, you have to read those blocks from the disk. Each directory entry will require enough space to hold the filename, and permissions, and information on where that file is found on-disk.
Second: directories aren't stored with any internal ordering (at least, not in the filesystems where I've worked with directory files). If you have 150,000 entries and 2 sub-directories, those 2 sub-directory references could be anywhere within the 150,000. You have to iterate to find them, there's no way around that.
So, let's say that you can't avoid the big directory. Your only real option is to try to keep the blocks comprising the directory file in the in-memory cache, so that you're not hitting the disk every time you access them. You can achieve this by regularly iterating over the directory in a background thread -- but this is going to cause undue load on your disks, and interfere with other processes. Alternatively, you can scan once and keep track of the results.
The alternative is to create a tiered directory structure. If you look at commercial websites, you'll see URLs like /1/150/15023.html -- this is meant to keep the number of files per directory small. Think of it as a BTree index in a database.
Of course, you can hide that structure: you can create a filesystem abstraction layer that takes filenames and automatically generates the directory tree where those filenames can be found.
The key problem could be File.isDirectory() function called in a loop.
File.isDirectory() can be extremely slow. I saw NFS take 10 seconds to process 200 file directory.
If you can by all means prevent File.isDirectory() calls (e.g. test for extension, no extension == directory), you could improve the performance drastically.
Otherwise I would suggest doing JNA/JNI/writing a native script that does this for you.
The jCifs library lets you manipulate windows network shares more efficiently. I am not aware of a library that would do this for other network file systems.
You could hack it if the 150k files all (or a significant number of them) had a similar naming convention like:
*.jpg
*Out.txt
and only actually create file objects for the ones you are unsure about being a folder.
I don't know if the overhead of shelling out to cmd.exe would eat it up, but one possibility would be something like this:
...
Runtime r = Runtime.getRuntime();
Process p = r.exec("cmd.exe /k dir /s/b/ad C:\\folder");
BufferedReader br = new BufferedReader(new InputStreamReader(p.getInputStream()));
for (;;) {
String d = br.readLine();
if (d == null)
break;
System.out.println(d);
}
...
/s means search subdirectories
/ad means only return directories
/b means return the full pathname from the root
I came across similar question when debugging performance in a Java application enumerating plenty of files. It is using old approach
for (File f : new File("C:\\").listFiles()) {
if (f.isDirectory()) {
continue;
}
}
And it appears that each f.isDirectory() is the call into native FileSsystem which, at least on NTFS, is very slow. Java7 NIO has additional API, but not all methods are good there. I'll just provide JMH benchmark result here
Benchmark Mode Cnt Score Error Units
MyBenchmark.dir_listFiles avgt 5 0.437 ? 0.064 s/op
MyBenchmark.path_find avgt 5 0.046 ? 0.001 s/op
MyBenchmark.path_walkTree avgt 5 1.702 ? 0.047 s/op
Number come from execution of this code:
java -jar target/benchmarks.jar -bm avgt -f 1 -wi 5 -i 5 -t 1
static final String testDir = "C:/Sdk/Ide/NetBeans/src/dev/src/";
static final int nCycles = 50;
public static class Counter {
int countOfFiles;
int countOfFolders;
}
#Benchmark
public List<File> dir_listFiles() {
List<File> files = new ArrayList<>(1000);
for( int i = 0; i < nCycles; i++ ) {
File dir = new File(testDir);
files.clear();
for (File f : dir.listFiles()) {
if (f.isDirectory()) {
continue;
}
files.add(f);
}
}
return files;
}
#Benchmark
public List<Path> path_walkTree() throws Exception {
final List<Path> files = new ArrayList<>(1000);
for( int i = 0; i < nCycles; i++ ) {
Path dir = Paths.get(testDir);
files.clear();
Files.walkFileTree(dir, new SimpleFileVisitor<Path> () {
#Override
public FileVisitResult visitFile(Path path, BasicFileAttributes arg1) throws IOException {
files.add(path);
return FileVisitResult.CONTINUE;
}
#Override
public FileVisitResult preVisitDirectory(Path path, BasicFileAttributes arg1)
throws IOException {
return path == dir ? FileVisitResult.CONTINUE : FileVisitResult.SKIP_SUBTREE;
}
});
}
return files;
}
#Benchmark
public List<Path> path_find() throws Exception {
final List<Path> files = new ArrayList<>(1000);
for( int i = 0; i < nCycles; i++ ) {
Path dir = Paths.get(testDir);
files.clear();
files.addAll(Files.find(dir, 1, (path, attrs)
-> true /*!attrs.isDirectory()*/).collect(Collectors.toList()));
}
return files;
}
if your OS is 'stable' give a try to JNA:
opendir/readdir on UNIX
FindFirstFile and related API on Windows
Java7 with NIO2
these are all "streaming API". They doesn't force you to allocate a 150k list/array before start searching. IMHO this is a great advantage in your scenario.
Here's an off-the wall solution, and devoid of any testing at all. It's also dependent on having a filesystem that supports symbolic links. This isn't a Java solution. I suspect your problem is filesystem/OS-related, and not Java related.
Is it possible to create a parallel directory structure, with subdirectories based on initial letters of the filenames, and then symbolically link to the real files ? An illustration
/symlinks/a/b/cde
would link to
/realfiles/abcde
(where /realfiles is where your 150,000 files reside)
You'd have to create and maintain this directory structure, and I don't have enough info to determine if that's practical. But the above would create a fast(er) index into your non-hierarchical (and slow) directory.
there is also a recursive parallel scanning at http://blogs.oracle.com/adventures/entry/fast_directory_scanning. Essentially siblings are processed in parallel. There also encouraging performance tests.
Maybe you could write a directory searching program in C#/C/C++ and use JNI to get it to Java. Do not know if this would improve performance or not.
Well, either JNI, or, if you say your deployment is constant, just run "dir" on Windows or "ls" on *nixes, with appropriate flags to list only directories (Runtime.exec())
In that case you might try some JNA solution - a platform dependant directory traverser (FindFirst, FindNext on Windows) with the possibility of some iteration pattern. Also Java 7 will have much better file system support, worth checking out the specs (I don't remember any specifics).
Edit: An idea: one option is to hide the slowness of the directory listing from the user's eyes. In a client side app, you could use some animation while the listing is working to distract the user. Actually depends on what else your application does beside the listing.
As of 2020, the DirectoryStream does seem to be faster than using File.listFiles() and checking each file with isDirectory().
I learned the answer from here:
https://www.baeldung.com/java-list-directory-files
I'm using Java 1.8 on Windows 10.
I need the advice from someone who knows very well java and the memory issues. I have a large CSV files (something like 500mb each) and I need to merge these files in one using only 64mb of xmx. I've tried to do it different ways, but nothing works - always got memory exception. What should I do to make it work properly?
The task is:
Develop a simple implementation that joins two input tables in a reasonably efficient way and can store both tables in RAM if needed.
My code works, but it takes alot of memory, so can't fit at 64mb.
public class ImprovedInnerJoin {
public static void main(String[] args) throws IOException {
RandomAccessFile firstFile = new RandomAccessFile("input_A.csv", "r");
FileChannel firstChannel = firstFile.getChannel();
RandomAccessFile secondFile = new RandomAccessFile("input_B.csv", "r");
FileChannel secondChannel = secondFile.getChannel();
RandomAccessFile resultFile = new RandomAccessFile("result2.csv", "rw");
FileChannel resultChannel = resultFile.getChannel().position(0);
ByteBuffer resultBuffer = ByteBuffer.allocate(40);
ByteBuffer firstBuffer = ByteBuffer.allocate(25);
ByteBuffer secondBuffer = ByteBuffer.allocate(25);
while (secondChannel.position() != secondChannel.size()){
Map <String, List<String>>table2Part = new HashMap();
for (int i = 0; i < secondChannel.size(); ++i){
if (secondChannel.read(secondBuffer) == -1)
break;
secondBuffer.rewind();
String[] table2Tuple = (new String(secondBuffer.array(), Charset.defaultCharset())).split(",");
if (!table2Part.containsKey(table2Tuple[0]))
table2Part.put(table2Tuple[0], new ArrayList());
table2Part.get(table2Tuple[0]).add(table2Tuple[1]);
secondBuffer.clear();
}
Set <String> taple2keys = table2Part.keySet();
while (firstChannel.read(firstBuffer) != -1){
firstBuffer.rewind();
String[] table1Tuple = (new String(firstBuffer.array(), Charset.defaultCharset())).split(",");
for (String table2key : taple2keys){
if (table1Tuple[0].equals(table2key)){
for (String value : table2Part.get(table2key)){
String result = table1Tuple[0] + "," + table1Tuple[1].substring(0,14) + "," + value; // 0,14 or result buffer will be overflown
resultBuffer.put(result.getBytes());
resultBuffer.rewind();
while(resultBuffer.hasRemaining()){
resultChannel.write(resultBuffer);
}
resultBuffer.clear();
}
}
}
firstBuffer.clear();
}
firstChannel.position(0);
table2Part.clear();
}
firstChannel.close();
secondChannel.close();
resultChannel.close();
System.out.println("Operation completed.");
}
}
A very easy to implement version of an external join is the external hash join.
It is much easier to implement than an external merge sort join and only has one drawback (more on that later).
How does it work?
Very similar to a hashtable.
Choose a number n, which signifies how many files ("buckets") you're distributing your data into.
Then do the following:
Setup n file writers
For each of your files that you want to join and for each line:
take the hashcode of the key you want to join on
compute the modulo of the hashcode and n, that will give you k
append your csv line to the kth file writer
Flush/Close all n writers.
Now you have n, hopefully smaller, files with the guarantee that the same key will always be in the same file. Now you can run your standard HashMap/HashMultiSet based join on each of these files separately.
Limitations
Why did I mentioned hopefully smaller files? Well, it depends on the distribution of the keys and their hashcodes. Think for the worst case, all of your files have the exact same key: you only have one file and you didn't win anything from partitioning.
Similar for skewed distributions, sometimes a few of your bucket files will be too big to fit into your RAM.
Usually there are three ways out of this dilemma:
Run the algorithm again with a bigger n, so you have more buckets to distribute to
Take only the buckets that are too big and do another hash partitioning pass only on those files (so each file goes into n newly created buckets again)
Fallback to an external merge sort on the big partition files.
Sometimes all three are used in a different combinations, which is called dynamic partitioning.
If central memory is a constraint for your application but you can access a persistent file, I would create as suggested by blahfunk a temporary SQLite file to your tmp folder, read every file by chunks and merge them with a simple join. You could could create a temporary SQLite DB by giving a look to libraries such as Hibernate, just take a look to what have I found on this StackOverflow question: How to create database in Hibernate at runtime?
If you cannot perform such a task, your remaining option is to consume more cpu and load just the first row of the first file searching for a row with the same index on the second file, buffering the result and flushing them as late as possible on the output file, repeating this for every row of the first file.
Maybe you can stream the first file and turn each line into a hashcode and save all those hashcodes in memory. Then stream the second file and make a hashcode for each line as it comes in. If the hashcode is in the first file, i.e., in memory, then don't write the line, else write the line. After that, append the first file in its entirety into the result file.
This would be effectively creating an index to compare your updates to.
Suppose a very simple program that lists out all the subdirectories of a given directory. Sound simple enough? Except the only way to list all subdirectories in Java is to use FilenameFilter combined with File.list().
This works for the trivial case, but when the folder has say 150,000 files and 2 sub folders, it's silly waiting there for 45 seconds iterating through all the files and testing for file.isDirectory(). Is there a better way to list sub directories??
PS. Sorry, please save the lectures on having too many files in the same directory. Our live environment has this as part of the requirement.
As has already been mentioned, this is basicly a hardware problem. Disk access is always slow, and most file systems aren't really designed to handle directories with that many files.
If you for some reason have to store all the files in the same directory, I think you'll have to maintain your own cache. This could be done using a local database such as sqlite, HeidiSQL or HSQL. If you want extreme performance, use a java TreeSet and cache it in memory. This means at the very least that you'll have to read the directory less often, and it could possibly be done in the background. You could reduce the need to refresh the list even further by using your systems native file update notification API (inotify on linux) to subscribe to changes to the directory.
This doesn't seem to be possible for you, but I once solved a similiar problem by "hashing" the files into subdirectories. In my case, the challenge was to store a couple of millions images with numeric ids. I constructed the directory structure as follows:
images/[id - (id % 1000000)]/[id - (id % 1000)]/[id].jpg
This has worked well for us, and it's the solution that I would recommend. You could do something similiar to alpha-numeric filenames by simply taking the first two letters of the filename, and then the next two letters. I've done this as well once, and it did the job as well.
Do you know the finite list of possible subdirectory names? If so, use a loop over all possible names and check for directory's existence.
Otherwise, you can not get ONLY directory names in most underlying OSs (e.g. in Unix, the directory listing is merely reading contents of "directory" file, so there's no way to find "just directories" quickly without listing all the files).
However, in NIO.2 in Java7 (see http://java.sun.com/developer/technicalArticles/javase/nio/#3 ), there's a way to have a streaming directory list so you don't get a full array of file elements cluttering your memory/network.
There's actually a reason why you got the lectures: it's the correct answer to your problem. Here's the background, so that perhaps you can make some changes in your live environment.
First: directories are stored on the filesystem; think of them as files, because that's exactly what they are. When you iterate through the directory, you have to read those blocks from the disk. Each directory entry will require enough space to hold the filename, and permissions, and information on where that file is found on-disk.
Second: directories aren't stored with any internal ordering (at least, not in the filesystems where I've worked with directory files). If you have 150,000 entries and 2 sub-directories, those 2 sub-directory references could be anywhere within the 150,000. You have to iterate to find them, there's no way around that.
So, let's say that you can't avoid the big directory. Your only real option is to try to keep the blocks comprising the directory file in the in-memory cache, so that you're not hitting the disk every time you access them. You can achieve this by regularly iterating over the directory in a background thread -- but this is going to cause undue load on your disks, and interfere with other processes. Alternatively, you can scan once and keep track of the results.
The alternative is to create a tiered directory structure. If you look at commercial websites, you'll see URLs like /1/150/15023.html -- this is meant to keep the number of files per directory small. Think of it as a BTree index in a database.
Of course, you can hide that structure: you can create a filesystem abstraction layer that takes filenames and automatically generates the directory tree where those filenames can be found.
The key problem could be File.isDirectory() function called in a loop.
File.isDirectory() can be extremely slow. I saw NFS take 10 seconds to process 200 file directory.
If you can by all means prevent File.isDirectory() calls (e.g. test for extension, no extension == directory), you could improve the performance drastically.
Otherwise I would suggest doing JNA/JNI/writing a native script that does this for you.
The jCifs library lets you manipulate windows network shares more efficiently. I am not aware of a library that would do this for other network file systems.
You could hack it if the 150k files all (or a significant number of them) had a similar naming convention like:
*.jpg
*Out.txt
and only actually create file objects for the ones you are unsure about being a folder.
I don't know if the overhead of shelling out to cmd.exe would eat it up, but one possibility would be something like this:
...
Runtime r = Runtime.getRuntime();
Process p = r.exec("cmd.exe /k dir /s/b/ad C:\\folder");
BufferedReader br = new BufferedReader(new InputStreamReader(p.getInputStream()));
for (;;) {
String d = br.readLine();
if (d == null)
break;
System.out.println(d);
}
...
/s means search subdirectories
/ad means only return directories
/b means return the full pathname from the root
I came across similar question when debugging performance in a Java application enumerating plenty of files. It is using old approach
for (File f : new File("C:\\").listFiles()) {
if (f.isDirectory()) {
continue;
}
}
And it appears that each f.isDirectory() is the call into native FileSsystem which, at least on NTFS, is very slow. Java7 NIO has additional API, but not all methods are good there. I'll just provide JMH benchmark result here
Benchmark Mode Cnt Score Error Units
MyBenchmark.dir_listFiles avgt 5 0.437 ? 0.064 s/op
MyBenchmark.path_find avgt 5 0.046 ? 0.001 s/op
MyBenchmark.path_walkTree avgt 5 1.702 ? 0.047 s/op
Number come from execution of this code:
java -jar target/benchmarks.jar -bm avgt -f 1 -wi 5 -i 5 -t 1
static final String testDir = "C:/Sdk/Ide/NetBeans/src/dev/src/";
static final int nCycles = 50;
public static class Counter {
int countOfFiles;
int countOfFolders;
}
#Benchmark
public List<File> dir_listFiles() {
List<File> files = new ArrayList<>(1000);
for( int i = 0; i < nCycles; i++ ) {
File dir = new File(testDir);
files.clear();
for (File f : dir.listFiles()) {
if (f.isDirectory()) {
continue;
}
files.add(f);
}
}
return files;
}
#Benchmark
public List<Path> path_walkTree() throws Exception {
final List<Path> files = new ArrayList<>(1000);
for( int i = 0; i < nCycles; i++ ) {
Path dir = Paths.get(testDir);
files.clear();
Files.walkFileTree(dir, new SimpleFileVisitor<Path> () {
#Override
public FileVisitResult visitFile(Path path, BasicFileAttributes arg1) throws IOException {
files.add(path);
return FileVisitResult.CONTINUE;
}
#Override
public FileVisitResult preVisitDirectory(Path path, BasicFileAttributes arg1)
throws IOException {
return path == dir ? FileVisitResult.CONTINUE : FileVisitResult.SKIP_SUBTREE;
}
});
}
return files;
}
#Benchmark
public List<Path> path_find() throws Exception {
final List<Path> files = new ArrayList<>(1000);
for( int i = 0; i < nCycles; i++ ) {
Path dir = Paths.get(testDir);
files.clear();
files.addAll(Files.find(dir, 1, (path, attrs)
-> true /*!attrs.isDirectory()*/).collect(Collectors.toList()));
}
return files;
}
if your OS is 'stable' give a try to JNA:
opendir/readdir on UNIX
FindFirstFile and related API on Windows
Java7 with NIO2
these are all "streaming API". They doesn't force you to allocate a 150k list/array before start searching. IMHO this is a great advantage in your scenario.
Here's an off-the wall solution, and devoid of any testing at all. It's also dependent on having a filesystem that supports symbolic links. This isn't a Java solution. I suspect your problem is filesystem/OS-related, and not Java related.
Is it possible to create a parallel directory structure, with subdirectories based on initial letters of the filenames, and then symbolically link to the real files ? An illustration
/symlinks/a/b/cde
would link to
/realfiles/abcde
(where /realfiles is where your 150,000 files reside)
You'd have to create and maintain this directory structure, and I don't have enough info to determine if that's practical. But the above would create a fast(er) index into your non-hierarchical (and slow) directory.
there is also a recursive parallel scanning at http://blogs.oracle.com/adventures/entry/fast_directory_scanning. Essentially siblings are processed in parallel. There also encouraging performance tests.
Maybe you could write a directory searching program in C#/C/C++ and use JNI to get it to Java. Do not know if this would improve performance or not.
Well, either JNI, or, if you say your deployment is constant, just run "dir" on Windows or "ls" on *nixes, with appropriate flags to list only directories (Runtime.exec())
In that case you might try some JNA solution - a platform dependant directory traverser (FindFirst, FindNext on Windows) with the possibility of some iteration pattern. Also Java 7 will have much better file system support, worth checking out the specs (I don't remember any specifics).
Edit: An idea: one option is to hide the slowness of the directory listing from the user's eyes. In a client side app, you could use some animation while the listing is working to distract the user. Actually depends on what else your application does beside the listing.
As of 2020, the DirectoryStream does seem to be faster than using File.listFiles() and checking each file with isDirectory().
I learned the answer from here:
https://www.baeldung.com/java-list-directory-files
I'm using Java 1.8 on Windows 10.
I have a block of text I'm trying to interpret in java (or with grep/awk/etc) looking like the following:
Somewhat differently, plaques of the rN8 and rN9 mutants and human coronavirus OC43 as well as the more divergent
were of fully wild-type size, indicating that the suppressor mu- SARS-CoV, human coronavirus HKU1, and bat coronaviruses
tations, in isolation, were not noticeably deleterious to the HKU4, HKU5, and HKU9 (Fig. 6B). Thus, not only do mem-
--
able effect on the viral phenotype. A potentially related obser- sented for the existence of an interaction between nsp9
vation is that the mutation A2U, which is also neutral by itself, nsp8 (56). A hexadecameric complex of SARS-CoV nsp8 and
is lethal in combination with the AACAAG insertion (data not nsp7 has been found to bind to double-stranded RNA. The
And what I'd like to do is split it into two parts: left and right. I'm having trouble coming up with a regex or any other method that would split a block of text obviously visually split, but not obvious to a programming language. The lengths of the lines are variable.
I've considered looking for the first block and then finding the second by looking for multiple spaces, but I'm not sure that that's a robust solution. Any ideas, snippets, pseudo code, links, etc?
Text Source
The text has been ran as follows through pdftotext pdftotext -layout MyPdf.pdf
Blur the text and come up with an array of the character density per column of text. Then look for gaps and split there.
String blurredText = text.replaceAll("(?<=\\S) (?=\\S)", ".");
String[] blurredLines = text.split("\r\n?|\n");
int maxRowLength = 0;
for (String blurredLine : blurredLines) {
maxRowLength = Math.max(maxRowLength, blurredLine.length());
}
int[] columnCounts = new int[maxRowLength];
for (String blurredLine : blurredLines) {
for (int i = 0, n = blurredLine.length(); i < n; ++i) {
if (blurredLine.charAt(i) != ' ') { ++columnCounts[i]; }
}
}
// Look for runs of zero of at least length 3.
// Alternatively, you might look for the n longest runs of zeros.
// Alternatively, you might look for runs of length min(columnCounts) to ignore
// horizontal rules.
int minBreakLen = 3; // A tuning parameter.
List<Integer> breaks = new ArrayList<Integer>();
outer: for (int i = 0; i < maxRowLength - minBreakLen; ++i) {
if (columnCounts[i] != 0) { continue; }
int runLength = 1;
while (i + runLength < maxRowLength && 0 == columnCounts[i + runLength]) {
++runLength;
}
if (runLength >= minBreakLen) {
breaks.add(i);
}
i += runLength - 1;
}
System.out.println(breaks);
I doubt there is any robust solution to this. I would go for some sort of heuristic approach.
Off the top of my head, I would calculate a histogram of the column index of the first character of each word, and split on the column with the highest score (the idea being to find lots of words that are all aligned horizontally). I might also choose to weight this based on the number of preceding spaces.
I work in this general area. I am surprised that a double-column bioscience text of recent times (SARS, etc.) would be rendered in double-column monospace as the original - it would be typeset in proportional font or in HTML. So I suspect your text came from some other format (such as PDF). If so then you should try to get that format. PDF is horrible to parse, but PDF flattened to monospace is probably worse.
If you possibly can find someone who has worked in the area and see what they have done. If you have multiple documents (e.g. from different journals or reports) then your problem is worse. Yes, I could write an algorithm to solve the example you have posted, but my guess is it will break on the next set of documents. You will end up customising this for each different source (I and others have had to do this).
UPDATE: Thanks. As it's PDF then I would start by asking around. We collaborate with the group at Penn State (who have also done Citeseer). I also have colleagues at Cambridge who have spent months on a PDF reader.
If you want to do it yourself - and it will take time - then I'd start with PDFBox. I've done quite a lot with this and I think it's better for this than pdf2text or pdftotext. I can't remember whether it has double column option - I think so
UPDATE Here is a recent answer of several ways of tackling double-column PDF
http://metaoptimize.com/qa/questions/3943/methods-for-extracting-two-column-text-from-a-pdf
I'd certainly see what other people have done.
FWIW I spend a lot of time trying to convince people that scientists should not create their output in PDF because it destroys machine parsing - as you and I have found
UPDATE. You get the PDFs from your PI (== Principal Investigator?) In which case you'll gets lots of different sources which makes it worse.
What is the real problem you are trying to solve? I may be able to help
For searching a string in a file and writing the lines with matched string to another
file it takes 15 - 20 mins for a single zip file of 70MB(compressed state).
Is there any ways to minimise it.
my source code:
getting Zip file entries
zipFile = new ZipFile(source_file_name);
entries = zipFile.entries();
while (entries.hasMoreElements())
{ ZipEntry entry = (ZipEntry)entries.nextElement();
if (entry.isDirectory())
{
continue;
}
searchString(Thread.currentThread(),entry.getName(), new BufferedInputStream (zipFile.getInputStream(entry)), Out_File, search_string, stats); }
zipFile.close();
Searching String
public void searchString(Thread CThread, String Source_File, BufferedInputStream in, File outfile, String search, String stats) throws IOException
{
int count = 0;
int countw = 0;
int countl = 0;
String s;
String[] str;
BufferedReader br2 = new BufferedReader(new InputStreamReader(in));
System.out.println(CThread.currentThread());
while ((s = br2.readLine()) != null)
{
str = s.split(search);
count = str.length - 1;
countw += count; //word count
if (s.contains(search))
{
countl++; //line count
WriteFile(CThread,s, outfile.toString(), search);
}
}
br2.close();
in.close();
}
--------------------------------------------------------------------------------
public void WriteFile(Thread CThread,String line, String out, String search) throws IOException
{
BufferedWriter bufferedWriter = null;
System.out.println("writre thread"+CThread.currentThread());
bufferedWriter = new BufferedWriter(new FileWriter(out, true));
bufferedWriter.write(line);
bufferedWriter.newLine();
bufferedWriter.flush();
}
Please help me. Its really taking 40 mins for 10 files using threads and 15 - 20 mins for a single file of 70MB after being compressed. Any ways to minimise the time.
You are reopening the file output handle for every single line you write.
This is likely to have a massive performance impact, far outweighing other performance issues. Instead I would recommend creating the BufferedWriter once (e.g. upon the first match) and then keeping it open, writing each matching line and then closing the Writer upon completion.
Also, remove the call to flush(); there is no need to flush each line as the call to Writer.close() will automatically flush any unwritten data to disk.
Finally, as a side note your variable and method naming style does not follow the Java camel case convention; you might want to consider changing it.
I'm not sure if the cost you are seeing is from disk operations or from string manipulations. I'll assume for now that the problem is the strings, you can check that by writing a test driver that runs your code with the same line over and over.
I can tell you that split() is going to be very expensive in your case because you are producing strings you don't need and then recycling them, creating much overhead. You may want to increase the amount of space available to your JVM with -Xmx.
If you merely separate words by the presence of whitespace, then you would do much better by using a regular expression matcher that you create before the loop and apply it to the string The number of matches when applied to a given string will be your word count, and that should not create an array of strings (which is very wasteful and which you don't use). You will see in the JavaDocs that split does work via regular expressions; that is true, but split does the extra step of creating separate strings and that's where your waste might be.
You can also use a regular expression to search for the match instead of contains though that may not be significantly faster.
You could make things parallel by using multiple threads. However, if split() is the cause of your grief, your problem is the overhead and running out of heap space, so you won't necessarily benefit from it.
More generally, if you need to do this a lot, you may want to write a script in a language more "friendly" to string manipulation. A 10-line script in Python can do this much faster.
wow, what are you doing in this method
WriteFile(CThread,s, outfile.toString(), search);
every time you got the line containing your text, you are creating BufferedWriter(new FileWriter(out, true));
Just create a bufferedWriter in your searchString method and use that to insert lines. No need to open that again and again. It will drastically improve the performance.
One problem here might be that you stop reading when you write. I would probably use one thread for reading and another thread for writing the file. As an extra optimization the thread writing the results could buffer them into memory and write them to the file as a batch, say every ten entries or something.
In the writing thread you should queue the incoming entries before handling them.
Of course, you should maybe first debug where that time is spent, is it the IO or something else.
There are too many potential bottlenecks in this code for anyone to be sure what the critical ones are. Therefore you should profile the application to determine what it causing it to be slow.
Armed with that information, decide whether the problem is in reading the ZIP file, soing the searching or writing the matches to the output file.
(Repeatedly opening and closing the output file is a bad idea, but if you only get a tiny number of search hits it won't make much difference to the overall performance.)