External Sorting from files in Java

External Sorting from files in Java - java

I am wondering how do we write Java code from the following PseudoCode
foreach file F in file directory D
foreach int I in file F
sort all I from each file
Basically this is part of the External Sorting algorithm, so those files contain lists of sorted integer, and I want to read the first one from each file and sort it and then output to another file, and then move to the next integer from each file again until all the integers are fully sorted.
The problem is that as far as I understand for each file we need a reader, so if we have N files then does that mean we need N file readers?
======update=======
I am wondering is it something that look like this? Correct me if I miss anything or any other better approach.
int numOfFiles = 10;
Scanner [] scanners = new Scanner[numOfFiles];
try{
//reader all the files
for(int i = 0 ; i < numOfFiles; i++){
scanners[i] = new Scanner(new BufferedReader(
new FileReader("file"+i+".txt");
}
}
catch(FileNotFoundException fnfe){
}

The problem is that as far as I understand for each file we need a reader, so if we have N files then does that mean we need N file readers ?
Yes, that's right - unless you want to either have to go back over the data, or the whole of each file into memory. Either of those would let you get away with only one file open at a time - but that may well not suit what you want to do.
Operating systems usually only allow you to open a certain number of files at a time. If you're trying to do something like create a single sorted set of results from a very large number of files, you might want to consider operating on a few of them at a time, producing larger intermediate files. At its simplest, this would just sort two files at a time, e.g.
input1 + input2 => tmp-a1
input3 + input4 => tmp-a2
input5 + input6 => tmp-a3
input7 + input8 => tmp-a4
tmp-a1 + tmp-a2 => tmp-b1
tmp-a3 + tmp-a4 => tmp-b2
tmp-b1 + tmp-b2 => result

Yes, we must have N file readers for reading N files.
Inorder to iterate all the files in a directory, read the files one by one, and store them in a List. Then sort that list again to get your output.

There's a method called Polyphase merge sort I recently learnt in my ds class where you traverse the files in form of runs (a run is a sorted sequence). There are n sources, and a destination.
The gist of this polyphase method is having to keep no file (given a set of files) idle. It significantly reduces the iterations. It's done by taking an fibonacci sequence of an order equal to that of number of files. So in case of 5 files, I'll take the fib sequence of order 5: [1,1,2,4,8], which represent the number of runs you're going to take out of each file and place them, where from files corresponding to runs=1, one of them will be the destination.
In short:
Distribute a file into runs according to the fib sequence. [which would mean the entire dataset is in a single file. if that's not the case, you can always create in situ runs where you might want to add dummy runs to suit the sequence]
Take first n runs from every file into the buffer, sort them (insertion preferred) and dump them into ONE files. That ONE file is again selected by the fibonacci sequence.
Run to a point you get a single file with single run.
This is the paper which neatly explains the polyphase concept. ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/76/543/CS-TR-76-543.pdf
http://en.wikipedia.org/wiki/Polyphase_merge_sort explains the algo better

Just presenting code, not answering "need N file readers ?" :)
use org.apache.commons.io:
//get line iterators :
Collection<File> files = FileUtils.listFiles(/* TODO : filter conf */);
List<LineIterator> iters = new ArrayList<LineIterator>();
for(File file : files) {
iters.add(FileUtils.lineIterator(file, "UTF-8"));
}
//collect a line from each file
List<String> numbers = new ArrayList<String>();
for(LineIterator li : iters) {
numbers.add(li.nextLine());
}
//sort
//Arrays.sort(numbers/*will fail*/);// :)

Yes, you need N File readers.
public void workOnFiles(){
File []D = new File("directoryName").listFiles(); //D.length should equal to N.
for(File F:D){
doSortingForEachFile(F);//do sorting part here. The same reader cannot open same file here again.
}
}
public void doSortingForEachFile(File f){
try{
ArrayList<Integer> list=new ArrayList<Integer>();
Scanner s=new Scanner(f);
while(s.hasNextInt()){//store ints inside the file.
list.add(s.nextInt());
}
s.close();//once closed, cannot open again.
Collections.sort(list);//this method will sort the ArrayList of int.
//...write numbers inside list to another file...
}catch(Exception e){}
}

Related

how to output sorted files in java

I have a problem where I want to scan the files that are in a certain folder and output them.
the only problem is that the output is: (1.jpg , 10.jpg , 11.jpg , 12.jpg , ... , 19.jpg , 2.jpg) when I want it to be: (1.jpg , 2.jpg and so on). Since I use: File actual = new File(i.); (i is the number of times the loop repeats) to scan for images, I don't know how to sort the output.
this is my code for now.
//variables
String htmlHeader = ("<!DOCTYPE html>:\n"
+ "<html lang=\"en\">\n"
+ "<head>\n"
+ "<meta charset=\"UTF-8\">\n"
+ "<meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\">\n"
+ "<meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n"
+ "<title>Document</title>\n"
+ "</head>"
+ "<body>;\n");
String mangaName = ("THREE DAYS OF HAPPINESS");
String htmlEnd = ("</body>\n</html>");
String image = ("image-");
//ask for page number
Scanner scan = new Scanner(System.in);
System.out.print("enter a chapter number: ");
int n = scan.nextInt();
//create file for chapter
File creator = new File("manga.html");
//for loop
for (int i = 1; i <= n; ++i) {
//writing to HTML file
BufferedWriter bw = null;
bw = new BufferedWriter(new FileWriter("manga"+i+".html"));
bw.write(htmlHeader);
bw.write("<h2><center>" + mangaName + "</center></h2</br>");
//scaning files
File actual = new File("Three Days Of Happiness Chapter "+i+" - Manganelo_files.");
for (File f : actual.listFiles()) {
String pageName = f.getName();
//create list
List<String> list = Arrays.asList(pageName);
list.sort(Comparator.nullsFirst(Comparator.comparing(String::length).thenComparing(Comparator.naturalOrder())));
System.out.println("list");
//for loop
//writing bpdy to html file
bw.write("<p><center><img src=\"Three Days Of Happiness Chapter "+i+" - Manganelo_files/" + pageName + "\" <br/></p>\n");
System.out.println(pageName);
}
bw.write(htmlEnd);
bw.close();
System.out.println("Process Finished");
}}
}```

When you try to sort the names, you'll most certainly notice that they are sorted alphanumerically (e.g. Comparing 9 with 12; 12 would come before 9 because the leftmost digit 1 < 9).
One way to get around this is to use an extended numbering format when naming & storing your files.
This has been working great for me when sorting pictures, for example. I use YYYY-MM-DD for all dates regardless whether the day contains one digit (e.g. 9) or two digits (11). This would mean that I always type 9 as 09. This also means that every file name in a given folder has the same length, and each digit (when compared to the corresponding digit to any other adjacent file) is compared properly.
One solution to your problem is to do the same and add zeros to the left of the file names so that they are easily sorted both by the OS and by your Java program. The drawback to this solution is that you'll need to decide the maximum number of files you'll want to store in a given folder beforehand – by setting the number of digits properly (e.g. 3 digits would mean a maximum of 1000 uniquely & linearly numbered file names from 000 to 999). The plus, however, is that this will save you the hassle of having to sort unevenly numerered files, while making it so that your files are pre-sorted once and are ready to be quickly read whenever.

Generally, file systems do not have an order to the files in a directory. Instead, anything that lists files (be it an ls or dir command on a command line, calling Files.list in java code, or opening Finder or Explorer) will apply a sorting order.
One common sorting order is 'alphanumerically'. In which case, the order you describe is correct: 2 comes after 1 and also after 10. You can't wave a magic wand and tell the OS or file system driver not to do that; files as a rule don't have an 'ordering' property.
Instead, make your filenames such that they do sort the way you want, when sorting alphanumerically. Thus, the right name for the first file would be 01.jpg. Or possibly even 0001.jpg - you're going to have to make a call about how many digits you're going to use before you start, unfortunately.
String.format("%05d", 1) becomes "00001" - that's pretty useful here.
The same principle applies to reading files - you can't just rely on the OS sorting it for you. Instead, read it all into e.g. a list of some sort and then sort that. You're going to have to write a fairly funky sorting order: Find the dot, strip off the left side, check if it is a number, etc. Quite complicated. It would be a lot simpler if the 'input' is already properly zero-prefixed, then you can just sort them naturally instead of having to write a complex comparator.
That comparator should probably by modal. Comparators work by being handed 2 elements, and you must say which one is 'earlier', and you must be consistent (if a is before b, and later I ask you: SO, how about b and a, you must indicate that b is after a).
Thus, an algorithm would look something like:
Determine if a is numeric or not (find the dot, parseInt the substring from start to the dot).
Determine if b is numeric or not.
If both are numeric, check ordering of these numbers. If they have an order (i.e. aren't identical), return an answer. Otherwise, compare the stuff after the dot (1.jpg should presumably be sorted before 1.png).
If neither are numeric, just compare alphanum (aName.compareTo(bName)).
If one is numeric and the other one is not, the numeric one always wins, and vice versa.

How can I improve the run-time complexity of my method?

I wrote a function in Java that edit file name, and replace each space char into dash char.
Currently I iterate all the files in a specific directory, iterate in each file name, creating a new file name, and replace the file in the directory.
I guess that the current complexity is O(N*M) {N = number of files in directory, M = number of chars in each file}.
Can anyone help me improve the run-time-complexity?
Thanks
public static void editSpace(String source, String target) {
// Source directory where all the files are there
File dir = new File(source);
File[] directoryListing = dir.listFiles();
// Iterate in each file in the directory
for (File file : directoryListing) {
String childName = file.getName();
String childNameNew = "";
// Iterate in each file name and change every space char to dash char
for (int i = 0; i < childName.length(); i++) {
if (childName.charAt(i) == ' ') {
childNameNew += "-";
} else {
childNameNew += childName.charAt(i);
}
}
// Update the new directory of the child
String childDir = target + "\\" + childNameNew;
// Renaming the file and moving it to a new location
if (!(childNameNew.equals(""))
&& (file.renameTo(new File(childDir)))) {
// If file copied successfully then delete the original file .
file.delete();
// Print message
System.out.println(childName + " File moved successfully to "
+ childDir);
}
// Moving failed
else {
// Print message
System.out.println(childName + " Failed to move the file to "
+ childDir);
}
}
}

I guess that the current complexity is O(N*M) {N = number of files in directory, M = number of chars in each file}. Can anyone help me improve the run-time-complexity?
Nobody can. You figured it yourself: when your task is to modify N file names that have like M chars to read(or modify), then you end up with NxM. There is no conceptual way to modify N file names based on their current names without each file and at each thing in there.
But what is possible: look carefully at your code, and see if you can improve the actual implementation.
You should start by relying much more on library methods. For example, you have String.replace() that allows you to turn all spaces into dashes with a single call. That shouldn't affect performance, but it allows your own code (having less code is mostly a good thing!). You could go one step further and look at streams to use even less code, see here.
But the real answer here: you are probably doing pre-mature optimisation. In the end, you are talking about something where the JVM needs to OS in order to make changes out there in the file system. There are zillions of aspects that influence overall, end to end performance for such a use case. It might be helpful to have more than one thread, so that can "process" file names from different directories in parallel for example.
On the other hand: creating a thread is a costly operation. And typically, it only helps you to speed up CPU intensive activities. Worse, multiple threads accessing the file system like that in parallel ... might actually slow down things, overall.
Meaning: depending on your overall setup, you might be able to speed up renaming files. Or not.
In the end, you are spending a lot of time and energy here. And the real question: is it really worth it?! Does it really matter to you whether your code will need 500 ms, or 1 sec, or 2 seconds? Depending on context it might, but maybe: it doesn't. That is the first thing to clarify. And when you figure that you really need the highest performance solution here, then you will have to invest real time into measuring what is going on, and doing experiments to find out which setting affects performance the most.
In other words: if you really care about performance here, you have a lot of low level details to look at. If you don't care about performance that much, I would throw away the java code and write 3 lines of python code, or Kotlin, or whatever you normally use for scripting, and go with that. Not because that code will be faster, but it will easier to read, write, and maintain. Because that is what matters when performance isn't your primary priority.

How do I count the number of integers in this file using Java?

If I have a given .dat file which I'm trying to read, how can I count the number of 32-bit integers? I'm getting 2 different answers using 2 different methods.
First method:
int size = 0;
try (DataInputStream Input = new DataInputStream(
new BufferedInputStream(new FileInputStream(file.getFD())))){
while (true) {
file.skipBytes(4);
size += 1;
}
}catch(Exception ex){
System.out.println(ex);
}
System.out.println(size);
Second method:
File fileRead = new File(file);
ret = fileRead.length() / 4
The first method is probably the most accurate since I'm reading 4 bytes each time and skipping it, to get the size of integers being packed sequentially in the file. However, the second method just gives me the direct file size and divided by 4, which is not the same. I think it might be including extra file related data not related to the content.
The first method is good but it is very inefficient for large files. Any idea how I can speed things up and get the number of integers efficiently?

If you want to know hoy many times can you read a 32-bit integer from a certain binary file, Method 2 is the certain answer.
You must not read your file through a DataInputStream unless you are certain that it was written through a DataOutputStream, because then it is not just a plain, binary file: Instead, it becomes a Java Object file, which will contain a lot of overhead data with every object written.

Merge 2 large csv files using inner join

I need the advice from someone who knows very well java and the memory issues. I have a large CSV files (something like 500mb each) and I need to merge these files in one using only 64mb of xmx. I've tried to do it different ways, but nothing works - always got memory exception. What should I do to make it work properly?
The task is:
Develop a simple implementation that joins two input tables in a reasonably efficient way and can store both tables in RAM if needed.
My code works, but it takes alot of memory, so can't fit at 64mb.
public class ImprovedInnerJoin {
public static void main(String[] args) throws IOException {
RandomAccessFile firstFile = new RandomAccessFile("input_A.csv", "r");
FileChannel firstChannel = firstFile.getChannel();
RandomAccessFile secondFile = new RandomAccessFile("input_B.csv", "r");
FileChannel secondChannel = secondFile.getChannel();
RandomAccessFile resultFile = new RandomAccessFile("result2.csv", "rw");
FileChannel resultChannel = resultFile.getChannel().position(0);
ByteBuffer resultBuffer = ByteBuffer.allocate(40);
ByteBuffer firstBuffer = ByteBuffer.allocate(25);
ByteBuffer secondBuffer = ByteBuffer.allocate(25);
while (secondChannel.position() != secondChannel.size()){
Map <String, List<String>>table2Part = new HashMap();
for (int i = 0; i < secondChannel.size(); ++i){
if (secondChannel.read(secondBuffer) == -1)
break;
secondBuffer.rewind();
String[] table2Tuple = (new String(secondBuffer.array(), Charset.defaultCharset())).split(",");
if (!table2Part.containsKey(table2Tuple[0]))
table2Part.put(table2Tuple[0], new ArrayList());
table2Part.get(table2Tuple[0]).add(table2Tuple[1]);
secondBuffer.clear();
}
Set <String> taple2keys = table2Part.keySet();
while (firstChannel.read(firstBuffer) != -1){
firstBuffer.rewind();
String[] table1Tuple = (new String(firstBuffer.array(), Charset.defaultCharset())).split(",");
for (String table2key : taple2keys){
if (table1Tuple[0].equals(table2key)){
for (String value : table2Part.get(table2key)){
String result = table1Tuple[0] + "," + table1Tuple[1].substring(0,14) + "," + value; // 0,14 or result buffer will be overflown
resultBuffer.put(result.getBytes());
resultBuffer.rewind();
while(resultBuffer.hasRemaining()){
resultChannel.write(resultBuffer);
}
resultBuffer.clear();
}
}
}
firstBuffer.clear();
}
firstChannel.position(0);
table2Part.clear();
}
firstChannel.close();
secondChannel.close();
resultChannel.close();
System.out.println("Operation completed.");
}
}

A very easy to implement version of an external join is the external hash join.
It is much easier to implement than an external merge sort join and only has one drawback (more on that later).
How does it work?
Very similar to a hashtable.
Choose a number n, which signifies how many files ("buckets") you're distributing your data into.
Then do the following:
Setup n file writers
For each of your files that you want to join and for each line:
take the hashcode of the key you want to join on
compute the modulo of the hashcode and n, that will give you k
append your csv line to the kth file writer
Flush/Close all n writers.
Now you have n, hopefully smaller, files with the guarantee that the same key will always be in the same file. Now you can run your standard HashMap/HashMultiSet based join on each of these files separately.
Limitations
Why did I mentioned hopefully smaller files? Well, it depends on the distribution of the keys and their hashcodes. Think for the worst case, all of your files have the exact same key: you only have one file and you didn't win anything from partitioning.
Similar for skewed distributions, sometimes a few of your bucket files will be too big to fit into your RAM.
Usually there are three ways out of this dilemma:
Run the algorithm again with a bigger n, so you have more buckets to distribute to
Take only the buckets that are too big and do another hash partitioning pass only on those files (so each file goes into n newly created buckets again)
Fallback to an external merge sort on the big partition files.
Sometimes all three are used in a different combinations, which is called dynamic partitioning.

If central memory is a constraint for your application but you can access a persistent file, I would create as suggested by blahfunk a temporary SQLite file to your tmp folder, read every file by chunks and merge them with a simple join. You could could create a temporary SQLite DB by giving a look to libraries such as Hibernate, just take a look to what have I found on this StackOverflow question: How to create database in Hibernate at runtime?
If you cannot perform such a task, your remaining option is to consume more cpu and load just the first row of the first file searching for a row with the same index on the second file, buffering the result and flushing them as late as possible on the output file, repeating this for every row of the first file.

Maybe you can stream the first file and turn each line into a hashcode and save all those hashcodes in memory. Then stream the second file and make a hashcode for each line as it comes in. If the hashcode is in the first file, i.e., in memory, then don't write the line, else write the line. After that, append the first file in its entirety into the result file.
This would be effectively creating an index to compare your updates to.

Removing file lines containing data which are not present in another file

I have a file Hier.csv which looks like this (thousands of lines):
value;nettingNodeData;ADM59505_10851487;CVAEngine;ADM;;USD;0.4;35661;BDR;NA;ICE;;RDC;MAS35661_10851487;CVAEngine;MA;10851487;RDC
I have another one, Prices.csv, which looks like this :
value;nettingNodePrices;ADM68834_22035364;CVAEngine;CVA with FTD;EUR;1468.91334249291905;DVA with FTD;EUR;5365.59742483701497
I have to make sure that both files have the same number of lines and the same ids (the third value of each lines), and it's a known fact that the set of ids from Hier.csv is larger and contains the set of ids from Prices.csv, ie. some ids that are in Hier.csv are not in Prices.csv.
Also, there are no duplicates in either file.
So far, I have tried the following, but it's taking ages, and not working (I can do it faster with my little hands and Excel, but that's not what I want).
Here is my program in pseudo code, as I don't have access to my code right now, I will edit this question as soon as I can :
for each line of Hier.csv
for each line of Prices.csv
if prices.line doesn't contain the 3rd value of hier.line
store that value in a list
end
end
end
Process p;
for each value in the list
// remove the line containing that value from Hier.csv
String[] command1 = {"sed", "'/^.*" + value + ".*$/d'", "Hier.csv", ">", "tmp.csv"};
Process p = Runtime.getRuntime().exec(command1)
end
String[] command2 = {"mv", "tmp.csv" "Hier.csv"};
Process p = Runtime.getRuntime().exec(command2)
Is there a better way than that double loop ?
Why does'nt the last part (exec(command)) work ?
And lastly, which is more efficient when reading csv files : BufferedReader or Scanner ?

You can use merge or hashtable.
Merge:
sort both files and merge together
Hashtable:
load smaller file (ids) to hashtable, loop through bigger file and test existence against hashtable

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.