Merge 2 large csv files using inner join

Merge 2 large csv files using inner join - java

I need the advice from someone who knows very well java and the memory issues. I have a large CSV files (something like 500mb each) and I need to merge these files in one using only 64mb of xmx. I've tried to do it different ways, but nothing works - always got memory exception. What should I do to make it work properly?
The task is:
Develop a simple implementation that joins two input tables in a reasonably efficient way and can store both tables in RAM if needed.
My code works, but it takes alot of memory, so can't fit at 64mb.
public class ImprovedInnerJoin {
public static void main(String[] args) throws IOException {
RandomAccessFile firstFile = new RandomAccessFile("input_A.csv", "r");
FileChannel firstChannel = firstFile.getChannel();
RandomAccessFile secondFile = new RandomAccessFile("input_B.csv", "r");
FileChannel secondChannel = secondFile.getChannel();
RandomAccessFile resultFile = new RandomAccessFile("result2.csv", "rw");
FileChannel resultChannel = resultFile.getChannel().position(0);
ByteBuffer resultBuffer = ByteBuffer.allocate(40);
ByteBuffer firstBuffer = ByteBuffer.allocate(25);
ByteBuffer secondBuffer = ByteBuffer.allocate(25);
while (secondChannel.position() != secondChannel.size()){
Map <String, List<String>>table2Part = new HashMap();
for (int i = 0; i < secondChannel.size(); ++i){
if (secondChannel.read(secondBuffer) == -1)
break;
secondBuffer.rewind();
String[] table2Tuple = (new String(secondBuffer.array(), Charset.defaultCharset())).split(",");
if (!table2Part.containsKey(table2Tuple[0]))
table2Part.put(table2Tuple[0], new ArrayList());
table2Part.get(table2Tuple[0]).add(table2Tuple[1]);
secondBuffer.clear();
}
Set <String> taple2keys = table2Part.keySet();
while (firstChannel.read(firstBuffer) != -1){
firstBuffer.rewind();
String[] table1Tuple = (new String(firstBuffer.array(), Charset.defaultCharset())).split(",");
for (String table2key : taple2keys){
if (table1Tuple[0].equals(table2key)){
for (String value : table2Part.get(table2key)){
String result = table1Tuple[0] + "," + table1Tuple[1].substring(0,14) + "," + value; // 0,14 or result buffer will be overflown
resultBuffer.put(result.getBytes());
resultBuffer.rewind();
while(resultBuffer.hasRemaining()){
resultChannel.write(resultBuffer);
}
resultBuffer.clear();
}
}
}
firstBuffer.clear();
}
firstChannel.position(0);
table2Part.clear();
}
firstChannel.close();
secondChannel.close();
resultChannel.close();
System.out.println("Operation completed.");
}
}

A very easy to implement version of an external join is the external hash join.
It is much easier to implement than an external merge sort join and only has one drawback (more on that later).
How does it work?
Very similar to a hashtable.
Choose a number n, which signifies how many files ("buckets") you're distributing your data into.
Then do the following:
Setup n file writers
For each of your files that you want to join and for each line:
take the hashcode of the key you want to join on
compute the modulo of the hashcode and n, that will give you k
append your csv line to the kth file writer
Flush/Close all n writers.
Now you have n, hopefully smaller, files with the guarantee that the same key will always be in the same file. Now you can run your standard HashMap/HashMultiSet based join on each of these files separately.
Limitations
Why did I mentioned hopefully smaller files? Well, it depends on the distribution of the keys and their hashcodes. Think for the worst case, all of your files have the exact same key: you only have one file and you didn't win anything from partitioning.
Similar for skewed distributions, sometimes a few of your bucket files will be too big to fit into your RAM.
Usually there are three ways out of this dilemma:
Run the algorithm again with a bigger n, so you have more buckets to distribute to
Take only the buckets that are too big and do another hash partitioning pass only on those files (so each file goes into n newly created buckets again)
Fallback to an external merge sort on the big partition files.
Sometimes all three are used in a different combinations, which is called dynamic partitioning.

If central memory is a constraint for your application but you can access a persistent file, I would create as suggested by blahfunk a temporary SQLite file to your tmp folder, read every file by chunks and merge them with a simple join. You could could create a temporary SQLite DB by giving a look to libraries such as Hibernate, just take a look to what have I found on this StackOverflow question: How to create database in Hibernate at runtime?
If you cannot perform such a task, your remaining option is to consume more cpu and load just the first row of the first file searching for a row with the same index on the second file, buffering the result and flushing them as late as possible on the output file, repeating this for every row of the first file.

Maybe you can stream the first file and turn each line into a hashcode and save all those hashcodes in memory. Then stream the second file and make a hashcode for each line as it comes in. If the hashcode is in the first file, i.e., in memory, then don't write the line, else write the line. After that, append the first file in its entirety into the result file.
This would be effectively creating an index to compare your updates to.

Related

Go back 'n' lines in file using Stream.lines

I need to build an application which scans through a large amount of files. These files contain blocks with some data about a sessions, in which each line has a different value. E.g.: "=ID: 39487".
At that point I have that line, but the problem I now face is that I need the value n lines above that ID. I was thinking about an Iterator but it only has forward methods. I also thought about saving the results in a List but that defies the reason to use Stream and some files are huge so that would cause memory problems.
I was wondering if something like this is possible using the Stream API (Files)? Or perhaps a better question, is there a better way to approach this?
Stream<String> lines = Files.lines(Paths.get(file.getName()));
Iterator<String> search = lines.iterator();

You can't arbitrarily read backwards and forwards through the file with the same reader (no matter if you're using streams, iterators, or a plain BufferedReader.)
If you need:
m lines before a given line
n lines after the given line
You don't know the value of m and n in advance, until you reach that line
...then you essentially have three options:
Read the whole file once, keep it in memory, and then your task is trivial (but this uses the most memory.)
Read the whole file once, mark the line numbers that you need, then do a second pass where you extract the lines you require.
Read the whole file once, storing some form of metadata about line lengths as you go, then use a RandomAccessFile to extract the specific bits you need without having to read the whole file again.
I'd suggest given the files are huge, the second option here is probably the most realistic. The third will probably give you better performance, but will require much more in the way of development effort.
As an alternative if you can guarantee that both n and m are below a certain value, and that value is a reasonable size - you could also just keep a certain number of lines in a buffer as you're processing the file, and read through that buffer when you need to read lines "backwards".

Try my library. abacus-util
try(Reader reader = new FileReader(yourFile)) {
StreamEx.of(reader)
.sliding(n, n, ArrayList::new)
.filter(l -> l.get(l.size() - 1).contains("=ID: 39487"))
./* then do your work */
}
No matter how big your file is. as long as n is small number, not millions

Loading and processing very large files with java

I'm trying to load in a csv file with a huge amount of lines (>5 million) but it slows down massively when trying to process them all into an arraylist of each value
I've tried a few different variations of reading and removing from the input list i loaded from the file, but it still ends up running out of heapspace, even when i allocate 14gb to the process, while the file is only 2gb
I know i need to be removing values so that i dont end up with duplicate references in memory, so that I dont end up with an arraylist of lines and also an arraylist of the individual comma seperated values, but i have no idea how to do something like that
Edit : For reference, in this particular situation, data should end up containing 16 * 5 million values.
If there's a more elegant solution, i'm all for it
The intention when loading this file is to process it as a database, with the appropriate methods like select and select where, all handled by a sheet class. It worked just fine with my smaller sample file of 36k lines, but i guess it doesnt scale very well
Current code :
//Load method to load it from file
private static CSV loadCSV(String filename, boolean absolute)
{
String fullname = "";
if (!absolute)
{
fullname = baseDirectory + filename;
if (!Load.exists(fullname,false))
return null;
}
else if (absolute)
{
fullname = filename;
if (!Load.exists(fullname,false))
return null;
}
ArrayList<String> output = new ArrayList<String>();
AtomicInteger atomicInteger = new AtomicInteger(0);
try (Stream<String> stream = Files.lines(Paths.get(fullname)))
{
stream.forEach(t -> {
output.add(t);
atomicInteger.getAndIncrement();
if (atomicInteger.get() % 10000 == 0)
{
Log.log("Lines done " + output.size());
}
});
CSV c = new CSV(output);
return c;
}
catch (IOException e)
{
Log.log("Error reading file " + fullname,3,"FileIO");
e.printStackTrace();
}
return null;
}
//Process method inside CSV class
public CSV(List<String> output)
{
Log.log("Inside csv " + output.size());
ListIterator<String> iterator = output.listIterator();
while (iterator.hasNext())
{
ArrayList<String> d = new ArrayList<String>(Arrays.asList(iterator.next().split(splitter,-1)));
data.add(d);
iterator.remove();
}
}

You need to use any database, which provide required functionality for your task (select, group).
Any database can effective read and aggregate 5 million rows.
Don't try to use "operations on ArrayList", it's works good only on small dataset.

I think some key concepts are missing here:
You said the file size is 2GB. That does not mean that when you load that file data in an ArrayList, the size in memory would also be 2GB. Why? Usually files store data using UTF-8 character encoding, whereas JVM internally stores String values using UTF-16. So, assuming your file contains only ASCII characters, each character occupies 1 byte in the filesystem whereas 2 bytes in memory. Assuming (for the sake of simplicity) all String values are unique, there will be space required to store the String references which are 32 bits each (assuming a 64 bits system with compressed oop). How much is your heap (excluding other memory areas)? How much is your eden space and old space? I'll come back to this again shortly.
In your code, you don't specify ArrayList size. This is a blunder in this case. Why? JVM creates a small ArrayList. After sometime JVM sees that this guy keeps pumping in data. Let's create a bigger ArrayList and copy the data of the old ArrayList into the new list. This event has some deeper implications when you are dealing with such huge volume of data: firstly, note that both the old and new arrays (with millions of entries) are in memory simultaneously occupying space, secondly unnecessarily data copy happens from one array to another - not once or twice but repeatedly, everytime the array run out of space. What happens to the old array? Well it's discarded and needs to be garbage collected. So, these repeated array copy and garbage collections slow down the process. CPU is really working hard here. What happens when your data no longer fits into the young generation (which is smaller than heap)? Maybe you need to see the behaviour using something like JVisualVM.
All in all, what I mean to say is there are good number of reasons why a 2GB file fills up your much larger heap and why your process performance is poor.

I would have a method that took a line read from the file as parameter and split it into a list of strings and then returned that list. I would then add that list to the CSV object in the file reading loop. That would mean only one large collection instead of two and the read lines could be freed from memory quicker.
Something like this
CSV csv = new CSV();
try (Stream<String> stream = Files.lines(Paths.get(fullname))) {
stream.forEach(t -> {
List<String> splittedString = splitFileRow(t);
csv.add(splittedString);
});

Trying to solve this problem using pure Java it is overwhelming. I suggest using a processing engine like Apache Spark that can process the file in a distributed way, by increasing the level of parallelism.
Apache Spark has specific APIs to load CSV file:
spark.read.format("csv").option("header", "true").load("../Downloads/*.csv")
You can transform it into an RDD, or Dataframe and perform operations on it.
You can find more online, or here

External Sorting from files in Java

I am wondering how do we write Java code from the following PseudoCode
foreach file F in file directory D
foreach int I in file F
sort all I from each file
Basically this is part of the External Sorting algorithm, so those files contain lists of sorted integer, and I want to read the first one from each file and sort it and then output to another file, and then move to the next integer from each file again until all the integers are fully sorted.
The problem is that as far as I understand for each file we need a reader, so if we have N files then does that mean we need N file readers?
======update=======
I am wondering is it something that look like this? Correct me if I miss anything or any other better approach.
int numOfFiles = 10;
Scanner [] scanners = new Scanner[numOfFiles];
try{
//reader all the files
for(int i = 0 ; i < numOfFiles; i++){
scanners[i] = new Scanner(new BufferedReader(
new FileReader("file"+i+".txt");
}
}
catch(FileNotFoundException fnfe){
}

The problem is that as far as I understand for each file we need a reader, so if we have N files then does that mean we need N file readers ?
Yes, that's right - unless you want to either have to go back over the data, or the whole of each file into memory. Either of those would let you get away with only one file open at a time - but that may well not suit what you want to do.
Operating systems usually only allow you to open a certain number of files at a time. If you're trying to do something like create a single sorted set of results from a very large number of files, you might want to consider operating on a few of them at a time, producing larger intermediate files. At its simplest, this would just sort two files at a time, e.g.
input1 + input2 => tmp-a1
input3 + input4 => tmp-a2
input5 + input6 => tmp-a3
input7 + input8 => tmp-a4
tmp-a1 + tmp-a2 => tmp-b1
tmp-a3 + tmp-a4 => tmp-b2
tmp-b1 + tmp-b2 => result

Yes, we must have N file readers for reading N files.
Inorder to iterate all the files in a directory, read the files one by one, and store them in a List. Then sort that list again to get your output.

There's a method called Polyphase merge sort I recently learnt in my ds class where you traverse the files in form of runs (a run is a sorted sequence). There are n sources, and a destination.
The gist of this polyphase method is having to keep no file (given a set of files) idle. It significantly reduces the iterations. It's done by taking an fibonacci sequence of an order equal to that of number of files. So in case of 5 files, I'll take the fib sequence of order 5: [1,1,2,4,8], which represent the number of runs you're going to take out of each file and place them, where from files corresponding to runs=1, one of them will be the destination.
In short:
Distribute a file into runs according to the fib sequence. [which would mean the entire dataset is in a single file. if that's not the case, you can always create in situ runs where you might want to add dummy runs to suit the sequence]
Take first n runs from every file into the buffer, sort them (insertion preferred) and dump them into ONE files. That ONE file is again selected by the fibonacci sequence.
Run to a point you get a single file with single run.
This is the paper which neatly explains the polyphase concept. ftp://reports.stanford.edu/pub/cstr/reports/cs/tr/76/543/CS-TR-76-543.pdf
http://en.wikipedia.org/wiki/Polyphase_merge_sort explains the algo better

Just presenting code, not answering "need N file readers ?" :)
use org.apache.commons.io:
//get line iterators :
Collection<File> files = FileUtils.listFiles(/* TODO : filter conf */);
List<LineIterator> iters = new ArrayList<LineIterator>();
for(File file : files) {
iters.add(FileUtils.lineIterator(file, "UTF-8"));
}
//collect a line from each file
List<String> numbers = new ArrayList<String>();
for(LineIterator li : iters) {
numbers.add(li.nextLine());
}
//sort
//Arrays.sort(numbers/*will fail*/);// :)

Yes, you need N File readers.
public void workOnFiles(){
File []D = new File("directoryName").listFiles(); //D.length should equal to N.
for(File F:D){
doSortingForEachFile(F);//do sorting part here. The same reader cannot open same file here again.
}
}
public void doSortingForEachFile(File f){
try{
ArrayList<Integer> list=new ArrayList<Integer>();
Scanner s=new Scanner(f);
while(s.hasNextInt()){//store ints inside the file.
list.add(s.nextInt());
}
s.close();//once closed, cannot open again.
Collections.sort(list);//this method will sort the ArrayList of int.
//...write numbers inside list to another file...
}catch(Exception e){}
}

efficient and fast KV store suitable for android phones for a million rows?

Is there a KV store that can store a million KV pairs, while performing well in an Android phone without hogging resources?
It should be able to do this fast:
kvstore.deleteByPrefix("image_hash_"); #a million keys have this prefix
for(... #for a million values
kvstore.add("image_hash_"+i.toString(), "true"); #values are small
}

Your key suggests your values are large (many GBs in total) This will cause you more problems than using a Map.
I suggest you use the file system with the name being the file name and the value being the contents of the file. You can use two levels of directories to spli the files to prevent any one directory getting too large.
You can use the following.
new File("image_hash.properties").delete(); #a million keys have this prefix
PrintWriter pw = new PrintWriter(new File("image_hash.properties"))
for(int i=0;i<1000*1000;i++)
pw.println(i+"="+true);
pw.close();
If you are concerned about efficiency and you can only have true or false, you can write binary.
FileChannel fc = new FileOutputStream("image_hash.flags");
ByteBuffer bb = ByteBuffer.wrap(1000*1000/8); // uses 125KB of memory.
Arrays.fill(bb.array(), (byte) -1);
fc.write(bb);
fc.close();
The first example uses ~14 bytes per value, the second example uses 1/8th of a byte per value.

File Writer Java ; how to overwrite

i'm making an Highscore implementation for my game. Here there's what i want to do:
I have object Score which contains String Name and Integer score.
Now :
if Name isn't already in the file, add it
if Name is on the file, after a space take the String and convert into integer, so i got the score.
Now, if score is better than the actual, i have to OVERWRITE it on the file...
and here's my problem..i can i do that? how can i write exactly a string over another in a certain point of the file?

Generally it's considered too fiddly to replace text in text files for this kind of requirement, so the usual way to do it is just to read in the whole file, make the replacement and write a new version of the whole file. If you have large amounts of data you would use a database or a NoSQL solution instead.
P.S. consider using serialization, it can make things easier.

I have object Score which contains String Name and Integer score.
Use a Properties file for this instead. It provides an easy interface to load & save the file, get the keys (Name) and set or retrieve values (Score). String values are stored, but they can be converted to/from integer easily.

I concur that this is best done by fully re-serializing the entire database. On modern computers, you can push 30MB/s per disk for linear writes (more if there's sufficient cache). And if you're dealing with more than 30MB of data, you REALLY need a DB (HSQLDB, DerbyDB, BerkleyDB) are trivial DBs. Or go all the way to postgres/mysql.
However, the way to overwrite a FIXED sized section of an existing file (or rather, the way to emulate doing so), is to use:
RandomAccessFile raf = new RandomAccessFile(fileName, "rw");
try {
raf.seek(position);
raf.writeInt(newScore);
raf.close();
} finally { raf.close(); }
Note using the writeInt instead of raf.write(Integer.toHexString(newScore).getBytes()), because you really really need that to be fixed in size.
Now if the text file is intrinsicly ascii (e.g. humans will read the file), and thus the value can't be binary.. Perhaps you could keep it HexString (because that will be fixed in size), or you could a zero-padded decimal string:
But what you absoluately positively can not do is grow the string by 1 byte.
So:
bob 15
joe 7
nina 981
Can't have joe's score replaced with 10, UNLESS you've padded a bunch of spaces.
If this is your data-file, then you will absolutely have to rewrite the whole file (even if you write the extra code to only rewrite from the point of change on - statistically that'll be 50% of the file and thus not worth bothering.
One other thing - if you do rewrite, you have the risk of shortening the file.. For that you need to call
raf.setLength(0);
before writing the first byte.. Otherwise, you'll see phantom text beyond the end of your new file.

final Map<String,Long> symbol2Position = new ConcurrentHashMap<>();
final Map<String,Integer> symbol2Score = new ConcurrentHashMap<>();
final String fileName;
final RandomAccessFile raf;
// ... skipped code
void storeFull() {
raf.position(0);
raf.setLength(0);
for (Map.Entry<String,Long> e : symbol2Position.entrySet()) {
raf.writeUTF8(e.getKey());
raf.write(',');
symbol2Score.put(e.getKey(), raf.position());
raf.writeUTF8(String.format("%06d",e.getValue()));
raf.write('\n');
}
}
void updateScore(String key, int newScore) {
if (symbol2Score.containsKey(key)) {
symbol2Score.put(key, newScore);
raf.position(symbol2Position.get(key));
raf.writeString(String.format("%06d", newScore));
} else {
symbol2Score.put(key, newScore);
long fileLen = raf.length();
symbol2Position.put(key, fileLen);
raf.position(fileLen);
raf.writeString(String.format("%06d", newScore));
}
}
I'd probably rather use a DB or binary map file.. meaning file with 8B per field, 4B pointing to user-name position, and 4B representing score. But this allows for a human readable data-file, while making updates faster than just rewriting the property file.
Check out LevelDB - fastest damn DB on the planet for embedded systems. :) Main thing it has over the above, is thousands / millions of updates per second without the rand-seek-rewrite cost of updating 6 bytes randomly across a multi-GB file.

Just a thought, any specific reason for the file storage of names and scores ?
Seems like a Map<String, Integer> would serve you much better...

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.