Taking cartesian product of two huge files using hadoop - java

We have two files file1 and file2 which have a huge number of lines, lets say a billion lines each. The goal here is to take a cartesian product of the files.
So if file1 has m lines and file2 has n lines the cartesian product output would have m*n lines.
I could think below solution for this problem:
Write mapper1 which reads each line as <line_number, line_content> from file1 and outputs <const, file1_marker+line_number+line_content>.
Write mapper2 which reads each line as <line_number, line_content> from file2 and outputs <const, file2_marker+line_number+line_content>.
Write mapper3 which reads from both the output files and outputs <const, list_of_file1_and_file2_line_contents>.
Write a reducer which gets the mapper3 output and while traversing the values combines every file1_marker+line_number value with every file2_marker+line_number value it sees (It could be made faster by hashing etc.) and outputs <file1_line_content, file2_line_contents>.
But it appears that the amount of memory will be an issue here. So I am looking for a more memory efficient solution if possible.
Please suggest.

Related

Java Mapreduce - getting names of files with matches & printing to output file

Hi I've been trying to come up with a modified version of the standard WordCount v1.0
wherein I read all files from an input directory (args[0]) and I print the output to an output directory (args[1]) which consists of not just the words and the number of occurrences, but a list of files where matches took place.
So for example I have 2 text files:
//1.txt
I love hadoop
and big data
//2.txt
I like programming
hate big data
The output would be:
//Output.txt
I 2 1.txt 2.txt
love 1 1.txt
hadoop 1 1.txt
and 1 1.txt
big 2 1.txt 2.txt
data 2 1.txt 2.txt
like 1 1.txt
programming 1 2.txt
hate 1 2.txt
At this stage I'm not sure how to extract the name of the file where the match occured. Furthermore I'm not sure how to store the file name - whether I would use a Triple or I would need to use nested maps, so perhaps Map (K1, Map (K2, v))? I don't know which would be possible in a mapreduce program so any tips would be greatly appreciated.
Getting file names is generally not encouraged. Different input formats have different ways of doing this, and some of them may not provide such functionality at all.
Assuming that you are working with simple TextInputFormat, you can use mapper context to retrieve the split:
FileSplit split = (FileSplit)context.getInputSplit();
String filename = split.getPath().getName();
To produce the format desired, let mapper emit tuples <Text(word),Text(filename)>. Reducer should collect them into Map<String(word), Set<String>(filename)>. This assumes no combiner is used.

Go back 'n' lines in file using Stream.lines

I need to build an application which scans through a large amount of files. These files contain blocks with some data about a sessions, in which each line has a different value. E.g.: "=ID: 39487".
At that point I have that line, but the problem I now face is that I need the value n lines above that ID. I was thinking about an Iterator but it only has forward methods. I also thought about saving the results in a List but that defies the reason to use Stream and some files are huge so that would cause memory problems.
I was wondering if something like this is possible using the Stream API (Files)? Or perhaps a better question, is there a better way to approach this?
Stream<String> lines = Files.lines(Paths.get(file.getName()));
Iterator<String> search = lines.iterator();
You can't arbitrarily read backwards and forwards through the file with the same reader (no matter if you're using streams, iterators, or a plain BufferedReader.)
If you need:
m lines before a given line
n lines after the given line
You don't know the value of m and n in advance, until you reach that line
...then you essentially have three options:
Read the whole file once, keep it in memory, and then your task is trivial (but this uses the most memory.)
Read the whole file once, mark the line numbers that you need, then do a second pass where you extract the lines you require.
Read the whole file once, storing some form of metadata about line lengths as you go, then use a RandomAccessFile to extract the specific bits you need without having to read the whole file again.
I'd suggest given the files are huge, the second option here is probably the most realistic. The third will probably give you better performance, but will require much more in the way of development effort.
As an alternative if you can guarantee that both n and m are below a certain value, and that value is a reasonable size - you could also just keep a certain number of lines in a buffer as you're processing the file, and read through that buffer when you need to read lines "backwards".
Try my library. abacus-util
try(Reader reader = new FileReader(yourFile)) {
StreamEx.of(reader)
.sliding(n, n, ArrayList::new)
.filter(l -> l.get(l.size() - 1).contains("=ID: 39487"))
./* then do your work */
}
No matter how big your file is. as long as n is small number, not millions

Count duplicates in huge text files collection

I have this collection of folders:
60G ./big_folder_6
52G ./big_folder_8
61G ./big_folder_7
60G ./big_folder_4
58G ./big_folder_5
63G ./big_folder_2
54G ./big_folder_9
61G ./big_folder_3
39G ./big_folder_10
74G ./big_folder_1
Each folder contains 100 txt files, with one sentence per line. For example, the file ./big_folder_6/001.txt:
sentence ..
sentence ..
...
Each file in the folder is between 4 and 6 GB (as you can see from the totals reported above) with 40-60 million of sentences more or less. One single file fits in memory.
I need to deduplicate and count the sentences globally unique, so as to obtain a new collection of files where the lines are counted:
count ...unique sentence...
The collection is huge.
My first implementation (using Java) was a "merge sort" approach ordering the lines in a new collection of 500 files (dispatching each line in the right file using the first N characters), then order and aggregate duplicates on the single files.
I know it is a wordcount map-reduce problem but I would prefer to avoid it. The question is: am I using the right approach to solve this kind of problem or I should consider other tool/approach beside MapReduce?
You mean delete duplicated lines of each file? or among all files?
in any case, you cant read the whole file, you need to read line by line or a memory exception will be thrown. Use BufferedReader (example here), use a map storing the string with the count of the repeated line as a value, when you read a line, put in the map incrementing the value if it exist.
After read the file, write all the lines and theirs counts to a new file and release memory.
UPDATE 1
the problem is that you have a lot of gigas. So you cant keep in memory each line because it can thrown a memory exception, but at the same time you have to keep them in memory to quickly validate if they are duplicated. What comes to may mind is instead of having a string representing the key value, put a hash of the string (usgin string.toHash()), and when it was the first, write it to the new file, but flush every 100 lines or more to lower the time writing to the disk. After you processed all the files and write unique lines in the file and you have only integers in the map (hashcode of the string as a key and count as a value), you start reading the file containing only unique lines, then create a new file writing the line and the count values.

Removing file lines containing data which are not present in another file

I have a file Hier.csv which looks like this (thousands of lines):
value;nettingNodeData;ADM59505_10851487;CVAEngine;ADM;;USD;0.4;35661;BDR;NA;ICE;;RDC;MAS35661_10851487;CVAEngine;MA;10851487;RDC
I have another one, Prices.csv, which looks like this :
value;nettingNodePrices;ADM68834_22035364;CVAEngine;CVA with FTD;EUR;1468.91334249291905;DVA with FTD;EUR;5365.59742483701497
I have to make sure that both files have the same number of lines and the same ids (the third value of each lines), and it's a known fact that the set of ids from Hier.csv is larger and contains the set of ids from Prices.csv, ie. some ids that are in Hier.csv are not in Prices.csv.
Also, there are no duplicates in either file.
So far, I have tried the following, but it's taking ages, and not working (I can do it faster with my little hands and Excel, but that's not what I want).
Here is my program in pseudo code, as I don't have access to my code right now, I will edit this question as soon as I can :
for each line of Hier.csv
for each line of Prices.csv
if prices.line doesn't contain the 3rd value of hier.line
store that value in a list
end
end
end
Process p;
for each value in the list
// remove the line containing that value from Hier.csv
String[] command1 = {"sed", "'/^.*" + value + ".*$/d'", "Hier.csv", ">", "tmp.csv"};
Process p = Runtime.getRuntime().exec(command1)
end
String[] command2 = {"mv", "tmp.csv" "Hier.csv"};
Process p = Runtime.getRuntime().exec(command2)
Is there a better way than that double loop ?
Why does'nt the last part (exec(command)) work ?
And lastly, which is more efficient when reading csv files : BufferedReader or Scanner ?
You can use merge or hashtable.
Merge:
sort both files and merge together
Hashtable:
load smaller file (ids) to hashtable, loop through bigger file and test existence against hashtable

How do we determine the number of lines in a text file?

Hi all I have a local file which looks like this:
AAA Anaa
AAC EL-ARISH
AAE Annaba
AAF APALACHICOLA MUNI AIRPORT
AAG ARAPOTI
AAL Aalborg Airport
AAM Mala Mala
AAN Al Ain
AAQ Anapa
AAR Aarhus Tirstrup Airport
AAT Altay
AAX Araxa
AAY Al Ghaydah
...
Java Tutorials suggests estimating the number of lines in a file by doing java.io.File.length
and dividing the result by 50.
But isn't there a more "solid" way to get the number of lines in a text file (yet without having to pay for the overhead of reading the entire file)?
Can't you just read the file with a FileReader and count the number of lines read?
int lines = 0;
BufferedReader br = new BufferedReader(new FileReader("foo.in"));
while (br.readLine != null) {
lines++;
}
The benefit to the estimation algorithm you've got is that it is very fast: one stat(2) call and then some division. It'll take the same length of time and memory no matter how large or small the file is. But it's also vastly wrong on a huge number of inputs.
Probably the best way to get the specific number is to actually read through the entire file looking for '\n' characters. If you read the file in in large binary blocks (think 16384 bytes or a larger power of two) and look for the specific byte you're interested in, it can go at something approaching the disk IO bandwidth.
You need to use BufferedReader and a counter which increment the value 1 for each readLine().

Categories

Resources