Mapreduce questions

Mapreduce questions - java

I am trying to implement a Mapreduce program to do wordcounts from 2 files, and then comparing the word counts from these files to see what are the most common words...
I noticed that after doing wordcount for file 1, the results that go into the directory "/data/output1/", there are 3 files inside.
- "_SUCCESS"
- "_logs"
- "part-r-00000"
The "part-r-00000" is the file that contains the results from file1 wordcount. How do I make my program read that particular file if the file name is generated in real-time without me knowing beforehand the filename?
Also, for the (key, value) pairs, I have added an identifier to the "value", so as to be able to identify which file and count that word belongs to.
public void map(Text key, Text value, Context context) throws IOException, InterruptedException {
Text newValue = new Text();
newValue.set(value.toString() + "_f2");
context.write(key, newValue);
}
at a later stage, how do i "remove" the identifier such that i can just get the "value"?

Just point your next MR job to /data/output1/. It will read all three files as input, but _SUCCESS and _logs are both empty so they'll have no affect on your program. They're just written that way so that you can tell that the MR job writing to the directory has finished successfully.

If you want to implement word count from 2 different files then you could use multipleinput class with help of which you can apply map reduce program on both files simultaneously. Refer this link for a example of how to implement it http://www.hadooptpoint.com/hadoop-multiple-input-files-example-in-mapreduce/ here you will define separate mapper for each input file thus you can add different identifier in both mapper file and then when there output will go to reducer it can identify from which map file that input is coming from and can process accordingly to it. And you can remove identifier in same way you add them like for example if you add a prefix # in mapper 1 output key and # in mapper 2 output key then in reducer you can identify from which map input is coming from using this prefix and then you can simple remove this prefix in reducer.
Aside from it about your other query related to file reading, it is simple the output file name aways have a pattern that if your are using hadoop1.x then result will be stored in file name as part-00000 and onward and with hadoop 2.x result will be stored in file name part-r-00000 if there is another output which need to be write in same ouput path then it will be stored in part-r-00001 and onwards. Other two files which are generated have no significance for developer they more of a act as a half for hadoop itself
Hope this solve your query. Please comment if answer is not clear.

Related

Java Mapreduce - getting names of files with matches & printing to output file

Hi I've been trying to come up with a modified version of the standard WordCount v1.0
wherein I read all files from an input directory (args[0]) and I print the output to an output directory (args[1]) which consists of not just the words and the number of occurrences, but a list of files where matches took place.
So for example I have 2 text files:
//1.txt
I love hadoop
and big data
//2.txt
I like programming
hate big data
The output would be:
//Output.txt
I 2 1.txt 2.txt
love 1 1.txt
hadoop 1 1.txt
and 1 1.txt
big 2 1.txt 2.txt
data 2 1.txt 2.txt
like 1 1.txt
programming 1 2.txt
hate 1 2.txt
At this stage I'm not sure how to extract the name of the file where the match occured. Furthermore I'm not sure how to store the file name - whether I would use a Triple or I would need to use nested maps, so perhaps Map (K1, Map (K2, v))? I don't know which would be possible in a mapreduce program so any tips would be greatly appreciated.

Getting file names is generally not encouraged. Different input formats have different ways of doing this, and some of them may not provide such functionality at all.
Assuming that you are working with simple TextInputFormat, you can use mapper context to retrieve the split:
FileSplit split = (FileSplit)context.getInputSplit();
String filename = split.getPath().getName();
To produce the format desired, let mapper emit tuples <Text(word),Text(filename)>. Reducer should collect them into Map<String(word), Set<String>(filename)>. This assumes no combiner is used.

Count duplicates in huge text files collection

I have this collection of folders:
60G ./big_folder_6
52G ./big_folder_8
61G ./big_folder_7
60G ./big_folder_4
58G ./big_folder_5
63G ./big_folder_2
54G ./big_folder_9
61G ./big_folder_3
39G ./big_folder_10
74G ./big_folder_1
Each folder contains 100 txt files, with one sentence per line. For example, the file ./big_folder_6/001.txt:
sentence ..
sentence ..
...
Each file in the folder is between 4 and 6 GB (as you can see from the totals reported above) with 40-60 million of sentences more or less. One single file fits in memory.
I need to deduplicate and count the sentences globally unique, so as to obtain a new collection of files where the lines are counted:
count ...unique sentence...
The collection is huge.
My first implementation (using Java) was a "merge sort" approach ordering the lines in a new collection of 500 files (dispatching each line in the right file using the first N characters), then order and aggregate duplicates on the single files.
I know it is a wordcount map-reduce problem but I would prefer to avoid it. The question is: am I using the right approach to solve this kind of problem or I should consider other tool/approach beside MapReduce?

You mean delete duplicated lines of each file? or among all files?
in any case, you cant read the whole file, you need to read line by line or a memory exception will be thrown. Use BufferedReader (example here), use a map storing the string with the count of the repeated line as a value, when you read a line, put in the map incrementing the value if it exist.
After read the file, write all the lines and theirs counts to a new file and release memory.
UPDATE 1
the problem is that you have a lot of gigas. So you cant keep in memory each line because it can thrown a memory exception, but at the same time you have to keep them in memory to quickly validate if they are duplicated. What comes to may mind is instead of having a string representing the key value, put a hash of the string (usgin string.toHash()), and when it was the first, write it to the new file, but flush every 100 lines or more to lower the time writing to the disk. After you processed all the files and write unique lines in the file and you have only integers in the map (hashcode of the string as a key and count as a value), you start reading the file containing only unique lines, then create a new file writing the line and the count values.

How to parse a single row from a CSV file for each java run and for the next run parse the next row?

I'm a beginner to java and I have a java class that reads in data from a CSV file which looks like this:
BUS|ID|Load|Max_Power
1 | 2 | 1 | 10.9
2 | 3 | 2 | 8.0
My problem is this: I have to consider for each java run (program execution), only 1 row at a time. For example for my first run I need to read in only the first row and then for my second run I need to read in the data from the second row.
Would using Hashmaps be the right way to search for the keys for each run?

When your program has to "remember" something beyond it's termination, you need to store this information somewhere (file, registry, ...).
Step 1 Figure out how to do file I/O (read/write files) with java. You need this to store your information, the line number in this case).
Step 2 Implement the logic:
read lineToRead from memory file (e.g. 1)
read line lineToRead (1) from data file and parse the data (take a look at #Kents answer for a nice explanation how to do so)
increment lineToRead (1 -> 2) and save it in to the memory file.
Hint: When mulitple instances of your program are going to run in parallel, you have to ensure the mutual exclusion / make the whole process (read, increment, write) atomic to prevent the lost update effect.

when you read the 1st line (the header), you split by | got string array (headerArray). Then init a hashmap <String, List<String>> (or Multimap if you use guava or other api) with elements in your string array as key.
Then you read each data row, split by |, again you got string array(dataArray), you just get the map value by: map.get(headerArray[index of dataArray]). Once you locate the map entry/value, you can do following logic (add to the list).
You can also design a ValueObjectType type with those attributes, and a special setter accepting int index, String value, there you check which attribute the value should go. In this way, you don't need map any longer, you need a List<ValueObjectType>

You can use com.csvreader.CsvReader class (available in javacsv.jar)
This class provides functionality to read CSV file row by row .
This will serve your purpose .
here is the sample code :-
CsvReader csv = new CsvReader(FileName);
csv.readHeaders(); // headers
while (products.readRecord()) {
String cal = csv.get(0);
}

Creating multiple output files with AvroMultipleOutputs

I have a Reducer using AvroKeyOutput as the output format. By default, MapReduce will write all my keys to a single output file. I would like to write to a separate output file for each key value. Avro provides the AvroMultipleOutputs method, but examples are slim. The one provided by Apache AvroMultipleOutputs, shows how to pre-configure the various outputs when defining the job. The examples shows:
JOB:
AvroMultipleOutputs.addNamedOutput(job, "avro1", AvroOutputFormat.class, schema);
AvroMultipleOutputs.addNamedOutput(job, "avro2", AvroOutputFormat.class, null);
REDUCER:
amos = new AvroMultipleOutputs(conf);
amos.getCollector("avro1", reporter).collect(datum);
amos.getCollector("avro2", "A", reporter).collect(datum);
amos.getCollector("avro3", "B", reporter).collect(datum);
But I don't know how many files I will need or what their names are, since it is based on the key values that come out of my reducer. How could I modify this to accommodate dynamic file naming?

A strategy you can use in this situation:
use a Map-only job (zero reduce tasks)
have a single named multiple output configuration
during the map() use your key value for the base output path in AvroMultipleOutputs.write(String namedOutput, Object key, Object value, String baseOutputPath)

Hadoop and MapReduce, How do I send the equivalent of an array of lines pulled from a csv to the map function, where each array contained lines x - y;

Okay, so I have been reading a lot about Hadoop and MapReduce, and maybe it’s because I’m not as familiar with iterators as most, but I have a question I can’t seem to find a direct answer too. Basically, as I understand it, the map function is executed in parallel by many machine and/or cores. Thus, whatever you are working on must not depend on prior code being executed for the program to make any kind of speed gains. This works perfectly for me, but what I’m doing requires me to test information in small batches. Basically I need to send batches of lines in a .csv as arrays of 32, 64, 128 or whatever lines each. Like lines 0 – 127 go to core1’s execution of the map function, lines 128 – 255 lines go to core2’s, etc., .etc . Also I need to have the contents of each batch available as a whole inside the function, as if I had passed it an array. I read a little about how the new java API allows for something called push and pull, and that this allows things to be sent in batches, but I couldn’t find any example code. I dunno, I’m going to continue researching, and I’ll post anything I find, but if anyone knows, could they please post in this thread. I would really appreciate any help I might receive.
edit
If you could simply ensure that the chunks of the .csv are sent in sequence you could preform it this way. I guess this also assumes that there are globals in mapreduce.
//** concept not code **//
GLOBAL_COUNTER = 0;
GLOBAL_ARRAY = NEW ARRAY();
map()
{
GLOBAL_ARRAY[GLOBAL_COUNTER] = ITERATOR_VALUE;
GLOBAL_COUNTER++;
if(GLOBAL_COUNTER == 127)
{
//EXECUTE TEST WITH AN ARRAY OF 128 VALUES FOR COMPARISON
GLOBAL_COUNTER = 0;
}
}

If you're trying to get a chunk of lines from your CSV file into the mapper, you might consider writing your own InputFormat/RecordReader and potentially your own WritableComparable object. With the custom InputFormat/RecordReader you'll be able to specify how objects are created and passed to the mapper based on the input you receive.
If the mapper is doing what you want, but you need these chunks of lines sent to the reducer, make the output key for the mapper the same for each line you want in the same reduce function.
The default TextInputFormat will give input to your mapper like this (the keys/offsets in this example are just random numbers):
0 Hello World
123 My name is Sam
456 Foo bar bar foo
Each of those lines will be read into your mapper as a key,value pair. Just modify the key to be the same for each line you need and write it to the output:
0 Hello World
0 My name is Sam
1 Foo bar bar foo
The first time the reduce function is read, it will receive a key,value pair with the key being "0" and the value being an Iterable object containing "Hello World" and "My name is Sam". You'll be able to access both of these values in the same reduce method call by using the Iterable object.
Here is some pseudo code:
int count = 0
map (key, value) {
int newKey = count/2
context.write(newKey,value)
count++
}
reduce (key, values) {
for value in values
// Do something to each line
}
Hope that helps. :)

If the end goal of what you want is to force certain sets to go to certain machines for processing you want to look into writing your own Partitioner. Otherwise, Hadoop will split data automatically for you depending on the number of reducers.
I suggest reading the tutorial on the Hadoop site to get a better understanding of M/R.

If you simply want to send N lines of input to a single mapper, you can user the NLineInputFormat class. You could then do the line parsing (splitting on commas, etc) in the mapper.
If you want to have access to the lines before and after the line the mapper is currently processing, you may have to write your own input format. Subclassing FileInputFormat is usually a good place to start. You could create an InputFormat that reads N lines, concatenates them, and sends them as one block to a mapper, which then splits the input into N lines again and begins processing.
As far as globals in Hadoop go, you can specify some custom parameters when you create the job configuration, but as far as I know, you cannot change them in a worker and expect the change to propagate throughout the cluster. To set a job parameter that will be visible to workers, do the following where you are creating the job:
job.getConfiguration().set(Constants.SOME_PARAM, "my value");
Then to read the parameters value in the mapper or reducer,
public void map(Text key, Text value, Context context) {
Configuration conf = context.getConfiguration();
String someParam = conf.get(Constants.SOME_PARAM);
// use someParam in processing input
}
Hadoop has support for basic types such as int, long, string, bool, etc to be used in parameters.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.