Avoiding file collisions in Hadoop Pig script that writes multiple output files - java

I'm writing a Pig script that looks as follows:
...
myGroup = group simplifiedJoinData by (dir1, dir2, dir3, dir4);
betterGroup = foreach myGroup {
value1Value2 = foreach simplifiedJoinedGroup generate value1, value2;
distinctValue1Value2 = DISTINCT value1Value2; generate group, distinctValue1Value2;
}
store betterGroup into '/myHdfsPath/myMultiStorageTest' using MyMultiStorage('output', '0', 'none' );
Please note that the schema of simplifiedJoinData is simplifiedJoinedGroup: {dir1: long,dir2: long,dir3: chararray,dir4: chararray,value1: chararray,value2: chararray}
It uses a custom storage class (MyMultiStorage - basically a modified version of MultiStorage in the piggybank) that writes multiple output files. The custom storage class expects that the values passed to it are in the following format:
{group:(dir1:long,dir2:long,dir3:chararray,dir4:chararray), bag:{(value1:chararrary,value2:chararray)}}
What I'd like the custom storage class to do is output multiple files as follows:
dir/dir2/dir3/dir4/value1_values.txt
dir/dir2/dir3/dir4/value2_values.txt
where the value1_values.txt contains all the value1 values and value2_values.txt contains all the value2 values. Ideally I would prefer not to write multiple part files that have to be combined later (Note that the example has been simplified for the purposes of this discussion. The real output files are binary structures that can't be combined with a simple cat). I have this working for small data sets; however, when I run with larger data sets, I run into issues where I get exceptions in Hadoop that the output file name already exists or that it is already being created:
java.io.IOException: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException
I suspect that this is because mutiple mappers or reducers are attempting to write the same file, and I am not using part IDs in the filename as PigStorage does. However, I would have expected that by grouping the data, I'd only have one record for each dir1, dir2, dir3, dir4 combination, and, as such, only one mapper or reducer would be attempting to write a particular file for a given run. I've tried running without speculative execution for both map and reduce tasks, but that seems to have had no effect. Clearly I don't understand what's going on here.
My question is: Why am I getting the AlreadyBeingCreatedException?
If there is no way for me to have a single reducer write all data for each record, it would be acceptable to have to write multiple parts output files in a directory (one per reducer) and combine them after the fact. It just wouldn't be ideal. However, as of yet, I have not been able to determine the proper way to have the custom storage class determine a unique filename, and I still end up with multiple reducers trying to create/write the same file. Is there a particular method in the job configuration or context that would allow me to coordinate parts accross the job?
Thanks in advance for any help you can provide.

Turns out that there was a condition where I was generating the same file name due to a tuple parsing error. I was getting the AlreadyBeingCreatedException for that exact reason.
Nothing wrong with the custom store function, or approaching the problem in this manner. Just a silly mistake on my part!

Related

Apache Spark: Issues with saveAsTextFile() and filter()

When I try to use the function saveAsTextFile() I always get empty files even that the RDD contains tuples:
myRDD.saveAsTextFile("C:/Users/pc/Desktop/chna.txt");
What can be the reason?
Let's assume that it works and the data is registered in the textfile, how can I retrieve it through the shell or through my code (note: I am using Java)?
Does any solution exist to modify a text file through my code (using Java always), I tried the following code but got an java.io.NotSerializableException , is there any other possible solution?
BufferedWriter bufferedWriter = new BufferedWriter(new FileWriter("C:/Users/pc/Desktop/chn.txt", true));
pairsRDD.foreach(x -> bufferedWriter.write(x._1+" "+x._2));
bufferedWriter.newLine(); // ...
bufferedWriter.close();
When I used this line of code:
JavaPairRDD<Integer, String> filterRDD = pairsRDD.filter((x,y) -> (x._1.equals(y._1))&&(x._2.equals(y._2)))));
I got an IOException , is it caused because the RDD is empty? Or the condition used for filter is wrong?
How can I fix this problem and what is the reason of it?
java.io.IOException: Could not locate executable null\bin\winutils.exe
in the Hadoop binaries.
When I create the RDD, it takes the first line (name of fields) too, how can I avoid this? Because I want to take only the lines which contains values.
saveAsTextFile() takes a path to a folder as parameter, not a path to a file. It will actually write one file per partition in that folder, named part-r-xxxxx (xxxxx being 00000 to whatever number of partitions you have).
To read your data again, it's a simple as using sparkContext.textFile() or .wholeTextFile() methods (depending whether you want to read a single file or a full folder).
There's no simple solution in spark to modify a file in place, since you don't control the naming of whatever spark writes, and spark forbids writing in a non empty folder in the first place.
If you really want to do that, the best thing to do is to not use spark, since it's not a matter of distributed computing, and use e.g sed or awk to do in place file editing, which will be orders of magnitude more performant, and a one liner.

Control number of hadoop mapper output files

I have a job for hadoop. When the job is stated, i have some number of mappers started. And each mapper write some file to disk, like part-m-00000, part-m-00001. As I understand, each mapper create one part file. I have big amount of data, so there must be more than one mapper, but can I somehow control number of this output files? I mean, hadoop will start, for example 10 mappers, but there will be only three part files?
I found this post
How do multiple reducers output only one part-file in Hadoop?
But there is using old version of hadoop library. I'm using classes from org.apache.hadoop.mapreduce.* and not from org.apache.hadoop.mapred.*
I'm using hadoop version 0.20, and hadoop-core:1.2.0.jar
Is there any possibility to do this, using new hadoop API?
The number of output files equals to the number of reducers or the number of the mappers if there aren't any reducers.
You can add a single reducer to your job so that the output from all the mappers will be directed to it and your get a single output file. Note that will be less efficient as all the data (output of mappers) will be sent over the wire (network IO) to the node where the reducer will run. Also since a single process will (eventually) get all the data it would probably run slower.
By the wat,the fact that there are multiple parts shouldn't be very significant as you can pass the directory containing them to subsequent jobs
Im not sure you can do it (your link is about multiple outputs not converging to only one), and why use only one output ? you will lose all parallelism on sort ?
Im also working on big files (~10GB each) and my MR process almost 100GB each. So to lower Map numbers, I set a higher value of block size in hdfs (applies only to newer files) and a higher value of mapred.min.split.size in mapred-site.xml
You might want to look at MultipleOutputFormat
Part of what Javadoc says:
This abstract class extends the FileOutputFormat, allowing to write
the output data to different output files.
Both Mapper and Reducer can use this.
Check this link for how you can specify a output file name or more from different mappers to output to HDFS.
NOTE: And, moreover, make sure you don't use context.write() so that, 10 files from 10 mapper don't get created. Use only MultipleOutputFormat to output.
If the job has no reducers, partitioners and combiners, each mapper outputs one output file. At some point, you should run some post processing to collect the outputs into large file.

Writing Hadoop MapReduce output to just 2 flat files

So I have a MapReduce job that takes in multiple news articles and outputs the following key value pairs.
.
.
.
<article_id, social_tag.name, social_tag.isCompany, social_tag.code>
<article_id2, social_tag2.name, social_tag2.isCompany, social_tag.code>
<article_id, topic_code.name, topic_code.isCompany, topic_code.rcsCode>
<article_id3, social_tag3.name, social_tag3.isCompany, social_tag.code>
<article_id2, topic_code2.name, topic_code2.isCompany, topic_code2.rcsCode>
.
.
.
As you can see, there are two main different types of data rows that I am currently outputting and right now, these get mixed up in the flat files outputted by mapreduce. Is there anyway I can simply output social_tags to file1 and topic_codes to file2 OR maybe output social_tags to a specified group of files(social1.txt, social2.txt ..etc) and topic_codes to another group (topic1.txt, topic2.txt...etc)
The reason I'm asking this is so that I can store all these into a Hive table later on easily. I preferably would want to have a separate table for each different data type(topic_code, social_tag,... etc.) If any of you guys know a simple way to achieve this without separating the mapreduce output to different files, that would be really helpful too.
Thanks in advance!
You can use MultipleOutputs as already suggested.
As you have asked for a simple way to achieve this without separating the mapreduce output to different files. Here is a quick way, if the amount of data is not real huge !!!. And the logic to differentiate the data is not too complex.
First load the mixed output file into a hive table (say main_table). Then you can create two different tables (topic_code, social_tag), and insert the data from the main table after filtering it by where clause.
hive > insert into table topic_code
> select * from main_table
> where $condition;
// $condition = the logic you would use to differentiate the records in the MR job
I think you can try MultipleOutputs present in hadoop API. MultipleOutputs allows you to write data to files whose names are derived from the
output keys and values, or in fact from an arbitrary string. This allows each reducer (or
mapper in a map-only job) to create more than a single file. File names are of the form
name-m-nnnnn for map outputs and name-r-nnnnn for reduce outputs, where name is an
arbitrary name that is set by the program, and nnnnn is an integer designating the part
number, starting from zero.
In the reducer, where we generate the output, we construct an instance of MultipleOutputs in the setup()method and assign it to an instance variable. We then use the
MultipleOutputsinstance in the reduce()method to write to the output, in place of the
context. The write()method takes the key and value, as well as a name.
You can look into the below link for details
http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html

Hadoop MapReduce: Read a file and use it as input to filter other files

I would like to write a hadoop application which takes as input a file and an input folder which contains several files. The single file contains keys whose records need to be selected and extracted out of the other files in the folder. How can I achieve this?
By the way, I have a running hadoop mapreduce application which takes as input a path to a folder, does the processing and writes out the result into a different folder.
I am kind of stuck with how to use a file to get keys that need to be selected and extracted out of other files in a specific directory. The file containing keys is a big file so that it can not be fit into the main memory directly. How can I do it?
Thx!
If the number of keys is too large to fit in memory, then consider loading the key set into a bloom filter (of suitable size to yield a low false positive rate) and then process the files, checking each key for membership in the bloom filter (Hadoop comes with a BloomFilter class, check the Javadocs).
You'll also need to perform a second MR Job to do a final validation (most probably in a reduce side join) to eliminate the false positives output from the first job.
I would read the single file first before you run your job. Store all needed keys in the job configuration. You can then write a job to read the files from the folder. In your mapper/reducer setup(context) method, read out the keys from the configuration and store them globally, so that you have the possibility to read them during map or reduce.

How do I append to a file in hadoop?

I want to create a file in HDFS that has a bunch of lines, each generated by a different call to map. I don't care about the order of the lines, just that they all get added to the file. How do I accomplish this?
If this is not possible, then is there a standard way to generate unique file names to put each line of output into a separate file?
There is no way to append to an existing file in hadoop at the moment, but that's not what it sounds like you want to do anyway. It sounds like you want to have the output from your Map Reduce job go to a single file, which is quite possible. The number of output files is (less than or) equal to the number of reducers, so if you set your number of reducers to 1, you'll get a single file of output.
Before you go and do that however, think if that's what you really want. You'll be creating a bottle neck in your pipeline where it needs to pass all your data through a single machine for that reduce. Within the HDFS distributed file system, the difference between having one file and having several files is pretty transparent. If you want a single file outside the cluster, you might do better to use getmerge from the file system tools.
Both your map and reduce functions should output the lines. In other words, your reduce function is a pass through function that doesn't do much. Set the number of reducers to 1. The output will be a list of all the lines in one file.

Categories

Resources