Below sample data input.txt, it has 2 columns key & value. For each record processed by Mapper, the output of map should be written to
1)HDFS => A new file needs to created based on key column
2)Context object
Below is the code, where 4 files need to be created based on key column, but files are not getting created. Output is incorrect too. I am expecting wordcount output, but I am getting character count output.
input.txt
------------
key value
HelloWorld1|ID1
HelloWorld2|ID2
HelloWorld3|ID3
HelloWorld4|ID4
public static class MapForWordCount extends Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value, Context con) throws IOException, InterruptedException {
String line = value.toString();
String[] fileContent = line.split("|");
Path hdfsPath = new Path("/filelocation/" + fileContent[0]);
System.out.println("FilePath : " +hdfsPath);
Configuration configuration = con.getConfiguration();
writeFile(fileContent[1], hdfsPath, configuration);
for (String word : fileContent) {
Text outputKey = new Text(word.toUpperCase().trim());
IntWritable outputValue = new IntWritable(1);
con.write(outputKey, outputValue);
}
}
static void writeFile(String fileContent, Path hdfsPath, Configuration configuration) throws IOException {
FileSystem fs = FileSystem.get(configuration);
FSDataOutputStream fin = fs.create(hdfsPath);
fin.writeUTF(fileContent);
fin.close();
}
}
Split uses regexp. You need to escape the '|' like .split("\\|");
See docs here: http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html
Related
need your help to convert prn file to csv file using java.
Thank you so much.
Below is my prn file.
i would like to make it shows like this
Thank you so much.
In your example you have four entries as input, each in a row. In your result table they all are in one row. I assume the input describes a complete prn set. So if a file would contain n prn sets, it would have n * 4 rows.
To map the pm set to a csv file you have to
read in the entries from the input file
write a header row (with eight titles)
extract in each entry the relevant values
combine the extracted values from four entries in sequence to one csv row
write the row
repeat steps 3 to 5 as long as there are further entries
Here is my suggestion:
public class PrnToCsv {
private static final String DILIM_PRN = " ";
private static final String DILIM_CSV = ",";
private static final Pattern PRN_SPLITTER = Pattern.compile(DILIM_PRN);
public static void main(String[] args) throws URISyntaxException, IOException {
List<String> inputLines = Files.readAllLines(new File("C://Temp//csv/input.prn").toPath());
List<String[]> inputValuesInLines = inputLines.stream().map(l -> PRN_SPLITTER.split(l)).collect(Collectors.toList());
try (BufferedWriter bw = Files.newBufferedWriter(new File("C://Temp//csv//output.csv").toPath())) {
// header
bw.append("POL1").append(DILIM_CSV).append("POL1_Time").append(DILIM_CSV).append("OLV1").append(DILIM_CSV).append("OLV1_Time").append(DILIM_CSV);
bw.append("POL2").append(DILIM_CSV).append("POL2_Time").append(DILIM_CSV).append("OLV2").append(DILIM_CSV).append("OLV2_Time");
bw.newLine();
// data
for (int i = 0; i + 3 < inputValuesInLines.size(); i = i + 4) {
String[] firstValues = inputValuesInLines.get(i);
bw.append(getId(firstValues)).append(DILIM_CSV).append(getDateTime(firstValues)).append(DILIM_CSV);
String[] secondValues = inputValuesInLines.get(i + 1);
bw.append(getId(secondValues)).append(DILIM_CSV).append(getDateTime(secondValues)).append(DILIM_CSV);
String[] thirdValues = inputValuesInLines.get(i + 2);
bw.append(getId(thirdValues)).append(DILIM_CSV).append(getDateTime(thirdValues)).append(DILIM_CSV);
String[] fourthValues = inputValuesInLines.get(i + 3);
bw.append(getId(fourthValues)).append(DILIM_CSV).append(getDateTime(fourthValues));
bw.newLine();
}
}
}
public static String getId(String[] values) {
return values[1];
}
public static String getDateTime(String[] values) {
return values[2] + " " + values[3];
}
}
Some remarks to the code:
Using the nio-API you can read the whole file with one line of code.
To extract the values of an entry line I used a Pattern to split the line into an array with each single word as a value.
Then it is easy get the relevant values of an entry using the appropriate array indexes.
To write the csv file line by line (without additional libs) you can use a BufferedWriter.
The file you're writting to is a resource. It is recommended to use resources with the try-with-resource-statement.
I hope I could answer your question.
I am loading multiple files into a JavaRDD using
JavaRDD<String> allLines = sc.textFile(hdfs://path/*.csv);
After loading the files I modify each record and want to save them. However I need to also save the original file name (ID) with the record for future reference. Is there anyway that I can get the original file name from the individual record in RDD?
thanks
You can try to do something like in the following snippet:
JavaPairRDD<LongWritable, Text> javaPairRDD = sc.newAPIHadoopFile(
"hdfs://path/*.csv",
TextInputFormat.class,
LongWritable.class,
Text.class,
new Configuration()
);
JavaNewHadoopRDD<LongWritable, Text> hadoopRDD = (JavaNewHadoopRDD) javaPairRDD;
JavaRDD<Tuple2<String, String>> namedLinesRDD = hadoopRDD.mapPartitionsWithInputSplit((inputSplit, lines) -> {
FileSplit fileSplit = (FileSplit) inputSplit;
String fileName = fileSplit.getPath().getName();
Stream<Tuple2<String, String>> stream =
StreamSupport.stream(Spliterators.spliteratorUnknownSize(lines, Spliterator.ORDERED), false)
.map(line -> {
String lineText = line._2().toString();
// emit file name as key and line as a value
return new Tuple2(fileName, lineText);
});
return stream.iterator();
}, true);
Update (for java7)
JavaRDD<Tuple2<String, String>> namedLinesRDD = hadoopRDD.mapPartitionsWithInputSplit(
new Function2<InputSplit, Iterator<Tuple2<LongWritable, Text>>, Iterator<Tuple2<String, String>>>() {
#Override
public Iterator<Tuple2<String, String>> call(InputSplit inputSplit, final Iterator<Tuple2<LongWritable, Text>> lines) throws Exception {
FileSplit fileSplit = (FileSplit) inputSplit;
final String fileName = fileSplit.getPath().getName();
return new Iterator<Tuple2<String, String>>() {
#Override
public boolean hasNext() {
return lines.hasNext();
}
#Override
public Tuple2<String, String> next() {
Tuple2<LongWritable, Text> entry = lines.next();
return new Tuple2<String, String>(fileName, entry._2().toString());
}
};
}
},
true
);
You want spark's wholeTextFiles function. From the documentation:
For example, if you have the following files:
hdfs://a-hdfs-path/part-00000
hdfs://a-hdfs-path/part-00001
...
hdfs://a-hdfs-path/part-nnnnn
Do val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path"),
then rdd contains
(a-hdfs-path/part-00000, its content)
(a-hdfs-path/part-00001, its content)
...
(a-hdfs-path/part-nnnnn, its content)
It returns you an RDD of tuples where the left is the filename and the right is the content.
You should be able to use toDebugString. Using wholeTextFile will read in the entire content of your file as one element, whereas sc.textfile creates an RDD with each line as an individual element - as described here.
for example:
val file= sc.textFile("/user/user01/whatever.txt").cache()
val wordcount = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
wordcount.toDebugString
// res0: String =
// (2) ShuffledRDD[4] at reduceByKey at <console>:23 []
// +-(2) MapPartitionsRDD[3] at map at <console>:23 []
// | MapPartitionsRDD[2] at flatMap at <console>:23 []
// | /user/user01/whatever.txt MapPartitionsRDD[1] at textFile at <console>:21 []
// | /user/user01/whatever.txt HadoopRDD[0] at textFile at <console>:21 []
I have a map reduce program that runs perfectly when run in stand-alone mode but when I run it on Hadoop Cluster at my school, an exception is happening in the Reducer. I have no clue what exception it is. I came to know this as when I keep a try/catch in reducer, the job passes but empty output. When I don't keep the try/catch, job fails. Since it is a school cluster, I do not have access to any of the job trackers or other files. All I can find is through programatically only. Is there a way I can find what exception happened on hadoop during run time ?
Following are snippets of my code
public static class RowMPreMap extends MapReduceBase implements
Mapper<LongWritable, Text, Text, Text> {
private Text keyText = new Text();
private Text valText = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
// Input: (lineNo, lineContent)
// Split each line using seperator based on the dataset.
String line[] = null;
line = value.toString().split(Settings.INPUT_SEPERATOR);
keyText.set(line[0]);
valText.set(line[1] + "," + line[2]);
// Output: (userid, "movieid,rating")
output.collect(keyText, valText);
}
}
public static class RowMPreReduce extends MapReduceBase implements
Reducer<Text, Text, Text, Text> {
private Text valText = new Text();
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
// Input: (userid, List<movieid, rating>)
float sum = 0.0F;
int totalRatingCount = 0;
ArrayList<String> movieID = new ArrayList<String>();
ArrayList<Float> rating = new ArrayList<Float>();
while (values.hasNext()) {
String[] movieRatingPair = values.next().toString().split(",");
movieID.add(movieRatingPair[0]);
Float parseRating = Float.parseFloat(movieRatingPair[1]);
rating.add(parseRating);
sum += parseRating;
totalRatingCount++;
}
float average = ((float) sum) / totalRatingCount;
for (int i = 0; i < movieID.size(); i++) {
valText.set("M " + key.toString() + " " + movieID.get(i) + " "
+ (rating.get(i) - average));
output.collect(null, valText);
}
// Output: (null, <M userid, movieid, normalizedrating>)
}
}
Exception happens in the above reducer. Below is the config
public void normalizeM() throws IOException, InterruptedException {
JobConf conf1 = new JobConf(UVDriver.class);
conf1.setMapperClass(RowMPreMap.class);
conf1.setReducerClass(RowMPreReduce.class);
conf1.setJarByClass(UVDriver.class);
conf1.setMapOutputKeyClass(Text.class);
conf1.setMapOutputValueClass(Text.class);
conf1.setOutputKeyClass(Text.class);
conf1.setOutputValueClass(Text.class);
conf1.setKeepFailedTaskFiles(true);
conf1.setInputFormat(TextInputFormat.class);
conf1.setOutputFormat(TextOutputFormat.class);
FileInputFormat.addInputPath(conf1, new Path(Settings.INPUT_PATH));
FileOutputFormat.setOutputPath(conf1, new Path(Settings.TEMP_PATH + "/"
+ Settings.NORMALIZE_DATA_PATH_TEMP));
JobConf conf2 = new JobConf(UVDriver.class);
conf2.setMapperClass(ColMPreMap.class);
conf2.setReducerClass(ColMPreReduce.class);
conf2.setJarByClass(UVDriver.class);
conf2.setMapOutputKeyClass(Text.class);
conf2.setMapOutputValueClass(Text.class);
conf2.setOutputKeyClass(Text.class);
conf2.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf2, new Path(Settings.TEMP_PATH + "/"
+ Settings.NORMALIZE_DATA_PATH_TEMP));
FileOutputFormat.setOutputPath(conf2, new Path(Settings.TEMP_PATH + "/"
+ Settings.NORMALIZE_DATA_PATH));
Job job1 = new Job(conf1);
Job job2 = new Job(conf2);
JobControl jobControl = new JobControl("jobControl");
jobControl.addJob(job1);
jobControl.addJob(job2);
job2.addDependingJob(job1);
handleRun(jobControl);
}
I caught the exception in reducer and write the stack trace to a file in the file system. I know this is the dirtiest possible way of doing this, but I have no option at this point. Following is the code if it helps any one in future. Put the code in catch block.
String valueString = "";
while (values.hasNext()) {
valueString += values.next().toString();
}
StringWriter sw = new StringWriter();
e.printStackTrace(new PrintWriter(sw));
String exceptionAsString = sw.toString();
Path pt = new Path("errorfile");
FileSystem fs = FileSystem.get(new Configuration());
BufferedWriter br = new BufferedWriter(new OutputStreamWriter(fs.create(pt,true)));
br.write(exceptionAsString + "\nkey: " + key.toString() + "\nvalues: " + valueString);
br.close();
Inputs to do this in a clean way are welcome.
On a sider note, Eventually I found it is a NumberFormatException. Counters would not have helped me identify this. Later I realized the format of splitting input in stand-alone and on cluster is happening in different fashion, which I am yet to find the reason.
Even if you don't have access to the server, you can get the counters for a job:
Counters counters = job.getCounters();
and dump the set of counters to your local console. These counters will show, among other things, the counts for the number of records input to and written from the mappers and reducers. The counters that have value zero indicate the problem location in your workflow. You can instrument your own counters to help debug/monitor the flow.
In new API (apache.hadoop.mapreduce.KeyValueTextInputFormat) , how to specify separator (delimiter) other than tab(which is default) to separate key and Value.
Sample Input :
one,first line
two,second line
Ouput Required :
Key : one
Value : first line
Key : two
Value : second line
I am specifying KeyValueTextInputFormat as :
Job job = new Job(conf, "Sample");
job.setInputFormatClass(KeyValueTextInputFormat.class);
KeyValueTextInputFormat.addInputPath(job, new Path("/home/input.txt"));
This is working fine for tab as a separator.
In the newer API you should use mapreduce.input.keyvaluelinerecordreader.key.value.separator configuration property.
Here's an example:
Configuration conf = new Configuration();
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");
Job job = new Job(conf);
job.setInputFormatClass(KeyValueTextInputFormat.class);
// next job set-up
Please set the following in the Driver Code.
conf.set("key.value.separator.in.input.line", ",");
For KeyValueTextInputFormat the input line should be a key value pair seperated by "\t"
Key1 Value1,Value2
By changing default seperator, You will be able to read as you wish.
For New Api
Here is the solution
//New API
Configuration conf = new Configuration();
conf.set("key.value.separator.in.input.line", ",");
Job job = new Job(conf);
job.setInputFormatClass(KeyValueTextInputFormat.class);
Map
public class Map extends Mapper<Text, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Text key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
System.out.println("key---> "+key);
System.out.println("value---> "+value.toString());
.
.
Output
key---> one
value---> first line
key---> two
value---> second line
It's a sequence matter.
The first line conf.set("key.value.separator.in.input.line", ",") must come before you create an instance of Job class. So:
conf.set("key.value.separator.in.input.line", ",");
Job job = new Job(conf);
First, the new API did not finished in 0.20.* so if you want to use new API in 0.20.*, you should implement the feature by yourself.For example you can use FileInputFormat to achieve.
Ignore the LongWritable key, and split the Text value on comma yourself.
By default, the KeyValueTextInputFormat class uses tab as a separator for key and value from input text file.
If you want to read the input from a custom separator, then you have to set the configuration with the attribute that you are using.
For the new Hadoop APIs, it is different:
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ";");
Example
public class KeyValueTextInput extends Configured implements Tool {
public static void main(String args[]) throws Exception {
String log4jConfPath = "log4j.properties";
PropertyConfigurator.configure(log4jConfPath);
int res = ToolRunner.run(new KeyValueTextInput(), args);
System.exit(res);
}
public int run(String[] args) throws Exception {
Configuration conf = this.getConf();
//conf.set("key.value.separator.in.input.line", ",");
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator",
",");
Job job = Job.getInstance(conf, "WordCountSampleTemplate");
job.setJarByClass(KeyValueTextInput.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
//job.setMapOutputKeyClass(Text.class);
//job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(KeyValueTextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
Path outputPath = new Path(args[1]);
FileSystem fs = FileSystem.get(new URI(outputPath.toString()), conf);
fs.delete(outputPath, true);
FileOutputFormat.setOutputPath(job, outputPath);
return job.waitForCompletion(true) ? 0 : 1;
}
}
class Map extends Mapper<Text, Text, Text, Text> {
public void map(Text k1, Text v1, Context context) throws IOException, InterruptedException {
context.write(k1, v1);
}
}
class Reduce extends Reducer<Text, Text, Text, Text> {
public void reduce(Text Key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
String sum = " || ";
for (Text value : values)
sum = sum + value.toString() + " || ";
context.write(Key, new Text(sum));
}
}
I'm new on hadoop.
I have a MapReduce job which is supposed to get an input from Hdfs and write the output of the reducer to Hbase. I haven't found any good example.
Here's the code, the error runing this example is Type mismatch in map, expected ImmutableBytesWritable recieved IntWritable.
Mapper Class
public static class AddValueMapper extends Mapper < LongWritable,
Text, ImmutableBytesWritable, IntWritable > {
/* input <key, line number : value, full line>
* output <key, log key : value >*/
public void map(LongWritable key, Text value,
Context context)throws IOException,
InterruptedException {
byte[] key;
int value, pos = 0;
String line = value.toString();
String p1 , p2 = null;
pos = line.indexOf("=");
//Key part
p1 = line.substring(0, pos);
p1 = p1.trim();
key = Bytes.toBytes(p1);
//Value part
p2 = line.substring(pos +1);
p2 = p2.trim();
value = Integer.parseInt(p2);
context.write(new ImmutableBytesWritable(key),new IntWritable(value));
}
}
Reducer Class
public static class AddValuesReducer extends TableReducer<
ImmutableBytesWritable, IntWritable, ImmutableBytesWritable> {
public void reduce(ImmutableBytesWritable key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
long total =0;
// Loop values
while(values.iterator().hasNext()){
total += values.iterator().next().get();
}
// Put to HBase
Put put = new Put(key.get());
put.add(Bytes.toBytes("data"), Bytes.toBytes("total"),
Bytes.toBytes(total));
Bytes.toInt(key.get()), total));
context.write(key, put);
}
}
I had a similar job only with HDFS and works fine.
Edited 18-06-2013. The college project finished successfully two years ago. For job configuration (driver part) check correct answer.
Here is the code which will solve your problem
Driver
HBaseConfiguration conf = HBaseConfiguration.create();
Job job = new Job(conf,"JOB_NAME");
job.setJarByClass(yourclass.class);
job.setMapperClass(yourMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Intwritable.class);
FileInputFormat.setInputPaths(job, new Path(inputPath));
TableMapReduceUtil.initTableReducerJob(TABLE,
yourReducer.class, job);
job.setReducerClass(yourReducer.class);
job.waitForCompletion(true);
Mapper&Reducer
class yourMapper extends Mapper<LongWritable, Text, Text,IntWritable> {
//#overide map()
}
class yourReducer
extends
TableReducer<Text, IntWritable,
ImmutableBytesWritable>
{
//#override reduce()
}
Not sure why the HDFS version works: normaly you have to set the input format for the job, and FileInputFormat is an abstract class. Perhaps you left some lines out? such as
job.setInputFormatClass(TextInputFormat.class);
The best and fastest way to BulkLoad data in HBase is use of HFileOutputFormat and CompliteBulkLoad utility.
You will find a sample code here:
Hope this will be useful :)
public void map(LongWritable key, Text value,
Context context)throws IOException,
InterruptedException {
change this to immutableBytesWritable, intwritable.
I am not sure..hope it works