parsing twitter json using mapreduce: Java, Pig

parsing twitter json using mapreduce: Java, Pig - java

I'm sure you might find this question somewhat 'duplicate' but I'm sure I've done my research before posting the same. I also apologize for posting Java & Pig issues in one single thread here but just don't want to create another thread for same problem.
I got a json file with some twitter extracts. I'm trying to perform the parse using java MR & Pig as well, but facing issues. Below is my Java code that I tried writing:
public class twitterDataStore {
private static final ObjectMapper mapper = new ObjectMapper();
public static abstract class Map extends MapReduceBase implements
Mapper<Object, Text, IntWritable, Reporter>{
private static final IntWritable one = new IntWritable();
private Text word = new Text();
public void map(Object key, Text value, OutputCollector<Text, IntWritable> context, Reporter arg3)
throws IOException{
try{
JSONObject jsonObj = new JSONObject(value.toString());
String text = (String) jsonObj.get("retweet_count");
StringTokenizer strToken = new StringTokenizer(text);
while(strToken.hasMoreTokens()){
word.set(strToken.nextToken());
context.collect(word, one);
}
}catch(IOException e){
e.printStackTrace();
} catch (JSONException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
public static class Reduce extends MapReduceBase implements Reducer<Object, Text, IntWritable, Reporter>{
#Override
public void reduce(Object key, Iterator<Text> value,
OutputCollector<IntWritable, Reporter> context, Reporter arg3)
throws IOException {
while(value.hasNext()){
System.out.println(value);
if(value.equals("retweet_count"))
{
System.out.println(value.equals("id_str"));
}
}
// TODO Auto-generated method stub
}
}
public static void main(String[] args) throws IOException{
JobConf conf = new JobConf(twitterDataStore.class);
conf.setJobName("twitterDataStore");
conf.setMapperClass(Map.class);
conf.setReducerClass(Reducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf,new Path(args[1]));
JobClient.runJob(conf);
}
}
The issue is, and you might have gotten it by now, is that I can't do the parsing when I execute the jar, most probably because the json jar isn't included in the package. I tried going with the information provided here: "parsing json input in hadoop java". But I can't get pass any option. Whatever #Tejas Patil has suggested and #Fraz tried, I couldn't get anything working for my problem. I'll paste the warning generated here also for an FYI:
14/04/14 21:09:22 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
Coming to Pig(version 0.11) loading, I wrote a JsonLoader load to load my tweet data :
data = LOAD '/tmp/twitter.txt' using JsonLoader('in_reply_to_screen_name:chararray,text:chararray,id_str:long,place:chararray,in_reply_to_status_id:chararray, contributors:chararray,retweet_count:CHARARRAY,favorited:chararray,truncated:chararray,source:chararray,in_reply_to_status_id_str:chararray,created_at:chararray, in_reply_to_user_id_str:chararray,in_reply_to_user_id:chararray,user:{(lang:chararray,profile_background_image_url:chararray,id_str:long,default_profile_image: chararray,statuses_count:chararray,profile_link_color:chararray,favourites_count:chararray,profile_image_url_https:chararray,following:chararray, profile_background_color:chararray,description:chararray,notifications:chararray,profile_background_tile:chararray,time_zone:chararray, profile_sidebar_fill_color:chararray,listed_count:chararray,contributors_enabled:chararray,geo_enabled:chararray,created_at:chararray,screen_name:chararray, follow_request_sent:chararray,profile_sidebar_border_color:chararray,protected:chararray,url:chararray,default_profile:chararray,name:chararray, is_translator:chararray,show_all_inline_media:chararray,verified:chararray,profile_use_background_image:chararray,followers_count:chararray,profile_image_url:chararray, id:long,profile_background_image_url_https:chararray,utc_offset:chararray,friends_count:chararray,profile_text_color:chararray,location:chararray)},retweeted:chararray, id:long,coordinates:chararray,geo:chararray');
Sorry for pasting everything unnecessarily here but just don't want to miss anything in the explanation.
I was facing issues with declaring some of the fields as 'integer' but when I converted all integers to chararray, the command passed the check. The error I'm getting here is
2014-04-14 21:19:24,977 [JobControl] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
2014-04-14 21:19:24,982 [JobControl] WARN org.apache.hadoop.mapred.JobClient - No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
The same issue with parsing. I tried registering the json jar before this load, but still the same problem. Can anyone help me out in resolving the issue?
Thanks in advance.
-Adil

You are using TextInputFormat so when you do value.toString(), you get a single line as an input. Not the whole JSON object.
I recommend you use this:
https://github.com/alexholmes/json-mapreduce#an-inputformat-to-work-with-splittable-multi-line-json

Related

Using MultipleOutputs without context.write results empty files

I don't know how to use MultipleOutputs class. I'm using it to create multiple output files. Following is my Driver class's code snippet
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(CustomKeyValueTest.class);//class with mapper and reducer
job.setOutputKeyClass(CustomKey.class);
job.setOutputValueClass(Text.class);
job.setMapOutputKeyClass(CustomKey.class);
job.setMapOutputValueClass(CustomValue.class);
job.setMapperClass(CustomKeyValueTestMapper.class);
job.setReducerClass(CustomKeyValueTestReducer.class);
job.setInputFormatClass(TextInputFormat.class);
Path in = new Path(args[1]);
Path out = new Path(args[2]);
out.getFileSystem(conf).delete(out, true);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
MultipleOutputs.addNamedOutput(job, "islnd" , TextOutputFormat.class, CustomKey.class, Text.class);
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
MultipleOutputs.setCountersEnabled(job, true);
boolean status = job.waitForCompletion(true);
and in Reducer, I used MultipleOutputs like this,
private MultipleOutputs<CustomKey, Text> multipleOutputs;
#Override
public void setup(Context context) throws IOException, InterruptedException {
multipleOutputs = new MultipleOutputs<>(context);
}
#Override
public void reduce(CustomKey key, Iterable<CustomValue> values, Context context) throws IOException, InterruptedException {
...
multipleOutputs.write("islnd", key, pop, key.toString());
//context.write(key, pop);
}
public void cleanup() throws IOException, InterruptedException {
multipleOutputs.close();
}
}
When I use context.write I get output files with data in it. But When I remove context.write the output files are empty. But I don't want to call context.write because it creates extra file part-r-00000. As Stated here(last para in the description of class) I used LazyOutputFormat to avoid part-r-00000 file. But still didn't work.

LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);
This means , in case you are not creating any output, dont create empty files.
Can you please look at hadoop counters and find
1. map.output.records
2. reduce.input.groups
3. reduce.input.records to verify if your mappers are sending any data to mapper.
Code with IT for multioutput is
http://bytepadding.com/big-data/map-reduce/multipleoutputs-in-map-reduce/

Hadoop is skipping reduce phase entirely

I have set up a Hadoop job like so:
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Legion");
job.setJarByClass(Legion.class);
job.setMapperClass(CallQualityMap.class);
job.setReducerClass(CallQualityReduce.class);
// Explicitly configure map and reduce outputs, since they're different classes
job.setMapOutputKeyClass(CallSampleKey.class);
job.setMapOutputValueClass(CallSample.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
job.setInputFormatClass(CombineRepublicInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
CombineRepublicInputFormat.setMaxInputSplitSize(job, 128000000);
CombineRepublicInputFormat.setInputDirRecursive(job, true);
CombineRepublicInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
This job completes, but something strange happens. I get one output line per input line. Each output line consists of the output from a CallSampleKey.toString() method, then a tab, then something like CallSample#17ab34d.
This means that the reduce phase is never running and the CallSampleKey and CallSample are getting passed directly to the TextOutputFormat. But I don't understand why this would be the case. I've very clearly specified job.setReducerClass(CallQualityReduce.class);, so I have no idea why it would skip the reducer!
Edit: Here's the code for the reducer:
public static class CallQualityReduce extends Reducer<CallSampleKey, CallSample, NullWritable, Text> {
public void reduce(CallSampleKey inKey, Iterator<CallSample> inValues, Context context) throws IOException, InterruptedException {
Call call = new Call(inKey.getId().toString(), inKey.getUuid().toString());
while (inValues.hasNext()) {
call.addSample(inValues.next());
}
context.write(NullWritable.get(), new Text(call.getStats()));
}
}

What if you try to change your
public void reduce(CallSampleKey inKey, Iterator<CallSample> inValues, Context context) throws IOException, InterruptedException {
to use Iterable instead of Iterator?
public void reduce(CallSampleKey inKey, Iterable<CallSample> inValues, Context context) throws IOException, InterruptedException {
You'll have to then use inValues.iterator() to get the actual iterator.
If the method signature doesn't match then it's just falling through to the default identity reducer implementation. It's perhaps unfortunate that the underlying default implementation doesn't make it easy to detect this kind of typo, but the next best thing is to always use #Override in all methods you intend to override so that the compiler can help.

job.setOutputKeyClass and setOutputValueClass in Driver is mismatching with the reducer's context.write method,still program is running fine.how?

Driver code:
public class WcDriver {
public static void main(String[] args) throws IOException,
InterruptedException, ClassNotFoundException {
Configuration conf = new Configuration();
Job job = new Job(conf, "WcDriver");
job.setJarByClass(WcDriver.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(WcMapper.class);
job.setReducerClass(WcReducer.class);
job.waitForCompletion(true);
}
}
Reducer code
public class WcReducer extends Reducer<Text, LongWritable, Text,String>
{
#Override
public void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
String key1 = null;
int total = 0;
for (LongWritable value : values) {
total += value.get();
key1= key.toString();
}
context.write(new Text(key1), "ABC");
}
}
Here, in driver class I have set job.setOutputKeyClass(Text.class) and job.setOutputValueClass(LongWritable.class), but in reducer class I am writing a string context.write(new Text(key1), "ABC");. I think there should be an error while running the program because output types are not matching, and also reducer's key should implement WritableComparable and value should implement Writable interface. Strangely, this program is running good. I do not understand why there is not an exception.

try to do this :
// job.setOutputFormatClass(TextOutputFormat.class);
// comment this line, and you'll sure get exception of casting.
This is because, TextOutputFormat assumes LongWritable as key, and Text as value, if you'll not define the outPutFormat class, it will expect to get default behaviour of writable, which is by default, but if u'll mention it, it would implicitly cast it to the given type.;

try this
//job.setOutputValueClass(LongWritable.class); if you comment this line you get an error
this will for only define the key value pair by defaul it depent on the output format and
it will be text so this is not giving any error

How can I use MultipleoutputFormai in Hadoop 0.20?

I am working with Hadoop 0.20 and I want to have two reduce output files instead of one output. I know that MultipleOutputFormat doesn't work in Hadoop 0.20. I added the hadoop1.1.1-core jar file in the build path of my project in Eclipse. But it still shows the last error.
Here is my code:
public static class ReduceStage extends Reducer<IntWritable, BitSetWritable, IntWritable, Text>
{
private MultipleOutputs mos;
public ReduceStage() {
System.out.println("ReduceStage");
}
public void setup(Context context) {
mos = new MultipleOutputs(context);
}
public void reduce(final IntWritable key, final Iterable<BitSetWritable> values, Context output ) throws IOException, InterruptedException
{
mos.write("text1", key, new Text("Hello"));
}
public void cleanup(Context context) throws IOException {
try {
mos.close();
} catch (InterruptedException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
And in the run():
FileOutputFormat.setOutputPath(job, ConnectedComponents_Nodes);
job.setOutputKeyClass(MultipleTextOutputFormat.class);
MultipleOutputs.addNamedOutput(job, "text1", TextOutputFormat.class,
IntWritable.class, Text.class);
The error is:
java.lang.NoSuchMethodError: org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.setOutputName(Lorg/apache/hadoop/mapreduce/JobContext;Ljava/lang/String;)V
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.getRecordWriter(MultipleOutputs.java:409)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:370)
at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:348)
at bitsetmr$ReduceStage.reduce(bitsetmr.java:179)
at bitsetmr$ReduceStage.reduce(bitsetmr.java:1)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
What can I do to have MultipleOutputFormat? Did I use the code right?

You may go for an overridden extension of MultipleTextOutputFormat and then make all the contents of the record to be the part of 'value', while make the file-name or path to be the key.
There is an oddjob library. They have a range of outputformat implementations. The one which you want is MultipleLeafValueOutputFormat : Writes to the file specified by the key, and only writes the value.
Now,say you have to write the following pairs and your separator is say the tab character ('\t'):
<"key1","value1"> (you want this to be written in filename1)
<"key2","value2"> (you want this to be written in filename2)
So, now the output from reducer would transform into follows:
<"filename1","key1\tvalue1">
<"filename2","key2\tvalue2">
Also, don't forget that the above defined class should be added as the outformat class to the job:
conf.setOutputFormat(MultipleLeafValueOutputFormat.class);
One thing to note here is that you will need to work with the old mapred package rather than the mapreduce package. But that shouldn't be a problem.

Firstly, you should make sure FileOutputFormat.setOutputName has the same code between versions 0.20 and 1.1.1. If not, you must have compatible version to compile your code. If the same, there must be some parameter error in your command.
I encountered the same issue and I removed -Dmapreduce.user.classpath.first=true from run command and it works. hope that helps!

Am I spawning more threads then I think I am in my mapper?

I'm attempting to make a web parser using and since by nature there is downtime while the program retrieves the document from I made it multithreaded. The idea being that my Threads retrieve the URLS from a url pile. This tripled the speed of the program when I ran it on EMR with medium instances. On large instances I got out of memory errors. Do I just need less threads or is the number of threads less strictly controlled then I think it is? Here is my mapper:
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
private Text word = new Text();
private URLPile pile= new URLPile();
#Override
public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) {
// non english encoding list, all others are considered english to
// avoid missing any
String url = value.toString();
StringTokenizer urls = new StringTokenizer(url);
Config.LoggerProvider = LoggerProvider.DISABLED;
MyThread[] Threads = new MyThread[8];
for(MyThread thread : Threads){
thread = new MyThread(output,pile);
thread.start();
}
while (urls.hasMoreTokens()) {
try{
if(urls.hasMoreTokens()){
word.set(urls.nextToken());
String currenturl= word.toString();
pile.addUrl(currenturl);
}else{
System.out.println("out of tokens");
pile.waitTillDone();
}
} catch (Exception e) {
/*
*/
continue;
}
}
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

parsing twitter json using mapreduce: Java, Pig - java

You are using TextInputFormat so when you do value.toString(), you get a single line as an input. Not the whole JSON object. I recommend you use this: https://github.com/alexholmes/json-mapreduce#an-inputformat-to-work-with-splittable-multi-line-json

Related

Using MultipleOutputs without context.write results empty files

Hadoop is skipping reduce phase entirely

job.setOutputKeyClass and setOutputValueClass in Driver is mismatching with the reducer's context.write method,still program is running fine.how?

How can I use MultipleoutputFormai in Hadoop 0.20?

Am I spawning more threads then I think I am in my mapper?

Categories

Resources