running hadoop wordCount example with groovy

running hadoop wordCount example with groovy - java

I was trying to run the wordCount example with groovy using this but encounter an error
Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
found this for above error but could not locate pom.xml file in my setup.
Then I came across this. How do we run this in hadoop. Is it by making a jar file and run similarly as the java example?(which ran fine)
What is the difference between running a groovy example using groovy-hadoop and by using this file (not sure how to run this) and hadoop-streaming? why would we use one method over others.
I've installed hadoop 2.7.1 on mac 10.10.3

I was able to run this groovy file with hadoop 2.7.1
The procedure I followed is
Install gradle
Generate jar file using gradle. I asked this question which helped me build dependencies in gradle
Run with hadoop as usual as we run a java jar file using this command from the folder where jar is located.
hadoop jar buildSrc-1.0.jar in1 out4
where in1 is input file and out4 is the output folder in hdfs
EDIT- As the above link is broken , I am pasting the groovy file here.
import StartsWithCountMapper
import StartsWithCountReducer
import org.apache.hadoop.conf.Configured
import org.apache.hadoop.fs.Path
import org.apache.hadoop.io.IntWritable
import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.mapreduce.Mapper
import org.apache.hadoop.mapreduce.Reducer
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
import org.apache.hadoop.util.Tool
import org.apache.hadoop.util.ToolRunner
class CountGroovyJob extends Configured implements Tool {
#Override
int run(String[] args) throws Exception {
Job job = Job.getInstance(getConf(), "StartsWithCount")
job.setJarByClass(getClass())
// configure output and input source
TextInputFormat.addInputPath(job, new Path(args[0]))
job.setInputFormatClass(TextInputFormat)
// configure mapper and reducer
job.setMapperClass(StartsWithCountMapper)
job.setCombinerClass(StartsWithCountReducer)
job.setReducerClass(StartsWithCountReducer)
// configure output
TextOutputFormat.setOutputPath(job, new Path(args[1]))
job.setOutputFormatClass(TextOutputFormat)
job.setOutputKeyClass(Text)
job.setOutputValueClass(IntWritable)
return job.waitForCompletion(true) ? 0 : 1
}
static void main(String[] args) throws Exception {
System.exit(ToolRunner.run(new CountGroovyJob(), args))
}
class GroovyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable countOne = new IntWritable(1);
private final Text reusableText = new Text();
#Override
protected void map(LongWritable key, Text value, Mapper.Context context) {
value.toString().tokenize().each {
reusableText.set(it)
context.write(reusableText,countOne)
}
}
}
class GroovyReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
private IntWritable outValue = new IntWritable();
#Override
protected void reduce(Text key, Iterable<IntWritable> values, Reducer.Context context) {
outValue.set(values.collect({it.value}).sum())
context.write(key, outValue);
}
}
}

The library you are using, groovy-hadoop, says it supports Hadoop 0.20.2. It's really old.
But the CountGroovyJob.groovy code you are trying to run looks like it's supposed to run on versions 2.x.x of Hadoop.
I can see this because in the imports you see packages such as org.apache.hadoop.mapreduce.Mapper, whereas before version 2, it was called org.apache.hadoop.mapred.Mapper.
The most voted answer in the SO question you linked is probably the answer you needed. You have an incompatibility problem. The groovy-hadoop library can't work with your Hadoop 2.7.1.

Related

Hadoop Java Class cannot be found

Exception in thread "main" java.lang.ClassNotFoundException:WordCount-> so many answers relate to this issue and it seems like I am definitely missing a small point again which took me hours to figure.
I will try to be as clear as possible about the paths, code itself and other possible solutions I tried and did not work.
I am kinda sure about my correctly configuring Hadoop as everything was working up until the last stage.
But still posting the details:
Environment variables and paths
>
HADOOP VARIABLES START
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_INSTALL=/usr/local/hadoop
export HADOOP_CLASSPATH=/usr/lib/jvm/java-8-oracle/lib/tools.jar
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
#export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
#export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
export JAVA_HOME=/usr/lib/jvm/java-8-oracle
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$JAVA_HOME/bin
#HADOOP VARIABLES END
The Class itself:
package com.cloud.hw03;
/**
* Hello world!
*
*/
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "wordcount");
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setJarByClass(WordCount.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
what I did for compiling and running:
Created jar file in the same folder with my WordCount maven project (eclipse-workspace)
$ hadoop com.sun.tools.javac.Main WordCount.java
$ jar cf WordCount.jar WordCount*.class
Running the program: (i already created a directory and copied input-output files in hdfs)
hadoop jar WordCount.jar WordCount /input/inputfile01 /input/outputfile01
result is: Exception in thread "main" java.lang.ClassNotFoundException: WordCount
Since I am in the same directory with WordCount.class and i created my jar file in that same directory, i am not specifying full path to WordCount, so i am running the above 2nd command in this directory :
I already added job.setJarByClass(WordCount.class); to the code so no help. I would appreciate your spending time on answering!
I am sure I am doing something unexpected again and cannot figure it out for 4 hrs

The Wordcount example code on the Hadoop site does not use a package
Since you do have one, you would run the fully qualified class. The exact same way as a regular Java application
hadoop jar WordCount.jar com.cloud.hw03.WordCount
Also, if you actually had a Maven project, then hadoop com.sun.tools.javac.Main is not correct. You would actually use Maven to compile and create the JAR with all the classes, not only WordCount* files
For example, from the folder with the pom.xml
mvn package
Otherwise, you need to be in the parent directory
hadoop com.sun.tools.javac.Main ./com/cloud/hw03/WordCount.java
And run the jar cf command also from that directory

Run WordCount example map reduce on AWS EMR

I am trying to run the word count example on AWS EMR, however I am having a hard time deploying and running the jar on the cluster. Its a customized word count example, where I have used some JSON parsing. The input is in my S3 bucket. When I try to run my job on EMR cluster I am getting the error that main function was not found in my Mapper class. Everywhere on the internet, the code for the word count example map reduce job is like they have created, three class, one static mapper class that extend Mapper, then the reducer which extends Reducer, and then the main class which contains the job configuration, so I am not sure why I am seeing the error. I build my code using maven assembly plugin so as to wrap all the third party dependencies in my JAR. Here is my code that I have written
package com.amalwa.hadoop.MapReduce;
import java.io.IOException;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.HashMap;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import com.google.gson.Gson;
public class ETL{
public static void main(String[] args) throws Exception{
if (args.length < 2) {
System.err.println("Usage: ETL <input path> <output path>");
System.exit(-1);
}
Configuration conf = new Configuration();
Job job = new Job(conf, "etl");
job.setJarByClass(ETL.class);
job.setMapperClass(JsonParserMapper.class);
job.setReducerClass(JsonParserReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(TweetArray.class);
FileInputFormat.addInputPath(job, new Path(args[1]));
FileOutputFormat.setOutputPath(job, new Path(args[2]));
job.waitForCompletion(true);
}
public static class JsonParserMapper extends Mapper<LongWritable, Text, Text, Text>{
private Text mapperKey = null;
private Text mapperValue = null;
Date filterDate = getDate("Sun Apr 20 00:00:00 +0000 2014");
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String jsonString = value.toString();
if(!jsonString.isEmpty()){
#SuppressWarnings("unchecked")
Map<String, Object> tweetData = new Gson().fromJson(jsonString, HashMap.class);
Date timeStamp = getDate(tweetData.get("created_at").toString());
if(timeStamp.after(filterDate)){
#SuppressWarnings("unchecked")
com.google.gson.internal.LinkedTreeMap<String, Object> userData = (com.google.gson.internal.LinkedTreeMap<String, Object>) tweetData.get("user");
mapperKey = new Text(userData.get("id_str") + "~" + tweetData.get("created_at").toString());
mapperValue = new Text(tweetData.get("text").toString() + " tweetId = " + tweetData.get("id_str"));
context.write(mapperKey, mapperValue);
}
}
}
public Date getDate(String timeStamp){
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("E MMM dd HH:mm:ss Z yyyy");
Date date = null;
try {
date = simpleDateFormat.parse(timeStamp);
} catch (ParseException e) {
e.printStackTrace();
}
return date;
}
}
public static class JsonParserReducer extends Reducer<Text, Text, Text, TweetArray> {
private ArrayList<Text> tweetList = new ArrayList<Text>();
public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
for (Text val : values) {
tweetList.add(new Text(val.toString()));
}
context.write(key, new TweetArray(Text.class, tweetList.toArray(new Text[tweetList.size()])));
}
}
}
please if someone can clarify this problem, it would be really nice. I have deployed this jar on my local machine on which I installed hadoop and it works fine, but when I set up my cluster using AWS and provide the streaming job with all the parameters it doesn't work. Here is a screen shot of my configuration:
The Mapper textbox is set to: java -classpath MapReduce-0.0.1-SNAPSHOT-jar-with-dependencies.jar com.amalwa.hadoop.MapReduce.JsonParserMapper
The Reducer textbox is set to: java -classpath MapReduce-0.0.1-SNAPSHOT-jar-with-dependencies.jar com.amalwa.hadoop.MapReduce.JsonParserReducer
Thanks and regards.

You need to select custom jar step instead of streaming program.

When you make the jar file (I usually do it using Eclipse or a custom gradle build), check if your main class is set to ETL. Apparently, that does not happen by default. Also check the Java version you are using on ur system. I think the aws emr works with upto java 7.

Crawljax - Require jars file for dynamic webpage crawling

I am trying to crawl the javascript webpages(content present within IFrame html tag) using Crawljax. I have added slf4j, crawljax 2.1 and Guava 18.0 jar to the application.
Error Message displayed in popup:
cannot find symbol
import com.crawljax.core.configuration.CrawljaxConfiguration.CrawljaxConfigurationBuild‌er;
symbol: class CrawljaxConfigurationBuilder
location: class CrawljaxConfiguration.
Code:
import com.crawljax.core.CrawlerContext;
import com.crawljax.core.CrawljaxRunner;
import com.crawljax.core.configuration.CrawljaxConfiguration;
import com.crawljax.core.configuration.CrawljaxConfiguration.CrawljaxConfigurationBuilder;
import com.crawljax.core.plugin.OnNewStatePlugin;
import com.crawljax.core.state.StateVertex;
public class CrawljaxExamples {
public static void main(String[] args) {
CrawljaxConfigurationBuilder builder
= CrawljaxConfiguration.builderFor("http://help.syncfusion.com/ug/wpf/default.htm#!documents/overview.htm");
builder.addPlugin(new OnNewStatePlugin() {
#Override
public void onNewState(CrawlerContext context, StateVertex newState) {
}
#Override
public String toString() {
return "Our example plugin";
}
});
CrawljaxRunner crawljax = new CrawljaxRunner(builder.build());
crawljax.call();
}
}
Error Message:
java.lang.ExceptionInInitializerError
Caused by: java.lang.RuntimeException: Uncompilable source code - cannot find symbol
symbol: class CrawljaxConfigurationBuilder
location: class com.crawljax.core.configuration.CrawljaxConfiguration
at crawljaxexamples.CrawljaxExamples.<clinit>(CrawljaxExamples.java:12)
Exception in thread "main" Java Result: 1
Same code could be found in below Link,
https://github.com/crawljax/crawljax/blob/master/examples/src/main/java/com/crawljax/examples/PluginExample.java
Can someone please tell what are jars files required to run this program? Or is there any settings to be changed in IDE?
Thanks

It seems you are using old version of crawljax.
Download latest version crawljax-cli-3.5.1.zip
Add all jars from lib folder and crawljax-cli-3.5.1.jar from the main folder as lib path.
Tested and now it works well.

Python Import Error - Cannot import name

Brand new to Python & JYthon. I'm going through a simple tutorial and am struggling with the basics and am hoping for some insight.
Created a PyDev project called 'PythonTest' In that I created a module called test (test.py) and the code looks like this
class test():
def __init__(self,name,number):
self.name = name
self.number = number
def getName(self):
return self.name
def getNumber(self):
return self.number
I then created a java project called pythonJava and in it created three classes.
ITest.java which looks like this
package com.foo.bar;
public interface ITest {
public String getName();
public String getNumber();
}
TestFactory.java which looks like this
package com.ngc.metro;
import org.python.core.PyObject;
import org.python.core.PyString;
import org.python.util.PythonInterpreter;
public class TestFactory {
private final PyObject testClass;
public TestFactory() {
PythonInterpreter interpreter = new PythonInterpreter();
interpreter.exec("from test import test");
testClass = interpreter.get("test");
}
public ITest create(String name, String number) {
PyObject testObject = testClass.__call__(new PyString(name),
new PyString(name), new PyString(number));
return (ITest) testObject.__tojava__(ITest.class);
}
}
And finally Main.java
public class Main {
private static void print(ITest testInterface) {
System.out.println("Name: " + testInterface.getName());
System.out.println("Number: " + testInterface.getNumber());
}
public static void main(String[] args) {
TestFactory factory = new TestFactory();
print(factory.create("BUILDING-A", "1"));
print(factory.create("BUILDING-B", "2"));
print(factory.create("BUILDING-C", "3"));
}
}
When I run Main.java I get the following error:
Exception in thread "main" Traceback (most recent call last): File
"", line 1, in ImportError: cannot import name test
Can someone advise me on what I'm doing wrong? I was under the impression I needed two imports one for the module (test.py) and the one for the class "test"
EDIT 1:
To avoid the easy question of my sys.path I have the following from IDLE
Python 3.3.3 (v3.3.3:c3896275c0f6, Nov 18 2013, 21:18:40) [MSC v.1600
32 bit (Intel)] on win32 Type "copyright", "credits" or "license()"
for more information. import sys print(sys.path) ['',
'C:\Python33\Lib\idlelib', 'C:\Python33', 'C:\Python33\Lib',
'C:\Python33\DLLs', 'C:\workspace\myProject\src',
'C:\Windows\system32\python33.zip',
'C:\Python33\lib\site-packages']
from test import test
t = test.test(1,2)
t.getName()
1

Actually, it seems a PYTHONPATH issue... as you're getting it from IDLE I can't say how things are actually in your Java env (and IDLE is using Python, but in your run you should be using Java+Jython -- in which case you're probably not using the proper sys.path for Jython -- at least I don't see any place having the PYTHONPATH defined in the code above to include the path to your .py files).
Also, if you're doing Java+Jython, see the notes on the end of: http://pydev.org/manual_101_project_conf2.html for configuring the project in PyDev.

How to use S3DistCp in java code

I want to copy output of job from EMR cluster to Amazon S3 pro-grammatically.
How to use S3DistCp in java code to do the same.

hadoop ToolRunner can run this.. since S3DistCP extends Tool
Below is the usage example:
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.util.ToolRunner;
import com.amazon.external.elasticmapreduce.s3distcp.S3DistCp
public class CustomS3DistCP{
private static final Log log = LogFactory.getLog(CustomS3DistCP.class);
public static void main(String[] args) throws Exception {
log.info("Running with args: " + args);
System.exit(ToolRunner.run(new S3DistCp(), args));
}
you have to have s3distcp jar in your classpath
You can call this program from a shell script.
Hope that helps!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

running hadoop wordCount example with groovy - java

Related

Hadoop Java Class cannot be found

Run WordCount example map reduce on AWS EMR

Crawljax - Require jars file for dynamic webpage crawling

Python Import Error - Cannot import name

How to use S3DistCp in java code

Categories

Resources