I am trying to attach a custom (java) partitioner to my MapReduce streaming job. I am using this command:
../bin/hadoop jar ../contrib/streaming/hadoop-streaming-1.2.1.jar \
-libjars ./NumericPartitioner.jar -D mapred.map.tasks=12 -D mapred.reduce.tasks=36 \
-input /input -output /output/keys -mapper "map_threeJoin.py" -reducer "keycount.py" \
-partitioner newjoin.NumericPartitioner -file "map_threeJoin.py" \
-cmdenv b_size=6 -cmdenv c_size=6
The important bit of that is the file NumericPartitioner.jar, which resides in the same folder the command is being run in (a level down from the Hadoop root installation.) Here is its code:
package newjoin;
import java.util.*;
import java.lang.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.io.*;
public class NumericPartitioner extends Partitioner<Text,Text>
{
#Override
public int getPartition(Text key,Text value,int numReduceTasks)
{
return Integer.parseInt(key.toString().split("\\s")[0]) % numReduceTasks;
}
}
And yet, when I try to run the above command, I get:
-partitioner : class not found : newjoin.NumericPartitioner
Streaming Command Failed!
What's going on here, and how can I get mapReduce to find my partitioner?
-libjars option is to make your third-party JAR’s available to the remote map and reduce task JVM’s.
But for making these same third party JAR’s available to the client JVM( JVM that’s created when you run the hadoop jar command) , you need to specify in HADOOP_CLASSPATH variable
$ export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:./NumericPartitioner.jar
../bin/hadoop jar ../contrib/streaming/hadoop-streaming-1.2.1.jar \
-libjars ${HADOOP_CLASSPATH} -D mapred.map.tasks=12 -D mapred.reduce.tasks=36 \
-input /input -output /output/keys -mapper "map_threeJoin.py" -reducer "keycount.py" \
-partitioner newjoin.NumericPartitioner -file "map_threeJoin.py" \
-cmdenv b_size=6 -cmdenv c_size=6
Related
I'm having problem running Java written spark application on AWS EMR.
Locally, everything runs fine. When I submit a job to EMR, I always get "Completed" withing 20 seconds even though job should take minutes. There is no output being produced, no log messages are being printed.
I'm still confused as weather it should be ran as Spark application or CUSTOM_JAR type.
Look of my main method:
public static void main(String[] args) throws Exception {
SparkSession spark = SparkSession
.builder()
.appName("RandomName")
.getOrCreate();
//process stuff
String from_path = args[0];
String to_path = args[1];
Dataset<String> dataInput = spark.read().json(from_path).toJSON();
JavaRDD<ResultingClass> map = dataInput.toJavaRDD().map(row -> convertData(row)); //provided function didn't include here
Dataset<Row> dataFrame = spark.createDataFrame(map, ResultingClass.class);
dataFrame
.repartition(1)
.write()
.mode(SaveMode.Append)
.partitionBy("year", "month", "day", "hour")
.parquet(to_path);
spark.stop();
}
I've tried these:
1)
aws emr add-steps --cluster-id j-XXXXXXXXX --steps \
Type=Spark,Name=MyApp,Args=[--deploy-mode,cluster,--master,yarn, \
--conf,spark.yarn.submit.waitAppCompletion=false, \
--class,com.my.class.with.main.Foo,s3://mybucket/script.jar, \
s3://partitioned-input-data/*/*/*/*/*.txt, \
s3://output-bucket/table-name], \
ActionOnFailure=CONTINUE --region us-west-2 --profile default
Completes in 15 sec without error, output result or logs I've added.
2)
aws emr add-steps --cluster-id j-XXXXXXXXX --steps \
Type=CUSTOM_JAR, \
Jar=s3://mybucket/script.jar, \
MainClass=com.my.class.with.main.Foo, \
Name=MyApp, \
Args=[--deploy-mode,cluster, \
--conf,spark.yarn.submit.waitAppCompletion=true, \
s3://partitioned-input-data/*/*/*/*/*.txt, \
s3://output-bucket/table-name], \
ActionOnFailure=CONTINUE \
--region us-west-2 --profile default
Reads parameters wrongly, sees --deploy-mode as first parameter and cluster as second instead of buckets
3)
aws emr add-steps --cluster-id j-XXXXXXXXX --steps \
Type=CUSTOM_JAR, \
Jar=s3://mybucket/script.jar, \
MainClass=com.my.class.with.main.Foo, \
Name=MyApp, \
Args=[s3://partitioned-input-data/*/*/*/*/*.txt, \
s3://output-bucket/table-name], \
ActionOnFailure=CONTINUE \
--region us-west-2 --profile default
I get this: Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.SparkSession
When I include all dependencies (which I do not need to locally)
I get: Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
I do not want to hardcode the "yarn" into the app.
I find AWS documentation very confusing as to what is the proper way to run this.
Update:
Running command directly on the server does work. So the problem must be in the way I'm defining a cli command.
spark-submit --class com.my.class.with.main.Foo \
s3://mybucket/script.jar \
"s3://partitioned-input-data/*/*/*/*/*.txt" \
"s3://output-bucket/table-name"
The 1) was working.
The step overview on the aws console said that the task was finished within 15 seconds, but in reality it was still running on the cluster. It took him an hour to do the work and I can see the result.
I do not know why the step is misreporting the result. I'm using emr-5.9.0 with Ganglia 3.7.2, Spark 2.2.0, Zeppelin 0.7.2.
I am trying to compile the xslt2-transformer extension, because I can’t find the LibreOffice extension xslt2-transformer.oxt on the web site (if anybody has it, he is welcome).
To build from source code, I proceeded like that:
$ git clone https://github.com/dtardon/xslt2-transformer.git
$ cd xslt2-transformer/
$ make
I am getting a lot of (similar) errors during build:
mkdir -p build/classes && \
javac -d build/classes -source 1.5 -target 1.5 \
-cp "external/saxon9.jar:" com/sun/star/comp/xsltfilter/Base64.java \
com/sun/star/comp/xsltfilter/XSLTFilterOLEExtracter.java \
com/sun/star/comp/xsltfilter/XSLTransformer.java && \
touch build/javac.done
com/sun/star/comp/xsltfilter/XSLTFilterOLEExtracter.java:27: error: package com.sun.star.bridge does not exist
import com.sun.star.bridge.XBridgeFactory;
^
com/sun/star/comp/xsltfilter/XSLTFilterOLEExtracter.java:28: error: package com.sun.star.bridge does not exist
import com.sun.star.bridge.XBridge;
^
[...]
symbol: class XConnector
location: class XSLTFilterOLEExtracter
com/sun/star/comp/xsltfilter/XSLTFilterOLEExtracter.java:321: error: cannot find symbol
XConnector xConnector = UnoRuntime.queryInterface(XConnector.class, x);
^
symbol: class XConnector
location: class XSLTFilterOLEExtracter
100 errors
1 warning
make: *** [build/javac.done] Error 1
I think my CLASSPATH is not up-to-date. I need to add the com.sun.star package and classes.
Since I am (currently) on OSX, my LibreOffice is installed on /Applications/LibreOffice.app and I found some classes in ./Contents/Resources/java.
So I update the CLASSPATH that way:
export CLASSPATH=/Applications/LibreOffice.app/Contents/Resources/java:$CLASSPATH
But, I have the same errors. How can I fix that?
EDIT 1 put some jar in the CLASSPATH
I tried that:
$ export CLASSPATH=/Applications/LibreOffice.app/Contents//Resources/java/ridl.jar:.
I have less errors.
EDIT 2 The build succeed!
I finally added the following jar files to the CLASSPATH:
/Applications/LibreOffice.app/Contents//Resources/java/ridl.jar
/Applications/LibreOffice.app/Contents//Resources/java/jurt.jar
/Applications/LibreOffice.app/Contents//Resources/java/juh.jar
/Applications/LibreOffice.app/Contents//Resources/java/unoil.jar
And I get the extension!
Finally, to build from source code, I proceeded like that:
git clone https://github.com/dtardon/xslt2-transformer.git
cd xslt2-transformer/
export CLASSPATH=/Applications/LibreOffice.app/Contents/Resources/java/ridl.jar:\
/Applications/LibreOffice.app/Contents/Resources/java/jurt.jar:\
/Applications/LibreOffice.app/Contents/Resources/java/juh.jar:\
/Applications/LibreOffice.app/Contents/Resources/java/unoil.jar
make
The result is build/xslt2-transformer.oxt.
I have a spark application packaged with maven. At run-time, I have to give 3 arguments (paths of 3 files to create RDDs). So I used spark-submit command as the officiel website of spark indicates:
./bin/spark-submit \
--class <main-class> \
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
.. # other options
<application-jar> \
[application-arguments]
My submit-command looks like:
\bin\spark-submit --class myapp.Main --master local[*] file:///C:\Users\pc\Desktop\eclipse\myapp\target\myapp-0.0.1-SNAPSHOT.jar ["C:\Users\pc\Desktop\pathToFile1.csv", "C:\Users\pc\Desktop\pathToFile2.csv", "C:\Users\pc\Desktop\pathToFile3.csv"]
I moddified my Main class as follows to get paths at runtime:
String pathToFile1=args[0];
String pathToFile2=args[1];
String pathToFile3=args[2];
But I get an error message that says that the specified path does not exist. What am I doing wrong here?
#bradimus you were right i dont have to use [], i have to write it as :
\bin\spark-submit --class myapp.Main --master local[*] file:///C:\Users\pc\Desktop\eclipse\myapp\target\myapp-0.0.1-SNAPSHOT.jar C:\Users\pc\Desktop\pathToFile1.csv C:\Users\pc\Desktop\pathToFile2.csv C:\Users\pc\Desktop\pathToFile3.csv
I'm attempting to build the SimpleShortestPathsComputation example included with Giraph and run it from within my home directory. Basically, I'm just trying to tweak the SimpleShortestPaths example and run it without any hassle (not quite sure what the best way to go about that would be). My approach was as follows:
SimpleShortestPathsComputaiton.java:
import org.apache.giraph.graph.BasicComputation;
......
import org.apache.log4j.Logger;
import java.io.IOException;
public class SimpleShortestPathsComputation extends BasicComputation<
LongWritable, DoubleWritable, FloatWritable, DoubleWritable> {
......
I build it like so:
JCC = javac
JFLAGS = "-Xlint"
OUTPUT_CLASS="test"
CLASSPATH = $(HADOOP_HOME)/hadoop-core-0.20.203.0.jar:$(GIRAPH_HOME)/giraph-core/target/giraph-1.1.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar
default: SimpleShortestPathsComputation.class
SimpleShortestPathsComputation.class: SimpleShortestPathsComputation.java
mkdir -p $(OUTPUT_CLASS)
$(JCC) $(JFLAGS) -classpath $(CLASSPATH) -d $(OUTPUT_CLASS) SimpleShortestPathsComputation.java
jar cvf SimpleShortestPathsComputation.jar -C $(OUTPUT_CLASS)/ .
This works fine and I create a jar file named SimpleShortestPathsComputation. I then try running it like so:
$HADOOP_HOME/bin/hadoop jar $GIRAPH_HOME/giraph-examples/target/giraph-examples-1.1.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar \
org.apache.giraph.GiraphRunner /home/hduser/SimpleShortestPathsComputation.jar \
-vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat \
-vip /user/hduser/input/tiny_graph.txt \
-vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
-op /user/hduser/output/shortestpaths -w 1 \
/
However, this results in the following:
Exception in thread "main" java.lang.ClassNotFoundException: /home/hduser/SimpleShortestPathsComputation.jar
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at org.apache.giraph.utils.ConfigurationUtils.handleComputationClass(ConfigurationUtils.java:470)
at org.apache.giraph.utils.ConfigurationUtils.populateGiraphConfiguration(ConfigurationUtils.java:453)
at org.apache.giraph.utils.ConfigurationUtils.parseArgs(ConfigurationUtils.java:207)
at org.apache.giraph.GiraphRunner.run(GiraphRunner.java:74)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.giraph.GiraphRunner.main(GiraphRunner.java:124)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
I'm not quite sure what I'm doing wrong. If anyone can point me in the right direction, or link to a resource that explains an easier way of what I'm trying to accomplish, I'd greatly appreciate it!
If you are running the "Quick start", actually there are inconsistency with 1.0.0:
(1) argument "-vof" should have been "-of". "-vof" was introduced in 1.1.0 (see GIRAPH-774).
(2) In 1.0.0, there doesn't exist a class called "org.apache.giraph.examples.SimpleShortestPathsComputation", change it to "org.apache.giraph.examples.SimpleShortestPathsVertex".
I'm trying to run a Java application as a Windows service with WinRun4J.
I copied WinRun4J64c.exe in my application directory and placed the following service.ini file beside:
service.class=org.boris.winrun4j.MainService
service.id=MyAPP
service.name=MyAPP
service.description=some description
classpath.1=./lib/*
classpath.2=WinRun4J.jar
[MainService]
class=play.core.server.NettyServer
But if I start the service with: WinRun4J64c.exe --WinRun4J:RegisterService I get:
Service control dispatcher error: 1063
What is wrong?
I didn't get it working, so my workaround is to use Apache Commons Deamon. I executed the included prunsrv.exe with the following parameters:
prunsrv.exe install "MeineAnwendung" \
--Install="C:/pfad/zu/prunsrv.exe" \
--JvmOptions=-Dpidfile.path=NUL
--Jvm=auto \
--Startup=auto \
--StartMode=jvm \
--Classpath="c:/irgendwo/anwendung/lib/*;" \
--StartClass=play.core.server.NettyServer