How to debug hadoop mapreduce jobs from eclipse?

How to debug hadoop mapreduce jobs from eclipse? - java

I'm running hadoop in a single-machine, local-only setup, and I'm looking for a nice, painless way to debug mappers and reducers in eclipse. Eclipse has no problem running mapreduce tasks. However, when I go to debug, it gives me this error :
12/03/28 14:03:23 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
Okay, so I do some research. Apparently, I should use eclipse's remote debugging facility, and add this to my hadoop-env.sh :
-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5000
I do that and I can step through my code in eclipse. Only problem is that, because of the "suspend=y", I can't use the "hadoop" command from the command line to do things like look at the job queue; it hangs, I'm imagining because it's waiting for a debugger to attach. Also, I can't run "hbase shell" when I'm in this mode, probably for the same reason.
So basically, if I want to flip back and forth between "debug mode" and "normal mode", I need to update hadoop-env.sh and restart my machine. Major pain. So I have a few questions :
Is there an easier way to do debug mapreduce jobs in eclipse?
How come eclipse can run my mapreduce jobs just fine, but for debugging I need to use remote debugging?
Is there a way to tell hadoop to use remote debugging for mapreduce jobs, but to operate in normal mode for all other tasks? (such as "hadoop queue" or "hbase shell").
Is there an easier way to switch hadoop-env.sh configurations without rebooting my machine? hadoop-env.sh is not executable by default.
This is a more general question : what exactly is happening when I run hadoop in local-only mode? Are there any processes on my machine that are "always on" and executing hadoop jobs? Or does hadoop only do things when I run the "hadoop" command from the command line? What is eclipse doing when I run a mapreduce job from eclipse? I had to reference hadoop-core in my pom.xml in order to make my project work. Is eclipse submitting jobs to my installed hadoop instance, or is it somehow running it all from the hadoop-core-1.0.0.jar in my maven cache?
Here is my Main class :
public class Main {
public static void main(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(Main.class);
job.setJobName("FirstStage");
FileInputFormat.addInputPath(job, new Path("/home/sangfroid/project/in"));
FileOutputFormat.setOutputPath(job, new Path("/home/sangfroid/project/out"));
job.setMapperClass(FirstStageMapper.class);
job.setReducerClass(FirstStageReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Make changes in /bin/hadoop (hadoop-env.sh) script. Check to see what command has been fired. If the command is jar, then only add remote debug configuration.
if [ "$COMMAND" = "jar" ] ; then
exec "$JAVA" -Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=8999 $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$#"
else
exec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$#"
fi

The only way you can debug hadoop in eclipse is running hadoop in local mode. The reason being, each map reduce task run in ist own JVM and when you don't hadoop in local mode, eclipse won't be able to debug.
When you set hadoop to local mode, instead of using hdfs API(which is default), hadoop file system changes to file:///. Thus, running hadoop fs -ls will not be a hdfs command, but more of hadoop fs -ls file:///, a path to your local directory. None of the JobTracker or NameNode runs.
These blogposts might help:
http://let-them-c.blogspot.com/2011/07/running-hadoop-locally-on-eclipse.html
http://let-them-c.blogspot.com/2011/07/configurations-of-running-hadoop.html

Besides the recommended MRUnit I like to debug with eclipse as well. I have a main program. It instantiates a Configuration and executes the MapReduce job directly. I just debug with standard eclipse Debug configurations. Since I include hadoop jars in my mvn spec, I have all hadoop per se in my class path and I have no need to run it against my installed hadoop. I always test with small data sets in local directories to make things easy. The defaults for the configuration behaves as a stand alone hadoop (file system is available)

I also like to debug via unit test w/MRUnit. I will use this in combination with approvaltests which creates an easy visualization of the Map Reduce process, and makes it easy to pass in scenarios that are failing. It also runs seamlessly from eclipse.
For example:
HadoopApprovals.verifyMapReduce(new WordCountMapper(),
new WordCountReducer(), 0, "cat cat dog");
Will produce the output:
[cat cat dog]
-> maps via WordCountMapper to ->
(cat, 1)
(cat, 1)
(dog, 1)
-> reduces via WordCountReducer to ->
(cat, 2)
(dog, 1)
There's a video on the process here: http://t.co/leExFVrf

Adding args to hadoop's internal java command can be done via HADOOP_OPTS env variable:
export HADOOP_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=5005,suspend=y"

You can pass the debugging parameters via -Dmapreduce.map.java.opts.
For example you can run HBase Import and run the mappers in debug mode:
yarn jar your/path/to/hbase-mapreduce-2.2.5.jar import
-Dmapreduce.map.speculative=false
-Dmapreduce.reduce.speculative=false
-Dmapreduce.map.java.opts="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=5005,suspend=y"
my_table /path/in/hdfs
Note that this must be placed into one line w/o new lines.
Other map-reduce applications can be started in the same way, the trick is to pass the debug derictives via -Dmapreduce.map.java.opts.
In Eclipse or IntelliJ you have to create a debug remote connection with
Host=127.0.0.1 (or even a remote IP address in case Hadoop runs elsewhere)
Port=5005
I managed to debug the Import this way. In addition you can limit the number of mappers to 1 as described here but this was not necessary for me.
Once the map-reduve application is started switch to your IDE an try to launch your debug settings which will fail in the beginning. Repeat it until the debugger hooks into the application. Don't forget to set a breakpoint before hand.
In case you don't want to debug your application only but also the surrounding HBase/Hadoop framework, you can download them
here and here (chosse your version via the "switch branch/tags" menu button).

Related

Jmeter linux shell cannot start server

Situation:
I have installed Jasper Reports Library (V6.5.1) on my local Linux server which generates PDF reports (Data is dumped in a temp Oracle DB table for the reporting engine).
It then serves this PDF back to the website from which I kick off the process.
Goal:
Install Jmeter to analyse performance / possible bottlenecks of "Jasper Reports Library" (aka Report Generation) on my local linux server (I cannot access this server via GUI, only shell).
I understand I have to connect my local Windows 10 machine (running same Jmeter 4.0) with this local server. On the server I have to start Jmeter 4.0 Server (via jmeter-server command) however I get an error and am stuck (have not found anything online or even people with the same goal unfortunately...)
Steps I have taken:
Download latest (4.0) bin from here
Extracted on local linux server in /opt/dlins/apache-jmeter-4.0bin
Trying to start server with /usr/lib/jvm/jdk1.8.0_102/bin/java jmeter-server (the default java version is 6 so through this I can run this app with java 8) - Instructions found here
-> Getting error: "Error: Could not find or load main class jmeter-server"
Any help regarding above or even any other tool you may use are appreciated (Maybe there is a preferable way to test performance for the above scenario)

There are 2 aspects related to your issue and screenshot:
1) Using java 8 instead of 6 - This can be done in several ways, depending on your needs and restrictions, such as the need to have Java 6 globally available for other applications and using 8 just to run JMeter, or just replacing 6 with 8 entirely. For the sake of brevity, I'll assume the first scenario, but there's documentation available for both and Dmitri T has partially explained it already.
Anyway, the same JMeter doc link you used, describes (just scroll down a few times) how to create a setenv.sh script in the bin directory and configure JAVA_HOME or JRE_HOME depending on your needs.
To set those variables permanently, you can place them in a file called setenv.sh in the bin directory. This file will be sourced when running JMeter by calling the jmeter script.
You seem to be wanting a JDK, so create the script and add inside JAVA_HOME=/usr/lib/jvm/jdk1.8.0_102, save and exit.
2) Running JMeter - To clarify a minor confusion, java MyCompiledClass instructs java to load and execute the "program" defined in MyCompiledClass, which is not what you want to do, because jmeter-server is a shell script. If you open it, you'll see that it calls the jmeter shell script which will do some configuration, end eventually call (in short) java -jar ApacheJMeter.jar with some arguments and options.
So, to run JMeter make sure your scripts are executable with chmod, and simply run from command line ./jmeter-server. From the same link:
Un*x script files; should work on most Linux/Unix systems:
jmeter - run JMeter (in GUI mode by default). Defines some JVM settings which may not work for all JVMs.
jmeter-server - start JMeter in server mode (calls jmeter script with appropriate parameters)
jmeter.sh - very basic JMeter script (You may need to adapt JVM options like memory settings).
mirror-server.sh - runs the JMeter Mirror Server in non-GUI mode
shutdown.sh - Run the Shutdown client to stop a non-GUI instance gracefully
stoptest.sh - Run the Shutdown client to stop a non-GUI instance abruptly

Amend your PATH environment variable so Java 8 bin would be before Java 6 bin like:
PATH=/usr/lib/jvm/jdk1.8.0_102/bin:$PATH && export PATH
Once done you should be able to just launch the jmeter-server script like
pushd /opt/dlins/apache-jmeter-4.0bin/bin && ./jmeter-server
More information:
Remote Testing
JMeter Distributed Testing Step-by-step
How to Get Started With JMeter: Part 1 - Installation & Test Plans

Restore old Jenkins jobs in Windows after the update

We were running plug-in rich Jenkins 2.46.1 and one of my team member tried to update Jenkins but it just hangs on displaying Please wait while Jenkins is restarting message for about 2 hours so I somehow forced Jenkins to shutdown and then started it using java -jar Jenkins.war command.
When Jenkins restarted again all my jobs were not displayed in Jenkins GUI but they were present in jobs and workspace folders so I tried an option Reload configuration from disk but that also did not restore jobs.
Please someone advice how I can restore my Jenkins jobs as they were before.

I believe you are not pointing JENKINS_HOME correctly. You may need something like this as startup script/batch file
For Linux:
export JENKINS_HOME=<path/to/old/jenkins/home>
java -jar Jenkins.war
For Windows:
set JENKINS_HOME=<path/to/old/jenkins/home>
java -jar Jenkins.war
Basically running war directly points to default temp directly for Jenkins Home. It can be overwritten by above environment variable. You can also centrally define it so that each start you do not need specify again.

Running Hadoop in standalone mode on Windows

What I understand from running Hadoop or Map reduce jobs on standalone mode is that we don't require any Hadoop daemons to be running. Everything is on the JVM.
So here is the problem.
I want to run a simple word count map reduce job on my machine (client machine), which is on Windows.
I have done the coding of word count in an IDE (eclipse), so I want to run it from eclipse.
How can this be achieved?
Do I need to use some plugin like the (Karmasphere plugin) or can it be done without using any plugins?
I am able to run the above in Linux. I am using the Cloudera provided VM. Here I created the same project in eclipse.
In the driver code I added the below line and finally ran it as Java application
Configuration conf = new Configuration();
conf.set("mapreduce.jobtracker.address", "local");
conf.set("fs.defaultFS","file:///");
Job job = new Job(conf);
While configuring for running the Java application, I provided input file and output folder name as the program arguments and it ran.
But doing same in windows was no good.

To debug map reduce jobs in eclipse

I want to debug Map-reduce jobs (pig,hive) using eclipse. That is, to set break points in the hadoop source java file and to inspect the elements while running map-reduce jobs. To do this, I started all the services using eclipse and I can debug some class files. But I cant create an entire debug environment. Can anyone tell me how?

I do not know of an eclipse tool that can do what you are looking for. If you are looking for a possible solution the following will work in java.
import java.util.logging.Logger;
For debugging java map reduce files you can use the java logger for each class( driver, mapper, reducer ).
Logger log = Logger.getLogger(MyClass.class.getName());
To inspect elements/variables just use :
log.info( "varOne: " + varOne );
These log lines with be printed in the administration page for your job.

The basic thing to remember here is that debugging a Hadoop MR job is going to be similar to any remotely debugged application in Eclipse.
As you would know, Hadoop can be run in the local environment in 3 different modes :
Local Mode
Pseudo Distributed Mode
Fully Distributed Mode (Cluster)
Typically you will be running your local hadoop setup in Pseudo Distributed Mode to leverage HDFS and Map Reduce(MR). However you cannot debug MR programs in this mode as each Map/Reduce task will be running in a separate JVM process so you need to switch back to Local mode where you can run your MR programs in a single JVM process.
Here are the quick and simple steps to debug this in your local environment:
Run hadoop in local mode for debugging so mapper and reducer tasks run in a single JVM instead of separate JVMs. Below steps help you do it.
Configure HADOOP_OPTS to enable debugging so when you run your Hadoop job, it will be waiting for the debugger to connect. Below is the command to debug the same at port 8080.
(export HADOOP_OPTS=”-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8008“)
Configure fs.default.name value in core-site.xml to file:/// from hdfs://. You won’t be using hdfs in local mode.
Configure mapred.job.tracker value in mapred-site.xml to local. This will instruct Hadoop to run MR tasks in a single JVM.
Create debug configuration for Eclipse and set the port to 8008 – typical stuff. For that go to the debugger configurations and create a new Remote Java Application type of configuration and set the port as 8080 in the settings.
Run your hadoop job (it will be waiting for the debugger to connect) and then launch Eclipse in debug mode with the above configuration. Do make sure to put a break-point first.
Thats it.

I created an eclipse project to debug generic mapreduce program, for example WordCount.java, running standalone hadoop in Eclipse. But I did not try hive/pig specific mapreduce jobs yet. The project locates at https://github.com/drachenrio/hadoopmr, it can be download using
git clone https://github.com/drachenrio/hadoopmr
This project was created with Ubuntu 16.04.2, Eclipse Neon.3 Release (4.6.3RC2), jdk1.8.0_121, hadoop-2.7.3 environments.
Quick setup:
1) Once project imported into Eclipse, open .classpath,
replace /j01/srv/hadoop-2.7.3 with your hadoop installation home path
2) mkdir -p /home/hadroop/input
copy src/main/resources/input.txt to /home/hadoop/input/
It is ready to run/debug WordCount.java mapreduce job.
Read README.md for more details.
If you prefer to manually create the project, see my another answer in stackoverflow

using javaw to run jars in batch files results in more than one java processes in process explorer - XYNTService

I have a somewhat strange issue. I have a java application that installs few services that run as Jars. Previously I used installed Java to run these Jars. There are four services and all will be instantiated from a single batch file with sequential call to each other. Something like this,
start %JAVA_HOME% commandtoruntjarfile
would work and all four services will run in the background and only one java.exe visible in process explorer. So I had another service installed as windows service which would start stop these services by calling the run bat or shutdown bat.
Now the customer requirement changed to using an internalized version of java. I extract java to a location, make our own environment variable name "ABC_HOME" and the required syntax in batch changes to
%ABC_HOME%\javaw commandtorunjarfile
When its run it works. but there is no stopping these. When I go to process explorer I see 4 java.exe running each for the four run commands in the batch file. If I stop the windows service all the four keep working. If I restart the windows service the number of java.exe in process explorer goes to eight and keeps going up until windows has had enough of it.
How do I get around it? I think the solution should be to have the one java process in process explorer but I cant seem to find any solution for that.
[EDIT]
The four sub services are actually XYNT processes. In the normal scenario it would be something like this
[Process1]
CommandLine = java -Xrs -DasService=yes -cp jarfiles
WorkingDir = c: bin scache
PauseStart = 1000
PauseEnd = 1000
UserInterface = No
Restart = Yes
For using java from a specific location the following change was needed
CommandLine = %JAVA_PATH%\bin\java.exe -Xrs -DasService=yes -cp jarfiles
but this wouldn't work as it would not accept the path variable in XYNT.ini file. so I called a batch file here and in that batch file I used the above code. So here is what the change looks like,
CommandLine = batchfile.bat
and in batchfile.bat
%JAVA_PATH%\bin\java.exe -Xrs -DasService=yes -cp jarfiles

Usually, every Java program run on your system has its own virtual machine running, which means: one java.exe/javaw.exe per instance of your program.
I can not tell why it "worked" from your point of view with java.exe like you described first, but the behaviour you described for javaw.exe (having 4 java processes in the process explorer) would be what I'd have expected.
For me the question is not why you're seeing 4 vs. 1 java processes, but how you can start/stop the "services". Killing the Java VM externally doesn't seem a very good solution. I'd consider building some IPC into the Java services that allow you to gracefully terminate the processes.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.