Running Hadoop in standalone mode on Windows

Running Hadoop in standalone mode on Windows - java

What I understand from running Hadoop or Map reduce jobs on standalone mode is that we don't require any Hadoop daemons to be running. Everything is on the JVM.
So here is the problem.
I want to run a simple word count map reduce job on my machine (client machine), which is on Windows.
I have done the coding of word count in an IDE (eclipse), so I want to run it from eclipse.
How can this be achieved?
Do I need to use some plugin like the (Karmasphere plugin) or can it be done without using any plugins?
I am able to run the above in Linux. I am using the Cloudera provided VM. Here I created the same project in eclipse.
In the driver code I added the below line and finally ran it as Java application
Configuration conf = new Configuration();
conf.set("mapreduce.jobtracker.address", "local");
conf.set("fs.defaultFS","file:///");
Job job = new Job(conf);
While configuring for running the Java application, I provided input file and output folder name as the program arguments and it ran.
But doing same in windows was no good.

Related

Running a java code that generates a csv file as Jenkins job

I have some doubts about this, first I want to run my java code in a Jenkins pipeline, and my java needs to get 2 folders as arguments from the Jenkins running environment (linux) and then, my java code will do certain comparisons and generate a csv file.
how to run the java code in my Jenkins job? is it simply to get my cloned java code and run mvn clean compile? and how I can send the arguments from jenkins running environment to the java code? and finally how can I let my csv file appear on my Jenkins job dashboard?

Jmeter linux shell cannot start server

Situation:
I have installed Jasper Reports Library (V6.5.1) on my local Linux server which generates PDF reports (Data is dumped in a temp Oracle DB table for the reporting engine).
It then serves this PDF back to the website from which I kick off the process.
Goal:
Install Jmeter to analyse performance / possible bottlenecks of "Jasper Reports Library" (aka Report Generation) on my local linux server (I cannot access this server via GUI, only shell).
I understand I have to connect my local Windows 10 machine (running same Jmeter 4.0) with this local server. On the server I have to start Jmeter 4.0 Server (via jmeter-server command) however I get an error and am stuck (have not found anything online or even people with the same goal unfortunately...)
Steps I have taken:
Download latest (4.0) bin from here
Extracted on local linux server in /opt/dlins/apache-jmeter-4.0bin
Trying to start server with /usr/lib/jvm/jdk1.8.0_102/bin/java jmeter-server (the default java version is 6 so through this I can run this app with java 8) - Instructions found here
-> Getting error: "Error: Could not find or load main class jmeter-server"
Any help regarding above or even any other tool you may use are appreciated (Maybe there is a preferable way to test performance for the above scenario)

There are 2 aspects related to your issue and screenshot:
1) Using java 8 instead of 6 - This can be done in several ways, depending on your needs and restrictions, such as the need to have Java 6 globally available for other applications and using 8 just to run JMeter, or just replacing 6 with 8 entirely. For the sake of brevity, I'll assume the first scenario, but there's documentation available for both and Dmitri T has partially explained it already.
Anyway, the same JMeter doc link you used, describes (just scroll down a few times) how to create a setenv.sh script in the bin directory and configure JAVA_HOME or JRE_HOME depending on your needs.
To set those variables permanently, you can place them in a file called setenv.sh in the bin directory. This file will be sourced when running JMeter by calling the jmeter script.
You seem to be wanting a JDK, so create the script and add inside JAVA_HOME=/usr/lib/jvm/jdk1.8.0_102, save and exit.
2) Running JMeter - To clarify a minor confusion, java MyCompiledClass instructs java to load and execute the "program" defined in MyCompiledClass, which is not what you want to do, because jmeter-server is a shell script. If you open it, you'll see that it calls the jmeter shell script which will do some configuration, end eventually call (in short) java -jar ApacheJMeter.jar with some arguments and options.
So, to run JMeter make sure your scripts are executable with chmod, and simply run from command line ./jmeter-server. From the same link:
Un*x script files; should work on most Linux/Unix systems:
jmeter - run JMeter (in GUI mode by default). Defines some JVM settings which may not work for all JVMs.
jmeter-server - start JMeter in server mode (calls jmeter script with appropriate parameters)
jmeter.sh - very basic JMeter script (You may need to adapt JVM options like memory settings).
mirror-server.sh - runs the JMeter Mirror Server in non-GUI mode
shutdown.sh - Run the Shutdown client to stop a non-GUI instance gracefully
stoptest.sh - Run the Shutdown client to stop a non-GUI instance abruptly

Amend your PATH environment variable so Java 8 bin would be before Java 6 bin like:
PATH=/usr/lib/jvm/jdk1.8.0_102/bin:$PATH && export PATH
Once done you should be able to just launch the jmeter-server script like
pushd /opt/dlins/apache-jmeter-4.0bin/bin && ./jmeter-server
More information:
Remote Testing
JMeter Distributed Testing Step-by-step
How to Get Started With JMeter: Part 1 - Installation & Test Plans

To debug map reduce jobs in eclipse

I want to debug Map-reduce jobs (pig,hive) using eclipse. That is, to set break points in the hadoop source java file and to inspect the elements while running map-reduce jobs. To do this, I started all the services using eclipse and I can debug some class files. But I cant create an entire debug environment. Can anyone tell me how?

I do not know of an eclipse tool that can do what you are looking for. If you are looking for a possible solution the following will work in java.
import java.util.logging.Logger;
For debugging java map reduce files you can use the java logger for each class( driver, mapper, reducer ).
Logger log = Logger.getLogger(MyClass.class.getName());
To inspect elements/variables just use :
log.info( "varOne: " + varOne );
These log lines with be printed in the administration page for your job.

The basic thing to remember here is that debugging a Hadoop MR job is going to be similar to any remotely debugged application in Eclipse.
As you would know, Hadoop can be run in the local environment in 3 different modes :
Local Mode
Pseudo Distributed Mode
Fully Distributed Mode (Cluster)
Typically you will be running your local hadoop setup in Pseudo Distributed Mode to leverage HDFS and Map Reduce(MR). However you cannot debug MR programs in this mode as each Map/Reduce task will be running in a separate JVM process so you need to switch back to Local mode where you can run your MR programs in a single JVM process.
Here are the quick and simple steps to debug this in your local environment:
Run hadoop in local mode for debugging so mapper and reducer tasks run in a single JVM instead of separate JVMs. Below steps help you do it.
Configure HADOOP_OPTS to enable debugging so when you run your Hadoop job, it will be waiting for the debugger to connect. Below is the command to debug the same at port 8080.
(export HADOOP_OPTS=”-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=8008“)
Configure fs.default.name value in core-site.xml to file:/// from hdfs://. You won’t be using hdfs in local mode.
Configure mapred.job.tracker value in mapred-site.xml to local. This will instruct Hadoop to run MR tasks in a single JVM.
Create debug configuration for Eclipse and set the port to 8008 – typical stuff. For that go to the debugger configurations and create a new Remote Java Application type of configuration and set the port as 8080 in the settings.
Run your hadoop job (it will be waiting for the debugger to connect) and then launch Eclipse in debug mode with the above configuration. Do make sure to put a break-point first.
Thats it.

I created an eclipse project to debug generic mapreduce program, for example WordCount.java, running standalone hadoop in Eclipse. But I did not try hive/pig specific mapreduce jobs yet. The project locates at https://github.com/drachenrio/hadoopmr, it can be download using
git clone https://github.com/drachenrio/hadoopmr
This project was created with Ubuntu 16.04.2, Eclipse Neon.3 Release (4.6.3RC2), jdk1.8.0_121, hadoop-2.7.3 environments.
Quick setup:
1) Once project imported into Eclipse, open .classpath,
replace /j01/srv/hadoop-2.7.3 with your hadoop installation home path
2) mkdir -p /home/hadroop/input
copy src/main/resources/input.txt to /home/hadoop/input/
It is ready to run/debug WordCount.java mapreduce job.
Read README.md for more details.
If you prefer to manually create the project, see my another answer in stackoverflow

Running a JAVA program as a scheduled task

I am trying to run a simple JAVA program once per day on a Windows 7 machine.
My code runs fine inside NetBeans. If I do a clean and build it suggests this:
C:\Program Files\Java\jdk1.7.0/bin/java -jar "C:\Users\User1\Documents\NetBeansProjects\Facebook\dist\Facebook.jar"
This does not work from the DOS prompt of course because of the space between program and files so I do this:
C:\Program Files\Java\jdk1.7.0/bin/java -jar "C:\Users\User1\Documents\NetBeansProjects\Facebook\dist\Facebook.jar" -jar "C:\Users\User1\Documents\NetBeansProjects\Facebook\dist\Facebook.jar"
This works from the DOS prompt.
I now create a task in Windows Scheduler to run:
C:\Program Files\Java\jdk1.7.0/bin/java
with arguments:
-jar "C:\Users\User1\Documents\NetBeansProjects\Facebook\dist\Facebook.jar"
When I then run it, all I see is a DOS box flashing up for a second. I expect the code to take about 30 secs to run. The code should persist data to a database and no updates happen.
The code also uses java.util.logging so I should see log entries and I don't.
I strongly suspect that I am not running the JAVA command properly or that there's a bad classpath issue that it present when running via Scheduler that isn't there when running from the DOS prompt.
Help would be appreciated. If you've seen this before and can sort it that would be great. If you can tell me how to get a meaningful error trace from Scheduler than that would also be really helpful.
Thanks!

I Think that you could create a simple batch script that will launch your program in this way :
#echo off
REM Eventually change directory to the program directory
cd C:\Users\User1\Documents\NetBeansProjects\Facebook\dist\
REM run the program
"C:\Program Files\Java\jdk1.7.0\bin\java.exe" -jar "C:\Users\User1\Documents\NetBeansProjects\Facebook\dist\Facebook.jar"
Copy it into the notepad and save as java_script.cmd and then schedule this script instead of the program directly.

I solved it after changing all fonts' references to "SansSerif"
I was using Jasper Reports inside Java to create a PDF file. It was working fine when I double click the batch file or Scheduler with Windows Server 2003 but not working with the Scheduler of 2008.
I tried many different things nothing worked so I though Could it be that Windows Server 2008 is blocking the access?.
Now is working perfect. So, if you are having problems check the references to anything you are using.

The scheduler will run under a different user unless you specify what user to run as. If it isn't running as your user then it won't be able to write to your directories.

The real problem to the original question is a java installation issue on Microsoft systems. Java jre installs into Program Files\java. The executable (java.exe) is only installed in that java\bin directory. Running from the command line, the os looks in the proper location for the java.exe. Running from other MS tools (such as VBA Excel or in this case TaskScheduler), it does not!
You can see that TaskScheduler is looking in the wrong place by viewing the tasks history in the TaskScheduler tool. Double click on some of the history events and one will list the action and return code. The action will show that the TaskScheduler is trying to run
"C:\Windows\system32\java.EXE"
So, copy java.exe from the java\bin directory into the place where the scheduler is looking, and now it will work.
Or update your task and provide the full path to java.exe.
You can also update the environment system path to look for java in the java\bin directory, but that has to apply to all users and sometimes this is faulty as well.

How to debug hadoop mapreduce jobs from eclipse?

I'm running hadoop in a single-machine, local-only setup, and I'm looking for a nice, painless way to debug mappers and reducers in eclipse. Eclipse has no problem running mapreduce tasks. However, when I go to debug, it gives me this error :
12/03/28 14:03:23 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
Okay, so I do some research. Apparently, I should use eclipse's remote debugging facility, and add this to my hadoop-env.sh :
-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5000
I do that and I can step through my code in eclipse. Only problem is that, because of the "suspend=y", I can't use the "hadoop" command from the command line to do things like look at the job queue; it hangs, I'm imagining because it's waiting for a debugger to attach. Also, I can't run "hbase shell" when I'm in this mode, probably for the same reason.
So basically, if I want to flip back and forth between "debug mode" and "normal mode", I need to update hadoop-env.sh and restart my machine. Major pain. So I have a few questions :
Is there an easier way to do debug mapreduce jobs in eclipse?
How come eclipse can run my mapreduce jobs just fine, but for debugging I need to use remote debugging?
Is there a way to tell hadoop to use remote debugging for mapreduce jobs, but to operate in normal mode for all other tasks? (such as "hadoop queue" or "hbase shell").
Is there an easier way to switch hadoop-env.sh configurations without rebooting my machine? hadoop-env.sh is not executable by default.
This is a more general question : what exactly is happening when I run hadoop in local-only mode? Are there any processes on my machine that are "always on" and executing hadoop jobs? Or does hadoop only do things when I run the "hadoop" command from the command line? What is eclipse doing when I run a mapreduce job from eclipse? I had to reference hadoop-core in my pom.xml in order to make my project work. Is eclipse submitting jobs to my installed hadoop instance, or is it somehow running it all from the hadoop-core-1.0.0.jar in my maven cache?
Here is my Main class :
public class Main {
public static void main(String[] args) throws Exception {
Job job = new Job();
job.setJarByClass(Main.class);
job.setJobName("FirstStage");
FileInputFormat.addInputPath(job, new Path("/home/sangfroid/project/in"));
FileOutputFormat.setOutputPath(job, new Path("/home/sangfroid/project/out"));
job.setMapperClass(FirstStageMapper.class);
job.setReducerClass(FirstStageReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

Make changes in /bin/hadoop (hadoop-env.sh) script. Check to see what command has been fired. If the command is jar, then only add remote debug configuration.
if [ "$COMMAND" = "jar" ] ; then
exec "$JAVA" -Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=8999 $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$#"
else
exec "$JAVA" $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$#"
fi

The only way you can debug hadoop in eclipse is running hadoop in local mode. The reason being, each map reduce task run in ist own JVM and when you don't hadoop in local mode, eclipse won't be able to debug.
When you set hadoop to local mode, instead of using hdfs API(which is default), hadoop file system changes to file:///. Thus, running hadoop fs -ls will not be a hdfs command, but more of hadoop fs -ls file:///, a path to your local directory. None of the JobTracker or NameNode runs.
These blogposts might help:
http://let-them-c.blogspot.com/2011/07/running-hadoop-locally-on-eclipse.html
http://let-them-c.blogspot.com/2011/07/configurations-of-running-hadoop.html

Besides the recommended MRUnit I like to debug with eclipse as well. I have a main program. It instantiates a Configuration and executes the MapReduce job directly. I just debug with standard eclipse Debug configurations. Since I include hadoop jars in my mvn spec, I have all hadoop per se in my class path and I have no need to run it against my installed hadoop. I always test with small data sets in local directories to make things easy. The defaults for the configuration behaves as a stand alone hadoop (file system is available)

I also like to debug via unit test w/MRUnit. I will use this in combination with approvaltests which creates an easy visualization of the Map Reduce process, and makes it easy to pass in scenarios that are failing. It also runs seamlessly from eclipse.
For example:
HadoopApprovals.verifyMapReduce(new WordCountMapper(),
new WordCountReducer(), 0, "cat cat dog");
Will produce the output:
[cat cat dog]
-> maps via WordCountMapper to ->
(cat, 1)
(cat, 1)
(dog, 1)
-> reduces via WordCountReducer to ->
(cat, 2)
(dog, 1)
There's a video on the process here: http://t.co/leExFVrf

Adding args to hadoop's internal java command can be done via HADOOP_OPTS env variable:
export HADOOP_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=5005,suspend=y"

You can pass the debugging parameters via -Dmapreduce.map.java.opts.
For example you can run HBase Import and run the mappers in debug mode:
yarn jar your/path/to/hbase-mapreduce-2.2.5.jar import
-Dmapreduce.map.speculative=false
-Dmapreduce.reduce.speculative=false
-Dmapreduce.map.java.opts="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=5005,suspend=y"
my_table /path/in/hdfs
Note that this must be placed into one line w/o new lines.
Other map-reduce applications can be started in the same way, the trick is to pass the debug derictives via -Dmapreduce.map.java.opts.
In Eclipse or IntelliJ you have to create a debug remote connection with
Host=127.0.0.1 (or even a remote IP address in case Hadoop runs elsewhere)
Port=5005
I managed to debug the Import this way. In addition you can limit the number of mappers to 1 as described here but this was not necessary for me.
Once the map-reduve application is started switch to your IDE an try to launch your debug settings which will fail in the beginning. Repeat it until the debugger hooks into the application. Don't forget to set a breakpoint before hand.
In case you don't want to debug your application only but also the surrounding HBase/Hadoop framework, you can download them
here and here (chosse your version via the "switch branch/tags" menu button).

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.