hadoop cluster is using only master node or all nodes - java

I have created a 4-node hadoop cluster. I start all datanodes,namenode resource manager,etc.
To find whether all of my nodes are working or not, I tried the following procedure:
Step 1. I run my program when all nodes are active
Step 2. I run my program when only master is active.
The completion time in both cases were almost same.
So, I would like to know if there is any other means by which I can know how many nodes are actually used while running the program.

Discussed in the chat. The problem is caused by incorrect Hadoop installation, in both cases job was started locally using LocalJobRunner.
As a recommendations:
Install Hadoop using Ambari (http://ambari.apache.org/)
Change platform to CentOS 6.4+
Use Oracle JDK 7
Be patient with host names and firewall
Get familiar with the cluster commands for health diagnostics and default Hadoop WebUIs

Related

How to simulate and run some servers on Hadoop?

I need a simulator to run some servers on Hadoop:
Able to work with database.
I want to run a Java on it and see its results.
Run the Hadoop without MapReduce
You don't run servers on Hadoop. It's the other way around.
If you want to create a Hadoop environment without installing Hadoop on your own, then you can download a virtual machine or start an account with any of the major cloud providers
Hadoop just starts YARN and HDFS. If you want to run code that isn't MapReduce, you'll need to find/install another tool such as Spark, Pig, Hive, Flink, etc, each of which can be used to query databases, but are not one themselves

Spark submit job fails for cluster mode but works in local for copyToLocal from HDFS in java

I'm running a Java code to copy the files from HDFS to local using Spark cluster mode in spark submit.
The job runs fine with spark local but fails in cluster mode.
It throws a java.io.exeception: Target /mypath/ is a directory.
I don't understand why is it failing in cluster. But I don't recieve any exceptions in local.
That behaviour is because in the first case (local) your driver is in the same machine that you are running the whole Spark job. In the second case (cluster), your driver program is shipped to one of your workers and execute the process from there.
In general, when you want to run Spark jobs as a cluster mode and you need to pre-process local files such as JSON, XML, among others, you need to ship them along with the executable using the following sentence --files <myfile>. Then in your driver program you will be able to see that particular file. If you want to include multiple files, put them separated by comma(,).
The approach is the same when you want to add some jars dependencies, you need to use --jars <myJars>.
For more details about this, check this thread.

Hadoop MapReduce without cluster - is it possible?

Is it possible to run a Hadoop MapReduce program without a cluster? I mean, I am just trying to fiddle around a little with map/reduce, for educational purposes, so all I want is to run few MapReduce programs on my computer, I don't need any job splitting to multiple nodes etc... Don't need any performance boosts or anything, as I said, just for educational purposes.. Do I still need to run a VM to achieve this? I am using IntelliJ Ultimate, and I'm trying to run simple WordCount.. I believe I've set up all necessary libraries and the entire project, and upon running I get this exception:
Exception in thread "main" java.io.IOException: Cannot initialize Cluster.
Please check your configuration for mapreduce.framework.name and the correspond server addresses.
I've found some posts saying that the entire map/reduce process can be run locally on the jvm, but couldn't yet find the way how to do it.
The whole installation tutorial of "pseudo-distributed" mode specifically walks you through the installation of a single node Hadoop cluster
There's also the "Mini cluster" which you'll find some Hadoop projects use for unit&integration tests
I feel like you're just asking if you need HDFS or YARN, though, and the answer is no, Hadoop can read file:// prefixed file paths from disk, with or without a cluster
Keep in mind that splitting is not just between nodes, but also between multiple cores of a single computer. If you're not doing any parallel processing, there's not much reason to use Hadoop other than to learn the API semantics.
Aside: From an "educational perspective", in my career thus far, I find more people writing Spark than MapReduce, and not many jobs asking specifically for MapReduce code

Elastic search stucks during its initialization

When I tried to up the Elastic search, it gets hanged(shown in below image).It takes more than 15 minutes to start normally. However when I tried the same setup on different machines elastic search server gets up in 5 to 10 seconds and it worked fine (Except 3,those 3 machines also showed same problem). What would be the possible cause for this?
Due to this problem I got org.elasticsearch.client.transport.NoNodeAvailableException: No node available exception from java.
Note: My elasticSearch is working as standalone node on each m/c.
I got solution here.
Probable solutions:
It's always a good idea to install a current Java JRE to enjoy resolved bugs.
Check if your os network settings (routes, gateway) and mount points are configured correctly.
Remove sigar libraries in order to avoid sigar monitoring hangs
From the logs ES seems to hang before the node is going to initialize plugins, so remove all ES plugins, check the classpath (the lib directory is loaded with *) and remove unwanted/duplicate jars

Automatically installing software on the client machine

Its more of a design and architecture scenario.
I want to have number of nodes in the cluster and initially all the nodes are pre-installed with java 6 and windows/linux. In all the nodes I want to install my application (this application I will be maintaining on the server) and this application will be used to run the some tasks on parallel.
On server I want to monitor traffic of all the nodes and task execution status.
So how to achieve it?
Any comments on this will be appreciated.
Thanks in advance.
If I understood your question correctly you can use parallel-ssh and its pscp and pssh commands to copy your distrib onto the remote hosts and run commands you want to install it.
There are also some alternatives: dsh, clusterit

Categories

Resources