I need a simulator to run some servers on Hadoop:
Able to work with database.
I want to run a Java on it and see its results.
Run the Hadoop without MapReduce
You don't run servers on Hadoop. It's the other way around.
If you want to create a Hadoop environment without installing Hadoop on your own, then you can download a virtual machine or start an account with any of the major cloud providers
Hadoop just starts YARN and HDFS. If you want to run code that isn't MapReduce, you'll need to find/install another tool such as Spark, Pig, Hive, Flink, etc, each of which can be used to query databases, but are not one themselves
Related
I'm running a Java code to copy the files from HDFS to local using Spark cluster mode in spark submit.
The job runs fine with spark local but fails in cluster mode.
It throws a java.io.exeception: Target /mypath/ is a directory.
I don't understand why is it failing in cluster. But I don't recieve any exceptions in local.
That behaviour is because in the first case (local) your driver is in the same machine that you are running the whole Spark job. In the second case (cluster), your driver program is shipped to one of your workers and execute the process from there.
In general, when you want to run Spark jobs as a cluster mode and you need to pre-process local files such as JSON, XML, among others, you need to ship them along with the executable using the following sentence --files <myfile>. Then in your driver program you will be able to see that particular file. If you want to include multiple files, put them separated by comma(,).
The approach is the same when you want to add some jars dependencies, you need to use --jars <myJars>.
For more details about this, check this thread.
Is it possible to run a Hadoop MapReduce program without a cluster? I mean, I am just trying to fiddle around a little with map/reduce, for educational purposes, so all I want is to run few MapReduce programs on my computer, I don't need any job splitting to multiple nodes etc... Don't need any performance boosts or anything, as I said, just for educational purposes.. Do I still need to run a VM to achieve this? I am using IntelliJ Ultimate, and I'm trying to run simple WordCount.. I believe I've set up all necessary libraries and the entire project, and upon running I get this exception:
Exception in thread "main" java.io.IOException: Cannot initialize Cluster.
Please check your configuration for mapreduce.framework.name and the correspond server addresses.
I've found some posts saying that the entire map/reduce process can be run locally on the jvm, but couldn't yet find the way how to do it.
The whole installation tutorial of "pseudo-distributed" mode specifically walks you through the installation of a single node Hadoop cluster
There's also the "Mini cluster" which you'll find some Hadoop projects use for unit&integration tests
I feel like you're just asking if you need HDFS or YARN, though, and the answer is no, Hadoop can read file:// prefixed file paths from disk, with or without a cluster
Keep in mind that splitting is not just between nodes, but also between multiple cores of a single computer. If you're not doing any parallel processing, there's not much reason to use Hadoop other than to learn the API semantics.
Aside: From an "educational perspective", in my career thus far, I find more people writing Spark than MapReduce, and not many jobs asking specifically for MapReduce code
I have configured HBase and integrated with HDFS on windows successfully. I using HBase version 0.98.6.1-hadoop2 and Hadoop version 2.5.1
Followed HBase quick start tutorial.
If i run HBase normally (without hbase.cluster.distributed property) then it works fine. Otherwise it shows This is not implemented yet. Stay tuned.
How do i start HBase in cluster distributed mode on windows without cygwin?
As per my knowledge you can do in these ways
1) Use cygwin (not in your requirements).
2) Use VMWare or VirtualBox
3) Use Microsoft HDInsights (Suitable for you)
Before Starting Hbase make sure Hadoop is in distributed mode and is working only then your HBase will work in distributed mode else it will run in local mode.
I'm trying to set up a trial cassandra + pig cluster. The cassandra wiki makes it sound like you need hadoop to integrate with pig.
but the readme in cassandra-src/contrib/pig makes it sound like you can run pig on cassandra without hadoop.
If hadoop is optional, what do you lose by not using it?
Hadoop is only optional when you are testing things out. In order to do anything at any scale you will need hadoop as well.
Running without hadoop means you are running pig in local mode. Which basically means all the data is processed by the same pig process that you are running in. This works fine with a single node and example data.
When running with any significant amount of data or multiple machines you want to run pig in hadoop mode. By running hadoop task trackers on your cassandra nodes pig can take advantage of the benefits map reduce provides by distributing the workload and using data locality to reduce network transfer.
It's optional. Cassandra has its own implementation of pig's LoadFunc and storeFunc which allow u to query and store.
Hadoop and Cassandra are different in many ways. It's hard to say what you lose without knowing what exactly u r trying to accomplish.
I have a cluster of 32 servers and I need a tool to distribute a Java service, packaged as a Jar file, to each machine and remotely start the service. The cluster consists of Linux (Suse 10) servers with 8 cores per blade. The application is a data grid which uses Oracle Coherence. What is the best tool for doing this?
I asked something similar once, and it seems that the Java Parallel Processing Framework might be what you need:
http://www.jppf.org/
From the web site:
JPPF is an open source Grid Computing
platform written in Java that makes it
easy to run applications in parallel,
and speed up their execution by orders
of magnitude. Write once, deploy once,
execute everywhere!
Have a look at OpenMOLE: http://www.openmole.org/
This tool enables you to distribute a computing workflow to several resources: from multicores machines, to clusters and computing grids.
It is nicely documented and can be controlled through groovy code or a GUI.
Distributing a jar on a cluster should be very easy to do with OpenMOLE.
Is your service packaged as an EJB? JBoss does a fairly good job with clustering.
Use Bit Torrent. Using Peer to Peer sharing style on clusters can really boost up your deployment speed.
It depends on which operating system you have and how security is setup on your network.
If you can use NFS or Windows Share, I suggest you put the software on an NFS drive which is visible to all machines. That way you can run them all from one copy.
If you have remote shell or secure remote shell you can write a script which runs the same command on each machine e.g. start on all machines, or stop on all machines.
If you have windows you might want to setup a service on each machine. If you have linux you might want to add a startup/shutdown script to each machine.
When you have a number of machines, it may be useful to have a tool which monitors that all your services are running, collects the logs and errors in one place and/or allows you to start/stop them from a GUI. There are a number of tools to do this, not sure which is the best these days.