Hadoop MapReduce without cluster - is it possible?

Hadoop MapReduce without cluster - is it possible? - java

Is it possible to run a Hadoop MapReduce program without a cluster? I mean, I am just trying to fiddle around a little with map/reduce, for educational purposes, so all I want is to run few MapReduce programs on my computer, I don't need any job splitting to multiple nodes etc... Don't need any performance boosts or anything, as I said, just for educational purposes.. Do I still need to run a VM to achieve this? I am using IntelliJ Ultimate, and I'm trying to run simple WordCount.. I believe I've set up all necessary libraries and the entire project, and upon running I get this exception:
Exception in thread "main" java.io.IOException: Cannot initialize Cluster.
Please check your configuration for mapreduce.framework.name and the correspond server addresses.
I've found some posts saying that the entire map/reduce process can be run locally on the jvm, but couldn't yet find the way how to do it.

The whole installation tutorial of "pseudo-distributed" mode specifically walks you through the installation of a single node Hadoop cluster
There's also the "Mini cluster" which you'll find some Hadoop projects use for unit&integration tests
I feel like you're just asking if you need HDFS or YARN, though, and the answer is no, Hadoop can read file:// prefixed file paths from disk, with or without a cluster
Keep in mind that splitting is not just between nodes, but also between multiple cores of a single computer. If you're not doing any parallel processing, there's not much reason to use Hadoop other than to learn the API semantics.
Aside: From an "educational perspective", in my career thus far, I find more people writing Spark than MapReduce, and not many jobs asking specifically for MapReduce code

Related

SPOON Pentaho vs Java Standalone program

I have written a standalone program to do the migration from one system to another which runs in multiple threads and processes them asynchronously (There are millions of records). "I have been told that I should switch to SPOON to do this job for better performance". Is this is the statement true?
If so I am curious to know how it is faster compared to the java program?
What algorithm or logic it contains to make it faster.
My understanding was it is a tool to do development faster like migration, or setting up scheduler, mapping the data, pulling logs and etc..
I appreciate your response.

Hadoop MapReduce Out of Memory on Small Files

I'm running a MapReduce job against about 3 million small files on Hadoop (I know, I know, but there's nothing we can do about it - it's the nature of our source system).
Our code is nothing special - it uses CombineFileInputFormat to wrap a bunch of these files together, then parses the file name to add it into the contents of the file, and spits out some results. Easy peasy.
So, we have about 3 million ~7kb files in HDFS. If we run our task against a small subset of these files (one folder, maybe 10,000 files), we get no trouble. If we run it against the full list of files, we get an out of memory error.
The error comes out on STDOUT:
#
# java.lang.OutOfMemoryError: GC overhead limit exceeded
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 15690"...
I'm assuming what's happening is this - whatever JVM is running the process that defines the input splits is getting totally overwhelmed trying to handle 3 million files, it's using too much memory, and YARN is killing it. I'm willing to be corrected on this theory.
So, what I need to know how to do is to increase the memory limit for YARN for the container that's calculating the input splits, not for the mappers or reducers. Then, I need to know how to make this take effect. (I've Googled pretty extensively on this, but with all the iterations of Hadoop over the years, it's hard to find a solution that works with the most recent versions...)
This is Hadoop 2.6.0, using the MapReduce API, YARN framework, on AWS Elastic MapReduce 4.2.0.

I would spin up a new EMR cluster and throw a larger master instance at it to see if that is the issue.
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.4xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=m3.xlarge
If the master is running out of memory when configuring the input splits you can modify the configuration
EMR Configuration

Instead of running the MapReduce on 3 million individual files, you can merge them into manageable bigger files using any of the following approaches.
1. Create Hadoop Archive ( HAR) files from the small files.
2. Create sequence file for every 10K-20K files using MapReduce program.
3. Create a sequence file from your individual small files using forqlift tool.
4. Merge your small files into bigger files using Hadoop-Crush.
Once you have the bigger files ready, you can run the MapReduce on your whole data set.

hadoop cluster is using only master node or all nodes

I have created a 4-node hadoop cluster. I start all datanodes,namenode resource manager,etc.
To find whether all of my nodes are working or not, I tried the following procedure:
Step 1. I run my program when all nodes are active
Step 2. I run my program when only master is active.
The completion time in both cases were almost same.
So, I would like to know if there is any other means by which I can know how many nodes are actually used while running the program.

Discussed in the chat. The problem is caused by incorrect Hadoop installation, in both cases job was started locally using LocalJobRunner.
As a recommendations:
Install Hadoop using Ambari (http://ambari.apache.org/)
Change platform to CentOS 6.4+
Use Oracle JDK 7
Be patient with host names and firewall
Get familiar with the cluster commands for health diagnostics and default Hadoop WebUIs

Cassandra and Pig integration - Is hadoop optional?

I'm trying to set up a trial cassandra + pig cluster. The cassandra wiki makes it sound like you need hadoop to integrate with pig.
but the readme in cassandra-src/contrib/pig makes it sound like you can run pig on cassandra without hadoop.
If hadoop is optional, what do you lose by not using it?

Hadoop is only optional when you are testing things out. In order to do anything at any scale you will need hadoop as well.
Running without hadoop means you are running pig in local mode. Which basically means all the data is processed by the same pig process that you are running in. This works fine with a single node and example data.
When running with any significant amount of data or multiple machines you want to run pig in hadoop mode. By running hadoop task trackers on your cassandra nodes pig can take advantage of the benefits map reduce provides by distributing the workload and using data locality to reduce network transfer.

It's optional. Cassandra has its own implementation of pig's LoadFunc and storeFunc which allow u to query and store.
Hadoop and Cassandra are different in many ways. It's hard to say what you lose without knowing what exactly u r trying to accomplish.

hadoop beginners question

I've read some documentation about hadoop and seen the impressive results. I get the bigger picture but am finding it hard whether it would fit our setup. Question isnt programming related but I'm eager to get opinion of people who currently work with hadoop and how it would fit our setup:
We use Oracle for backend
Java (Struts2/Servlets/iBatis) for frontend
Nightly we get data which needs to be summarized. this runs as a batch process (takes 5 hours)
We are looking for a way to cut those 5 hours to a shorter time.
Where would hadoop fit into this picture? Can we still continue to use Oracle even after hadoop?

The chances are you can dramatically reduce the elapsed time of that batch process with some straightforward tuning. I offer this analysis on the simple basis of past experience. Batch processes tend to be written very poorly, precisely because they are autonomous and so don't have irate users demanding better response times.
Certainly I don't think it makes any sense at all to invest a lot of time and energy re-implementing our application in a new technology - no matter how fresh and cool it may be - until we have exhausted the capabilities of our current architecture.
If you want some specific advice on how to tune your batch query, well that would be a new question.

Hadoop is designed to parallelize a job across multiple machines. To determine whether it will be a good candidate for your setup, ask yourself these questions:
Do I have many machines on which I can run Hadoop, or am I willing to spend money on something like EC2?
Is my job parallelizable? (If your 5 hour batch process consists of 30 10-minute tasks that have to be run in sequence, Hadoop will not help you).
Does my data require random access? (This is actually pretty significant - Hadoop is great at sequential access and terrible at random access. In the latter case, you won't see enough speedup to justify the extra work / cost).
As far as where it "fits in" - you give Hadoop a bunch of data, and it gives you back output. One way to think of it is like a giant Unix process - data goes in, data comes out. What you do with it is your business. (This is of course an overly simplified view, but you get the idea.) So yes, you will still be able to write data to your Oracle database.

Hadoop distributed filesystem supports highly paralleled batch processing of data using MapReduce.
So your current process takes 5 hours to summarize the data. Of the bat, general summarization tasks are one of the 'types' of job MapReduce excels at. However you need to understand weather your processing requirements will translate into a MapReduce job. By this I mean, can you achieve the summaries you need using the key/value pairs MapReduce limits you to using?
Hadoop requires a cluster of machines to run. Do you have hardware to support a cluster? This usually comes down to how much data you are storing on the HDFS and also how fast you want to process the data. Generally when running MapReduce on a Hadoop the more machines you have either the more data you can store or the faster you run a job. Having an idea of the amount of data you process each night would help a lot here?
You can still use Oracle. You can use Hadoop/MapReduce to do the data crunching and then use custom code to insert the summary data into an oracle DB.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.