I'm trying to set up a trial cassandra + pig cluster. The cassandra wiki makes it sound like you need hadoop to integrate with pig.
but the readme in cassandra-src/contrib/pig makes it sound like you can run pig on cassandra without hadoop.
If hadoop is optional, what do you lose by not using it?
Hadoop is only optional when you are testing things out. In order to do anything at any scale you will need hadoop as well.
Running without hadoop means you are running pig in local mode. Which basically means all the data is processed by the same pig process that you are running in. This works fine with a single node and example data.
When running with any significant amount of data or multiple machines you want to run pig in hadoop mode. By running hadoop task trackers on your cassandra nodes pig can take advantage of the benefits map reduce provides by distributing the workload and using data locality to reduce network transfer.
It's optional. Cassandra has its own implementation of pig's LoadFunc and storeFunc which allow u to query and store.
Hadoop and Cassandra are different in many ways. It's hard to say what you lose without knowing what exactly u r trying to accomplish.
Related
I need a simulator to run some servers on Hadoop:
Able to work with database.
I want to run a Java on it and see its results.
Run the Hadoop without MapReduce
You don't run servers on Hadoop. It's the other way around.
If you want to create a Hadoop environment without installing Hadoop on your own, then you can download a virtual machine or start an account with any of the major cloud providers
Hadoop just starts YARN and HDFS. If you want to run code that isn't MapReduce, you'll need to find/install another tool such as Spark, Pig, Hive, Flink, etc, each of which can be used to query databases, but are not one themselves
Is it possible to run a Hadoop MapReduce program without a cluster? I mean, I am just trying to fiddle around a little with map/reduce, for educational purposes, so all I want is to run few MapReduce programs on my computer, I don't need any job splitting to multiple nodes etc... Don't need any performance boosts or anything, as I said, just for educational purposes.. Do I still need to run a VM to achieve this? I am using IntelliJ Ultimate, and I'm trying to run simple WordCount.. I believe I've set up all necessary libraries and the entire project, and upon running I get this exception:
Exception in thread "main" java.io.IOException: Cannot initialize Cluster.
Please check your configuration for mapreduce.framework.name and the correspond server addresses.
I've found some posts saying that the entire map/reduce process can be run locally on the jvm, but couldn't yet find the way how to do it.
The whole installation tutorial of "pseudo-distributed" mode specifically walks you through the installation of a single node Hadoop cluster
There's also the "Mini cluster" which you'll find some Hadoop projects use for unit&integration tests
I feel like you're just asking if you need HDFS or YARN, though, and the answer is no, Hadoop can read file:// prefixed file paths from disk, with or without a cluster
Keep in mind that splitting is not just between nodes, but also between multiple cores of a single computer. If you're not doing any parallel processing, there's not much reason to use Hadoop other than to learn the API semantics.
Aside: From an "educational perspective", in my career thus far, I find more people writing Spark than MapReduce, and not many jobs asking specifically for MapReduce code
I am developing a Java based application and I decided to use machine learning algorithms implemented in Mahout library. My application will run on single machine, without Hadoop.
I would like to ask, if single node Mahout has also overhead, like distributed one? I read in a book Mahout in action, than multiple cluster Mahout has some overhead (initializing, transfering data, etc.). But if we use Mahout algorithms without MapReduce paradigm, there should be no overhead, right?
It does not make a difference whether you run it in a single machine or a 1000-node cluster. Hadoop serializes all the intermediate data (MAP's key-value output), and persist it on the disk. In the reduce phase, it loads the key-value pairs back into the memory. Therefore, it has huge processing and disk-access overheads.
Basically, if you have few machines (e.g. <7 machines), hadoop is possibly not a good choice, specially for speedup analysis. In this case, you may just use the small cluster to check your code's logic before deploying it on a larger environment.
I have been using Hbase for months and I have loaded Hbase table with more than 6GB of data. When I tried scanning the rows using Java client it hangs and reports the following error,
Could not seek StoreFileScanner[HFileScanner for reader reader=hdfs
Further if I login to shell and scan it works perfectly and even Java client scanner works fine for hbase table having small amount of data.
Any workaround for this?
For large data you can write map reduce code. simple Java programs are not really very effective when it comes to big data. You can look into pig script to achieve that.
Check out these for further help :
http://sujee.net/tech/articles/hadoop/hbase-map-reduce-freq-counter/
http://wiki.apache.org/hadoop/Hbase/MapReduce
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html
Or else you can give a try to Pig Scripts also for mapt reduce programs.
http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/backend/hadoop/hbase/HBaseTableInputFormat.html
One more option is there you increase the HBase time out Property and give a try. From different HBase configuration setting you can refer:
http://hbase.apache.org/docs/r0.20.6/hbase-conf.html
But when it comes to large data Map-reduce code is always better, and you can also search for optimizing guidelines/best practices for hbase.
I've read some documentation about hadoop and seen the impressive results. I get the bigger picture but am finding it hard whether it would fit our setup. Question isnt programming related but I'm eager to get opinion of people who currently work with hadoop and how it would fit our setup:
We use Oracle for backend
Java (Struts2/Servlets/iBatis) for frontend
Nightly we get data which needs to be summarized. this runs as a batch process (takes 5 hours)
We are looking for a way to cut those 5 hours to a shorter time.
Where would hadoop fit into this picture? Can we still continue to use Oracle even after hadoop?
The chances are you can dramatically reduce the elapsed time of that batch process with some straightforward tuning. I offer this analysis on the simple basis of past experience. Batch processes tend to be written very poorly, precisely because they are autonomous and so don't have irate users demanding better response times.
Certainly I don't think it makes any sense at all to invest a lot of time and energy re-implementing our application in a new technology - no matter how fresh and cool it may be - until we have exhausted the capabilities of our current architecture.
If you want some specific advice on how to tune your batch query, well that would be a new question.
Hadoop is designed to parallelize a job across multiple machines. To determine whether it will be a good candidate for your setup, ask yourself these questions:
Do I have many machines on which I can run Hadoop, or am I willing to spend money on something like EC2?
Is my job parallelizable? (If your 5 hour batch process consists of 30 10-minute tasks that have to be run in sequence, Hadoop will not help you).
Does my data require random access? (This is actually pretty significant - Hadoop is great at sequential access and terrible at random access. In the latter case, you won't see enough speedup to justify the extra work / cost).
As far as where it "fits in" - you give Hadoop a bunch of data, and it gives you back output. One way to think of it is like a giant Unix process - data goes in, data comes out. What you do with it is your business. (This is of course an overly simplified view, but you get the idea.) So yes, you will still be able to write data to your Oracle database.
Hadoop distributed filesystem supports highly paralleled batch processing of data using MapReduce.
So your current process takes 5 hours to summarize the data. Of the bat, general summarization tasks are one of the 'types' of job MapReduce excels at. However you need to understand weather your processing requirements will translate into a MapReduce job. By this I mean, can you achieve the summaries you need using the key/value pairs MapReduce limits you to using?
Hadoop requires a cluster of machines to run. Do you have hardware to support a cluster? This usually comes down to how much data you are storing on the HDFS and also how fast you want to process the data. Generally when running MapReduce on a Hadoop the more machines you have either the more data you can store or the faster you run a job. Having an idea of the amount of data you process each night would help a lot here?
You can still use Oracle. You can use Hadoop/MapReduce to do the data crunching and then use custom code to insert the summary data into an oracle DB.