Can anybody tell that is it necessary to know java for learning hadoop?
If anyone is working on hadoop then please tell what is required to get a job in hadoop?
what is exact use of hadoop?
What was there before hadoop?
what is the difference between hdfs and gfs?
I know there are lots of question but if anyone can help.. That will be great for me.
Thanks a lot guys.
What is exact use of hadoop?
The Apache Hadoop software library is a framework that allows for the distributed storage and processing of large data sets across clusters of computers using simple programming models. Refer to documentation at Apache-Hadoop
Hadoop provides highly scalable, cost-effective,fast,flexible and resilient data storage & Analytics platform.
Is it necessary to know java for learning Hadoop?
Not required. But you are looking at optimization of your Map Reduce jobs, java provides that flexibility.
Alternatives if you are not interested in java
PIG: A high-level data-flow language and execution framework for parallel computation. It is recommended for people, who are experts in scripting languages like Python.
HIVE: A data warehouse infrastructure that provides data summarization and ad- hoc querying. It is recommended for people, who are experts in SQL programming as syntax is similar to SQL
PIG and HIVE hides complexity of Map Reduce Jobs for developers. They provide higher level of abstraction to provide solution to business problems.
PIG and HIVE, both translates scripts/queries to a series of MAP reduce jobs. On the performance front, they are not efficient compared to traditional MAP reduce jobs implemented in java.
Refer to this article for Java Alternatives
On job front, it depends your expertise and your selection of eco-system in Hadoop framework. It is difficult to answer.
Before Hadoop, we do not have best framework/platform, which can provide same advantages of Hadoop with BigData. To implement Word count type of programs, you have to write scripts, execute and consolidate the results from datanodes.
You can see comparison between GFS and HDFS at GFS Vs HDFS. Get good insight into HDFS at HDFS design
Hadoop is a distributed computing framework. It is a De facto standard for Data Management (Distributed Storage + Distributed Processing). So Hadoop is a technology for all who involves in Data Management Life Cycle(Capturing, Storage, Processing, and Reporting). Hadoop is used by the following roles:
Admin
Developer
Data Analyst
Data Scientist
Business Analyst
Functional Consultant
etc...
Though Hadoop and most of it's Eco System is written in Java, but it is used by all kinds of People in the Enterprise. So we need several interfaces to target all the audience and to increase the adaptability.
Hadoop Project Management Committee initiated several projects to support non java programmers, non programmers, SQL programmers etc...
The following are utilities and projects to support all varieties of Audience:
Hadoop Streaming: It is an utility offered by Hadoop to allow non Java programers to write MapReduce programs using other languages like Perl, PHP, Python, Shell, R, C, C++, Scala, Groovy, Ruby etc..
Hadoop Streaming = Hadoop + Console (STDOUT/STDIN) + External Programs.
Hadoop streaming is a bit slow compared to Native Java MapReduce, but is useful to integrate the legacy code which is written in non java and it is also good for integrating Data Science tool kits like R and Python with Hadoop.
There are several projects which are developed based on Hadoop Streaming:
RHadoop : R and Hadoop
Dumbo : Python + Hadoop
mrjob : Python + Hadoop
Hadoop Pipes: It is an utility offered by Hadoop to allow non Java programers to write MapReduce programs using C++.
Pydoop: It is a Python Module to write MapReduce programs using Python. It internally uses Hadoop Pipes. So It is Python wrapper over Hadoop Pipes.
Pig: It offers a scripting like language called Pig Latin to analyse your data by performing a series of transformations and aggregations. Pig Latin is easy to learn and it is a data flow language. It is a right tool for people who do not have any programming background.
Hive/Impala/Drill/Tajo/Presto/MRQL: All these are Distributed SQL Engines over Hadoop. These offer a SQL like Query language to run ad hoc queries, data summarisation. It is good choice for SQL programmers, Database analysts, Data Warehouse programmers.
All the above projects and utilities allow non java programmers to write their data analysis using their choice of language. Hadoop with Java has great advantage and full control over data in terms of Key Value pairs.
The conclusion here is, we can do data analysis with Hadoop without Java programming.
Is Java necessary for Hadoop?
Hadoop is built in Java but to work on Hadoop you didn't require Java.
It is preferred if you know Java, then you can code on mapreduce. If you are not familiar with Java. You can focus your skills on Pig and Hive to perform the same functionality. They are similar tools like SQL with different way to writing syntax. If you are coding Background like python, C++ etc. You can convert there codes into Java using libraries like Jython for Python.
How much Java is necessary for Hadoop?
If you want to learn Java only for Hadoop then here are list of topic you must learn:
1. Basic of Core Java
- Variables, Classes, Functions, Inheritance, Package, Handling Error, Flow Controls, Arrays, API.
- Some useful string function //Useful for data filter
- Collections and Generic - ArrayList, Hashtable, etc.
To get more detail about what are topics you need to learn in Java, check this article:
https://www.dezyre.com/article/-how-much-java-is-required-to-learn-hadoop/103
To know about the GFS and HDFS, check this paper:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.454.4159&rep=rep1&type=pdf
Related
I wrote a tree in Java that supports multiple children nodes and edges. I haven't had to worry about scaling this application up until now. The use case now requires that there be 100 or more instances of the tree to be searched and maintained.
I've done very minimal research in this domain. The first thing that comes to mind is Spark. How I understand it though, is that Spark batches windows of events - almost removing the "in stream" aspect. Time is very critical here.
So I was thinking of taking advantage of Hadoop's file system. Indexing the tree across the cluster and using MapReduce to do maintenance.
I don't mind a good read. If there is any articles, tutorials, and or recommendations, that would be greatly appreciated.
Cheers
Hadoop and Spark are both distributed processing systems. Spark was designed to override the drawbacks of Hadoop system.
Hadoop has two parts, a storage system called hdfs and a processing algorithm called map-reduce. Spark was developed analyzing drawbacks in map-reduce. Hence RDD (resilient distributed datasets) was introduced in spark for in memory distributed processing. More information can be found in Apache Spark and Jacek
We can use the powerful hadoop filesystem with Spark processing as well.
If you choose spark, you would learn functional programming with scala or python or R. Hadoop is dependent on map-reduce algorithm which is a bit complex to be followed.
And there are apis for Tree in scala and there are works uderway too for example this and this
I hope this is helpful.
I have to do analysis of data that would be about 50 TB. I have been searching for few time but still confused which to use?
While searching I came across few points about node that it supports less computation service, numerical analysis and can be used for samller data set.
Is it true?
We have to design complex algorithms for statistics and display the result in web browser.
We will be using logstash and elastic search for filtering and storing data.
So which language would be the better choice. Java or Node?
Definitely Java. This kind of thing is Java's domain. Hadoop, Elastisearch, Lucene, Cassandra, Solr are all written in Java. Spark and Storm also run on the Java Virtual Machine. If you intend to use any of these tools, Java would be a first-class language.
Node may be useful in implementing the server side of the front end to enable designers to use JavaScript on both the client and server side. But, as far as web server speed and scalability is concerned, according to the tech empower benchmarks, Java is faster too.
I need a distributed text clustering framework to support algorithms whicomplete set of documents. Applications like carrot2 http://project.carrot2.org/ works on set of documents do in memory computation hence time consuming and very performance efficient. If this kind of text clustering algorithms like lingo,STC, knn etc can run on distributed environment they will be much faster.
Is there any framework using opensource tools like hazelcast http://www.hazelcast.com/ or is there any specific approach which is more fast and performance efficient.
Apache Mahout is what you're looking for.
There are few tools which does this, Mahout is one of them. Mahout supports 3 machine Learning algorithms, recommendation, clustering and classification. Mahout in action book by manning does a very good job of explaining this. Refer to the blog which talks about a use case about how Mahout and Hadoop distributed file system works?, the example is more focused on recommendation engine, but it can be applied for clustering as well, as mentioned in mahout in action chapter 7. As a precursor to this, I have also written a Component architecture of how each of these tools fit together for a data mining problem.
Mahout will work in standalone mode as well as with Hadoop. The decision to use either one boils down to the size of the historical data that needs to be mined. If the data size is of the order of Terabytes and Petabytes, you typically use Mahout with Hadoop. Weka is another similar open source projects. All these come under a category called machine learning frameworks. I hope it help.
Sorry in advance if this is a basic question. I'm reading a book on hbase and learing but most of the examples in the book(and well as online) tend to be using Java(I guess because hbase is native to java). There are a few python examples and I know I can access hbase with python(using thrift or other modules), but I'm wondering about additional functions?
For example, hbase has a 'coprocessors' function that pushs the data to where your doing your computing. Does this type work with python or other apps that are using streaming hadoop jobs? It seems with java, it can know what your doing and manage the data flow accordingly but how does this work with streaming? If it doesn't work, is there a way to get this type of functionality(via streaming without switching to another language)?
Maybe another way of asking this is..what can a non-java programmer do to get all the benefits of the features of hadoop when streaming?
Thanks in advance!
As far as I know, you are talking about 2(or more) totally different concepts.
"Hadoop Streaming" is there to stream data through your executable (independent from your choice of programming language). When using streaming there can't be any loss of functionality, since the functionality is basicly map/reduce the data you are getting from hadoop stream.
For hadoop part you can even use pig or hive big data query languages to get things done efficiently. With the newest versions of pig you can even write custom functions in python and use them inside your pig scripts.
Although there are tools to make you use the language you are comfortable with never forget that hadoop framework is mostly written in java. There could be times when you would need to write a specialized InputFormat; or a UDF inside pig, etc. Then a decent knowledge in java would come handy.
Your "Hbase coprocessors" example is kinda unrelated with streaming functionality of hadoop. Hbase coproccessors consists of 2 parts : server-side part, client-side part. I am pretty sure there would be some useful server-side coprocessors embedded inside hbase with release; but other than that you would need to write your own coprocessor (and bad news: its java). For client side I am sure you would be able to use them with your favorite programming language through thrift without too much problem.
So as an answer to your question: you can always dodge learning java; still using hadoop to it's potential (using 3rd party libraries/applications). But when shit hits the fan its better to understand the underlaying content; to be able to develop with java. Knowing java would give you a full control over hadoop/hbase enviroment.
Hope you would find this helpful.
Yes, you should get data local code execution with streaming. You do not push the data to where the program is, you push the program to where the data is. Streaming simply takes the local input data and runs it through stdin to your python program. Instead of each map running inside of a java task, it spins up and instance of your python program and just pumps the input through that.
If you really want to do fast processing you really should learn java though. Having to pipe everything through stdin and stout is a lot of overhead.
I'm developing an application where I need to distribute a set of tasks across a potentially quite large cluster of different machines.
Ideally I'd like a very simple, idiomatic way to do this in Clojure, e.g. something like:
; create a clustered set of machines
(def my-cluster (new-cluster list-of-ip-addresses))
; define a task to be executed
(deftask my-task (my-function arg1 arg2))
; run a task 10000 times on the cluster
(def my-job (run-task my-cluster my-task {:repeat 10000})
; do something with the results:
(some-function (get-results my-job))
Bonus if it can do something like Map-Reduce on the cluster as well.....
What's the best way to achieve something like this? Maybe I could wrap an appropriate Java library?
UPDATE:
Thanks for all the suggestion of Apache Hadoop - looks like it might fit the bill, however it seem a bit like overkill since I'm not needing a distributed data storage system like Hadoop uses (i.e. i don't need to process billions of records)... something more lightweight and focused on compute tasks only would be preferable if it exists.
Hadoop is the base for almost all the large scale big data excitement in the Clojure world these days though there are better ways than using Hadoop directly.
Cascalog is a very popular front end:
Cascalog is a tool for processing data on Hadoop with Clojure in a concise and
expressive manner. Cascalog combines two cutting edge technologies in Clojure
and Hadoop and resurrects an old one in Datalog. Cascalog is high performance,
flexible, and robust.
Also check out Amit Rathor's swarmiji distributed worker framework build on top of RabbitMQ. it's less focused on data processing and more on distributing a fixed number of tasks to a pool of available computing power. (P.S. It's in his book, Clojure in Action)
Although I haven't gotten to use it yet, I think that Storm is something that you might find useful to explore:
Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Hadoop is exacly what you need: Apache Hadoop
Storm might suit your needs better than Hadoop, as it has no distributed data storage and has low latency. It's possible to split up and process data, similar to MapReduce, the Trident api makes this very simple.
It is partly written in Clojure, so I suppose Clojure interop is easier.
Another option is Onyx which offers similar functionality, but is a pure Clojure based project.