I wrote a tree in Java that supports multiple children nodes and edges. I haven't had to worry about scaling this application up until now. The use case now requires that there be 100 or more instances of the tree to be searched and maintained.
I've done very minimal research in this domain. The first thing that comes to mind is Spark. How I understand it though, is that Spark batches windows of events - almost removing the "in stream" aspect. Time is very critical here.
So I was thinking of taking advantage of Hadoop's file system. Indexing the tree across the cluster and using MapReduce to do maintenance.
I don't mind a good read. If there is any articles, tutorials, and or recommendations, that would be greatly appreciated.
Cheers
Hadoop and Spark are both distributed processing systems. Spark was designed to override the drawbacks of Hadoop system.
Hadoop has two parts, a storage system called hdfs and a processing algorithm called map-reduce. Spark was developed analyzing drawbacks in map-reduce. Hence RDD (resilient distributed datasets) was introduced in spark for in memory distributed processing. More information can be found in Apache Spark and Jacek
We can use the powerful hadoop filesystem with Spark processing as well.
If you choose spark, you would learn functional programming with scala or python or R. Hadoop is dependent on map-reduce algorithm which is a bit complex to be followed.
And there are apis for Tree in scala and there are works uderway too for example this and this
I hope this is helpful.
Related
Can anybody tell that is it necessary to know java for learning hadoop?
If anyone is working on hadoop then please tell what is required to get a job in hadoop?
what is exact use of hadoop?
What was there before hadoop?
what is the difference between hdfs and gfs?
I know there are lots of question but if anyone can help.. That will be great for me.
Thanks a lot guys.
What is exact use of hadoop?
The Apache Hadoop software library is a framework that allows for the distributed storage and processing of large data sets across clusters of computers using simple programming models. Refer to documentation at Apache-Hadoop
Hadoop provides highly scalable, cost-effective,fast,flexible and resilient data storage & Analytics platform.
Is it necessary to know java for learning Hadoop?
Not required. But you are looking at optimization of your Map Reduce jobs, java provides that flexibility.
Alternatives if you are not interested in java
PIG: A high-level data-flow language and execution framework for parallel computation. It is recommended for people, who are experts in scripting languages like Python.
HIVE: A data warehouse infrastructure that provides data summarization and ad- hoc querying. It is recommended for people, who are experts in SQL programming as syntax is similar to SQL
PIG and HIVE hides complexity of Map Reduce Jobs for developers. They provide higher level of abstraction to provide solution to business problems.
PIG and HIVE, both translates scripts/queries to a series of MAP reduce jobs. On the performance front, they are not efficient compared to traditional MAP reduce jobs implemented in java.
Refer to this article for Java Alternatives
On job front, it depends your expertise and your selection of eco-system in Hadoop framework. It is difficult to answer.
Before Hadoop, we do not have best framework/platform, which can provide same advantages of Hadoop with BigData. To implement Word count type of programs, you have to write scripts, execute and consolidate the results from datanodes.
You can see comparison between GFS and HDFS at GFS Vs HDFS. Get good insight into HDFS at HDFS design
Hadoop is a distributed computing framework. It is a De facto standard for Data Management (Distributed Storage + Distributed Processing). So Hadoop is a technology for all who involves in Data Management Life Cycle(Capturing, Storage, Processing, and Reporting). Hadoop is used by the following roles:
Admin
Developer
Data Analyst
Data Scientist
Business Analyst
Functional Consultant
etc...
Though Hadoop and most of it's Eco System is written in Java, but it is used by all kinds of People in the Enterprise. So we need several interfaces to target all the audience and to increase the adaptability.
Hadoop Project Management Committee initiated several projects to support non java programmers, non programmers, SQL programmers etc...
The following are utilities and projects to support all varieties of Audience:
Hadoop Streaming: It is an utility offered by Hadoop to allow non Java programers to write MapReduce programs using other languages like Perl, PHP, Python, Shell, R, C, C++, Scala, Groovy, Ruby etc..
Hadoop Streaming = Hadoop + Console (STDOUT/STDIN) + External Programs.
Hadoop streaming is a bit slow compared to Native Java MapReduce, but is useful to integrate the legacy code which is written in non java and it is also good for integrating Data Science tool kits like R and Python with Hadoop.
There are several projects which are developed based on Hadoop Streaming:
RHadoop : R and Hadoop
Dumbo : Python + Hadoop
mrjob : Python + Hadoop
Hadoop Pipes: It is an utility offered by Hadoop to allow non Java programers to write MapReduce programs using C++.
Pydoop: It is a Python Module to write MapReduce programs using Python. It internally uses Hadoop Pipes. So It is Python wrapper over Hadoop Pipes.
Pig: It offers a scripting like language called Pig Latin to analyse your data by performing a series of transformations and aggregations. Pig Latin is easy to learn and it is a data flow language. It is a right tool for people who do not have any programming background.
Hive/Impala/Drill/Tajo/Presto/MRQL: All these are Distributed SQL Engines over Hadoop. These offer a SQL like Query language to run ad hoc queries, data summarisation. It is good choice for SQL programmers, Database analysts, Data Warehouse programmers.
All the above projects and utilities allow non java programmers to write their data analysis using their choice of language. Hadoop with Java has great advantage and full control over data in terms of Key Value pairs.
The conclusion here is, we can do data analysis with Hadoop without Java programming.
Is Java necessary for Hadoop?
Hadoop is built in Java but to work on Hadoop you didn't require Java.
It is preferred if you know Java, then you can code on mapreduce. If you are not familiar with Java. You can focus your skills on Pig and Hive to perform the same functionality. They are similar tools like SQL with different way to writing syntax. If you are coding Background like python, C++ etc. You can convert there codes into Java using libraries like Jython for Python.
How much Java is necessary for Hadoop?
If you want to learn Java only for Hadoop then here are list of topic you must learn:
1. Basic of Core Java
- Variables, Classes, Functions, Inheritance, Package, Handling Error, Flow Controls, Arrays, API.
- Some useful string function //Useful for data filter
- Collections and Generic - ArrayList, Hashtable, etc.
To get more detail about what are topics you need to learn in Java, check this article:
https://www.dezyre.com/article/-how-much-java-is-required-to-learn-hadoop/103
To know about the GFS and HDFS, check this paper:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.454.4159&rep=rep1&type=pdf
I have written a code in java that works over a large data. I want to distribute this to multiple machines to work on parts of data and to get the processing done more quickly. I have never worked on Distributed Computing before. Are there tools to get this done? Thanks.
Gridgain works fairly well. Hadoop is a great one but needs more dev. Hazelcast coulb be a good outsider
I can cite others too, but it's difficult to answer your question without knowing what types of datas and processing are involved. Are processes I/O intensive or CPU-Bound ?
One of the question is "How big your dataset is ?".
seems like you want to do a map-reduce algorithem.
Hadoop is an open source project that provides a framework to do exactly that.
I need a distributed text clustering framework to support algorithms whicomplete set of documents. Applications like carrot2 http://project.carrot2.org/ works on set of documents do in memory computation hence time consuming and very performance efficient. If this kind of text clustering algorithms like lingo,STC, knn etc can run on distributed environment they will be much faster.
Is there any framework using opensource tools like hazelcast http://www.hazelcast.com/ or is there any specific approach which is more fast and performance efficient.
Apache Mahout is what you're looking for.
There are few tools which does this, Mahout is one of them. Mahout supports 3 machine Learning algorithms, recommendation, clustering and classification. Mahout in action book by manning does a very good job of explaining this. Refer to the blog which talks about a use case about how Mahout and Hadoop distributed file system works?, the example is more focused on recommendation engine, but it can be applied for clustering as well, as mentioned in mahout in action chapter 7. As a precursor to this, I have also written a Component architecture of how each of these tools fit together for a data mining problem.
Mahout will work in standalone mode as well as with Hadoop. The decision to use either one boils down to the size of the historical data that needs to be mined. If the data size is of the order of Terabytes and Petabytes, you typically use Mahout with Hadoop. Weka is another similar open source projects. All these come under a category called machine learning frameworks. I hope it help.
I have a list of URLs and I want to download them in order to create an index in webtrec format. I've found an useful framework called MapReduce (Apache Hadoop) but I'd like to know if there is an implementation in java of what I want to do. Or may be a close example of it.
Thank you!
MapReduce pattern is a pattern for parallelizable, CPU-bound computations in multiple steps. Downloading and crawling web pages is an I/O-bound operation. Hence, you should differentiate both operations.
So you should first use something like a queue and asynchronous I/O for downloading web sites when performance is really that important. In a second step, you can then use MapReduce for building the actual index.
Hadoop is one possibility, but if you're not targeting large scale, frameworks such as Fork/Join and akka may be applicable as well.
I'm developing an application where I need to distribute a set of tasks across a potentially quite large cluster of different machines.
Ideally I'd like a very simple, idiomatic way to do this in Clojure, e.g. something like:
; create a clustered set of machines
(def my-cluster (new-cluster list-of-ip-addresses))
; define a task to be executed
(deftask my-task (my-function arg1 arg2))
; run a task 10000 times on the cluster
(def my-job (run-task my-cluster my-task {:repeat 10000})
; do something with the results:
(some-function (get-results my-job))
Bonus if it can do something like Map-Reduce on the cluster as well.....
What's the best way to achieve something like this? Maybe I could wrap an appropriate Java library?
UPDATE:
Thanks for all the suggestion of Apache Hadoop - looks like it might fit the bill, however it seem a bit like overkill since I'm not needing a distributed data storage system like Hadoop uses (i.e. i don't need to process billions of records)... something more lightweight and focused on compute tasks only would be preferable if it exists.
Hadoop is the base for almost all the large scale big data excitement in the Clojure world these days though there are better ways than using Hadoop directly.
Cascalog is a very popular front end:
Cascalog is a tool for processing data on Hadoop with Clojure in a concise and
expressive manner. Cascalog combines two cutting edge technologies in Clojure
and Hadoop and resurrects an old one in Datalog. Cascalog is high performance,
flexible, and robust.
Also check out Amit Rathor's swarmiji distributed worker framework build on top of RabbitMQ. it's less focused on data processing and more on distributing a fixed number of tasks to a pool of available computing power. (P.S. It's in his book, Clojure in Action)
Although I haven't gotten to use it yet, I think that Storm is something that you might find useful to explore:
Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Hadoop is exacly what you need: Apache Hadoop
Storm might suit your needs better than Hadoop, as it has no distributed data storage and has low latency. It's possible to split up and process data, similar to MapReduce, the Trident api makes this very simple.
It is partly written in Clojure, so I suppose Clojure interop is easier.
Another option is Onyx which offers similar functionality, but is a pure Clojure based project.