I have written a code in java that works over a large data. I want to distribute this to multiple machines to work on parts of data and to get the processing done more quickly. I have never worked on Distributed Computing before. Are there tools to get this done? Thanks.
Gridgain works fairly well. Hadoop is a great one but needs more dev. Hazelcast coulb be a good outsider
I can cite others too, but it's difficult to answer your question without knowing what types of datas and processing are involved. Are processes I/O intensive or CPU-Bound ?
One of the question is "How big your dataset is ?".
seems like you want to do a map-reduce algorithem.
Hadoop is an open source project that provides a framework to do exactly that.
Related
I've got a legacy system that uses SAS to ingest raw data from the database, cleanse and consolidate it, and then score the outputted documents.
I'm wanting to move to a Java or similar object oriented solution, so I can implement unit testing, and otherwise general better code control. (I'm not talking about overhauling the whole system, but injecting java where I can).
In terms of data size, we're talking about around 1 TB of data being both ingested and created. In terms of scaling, this might increase by a factor of around 10, but isn't likely to increase on massive scale like a worldwide web project might.
The question is - what tools would be most appropriate for this kind of project?
Where would I find this information - what search terms should be used?
Is doing processing on an SQL database (creating and dropping tables, adding columns, as needed) an appropriate, or awful, solution?
I've had a quick look at Hadoop - but due to the small scale of this project, would Hadoop be an unnecessary complication?
Are there any Java packages that do similar functionality as SAS or SQL in terms of merging, joining, sorting, grouping datasets, as well as modifying data?
It's hard for me to prescribe exactly what you need given your problem statement.
It sounds like a good database API (i.e. native JDBC might be all you need with a good open source database backend)
However, I think you should take some time to check out Lucene. It's a fantastic tool and may meet your scoring needs very well. Taking a search engine indexing approach to your problem may be fruitful.
I think the question you need to ask yourself is
what's the nature of your data set, how often it will be updated.
what's the workload you will have on this 1TB or more data in the future. Will there be mainly offline read and analysis operations? Or there will also have a lot random write operations?
Here is an article talking about if to choose using Hadoop or not which I think is worth reading.
Hadoop is a better choice if you only have daily or weekly update of your data set. And the major operations on the data is read-only operations, along with further data analysis. For the merging, joining, sorting, grouping datasets operation you mentioned, Cascading is a Java library running on top of Hadoop which supports this operation well.
Could anybody please recommend good solution (framework) to access HBase on Hadoop cluster from Scala (or Java) application?
By now I'm moving in scalding direction. Prototypes I obtained allowed me to combine scalding library with Maven and separate scalding job JAR from 'library' code packages. This in turn allowed me to run scalding based Hadoop jobs from outside cluster with minimal overhead per job ('library' code is posted to cluster 'distributed cache' only when it changes (which is rarely needed) so I can load jobs code fast).
Now I'm actually starting to play with HBase itself and I see scalding is good but it is not so 'native' to HBase. Yes, there are things like hbase-scalding but as I anyway have some point to plan future actions I'd like to know about other good solutions I probably missed.
What is expected:
Applications (jobs) starting overhead should be low. I need to run lot of them.
It should be possible (easier - better) to run jobs from outside cluster without any SSH (just based on 'hadoop jar' command or even simply by application execution).
Jobs language itself should allow short, logical semantic. Ideally this code should be simple enough to be automatically generated.
This solution should be productive on big enough HBase tables (initially up to 100.000.000 entries).
OK, solution should be 'live' (being actively developing) but relatively good in terms of general stability.
I think argumentation here could be even more useful than solution itself and this question should add couple of ideas for many people.
Any peace of advice?
If you're using scalding (which I recommend) there's a new project with updated cascading and scalding wrappers for accessing HBase. You might want to check it out - https://github.com/ParallelAI/SpyGlass
HPaste http://www.gravity.com/labs/hpaste/ may be what you are looking for.
You may be interested in the Kiji project (https://github.com/kijiproject/). It provides a "schema-ed" layer on top of HBase.
It also has a Scalding adapter (KijiExpress) so that you can do functional collections operations (map, groupby, etc.) on "pipes" of tuples sourced from these schema-ed HBase tables.
Update (August 2014): Stratosphere is now called Apache Flink (incubating)
Check out Stratosphere. If offers a Scala API and has a HBase module and is under active development.
Starting a job should be possible within a sec or so (depends on your cluster size.)
You can submit jobs remotely (it has a class called RemoteExecutor which allows you to programmatically submit jobs on remote clusters)
Please contact me if you have further questions!
I am currently trying to maintain hbase-scalding at my free time. As I am also picking up Scala.
Please take a look at github
The team I work on is fortunate enough to have management that recognizes the need to enhance our skills and learn new technologies. As a result, whenever we have a little downtime between major projects, we are encouraged to use that time to stretch our minds a bit and learn something new. We often tackle a large research project as a team so that everyone benefits from the knowledge. For example, we built a spec-compliant Kerberos authentication server to get familiar with the ins and outs of the protocol. We wrote our own webserver to learn about efficient design strategies for networked applications.
Recently, we've been very curious about Map-Reduce, specifically Hadoop and the various supporting components (HBase, HDFS, Pig, Hive, etc.). To learn a bit more about it, we would like to write a web analytics service. It will use Javascript page tagging to gather the metrics, and Hadoop and something to make analytics and reports available via a web interface.
The non-Hadoop side of the architecture is easy. A Java servlet will parse the parameters from a Javascript tag (easy enough -- we're a Java shop). The servlet will then send out a JMS message for asynchronous processing (again, easy).
My question is... What next? We've researched things like Hive a bit, and it sounds like a great fit for querying the datastore for the various metrics we're looking for. But, it's high latency. We're fortunate enough to be able to drop this onto a website that gets a few million hits per month. We'd really like to get relatively quick metrics using the web interface for our analytics tool. Latency is not our friend. So, what is the best way to accomplish this? Would it be to run the queries as a scheduled job and then store the results somewhere with lower latency (PostgreSQL, etc.) and retrieve them from there? If that's the case, where should the component listening for the JMS messages store the data? Can Hive get its data from HBase directly? Should we store it in HDFS somewhere and read it in Hive?
Like I said, we're a very technical team and love learning new technologies. This, though, is way different from anything we've learned before, so we'd like to get a sense of what the "best practices" would be here. Any advice or opinions you can give are GREATLY appreciated!
EDIT : I thought I'd add some clarification as to what I'm looking for. I'm seeking advice on architecture and design for a solution such as this. We'll collect 20-30 different metrics on a site that gets several million page views per month. This will be a lot of data, and we'd like to be able to get metrics in as close to realtime as possible. I'm looking for best practices and advice on the architecture of such a solution, because I don't want us to come up with something on our own that is really bad that will leave us thinking we're "Hadoop experts" just because it works.
Hive, as you mentioned, has high latency for queries. It can be pointed at HBase (see https://cwiki.apache.org/Hive/hbaseintegration.html), but the integration results in HBase having tables that are forced into a mostly-rectangular, relational-like schema that is not optimal for HBase. Plus, the overhead of doing it is extremely costly- hive queries against hbase are, on my cluster, at least an order of magnitude slower than against plain HDFS files.
One good strategy is to store the raw metrics in HBase or on plain HDFS (Might want to look at Flume if these metrics are coming from log files) and run periodic MapReduce jobs (even every 5 minutes) to create pre-aggregated results that you can store in plain rectangular files that you can query through Hive. When you are just reading a file and Hive doesn't have to do anything fancy (e.g. sorting, joining, etc), then Hive is actually reasonably low latency- it doesn't run MapReduce, it just streams the file's contents out to you.
Finally, another option is to use something like Storm (which runs on Hadoop) to collect and analyze data in real time, and store the results for querying as mentioned above, or storing them in HBase for display through a custom user interface that queries HBase directly.
Sorry in advance if this is a basic question. I'm reading a book on hbase and learing but most of the examples in the book(and well as online) tend to be using Java(I guess because hbase is native to java). There are a few python examples and I know I can access hbase with python(using thrift or other modules), but I'm wondering about additional functions?
For example, hbase has a 'coprocessors' function that pushs the data to where your doing your computing. Does this type work with python or other apps that are using streaming hadoop jobs? It seems with java, it can know what your doing and manage the data flow accordingly but how does this work with streaming? If it doesn't work, is there a way to get this type of functionality(via streaming without switching to another language)?
Maybe another way of asking this is..what can a non-java programmer do to get all the benefits of the features of hadoop when streaming?
Thanks in advance!
As far as I know, you are talking about 2(or more) totally different concepts.
"Hadoop Streaming" is there to stream data through your executable (independent from your choice of programming language). When using streaming there can't be any loss of functionality, since the functionality is basicly map/reduce the data you are getting from hadoop stream.
For hadoop part you can even use pig or hive big data query languages to get things done efficiently. With the newest versions of pig you can even write custom functions in python and use them inside your pig scripts.
Although there are tools to make you use the language you are comfortable with never forget that hadoop framework is mostly written in java. There could be times when you would need to write a specialized InputFormat; or a UDF inside pig, etc. Then a decent knowledge in java would come handy.
Your "Hbase coprocessors" example is kinda unrelated with streaming functionality of hadoop. Hbase coproccessors consists of 2 parts : server-side part, client-side part. I am pretty sure there would be some useful server-side coprocessors embedded inside hbase with release; but other than that you would need to write your own coprocessor (and bad news: its java). For client side I am sure you would be able to use them with your favorite programming language through thrift without too much problem.
So as an answer to your question: you can always dodge learning java; still using hadoop to it's potential (using 3rd party libraries/applications). But when shit hits the fan its better to understand the underlaying content; to be able to develop with java. Knowing java would give you a full control over hadoop/hbase enviroment.
Hope you would find this helpful.
Yes, you should get data local code execution with streaming. You do not push the data to where the program is, you push the program to where the data is. Streaming simply takes the local input data and runs it through stdin to your python program. Instead of each map running inside of a java task, it spins up and instance of your python program and just pumps the input through that.
If you really want to do fast processing you really should learn java though. Having to pipe everything through stdin and stout is a lot of overhead.
I'm developing an application where I need to distribute a set of tasks across a potentially quite large cluster of different machines.
Ideally I'd like a very simple, idiomatic way to do this in Clojure, e.g. something like:
; create a clustered set of machines
(def my-cluster (new-cluster list-of-ip-addresses))
; define a task to be executed
(deftask my-task (my-function arg1 arg2))
; run a task 10000 times on the cluster
(def my-job (run-task my-cluster my-task {:repeat 10000})
; do something with the results:
(some-function (get-results my-job))
Bonus if it can do something like Map-Reduce on the cluster as well.....
What's the best way to achieve something like this? Maybe I could wrap an appropriate Java library?
UPDATE:
Thanks for all the suggestion of Apache Hadoop - looks like it might fit the bill, however it seem a bit like overkill since I'm not needing a distributed data storage system like Hadoop uses (i.e. i don't need to process billions of records)... something more lightweight and focused on compute tasks only would be preferable if it exists.
Hadoop is the base for almost all the large scale big data excitement in the Clojure world these days though there are better ways than using Hadoop directly.
Cascalog is a very popular front end:
Cascalog is a tool for processing data on Hadoop with Clojure in a concise and
expressive manner. Cascalog combines two cutting edge technologies in Clojure
and Hadoop and resurrects an old one in Datalog. Cascalog is high performance,
flexible, and robust.
Also check out Amit Rathor's swarmiji distributed worker framework build on top of RabbitMQ. it's less focused on data processing and more on distributing a fixed number of tasks to a pool of available computing power. (P.S. It's in his book, Clojure in Action)
Although I haven't gotten to use it yet, I think that Storm is something that you might find useful to explore:
Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Hadoop is exacly what you need: Apache Hadoop
Storm might suit your needs better than Hadoop, as it has no distributed data storage and has low latency. It's possible to split up and process data, similar to MapReduce, the Trident api makes this very simple.
It is partly written in Clojure, so I suppose Clojure interop is easier.
Another option is Onyx which offers similar functionality, but is a pure Clojure based project.