Could anybody please recommend good solution (framework) to access HBase on Hadoop cluster from Scala (or Java) application?
By now I'm moving in scalding direction. Prototypes I obtained allowed me to combine scalding library with Maven and separate scalding job JAR from 'library' code packages. This in turn allowed me to run scalding based Hadoop jobs from outside cluster with minimal overhead per job ('library' code is posted to cluster 'distributed cache' only when it changes (which is rarely needed) so I can load jobs code fast).
Now I'm actually starting to play with HBase itself and I see scalding is good but it is not so 'native' to HBase. Yes, there are things like hbase-scalding but as I anyway have some point to plan future actions I'd like to know about other good solutions I probably missed.
What is expected:
Applications (jobs) starting overhead should be low. I need to run lot of them.
It should be possible (easier - better) to run jobs from outside cluster without any SSH (just based on 'hadoop jar' command or even simply by application execution).
Jobs language itself should allow short, logical semantic. Ideally this code should be simple enough to be automatically generated.
This solution should be productive on big enough HBase tables (initially up to 100.000.000 entries).
OK, solution should be 'live' (being actively developing) but relatively good in terms of general stability.
I think argumentation here could be even more useful than solution itself and this question should add couple of ideas for many people.
Any peace of advice?
If you're using scalding (which I recommend) there's a new project with updated cascading and scalding wrappers for accessing HBase. You might want to check it out - https://github.com/ParallelAI/SpyGlass
HPaste http://www.gravity.com/labs/hpaste/ may be what you are looking for.
You may be interested in the Kiji project (https://github.com/kijiproject/). It provides a "schema-ed" layer on top of HBase.
It also has a Scalding adapter (KijiExpress) so that you can do functional collections operations (map, groupby, etc.) on "pipes" of tuples sourced from these schema-ed HBase tables.
Update (August 2014): Stratosphere is now called Apache Flink (incubating)
Check out Stratosphere. If offers a Scala API and has a HBase module and is under active development.
Starting a job should be possible within a sec or so (depends on your cluster size.)
You can submit jobs remotely (it has a class called RemoteExecutor which allows you to programmatically submit jobs on remote clusters)
Please contact me if you have further questions!
I am currently trying to maintain hbase-scalding at my free time. As I am also picking up Scala.
Please take a look at github
Related
I want to implement a music recommendation system that can generate recommended music playlists in realtime. I believe that this can be implemented in Prediction.io...
However, due to Prediction.io's design, I need to call pio train, pio deploy, in order to update the learning model with the new actions done by the user (like music, etc.) Hence, I would need to run these commands every 2 hours (or another appropriate time interval).
I recently came across Apache Storm, and I really like the concept of a "realtime hadoop" processing. Hence, I was thinking if I can incorporate Prediction.io with Apache Storm, so that the learning is done "online", which will allow my app to recommend music within a few likes/actions by the user, instead of having the user wait until the learning model is updated.
If this is not viable, then is it possible to incorporate Spark's Mllib into a Apache Storm bolt (java), since I can build recommendation systems with it (and it also seems that Prediction.io itself is built upon Apache Spark)?
Thanks in advance!
The use case seems viable, but I would not consider 'need to run something every few hours' as a good motivation for using storm. On the other hand, if your learning data is 'streaming', you can model your storm topology to update its internal knowledge base every time new data arrives. This will allow you to use the most updated knowledge base every time user queries something.
As to which libraries can be used with Storm, any java library (in fact any library in any language if it can interface with java) should work.
we have an application which is written in Java, and uses solr,Elastic Search, Neo4j,MySQL and few more .
we require to increase our data size dramatically (from millions to billions)
So here the options I had in order to make this work:
clustering individual components notably solr, ES, Neo4j and MySQl
use what everyone talks about nowadays : Hadoop
Problem with first is hard to manage
the second option sounds too good to be true. So my questions are :
Can I actually assume that Hadoop can do that before digging in?
what other criteria do I need to consider?
Is there any alternative solution for such task?
Solr is for data searching. If you want to process the big data (meets criteria of volume, velocity and variety) such as ETL and reporting, you would need Hadoop.
Hadoop consist of several eco system components. You can refer to below link for documentation:
https://hadoop.apache.org
I have written a code in java that works over a large data. I want to distribute this to multiple machines to work on parts of data and to get the processing done more quickly. I have never worked on Distributed Computing before. Are there tools to get this done? Thanks.
Gridgain works fairly well. Hadoop is a great one but needs more dev. Hazelcast coulb be a good outsider
I can cite others too, but it's difficult to answer your question without knowing what types of datas and processing are involved. Are processes I/O intensive or CPU-Bound ?
One of the question is "How big your dataset is ?".
seems like you want to do a map-reduce algorithem.
Hadoop is an open source project that provides a framework to do exactly that.
The team I work on is fortunate enough to have management that recognizes the need to enhance our skills and learn new technologies. As a result, whenever we have a little downtime between major projects, we are encouraged to use that time to stretch our minds a bit and learn something new. We often tackle a large research project as a team so that everyone benefits from the knowledge. For example, we built a spec-compliant Kerberos authentication server to get familiar with the ins and outs of the protocol. We wrote our own webserver to learn about efficient design strategies for networked applications.
Recently, we've been very curious about Map-Reduce, specifically Hadoop and the various supporting components (HBase, HDFS, Pig, Hive, etc.). To learn a bit more about it, we would like to write a web analytics service. It will use Javascript page tagging to gather the metrics, and Hadoop and something to make analytics and reports available via a web interface.
The non-Hadoop side of the architecture is easy. A Java servlet will parse the parameters from a Javascript tag (easy enough -- we're a Java shop). The servlet will then send out a JMS message for asynchronous processing (again, easy).
My question is... What next? We've researched things like Hive a bit, and it sounds like a great fit for querying the datastore for the various metrics we're looking for. But, it's high latency. We're fortunate enough to be able to drop this onto a website that gets a few million hits per month. We'd really like to get relatively quick metrics using the web interface for our analytics tool. Latency is not our friend. So, what is the best way to accomplish this? Would it be to run the queries as a scheduled job and then store the results somewhere with lower latency (PostgreSQL, etc.) and retrieve them from there? If that's the case, where should the component listening for the JMS messages store the data? Can Hive get its data from HBase directly? Should we store it in HDFS somewhere and read it in Hive?
Like I said, we're a very technical team and love learning new technologies. This, though, is way different from anything we've learned before, so we'd like to get a sense of what the "best practices" would be here. Any advice or opinions you can give are GREATLY appreciated!
EDIT : I thought I'd add some clarification as to what I'm looking for. I'm seeking advice on architecture and design for a solution such as this. We'll collect 20-30 different metrics on a site that gets several million page views per month. This will be a lot of data, and we'd like to be able to get metrics in as close to realtime as possible. I'm looking for best practices and advice on the architecture of such a solution, because I don't want us to come up with something on our own that is really bad that will leave us thinking we're "Hadoop experts" just because it works.
Hive, as you mentioned, has high latency for queries. It can be pointed at HBase (see https://cwiki.apache.org/Hive/hbaseintegration.html), but the integration results in HBase having tables that are forced into a mostly-rectangular, relational-like schema that is not optimal for HBase. Plus, the overhead of doing it is extremely costly- hive queries against hbase are, on my cluster, at least an order of magnitude slower than against plain HDFS files.
One good strategy is to store the raw metrics in HBase or on plain HDFS (Might want to look at Flume if these metrics are coming from log files) and run periodic MapReduce jobs (even every 5 minutes) to create pre-aggregated results that you can store in plain rectangular files that you can query through Hive. When you are just reading a file and Hive doesn't have to do anything fancy (e.g. sorting, joining, etc), then Hive is actually reasonably low latency- it doesn't run MapReduce, it just streams the file's contents out to you.
Finally, another option is to use something like Storm (which runs on Hadoop) to collect and analyze data in real time, and store the results for querying as mentioned above, or storing them in HBase for display through a custom user interface that queries HBase directly.
I've got a doozy of a problem here. I'm aiming to build a framework to allow for the integration of different traffic simulation models. This integration is based upon the sharing of link connectivities, link costs, and vehicles between simulations.
To make a distributed simulation, I plan to have a 'coordinator' (star topology). All participating simulations simply register with it, and talk only to the coordinator. The coordinator then coordinates the execution of various tasks between each simulation.
A quick example of a distribution problem, is when one simulation is 'in charge' of certain objects, like a road. And another is 'in charge' of other roads. However, these roads are interconnected (and hence, we need synchronisation between these simulations, and need to be able to exchange data / invoke methods remotely).
I've had a look at RMI and am thinking it may be suited for this task. (To abstract out having to create an over-wire signalling discipline).
Is this sane? The issue here, is that simulation participants need to centralize some of their data storage in the 'coordinator' to ensure explicit synchronisation between simulations. Furthermore, some simulations may require components or methods from other simulations. (Hence the idea of using RMI).
My basic approach is to have the 'coordinator' run a giant RMI registry. And every simulation simply looks up everything in the registry, ensuring that the correct objects are used at each step.
Anyone have any tips for heading down this path?
You may want to check out Hazelcast also. Hazelcast is an open source transactional, distributed/partitioned implementation of queue, topic, map, set, list, lock and executor service. It is super easy to work with; just add hazelcast.jar into your classpath and start coding. Almost no configuration is required.
If you are interested in executing your Runnable, Callable tasks in a distributed fashion, then please check out Distributed Executor Service documentation at http://code.google.com/docreader/#p=hazelcast
Hazelcast is released under Apache license and enterprise grade support is also available.
Is this sane? IMHO no. And I'll tell you why. But first I'll add the disclaimer that this is a complicated topic so any answer has to be viewed as barely scratching the surface.
First instead of repeating myself I'll point you to a summary of Java grid/cluster technologies that I wrote awhile ago. Its a mostly complete list.
The star topology is "natural" for a "naive" (I don't mean that in a bad way) implementation because point-to-point is simple and centralizing key controller logic is also simple. It is however not fault-tolerant. It introduces scalability problems and a single bottleneck. It introduces communication inefficiences (namely the points communicate via a two-step process through the center).
What you really want for this is probably a cluster (rather than a data/compute grid) solution and I'd suggest you look at Terracotta. Ideally you'd look at Oracle Coherence but it's no doubt expensive (compared to free). It is a fantastic product though.
These two products can be used a number of ways but the core of both is to treat a cache like a distributed map. You put things in, you take things out and you fire off code that alters the cache. Coherence (with which I'm more familiar) in this regards scales fantastically well. These are more "server" based products though for a true cluster.
If you're looking at a more distributed model then perhaps you should be looking at more of an SOA based approach.
Have a look at http://www.terracotta.org/
its a distributed Java VM, so it has the advantage of being clustered application looks no different than a standard Java application.
I have used it in applications and the speed is very impressive so far.
Paul
Have you considered using a message queue approach? You could use JMS to communicate/coordinate tasks and results among a set of servers/nodes. You could even use Amazon's SQS (Simple Queue Service: aws.amazon.com/sqs) and have your servers running on EC2 to allow you to scale up and down as required.
Just my 2 cents.
Take a look at JINI, it might be of some use to you.
Well, Jini, or more specifically Javaspaces is a good place to start for a simple approach to the problem. Javaspaces lets you implement a master-worker model, where your master (coordinator in your case) writes tasks to the Javaspace, and the workers query for and process those tasks, writing the results back for the master. Since your problem is not embarrassingly parallel, and your workers need to synchronize/exchanging data, this will add some complexity to your solution.
Using Javaspaces will add a whole lot more abstraction to your implementation that using plain RMI (which is used by the Jini framework internally as the default "wire protocol").
Have a look at this article from sun for an intro.
And Jan Newmarch's Jini Tutorial is a pretty good place to start learning Jini
Just as an addition to the other answers which as far as I have seen all focus on grid and cloud computing, you should notice that simulation models have one unique characteristic: simulation time.
When running distributed simulation models in parallel and synchronized then I see two options:
When each simulation model has its own simulation clock and event list then these should be synchronized over the network.
Alternatively there could be a single simulation clock and event list which will "tick the time" for all distributed (sub) models.
The first option has been extensively researched for the High Level Architecture (HLA) see for example http://en.wikipedia.org/wiki/IEEE_1516 as a starter.
However the second option seems more simple and with less overhead to me.
GridGain is a good alternative. They have a map/reduce implementation with "direct API support for split and aggregation" and "distributed task session". You can browse their examples and see if some of them fits with your needs.