I'm just getting started on Hadoop and I'm struggling to figure out how to use other input sources that aren't files, i.e. Read all the rows from AWS SimpleDB, or all records from a REST API on another system. Everything online only shows how to process files or a few selected databases.
The API for InputFormat looks quite complex, so I'm trying to figure out the quickest way to read in data from any non-file data source, which can then be MapReduced using Amazon's Elastic MapReduce (based on Hadoop). I'm using JAVA to write the code.
Thanks!
The 'quickest' way would be to use some data aggregation tool, like Flume or Chukwa.
You can find a very good example on how to collect Twitter data through Flume using Twitter API here. It shows how you can use Flume to read twitter data into your Hadoop cluster and then process it using Hive. You could write your own MR job to do that if you need that. Trying to devise a custom InputFormat for these kinda things really requires some work and I don't think you'll find much help on this(Unless somebody has done this and is ready to share it with you).
HTH
Related
My task is to write a Java based web application which will produce various charts like WaferMap, Histogram, Overlay Chart etc.
The front end is ExtJS and the chart generation part is taken care by JFreeChart.
The data for charts will be in multiple .CSV files which are stored in the file system.
My questions are:
The .CSV files size will be in GB's. Can I store these files in HDFS and query them during run-time and display data in frontend?
Is using Hadoop ecosystem is a feasible solution for my above requirement? Should I also consider Apache pig or hive for querying the CSV file?
Yes you can (Apache Hive)
It all depends but Hive seems like what you're looking for. It was designed with a SQL like feel and can include SQL clauses. It is widely used with major corps like Facebook, Netflix, FINRA, etc. In your case, supporting SQL syntax also means that you can integrate with Java's JDBC driver real easy and query data from your CSV files.
http://www.tutorialspoint.com/hive/
Setting up Hive can be a bit difficult at first if you're not too familiar with the Hadoop environment. The above link is a great reference link to understand Hive better and get you in the right direction.
Hope this was helpful!
My requirement is as follows:
We are trying to implement Recommendations Engine for one of our customers. To achieve the same, we need to store data in HDFS from web application (for every click on the product) and compute the recommendations in the back end and display the result (as product) in the web application.
My approach is as follows highlighted with the steps:
We have downloaded and configured the Cloudera
We have downloaded/configured Apache Mlib (Recommendations Engine)
Using Eclipse Luna, we are able to run the Mlib (using Java plugin)
Now we need to create a JSON service which will read the data from web and
store in the HDFS. We are stuck in this step.
Now we need to create a JSON service which can read the data from HDFS and
compute the Recommendations and display the result in JSON format dynamically.
We are stuck at Step 4 & 5. Please suggest, how can we create a JSON service to read/write from HDFS?
You asked a very general question. I suggest you to get familiar with Apache Spark first. Read it's quick guide. Start to read\write data from hdfs into jsonRDD as described in tutorial. After you understand how to work with batch processing, read about spark streaming.
There is an old story that Ptolemy I asked if there was a shorter path to learning geometry than Euclid's Elements. Euclid replied there is no royal road to geometry. So there is no fast way to build mllib engine for your clients except for reading and understanding the basics of Apache Spark Usage. I wish you a good luck in that!
The team I work on is fortunate enough to have management that recognizes the need to enhance our skills and learn new technologies. As a result, whenever we have a little downtime between major projects, we are encouraged to use that time to stretch our minds a bit and learn something new. We often tackle a large research project as a team so that everyone benefits from the knowledge. For example, we built a spec-compliant Kerberos authentication server to get familiar with the ins and outs of the protocol. We wrote our own webserver to learn about efficient design strategies for networked applications.
Recently, we've been very curious about Map-Reduce, specifically Hadoop and the various supporting components (HBase, HDFS, Pig, Hive, etc.). To learn a bit more about it, we would like to write a web analytics service. It will use Javascript page tagging to gather the metrics, and Hadoop and something to make analytics and reports available via a web interface.
The non-Hadoop side of the architecture is easy. A Java servlet will parse the parameters from a Javascript tag (easy enough -- we're a Java shop). The servlet will then send out a JMS message for asynchronous processing (again, easy).
My question is... What next? We've researched things like Hive a bit, and it sounds like a great fit for querying the datastore for the various metrics we're looking for. But, it's high latency. We're fortunate enough to be able to drop this onto a website that gets a few million hits per month. We'd really like to get relatively quick metrics using the web interface for our analytics tool. Latency is not our friend. So, what is the best way to accomplish this? Would it be to run the queries as a scheduled job and then store the results somewhere with lower latency (PostgreSQL, etc.) and retrieve them from there? If that's the case, where should the component listening for the JMS messages store the data? Can Hive get its data from HBase directly? Should we store it in HDFS somewhere and read it in Hive?
Like I said, we're a very technical team and love learning new technologies. This, though, is way different from anything we've learned before, so we'd like to get a sense of what the "best practices" would be here. Any advice or opinions you can give are GREATLY appreciated!
EDIT : I thought I'd add some clarification as to what I'm looking for. I'm seeking advice on architecture and design for a solution such as this. We'll collect 20-30 different metrics on a site that gets several million page views per month. This will be a lot of data, and we'd like to be able to get metrics in as close to realtime as possible. I'm looking for best practices and advice on the architecture of such a solution, because I don't want us to come up with something on our own that is really bad that will leave us thinking we're "Hadoop experts" just because it works.
Hive, as you mentioned, has high latency for queries. It can be pointed at HBase (see https://cwiki.apache.org/Hive/hbaseintegration.html), but the integration results in HBase having tables that are forced into a mostly-rectangular, relational-like schema that is not optimal for HBase. Plus, the overhead of doing it is extremely costly- hive queries against hbase are, on my cluster, at least an order of magnitude slower than against plain HDFS files.
One good strategy is to store the raw metrics in HBase or on plain HDFS (Might want to look at Flume if these metrics are coming from log files) and run periodic MapReduce jobs (even every 5 minutes) to create pre-aggregated results that you can store in plain rectangular files that you can query through Hive. When you are just reading a file and Hive doesn't have to do anything fancy (e.g. sorting, joining, etc), then Hive is actually reasonably low latency- it doesn't run MapReduce, it just streams the file's contents out to you.
Finally, another option is to use something like Storm (which runs on Hadoop) to collect and analyze data in real time, and store the results for querying as mentioned above, or storing them in HBase for display through a custom user interface that queries HBase directly.
Sorry in advance if this is a basic question. I'm reading a book on hbase and learing but most of the examples in the book(and well as online) tend to be using Java(I guess because hbase is native to java). There are a few python examples and I know I can access hbase with python(using thrift or other modules), but I'm wondering about additional functions?
For example, hbase has a 'coprocessors' function that pushs the data to where your doing your computing. Does this type work with python or other apps that are using streaming hadoop jobs? It seems with java, it can know what your doing and manage the data flow accordingly but how does this work with streaming? If it doesn't work, is there a way to get this type of functionality(via streaming without switching to another language)?
Maybe another way of asking this is..what can a non-java programmer do to get all the benefits of the features of hadoop when streaming?
Thanks in advance!
As far as I know, you are talking about 2(or more) totally different concepts.
"Hadoop Streaming" is there to stream data through your executable (independent from your choice of programming language). When using streaming there can't be any loss of functionality, since the functionality is basicly map/reduce the data you are getting from hadoop stream.
For hadoop part you can even use pig or hive big data query languages to get things done efficiently. With the newest versions of pig you can even write custom functions in python and use them inside your pig scripts.
Although there are tools to make you use the language you are comfortable with never forget that hadoop framework is mostly written in java. There could be times when you would need to write a specialized InputFormat; or a UDF inside pig, etc. Then a decent knowledge in java would come handy.
Your "Hbase coprocessors" example is kinda unrelated with streaming functionality of hadoop. Hbase coproccessors consists of 2 parts : server-side part, client-side part. I am pretty sure there would be some useful server-side coprocessors embedded inside hbase with release; but other than that you would need to write your own coprocessor (and bad news: its java). For client side I am sure you would be able to use them with your favorite programming language through thrift without too much problem.
So as an answer to your question: you can always dodge learning java; still using hadoop to it's potential (using 3rd party libraries/applications). But when shit hits the fan its better to understand the underlaying content; to be able to develop with java. Knowing java would give you a full control over hadoop/hbase enviroment.
Hope you would find this helpful.
Yes, you should get data local code execution with streaming. You do not push the data to where the program is, you push the program to where the data is. Streaming simply takes the local input data and runs it through stdin to your python program. Instead of each map running inside of a java task, it spins up and instance of your python program and just pumps the input through that.
If you really want to do fast processing you really should learn java though. Having to pipe everything through stdin and stout is a lot of overhead.
I would like to populate the data store. Yet all the examples and instructions for populating the data store are concerned with Python projects. Is there a way to upload bulk data using the AppEngine Java tools? (At the moment the data is in CSV format, but I can easily reformat the data as needed.)
It would be especially useful if it could be done within the Eclipse IDE.
Thanks.
I'm having the same problem as you with this one. According to the discussion at http://groups.google.com/group/google-appengine-java/browse_thread/thread/72f58c28433cac26
there's no equivalent tool available for Java yet. However it looks like there's nothing stopping you from using the Python tool to just populate the datastore and then accessing that data as normal through your Java code, although this assumes you're comfortable with Python, which could be the problem.