My requirement is as follows:
We are trying to implement Recommendations Engine for one of our customers. To achieve the same, we need to store data in HDFS from web application (for every click on the product) and compute the recommendations in the back end and display the result (as product) in the web application.
My approach is as follows highlighted with the steps:
We have downloaded and configured the Cloudera
We have downloaded/configured Apache Mlib (Recommendations Engine)
Using Eclipse Luna, we are able to run the Mlib (using Java plugin)
Now we need to create a JSON service which will read the data from web and
store in the HDFS. We are stuck in this step.
Now we need to create a JSON service which can read the data from HDFS and
compute the Recommendations and display the result in JSON format dynamically.
We are stuck at Step 4 & 5. Please suggest, how can we create a JSON service to read/write from HDFS?
You asked a very general question. I suggest you to get familiar with Apache Spark first. Read it's quick guide. Start to read\write data from hdfs into jsonRDD as described in tutorial. After you understand how to work with batch processing, read about spark streaming.
There is an old story that Ptolemy I asked if there was a shorter path to learning geometry than Euclid's Elements. Euclid replied there is no royal road to geometry. So there is no fast way to build mllib engine for your clients except for reading and understanding the basics of Apache Spark Usage. I wish you a good luck in that!
Related
My task is to write a Java based web application which will produce various charts like WaferMap, Histogram, Overlay Chart etc.
The front end is ExtJS and the chart generation part is taken care by JFreeChart.
The data for charts will be in multiple .CSV files which are stored in the file system.
My questions are:
The .CSV files size will be in GB's. Can I store these files in HDFS and query them during run-time and display data in frontend?
Is using Hadoop ecosystem is a feasible solution for my above requirement? Should I also consider Apache pig or hive for querying the CSV file?
Yes you can (Apache Hive)
It all depends but Hive seems like what you're looking for. It was designed with a SQL like feel and can include SQL clauses. It is widely used with major corps like Facebook, Netflix, FINRA, etc. In your case, supporting SQL syntax also means that you can integrate with Java's JDBC driver real easy and query data from your CSV files.
http://www.tutorialspoint.com/hive/
Setting up Hive can be a bit difficult at first if you're not too familiar with the Hadoop environment. The above link is a great reference link to understand Hive better and get you in the right direction.
Hope this was helpful!
I am planing on a project where it involves data-persistency , searching capabilities and Recommendation feature(Collaborative filtering).
As shown in the diagram, I am thinking of :
1) Having set of micro-services to handle entities which will be persisted in NoSQL storage (probably MongoDb)
2) For searching function I will use Slor and messages coming from micro-services will be used to update the Slor index.
3) For recommendations , I am thinking of using Apache Mahout and use message-queue to update the Slor index used in Mahout
My questions are :
1) Is this the correct architecture to handle this kind of a problem?
2) Does it needs 3 data-storages : MongoDB for data persistance, Slor(Lucene index) for search and Solr(Lucene Index) used by mahout for Recommendations ?
3) Since Slor is also a NoSQL solution , what are the drawbacks of using Solr for both persistency and search functions without using MongoDB?
4) If I want to use Hadoop or Apache Sparks for analytics , this involves introducing another data-storage ?
This architecture seems reasonable. You can use the same Solr cluster for normal search as well as the Recommender search. If you want to write your own data input to Spark you might implement a method to instantiate the Mahout IndexedDataset from MongoDB. There is already a companion object for taking a PairRDD of (String, String) as a single event's input and creating an IndexedDataset. This would remove the need for HDFS.
Spark saves temp files but does not require HDFS for storage. If you are using AWS you could put the Spark retraining work onto EMR, to spin up for training, and tear down afterwards.
So the answers are:
Yes, it looks reasonable. You should always keep the event stream in some safe storage.
No, only MongoDB and Solr are needed as long as you can read from MongoDB to Spark. This would be done in the Recommender training code using Mahout's Spark code for SimilarityAnalysis.cooccurrence
No known downside, not sure of the performance or devops trade-offs.
You must use Spark for SimilarityAnalysis.cooccurrence from Mahout since it implements the new "Correlated Cross-occurrence" (CCO) algorithm that will greatly improve your ability to use different forms of user data that will in turn increase the quality of recommendations. Spark does not require HDFS storage if you feed in events using MongoDB or Solr.
BTW: ActionML helps with the Data Science part of this, we can help you determine which user information is most predictive. We created the first open source implementation of CCO. We have seen very large increases in quality of recommendation by including the right CCO data (much greater than the Netflix prize 10%). We also support the PredictionIO implementation of the above architecture. We wrote the Universal Recommender based on Mahout (I'm a Mahout committer), but it is much more turnkey than building the system from scratch, however our analysis help is independent of the implementation and might help you on the Data Science part of the project. ActionML.com, Universal Recommender here. All is free OSS.
I'm just getting started on Hadoop and I'm struggling to figure out how to use other input sources that aren't files, i.e. Read all the rows from AWS SimpleDB, or all records from a REST API on another system. Everything online only shows how to process files or a few selected databases.
The API for InputFormat looks quite complex, so I'm trying to figure out the quickest way to read in data from any non-file data source, which can then be MapReduced using Amazon's Elastic MapReduce (based on Hadoop). I'm using JAVA to write the code.
Thanks!
The 'quickest' way would be to use some data aggregation tool, like Flume or Chukwa.
You can find a very good example on how to collect Twitter data through Flume using Twitter API here. It shows how you can use Flume to read twitter data into your Hadoop cluster and then process it using Hive. You could write your own MR job to do that if you need that. Trying to devise a custom InputFormat for these kinda things really requires some work and I don't think you'll find much help on this(Unless somebody has done this and is ready to share it with you).
HTH
The team I work on is fortunate enough to have management that recognizes the need to enhance our skills and learn new technologies. As a result, whenever we have a little downtime between major projects, we are encouraged to use that time to stretch our minds a bit and learn something new. We often tackle a large research project as a team so that everyone benefits from the knowledge. For example, we built a spec-compliant Kerberos authentication server to get familiar with the ins and outs of the protocol. We wrote our own webserver to learn about efficient design strategies for networked applications.
Recently, we've been very curious about Map-Reduce, specifically Hadoop and the various supporting components (HBase, HDFS, Pig, Hive, etc.). To learn a bit more about it, we would like to write a web analytics service. It will use Javascript page tagging to gather the metrics, and Hadoop and something to make analytics and reports available via a web interface.
The non-Hadoop side of the architecture is easy. A Java servlet will parse the parameters from a Javascript tag (easy enough -- we're a Java shop). The servlet will then send out a JMS message for asynchronous processing (again, easy).
My question is... What next? We've researched things like Hive a bit, and it sounds like a great fit for querying the datastore for the various metrics we're looking for. But, it's high latency. We're fortunate enough to be able to drop this onto a website that gets a few million hits per month. We'd really like to get relatively quick metrics using the web interface for our analytics tool. Latency is not our friend. So, what is the best way to accomplish this? Would it be to run the queries as a scheduled job and then store the results somewhere with lower latency (PostgreSQL, etc.) and retrieve them from there? If that's the case, where should the component listening for the JMS messages store the data? Can Hive get its data from HBase directly? Should we store it in HDFS somewhere and read it in Hive?
Like I said, we're a very technical team and love learning new technologies. This, though, is way different from anything we've learned before, so we'd like to get a sense of what the "best practices" would be here. Any advice or opinions you can give are GREATLY appreciated!
EDIT : I thought I'd add some clarification as to what I'm looking for. I'm seeking advice on architecture and design for a solution such as this. We'll collect 20-30 different metrics on a site that gets several million page views per month. This will be a lot of data, and we'd like to be able to get metrics in as close to realtime as possible. I'm looking for best practices and advice on the architecture of such a solution, because I don't want us to come up with something on our own that is really bad that will leave us thinking we're "Hadoop experts" just because it works.
Hive, as you mentioned, has high latency for queries. It can be pointed at HBase (see https://cwiki.apache.org/Hive/hbaseintegration.html), but the integration results in HBase having tables that are forced into a mostly-rectangular, relational-like schema that is not optimal for HBase. Plus, the overhead of doing it is extremely costly- hive queries against hbase are, on my cluster, at least an order of magnitude slower than against plain HDFS files.
One good strategy is to store the raw metrics in HBase or on plain HDFS (Might want to look at Flume if these metrics are coming from log files) and run periodic MapReduce jobs (even every 5 minutes) to create pre-aggregated results that you can store in plain rectangular files that you can query through Hive. When you are just reading a file and Hive doesn't have to do anything fancy (e.g. sorting, joining, etc), then Hive is actually reasonably low latency- it doesn't run MapReduce, it just streams the file's contents out to you.
Finally, another option is to use something like Storm (which runs on Hadoop) to collect and analyze data in real time, and store the results for querying as mentioned above, or storing them in HBase for display through a custom user interface that queries HBase directly.
I would like to populate the data store. Yet all the examples and instructions for populating the data store are concerned with Python projects. Is there a way to upload bulk data using the AppEngine Java tools? (At the moment the data is in CSV format, but I can easily reformat the data as needed.)
It would be especially useful if it could be done within the Eclipse IDE.
Thanks.
I'm having the same problem as you with this one. According to the discussion at http://groups.google.com/group/google-appengine-java/browse_thread/thread/72f58c28433cac26
there's no equivalent tool available for Java yet. However it looks like there's nothing stopping you from using the Python tool to just populate the datastore and then accessing that data as normal through your Java code, although this assumes you're comfortable with Python, which could be the problem.