I am planing on a project where it involves data-persistency , searching capabilities and Recommendation feature(Collaborative filtering).
As shown in the diagram, I am thinking of :
1) Having set of micro-services to handle entities which will be persisted in NoSQL storage (probably MongoDb)
2) For searching function I will use Slor and messages coming from micro-services will be used to update the Slor index.
3) For recommendations , I am thinking of using Apache Mahout and use message-queue to update the Slor index used in Mahout
My questions are :
1) Is this the correct architecture to handle this kind of a problem?
2) Does it needs 3 data-storages : MongoDB for data persistance, Slor(Lucene index) for search and Solr(Lucene Index) used by mahout for Recommendations ?
3) Since Slor is also a NoSQL solution , what are the drawbacks of using Solr for both persistency and search functions without using MongoDB?
4) If I want to use Hadoop or Apache Sparks for analytics , this involves introducing another data-storage ?
This architecture seems reasonable. You can use the same Solr cluster for normal search as well as the Recommender search. If you want to write your own data input to Spark you might implement a method to instantiate the Mahout IndexedDataset from MongoDB. There is already a companion object for taking a PairRDD of (String, String) as a single event's input and creating an IndexedDataset. This would remove the need for HDFS.
Spark saves temp files but does not require HDFS for storage. If you are using AWS you could put the Spark retraining work onto EMR, to spin up for training, and tear down afterwards.
So the answers are:
Yes, it looks reasonable. You should always keep the event stream in some safe storage.
No, only MongoDB and Solr are needed as long as you can read from MongoDB to Spark. This would be done in the Recommender training code using Mahout's Spark code for SimilarityAnalysis.cooccurrence
No known downside, not sure of the performance or devops trade-offs.
You must use Spark for SimilarityAnalysis.cooccurrence from Mahout since it implements the new "Correlated Cross-occurrence" (CCO) algorithm that will greatly improve your ability to use different forms of user data that will in turn increase the quality of recommendations. Spark does not require HDFS storage if you feed in events using MongoDB or Solr.
BTW: ActionML helps with the Data Science part of this, we can help you determine which user information is most predictive. We created the first open source implementation of CCO. We have seen very large increases in quality of recommendation by including the right CCO data (much greater than the Netflix prize 10%). We also support the PredictionIO implementation of the above architecture. We wrote the Universal Recommender based on Mahout (I'm a Mahout committer), but it is much more turnkey than building the system from scratch, however our analysis help is independent of the implementation and might help you on the Data Science part of the project. ActionML.com, Universal Recommender here. All is free OSS.
Related
we have an application which is written in Java, and uses solr,Elastic Search, Neo4j,MySQL and few more .
we require to increase our data size dramatically (from millions to billions)
So here the options I had in order to make this work:
clustering individual components notably solr, ES, Neo4j and MySQl
use what everyone talks about nowadays : Hadoop
Problem with first is hard to manage
the second option sounds too good to be true. So my questions are :
Can I actually assume that Hadoop can do that before digging in?
what other criteria do I need to consider?
Is there any alternative solution for such task?
Solr is for data searching. If you want to process the big data (meets criteria of volume, velocity and variety) such as ETL and reporting, you would need Hadoop.
Hadoop consist of several eco system components. You can refer to below link for documentation:
https://hadoop.apache.org
My requirement is as follows:
We are trying to implement Recommendations Engine for one of our customers. To achieve the same, we need to store data in HDFS from web application (for every click on the product) and compute the recommendations in the back end and display the result (as product) in the web application.
My approach is as follows highlighted with the steps:
We have downloaded and configured the Cloudera
We have downloaded/configured Apache Mlib (Recommendations Engine)
Using Eclipse Luna, we are able to run the Mlib (using Java plugin)
Now we need to create a JSON service which will read the data from web and
store in the HDFS. We are stuck in this step.
Now we need to create a JSON service which can read the data from HDFS and
compute the Recommendations and display the result in JSON format dynamically.
We are stuck at Step 4 & 5. Please suggest, how can we create a JSON service to read/write from HDFS?
You asked a very general question. I suggest you to get familiar with Apache Spark first. Read it's quick guide. Start to read\write data from hdfs into jsonRDD as described in tutorial. After you understand how to work with batch processing, read about spark streaming.
There is an old story that Ptolemy I asked if there was a shorter path to learning geometry than Euclid's Elements. Euclid replied there is no royal road to geometry. So there is no fast way to build mllib engine for your clients except for reading and understanding the basics of Apache Spark Usage. I wish you a good luck in that!
We are designing the architecture of social networking website which has highly interconnected dataset. (like user can follow other users, places, interests. And recommendations based on that). The feed would come from directly following entities as well as from indirectly connected entities. (the places and interest can be connected to other places and interests in a inverted tree like hierarchy ).
Now we plan to use Neo4j for storing the complex relationships between entities with their IDs. We want to store the actual data for that entity in MySQL. We want to keep graph database content only to minimal size (but with the entire relationship (that's very important for feeds)), so that we could load entire graph in RAM at run time. (entire graph in memory for fast retrieval of content). Once we get ID's of object from Neo4j, we could run normal SQL queries on MySQL.
We are using PHP and MySQL combination. Now we have learned that Neo4j, if implemented in embedded mode, is suitable for complex algorithm and fast data retrieval. Now we need to integrate Neo4j with PHP. We plan to create RESTful Java APIs (or SOAP) for Neo4j implementation. By this way we could do it.
We would have atleast 1 million nodes and 10 millions relationship. Can Neo4j traverse 1 million nodes without perfomance glitches in 1-5 seconds with proper indexing?
Now, please guide me if this would work. Anyone who has already done this kind of things before. Your any little guidance in this regards would be highly useful to me.
thank you
P/s: i am attaching some project relationship diagrams to give you more understanding. please ask if you need more inputs from me.
https://drive.google.com/file/d/0B-XA2uVZaFFTWDdwUEViZ2ZsbkE/edit?usp=sharing
https://drive.google.com/file/d/0B-XA2uVZaFFTTGV4d1IySXlWRGs/edit?usp=sharing
I published an unmanaged extension some time ago that represents a kind of activity stream. Feel free to have a look, you would consume it from PHP just via a simple http-REST call.
https://github.com/jexp/neo4j-activity-stream
A picture of the domain model is here:
yes, 10M relationships and 1M nodes should be no problem to even hold in memory. For fastest retrieval, I would build a server extension in Java and use the embedded API or even Cypher, and expose a custom REST endpoint that your PHP environment talks to, see http://docs.neo4j.org/chunked/milestone/server-plugins.html
The team I work on is fortunate enough to have management that recognizes the need to enhance our skills and learn new technologies. As a result, whenever we have a little downtime between major projects, we are encouraged to use that time to stretch our minds a bit and learn something new. We often tackle a large research project as a team so that everyone benefits from the knowledge. For example, we built a spec-compliant Kerberos authentication server to get familiar with the ins and outs of the protocol. We wrote our own webserver to learn about efficient design strategies for networked applications.
Recently, we've been very curious about Map-Reduce, specifically Hadoop and the various supporting components (HBase, HDFS, Pig, Hive, etc.). To learn a bit more about it, we would like to write a web analytics service. It will use Javascript page tagging to gather the metrics, and Hadoop and something to make analytics and reports available via a web interface.
The non-Hadoop side of the architecture is easy. A Java servlet will parse the parameters from a Javascript tag (easy enough -- we're a Java shop). The servlet will then send out a JMS message for asynchronous processing (again, easy).
My question is... What next? We've researched things like Hive a bit, and it sounds like a great fit for querying the datastore for the various metrics we're looking for. But, it's high latency. We're fortunate enough to be able to drop this onto a website that gets a few million hits per month. We'd really like to get relatively quick metrics using the web interface for our analytics tool. Latency is not our friend. So, what is the best way to accomplish this? Would it be to run the queries as a scheduled job and then store the results somewhere with lower latency (PostgreSQL, etc.) and retrieve them from there? If that's the case, where should the component listening for the JMS messages store the data? Can Hive get its data from HBase directly? Should we store it in HDFS somewhere and read it in Hive?
Like I said, we're a very technical team and love learning new technologies. This, though, is way different from anything we've learned before, so we'd like to get a sense of what the "best practices" would be here. Any advice or opinions you can give are GREATLY appreciated!
EDIT : I thought I'd add some clarification as to what I'm looking for. I'm seeking advice on architecture and design for a solution such as this. We'll collect 20-30 different metrics on a site that gets several million page views per month. This will be a lot of data, and we'd like to be able to get metrics in as close to realtime as possible. I'm looking for best practices and advice on the architecture of such a solution, because I don't want us to come up with something on our own that is really bad that will leave us thinking we're "Hadoop experts" just because it works.
Hive, as you mentioned, has high latency for queries. It can be pointed at HBase (see https://cwiki.apache.org/Hive/hbaseintegration.html), but the integration results in HBase having tables that are forced into a mostly-rectangular, relational-like schema that is not optimal for HBase. Plus, the overhead of doing it is extremely costly- hive queries against hbase are, on my cluster, at least an order of magnitude slower than against plain HDFS files.
One good strategy is to store the raw metrics in HBase or on plain HDFS (Might want to look at Flume if these metrics are coming from log files) and run periodic MapReduce jobs (even every 5 minutes) to create pre-aggregated results that you can store in plain rectangular files that you can query through Hive. When you are just reading a file and Hive doesn't have to do anything fancy (e.g. sorting, joining, etc), then Hive is actually reasonably low latency- it doesn't run MapReduce, it just streams the file's contents out to you.
Finally, another option is to use something like Storm (which runs on Hadoop) to collect and analyze data in real time, and store the results for querying as mentioned above, or storing them in HBase for display through a custom user interface that queries HBase directly.
I'm trying to implement part of the facebook ads api, the auto complete function ads.getAutoCompleteData
Basically, Facebook supplies this 39MB file which updated weekly, and which contains targeting ads data including colleges, college majors, workplaces, locales, countries, regions and cities.
Our application needs to access all of those objects and supply auto completion using this file's data.
I'm thinking of preferred ways to solved this. I was thinking about one of the following options:
Loading it to memory using Trie (Patricia-trie), the disadvantage of course that it will take too much memory on the server.
Using a dedicated search platform such as Solr on a different machine, the disadvantage is perhaps over-engineering (Though the file size will probably increase largely in the future).
(Fill here cool, easy and speed of light option) ?
Well, what do you think?
I would stick with a service oriented architecture (especially if the product is supposed to handle high volumes) and go with Solr. That being said, 39 MB is not a lot of hold in memory if it's going to be a singleton. With indexes and all this will get up to what? 400MB? This of course depends on what your product does and what kind of hardware you wish to run it on.
I would go with Solr or write your own service that reads the file into a fast DB like MySQL's MyISAM table (or even in-memory table) and use mysql's text search feature to serve up results. Barring that I would try to use Solr as a service.
The benefit of writing my own service is that I know what is going on, the down side is that it'll be no where as powerful as Solr. However I suspect writing my own service will take less time to implement.
Consider writing your own service that serves up request in a async manner (if your product is a website then using ajax). The trouble with Solr or Lucene is that if you get stuck, there is not a lot of help out there.
Just my 2 cents.