we have an application which is written in Java, and uses solr,Elastic Search, Neo4j,MySQL and few more .
we require to increase our data size dramatically (from millions to billions)
So here the options I had in order to make this work:
clustering individual components notably solr, ES, Neo4j and MySQl
use what everyone talks about nowadays : Hadoop
Problem with first is hard to manage
the second option sounds too good to be true. So my questions are :
Can I actually assume that Hadoop can do that before digging in?
what other criteria do I need to consider?
Is there any alternative solution for such task?
Solr is for data searching. If you want to process the big data (meets criteria of volume, velocity and variety) such as ETL and reporting, you would need Hadoop.
Hadoop consist of several eco system components. You can refer to below link for documentation:
https://hadoop.apache.org
Related
I am planing on a project where it involves data-persistency , searching capabilities and Recommendation feature(Collaborative filtering).
As shown in the diagram, I am thinking of :
1) Having set of micro-services to handle entities which will be persisted in NoSQL storage (probably MongoDb)
2) For searching function I will use Slor and messages coming from micro-services will be used to update the Slor index.
3) For recommendations , I am thinking of using Apache Mahout and use message-queue to update the Slor index used in Mahout
My questions are :
1) Is this the correct architecture to handle this kind of a problem?
2) Does it needs 3 data-storages : MongoDB for data persistance, Slor(Lucene index) for search and Solr(Lucene Index) used by mahout for Recommendations ?
3) Since Slor is also a NoSQL solution , what are the drawbacks of using Solr for both persistency and search functions without using MongoDB?
4) If I want to use Hadoop or Apache Sparks for analytics , this involves introducing another data-storage ?
This architecture seems reasonable. You can use the same Solr cluster for normal search as well as the Recommender search. If you want to write your own data input to Spark you might implement a method to instantiate the Mahout IndexedDataset from MongoDB. There is already a companion object for taking a PairRDD of (String, String) as a single event's input and creating an IndexedDataset. This would remove the need for HDFS.
Spark saves temp files but does not require HDFS for storage. If you are using AWS you could put the Spark retraining work onto EMR, to spin up for training, and tear down afterwards.
So the answers are:
Yes, it looks reasonable. You should always keep the event stream in some safe storage.
No, only MongoDB and Solr are needed as long as you can read from MongoDB to Spark. This would be done in the Recommender training code using Mahout's Spark code for SimilarityAnalysis.cooccurrence
No known downside, not sure of the performance or devops trade-offs.
You must use Spark for SimilarityAnalysis.cooccurrence from Mahout since it implements the new "Correlated Cross-occurrence" (CCO) algorithm that will greatly improve your ability to use different forms of user data that will in turn increase the quality of recommendations. Spark does not require HDFS storage if you feed in events using MongoDB or Solr.
BTW: ActionML helps with the Data Science part of this, we can help you determine which user information is most predictive. We created the first open source implementation of CCO. We have seen very large increases in quality of recommendation by including the right CCO data (much greater than the Netflix prize 10%). We also support the PredictionIO implementation of the above architecture. We wrote the Universal Recommender based on Mahout (I'm a Mahout committer), but it is much more turnkey than building the system from scratch, however our analysis help is independent of the implementation and might help you on the Data Science part of the project. ActionML.com, Universal Recommender here. All is free OSS.
Could anybody please recommend good solution (framework) to access HBase on Hadoop cluster from Scala (or Java) application?
By now I'm moving in scalding direction. Prototypes I obtained allowed me to combine scalding library with Maven and separate scalding job JAR from 'library' code packages. This in turn allowed me to run scalding based Hadoop jobs from outside cluster with minimal overhead per job ('library' code is posted to cluster 'distributed cache' only when it changes (which is rarely needed) so I can load jobs code fast).
Now I'm actually starting to play with HBase itself and I see scalding is good but it is not so 'native' to HBase. Yes, there are things like hbase-scalding but as I anyway have some point to plan future actions I'd like to know about other good solutions I probably missed.
What is expected:
Applications (jobs) starting overhead should be low. I need to run lot of them.
It should be possible (easier - better) to run jobs from outside cluster without any SSH (just based on 'hadoop jar' command or even simply by application execution).
Jobs language itself should allow short, logical semantic. Ideally this code should be simple enough to be automatically generated.
This solution should be productive on big enough HBase tables (initially up to 100.000.000 entries).
OK, solution should be 'live' (being actively developing) but relatively good in terms of general stability.
I think argumentation here could be even more useful than solution itself and this question should add couple of ideas for many people.
Any peace of advice?
If you're using scalding (which I recommend) there's a new project with updated cascading and scalding wrappers for accessing HBase. You might want to check it out - https://github.com/ParallelAI/SpyGlass
HPaste http://www.gravity.com/labs/hpaste/ may be what you are looking for.
You may be interested in the Kiji project (https://github.com/kijiproject/). It provides a "schema-ed" layer on top of HBase.
It also has a Scalding adapter (KijiExpress) so that you can do functional collections operations (map, groupby, etc.) on "pipes" of tuples sourced from these schema-ed HBase tables.
Update (August 2014): Stratosphere is now called Apache Flink (incubating)
Check out Stratosphere. If offers a Scala API and has a HBase module and is under active development.
Starting a job should be possible within a sec or so (depends on your cluster size.)
You can submit jobs remotely (it has a class called RemoteExecutor which allows you to programmatically submit jobs on remote clusters)
Please contact me if you have further questions!
I am currently trying to maintain hbase-scalding at my free time. As I am also picking up Scala.
Please take a look at github
The team I work on is fortunate enough to have management that recognizes the need to enhance our skills and learn new technologies. As a result, whenever we have a little downtime between major projects, we are encouraged to use that time to stretch our minds a bit and learn something new. We often tackle a large research project as a team so that everyone benefits from the knowledge. For example, we built a spec-compliant Kerberos authentication server to get familiar with the ins and outs of the protocol. We wrote our own webserver to learn about efficient design strategies for networked applications.
Recently, we've been very curious about Map-Reduce, specifically Hadoop and the various supporting components (HBase, HDFS, Pig, Hive, etc.). To learn a bit more about it, we would like to write a web analytics service. It will use Javascript page tagging to gather the metrics, and Hadoop and something to make analytics and reports available via a web interface.
The non-Hadoop side of the architecture is easy. A Java servlet will parse the parameters from a Javascript tag (easy enough -- we're a Java shop). The servlet will then send out a JMS message for asynchronous processing (again, easy).
My question is... What next? We've researched things like Hive a bit, and it sounds like a great fit for querying the datastore for the various metrics we're looking for. But, it's high latency. We're fortunate enough to be able to drop this onto a website that gets a few million hits per month. We'd really like to get relatively quick metrics using the web interface for our analytics tool. Latency is not our friend. So, what is the best way to accomplish this? Would it be to run the queries as a scheduled job and then store the results somewhere with lower latency (PostgreSQL, etc.) and retrieve them from there? If that's the case, where should the component listening for the JMS messages store the data? Can Hive get its data from HBase directly? Should we store it in HDFS somewhere and read it in Hive?
Like I said, we're a very technical team and love learning new technologies. This, though, is way different from anything we've learned before, so we'd like to get a sense of what the "best practices" would be here. Any advice or opinions you can give are GREATLY appreciated!
EDIT : I thought I'd add some clarification as to what I'm looking for. I'm seeking advice on architecture and design for a solution such as this. We'll collect 20-30 different metrics on a site that gets several million page views per month. This will be a lot of data, and we'd like to be able to get metrics in as close to realtime as possible. I'm looking for best practices and advice on the architecture of such a solution, because I don't want us to come up with something on our own that is really bad that will leave us thinking we're "Hadoop experts" just because it works.
Hive, as you mentioned, has high latency for queries. It can be pointed at HBase (see https://cwiki.apache.org/Hive/hbaseintegration.html), but the integration results in HBase having tables that are forced into a mostly-rectangular, relational-like schema that is not optimal for HBase. Plus, the overhead of doing it is extremely costly- hive queries against hbase are, on my cluster, at least an order of magnitude slower than against plain HDFS files.
One good strategy is to store the raw metrics in HBase or on plain HDFS (Might want to look at Flume if these metrics are coming from log files) and run periodic MapReduce jobs (even every 5 minutes) to create pre-aggregated results that you can store in plain rectangular files that you can query through Hive. When you are just reading a file and Hive doesn't have to do anything fancy (e.g. sorting, joining, etc), then Hive is actually reasonably low latency- it doesn't run MapReduce, it just streams the file's contents out to you.
Finally, another option is to use something like Storm (which runs on Hadoop) to collect and analyze data in real time, and store the results for querying as mentioned above, or storing them in HBase for display through a custom user interface that queries HBase directly.
I'm trying to implement part of the facebook ads api, the auto complete function ads.getAutoCompleteData
Basically, Facebook supplies this 39MB file which updated weekly, and which contains targeting ads data including colleges, college majors, workplaces, locales, countries, regions and cities.
Our application needs to access all of those objects and supply auto completion using this file's data.
I'm thinking of preferred ways to solved this. I was thinking about one of the following options:
Loading it to memory using Trie (Patricia-trie), the disadvantage of course that it will take too much memory on the server.
Using a dedicated search platform such as Solr on a different machine, the disadvantage is perhaps over-engineering (Though the file size will probably increase largely in the future).
(Fill here cool, easy and speed of light option) ?
Well, what do you think?
I would stick with a service oriented architecture (especially if the product is supposed to handle high volumes) and go with Solr. That being said, 39 MB is not a lot of hold in memory if it's going to be a singleton. With indexes and all this will get up to what? 400MB? This of course depends on what your product does and what kind of hardware you wish to run it on.
I would go with Solr or write your own service that reads the file into a fast DB like MySQL's MyISAM table (or even in-memory table) and use mysql's text search feature to serve up results. Barring that I would try to use Solr as a service.
The benefit of writing my own service is that I know what is going on, the down side is that it'll be no where as powerful as Solr. However I suspect writing my own service will take less time to implement.
Consider writing your own service that serves up request in a async manner (if your product is a website then using ajax). The trouble with Solr or Lucene is that if you get stuck, there is not a lot of help out there.
Just my 2 cents.
There are several advantages to use Solr 1.4 (out-of-the-box facetting search, grouping, replication, http administration vs. luke, ...).
Even if I embed a search-functionality in my Java application I could use SolrJ to avoid the HTTP trade-off when using Solr. Is SolrJ recommended at all?
So, when would you recommend to use "pure-Lucene"? Does it have a better performance or requires less RAM? Is it better unit-testable?
PS: I am aware of this question.
If you have a web application, use Solr - I've tried integrating both, and Solr is easier. Otherwise, if you don't need Solr's features (the one that comes to mind as being most important is faceted search), then use Lucene.
If you want to completely embed your search functionality within your application and do not want to maintain a separate process like Solr, using Lucene is probably preferable. Per example, a desktop application might need some search functionality (like the Eclipse IDE that uses Lucene for searching its documentation). You probably don't want this kind of application to launch a heavy process like Solr.
Here is one situation where I have to use Lucene.
Given a set of documents, find out the most common terms in them.
Here, I need to access term vectors of each document (using low-level APIs of TermVectorMapper). With Lucene it's quite easy.
Another use case is for very specialized ordering of search results. For exmaple, I want a search for an author name (who has writen multiple books) to result into one book from each store in the first 10 results. In this case, I will find results from each book store and to show final results I will pick one result from each book store. Here you are essentially doing multiple searches to generate final results. Having access to low-level APIs of lucene definitely helps.
One more reason to go for Lucene was to get new goodies ASAP. This no longer is true as both of them have been merged and there will be synchronous releases.
I'm surprised nobody mentioned NRT - Near Real Time search, available with Lucene, but not with Solr (yet).
Use Solr if you are more concerned about scalability than performance and use Lucene if you are more concerned about performance than scalability.