Hadoop-Hive-HBase Advice for Web Analytics

Hadoop-Hive-HBase Advice for Web Analytics - java

The team I work on is fortunate enough to have management that recognizes the need to enhance our skills and learn new technologies. As a result, whenever we have a little downtime between major projects, we are encouraged to use that time to stretch our minds a bit and learn something new. We often tackle a large research project as a team so that everyone benefits from the knowledge. For example, we built a spec-compliant Kerberos authentication server to get familiar with the ins and outs of the protocol. We wrote our own webserver to learn about efficient design strategies for networked applications.
Recently, we've been very curious about Map-Reduce, specifically Hadoop and the various supporting components (HBase, HDFS, Pig, Hive, etc.). To learn a bit more about it, we would like to write a web analytics service. It will use Javascript page tagging to gather the metrics, and Hadoop and something to make analytics and reports available via a web interface.
The non-Hadoop side of the architecture is easy. A Java servlet will parse the parameters from a Javascript tag (easy enough -- we're a Java shop). The servlet will then send out a JMS message for asynchronous processing (again, easy).
My question is... What next? We've researched things like Hive a bit, and it sounds like a great fit for querying the datastore for the various metrics we're looking for. But, it's high latency. We're fortunate enough to be able to drop this onto a website that gets a few million hits per month. We'd really like to get relatively quick metrics using the web interface for our analytics tool. Latency is not our friend. So, what is the best way to accomplish this? Would it be to run the queries as a scheduled job and then store the results somewhere with lower latency (PostgreSQL, etc.) and retrieve them from there? If that's the case, where should the component listening for the JMS messages store the data? Can Hive get its data from HBase directly? Should we store it in HDFS somewhere and read it in Hive?
Like I said, we're a very technical team and love learning new technologies. This, though, is way different from anything we've learned before, so we'd like to get a sense of what the "best practices" would be here. Any advice or opinions you can give are GREATLY appreciated!
EDIT : I thought I'd add some clarification as to what I'm looking for. I'm seeking advice on architecture and design for a solution such as this. We'll collect 20-30 different metrics on a site that gets several million page views per month. This will be a lot of data, and we'd like to be able to get metrics in as close to realtime as possible. I'm looking for best practices and advice on the architecture of such a solution, because I don't want us to come up with something on our own that is really bad that will leave us thinking we're "Hadoop experts" just because it works.

Hive, as you mentioned, has high latency for queries. It can be pointed at HBase (see https://cwiki.apache.org/Hive/hbaseintegration.html), but the integration results in HBase having tables that are forced into a mostly-rectangular, relational-like schema that is not optimal for HBase. Plus, the overhead of doing it is extremely costly- hive queries against hbase are, on my cluster, at least an order of magnitude slower than against plain HDFS files.
One good strategy is to store the raw metrics in HBase or on plain HDFS (Might want to look at Flume if these metrics are coming from log files) and run periodic MapReduce jobs (even every 5 minutes) to create pre-aggregated results that you can store in plain rectangular files that you can query through Hive. When you are just reading a file and Hive doesn't have to do anything fancy (e.g. sorting, joining, etc), then Hive is actually reasonably low latency- it doesn't run MapReduce, it just streams the file's contents out to you.
Finally, another option is to use something like Storm (which runs on Hadoop) to collect and analyze data in real time, and store the results for querying as mentioned above, or storing them in HBase for display through a custom user interface that queries HBase directly.

Related

using hadoop on the current application

we have an application which is written in Java, and uses solr,Elastic Search, Neo4j,MySQL and few more .
we require to increase our data size dramatically (from millions to billions)
So here the options I had in order to make this work:
clustering individual components notably solr, ES, Neo4j and MySQl
use what everyone talks about nowadays : Hadoop
Problem with first is hard to manage
the second option sounds too good to be true. So my questions are :
Can I actually assume that Hadoop can do that before digging in?
what other criteria do I need to consider?
Is there any alternative solution for such task?

Solr is for data searching. If you want to process the big data (meets criteria of volume, velocity and variety) such as ETL and reporting, you would need Hadoop.
Hadoop consist of several eco system components. You can refer to below link for documentation:
https://hadoop.apache.org

database for log analysis application in java - 2014

I want to create a java application for the purpose of handling and analyzing live streaming logs. I have to implement some complex filter functionality also. I was doing a research on finding the best suited database for the same.
I came across many portable database like mongodb, hbase, h2 and all. Among all, mongodb seems to be a better candidate. But for my requirement, there may be insertion and selection happening at the same time. Somewhere I read like, mongodb is not best at handling concurrency.
I'm sure, moving forward the performance of database is going to play a crucial role in the whole performance of the application.
I came across many stack overflow links regarding the same. But the thing is, all of them are asked 2 or more years back.
Can mongodb handle concurrency? Is there any other portable database which is better than mongodb for the same?
Please help.

Have you looked to some solution, for instance, like elasticsearch coupled with kibana and td-agent?
It provides asynchronous logging. I've used it to store and analyze 30 millions events per day from several servers, but it depends what you want to do in the end.

Tools to do data processing from Java

I've got a legacy system that uses SAS to ingest raw data from the database, cleanse and consolidate it, and then score the outputted documents.
I'm wanting to move to a Java or similar object oriented solution, so I can implement unit testing, and otherwise general better code control. (I'm not talking about overhauling the whole system, but injecting java where I can).
In terms of data size, we're talking about around 1 TB of data being both ingested and created. In terms of scaling, this might increase by a factor of around 10, but isn't likely to increase on massive scale like a worldwide web project might.
The question is - what tools would be most appropriate for this kind of project?
Where would I find this information - what search terms should be used?
Is doing processing on an SQL database (creating and dropping tables, adding columns, as needed) an appropriate, or awful, solution?
I've had a quick look at Hadoop - but due to the small scale of this project, would Hadoop be an unnecessary complication?
Are there any Java packages that do similar functionality as SAS or SQL in terms of merging, joining, sorting, grouping datasets, as well as modifying data?

It's hard for me to prescribe exactly what you need given your problem statement.
It sounds like a good database API (i.e. native JDBC might be all you need with a good open source database backend)
However, I think you should take some time to check out Lucene. It's a fantastic tool and may meet your scoring needs very well. Taking a search engine indexing approach to your problem may be fruitful.

I think the question you need to ask yourself is
what's the nature of your data set, how often it will be updated.
what's the workload you will have on this 1TB or more data in the future. Will there be mainly offline read and analysis operations? Or there will also have a lot random write operations?
Here is an article talking about if to choose using Hadoop or not which I think is worth reading.
Hadoop is a better choice if you only have daily or weekly update of your data set. And the major operations on the data is read-only operations, along with further data analysis. For the merging, joining, sorting, grouping datasets operation you mentioned, Cascading is a Java library running on top of Hadoop which supports this operation well.

Loading facebook's big text file to memory (39MB) for autocompletion

I'm trying to implement part of the facebook ads api, the auto complete function ads.getAutoCompleteData
Basically, Facebook supplies this 39MB file which updated weekly, and which contains targeting ads data including colleges, college majors, workplaces, locales, countries, regions and cities.
Our application needs to access all of those objects and supply auto completion using this file's data.
I'm thinking of preferred ways to solved this. I was thinking about one of the following options:
Loading it to memory using Trie (Patricia-trie), the disadvantage of course that it will take too much memory on the server.
Using a dedicated search platform such as Solr on a different machine, the disadvantage is perhaps over-engineering (Though the file size will probably increase largely in the future).
(Fill here cool, easy and speed of light option) ?
Well, what do you think?

I would stick with a service oriented architecture (especially if the product is supposed to handle high volumes) and go with Solr. That being said, 39 MB is not a lot of hold in memory if it's going to be a singleton. With indexes and all this will get up to what? 400MB? This of course depends on what your product does and what kind of hardware you wish to run it on.
I would go with Solr or write your own service that reads the file into a fast DB like MySQL's MyISAM table (or even in-memory table) and use mysql's text search feature to serve up results. Barring that I would try to use Solr as a service.
The benefit of writing my own service is that I know what is going on, the down side is that it'll be no where as powerful as Solr. However I suspect writing my own service will take less time to implement.
Consider writing your own service that serves up request in a async manner (if your product is a website then using ajax). The trouble with Solr or Lucene is that if you get stuck, there is not a lot of help out there.
Just my 2 cents.

Is Web Service suitable for ETL purpose?

My company is considering using web service as mean of ETL process. However I don't think web service fit into this purpose, for several reasons:
1. web service could possibly consume a lot of memory when generating large xml.
2. xml is a bloated format.
3. possibly time-out if the server takes huge amount of time to generate data
4. file size limitation? (for windows, it's 2Gb, if my memory serves me right)
I am not a web service expert, so I need your opinions. :)
Thanks.

There are plenty of technologies in the Web Services tool shed that circumvent all the problems you elaborate. There is stream oriented XML shredding, there are XML compression formats for delivery, protocols that deal with fragmentation and fairness and there are many a storage systems that can hold terabytes upon terabytes of data.
If by web service you imagine some college freshmen homework concoction of an interface that accepts a single glop argument with a 2GB serialized table in it then all your arguments are valid. But if you give your requirements to an experienced team with knowledge of the concepts involved in WS-ReliableMessaging and WS-Transaction then there is no reason not to have an ETL process around Web Services. Note that I do not advocate the SOAP protocols per-se, but I do advocate knowledge and understanding of the concepts involved.
Now that being said, whether an Web Service oriented ETL process makes sense for you or not it depends on a whole set of other reasons. However, your rebuttal of the Web Service technologies does not hold water.

I would not use a web service for an ETL task. There are specialized tools for that task (e.g., Ab Initio, Informatica, etc.) that are better suited.
If you have a large amount of data, I'd say that the price of the extra latency that the network would introduce would be prohibitive.

It really does depend on what you are doing and how you are trying to accomplish it. In general webservices require more care and feeding than you would normally put into an ETL process, but they can be surprisingly effective at the task as well. I did not get enough specifics for your scenario to say whether it would work.
I have worked on Webservices which transmit and recieve 100+ MB documents, some encoded in XML some not, and do it in seconds (on a closed local network). These services required a good deal of tuning and planning, but they did work well for our scenario and they allowed a wide variety of clients to connect and transmit differing amounts of data through a fairly standard interface. This differed from some of the other ETL jobs we had were the job was specific to each client and had to be setup and maintained for each client.
It all depends on what you are doing and what your constraints are.
If you are going to pursue this route sit down and draft out the process from beginning to end, including how you want clients to connect, verify that the data was received and verify that the job is finished. Consider some of the scenarios, the clients and the types of data being transmitted and then work out what would be needed. Contrast that with what is already available in other tools, and how much time you have to get it done.

I'm really wondering why your company is not considering using a real ETL tool like like those mentioned by duffymo in his answer or, Talend or CloverETL if open source is an option.
They are in general good for ETL purpose :)
Building your own solution sounds like reinventing the wheel.
Many of them have web services oriented features (see Export a job as webservice in Talend's wiki or CloverETL Server HTTP Launch Services for example).
I'm not an ETL product expert and I didn't check them all but I'm pretty sure this is something to consider.

Look up MTOM, to start with, which allows arbitrary non-XML data to be streamed in a web service.

Web services are just fine for ETL tasks. Remember that each task is going to get handled in its own thread for free, and you're guaranteed proper cleanup between requests. Using web services inside something like Tomcat wouldn't be nearly as heavy as you think.
If you're concerned over the bloat of XML, consider JSON format.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.