Realtime Prediction.io with Apache Storm

Realtime Prediction.io with Apache Storm - java

I want to implement a music recommendation system that can generate recommended music playlists in realtime. I believe that this can be implemented in Prediction.io...
However, due to Prediction.io's design, I need to call pio train, pio deploy, in order to update the learning model with the new actions done by the user (like music, etc.) Hence, I would need to run these commands every 2 hours (or another appropriate time interval).
I recently came across Apache Storm, and I really like the concept of a "realtime hadoop" processing. Hence, I was thinking if I can incorporate Prediction.io with Apache Storm, so that the learning is done "online", which will allow my app to recommend music within a few likes/actions by the user, instead of having the user wait until the learning model is updated.
If this is not viable, then is it possible to incorporate Spark's Mllib into a Apache Storm bolt (java), since I can build recommendation systems with it (and it also seems that Prediction.io itself is built upon Apache Spark)?
Thanks in advance!

The use case seems viable, but I would not consider 'need to run something every few hours' as a good motivation for using storm. On the other hand, if your learning data is 'streaming', you can model your storm topology to update its internal knowledge base every time new data arrives. This will allow you to use the most updated knowledge base every time user queries something.
As to which libraries can be used with Storm, any java library (in fact any library in any language if it can interface with java) should work.

Related

How to capture save or update events in Couchbase

I would like to be able to do some data manipulation when documents are updated or created in Couchbase.
Documents can arrive in our database either via Sync Gateway or our own code which streams data in from an http service. It would be great to have one place where I can intercept all updates.
We are running a Spring Boot REST API against this data so this would be the good place to have the interceptor/listener. Either way my preference would be for a Java solution.
The data is written as JSON rather than using Spring entities so I can't use ApplicationListener which only listens to events on Entity classes. Correct me if I'm wrong. I can find precious few examples of setting up ApplicationListeners so I may be wrong here but I can't seem to get it working.
I see that there is an Eventing service where you write Javascript but for a number of reasons I'm not keen to go that way. I'm not keen on fragmenting our API code across platforms and languages, not sure I can run the eventing service on our systems etc. Again, I'm open to debate though.
That leaves DCP only as far as I can tell which seems very low level.
https://blog.couchbase.com/couchbases-history-everything-dcp/ but looks like the tool for the job.
The QUESTION: Is there an alternative, less low level, way to catch update events in Couchbase for JSON objects NOT entities other than DCP.

Disclaimer: I work for Couchbase and develop the Java DCP client.
If you've already evaluated the Eventing service and decided it doesn't meet your requirements, the Java DCP client might be worth looking into even though it's not officially supported. It's used by the official Couchbase connectors for Kafka, Spark, and Elasticsearch (all of which are open source) and is actively maintained.
If you only care about events that happened since your app started up, usage can be as simple as registering a callback and starting the event stream. Things get a bit more complicated if you need to remember your place in the stream and resume later (to process events that occurred while you were offline, for example), but there's example code for that case too.
The DCP protocol itself is well documented. If you decide to go this route, it might be good to read at least the Architecture section of that documentation. Also be aware that because the Java DCP Client is unsupported, the API can change without notice. (Officially supporting the library and providing a friendlier API are among our long-term goals, but we haven't committed to anything yet.)

Like David, I also work for Couchbase as a product manager for the Eventing service.
I would like to be able to do some data manipulation when documents are updated or created in Couchbase.
Eventing certainly allows anyone to respond to and perform data manipulation on mutations (inserts or upserts) via tiny JavaScript fragments. Just take a look at couchbase-eventing-small-scripts-that-solve-big-problems for a quick introduction and also the eventing-examples from the documentation.
If you do go the Eventing service route on a SGW enabled bucket you will need to suppress a duplicate mutation via the crc64() function built into Eventing (for details goto eventing-language-constructs and search for: Sync Gateway). In addition if you want to have Eventing directly update the source bucket if SGW is enabled on that bucket there is a more involved workaround (just reach out to me and I will be happy to provide it)
Next you stated:
not sure I can run the Eventing service on our systems
The Eventing service bundled with the Couchbase Enterprise offering, it provides scalable infrastructure to run simple JavaScript fragments on data or documents as they change or mutate without the overhead of an SDK. You either add stand alone Eventing node(s) to your Couchbase cluster or collocate the Eventing service with other existing nodes.

Extensible Java Calendaring/Task Server

I have a Java project where I will need to create an application to manage calendering events and tasks. This application be part of a larger project and will need to implement pretty specific functionality. At it's heart though, it will have events with dates, reminders, alarms, to-do lists with deadlines. It will also need to share events between different users and groups.
It seems to me that the core functions of this system must be something that has been pretty much solved, and I am wondering if there is a java system / library that implements it?
I have looked into the following :
- bedework : seems to implement everything I need, and a lot more. I have looked into it, but it does not seem like the source code is documented or that the core functions are made to be integrated into an other project ( I do not need their web interface or the extra features )
- cosmo : seems abandonned
- Oracle Calendar Server : Does not seem to be open source
- Milton IO seems like something I would use if I wanted to support CalDav, but I still need to implement the calendar back end
Pointers? Solutions? Recommendations?

Alternatives to scalding for HBase access from Scala (or Java)

Could anybody please recommend good solution (framework) to access HBase on Hadoop cluster from Scala (or Java) application?
By now I'm moving in scalding direction. Prototypes I obtained allowed me to combine scalding library with Maven and separate scalding job JAR from 'library' code packages. This in turn allowed me to run scalding based Hadoop jobs from outside cluster with minimal overhead per job ('library' code is posted to cluster 'distributed cache' only when it changes (which is rarely needed) so I can load jobs code fast).
Now I'm actually starting to play with HBase itself and I see scalding is good but it is not so 'native' to HBase. Yes, there are things like hbase-scalding but as I anyway have some point to plan future actions I'd like to know about other good solutions I probably missed.
What is expected:
Applications (jobs) starting overhead should be low. I need to run lot of them.
It should be possible (easier - better) to run jobs from outside cluster without any SSH (just based on 'hadoop jar' command or even simply by application execution).
Jobs language itself should allow short, logical semantic. Ideally this code should be simple enough to be automatically generated.
This solution should be productive on big enough HBase tables (initially up to 100.000.000 entries).
OK, solution should be 'live' (being actively developing) but relatively good in terms of general stability.
I think argumentation here could be even more useful than solution itself and this question should add couple of ideas for many people.
Any peace of advice?

If you're using scalding (which I recommend) there's a new project with updated cascading and scalding wrappers for accessing HBase. You might want to check it out - https://github.com/ParallelAI/SpyGlass

HPaste http://www.gravity.com/labs/hpaste/ may be what you are looking for.

You may be interested in the Kiji project (https://github.com/kijiproject/). It provides a "schema-ed" layer on top of HBase.
It also has a Scalding adapter (KijiExpress) so that you can do functional collections operations (map, groupby, etc.) on "pipes" of tuples sourced from these schema-ed HBase tables.

Update (August 2014): Stratosphere is now called Apache Flink (incubating)
Check out Stratosphere. If offers a Scala API and has a HBase module and is under active development.
Starting a job should be possible within a sec or so (depends on your cluster size.)
You can submit jobs remotely (it has a class called RemoteExecutor which allows you to programmatically submit jobs on remote clusters)
Please contact me if you have further questions!

I am currently trying to maintain hbase-scalding at my free time. As I am also picking up Scala.
Please take a look at github

Hadoop-Hive-HBase Advice for Web Analytics

The team I work on is fortunate enough to have management that recognizes the need to enhance our skills and learn new technologies. As a result, whenever we have a little downtime between major projects, we are encouraged to use that time to stretch our minds a bit and learn something new. We often tackle a large research project as a team so that everyone benefits from the knowledge. For example, we built a spec-compliant Kerberos authentication server to get familiar with the ins and outs of the protocol. We wrote our own webserver to learn about efficient design strategies for networked applications.
Recently, we've been very curious about Map-Reduce, specifically Hadoop and the various supporting components (HBase, HDFS, Pig, Hive, etc.). To learn a bit more about it, we would like to write a web analytics service. It will use Javascript page tagging to gather the metrics, and Hadoop and something to make analytics and reports available via a web interface.
The non-Hadoop side of the architecture is easy. A Java servlet will parse the parameters from a Javascript tag (easy enough -- we're a Java shop). The servlet will then send out a JMS message for asynchronous processing (again, easy).
My question is... What next? We've researched things like Hive a bit, and it sounds like a great fit for querying the datastore for the various metrics we're looking for. But, it's high latency. We're fortunate enough to be able to drop this onto a website that gets a few million hits per month. We'd really like to get relatively quick metrics using the web interface for our analytics tool. Latency is not our friend. So, what is the best way to accomplish this? Would it be to run the queries as a scheduled job and then store the results somewhere with lower latency (PostgreSQL, etc.) and retrieve them from there? If that's the case, where should the component listening for the JMS messages store the data? Can Hive get its data from HBase directly? Should we store it in HDFS somewhere and read it in Hive?
Like I said, we're a very technical team and love learning new technologies. This, though, is way different from anything we've learned before, so we'd like to get a sense of what the "best practices" would be here. Any advice or opinions you can give are GREATLY appreciated!
EDIT : I thought I'd add some clarification as to what I'm looking for. I'm seeking advice on architecture and design for a solution such as this. We'll collect 20-30 different metrics on a site that gets several million page views per month. This will be a lot of data, and we'd like to be able to get metrics in as close to realtime as possible. I'm looking for best practices and advice on the architecture of such a solution, because I don't want us to come up with something on our own that is really bad that will leave us thinking we're "Hadoop experts" just because it works.

Hive, as you mentioned, has high latency for queries. It can be pointed at HBase (see https://cwiki.apache.org/Hive/hbaseintegration.html), but the integration results in HBase having tables that are forced into a mostly-rectangular, relational-like schema that is not optimal for HBase. Plus, the overhead of doing it is extremely costly- hive queries against hbase are, on my cluster, at least an order of magnitude slower than against plain HDFS files.
One good strategy is to store the raw metrics in HBase or on plain HDFS (Might want to look at Flume if these metrics are coming from log files) and run periodic MapReduce jobs (even every 5 minutes) to create pre-aggregated results that you can store in plain rectangular files that you can query through Hive. When you are just reading a file and Hive doesn't have to do anything fancy (e.g. sorting, joining, etc), then Hive is actually reasonably low latency- it doesn't run MapReduce, it just streams the file's contents out to you.
Finally, another option is to use something like Storm (which runs on Hadoop) to collect and analyze data in real time, and store the results for querying as mentioned above, or storing them in HBase for display through a custom user interface that queries HBase directly.

Implementing a Dynamic Award System

I have been developing a Online Poker Game. But I keep hitting a wall. I want to implement Awards into the system, but I want them to be dynamic. Meaning I don't want to recompile for every award I would like to add.
I have thought about using Python code for each award. Then when the server checks to see if the user qualifies for the award it runs the python script with Jython (server is in Java and Netty NIO) and if the function returns a certain value I award the award to the user. Which could work but is there maybe a more efficient technique out there that will not force me to run hundreds of python scripts each time I need to check if a user got a award.
And when are the best times to do these checks? I have tought about a hook system where I will specify the hooks like ( [onconnect][ondisconnect][chatmessage.received] ). Which also could work but feels a bit crude and I will still have to run all the scripts from the database.

If I were you, I'd have a totally separate process that grants awards. It runs perhaps once a day on the underlying database that contains all your player/game data.
Your core customer-facing app knows about awards, but all it knows about them is data it loads from the DB -- something like a title, image, description, maybe how many people have the award, etc., and (based on DB tables) who has won the award.
Your "award granter" process simply runs in batch mode, once per day / hour etc, and grants new awards to eligible players. Then the core customer-facing app notifies them but doesn't actually have to know the smarts of how to grant them. This gives you the freedom to recompile and re-run your award granter any time you want with no core app impact.
Another approach, depending on how constrained your awards are, would be to write a simple rules interface that allows you to define rules in data. That would be ideal to achieve what you describe, but it's quite a bit of work for not much reward, in my opinion.
PS -- in running something like an online poker server, you're going to run into versions of this problem all the time. You are absolutely going to need to develop a way to deploy new code without killing your service or having a downtime window. Working around a java-centric code solution for awards is not going to solve that problem for you in the long run. You should look into the literature on running true 24/7 services, there are quite a few ways to address the issue and it's actually not that difficult these days.

There are a number of options I can think of:
OSGi as described above - it comes at a cost, but is probably the most generic and dynamic solution out there
If you're open to restart (just not recompile), a collection of jars in a well known folder and spring give you a cheaper but equally generic solution. Just have your award beans implement a standard interface, be beans, and let spring figure #Autowire all the available awards into your checker.
If you award execution is fairly standard, and the only variation between awards are the rules themselves, you can have some kind of scripted configuration. Many options there, from the python you described (except I'd go for a few big script managing all awards), to basic regular expressions, with LUA and Drools in the middle. In all cases you're looking at some kind of rules engine architecture, which is flexible in term of what the award can trigger on but doesn't offer much flexibility in term of what an award can lead to (i.e. perfect for achievements).

Some comments to the answer with batch ideas:
Implementing a Dynamic Award System
That batch processes can be on separate server/machine, so you can recompile the app or restart the server at any time. Having that new awards can be handled using for example the mentioned approach with adding jars and restarting the server, also new batch jobs can be introduced at any time and so on. So your core application is running 99% of the time, batch server can be restarted frequently. So separate batch machines is good to have.
When you need to deploy new version of core app I think you can just stop, deploy and start it with maintenance notice to users. That approach is used even by top poker rooms with great software (e.g. FullTiltPoker did so, right now it is down due to the license loss, but their site says 'System Update' :) ).
So one approach for versions update is to redeploy/restart in off-hours.
Another approach is real time updates. As a rule it is done by migrating users bunch by bunch to new version. So at the same time some users are using old version, some - new. Not pretty cool for poker software were users with different versions can interact. But if you are sure in the versions 'compatibility' you can go with that approach, checking user's client version for example on login.
In my answer I tried to say that you need not introduce 24/7 support logic to your code. Leave that system availability problems for hardware (failovers, loadbalancers, etc.). You can follow any good techniques you used to write code, only you need to remember that your crucial core logic is deployed not frequently (for example once a week), and batch part can be updated/restarted at any time if needed.

As far as I understand you, you probably do not have to run external processes from your application nor use OSGI.
Just create a simple Java interface and implement each plugin ('award') as a class implementing the interface. You could then just simply compile any new plugin and load it in as a class file from your application at run-time using Class.forName(String className).
Any logic you need from such a plugin would be contained in methods on the interface.
http://download.oracle.com/javase/1,5.0/docs/api/java/lang/Class.html#forName(java.lang.String)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.