Simple node discovery method - java

I'm starting work on a system that will need to discover nodes in a cluster and send those nodes jobs to work on. I know that a myriad of systems exist that solve this but I'm unclear about the complexities of each and which would best suit my specific needs.
Our requirements are that an application should be able to send out job requests. Each request will specify multiple segments of data to work on. Nodes in the cluster should get these job requests and figure out whether the data segments being requested are "convenient". The application will need to keep track of which segments are being worked on by some node and then possibly send out a further requests if there are data segments that it needs to force some nodes to work on (all the nodes have access to all the data, but they should prefer to work on data segments that they have already cached).
This is a very typical map/reduce problem but we don't want to use the standard hadoop solutions because we are trying to avoid the overhead of writing preliminary results to files. This is more of a streaming problem where we want nodes to perform filtering on data that they read and then send it over a network socket to the application that will combine the results from all the nodes.
I've taken a quick look at akka, apache-spark (streaming), storm and just plain simple UPNP and I'm not quite sure which one would suit my needs best. One thing that works against at least spark is that it seems to require ZooKeeper to be set up on the network which is a complication that we'd like to be able to avoid.
Is there any simple library that does something similar to this "auto discover nodes via network multicast" and then allows you to simply send messages back and forth to negotiate which node will handle which data segment? Will Akka be able to help me here? How are nodes added/discovered in a cluster there? Again, we'd like to keep the configuration overhead to a minimum which is why UPNP/SSDP look sort of nice.
Any suggestions for how to use the solutions mentioned above or even other libraries or solutions to look into are very much appreciated.

You could use Akka Clustering: http://doc.akka.io/docs/akka/current/java/cluster-usage.html. However, it doesn't use multicast, it uses a Gossip protocol to handle node up/down messages. You could use a Cluster-Aware Router (see the Akka Clustering doc and http://doc.akka.io/docs/akka/current/java/routing.html) to route your messages to the cluster, there are several different types of routers depending on your needs and what you mean by "convenient". If "convenient" just means which actor is currently free, you can use a Smallest Mailbox router. If it has something to do with the content of the message, you could use a Consistent Hashing router.

See Balancing Workload Across Nodes with Akka 2.
This post describes a work distribution algorithm using Akka. The algorithm doesn't use multicast to discover workers. There is a well-known master address and the workers register with the master. Other than that though it fits your requirements well.
Another variation on it is described in Akka Work Pulling Pattern.
I've used this pattern in a number of projects - it works great.

Storm is fairly resilient when it comes to worker-nodes coming offline & online. However, just like Spark, it does require Zookeeper.
The good news is that Storm is comes with a sister project to make deployment a breeze: https://github.com/nathanmarz/storm-deploy/wiki
If you're running vanilla storm on EC2, the storm-deploy project could be what you're looking for.

Related

Kafka connect with Cassandra database

i am using the Kafka for multiple purpose , but i want to use the connect API of Kafka but i wont be able to understand the differences of why to use Kafka connect instead of writing our own consumer group and write the message in any database without going write complex thing and without adding other packages like confluent does in Kafka-connect.
Connect as a framework takes care of fail-over and you can also run it in distributed mode to scale out you data import/export "job". Thus, Connect is really a "fire and forget" experience. Furthermore, for Connect, you don't need to write any code -- you just configure the connector.
If you built this manually, you basically solve issues that got solved by Connect already (ie, reinvent the wheel). Don't underestimate the complexity of this task -- it sound straight forward on the surface, but its more complex as it seems to be.
Kafka Connect offers a useful abstraction for both users and developers who want to move data in and out of Apache Kafka.
Users may pick a Connector out of a constantly growing collection of existing Connectors and, by just submitting appropriate configuration, integrate their data with Kafka quickly and efficiently. Developers can implement a Connector for their special use case, without having to worry about low level management of a cluster of producers and consumers and how to make such a cluster scale (as Matthias mentioned already).
As it often happens with software, if a particular software abstraction doesn't fit your needs, you may have to go down one or more abstraction levels and write your code by using lower level constructs. In our case these are the Kafka producer and consumer, which are still a pretty robust and easy to use abstraction for moving data in and out of Kafka.
Now to get to the specific point you are referring to, which is what is often called handling of bad or incompatible data in Kafka Connect, this is something that is mostly a responsibility of the Connector developer at the moment. However, we intend to provide ways for the framework to facilitate such handling of bad data and make it more a matter of configuration rather than Connector implementation. That's in the roadmap for the near future.

is communication of nodes in cluster a good usage of scala actors?

I'm building a java/Scala app which would be deployed in hundreds of nodes in cluster. I thought of the following idea. Instead of building rest api for my services where nodes would communicate and query and do operations one on each other. I'll simply use akka actors in this way one actor can send messages to others and this would save me from the whole process of managing rest api's and even discovery of nodes load balancing etc (although in don't have to be asynchronous).
Note I'm aware of the motivation to use actors as means for async programming instead of locks I just want to know if the use case I suggested is actually a good use case for Hakka actors or am I missing something thanks.

Hadoop-Hive-HBase Advice for Web Analytics

The team I work on is fortunate enough to have management that recognizes the need to enhance our skills and learn new technologies. As a result, whenever we have a little downtime between major projects, we are encouraged to use that time to stretch our minds a bit and learn something new. We often tackle a large research project as a team so that everyone benefits from the knowledge. For example, we built a spec-compliant Kerberos authentication server to get familiar with the ins and outs of the protocol. We wrote our own webserver to learn about efficient design strategies for networked applications.
Recently, we've been very curious about Map-Reduce, specifically Hadoop and the various supporting components (HBase, HDFS, Pig, Hive, etc.). To learn a bit more about it, we would like to write a web analytics service. It will use Javascript page tagging to gather the metrics, and Hadoop and something to make analytics and reports available via a web interface.
The non-Hadoop side of the architecture is easy. A Java servlet will parse the parameters from a Javascript tag (easy enough -- we're a Java shop). The servlet will then send out a JMS message for asynchronous processing (again, easy).
My question is... What next? We've researched things like Hive a bit, and it sounds like a great fit for querying the datastore for the various metrics we're looking for. But, it's high latency. We're fortunate enough to be able to drop this onto a website that gets a few million hits per month. We'd really like to get relatively quick metrics using the web interface for our analytics tool. Latency is not our friend. So, what is the best way to accomplish this? Would it be to run the queries as a scheduled job and then store the results somewhere with lower latency (PostgreSQL, etc.) and retrieve them from there? If that's the case, where should the component listening for the JMS messages store the data? Can Hive get its data from HBase directly? Should we store it in HDFS somewhere and read it in Hive?
Like I said, we're a very technical team and love learning new technologies. This, though, is way different from anything we've learned before, so we'd like to get a sense of what the "best practices" would be here. Any advice or opinions you can give are GREATLY appreciated!
EDIT : I thought I'd add some clarification as to what I'm looking for. I'm seeking advice on architecture and design for a solution such as this. We'll collect 20-30 different metrics on a site that gets several million page views per month. This will be a lot of data, and we'd like to be able to get metrics in as close to realtime as possible. I'm looking for best practices and advice on the architecture of such a solution, because I don't want us to come up with something on our own that is really bad that will leave us thinking we're "Hadoop experts" just because it works.
Hive, as you mentioned, has high latency for queries. It can be pointed at HBase (see https://cwiki.apache.org/Hive/hbaseintegration.html), but the integration results in HBase having tables that are forced into a mostly-rectangular, relational-like schema that is not optimal for HBase. Plus, the overhead of doing it is extremely costly- hive queries against hbase are, on my cluster, at least an order of magnitude slower than against plain HDFS files.
One good strategy is to store the raw metrics in HBase or on plain HDFS (Might want to look at Flume if these metrics are coming from log files) and run periodic MapReduce jobs (even every 5 minutes) to create pre-aggregated results that you can store in plain rectangular files that you can query through Hive. When you are just reading a file and Hive doesn't have to do anything fancy (e.g. sorting, joining, etc), then Hive is actually reasonably low latency- it doesn't run MapReduce, it just streams the file's contents out to you.
Finally, another option is to use something like Storm (which runs on Hadoop) to collect and analyze data in real time, and store the results for querying as mentioned above, or storing them in HBase for display through a custom user interface that queries HBase directly.

Simple *Authoritative DNS Server* in Java

Is there an already written Java DNS Server that only implements authoritative responses. I would like to take the source code and move it into a DNS server we will be developing that will use custom rule sets to decide what TTL to use and what IP address to publish.
The server will not be a caching server. It will only return authoritative results and only be published on the WHOIS record for the domains. It will never be called directly.
The server will have to publish MX records, A records and SPF/TXT records. The plan is to use DNS to assist in load balancing among gateway-servers on multiple locations (we are aware that DNS has a short reach in this area). Also it will cease to publish IP addesses of gateway-servers when they go down (on purpose or on accident) (granted, DNS will only be able to help during extended outages).
We will write the logic for all this ourselves.. but I would very much like to start with a DNS server that has been through a little testing instead of starting from scratch.
However, that is only feasible if what we copy from is simple enough. Otherwise,, it could turn out to be a waste of time
George,
I guess what you need is a java library which implements DNS protocol.
Take a look at dnsjava
This is very good in terms of complete spec coverage of all types of records and class.
But the issue which you might face with a java based library is performance.
DNS servers would be expected to have a high throughput. But yes, you can solve that by throwing more hardware.
If performance is a concern for you , I would suggest to look into unbound
http://www.xbill.org/dnsjava/
Unfortunately, the documentation states "jnamed should not be used for production, and should probably not be used for testing. If the above documentation is not enough,
please do not ask for more, because it really should not be used."
I'm not aware of any better alternatives, however.
You could take a look at Eagle DNS:
http://www.unlogic.se/projects/eagledns
It's been around for a few years and it's quite well tested by now.

Grid Computing and Java

I couldn't seem to find a similar question to this.
I am currently looking at the best solution solving a grid computing problem.
The setup:
I have a server/client situation where there clients [typically dumb of most logic] and recieve instructions from the server
Have an authorization request
Clients report back information on speed of completing the task (the task's difficult is judged by the task type)
Clients recieve the best fit task for their previous performance (the best clients receive the worst problems)
Eventually the requirements would be:
The client's footprint must be small and standalone - I can't have a client that requires lots to install and setup
The client should be able to grab new jobs and job runtimes from the server (it would be nice to have the grid scale to new problems [and the new problems would be distributed by the server] that are introduced)
I need to have an authentication layer (doesn't have to be complex or conform to an existing ldap) [easier requirement: clients can signup for a new "membership" and get access] (I'm not sure that RMI's strengths lie here)
The clients would be able to run from the Internet rather in a networked environement
Which means encryption of the results requested
I'm currently using webservices to communicate between the clients and and the server. All of the information and results goes back to the hosting server (J2EE).
My question is there a grid system setup that matches all/most of these requirements, and is open source?
I'm not interested in doing a cloud because most of these tasks are small, but very frequent (once a day but the task may be easy, but performs maintenance).
All of the code for this system is in Java.
You may want to investigate space-based architectures, and in particular Jini and Javaspaces. What's Jini ? It is essentially RMI with a configurable discovery mechanism. You request an implementor of a Java interface, and the Jini subsystem finds current services implementing that interface and dynamically informs your service of these.
Briefly, you'd write the work items into a space. The grid nodes would be set up to read data transactionally from the space. Each grid node would take a work item, process it and write a result back into that space (or another space). The distributing node can monitor for results being written back (and.or for your projected result timings, as you've requested).
It's all Java, and will scale linearly. Because it's Jini, the grid nodes can dynamically load their classes from an HTTP server and so you can propagate code updates trivially.
Take a look on Grid Beans
BOINC sounds like it would work for your problem, though you have to wrap java for your clients. That, and it may be overkill for you.

Categories

Resources