KAFKA compared to modern In Memory Memory Data Grid (IMDG)

KAFKA compared to modern In Memory Memory Data Grid (IMDG) - java

I have some IMDG experience I am rather new to KAFKA. I am trying to understand the use case for Kafka. I understand it is a streaming/messaging platform. A lot of its issues have some contra parts in the modern In Memory Data Grids. Can you shed a bit light over the use cases when someone would prefer to use Kafka and when you would prefer to use IMDG. I need to draw a parallel.
I will give you one example. I have noticed usage of Kafka for data replication. Although possible I feel that IMDG are more capable and automated for this purpose.
Also I am interested in how these two technologies compliment each other as I don't think they are in direct competition.

The two types of systems do have some feature overlap, but they still are two different types of systems with dissimilar primary objectives. In that we can't compare them on the primary feature of either.
Kafka is primarily a pub/sub durable message broker. Data grids are primarily in-memory cache systems. This is the first distinction or key attribute on which one would choose to use either.
On a secondary level, which I believe is where the lines become blurred, both types of system provide some kind of distributed computing capabilities (Kafka Streams, Ignite or Hazelcast compute grid/service) with data ingestion functionality. This, however, cannot be taken as the primary selection criterion.
The two types don't really directly compete with one another on their respective primary purpose. A stream-based compute engine may use a data grid for computation or for transient state caching, but I don't see how it would rely on compute/data grids for a reliable, standalone message broker as it would depend on something like Kafka for it.
A small application may dispense with one type to use the secondary features of the other, but an application with high demand for both may in fact need to use both types of systems.
As an example, if you're building a high-volume data pipeline with multiple data sources and you have to use a durable message broker, you will probably have to use Kafka, but if you equally have strong requirements for low-latence querying downsream, you will as well need to use a compute grid, be it for caching or for distributed computing.

I've been pondering upon the same question recently. I've came to these conclusions:
Use an IMDG like Ignite/Hazelcast if:
Your processing use cases makes sense in a compute grid AND your grid, which could have a number of applications/processes in it, is the only consumer for the durable, distributed data streams
Use Kafka if:
You have a heterogeneous environment of processing layers and you need an independent data integration layer to provide durable, distributed data streams
Also, they are not necessarily mutually exclusive. You may find that the latter makes sense in your organization. However some consumers may have specific use cases for an IMDG/IMCG and prefer to tap into the enterprise wide Kafka plane for seed data and re-use its IMDG/IMCG internal data structures for intermediate data streams that is used exclusively within the grid, so no real reason to re-divert those out to Kafka. It may re-divert results back to Kafka for further dissemination to the rest of the enterprise.
Btw, IMDGs/IMCGs like ignite and hazelcast can provide pub/sub, be as durable as Kafka in terms of data resilience and provide stream processing over it.

Related

Distributed Keying / Partitioning / Sharding Java library

I receive http requests on my N fronend machines and want to have them processed by my K backend machines depending on a certain key in the data.
the keying has to be stable and consistent. I also want to scale the frontend and backend machines depending on the load without interruption. I am fine when very little data is lost while scaling.
I think i could achieve my goal with kafka or apache flink. Maybe also hazelcast could be used, by they all seem heavy weight and too much for my case.
Is there a library that just solves the aspect of keying / partitioning / sharding in a distributed way ?
Bonus points for rx integration library.

What makes you think Hazelcast is heavier?
Hazelcast actually provides everything within a single environment - sharding, consistent hashing, partitioning and ensuring high availability of the data etc. Plus the easy and straightforward APIs take away lot of pains of writing code to use. All you are required to do is start a HC cluster using startup scripts and invoke APIs like map.put(key, value)/map.get(key) etc, its that simple as everything else is taken care of by Hazelcast behind the scenes.

In this kind of scenario I usually use a cluster tech that tracks membership (hazelcast or my favorite jgroups, much lighter than hazelcast)
Then combine the current cluster size/members with a consistent hashing function like Guava's (see https://github.com/google/guava/wiki/HashingExplained )
The consistent hash would take in your data as the key and the current cluster member size as the buckets, and you would get back a consistent answer for that same # of buckets.
Then use the computed bucket to route your request

Kafka connect with Cassandra database

i am using the Kafka for multiple purpose , but i want to use the connect API of Kafka but i wont be able to understand the differences of why to use Kafka connect instead of writing our own consumer group and write the message in any database without going write complex thing and without adding other packages like confluent does in Kafka-connect.

Connect as a framework takes care of fail-over and you can also run it in distributed mode to scale out you data import/export "job". Thus, Connect is really a "fire and forget" experience. Furthermore, for Connect, you don't need to write any code -- you just configure the connector.
If you built this manually, you basically solve issues that got solved by Connect already (ie, reinvent the wheel). Don't underestimate the complexity of this task -- it sound straight forward on the surface, but its more complex as it seems to be.

Kafka Connect offers a useful abstraction for both users and developers who want to move data in and out of Apache Kafka.
Users may pick a Connector out of a constantly growing collection of existing Connectors and, by just submitting appropriate configuration, integrate their data with Kafka quickly and efficiently. Developers can implement a Connector for their special use case, without having to worry about low level management of a cluster of producers and consumers and how to make such a cluster scale (as Matthias mentioned already).
As it often happens with software, if a particular software abstraction doesn't fit your needs, you may have to go down one or more abstraction levels and write your code by using lower level constructs. In our case these are the Kafka producer and consumer, which are still a pretty robust and easy to use abstraction for moving data in and out of Kafka.
Now to get to the specific point you are referring to, which is what is often called handling of bad or incompatible data in Kafka Connect, this is something that is mostly a responsibility of the Connector developer at the moment. However, we intend to provide ways for the framework to facilitate such handling of bad data and make it more a matter of configuration rather than Connector implementation. That's in the roadmap for the near future.

Simple node discovery method

I'm starting work on a system that will need to discover nodes in a cluster and send those nodes jobs to work on. I know that a myriad of systems exist that solve this but I'm unclear about the complexities of each and which would best suit my specific needs.
Our requirements are that an application should be able to send out job requests. Each request will specify multiple segments of data to work on. Nodes in the cluster should get these job requests and figure out whether the data segments being requested are "convenient". The application will need to keep track of which segments are being worked on by some node and then possibly send out a further requests if there are data segments that it needs to force some nodes to work on (all the nodes have access to all the data, but they should prefer to work on data segments that they have already cached).
This is a very typical map/reduce problem but we don't want to use the standard hadoop solutions because we are trying to avoid the overhead of writing preliminary results to files. This is more of a streaming problem where we want nodes to perform filtering on data that they read and then send it over a network socket to the application that will combine the results from all the nodes.
I've taken a quick look at akka, apache-spark (streaming), storm and just plain simple UPNP and I'm not quite sure which one would suit my needs best. One thing that works against at least spark is that it seems to require ZooKeeper to be set up on the network which is a complication that we'd like to be able to avoid.
Is there any simple library that does something similar to this "auto discover nodes via network multicast" and then allows you to simply send messages back and forth to negotiate which node will handle which data segment? Will Akka be able to help me here? How are nodes added/discovered in a cluster there? Again, we'd like to keep the configuration overhead to a minimum which is why UPNP/SSDP look sort of nice.
Any suggestions for how to use the solutions mentioned above or even other libraries or solutions to look into are very much appreciated.

You could use Akka Clustering: http://doc.akka.io/docs/akka/current/java/cluster-usage.html. However, it doesn't use multicast, it uses a Gossip protocol to handle node up/down messages. You could use a Cluster-Aware Router (see the Akka Clustering doc and http://doc.akka.io/docs/akka/current/java/routing.html) to route your messages to the cluster, there are several different types of routers depending on your needs and what you mean by "convenient". If "convenient" just means which actor is currently free, you can use a Smallest Mailbox router. If it has something to do with the content of the message, you could use a Consistent Hashing router.

See Balancing Workload Across Nodes with Akka 2.
This post describes a work distribution algorithm using Akka. The algorithm doesn't use multicast to discover workers. There is a well-known master address and the workers register with the master. Other than that though it fits your requirements well.
Another variation on it is described in Akka Work Pulling Pattern.
I've used this pattern in a number of projects - it works great.

Storm is fairly resilient when it comes to worker-nodes coming offline & online. However, just like Spark, it does require Zookeeper.
The good news is that Storm is comes with a sister project to make deployment a breeze: https://github.com/nathanmarz/storm-deploy/wiki
If you're running vanilla storm on EC2, the storm-deploy project could be what you're looking for.

Fast object sharing between java applications

I'm looking for a way to share large in-memory objects between java applications and have been looking at JMS (ActiveMQ) and JavaSpaces. Will any of theese allow me to reliably send/share objects between two or more java applications? Is ActiveMQ suitable for large messages?

You can use in-memory data grids like Oracle Coherence or JBoss Data Grid. This may be a faster then JMS using.

It really depends what you mean by share. If you mean that different processes (potentially on different machines) need to be able to access a "shared" object, then yes, as the other answer suggests, something like Oracle Coherence would be great.
On the other hand, if you mean share as in to pass from one process to another, then you probably are looking for a messaging solution, e.g. JMS or even simpler e.g. REST.

Java - Distributed Programming, RMI?

I've got a doozy of a problem here. I'm aiming to build a framework to allow for the integration of different traffic simulation models. This integration is based upon the sharing of link connectivities, link costs, and vehicles between simulations.
To make a distributed simulation, I plan to have a 'coordinator' (star topology). All participating simulations simply register with it, and talk only to the coordinator. The coordinator then coordinates the execution of various tasks between each simulation.
A quick example of a distribution problem, is when one simulation is 'in charge' of certain objects, like a road. And another is 'in charge' of other roads. However, these roads are interconnected (and hence, we need synchronisation between these simulations, and need to be able to exchange data / invoke methods remotely).
I've had a look at RMI and am thinking it may be suited for this task. (To abstract out having to create an over-wire signalling discipline).
Is this sane? The issue here, is that simulation participants need to centralize some of their data storage in the 'coordinator' to ensure explicit synchronisation between simulations. Furthermore, some simulations may require components or methods from other simulations. (Hence the idea of using RMI).
My basic approach is to have the 'coordinator' run a giant RMI registry. And every simulation simply looks up everything in the registry, ensuring that the correct objects are used at each step.
Anyone have any tips for heading down this path?

You may want to check out Hazelcast also. Hazelcast is an open source transactional, distributed/partitioned implementation of queue, topic, map, set, list, lock and executor service. It is super easy to work with; just add hazelcast.jar into your classpath and start coding. Almost no configuration is required.
If you are interested in executing your Runnable, Callable tasks in a distributed fashion, then please check out Distributed Executor Service documentation at http://code.google.com/docreader/#p=hazelcast
Hazelcast is released under Apache license and enterprise grade support is also available.

Is this sane? IMHO no. And I'll tell you why. But first I'll add the disclaimer that this is a complicated topic so any answer has to be viewed as barely scratching the surface.
First instead of repeating myself I'll point you to a summary of Java grid/cluster technologies that I wrote awhile ago. Its a mostly complete list.
The star topology is "natural" for a "naive" (I don't mean that in a bad way) implementation because point-to-point is simple and centralizing key controller logic is also simple. It is however not fault-tolerant. It introduces scalability problems and a single bottleneck. It introduces communication inefficiences (namely the points communicate via a two-step process through the center).
What you really want for this is probably a cluster (rather than a data/compute grid) solution and I'd suggest you look at Terracotta. Ideally you'd look at Oracle Coherence but it's no doubt expensive (compared to free). It is a fantastic product though.
These two products can be used a number of ways but the core of both is to treat a cache like a distributed map. You put things in, you take things out and you fire off code that alters the cache. Coherence (with which I'm more familiar) in this regards scales fantastically well. These are more "server" based products though for a true cluster.
If you're looking at a more distributed model then perhaps you should be looking at more of an SOA based approach.

Have a look at http://www.terracotta.org/
its a distributed Java VM, so it has the advantage of being clustered application looks no different than a standard Java application.
I have used it in applications and the speed is very impressive so far.
Paul

Have you considered using a message queue approach? You could use JMS to communicate/coordinate tasks and results among a set of servers/nodes. You could even use Amazon's SQS (Simple Queue Service: aws.amazon.com/sqs) and have your servers running on EC2 to allow you to scale up and down as required.
Just my 2 cents.

Take a look at JINI, it might be of some use to you.

Well, Jini, or more specifically Javaspaces is a good place to start for a simple approach to the problem. Javaspaces lets you implement a master-worker model, where your master (coordinator in your case) writes tasks to the Javaspace, and the workers query for and process those tasks, writing the results back for the master. Since your problem is not embarrassingly parallel, and your workers need to synchronize/exchanging data, this will add some complexity to your solution.
Using Javaspaces will add a whole lot more abstraction to your implementation that using plain RMI (which is used by the Jini framework internally as the default "wire protocol").
Have a look at this article from sun for an intro.
And Jan Newmarch's Jini Tutorial is a pretty good place to start learning Jini

Just as an addition to the other answers which as far as I have seen all focus on grid and cloud computing, you should notice that simulation models have one unique characteristic: simulation time.
When running distributed simulation models in parallel and synchronized then I see two options:
When each simulation model has its own simulation clock and event list then these should be synchronized over the network.
Alternatively there could be a single simulation clock and event list which will "tick the time" for all distributed (sub) models.
The first option has been extensively researched for the High Level Architecture (HLA) see for example http://en.wikipedia.org/wiki/IEEE_1516 as a starter.
However the second option seems more simple and with less overhead to me.

GridGain is a good alternative. They have a map/reduce implementation with "direct API support for split and aggregation" and "distributed task session". You can browse their examples and see if some of them fits with your needs.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.