Distributed Keying / Partitioning / Sharding Java library

Distributed Keying / Partitioning / Sharding Java library - java

I receive http requests on my N fronend machines and want to have them processed by my K backend machines depending on a certain key in the data.
the keying has to be stable and consistent. I also want to scale the frontend and backend machines depending on the load without interruption. I am fine when very little data is lost while scaling.
I think i could achieve my goal with kafka or apache flink. Maybe also hazelcast could be used, by they all seem heavy weight and too much for my case.
Is there a library that just solves the aspect of keying / partitioning / sharding in a distributed way ?
Bonus points for rx integration library.

What makes you think Hazelcast is heavier?
Hazelcast actually provides everything within a single environment - sharding, consistent hashing, partitioning and ensuring high availability of the data etc. Plus the easy and straightforward APIs take away lot of pains of writing code to use. All you are required to do is start a HC cluster using startup scripts and invoke APIs like map.put(key, value)/map.get(key) etc, its that simple as everything else is taken care of by Hazelcast behind the scenes.

In this kind of scenario I usually use a cluster tech that tracks membership (hazelcast or my favorite jgroups, much lighter than hazelcast)
Then combine the current cluster size/members with a consistent hashing function like Guava's (see https://github.com/google/guava/wiki/HashingExplained )
The consistent hash would take in your data as the key and the current cluster member size as the buckets, and you would get back a consistent answer for that same # of buckets.
Then use the computed bucket to route your request

Related

KAFKA compared to modern In Memory Memory Data Grid (IMDG)

I have some IMDG experience I am rather new to KAFKA. I am trying to understand the use case for Kafka. I understand it is a streaming/messaging platform. A lot of its issues have some contra parts in the modern In Memory Data Grids. Can you shed a bit light over the use cases when someone would prefer to use Kafka and when you would prefer to use IMDG. I need to draw a parallel.
I will give you one example. I have noticed usage of Kafka for data replication. Although possible I feel that IMDG are more capable and automated for this purpose.
Also I am interested in how these two technologies compliment each other as I don't think they are in direct competition.

The two types of systems do have some feature overlap, but they still are two different types of systems with dissimilar primary objectives. In that we can't compare them on the primary feature of either.
Kafka is primarily a pub/sub durable message broker. Data grids are primarily in-memory cache systems. This is the first distinction or key attribute on which one would choose to use either.
On a secondary level, which I believe is where the lines become blurred, both types of system provide some kind of distributed computing capabilities (Kafka Streams, Ignite or Hazelcast compute grid/service) with data ingestion functionality. This, however, cannot be taken as the primary selection criterion.
The two types don't really directly compete with one another on their respective primary purpose. A stream-based compute engine may use a data grid for computation or for transient state caching, but I don't see how it would rely on compute/data grids for a reliable, standalone message broker as it would depend on something like Kafka for it.
A small application may dispense with one type to use the secondary features of the other, but an application with high demand for both may in fact need to use both types of systems.
As an example, if you're building a high-volume data pipeline with multiple data sources and you have to use a durable message broker, you will probably have to use Kafka, but if you equally have strong requirements for low-latence querying downsream, you will as well need to use a compute grid, be it for caching or for distributed computing.

I've been pondering upon the same question recently. I've came to these conclusions:
Use an IMDG like Ignite/Hazelcast if:
Your processing use cases makes sense in a compute grid AND your grid, which could have a number of applications/processes in it, is the only consumer for the durable, distributed data streams
Use Kafka if:
You have a heterogeneous environment of processing layers and you need an independent data integration layer to provide durable, distributed data streams
Also, they are not necessarily mutually exclusive. You may find that the latter makes sense in your organization. However some consumers may have specific use cases for an IMDG/IMCG and prefer to tap into the enterprise wide Kafka plane for seed data and re-use its IMDG/IMCG internal data structures for intermediate data streams that is used exclusively within the grid, so no real reason to re-divert those out to Kafka. It may re-divert results back to Kafka for further dissemination to the rest of the enterprise.
Btw, IMDGs/IMCGs like ignite and hazelcast can provide pub/sub, be as durable as Kafka in terms of data resilience and provide stream processing over it.

Distributed cache for key-value pairs

I'm looking for a distributed cache for key-value pairs with these features -
Persistence to disk
Open Source
Java Interface
Fast read/write with minimum memory utilisation
Easy to add more machines to the database (Horizontally Scalable)
What are the databases that fit the bill?

Redisson framework also provides distributed cache abilities based on Redis

There are a lot of options that you can make use of.
Redis - the one you've stated by yourself. Its a distinct process, very fast, key-value for sure, but it's not an "in-memory with your application", I mean that you'll always do a socket I/O in order to go to redis process.
Its not written in Java, but it provides a descent Java Driver to work with, moreover there is a spring integration.
If you want a java based solution consider the following:
memcached - a distributed cache
Hazelcast - its a datagrid, its much more than simply key-value store, but you might be interested in this as well.
Infinispan - folks from JBoss have created this one
EHCache - a popular distributed cache
Hope this helps

Teracotta and Hibernate Search

Does anyone have experience with using Terracotta with Hibernate Search to satisfy application Queries?
If so:
What magnitude of "object
updates" can it handle? (How's the
performance)
What kind of performance do the
Queries have?
Is it possible to use Terracotta
Hibernate Search without even having
a backing Database to satisfy all
"queries" in Memory?

I am Terracotta's CTO. I spent some time last month looking at Hibernate Search. It is not built in a way to be clustered transparently by Terracotta. Here's why in a nutshell: Hibernate has a custom-built JMS replication of Lucene indexes across JVMs.
The basic idea in Search is that talking to local disk under lucene works really well, whereas fragmenting or partitioning up Lucene indexes across the network introduces sooo much latency as to make Lucene seem bad when it is not Lucene's fault at all. To that end, HIbernate Search doesn't rely on JBossCache or any in-memory partitioning / caching schemes and instead relies on JMS and each JVM's local disk in order to provide up-to-date indexing across a cluster with simultaneous low latency. Then, the beauty of Hibernate Search is that standard Hibernate queries and more can be launch through Hibernate at these natural language indexes in each machine.
At Terracotta it turns out we had a similar idea to Emmanuel and built a SearchableMap product on top of Compass. Each machine gets its own Compass store and the store is configured to spill to disk locally. Terracotta is used to create a multi-master writing capability where any JVM can add to the index and the delta is sent through Terracotta to be replayed / reapplied locally to each disk. It works just like Hibernate Search but with DSO as the networking protocol in place of JMS and w/o the nice Hibernate interfaces but instead with Compass interfaces.
I think we will support Hibernate Search w/ help from JBoss (they would need to factor out the JMS impl as pluggable) by end of the year.
Now to your questions directly:
1.Object updates/sec in Hibernate or SearchableMap should be quite high because both are sending only deltas. In Hibernate's case it is a function of our JMS provider. In Terracotta it is scalable just by adding more Terracotta Servers to the array.
Query performance in both is very fast. Local memory performance in most cases. And if you need to page in from disk, it turns out most OSes do a good job and can respond way faster than any network-based clustering can to queries.
It will be, I think, once we get JBoss to factor out their JMS assumptions, etc.
Cheers,
--Ari

Since people on the Hibernate forums keep referring to this post I feel in need to point out that while Ari's comments where correct at the beginning of 2009, we have been developing and improving a lot.
Hibernate Search provides a set of backend channels out of the box, like the already mentioned JMS based and a more recent addition using JGroups, but we made it also pretty easy to plug in alternative implementations or override some.
In addition to using a custom backend, it's now possible since version 4 to replace the whole strategy and instead of changing the backend implementation only you can use an IndexManager which follows a different design and doesn't use a backend at all; at this time we have two IndexManagers only but we're working on more alternatives; again the idea is to provide nice implementations for the most common
It does have an Infinispan based backend for very quick distribution of the index across different nodes, and it should be straight forward to contribute one based on Terracotta or any other clustering technology. More solutions are coming.

Java - Distributed Programming, RMI?

I've got a doozy of a problem here. I'm aiming to build a framework to allow for the integration of different traffic simulation models. This integration is based upon the sharing of link connectivities, link costs, and vehicles between simulations.
To make a distributed simulation, I plan to have a 'coordinator' (star topology). All participating simulations simply register with it, and talk only to the coordinator. The coordinator then coordinates the execution of various tasks between each simulation.
A quick example of a distribution problem, is when one simulation is 'in charge' of certain objects, like a road. And another is 'in charge' of other roads. However, these roads are interconnected (and hence, we need synchronisation between these simulations, and need to be able to exchange data / invoke methods remotely).
I've had a look at RMI and am thinking it may be suited for this task. (To abstract out having to create an over-wire signalling discipline).
Is this sane? The issue here, is that simulation participants need to centralize some of their data storage in the 'coordinator' to ensure explicit synchronisation between simulations. Furthermore, some simulations may require components or methods from other simulations. (Hence the idea of using RMI).
My basic approach is to have the 'coordinator' run a giant RMI registry. And every simulation simply looks up everything in the registry, ensuring that the correct objects are used at each step.
Anyone have any tips for heading down this path?

You may want to check out Hazelcast also. Hazelcast is an open source transactional, distributed/partitioned implementation of queue, topic, map, set, list, lock and executor service. It is super easy to work with; just add hazelcast.jar into your classpath and start coding. Almost no configuration is required.
If you are interested in executing your Runnable, Callable tasks in a distributed fashion, then please check out Distributed Executor Service documentation at http://code.google.com/docreader/#p=hazelcast
Hazelcast is released under Apache license and enterprise grade support is also available.

Is this sane? IMHO no. And I'll tell you why. But first I'll add the disclaimer that this is a complicated topic so any answer has to be viewed as barely scratching the surface.
First instead of repeating myself I'll point you to a summary of Java grid/cluster technologies that I wrote awhile ago. Its a mostly complete list.
The star topology is "natural" for a "naive" (I don't mean that in a bad way) implementation because point-to-point is simple and centralizing key controller logic is also simple. It is however not fault-tolerant. It introduces scalability problems and a single bottleneck. It introduces communication inefficiences (namely the points communicate via a two-step process through the center).
What you really want for this is probably a cluster (rather than a data/compute grid) solution and I'd suggest you look at Terracotta. Ideally you'd look at Oracle Coherence but it's no doubt expensive (compared to free). It is a fantastic product though.
These two products can be used a number of ways but the core of both is to treat a cache like a distributed map. You put things in, you take things out and you fire off code that alters the cache. Coherence (with which I'm more familiar) in this regards scales fantastically well. These are more "server" based products though for a true cluster.
If you're looking at a more distributed model then perhaps you should be looking at more of an SOA based approach.

Have a look at http://www.terracotta.org/
its a distributed Java VM, so it has the advantage of being clustered application looks no different than a standard Java application.
I have used it in applications and the speed is very impressive so far.
Paul

Have you considered using a message queue approach? You could use JMS to communicate/coordinate tasks and results among a set of servers/nodes. You could even use Amazon's SQS (Simple Queue Service: aws.amazon.com/sqs) and have your servers running on EC2 to allow you to scale up and down as required.
Just my 2 cents.

Take a look at JINI, it might be of some use to you.

Well, Jini, or more specifically Javaspaces is a good place to start for a simple approach to the problem. Javaspaces lets you implement a master-worker model, where your master (coordinator in your case) writes tasks to the Javaspace, and the workers query for and process those tasks, writing the results back for the master. Since your problem is not embarrassingly parallel, and your workers need to synchronize/exchanging data, this will add some complexity to your solution.
Using Javaspaces will add a whole lot more abstraction to your implementation that using plain RMI (which is used by the Jini framework internally as the default "wire protocol").
Have a look at this article from sun for an intro.
And Jan Newmarch's Jini Tutorial is a pretty good place to start learning Jini

Just as an addition to the other answers which as far as I have seen all focus on grid and cloud computing, you should notice that simulation models have one unique characteristic: simulation time.
When running distributed simulation models in parallel and synchronized then I see two options:
When each simulation model has its own simulation clock and event list then these should be synchronized over the network.
Alternatively there could be a single simulation clock and event list which will "tick the time" for all distributed (sub) models.
The first option has been extensively researched for the High Level Architecture (HLA) see for example http://en.wikipedia.org/wiki/IEEE_1516 as a starter.
However the second option seems more simple and with less overhead to me.

GridGain is a good alternative. They have a map/reduce implementation with "direct API support for split and aggregation" and "distributed task session". You can browse their examples and see if some of them fits with your needs.

Grid Computing and Java

I couldn't seem to find a similar question to this.
I am currently looking at the best solution solving a grid computing problem.
The setup:
I have a server/client situation where there clients [typically dumb of most logic] and recieve instructions from the server
Have an authorization request
Clients report back information on speed of completing the task (the task's difficult is judged by the task type)
Clients recieve the best fit task for their previous performance (the best clients receive the worst problems)
Eventually the requirements would be:
The client's footprint must be small and standalone - I can't have a client that requires lots to install and setup
The client should be able to grab new jobs and job runtimes from the server (it would be nice to have the grid scale to new problems [and the new problems would be distributed by the server] that are introduced)
I need to have an authentication layer (doesn't have to be complex or conform to an existing ldap) [easier requirement: clients can signup for a new "membership" and get access] (I'm not sure that RMI's strengths lie here)
The clients would be able to run from the Internet rather in a networked environement
Which means encryption of the results requested
I'm currently using webservices to communicate between the clients and and the server. All of the information and results goes back to the hosting server (J2EE).
My question is there a grid system setup that matches all/most of these requirements, and is open source?
I'm not interested in doing a cloud because most of these tasks are small, but very frequent (once a day but the task may be easy, but performs maintenance).
All of the code for this system is in Java.

You may want to investigate space-based architectures, and in particular Jini and Javaspaces. What's Jini ? It is essentially RMI with a configurable discovery mechanism. You request an implementor of a Java interface, and the Jini subsystem finds current services implementing that interface and dynamically informs your service of these.
Briefly, you'd write the work items into a space. The grid nodes would be set up to read data transactionally from the space. Each grid node would take a work item, process it and write a result back into that space (or another space). The distributing node can monitor for results being written back (and.or for your projected result timings, as you've requested).
It's all Java, and will scale linearly. Because it's Jini, the grid nodes can dynamically load their classes from an HTTP server and so you can propagate code updates trivially.

Take a look on Grid Beans

BOINC sounds like it would work for your problem, though you have to wrap java for your clients. That, and it may be overkill for you.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.