i am using the Kafka for multiple purpose , but i want to use the connect API of Kafka but i wont be able to understand the differences of why to use Kafka connect instead of writing our own consumer group and write the message in any database without going write complex thing and without adding other packages like confluent does in Kafka-connect.
Connect as a framework takes care of fail-over and you can also run it in distributed mode to scale out you data import/export "job". Thus, Connect is really a "fire and forget" experience. Furthermore, for Connect, you don't need to write any code -- you just configure the connector.
If you built this manually, you basically solve issues that got solved by Connect already (ie, reinvent the wheel). Don't underestimate the complexity of this task -- it sound straight forward on the surface, but its more complex as it seems to be.
Kafka Connect offers a useful abstraction for both users and developers who want to move data in and out of Apache Kafka.
Users may pick a Connector out of a constantly growing collection of existing Connectors and, by just submitting appropriate configuration, integrate their data with Kafka quickly and efficiently. Developers can implement a Connector for their special use case, without having to worry about low level management of a cluster of producers and consumers and how to make such a cluster scale (as Matthias mentioned already).
As it often happens with software, if a particular software abstraction doesn't fit your needs, you may have to go down one or more abstraction levels and write your code by using lower level constructs. In our case these are the Kafka producer and consumer, which are still a pretty robust and easy to use abstraction for moving data in and out of Kafka.
Now to get to the specific point you are referring to, which is what is often called handling of bad or incompatible data in Kafka Connect, this is something that is mostly a responsibility of the Connector developer at the moment. However, we intend to provide ways for the framework to facilitate such handling of bad data and make it more a matter of configuration rather than Connector implementation. That's in the roadmap for the near future.
Related
I would like to be able to do some data manipulation when documents are updated or created in Couchbase.
Documents can arrive in our database either via Sync Gateway or our own code which streams data in from an http service. It would be great to have one place where I can intercept all updates.
We are running a Spring Boot REST API against this data so this would be the good place to have the interceptor/listener. Either way my preference would be for a Java solution.
The data is written as JSON rather than using Spring entities so I can't use ApplicationListener which only listens to events on Entity classes. Correct me if I'm wrong. I can find precious few examples of setting up ApplicationListeners so I may be wrong here but I can't seem to get it working.
I see that there is an Eventing service where you write Javascript but for a number of reasons I'm not keen to go that way. I'm not keen on fragmenting our API code across platforms and languages, not sure I can run the eventing service on our systems etc. Again, I'm open to debate though.
That leaves DCP only as far as I can tell which seems very low level.
https://blog.couchbase.com/couchbases-history-everything-dcp/ but looks like the tool for the job.
The QUESTION: Is there an alternative, less low level, way to catch update events in Couchbase for JSON objects NOT entities other than DCP.
Disclaimer: I work for Couchbase and develop the Java DCP client.
If you've already evaluated the Eventing service and decided it doesn't meet your requirements, the Java DCP client might be worth looking into even though it's not officially supported. It's used by the official Couchbase connectors for Kafka, Spark, and Elasticsearch (all of which are open source) and is actively maintained.
If you only care about events that happened since your app started up, usage can be as simple as registering a callback and starting the event stream. Things get a bit more complicated if you need to remember your place in the stream and resume later (to process events that occurred while you were offline, for example), but there's example code for that case too.
The DCP protocol itself is well documented. If you decide to go this route, it might be good to read at least the Architecture section of that documentation. Also be aware that because the Java DCP Client is unsupported, the API can change without notice. (Officially supporting the library and providing a friendlier API are among our long-term goals, but we haven't committed to anything yet.)
Like David, I also work for Couchbase as a product manager for the Eventing service.
I would like to be able to do some data manipulation when documents are updated or created in Couchbase.
Eventing certainly allows anyone to respond to and perform data manipulation on mutations (inserts or upserts) via tiny JavaScript fragments. Just take a look at couchbase-eventing-small-scripts-that-solve-big-problems for a quick introduction and also the eventing-examples from the documentation.
If you do go the Eventing service route on a SGW enabled bucket you will need to suppress a duplicate mutation via the crc64() function built into Eventing (for details goto eventing-language-constructs and search for: Sync Gateway). In addition if you want to have Eventing directly update the source bucket if SGW is enabled on that bucket there is a more involved workaround (just reach out to me and I will be happy to provide it)
Next you stated:
not sure I can run the Eventing service on our systems
The Eventing service bundled with the Couchbase Enterprise offering, it provides scalable infrastructure to run simple JavaScript fragments on data or documents as they change or mutate without the overhead of an SDK. You either add stand alone Eventing node(s) to your Couchbase cluster or collocate the Eventing service with other existing nodes.
I have case that needs manipulate large stream of JSON and injecting it to Apache HBase. Our system works on Node.js with Mongo then, since we needs to enhance performance so HBase is choosen to handling the big data things.
To enchance my system scalability, I prefer using Actor Model by Akka for messaging instead any other messaging queue system. It's because Actor Model that Akka provide gives me any advantages about fail safe, Actor management, and other features that's very helpful to make my job done. But it still in JVM layer that directly injecting and consuming data from HBase.
I want my Node.js apps also works under Akka system maybe using node-java. Is it good practice? If it's not, is there any solution that Node.js can communicate with Akka?
ps. my question here is about how to works with Akka and Node.js, not arguing about "why choose to use Node.js when JVM has really fast JSON manipulating library?", it's because our system are already benchmarked and Node.js was the choosen one to handling JSON manipulation. Also it's already on production stage, so migration totally from Node.js to full Scala is not our priority today.
Just to clarify, Akka implements message passing as it's concurrency model and it supports Message Queue patterns (e.g. broadcast, pub-sub). However, you'd be better off looking at MQ solutions if that is really what you need.
I think going down the path you proposed (running NodeJs with Java interop) will yield little benefit whilst adding significant complexity for the long term.
Better to look for an answer from an architectural point-of-view.
If I had to decide, I would create a Scala / Java Akka microservice that sits between your NodeJs front-end and HBase. You can get a quick Proof of Concept running (which you can back out on relatively easy).
PS. If you are committing yourself to HBase, I would highly recommend you to also look into Apache Spark, which makes taming Big Data easier.
I'm starting work on a system that will need to discover nodes in a cluster and send those nodes jobs to work on. I know that a myriad of systems exist that solve this but I'm unclear about the complexities of each and which would best suit my specific needs.
Our requirements are that an application should be able to send out job requests. Each request will specify multiple segments of data to work on. Nodes in the cluster should get these job requests and figure out whether the data segments being requested are "convenient". The application will need to keep track of which segments are being worked on by some node and then possibly send out a further requests if there are data segments that it needs to force some nodes to work on (all the nodes have access to all the data, but they should prefer to work on data segments that they have already cached).
This is a very typical map/reduce problem but we don't want to use the standard hadoop solutions because we are trying to avoid the overhead of writing preliminary results to files. This is more of a streaming problem where we want nodes to perform filtering on data that they read and then send it over a network socket to the application that will combine the results from all the nodes.
I've taken a quick look at akka, apache-spark (streaming), storm and just plain simple UPNP and I'm not quite sure which one would suit my needs best. One thing that works against at least spark is that it seems to require ZooKeeper to be set up on the network which is a complication that we'd like to be able to avoid.
Is there any simple library that does something similar to this "auto discover nodes via network multicast" and then allows you to simply send messages back and forth to negotiate which node will handle which data segment? Will Akka be able to help me here? How are nodes added/discovered in a cluster there? Again, we'd like to keep the configuration overhead to a minimum which is why UPNP/SSDP look sort of nice.
Any suggestions for how to use the solutions mentioned above or even other libraries or solutions to look into are very much appreciated.
You could use Akka Clustering: http://doc.akka.io/docs/akka/current/java/cluster-usage.html. However, it doesn't use multicast, it uses a Gossip protocol to handle node up/down messages. You could use a Cluster-Aware Router (see the Akka Clustering doc and http://doc.akka.io/docs/akka/current/java/routing.html) to route your messages to the cluster, there are several different types of routers depending on your needs and what you mean by "convenient". If "convenient" just means which actor is currently free, you can use a Smallest Mailbox router. If it has something to do with the content of the message, you could use a Consistent Hashing router.
See Balancing Workload Across Nodes with Akka 2.
This post describes a work distribution algorithm using Akka. The algorithm doesn't use multicast to discover workers. There is a well-known master address and the workers register with the master. Other than that though it fits your requirements well.
Another variation on it is described in Akka Work Pulling Pattern.
I've used this pattern in a number of projects - it works great.
Storm is fairly resilient when it comes to worker-nodes coming offline & online. However, just like Spark, it does require Zookeeper.
The good news is that Storm is comes with a sister project to make deployment a breeze: https://github.com/nathanmarz/storm-deploy/wiki
If you're running vanilla storm on EC2, the storm-deploy project could be what you're looking for.
So I'm planning on writing an application that would lend itself well to a producer/consumer pattern. I was thinking of building out my own producer/consumer framework but then thought about message queues something I use extensively at work. I'm not a 100% sure that a messaging queue would be the right approach considering that the multiple modules of the application I am writing need to run on a single server as its a client/controller of sorts for that particular host.
What are the pros and cons of using messaging queues for a non-distributed application? Has anyone used it in this way before?
Thanks, let me know if you need more information.
By "message queues" do you mean an external message server? My below answer assumes that is what you were aking about. If you are just asking about the more general architectural approach of having modules communicate partially, or in full, via in-memory-messages instead of method calls--yes sometimes this can be very nice. Classes like guava's EvenBus facilitate a design like this nicely: https://code.google.com/p/guava-libraries/wiki/EventBusExplained
On the one hand I generally try to discourage people from using JMS message queues when a simple queue data structure would suffice. Sometimes I feel that JMS is an inter-process communication tool that has one-to-many (topics) and one-to-one communication channels which happen to be named queues. Yes their access pattern is similar to that of a queue, but the more important characteristic, it seems to me, is their point-to-point messaging capability. So an unfortunate name that I think sometimes causes people to use a jackhammer (JMS) when all they need is a screwdriver (java.lang.Queue).
On the other hand there are exceptions to any rule. I can't recommend, off hand, a java.lang.Queue implementation that is thread-safe and persistent during server restart (an often needed feature when people are considering JMS). I'm sure there are some. Find a few and compare them to JMS. Weigh business needs, time constraints, possible future design/requirements, etc. I have implemented one myself before and it turned out quite nice (and was faster than sending messages over the network to a remote JMS server)-- but only you can say if this is right for your situation.
I suppose you could always defer the decision by having the modules of your app communicate through a messaging-like interface of your own which uses java.lang.Queues internally for now, but JMS later if you find that you need it. Though be careful here too-- adding unnecessary abstraction early is sometimes a burden that turns out not to be worth it.
What products/projects could help me with the following scenario?
More than one server (same location)
Some state should be shared between server (for instance information if a scheduled task is running and on what server).
The obvious answer could of course be databases but we are using Seam and there doesn't seem to be a good way to nest transactions inside a Seam-bean so I need to find a way where I don't have to go crazy over configuration (tried to use EJB:s but persistence.xml wasn't pretty afterwards). So i need another way around this problem until Seam support nested transactions.
This is basically the same scenario as I have if you need more details: https://community.jboss.org/thread/182126.
Any ideas?
Sounds like you need to do distributed job management.
The reality is that in the Java EE world, you are going to end up having to do Queues, as in MoM [Message-oriented Middleware]. Seam will work with JMS, and you can have publish and subscribe queues.
Where you might want to take a look for an alternative is at Akka. It gives you the ability to distribute jobs across machines using an Actor/Agent model that is transparent. That's to say your agents can cooperate with each other whether they are on the same instance or across the network from each other, and you are not writing a ton of code to make that happen, or having to special handle things up and down the message chain.
The other thing Akka has going for it is the notion of Supervision, aka Go Ahead and Fail, or Let it Crash. This is the idea (followed by the Telcos for years), that systems will fail and you should design for it and have a means of making things resilient.
Finally, the state of other options job wise in the Java world is dismal. Have used Seam for years. It's great, but they decided to just support Quartz for jobs, which is useless.
Akka is built on Netty, too, which does some pretty crazy stuff in terms of concurrency and performance.
[Not a TypeSafe employee, btw…]