I am trying to learn zookeeper but all the tutorials and explanations are way too abstract for me have a clear view of how it could benefit my life, or, it is just another tool that is "awesome" but no one will ever use directly in real life.
I understand zookeeper is a "coordination tool", master, worker, assign task and a bunch of failure prevention staff. However, I have a very 'naive' real-world problem and wondering if zookeeper itself will help me solve it.
Say I have a big big file with many many lines of numbers, like below:
1000
23213
3231
4213
..
And the goal/output is to come up with another file, which contains the square of corresponding line.
1000^2
23213^2
...
I actually have a real-life use case that I implemented using python-flask server to distribute the work based on request from the workers but it is too fragile. Also I cannot easily track the failures. And I am wondering will zookeeper be the solution.
Can any zookeeper expert help me write some example code to distribute this work to maybe 3 computers. And in the end, send the data back to the master.
I totally understand it must be super easy to use map reduce or multithreading to make it happen but I am wondering would it be possible to just use zookeeper to show the idea that "zookeeper is a coordination tool".
One common way to use Zookeeper is by taking advantage of ephemeral nodes as locks so as to create a distributed work queue.
Workers browse a list inside ZK and attempt to create an ephemeral "lock" node.
If the attempt fails it means another worker has locked the node.
If the attempt succeeds the worker can do the operation (in your case do the math), and then write a new node and delete the old node.
The power of the ephemeral lock is that if the worker dies for whatever reason the connection is broken and ZK guarantees that the lock goes away automatically.
Related
I am trying to write a custom receiver for Structured Streaming that will consume messages from RabbitMQ.
Spark recently released DataSource V2 API, which seems very promising. Since it abstracts away many details, I want to use this API for the sake of both simplicity and performance. However, since it's quite new, there are not many sources available. I need some clarification from experienced Spark guys, since they will grasp the key points easier. Here we go:
My starting point is the blog post series, with the first part here. It shows how to implement a data source, without streaming capability. To make a streaming source, I slightly changed them, since I need to implement MicroBatchReadSupport instead of (or in addition to) DataSourceV2.
To be efficient, it's wise to have multiple spark executors consuming RabbitMQ concurrently, i.e. from the same queue. If I'm not confused, every partition of the input -in Spark's terminology- corresponds to a consumer from the queue -in RabbitMQ terminology. Thus, we need to have multiple partitions for the input stream, right?
Similar with part 4 of the series, I implemented MicroBatchReader as follows:
#Override
public List<DataReaderFactory<Row>> createDataReaderFactories() {
int partition = options.getInt(RMQ.PARTITICN, 5);
List<DataReaderFactory<Row>> factories = new LinkedList<>();
for (int i = 0; i < partition; i++) {
factories.add(new RMQDataReaderFactory(options));
}
return factories;
}
I am returning a list of factories, and hope that every instance in the list will be used to create a reader, which will be also a consumer. Is that approach correct?
I want my reciever to be reliable, i.e. after every processed message (or at least written to chekpoint directory for further processing), I need to ack it back to RabbitMQ. The problem starts after here: these factories are created at the driver, and the actual reading process takes place at executors through DataReaders. However, the commit method is a part of MicroBatchReader, not DataReader. Since I have many DataReaders per MicroBatchReader, how should I ack these messages back to RabbitMQ? Or should I ack when the next method is called on DataReader? Is it safe? If so, what is the purpose of commit function then?
CLARIFICATION: OBFUSCATION: The link provided in the answer about the renaming of some classes/functions (in addition to the explanations there) made everything much more clear worse than ever. Quoting from there:
Renames:
DataReaderFactory to InputPartition
DataReader to InputPartitionReader
...
InputPartition's purpose is to manage the lifecycle of the
associated reader, which is now called InputPartitionReader, with an
explicit create operation to mirror the close operation. This was no
longer clear from the API because DataReaderFactory appeared to be more
generic than it is and it isn't clear why a set of them is produced for
a read.
EDIT: However, the docs clearly say that "the reader factory will be serialized and sent to executors, then the data reader will be created on executors and do the actual reading."
To make the consumer reliable, I have to ACK for a particular message only after it is committed at Spark side. Note that the messages have to be ACKed on the same connection that it has been delivered through, but commit function is called at driver node. How can I commit at the worker/executor node?
> I am returning a list of factories, and hope that every instance in the list will be used to create a reader, which will be also a consumer. Is that approach correct?
The source [socket][1] source implementation has one thread pushing messages into the internal ListBuffer. In other words, there is one consumer (the thread) filling up the internal ListBuffer which is **then** divided up into partitions by `planInputPartitions`( `createDataReaderFactories` got [renamed][2] to `planInputPartitions`).
Also, according to the Javadoc of [MicroBatchReadSupport][3]
> The execution engine will create a micro-batch reader at the start of a streaming query, alternate calls to setOffsetRange and createDataReaderFactories for each batch to process, and then call stop() when the execution is complete. Note that a single query may have multiple executions due to restart or failure recovery.
In other words, the `createDataReaderFactories` should be called **multiple** times, which to my understanding suggests that each `DataReader` is responsible for a static input partition, which implies that the DataReader shouldn't be a consumer.
----------
> However, the commit method is a part of MicroBatchReader, not DataReader ... If so, what is the purpose of commit function then?
Perhaps part of the rationale for the commit function is to prevent the internal buffer of the MicroBatchReader from getting to big. By committing an Offset, you can effectively remove elements less than the Offset from the buffer as you are making a commitment to not process them anymore. You can see this happening in the socket source code with `batches.trimStart(offsetDiff)`
I'm unsure about implementing a reliable receiver, so I hope a more experienced Spark guy comes around and grabs your question as I'm interested too!
Hope this helps!
EDIT
I had only studied the socket, and wiki-edit sources. These sources are not production ready, which is something that the question was was not looking for. Instead, the kafka source is the better starting point which has, unlike the aforementioned sources, multiple consumers like the author was looking for.
However, perhaps if you're looking for unreliable sources, the socket and wikiedit sources above provide a less complicated solution.
I'm coding a Java socket server that connects to Arduino which in turn send and receive data. As shown by the Java socket documentation I've set up the server to open a new thread for every connection.
My question is, how will I be able to send the data from the socket threads to my main thread? The socket will be constantly open, so the data has to be sent while the thread is running.
Any suggestion?
Update: the goal of the server is to send commands to an Arduino (ie. Turn ligh on or off) and receive data from sensors, therefore I need a way to obtain that data from the sensors which are connected to individual threads and to send them into a single one.
Sharing data among threads is always tricky. There is no "correct" answer, it all depends on your use case. I suppose you are not searching for the highest performance, but for easiness of use, right?
For that case, I would recommend looking at synchronized collections, maps, lists or queues perhaps. One class, which seems like a good fit for you, is ConcurrentLinkedQueue.
You can also create synchronized proxies for all usual collections using the factory methods in Collections class:
Collections.synchronizedList(new ArrayList<String>());
You do not have to synchronize access to them.
Another option, which might be an overkill, is using database. There are some in-memory databases, like H2.
In any case, I suggest you to lower the amount of shared information to the lowest possible level. For example, you can keep the "raw" data separate per thread (e.g. in ThreadLocal variables) and then just synchronize during aggregation.
You seem to have the right idea - you need a thread to run the connection to the external device and you need a main thread to run your application.
How do you share data between these threads: This isn't in general a problem - different threads can write to the same memory; within the same application threads share memory space.
What you probably want to avoid is the two thread concurrently changing or reading the data - java provides a very useful keyword - synchronized - to handle this sort of situation which is straight forward to use and provides the kind of guarantees you need. This is a bit technical but discusses the concurrency features.
Here is a tutorial you might be able to get some more information on. Please note, a quick google search will bring up lots of answers to your question.
http://tutorials.jenkov.com/java-multithreaded-servers/multithreaded-server.html
In answer to your question, you can send the information from one thread to another by using a number of options - I would recommend if it is a simple setup, just use static variables/methods to pass the information.
Also as reference, for large scale programs, it is not recommended to start a thread for every connection. It works fine on smaller scale (e.g. a few number of clients), but scales poorly.
If this is a web application and you are just going to show the current readout of any of the sensors, then blocking queue is a huge overkill and will cause more problems than it solves. Just use a volatile static field of the required type. The field itself can be static, or it could reside in a singleton object, or it could be part of a context passed to the worker.
in the SharedState class:
static volatile float temperature;
in the thread:
SharedState.temperature = 13.2f;
In the web interface (assuming jsp):
<%= SharedState.temperature %>
btw: if you want to access last 10 readouts, then it's equally easy: just store an array with last 10 readouts instead of a single value (just don't modifiy what's inside the array, replace the whole array instead - otherwise synchronization issues might occur).
I'm new to Hazelcast. So have a question about best failure handling practices during parallel processing:
Mastering Hazelcast, section 6.6, p. 96:
Work-queue has no high availability: Each member will create one or
more local ThreadPoolExecutors with ordinary work-queues that do the
real work. When a task is submitted, it will be put on the work-queue
of that ThreadPoolExecutor and will not be backed up by Hazelcast. If
something would happen with that member, all unprocessed work will be
lost.
Task:
Suppose I've got 1 master node and 2 slaves. I launch time consuming task with
executor.submitToAllMembers (new TimeConsumingTask())
So each node is processing something. And while they all are processing something one of the slaves fails
Questions:
That's not possible to rerun the failed member work on another node, right?
Is there any other (preferably better) approach than rerun the whole job set across the whole cluster? (In case if TimeConsumingTask is Runnable)
Is there any other (preferably better) approach than rerun the whole job set across the whole cluster? (In case if TimeConsumingTask is Callable and I want to get a Future as a cluster computation result)
I'm assuming by 'failure handling' you're talking about the scenario where a node in the cluster goes down....
Question 1 Not automatically. You are right in assuming that Hazelcast's execution tasks are not fault tolerant. However, if you were able to handle the failure of a task, I can't see a reason why you couldn't resubmit the work to another member in the cluster.
Question 2 It's difficult to know what your TimeConsumingTask is actually doing - as with any distributed execution engine, it's generally better to compose the long running task as a series of smaller tasks. If you can't compose your task as smaller elements, then no - there's not really a better approach than resubmitting the whole job again
Question 3 The same thing applies to this question as question 2. Returning a Future from a task submission is not going to help you massively if a node fails. Futures provide you with the ability to wait (optionally for a specified timeout period) on the result and provide the possibility of cancelling the task.
Generally, for handling a node failing I would take a look to see whether an ExecutionCallback would help - in this case you get notified on a failure, which I am currently assuming that a node failure falls under this. When your callback is notified of the failure, you could resubmit the job.
You might also want to look at some other approaches that exist outside of the core Hazelcast API. Hazeltask is a project on GitHub that promises failover handling and task resubmission - so that might be worth a look?
My application takes a lot of measurements of it's internal processes. For example I time certain methods, I time external webservice calls and I also have variables which have a changing value, and processes which have a 'state' (e.g. PAUSED, WAITING etc).
The application uses 100 to 200 threads, and each bit of data would be associated with a particular thread.
I am looking for some software that I can channel all this information into that would produce useful metrics and graphs of the data (ideally in real time or close to real time), let me set thresholds to trigger warnings, would allow me to filter the data by thread or thread group, etc etc.
The application is performing time critical tasks so the software/api would need to be very fast and never block.
The application is written in java, and ideally the software/api would be in java as well. I think what I'm looking for is called Event Stream Processing, but I'm really not sure what language to use to describe it.
All I've found so far are Esper and ERMA. Can anyone give me a recommendation? I'm the only one working on this project so I'm hoping for something that is pretty easy to set up and use, and has a workable front end.
In the end I found Graphite which was pretty close to being exactly what I wanted. Not the simplest to set up and configure however, but I got it working in the end.
http://graphite.wikidot.com/
In my case I send data directly from my application to Statsd (via UDP), which collects the data and does some pre processing before it ends up in the whisper back end, there is a simple example of a java interface here https://github.com/etsy/statsd/commit/2253223f3c19d2149d65ec5bc802198ff93da4cb
Alternatively you could send your data directly to graphite, example here http://neopatel.blogspot.co.uk/2011/04/logging-to-graphite-monitoring-tool.html
Given the following facts, is there a existing open-source Java API (possibly as part of some greater product) that implements an algorithm enabling the reproducible ordering of events in a cluster environment:
1) There are N sources of events, each with a unique ID.
2) Each event produced has an ID/timestamp, which, together with
its source ID, makes it uniquely identifiable.
3) The ids can be used to sort the events.
4) There are M application servers receiving those events.
M is normally 3.
5) The events can arrive at any one or more of the application
servers, in no specific order.
6) The events are processed in batches.
7) The servers have to agree for each batch on the list of events
to process.
8) The event each have earliest and latest batch ID in which they
must be processed.
9) They must not be processed earlier, and are "failed" if they
cannot be processed before the deadline.
10) The batches are based on the real clock time. For example,
one batch per second.
11) The events of a batch are processed when 2 of the 3 servers
agree on the list of events to process for that batch (quorum).
12) The "third" server then has to wait until it possesses all the
required events before it can process that batch too.
13) Once an event was processed or failed, the source has to be
informed.
14) [EDIT] Events from one source must be processed (or failed) in
the order of their ID/timestamp, but there is no causality
between different sources.
Less formally, I have those servers that receive events. They start with the same initial state, and should keep in sync by agreeing on which event to process in which order. Luckily for me, the events are not to be processed ASAP, but "in a bit", so that I have some time to get the servers to agree before the deadline. But I'm not sure if that actually make any real difference to the algorithms. And if all servers agree on all batches, then they will always be in sync, therefore presenting a consistent view when queried.
While I would be most happy with a Java API, I would accept something else if I can call it from Java. And if there is no open-source API, but a clear algorithm, I would also take that as an answer and try to implement it myself.
Looking at the question and your follow-up there probably "wasn't" an API to satisfy your requirements. To day you could take a look at the Kafka (from LinkedIn)
Apache Kafka
And the general concept of "a log" entity, in what folks like to call 'big data':
The Log: What every software engineer should know about real-time data's unifying abstraction
Actually for your question, I'd begin with the blog about "the log". In my terms the way it works -- And Kafka isn't the only package out doing log handling -- Works as follows:
Instead of a queue based message-passing / publish-subscribe
Kafka uses a "log" of messages
Subscribers (or end-points) can consume the log
The log guarantees to be "in-order"; it handles giga-data, is fast
Double check on the guarantee, there's usually a trade-off for reliability
You just read the log, I think reads are destructive as the default.
If there's a subscriber group, everyone can 'read' before the log-entry dies.
The basic handling (compute) process for the log, is a Map-Reduce-Filter model so you read-everything really fast; keep only stuff you want; process it (reduce) produce outcome(s).
The downside seems to be you need clusters and stuff to make it really shine. Since different servers or sites was mentioned I think we are still on track. I found it a finicky to get up-and-running with the Apache downloads because the tend to assume non-Windows environments (ho hum).
The other 'fast' option would be
Apache Apollo
Which would need you to do the plumbing for connecting different servers. Since the requirements include ...
servers that receive events. They start with the same initial state, and should keep in sync by agreeing on which event to process in which order. Luckily for me, the events are not to be processed ASAP, but "in a bit", so that I have some time to get the servers to agree before the deadline
I suggest looking at a "Getting Started" example or tutorial with Kafka and then looking at similar ZooKeeper organised message/log software(s). Good luck and Enjoy!
So far I haven't got a clear answer, but I think it would be useful anyone interested to see what I found.
Here are some theoretical discussions related to the subject:
Dynamic Vector Clocks for Consistent Ordering of Events
Conflict-free Replicated Data Types
One way of making multiple concurent process wait for each other, which I could use to synchronize the "batches" is a distributed barrier. One Java implementation seems to be available on top of Hazelcast and another uses ZooKeeper
One simpler alternative I found is to use a DB. Every process inserts all events it receives into the DB. Depending on the DB design, this can be fully concurrent and lock-free, like in VoltDB, for example. Then at regular interval of one second, some "cron job" runs that selects and marks the events to be processed in the next batch. The job can run on every server. The first to run the job for one batches fixes the set of events, so that the others just get to use the list that the first one defined. Like that we have a guarantee that all batches contain the same set of event on all servers. And if we can use a complete order over the whole batch, which the cron job could specify itself, then the state of the servers will be kept in sync.