Apache Spark: what's the designed behavior if master fails

Apache Spark: what's the designed behavior if master fails - java

We are running our calculations in a standalone Spark cluster, ver 1.0.2 - the previous major release. We do not have any HA or recovery logic configured.
A piece of functionality on the driver side consumes incoming JMS messages and submits respective jobs to spark.
When we bring the single & only Spark master down (for tests), it seems the driver program is unable to properly figure out that the cluster is no longer usable. This results in 2 major problems:
The driver tries to reconnect endlessly to the master, or at least we couldn't wait until it gives up.
Because of the previous point, submission of new jobs blocks (in org.apache.spark.scheduler.JobWaiter#awaitResult). I presume this is because the cluster is not reported unreacheable/down and the submission simply logic waits until the cluster comes back. For us this means that we run out of the JMS listener threads very fast since they all get blocked.
There are a couple of akka failure detection-related properties that you can configure on Spark, but:
The official documentation strongly doesn't recommend enabling akka's built-in failure detection.
I would really want to understand how this is supposed to work by default.
So, can anyone please explain what's the designed behavior if a single spark master in a standalone deployment mode fails/stops/shuts down. I wasn't able to find any proper doc on the internet about this.

In default, Spark can handle Workers failures but not for the Master (Driver) failure. If the Master crashes, no new applications can be created. Therefore, they provide 2 high availability schemes here: https://spark.apache.org/docs/1.4.0/spark-standalone.html#high-availability
Hope this helps,
Le Quoc Do

Related

Keeping all instance of in memory graph db in sync

We are building an java application which will use embedded Neo4j for graph traversal. Below are the reasons why we want to use embedded version instead of centralized server
This app is not a data owner. Data will be ingested on it through other app. Keeping data locally will help us in doing quick calculation and hence it will improve our api sla.
Since data foot print is small we don't want to maintain centralized server which will incur additional cost and maintenance.
No need for additional cache
Now this architecture bring two challenges. First How to update data in all instance of embedded Neo4j application at same time. Second how to make sure that all instance are in sync i.e using same version of data.
We thought of using Kafka to solve first problem. Idea is to have kafka listener with different groupid(to ensure all get updates) in all instance . Whenever there is update, event will be posted in kafka. All instance will listen for event and will perform the update operation.
However we still don't have any solid design to solve second problem. For various reason one of the instance can miss the event (it's consumer is down). One of the way is to keep checking latest version by calling api of data owner app. If version is behind replay the events.But this brings additional complexity of maintaining the event logs of all updates. Do you guys think if it can be done in a better and simpler way?

Kafka consumers are extremely consistent and reliable once you have them configured properly, so there shouldn't be any reason for them to miss messages, unless there's an infrastructure problem, in which case any solution you architect will have problems. If the Kafka cluster is healthy (e.g. at least one of the copies of the data is available, and at least quorum zookeepers are up and running), then your consumers should receive every single message from the topics they're subscribed to. The consumer will handle the retries/reconnecting itself, as long as your timeout/retry configurations are sane. The default configs in the latest kafka versions are adequate 99% of the time.
Separately, you can add a separate thread, for example, that is constantly checking what the latest offset is per topic/partitions, and compare it to what the consumer has last received, and maybe issue an alert/warning if there is a discrepancy. In my experience, and with Kafka's reliability, it should be unnecessary, but it can give you peace of mind, and shouldn't be too difficult to add.

Understanding scale up in HazelCast

I have a problem in understanding Distributed Executor Service.I am trying to run the example which is mentioned here in
[a link] https://github.com/hazelcast/hazelcast-code-samples/tree/master/distributed-executor/scale-out
What i am assuming about scale out is that when i run master and slave on different Machines the execution should happen on both machines i.e load should be balanced on both machines.But i am not able to see anything happening on slave console.The master console is executing all the 1000 EchoTask.Is my understanding wrong about Distributed Executor Service?Can someone help me in understanding this

Your understanding is correct. When you start up the nodes, are they able to connect to each other? Do you actually see a cluster being formed or work both of them independently?
Latter case can happen if your network does not work with multicast and nodes are not able to discover each other automatically.
Can you basically share some logging to see that the nodes form the cluster as expected?

Scheduled job in a multi node environment

I am working on a scheduled job that will run at certain interval (eg. once a day at 1pm), scheduled through Cron. I am working with Java and Spring.
Writing the scheduled job is easy enough - it does: grab list of people will certain criteria from db, for each person do some calculation and trigger a message.
I am working on a single-node environment locally and in testing, however when we go to production, it will be multi-node environment (with load balancer, etc). My concern is how would multi node environment affect the scheduled job?
My guess is I could (or very likely would) end up with triggering duplicate message.
Machine 1: Grab list of people, do calculation
Machine 2: Grab list of people, do calculation
Machine 1: Trigger message
Machine 2: Trigger message
Is my guess correct?
What would be the recommended solution to avoid the above issue? Do I need to create a master/slave distributed system solution to manage multi node environment?

If you have something like three Tomcat instances, each load balanced behind Apache, for example, and on each your application runs then you will have three different triggers and your job will run three times. I don't think you will have a multi-node environment with distributed job execution unless some kind of mechanism for distributing the parts of the job is in place.
If you haven't looked at this project yet, take a peek at Spring XD. It handles Spring Batch Jobs and can be run in distributed mode.

How to avoid simultaneous quartz job execution when application has two instances

I've got a Spring Web application that's running on two different instances.
The two instances aren't aware of each other, they run on distinct servers.
That application has a scheduled Quartz job but my problem is that the job shouldn't execute simultaneously on the instances, as its a mail sending job, it could cause duplicate emails being sent.
I'm using RAMJobStore and JDBCJobStore is not an option for me due to the large number of tables it requires.(I cant afford to create many tables due to internal restriction)
The solutions I thought about:
-creating a single control table, that has to be checked everytime a job starts (with repeatable read isolation level to avoid concurrency issues) The problem is that if the server is killed, the table might be left in a invalid state.
-using properties to define a single server to be the job running server. Problem is that if that server goes down, jobs will stop running
Has anyone ever experienced this problem and do you have any thoughts to share?

Start with the second solution (deactivate qartz on all nodes except one). It is very simple to do and it is safe. Count how frequently your server goes down. If it is inacceptable then try the first solution. The problem with the first solution is that you need a good skill in mutithreaded programming to implement it without bugs. It is not so simple if multithreading is not your everyday task. And a cost of some bug in your implementation may be bigger than actual profit.

Why do we need ZooKeeper in the Hadoop stack?

I am new to Hadoop/ZooKeeper. I cannot understand the purpose of using ZooKeeper with Hadoop, is ZooKeeper writing data in Hadoop? If not, then why we do we use ZooKeeper with Hadoop?

Hadoop 1.x does not use Zookeeper. HBase does use zookeeper even in Hadoop 1.x installations.
Hadoop adopted Zookeeper as well starting with version 2.0.
The purpose of Zookeeper is cluster management. This fits with the general philosophy of *nix of using smaller specialized components - so components of Hadoop that want clustering capabilities rely on Zookeeper for that rather than develop their own.
Zookeeper is a distributed storage that provides the following guarantees (copied from Zookeeper overview page):
Sequential Consistency - Updates from a client will be applied in the
order that they were sent.
Atomicity - Updates either succeed or
fail. No partial results.
Single System Image - A client will see the
same view of the service regardless of the server that it connects
to.
Reliability - Once an update has been applied, it will persist
from that time forward until a client overwrites the update.
Timeliness - The clients view of the system is guaranteed to be
up-to-date within a certain time bound.
You can use these to implement different "recipes" that are required for cluster management like locks, leader election etc.
If you're going to use ZooKeeper yourself, I recommend you take a look at Curator from Netflix which makes it easier to use (e.g. they implement a few recipes out of the box)

Zookeeper solves the problem of reliable distributed coordination, and hadoop is a distributed system, right?
There's an excellent paper Paxos Algorithm that you can read on this subject.

From zookeeper documentation page:
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications.
Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.
From hadoop documentation page:
The Apache™ Hadoop® project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models
Regarding your query:
Why do we need ZooKeeper in Hadoop Stack?
The binding factor is distributed processing and high availability.
e.g. Hadoop Namenode fail over process.
Hadoop high availability is designed around Active Namenode & Standby Namenode for fail over process. At any point of time, you should not have two masters ( active Namenodes) at same time.
From Apache documentation link on HDFSHighAvailabilityWithQJM:
It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called “split-brain scenario,” the JournalNodes will only ever allow a single NameNode to be a writer at a time.
During a failover, the NameNode which is to become active will simply take over the role of writing to the JournalNodes, which will effectively prevent the other NameNode from continuing in the Active state, allowing the new Active to safely proceed with failover.
Zookeeper has been used to avoid Split - brain scenario. You can find role of Zookeeper in below question:
How does Hadoop Namenode failover process works?

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.