In part 1 of this series on Apache Mesos, I provided a high level overview of the technology and in part 2, I went into a bit more of a deep dive on the Mesos architecture. I ended the last post stating I would do a follow-up post on how resource allocation is handled in Mesos. However, I received some feedback from readers and decided I would first do this post on persistent storage and on fault tolerance before moving on to talk about resource allocation.
Persistent Storage Question
As my previous posts discussed and as Mesos co-creator Ben Hindman’s diagram above indicates, a key benefit of using Mesos is the ability to run multiples types of applications (scheduled and initiated via frameworks as tasks) on the same set of compute nodes. These tasks are abstracted from the actual nodes using isolation modules (currently some type of container technology) so that they can be moved to and restarted on different nodes as needed.
This brings up the question of how Mesos handles persistent storage? If I am running a database job, how does Mesos ensure that when the task is scheduled, the assigned nodes have access to the data it needs? Hindman’s diagram shows Hadoop File System (HDFS) as the persistence layer for Mesos in his particular example; HDFS is frequently used in that way but is also often used by Mesos executors to pass configuration data to slaves assigned to given tasks. Mesos can actually leverage multiple type of file systems for persistent storage, with HDFS being one of the most frequently used given Mesos’ affinity to High Performance Computing. Actually there are multiple options for dealing with the persistent storage question in Mesos:
- Distributed File System – As mentioned above, Mesos can use a DFS such as HDFS or Lustre to guarantee that data can be accessed by any nodes in the Mesos cluster. The tradeoff is that you will have network latency that might make a networked file system unsuitable for a given application.
- Local File System With Data Store Replication – Another approach is to leverage application level replication to ensure data is available for multiple nodes. Applications that provide data store replication include NoSQL databases such as Cassandra and MongoDB. The advantage is you longer have to deal with network latency issues. The tradeoff is you would have to configure Mesos to run specific tasks only on nodes that have the replicated data since you would likely not want to replicate the same data across all the nodes in your data center. You can accomplish this by statically reserving specific nodes with replicated data stores for use by a framework.
- Local File System Without Replication – You could also store persistent data on a specific node’s file system and reserve that node for a specific application. As with the previous option, you could create a static reservation but this time only for a single node instead of a set of nodes. The last two options are obviously less desirable options since you are effectively creating static partitions. However, they may be required for specific use cases where latency is an issue or the application cannot replicate its data stores.
The Mesos project is constantly growing and new features are being regularly added. I’ve seen two in particular that will help with the persistent storage question when it is released:
- Dynamic Reservations – This feature will allow a framework to reserve a specific resource, such as a persistent store, so that resource will only be offered back to that same framework when another task needs to be initiated. This can be used in conjunction with configuring a framework to use only a node or set of nodes that have access to the persistent data store. You can read more about this proposed feature here.
- Persistent Volumes – This feature will enable a volume to be created, as part of a task being launched on a slave node, that will persist even after the task is completed. Subsequent tasks that require access to that same data will be offered back to the same framework to be initiated on nodes that have access to that persistent volume. You can read more about this proposed feature here.
Let’s talk about how Mesos provides fault tolerance at multiple layers of its stack. IMHO, one of the strengths of Mesos is the way fault tolerance has been designed into the architecture but done in a way to allow scalability as a distributed system.
- Master – To make the master fault tolerant, a mechanism for handling failures and a specific architectural design are implemented.
First, the decision was made to implement master nodes using a hot-standby design. As the diagram above from Tomas Barton illustrates, a single hot master runs in a cluster along with some number of standby nodes that are monitored by an open source software called Zookeeper. Zookeeper monitors all the nodes in the master cluster and manages the election of a new master when the hot master node fails. Effectively, you need a minimum of three master nodes for production with a recommended total of at least five nodes.
The decision to design the master to have soft state means that when the master node fails, its state can be quickly reconstructed on the newly elected master. Mesos state information actually resides with the framework scheduler and the slave nodes. When a new master is elected, Zookeeper informs the frameworks and the slave nodes of the election so they can register themselves with the new master. At that point the new master is able to reconstruct internal state based on messages passed to it from the frameworks and slave nodes.
- Framework Schedulers – Fault tolerance for a given framework scheduler is achieved by having a framework register two or more schedulers with the master. In the event that a scheduler fails, the master will notify another scheduler to take over. Note that frameworks are responsible for implementing their own mechanism for sharing state between schedulers.
- Slaves – Mesos implements a slave recovery feature that allows executors/tasks to continue running even when the slave process on a node fails and for that slave process to reconnect with running executors/tasks on that slave node. When a task is being executed, the slave checkpoints metadata about the task to local disk. If the slave process fails, the task continues running and when the master restarts the slave process because it is not responding to messages, the restarted slave process will use the checkpointed data to recover state and to reconnect with executors/tasks.
The situation is obviously different when the compute node that the slave is running on and where tasks are being executed fails. Here, the master is responsible for monitoring the status of all its slave nodes.
When a compute/slave node fails to respond to multiple consecutive messages, the master will remove the node from its list of available resources and will attempt to shut down the node.
The master will report the executor/task failure to the framework scheduler that assigned the task and allow the scheduler to do task failure handling based on the policies that have been configured for that scheduler. Typically, frameworks will restart a task on a new slave node assuming it receives and accepts the appropriate resource offer from the master.
- Executor/task – Similar to a compute/slave node failure, the master will report the executor/task failure to the framework scheduler that assigned the task and allow the scheduler to do task failure handling based on the policies that have been configured for that scheduler. Typically, most frameworks will restart a task on a new slave node assuming it receives and accepts the appropriate resource offer from the master.
I’ll wrap it up here and in the next post, I’ll delve into the resource allocation module. I also encourage folks to read my post on what I see going right with the Mesos project. If you are interested in spinning up and trying out Mesos, I link to some resources in another blog post.
Meanwhile, I again encourage readers to provide feedback, especially regarding if I am hitting the mark with these posts and if you see any errors that need to be corrected. I am a learner and do not pretend to know all the answers; so correction and enlightenment is always welcomed. I also respond on twitter at @kenhuiny.