Playing Traffic Cop: Resource Allocation In Apache Mesos

tony_lepore

A key capability that makes Apache Mesos a viable uber resource manager for the data center is its ability to play traffic cop amongst a diversity of workloads.  The purpose of this post will be to dig into the internals of resource allocation in Mesos and how it balances fair resource sharing with customer workload needs.  Before going on, you may want to read the other posts in this series if you have not already done so.  An overview of Mesos can be found here, followed by an explanation of its two-level architecture here, and then a post here on data storage and on fault tolerance.  I also write about what I see is going right in the Mesos project.  If you are interested in spinning up and trying out Mesos, I link to some resources in another blog post.

We will be walking through the role of the Mesos resource allocation module, how it decides what resources to offer to which frameworks, and how it reclaims resources when necessary.  But let’s start by reviewing the task scheduling process in Mesos:

mesos framework-example

If you recall from my earlier architecture post, the Mesos master delegates the scheduling of tasks by collecting information about available resources from slave nodes and then offering those resources to registered frameworks in the form of resource offers.  Frameworks have the option of accepting resources offers or rejecting them if they do not meet task constraints.  Once a resource offer is accepted, a framework will coordinate with the master to schedule and to run tasks on appropriate slave nodes in the data center.

Responsibility for how resource offer decisions are made is delegated to the resource allocation module which lives with the master.  The resource allocation module determines the order in which frameworks are offered resources and has to do so while making sure that resources are fairly shared among inherently greedy frameworks.  In a homogenous environment, such as a Hadoop cluster, one of the most popular fair share allocation algorithms in use is max-min fairness.  Max-min fairness maximizes the minimum allocation of resources offered to a user with the goal of ensuring that each user receives a fair share of resources needed to satisfy its requirements; for a simple example of how this might work, check out Example 1 on this max-min fair share algorithm page.  This generally works well, as mentioned in a homogeneous environment where resource demands have little fluctuation as it pertains to the types of resources, e.g. CPU, RAM, network bandwidth, I/O.  Resource allocation becomes more difficult when you are scheduling resources across a data center with heterogeneous resource demands.  For example, what is a suitable fair share allocation policy if user A runs tasks that require (1 CPU, 4GB RAM) each and User B runs tasks that require (3 CPUs, 1GB RAM) each?  How do you fairly allocate a bundle of resources when User A tasks are RAM heavy and User B tasks are CPU heavy?

Since Mesos is specifically focused on managing resources in a heterogeneous environment, it implements a pluggable resource allocation module architecture that allows users to create allocation policies with algorithms that best fit a particular deployment.  For example, a user could implement a weighted max-min fairness algorithm that will give a specific framework a greater share of resources relative to other frameworks.  By default, Mesos includes a strict priority resource allocation module and a modified fair sharing resource allocation module.  The strict priority module implements an algorithm that gives a framework priority to always receive and accept resource offers sufficient to meet its task requirements.  It guarantees resources for critical workloads at the expense of restricting dynamic resource sharing in Mesos and potentially starving other frameworks.

For those reasons, most users default to Dominant Resource Fairness, a modified fair share algorithm in Mesos that is more suitable for heterogeneous environments.  Dominant Resource Fairness (DRF) was proposed by the same Berkeley AMPLab team that created Mesos and coded in as the default resource allocation policy in Mesos.  You can read the original papers on DRF here and here.  For this blog post, I’ll attempt to summarize the main points and provide some examples that will hopefully make things more clear.  Time to get under the hood.

The goal of DRF is to ensure that each user, a framework in the case of Mesos, in a heterogeneous environment receives a fair share of the resource most needed by that user/framework.  To grasp DRF, you need to understand the concepts of dominant resource and dominant share.  A framework’s dominant resource is the resource type (CPU, RAM, etc.) that is most in demand by that framework, as a percentage of available resources presented to it by a given resource offer.  For example, a task that is computation-heavy will have CPU as its framework’s dominant resource while a task that relies on in-memory calculations may have RAM as its framework’s dominant resource.  As frameworks are allocated resources, DRF tracks the percentages of shares that each framework owns for a given resource type; the highest percentage of shares owned across all resource types for a given framework is that framework’s dominant share.  The DRF algorithm uses the dominant share of all registered frameworks to ensure that each framework receives a fair share of its dominant resource.

Too abstract a concept?  Let’s illustrate with an example.  Assume you have a resource offer that consists of 9 CPUs and 18 GB of RAM.  Framework 1 runs tasks that require (1 CPU, 4 GB RAM) and framework 2 runs tasks that require (3 CPUs, 1 GB RAM).  Each framework 1 task will consume 1/9 of the total CPU and 2/9 of the total RAM, making RAM framework 1’s dominant resource.  Likewise, each framework 2 task will consume 1/3 of the total CPU and 1/18 of the total RAM, making CPU framework 2’s dominant resource.  DRF will attempt to give each framework an equal amount of their dominant resource, as represented by their dominant share.  In this example, DRF will work with the frameworks to allocate the following – three tasks to framework 1 for a total allocation of (3 CPUs, 12 GB RAM) and 2 tasks to framework 2 for a total allocation of (6 CPUs, 2 GB RAM).  In this case, each framework ends up with the same dominant share (2/3 or 67%) for each framework’s dominant resource (RAM for framework 1 and CPU for framework 2) before there is not enough available resources, in this particular offer, to run additional tasks.  Note that if framework 1, for example, only had 2 tasks that needed to be run, then framework 2 and any other registered frameworks would have received all left over resources.

Screen Shot 2015-04-07 at 9.49.36 PM

So how does DRF work to produce the outcome above?  As stated earlier, the DRF allocation module tracks the resources allocated to each framework and each framework’s dominant share.  At each step, DRF presents resource offers to the framework with the lowest dominant share among all frameworks with tasks to run.  The framework will accept the offer if there are enough available resources to run its task.  Using the example taken from the DRF papers cited earlier and modified for this blog post, I will walk through each step taken by the DRF algorithm.  To keep things simple, the example I use do not factor in resources that are released back to the pool after a short-running task is complete, I assume there is an infinite number of tasks to be run for each framework, and I assume that every resource offer is accepted.

If you recall from above, we assume you have a resource offer that consists of 9 CPUs and 18 GB of RAM.  Framework 1 runs tasks that require (1 CPU, 4 GB RAM) and framework 2 runs tasks that require (3 CPUs, 1 GB RAM).  Each framework 1 task will consume 1/9 of the total CPU and 2/9 of the total RAM, making RAM framework 1’s dominant resource.  Again, each framework 2 task will consume 1/3 of the total CPU and 1/18 of the total RAM, making CPU framework 2’s dominant resource.

DRF

Each row provides the following information:

  • Framework chosen – The framework that has been given the latest resource offer
  • Resource Shares – Total resources accepted at a given time by a given framework by CPU and by RAM, expressed as fraction of total resources
  • Dominant Share – Percentage of total shares accepted for a given framework’s dominant resource at a given time, expressed as fraction of total resources
  • Dominant Share % – Percentage of total shares accepted for a given framework’s dominant resource at a given time, expressed as percentage of total resources
  • CPU Total Allocation – Total CPU resources accepted by all frameworks at a given time
  • RAM Total Allocation – Total RAM resources accepted by all frameworks at a given time

Note also that the lowest dominant share in each row can be found in bold.

Although Initially both frameworks have a dominant share of 0%, we’ll assume that DRF chooses framework 2 to offer resources to first though we could have assumed framework 1 and the final outcome would still be the same.

  1. Framework 2 receives its shares to run a task causing the dominate share for its dominant resource (CPU) to be increased to 33%.
  2. Since framework 1’s dominant share remained at 0%, it receives shares next to run a task and the dominant share for its dominant resource (RAM) is increased to 22%.
  3. Since framework 1 continues to have the lower dominant share, it receives the next shares as well to run a task, increasing its dominant share to 44%.
  4. DRF then offer resources to framework 2 since it now has the lower dominant share.
  5. The process continues until it is not possible to run any more new tasks due to lack of available resources.  In this case, CPU resources have been saturated.
  6. The process will then repeat with a new set of resource offers

Note that is possible to create a resource allocation module that uses weighted DRF to favor one framework or set of frameworks over another.  And as mentioned earlier, it is possible to create a number of other custom modules to provide organizational specific allocation policies.

Now under normal circumstances, most tasks are short-lived and Mesos is able to wait and reallocate resources when a task is finished.  However, it is possible that a cluster can be filled with long-running tasks due to a hung job or a badly behaving framework.  It is worth noting that in such a circumstance when resources are not being freed quickly enough, the resource allocation module has the ability to revoke tasks.  Mesos attempts to revoke a task by first requesting that an executor kill a given task and gives a grace period for clean up by the executor.  If the executor does not respond to the request, the allocation module then kills the executor and all its tasks.  An allocation policy can be implemented that prevents a given task from being revoked by providing the associated framework a guaranteed allocation.  If a framework is below its guaranteed allocation, Mesos will not be able to kill its tasks.

There is more to know about resource allocation in Mesos, but I’ll stop here.  Next week, I will do something a little different and blog about the Mesos community.  I believe that is an important topic to consider since open source is not just about the technology but also about the community.  After that, I will try to post some step-by-step tutorials on setting up Mesos and using and creating frameworks.

As always, I encourage readers to provide feedback, especially regarding if I am hitting the mark with these posts and if you see any errors that need to be corrected.  I am a learner and do not pretend to know all the answers; so correction and enlightenment is always welcomed.  I also respond on twitter at @kenhuiny.

6 comments

  1. I read a lot of revocation, as does your blog, however I’ve never seen Mesos actually do this while I wish it were if a framework is holding on to resources when it shouldn’t.

    Do you have more info on this?

    • Han,

      Which framework and application are you using? Typically revocation should not be necessary if you are running short-lived tasks. Also, what version of Mesos?

      Thank you.

      Ken

Leave a comment