I’ve been involved in several discussion recently on the topic of vSphere sizing, especially after my recent post on NUMA and its impact on sizing for tier one applications. I thought it may interesting for me to lay out my own thoughts on vSphere sizing and get feedback from other folks in the VMware community.
Compute Sizing Guidelines
In these recent discussions, I’ve heard rule-of thumb guidelines such as assuming 6:1 virtual CPU (vCPU) to physical CPU (CPU) and 1.5:1 virtual RAM (vRAM) to physical RAM (pRAM) oversubscription ratios. It seems to me that many of these guidelines were developed at a time when most shops were consolidating Window servers with only 1 pCPU, which were often only 10 to 20% utilized, and using 2 GB pRAM which were often only 50% utilized. Under those assumptions, a 6:1 ratio made sense. For example, 6 windows servers, each running on a single 2 GHz CPU that is only 20% utilized and using 2 GB RAM that is only 50% utilized, would require only an aggregate of 2.4 GHz of CPU cycles and 6 GB RAM; you could virtualize those 6 servers and put them on a 3 Ghz CPU core with 8 GB RAM and still have 20% headroom for spikes; this would allowe you to host ~50 VMs on a vSphere host with 8 cores and 64 GB RAM. However, how feasible are these assumptions today when applications are being written to take better advantage of the higher number of CPU cores and RAM? In those cases, the rule-of-thumb guidelines will likely not work, even if you adjusted them down.
Over the years, I’ve sized vSphere, using different methodologies that range from data collection and analysis to best-guess estimates, depending on what my customers can and are willing to provide:
Using current utilization data – I get this from very few customers but sometime they are provided using vCenter/ESXTOP/perfmon data or data from the VMware Capacity Planner. I basically find out what the aggregate CPU cycles and RAM usage rate is and map that to UCS. That allows me to size even if the current server environment do have not CPU models that match what is currently in a UCS blade, which is often the case in a refresh.
Simple example – 100 servers that are using 2.4 GHz CPUs and are 30 % utilized on average yields an aggregate of 72 Ghz of CPU cycles. This maps out to 24 3.0 Ghz CPU cores; with 20% headroom, that works out to 30 CPU cores. Using a B200 M3 blade with 16 cores, I could fit that all on to 2 blades at a ratio of 3.3:1. I would use the same methodology for calculating amount of RAM required and use the more restrictive of the 2 calculations. I almost always find that I need more blades due to the RAM requirements.
Using their current inventory – More common is to get an inventory without any utilization data. In which case, I use the same methodology above but make assumptions about pCPU and pRAM utilization rate; I make sure those rates are agreed on by the customer before I do my design.
Using Rule-of-Thumb guidelines – In reality this is the most common scenario because I often get insufficient data from a customer. At that point I take extra care to confirm I have an agreement with the customer as to what our assumption will be regarding CPU and RAM oversubscription. The point though, is I don’t use rule-of-thumb by default, but only when necessary. Btw, I’ve adjusted my personal rule-of-thumb to be 4:1 for vCPU to pCPU oversubscription and 1.2:1 for vRAM to pRAM oversubscription.
The discussion above, does not take into account HA requirements, which is probably worth a post of its own; however, I generally factor in the following:
- Adding in one of more spare UCS blades as prescribed by Cisco.
- The number of host failures tolerated in a VMware HA cluster. For example if I have a 4-node HA cluster that is 80% utilized across each node, if one node fails, then I am oversubscribed and depending on how HA is configured, may not be able to start up all VMs. In that case, you may need to design a 6-node cluster that is 53% utilized across the cluster but can tolerate a single node failure. I know you also have to factor in things like HA slots and Admission Controls, based on the configuration of the VMs in a cluster; but I am trying to keep it simple to illustrate my point.
Storage Sizing Guidelines
I’ve purposefully avoided talking about storage because it’s a fairly complex topic that requires more details than I can provide on this post. That being said, here are some general guidelines I follow:
- The main thing to understand with storage sizing is that, unlike CPU and RAM, there is no over-subscription. Some people mistakenly believe that if a given RAID group can deliver 500 IOPS for a physical server, it can somehow deliver 800 IOPS for a virtual server. The fact is, storage sizing for a virtual environment is not much different from sizing for a physical environment except that you need to take into account what the storage design needs to look like if you aggregate workload requirements for multiple servers on to a single ESXi host.
- Size first for performance and then for capacity.
- When possible, take a building block approach where you size each application and their associated server separately and then aggregate the results.
- If you are unable to get workload profile information for the environment you are sizing, consider using a profile that will accommodate 80% of the workload. Most storage vendors will use the following profile for a “general purpose” environment – Random I/O workload, 8K blocks size, and 70/30 split for read/write ratio.
- Use a storage IOPS calculator to determine the required storage configuration. I have one that I created in Excel that I am happy to share with the community if someone can help me figure out how to port it over to a web application.
- Talk with your storage vendor about their best practices, particularly, if they are utilizing some type of drive tiering technology. That will impact your entire storage design, including the type and mix of drives you design for and the layout of your datastores and volumes. For example, in the “general purpose” use case, I would typically favor a configuration, with 5% EFDs/SSDs, 25% 15K SAS, and 70% 7200K NL-SAS. This is due to the fact that I am sizing for a Vblock that uses EMC storage in the backend with their Fully Automated Storage Tiering (FAST) technology that can move sub-volume blocks across the different tiers/drive types. Below is an example of the mix of drives I would use for a VDI design and how volumes would be laid out.
Feedback on how others do their sizing are welcome, particularly since I am very open to both correction and refinement of my own methodologies and thinking.
Next post after Christmas: vSphere Sizing for Business Critical Applications.