Over-commitment of resources is a well known feature of vSphere and allows you to use the available physical resources as efficient as possible, resulting in a possibly higher consolidation ratio (number of VM’s per ESXi host). This feature is especially interesting with regard to CPU resources, as this is a type of resource that has a very low average utilization in many server environments. Using overcommitment of CPU allows you for example to configure a number of VM’s on a host with a total of let’s say 50 virtual CPU’s (vCPU’s) where the specific host only has 16 physical cores available. This is an example based on a general best practice to allow for a 3-to-1 overcommitment ratio (3x as many vCPU’s configured as available physical cores). Sometimes you might want to reduce this (if you have very CPU-intensive workload running on your hosts) or you could even decide to allow for a higher overcommitment ratio of 5-to-1 (for workload that uses relatively little CPU).
DRS (Distributed Resource Scheduler) is a feature of a vSphere cluster that makes sure that all workload (VM’s) running on all hosts in that cluster is provided with the resources it needs. Balancing the load within the cluster is done by using vMotion migration of VM’s from hosts that have relatively little resources to hosts where resources are more plentiful available.
Starting with vSphere 6.5 a new setting is available in DRS that allows you to configure the allowed CPU over-commitment ratio. If you enable this feature, you can configure a setting of up to 500% (a 5-to-1 over-commitment ratio).
Now … how does this work and does this have any impact on availability you may ask. So I created a little vSphere 6.5 cluster with two ESXi hosts with 2 cores each, so a total of 4 cores available. I also configured HA (without admission control enabled, so it would allow me to start as many VM’s as I would like from an HA perspective) and then I configured DRS with this new feature enabled and the over-commitment ratio set to 50%.
This would allow me to use a maximum total of 4 x 50% = 2 virtual CPU’s. So I started my first VM which is configured with 2 vCPU’s … no problem.
Now I started the second VM which only has a single vCPU configured (which would bring the total of actively used vCPU’s to 3). As we would expect, DRS will prevent us from doing this and gives us an error message to reflect this :
So this seems like a great feature to prevent you from powering on too many VM’s and makes sure that the VM’s that are running get enough CPU resources to reach an acceptable performance level. But what happens when a host goes down? As this is a cluster level setting, having only one host left in my lab cluster will result in 2 physical CPU’s being available and with an over-commitment level set to 50% this would mean I could only use 1 vCPU … Well let’s find out. First I have my 2-vCPU VM running on host esxi65a.
Then I powered off this host. Since I have HA configured I would assume this will take care of automatically restarting the VM on my host B. But what about the amount of available CPU resources ? I need 2 vCPU’s for this VM and DRS would allow only 1 (only 2 physical CPU’s left and over-commitment set to 50%). Well it appears that this is not a problem, since HA does it’s job as we would expect :
So we don’t need to be afraid that using this DRS setting will effect our level of availability. We DO need to be careful however, since after a host failure (or during maintenance windows) the amount of available cluster resources are reduced, so starting additional workload might result in unexpected failures (in which case you could temporarily disable this feature again or set it to a higher ratio).