If you've been using Aria Operations for a while (version 8.10), you've likely noticed a metric called Performance KPI %, defined on ESXi Hosts or Clusters.
I can't find this metric in the documentation, so let's explore from the Metrics tab of a Cluster.
Hovering over it gives you a description: Performance of the cluster, where 100% means all its VMs are being served well by the provider, as per default threshold.
Which inspires even more questions: what does it mean to be served well, what is the default threshold, can it be changed, etc. I couldn't find the formula for this metric in the documentation, so I looked internally and found it:
100 - ((Sum([VM]Performance|Number of KPIs Breached) + Sum([Pod]Performance|Number of KPIs Breached)) / (Summary|Number of Running VMs + Summary|Number of Pods) * 100 / 4)
So, what are these VM Performance Number of KPIs Breached and Pod Performance Number of KPIs breached? Looking at a VM they are described as shown here.
Hovering over it gives you the description: Count of performance KPIs breached, where 0 means the VM is being served well by the cluster or ESXi Host, and 4 means none of the infrastructure services (CPU, Memory, Disk, Network) is delivered as per default threshold.
Which begs the question, how is this metric defined? It's not documented here, so I looked around and found it.
numberOfKPIsBreached = 0;
if (Network|Transmitted Packets Dropped > 0) => numberOfKPIsBreached += 1;
if (Memory|Contention (%) > 1) => numberOfKPIsBreached += 1;
if (CPU|Ready (%) > 2.5) => numberOfKPIsBreached += 1;
if (Virtual Disk:Aggregate of all Instances|Total Latency (ms) > 10) =>numberOfKPIsBreached += 1;
resulted numberOfKPIsBreached is the metric value
So, VMs have a metric called numberOfKPIsBreached that starts at 0 and increases by 1 for each of the thresholds crossed, reaching a potential maximum of 4. It is similarily defined for Pods. Let's consider an example: a Cluster with 100 VMs, all VMs have 4 KPIs breached. Our Cluster Performance KPI % becomes:
100 - (400 / 100 * 100 / 4) = 0
As expected, if all of the VMs in the Cluster have 4 KPIs breached, the Cluster Performance KPI % is 0, indicative of a Cluster not providing resources as required.
Let's consider a Cluster with 100 VMs, 50 of them are running great, while 50 of them aren't (they have 4 KPIs breached). In this case our formula looks like this.
100 - (50 / 100 * 100 / 4) = 50
As expected, the Cluster Performance KPI % is 50%, since only 50/100 VMs are getting the resources they need. This same metric is available for ESXi Hosts. I've asked the documentation team to formally document this metric and publish its formula.
There is a larger conversation to be had here around KPIs vs SLIs (Service Level Indicators) and SLAs (Service Level Agreements), and there will be more to come in upcoming releases, but for now I'll reference Iwan Rahabok and his book, check it out here.
Surprised to see that above 2,5% ready it is considered a PKI breach. I always thought it was above 5%