Top bias metrics and how they work

AI Governance & Assurance
Principles & Frameworks
Impact & Society
Bias metrics - data concept image

Metrics for detecting bias

In our previous posts Breaking down bias in AI and How does bias happen, technically?, we highlighted that things are complex, and discussed the three main causes of automated decision bias. The purpose of this post will be to walk through our three favorite metrics for detecting that bias exists for a model in production.

How to measure for bias can be a moving target. In this blog post, we will examine the common methods to evaluate for bias, how they conflict and our recommended approaches.

There are several excellent survey papers on the various bias metrics that we recommend. Our favorite is titled “The Zoo of Fairness Metrics in Machine Learning” and was authored by Castelnovo, et al [1]. Castelnovo et al group the many bias metrics into three broad categories: Group Fairness, Individual Fairness, and Causality based fairness.

As we discussed in Breaking down bias in AI, there isn’t one right or wrong way to train an unbiased model, so the approach you take and metrics you use will change depending on context. The short answer for which approach or metric is best is always frustrating: “it depends.” For today’s discussion, we will dig into our personal three favorites:

  • Disparate Impact
  • Equalized Odds
  • Non-parametric cohort analysis

To begin digging into details, we first need to get some notation and terminology out of the way. For this post, we will use the following notion:

  • $A$ is the categorical attribute representing the protected attribute
  • $X$ is all other non-protected features
  • ${\hat Y = F(X,A) \in (0,1)}$ Is the model function
  • Model loss is minimized via $L(Y,\hat Y)$
  • $(x_{1},y_{1}),...,(x_{n},y_{n})$ are the observed data points
  • N (Number of observations) = Number of predictions
  • TP (True Positives)   = Number of correctly predicted positive outcomes
  • FP (False Positives)  = Number of incorrectly predicted positive outcomes
  • TN (True Negatives)  = Number of correctly predicted negative outcomes
  • FN (False Negatives) = Number of incorrectly predicted negative outcomes
  • FPR (False Positive Rate) = Percent of negative outcome predictions that were incorrect
  • TPR (True Positive Rate) = Percent of positive outcome predictions that were correct
  • SR (Selection Rate) ** = Percent of predictions that were correct and positive outcomes
  • Protected Classes: Groups that we have a legal and moral obligation to not discriminate against such as: gender, ethnicity, religion, political affiliation, disability, sexuality, and age.
  • Proxies: Attributes that can be used to infer a protected class. Frequently identified proxies include: FICO score, Education level, Criminal record, Occupation, Zip code

Monitaur's top recommended bias monitoring tests

Test Overview Protected Attributes? Proxy Features? Pros Cons
Equalized Odds Academic standard for algorithmic fairness Yes Possible but not recommended Recognized as most accurate and equitable form of bias detection in academia Requires protected attribute, good/bad outcomes, as well as ground truth, does not have an explicit notion of error or acceptable bias ranges.
Disparate impact Current legal standard method for detecting bias Yes Possible but not recommended Easy to understand and interpret: a Disparate Impact less than 80% indicates bias Not completely accurate. Bias can be present but go undetected. Requires protected attribute (could be non-model feature), as well as labelled positive/negative outcome[s]
Non-parametric cohort analysis Statistical measure of significant differences between cohorts and outcome Yes Possible but not recommended Standard, non-parametric analysis of data Requires proxy or protected attribute. Doesn't give different bias outcomes like disparate impact, just identifies where an issue may be present.

Disparate Impact

We will begin with Disparate Impact. The foundational bias metric, and the legal standard even today. Disparate impact as a concept became solidified from Title VII of the 1964 Civil Rights Act which states that:

“An employment practice or policy that appears neutral, but has a disproportionately adverse effect on members of the protected class as compared with non-members of the protected class is illegal.”

To determine if disparate impact exists, the “80 percent” test was created from a panel of 32 professionals assembled by the State of California Fair Employment Practice Commission (FEPC) in 1971. This test was then codified in the 1979 Uniform Guidelines on Employee Selection Procedures, a document used by the U.S. Equal Employment Opportunity Commission (EEOC), the Department of Labor, and the Department of Justice in Title VII enforcement.

The equation for Disparate Impact is:

$SR = \frac{\textrm{Positive result count}}{N}$

$\textrm{Disparate Impact Ratio} = \frac{\textrm{Underprivileged Group SR}}{\textrm{Privileged Group SR}}$

Disparate Impact example

Gender Applied Qualified Approved Selection Rate
Male 80 48 48 48 / 80 = 0.6
Female 40 12 12 12 / 40 = 0.3

$\textrm{Disparate Impact Ratio} = \frac{\textrm{Female SR}}{\textrm{Male SR}}=\frac{0.3}{0.6}=0.50$

Result: <80%, disparate impact is present.

The problem with the Disparate Impact Ratio is that it doesn’t take into account the effects of merit or qualifications. For instance, in our example above, 30% of the women who applied were qualified and thus their selection rate was to be expected. Disparate Impact is still better than nothing and is a key metric to hang your hat on, but it is dated and flags false positives for bias.

For industries such as insurance where discrimination based on credit risk is the core of the business (we can have a separate discussion about if some of the factors that insurers use are in fact biased and shouldn’t be used, but credit risk needs to be assessed regardless), we need a more intelligent metric to better gauge bias.

One of the benefits of the disparate impact ratio is it does not require ground truth data, the correct outcomes vs the predicted outcomes, or how we train a model. The next metric we will discuss, Equalized Odds, solves our merit issue with Disparate Impact but introduces another problem, that of knowing what the correct prediction is.

Webinar replay - image

Equalized Odds

Equalized Odds, as described by Hardt et al in Equality of Opportunity in Supervised Learning is a relatively newer academic criterion for evaluating fairness in machine learning model outcomes. This method seeks to equalize the accuracy of prediction for all demographics[2]. Currently, no accepted constraints exist for what constitutes “unequal odds”. Unlike Disparate impact, equalized odds or equal opportunity punishes models that only perform well on the majority outcome class.

We say that a model satisfies Equalized Odds with respect to a protected attribute A and an outcome Y if the prediction and protected attribute are independent and conditional on the outcome.

$P(\hat Y=1|A=0,Y=y)=P(\hat Y=1|A=1,Y=y), y \in {0,1}$

Thus Equalized Odds is achieved if the probability of a certain prediction is not influenced by flipping the protected attribute.

Practically, we can roughly simplify the above equation into the following:

$\textrm{TPR}=\frac{\textrm{TP}}{\textrm{TP + FN}}$

$\textrm{FPR}=\frac{\textrm{FP}}{\textrm{FP + FN}}$

The goal is to minimize the difference between the TPR of the privileged group and the TPR of the underprivileged group. Likewise, we should also minimize the difference in FPRs between privileged and underprivileged groups. An accepted error rate hasn’t been codified yet, meaning we can’t state that an equalized odds measurement is accepted if it has a 5%/10%/20% difference.

Equalized Odds example

In this example, we compare two groups of people: men and women for equality of opportunity. Here, equalized odds are satisfied because both genders have 80% chance of being hired (TPR) and 70% chance of being rejected (FPR).


Qualified Unqualified
Hired 56 9
Rejected 14 21
TOTAL 70 30

$\textrm{Male hired TPR}=\frac{56}{56+14}=\frac{56}{70}=0.80$

$\textrm{Male rejected FPR}=\frac{21}{21+9}=\frac{21}{30}=0.70$


Qualified Unqualified
Hired 24 21
Rejected 6 49
TOTAL 30 70

$\textrm{Female hired TPR}=\frac{24}{24+6}=\frac{24}{30}=0.80$

$\textrm{Female rejected FPR}=\frac{49}{49+21}=\frac{49}{70}=0.70$

In the above example, we take into account qualifications, and can accurately determine if the hiring decisions are fair and unbiased. However, in practicality, the equalized odds test is difficult to implement, as many use cases are not as cut and dried as “qualified” and ”unqualified”. We often don’t know until long after the fact (if ever) if our model prediction were correct. As a worrying consequence, you could theoretically obtain a perfect equalized odds score and still be biased (like we saw in the above example) because the qualifications themselves are biased.

Non-parametric cohort analysis

Non-parametric cohort analysis sounds like a mouthful, but non-parametric essentially means we are employing statistical tests that do not assume the data fits into one fixed probability distribution, such as the normal bell curve. A cohort analysis is used to analyze the presence of a significant relationship between a protected class feature, such as gender (male/gender*) and an outcome (hired/rejected*).

There are many different nonparametric statistical tests, a topic for another blog post, but for this post, we will keep with our binary outcome and binary protected class problem, i.e. $\hat Y \in 0,1$, and $A \in 0,1$, respectively, and use the McNemar test, a paired nominal test.

To calculate the McNemar statistic, we first create a  contingency table of the data.

Gender Died Survived
Males 177 96
Females 26 101

And perform a hypothesis test.

$H_o: p_b = p_c$

$Ha: p_b \neq p_c$

The test statistic is calculated as: $\chi^2 = \frac{(b-c)^2}{b+c}$

Test statistic = $\frac{(96-26)^2}{96+26}\approx40$ This equates to a p-value** of approximately <0.0001

With a p-value this low, we reject $H_0$, bias may be present.

The McNemar test helps us understand the relationship between a protected (or proxy) feature and an outcome in a statistically valid manner. Where this test comes into its own is as a monitoring test where we run a test on a schedule based on every $x$ days or every $x$ number of transactions. Complexity gets introduced as we determine sampling needs, but this test serves as a more statistically valid version of disparate impact. However, it runs into the same problem of whether is discrimination warranted, which equalized odds addresses.

Protected features and proxy features

Another complication we often run into is data availability. For fear of bias,  many companies do not include any protected features or proxies in their modeling. Worse yet, many do not have their data structured in a way to easily link transactions to this information, essentially leaving us to fly blind.

To combat this, there are a couple ‘hacks’ that the industry has come up with, such as BISG, but there are not nearly as effect as having protected class or proxy information present in the model (we recommend multi-objective modeling - see our previous post How does bias happen, technically?).

Monitaur also has the notion of ‘non-model features’ which, if you have the ability to reference and pull in other data that has a protected class or proxy feature, we can perform bias monitoring on this feature, which is fully isolated from the model.

For when there is absolutely nothing to go on, which occurs often in some industries, Monitaur created a last-ditch approach called Optimal Group Differencing (OGD). In OGD, we use a hyper-parameter optimized unsupervised algorithm technique to identify clusters of transactions that exhibit significantly different correlations in patterns that potentially exhibit bias. OGD will return groups of transactions that may exhibit bias, but without the protected attribute or ground truth outcome, it is still difficult to determine if bias exists in the identified transactions. It is a step in the right direction and gives us the capability for high-risk models to have an individual review a set of transactions that may exhibit bias.

Bias is complicated

As this post has illustrated the reoccurring theme of this series: “things are complex.” Hopefully this has been helpful to solidify the strengths and weaknesses of several prominent bias metrics. We did not touch on the probability theory behind bias metrics, i.e. independence, separation, and sufficiency, or my personal favorite methods of simulations and counterfactual methods for validating a model is not biased, but we can save these for subsequent posts.

About the author

Dr. Andrew Clark is Monitaur’s co-founder and Chief Technology Officer. A trusted domain expert on the topic of ML auditing and assurance, Andrew built and deployed ML auditing solutions at Capital One. He has contributed to ML auditing education and standards at organizations including ISACA and ICO in the UK. He currently serves as a key contributor to ISO AI Standards and the NIST AI Risk Management framework. Prior to Monitaur, he also served as an economist and modeling advisor for several very prominent crypto-economic projects while at Block Science.

Andrew received a B.S. in Business Administration with a concentration in Accounting, Summa Cum Laude, from the University of Tennessee at Chattanooga, an M.S. in Data Science from Southern Methodist University, and a Ph.D. in Economics from the University of Reading. He also holds the Certified Analytics Professional and American Statistical Association Graduate Statistician certifications. Andrew is a professionally trained concert trumpeter and Team USA triathlete.


  • There are additional considerations, such as independent vs dependent sampling, statistical assumptions, etc. that are behind the scope of this already lengthy blog post, but can be the focus of another blog post if desired.
  • *overly simplified for the sake of example
  • **P-values is another topic. We can dive into this in the future as well


[1]Castelnovo, Alessandro, Riccardo Crupi, Greta Greco, and Daniele Regoli. “The Zoo of Fairness Metrics in Machine Learning.” ArXiv:2106.00467 [Cs, Stat], December 13, 2021.

[2]Hardt, Moritz, Eric Price and Nathan Srebro. “Equality of Opportunity in Supervised Learning.” ArXiv  abs/1610.02413 (2016): n. pag.