In one of our previous posts in our series on bias, titled: How does bias happen, technically? we touched on the notion of a loss function and how algorithms are trained. In this blog we will dive deeper into what exactly loss functions are as well as how machine learning models are constructed and 'trained'. Here we will walk through some popular different loss functions, explain how they work, introduce the concept of stochastic gradient descent, provide an example of how a loss function gets optimized, and end with a discussion of how multi-objective optimization is the future of fair and ethical machine learning modeling.
A loss function, in statistical theory, is a function that calculates the error between actual/true values vs predicted values. Loss functions, a.k.a. cost functions, can take on different shapes and sizes to accommodate different goals or use cases, such as a regression model (a model that predicts a continuous value, such as a risk score between 0 and 100). In order to 'train' a statistical or machine learning model, we undertake the optimization problem of minimizing the loss from the loss function. One of the most common methods to train an algorithm and minimize a loss function, is to use gradient descent. In gradient descent, we take the partial derivative with respect to the coefficients of the model in order to move towards minima of the differentiated loss function. Below is, hopefully, an intuitive example to understand how gradient descent works:
Imagine an individual is stuck high up in the mountains of Colorado and has gotten lost. Visibility is very low due to the fog, so they do not see a path all the way down the mountain. This hypothetical climber could find their way to the base by employing a real-world variation of gradient descent. To do this, they would look for the path that has the steepest downhill descent in their line of sight. They will take a few steps down this path and then revaluate if they need to change direction or continue down. Eventually they will make it to the bottom of the mountain (the global maxima) or the bottom of a hole in the mountain (a local minimum). The amount of times they measure the steepness of the hill and make adjustments can be considered an epoch with the degree of directional change after each epoch is their learning rate.
For our discussion today, we will use a basic statistical model called Linear Regression. Linear Regression is a simplistic model that predicts an outcome or dependent variable from an observation or independent variable with the addition of an intercept. This results in a linear equation of the form y = mX + c where y is the outcome, X is the observation, m is the slope of the line, and c is a constant error term. Linear regression models are a good reference model before performing more complex modeling techniques. Below is the Linear Regression equation written out in matrix notion:
$ \mathbf{y} = \boldsymbol m X + \boldsymbol c\ \ $ where $\ \ \mathbf{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix},\boldsymbol m = \begin{pmatrix} m_0 \\ m_1 \\ m_2 \\ \vdots \\ m_p \end{pmatrix}, X = \begin{pmatrix} \mathbf{x}^\mathsf{T}_1 \\ \mathbf{x}^\mathsf{T}_2 \\ \vdots \\ \mathbf{x}^\mathsf{T}_n \end{pmatrix}, \boldsymbol c = \begin{pmatrix} c_1 \\ c_2 \\ \vdots \\ c_n \end{pmatrix}$
Note, for our use case, we are using simple univariate linear regression, or only 1 independent variable. Linear regression also supports multivariate regression, or $n+1$ independent variables. For brevity, we have omitted a discussion on the assumptions built into linear regression and the exploratory data analysis that is required before building mission-critical models.
To begin our discussion of loss functions, namely the common MSE and MAE loss functions, we will first write out as much of the math and Python code to as detailed degree as is practical.
From the plot above, we see our randomly distributed data. Since we are using a Gamma distribution, we can see some outliers.
The first loss function we will define is the ubiquitous Mean Squared Error (MSE) loss. MSE is defined as the average squared error between the actual dependent, $Y_i$ (i being the specific datapoint) value and the predicted dependent, $\hat Y_i$ value. As we train a model with gradient descent, over time the average of the squared errors for the dataset decreases. Different loss functions have different use cases and are optimal for different desired outcomes, but MSE is a great starting point for regression (predicting a continuous outcome) problem. Below we have the equation:
$$\textrm{MSE}=\frac{1}{n} \sum_{i=1}^n \left(Y_i-\hat{Y_i}\right)^2$$
Where $n$ is the number of predicted data points, $i$ is a specific data point in the array of values.
Now, to perform gradient descent as described above, we need to solve our MSE for the constants of $m$, the slope term, and $c$ the scalar, or constant value. To do so, we will put our simple linear regression equation of $y = mx + c$ into the MSE equation where $\hat Y$ exists ($\hat Y$ can be thought of as $f(x)$):
$$\textrm{MSE}=\frac{1}{n} \sum_{i=1}^n \left(Y_i-(mx_i +c\right))^2$$
We then take the partial derivates with respect to m and c, the scalar and constant components, giving us:
$$\frac{\partial MSE}{\partial m} = \frac{-2}{n}\sum_{i=0}^n x_i (Y_i - \hat y_i)$$
$$\frac{\partial MSE}{\partial c} = \frac{-2}{n}\sum_{i=0}^n (Y_i - \hat y_i)$$
In order to iterate over our algorithm to train the model, we will perform the following steps: use our epoch to determine how many times we will check for the steepness of the hill, then use our learning rate, $L$, to determine the degree to which we update our scalar values.
Below we will walk through training our model 15 times, or 15 epochs with gradient descent. We will plot both the data and the corresponding best fit lines from each time step. We will also print out the respective scalar values and their corresponding MSEs. The python code is modifed from Adarsh Menon's excellent 2018 blog post[1].
Epoch #0: m = 1.3029690942817036; c = 0.020171926803486246; MSE = 11125.510732037144
Epoch #1: m = 1.38492617474331; c = 0.023533981198709558; MSE = 2100.5570855449128
Epoch #2: m = 1.3900542967346852; c = 0.02583827178686301; MSE = 2064.7755880223576
Epoch #3: m = 1.3903481598871632; c = 0.02807595843804381; MSE = 2064.584234839073
Epoch #4: m = 1.39033784070446; c = 0.030309407272782522; MSE = 2064.53379322249
Epoch #5: m = 1.3903083822971603; c = 0.0325425425157025; MSE = 2064.4839117351066
Epoch #6: m = 1.3902777202508474; c = 0.03477561108933626; MSE = 2064.4340346941194
Epoch #7: m = 1.3902469831149034; c = 0.03700862853168542; MSE = 2064.3841598993504
Epoch #8: m = 1.3902162419003055; c = 0.03924159582149052; MSE = 2064.3342873419874
Epoch #9: m = 1.3901855010752178; c = 0.041474513021388985; MSE = 2064.284417021897
Epoch #10: m = 1.3901547609207736; c = 0.04370738013637486; MSE = 2064.2345489389777
Epoch #11: m = 1.3901240214546475; c = 0.04594019716781509; MSE = 2064.1846830931318
Epoch #12: m = 1.390093282677937; c = 0.048172964116848384; MSE = 2064.1348194842553
Epoch #13: m = 1.3900625445906973; c = 0.05040568098459907; MSE = 2064.0849581122493
Epoch #14: m = 1.390031807192917; c = 0.05263834777219053; MSE = 2064.035098977013
As we can see from above, our first MSE was okay, then it improved substantially, improved again, and then really settled into the model we stayed with by the third epoch. Let's see how our training may go under a different loss function.
With mean absolute error, MAE, instead of squaring the errors, we take the absolute value of the error. The main difference between MSE and MAE is their responsiveness to outliers. If you want your model to be more influenced by outliers, use MSE, if you don't want outliers to have too much weight, you may be better served by the MAE loss function, shown below:
$$\textrm{MAE} = \frac{1}{n}\sum_{i=1}^n |y_i - \hat y_i|$$
To optimize our model over the loss function, as in the MSE walkthrough, we need to solve our MAE for the constants of $m$, the slope term, and $c$ the scalar, or constant value. To do so, we will put our simple linear regression equation of $y = mx + c$ into the MAE equation where $\hat Y$ exists ($\hat Y$ can be thought of as $f(x)$):
$$\textrm{MAE} = \frac{1}{n}\sum_{i=1}^n |Y_i - (mx_i +c)|$$
Now, we run into a wrinkle we didn't have in the MSE: the MAE isn't meaningfully differentiable since there is no derivative at zero we take the partial derivates with respective to m and c around 0.
$$ \frac{\partial MAE}{\partial m} = \begin{cases} -1 & \text{for } x_i(Y_i - \hat y_i) < 0 \\ +1 & \text{for } Y_i - \hat y_i > 0 \end{cases}$$$$ \frac{\partial MAE}{\partial c} = \begin{cases} -1 & \text{for } Y_i - \hat y_i < 0 \\ +1 & \text{for } Y_i - \hat y_i > 0 \end{cases}$$
What this practically means is that we are not optimizing $c$ and need to increase our learning rate from 0.0001 to 0.1 as we used in the MSE, to be be more response at each change, to make up for the lack of differentiability, gradient, we need to move in larger 'steps' per epoch.
Epoch #0: m = 0.1; c = 0.1; MAE = 100.85963401743122
Epoch #1: m = 0.2; c = 0.2; MAE = 94.31057927904259
Epoch #2: m = 0.3; c = 0.3; MAE = 87.76152454065421
Epoch #3: m = 0.4; c = 0.4; MAE = 81.21246980226589
Epoch #4: m = 0.5; c = 0.5; MAE = 74.66341506387745
Epoch #5: m = 0.6; c = 0.6; MAE = 68.11436032548895
Epoch #6: m = 0.7; c = 0.7; MAE = 61.56530558710057
Epoch #7: m = 0.8; c = 0.8; MAE = 55.01625084871213
Epoch #8: m = 0.9; c = 0.9; MAE = 48.467196110323655
Epoch #9: m = 1.0; c = 1.0; MAE = 41.918141371935214
Epoch #10: m = 1.1; c = 1.1; MAE = 35.36908663354676
Epoch #11: m = 1.2; c = 1.2; MAE = 28.820031895158365
Epoch #12: m = 1.3; c = 1.3; MAE = 22.270977156769924
Epoch #13: m = 1.4; c = 1.4; MAE = 15.721922418381459
Epoch #14: m = 1.3; c = 1.5; MAE = 9.172867679992994
Now that we've established how optimizing loss functions work, we will go into the topic of multi-objective loss functions. Up until this point, we had one equation or criteria that we were trying to minimize. However, as we have discussed in previous posts, when it comes to responsible AI, we want to ensure that our models are both performant and unbiased. In order to accomplish this, we need to optimize for multiple objectives in our loss functions. Multi-objective optimization is a very large, advanced topic with extensive work in the fields of engineering, economics, and logistics. For our purposes today, we will modify our MSE loss function to include a simple constraint that the predicted value cannot be greater than 120. In practicality, we would most likely be optimizing a multi-variate model for both an accuracy function, such as MSE, and a protected class variable for equalized odds (see our previous post on Top bias metrics and how they work).
$$\textrm{MSE_max} =\frac{1}{n} \sum_{i=1}^n \left(Y_i-\hat{Y_i}\right)^2, \hat{Y_i} < 120 $$
Now, to perform gradient descent as described above, we need to solve our MSE_max for the constants of $m$, the slope term, and $c$ the scalar, or constant value, with the constraint of $\hat{Y_i} < 120$. To do so, we will put our simple linear regression equation of $y = mx + c$ into the MSE max equation where $\hat Y$ exists ($\hat Y$ can be thought of as $f(x)$):
$$\textrm{MSE}=\frac{1}{n} \sum_{i=1}^n \left(Y_i-(mx_i +c\right))^2$$
We then take the partial derivates with respective to m and c, the scalar or constant values:
$$\frac{\partial MSE}{\partial m} = \frac{-2}{n}\sum_{i=0}^n x_i (Y_i - \begin{cases} \hat y_i & \text{for } Y_i - \hat y_i > 120 \\ 120 & \text{for } Y_i - \hat y_i < 120 \end{cases})$$
$$\frac{\partial MSE}{\partial c} = \frac{-2}{n}\sum_{i=0}^n (Y_i - (Y_i - \begin{cases} \hat y_i & \text{for } Y_i - \hat y_i > 120 \\ 120 & \text{for } Y_i - \hat y_i < 120 \end{cases})$$
Epoch #0: m = 1.3029690942817036; c = 0.020171926803486246; MSE = 11125.510732037144
Epoch #1: m = 1.442144301979417; c = 0.024010316347749815; MSE = 1917.442210263628
Epoch #2: m = 1.4955802526488085; c = 0.02647448645966806; MSE = 1709.6243849370342
Epoch #3: m = 1.520488190537731; c = 0.028463009225149227; MSE = 1651.132706823289
Epoch #4: m = 1.5328372774364025; c = 0.03023889240754651; MSE = 1627.0645394102555
Epoch #5: m = 1.539132639411789; c = 0.03191150385295135; MSE = 1615.7126147043207
Epoch #6: m = 1.5424050609390116; c = 0.03353223899537106; MSE = 1609.971450646196
Epoch #7: m = 1.5441092936350949; c = 0.03512601454898867; MSE = 1607.0176496386148
Epoch #8: m = 1.5449929156620412; c = 0.036705665638760654; MSE = 1605.4734716233502
Epoch #9: m = 1.545446687313976; c = 0.03827790311578823; MSE = 1604.651889794666
Epoch #10: m = 1.5456741786500285; c = 0.03984622787016094; MSE = 1604.211963591956
Epoch #11: m = 1.5457819050863417; c = 0.041412474493953355; MSE = 1603.972386307603
Epoch #12: m = 1.5458262428097813; c = 0.04297761399508994; MSE = 1603.8385323390485
Epoch #13: m = 1.5458370304536697; c = 0.04454216030604839; MSE = 1603.7605466993352
Epoch #14: m = 1.5458300609278863; c = 0.046106385439669516; MSE = 1603.712106202352
When we compare the different optimized loss functions and their shape, we see that our multi-objective model ended up having the lowest MSE. This may not always be the case, but a trade-off in performance for a safe and fair model is normally a worthwhile tradeoff for a well-controlled model.
We have given a high-level introduction into different loss functions and how they function by creating univariate linear regression models off of randomly generated data. We walked through how a model is trained using gradient descent and provided the math and code for creating these loss functions and linear models from scratch. We concluded by introducing the complex topic of multi-objective modeling, which will be the focus of future posts. Coming soon in our series will be a discussion on debiasing training data with interpretable techniques as well as model validations.
For the sake of illustration and discussion, we skipped crucial modeling steps such as understanding business needs, data quality assessment, data segmentation, and model cross-validation, to only name a few steps.
For more information about considering these steps, subscribe to The AI Fundamentalists podcast and search for episodes about model robustness and performance.
[1] Menon, Adarsh. “Linear Regression Using Gradient Descent.” Medium, September 19, 2018. https://towardsdatascience.com/linear-regression-using-gradient-descent-97a6c8700931.