In a recent podcast of the AI Fundamentalists, we spoke about information theory and why, although it is a very valuable discipline, its divergences are often the wrong choice for model and data drift monitoring. In this post, we summarize the goals of information theory, define the differences between metrics and divergences, explain why divergences are the wrong choice for monitoring, and propose better alternatives.
Information theory as we know it today came out of Claude Shannon's work in the 1940s at Bell Labs [1]. Information theory is based on probability and statistics, to study how information is transferred and used, quantified as bits. Information theory was instrumental in cryptography, messaging, and related fields as it focuses on understanding how many bits of information are minimally needed to convey a given idea. One of the main driving concepts in information studies is entropy. Entropy quantifies the amount of uncertainty involved with executing a data process. The less variability within a given process, the less “information” is conveyed, yielding a lower entropy.
How information theory relates to modeling systems, outside of entropy-based input selection, is that of input and output data monitoring. To understand why information theory methods might not be the best for monitoring, we first need to understand the difference between a distance metric and a divergence, and why it matters.
Without diving too deep down a mathematical rabbit hole, it is important to understand the concept of vectors and spaces. A vector can be thought of as a ‘column’ of values, often notated as:
More simply, this means the Euclidean distance between your house and Starbucks is the shortest possible path by going through the globe, not the best sequence of roads/tunnels to get there.
We can calculate the distance of two to three points in a space, referred to as a metric space if four distance properties are met.
If any of these properties are not met, the points are not in a metric space and distances cannot be calculated. This may seem trivial, but metric spaces are what make chess AIs and GPS driving instructions possible. In the mathematical sphere, failing these properties carries ramifications, one of which is the ability to use standard p-value and alpha suite. Universally, with a properly configured test on a distance metric, we can use the properties of alpha and p-value to determine if a change in distance between two points in space is statistically significant. Without this, we must rely on divergence-specific heuristics to determine if a result is adverse. In responsibly deploying modeling systems, knowing when distributions have statistically significantly shifted, specifically across many inputs and modeling systems, is critical.
Now that we’ve walked through what a distance metric is and briefly discussed why they matter, let’s get back to exploring commonly used information theory divergences. Consider a divergence to essentially be a vector space [8] that is locally similar enough to perform calculus on, but that does not meet all of the properties required to be in the metric space.
We don’t definitively know. It has always been there and is the backbone of what makes up computer science. Most Machine Learning and MLOps Engineers have computer science backgrounds and may not be deeply familiar with statistics. Thus they may default to the Information Theory hammer in their statistical toolbox. When you have a hammer, everything is a nail.
Disregarding the theoretical downsides, Evidently AI has an excellent post [3] comparing multiple methods. This work notably highlights the empirical finding that Kullback-Leibler Divergence, Jensen-Shannon Divergence, and Population Stability Index are ineffective tools for detecting drift.
As an alternative, we propose that non-parametric metrics are ‘the way’. The main exception to this is if you have a strong understanding of your underlying statistical distributions, in which case parametric statistics give you the highest level of performance. We refer you to our previous podcast and post on non-parametric statistics for more details [4].
In this post we've provided a follow-up discussion from our podcast on information theory. If this content was helpful, you would like us to go deeper in an area, or if you have questions or requests, please submit feedback on our site.
Until next time,
Andrew & Sid, The AI Fundamentalists.
* Note: Many of the underpinnings for modeling and computational systems hinge on this concept of Euclidean distance.
** Notation review:
[1]: C. E. Shannon. A Mathematical Theory of Communication. System Technical Journal 1948-07: Volume 27, Issue 3: AT&T Bell Laboratories. 1948.
[2]: Chiang, Alpha C., and Kevin Wainwright. Fundamental Methods of Mathematical Economics. 4th ed. Boston, Mass: McGraw-Hill/Irwin, 2005.
[3]: “Which Test Is the Best? We Compared 5 Methods to Detect Data Drift on Large Datasets.” Accessed March 22, 2024. https://www.evidentlyai.com/blog/data-drift-detection-large-datasets.
[4]: Exploring non-parametric statistics. Accessed March 22, 2024. https://www.monitaur.ai/blog-posts/exploring-non-parametric-statistics
[5]: Kullback, S., and R. A. Leibler. “On Information and Sufficiency.” The Annals of Mathematical Statistics 22, no. 1 (March 1951): 79–86. https://doi.org/10.1214/aoms/1177729694.
[6]: Lin, J. “Divergence Measures Based on the Shannon Entropy.” IEEE Transactions on Information Theory 37, no. 1 (January 1991): 145–51. https://doi.org/10.1109/18.61115.
[7]: Karakoulas, Grigoris. “Empirical validation of retail credit-scoring models” RMA Journal 87, (September **2004): 56-60. https://cms.rmau.org/uploadedFiles/Credit_Risk/Library/RMA_Journal/Other_Topics_(1998_to_present)/Empirical Validation of Retail Credit-Scoring Models.pdf
[8] David Guichard. “Divergence and Curl.” Calculus: early transcendentals; Chapter 16.5. Whitman College. https://www.whitman.edu/mathematics/calculus_online/section16.05.html