Machine learning, and lately generative AI models, are the next frontier of model systems development for businesses. There are already exploratory uses in various fields, including finance, insurance, healthcare, and marketing. However, the successful and safe implementation of these model systems depends on the selection, governance, and quality of the data used to train them.
In this blog post, we will illustrate the importance and process of establishing data governance and data quality prior to building model systems, AI or otherwise, and talk about how these practices set the foundation for higher-quality model systems in high-risk scenarios and regulated industries.
Organizations have processes to ensure that the data used for machine learning is governed and scoped to meet the objective of the model systems. Data controls are specific checks to implement and manage discrete parts of governance. There are several controls that are essential for machine learning model systems.
Regulatory data review involves checking and reviewing input data to ensure compliance with relevant regulations. This process should be done on a regular and repeated schedule.
When done properly, it involves reviewing the data to ensure it meets the necessary regulatory requirements. However, some teams argue that they should build and deploy model systems before validating compliance with regulations, citing that reviews take too long for systems that are experimental and might not make it to deployment. On the contrary, regular and repeated reviews of input data for a model system ensure a certain level of reliability and validation of what data was used to train the model, leading to fewer regulatory concerns about why the model system was built in the first place.
Data governance processes are put in place to ensure that data is organized, stored, categorized, defined, and maintained with lineage. When done at the time of collection, these processes maintain several steps to check for completeness, accuracy, consistency, and cleanliness. It’s also a chance to look for historical gaps. This process should be repeated at a required frequency and after any source data changes that might impact the model system.
Incomplete or missing data will result in the generation of model systems that reflect these discrepancies between the intentions of the model systems and the potential for collected data to be skewed.
Data bias review involves testing and documenting representative underlying socioeconomic and demographic characteristics. This process should be done on a repeated schedule based on the risk of the model. In some organizations, this process might also be the resulting step of governance during data collection. When protected data sources are reviewed, they should also be reviewed to identify and remediate bias.
Data selection involves ensuring that all data used for machine learning is appropriate and timely for the use case. Business stakeholders and model owners should review the data selection to ensure it supports business intent.
In this sense, you might be wondering about the data that ChatGPT was trained on. OpenAI’s interface accessed tons of public and private data scraped from the internet and fed it to a Large Language Model. General users were not able to perform data control steps, including data selection of their own data to align to the objective and intent of OpenAI’s model use.
For purposes of illustrating the steps to better data and subsequently, high model quality, the risk of not being able to align model input data with your objectives is something to consider as you contemplate using LLMs trained on general public data.
After data has ideally gone through the necessary control steps, a data quality report should be evaluated by the model developers and owners to confirm that the data used for a model is of high quality. A data quality report can be expected to cover validation of:
When models expect all features to be present, missing data can make large swaths of collected data unusable. There have been situations where an amazing model is built, but when applied in the real world, not all features are collected, rendering the model unusable for over 50% of transactions. There are remediation options available for these scenarios such as imputation or further data selection to drop commonly missing features.
Out-of-bounds or range data can cause unexpected model outputs and, in worst-case scenarios, dangerous model outcomes. Models can sometimes retrain themselves over time to match new data, incentivizing adversarial attackers to send intentionally bogus out-of-range data to skew the model to work how they want it to.
It’s important to mention that while checking and correcting for out-of-bounds, data is one prong of prevention against attackers, there are several other factors outside of the scope of data that your risk and compliance teams will want to check for independently. Data quality is an important factor, but not a foolproof plan against attacks on models.
Highly correlated features can cause problems with modeling paradigms that are sensitive to overfitting on repeat columns. It is always best to only keep the best representations when columns are overwhelmingly encoding the same information. Collinearity fundamentally breaks certain model types. Even when it isn't explicitly a problem, it can introduce noise for a model to overfit to.
Data imbalance can cause majority biases, where models become incapable of correctly learning patterns for minority groups in the data. Imagine building a model and being excited to see 85% accuracy on a gender prediction system, but then learning that the data was trained on 85% men in training.
Data governance and data quality are essential for successful machine learning modeling. By understanding the objective for creating a model in the first place and following the necessary processes, we can ensure that the data used for machine learning is of high quality and mitigate unintended biases. This will lead to more accurate models, which can be used to make better decisions in high-risk business decisions and regulated industries.
To hear the entire discussion about why data matters, listen to episode 3 of The AI Fundamentalists and subscribe on Spotify, Apple Podcasts and more popular players.