Deleting unethical data sets isn’t good enough

Data is at the root of many of the governance challenges raised by AI and ML, and Karen Hao of the MIT Technology Review pulls back the veil on the questionable provenance of some datasets that underlie modern data science, especially for facial ID. She cues off a study looking at a dataset created by researchers over 15 years ago using the unethical practice of scraping images and information from the web without the individuals’ knowledge or permission. The dataset subsequently was cited in over 1000 academic studies and has since “gone wild” in the corporate world, where it is difficult to know how often it is used. Other similar datasets have been “retracted” by researchers, but there is no mechanism to ensure they are not used.

How to govern the large datasets required for building effective AI is one of the big puzzles for the AI community to solve over the coming years. The study authors propose independent data stewardship organizations. In a similar vein, the Stanford Human-Centered Artificial Intelligence’s (HAI) new Center for Research on Foundation Models (CRFM) released a study on the possible ramifications of “foundation” models like Google’s BERT and OpenAI’s DALL-E. Trained at massive scale, they create unique advantages for developers, but their opacity raises serious concerns about governance and their close control by vested interests create high competitive barriers for new entrants.