I had the pleasure to present our work on evaluating and calibrating with uncertain ground truth at the seminar series of the PRECISE center at the University of Pennsylvania. Besides talking about our recent papers on evaluating AI models in health with uncertain ground truth and conformal prediction with uncertain ground truth, I also got to learn more about the research at PRECISE through post-doc and student presentations. In this article, I want to share the corresponding slides.
In supervised machine learning, we usually assume access to ground truth label for evaluation. In many applications, however, these ground truth labels are derived from expert opinions. Disagreement among these experts is typically ignored using simple majority voting or averaging. Unfortunately, this can have severe consequences by over-estimating performance or mis-guiding model selection. In our work presented in this article, we tackle this problem by introducing a statistical framework for aggregating expert opinions.