Confidence Limits for Prediction Performance
File | Description | Size | Format | |
---|---|---|---|---|
Confidence-Limits-for-Prediction-Performance_Rink_PDF-A.pdf | 1.96 MB | Adobe PDF | View/Open |
Authors: | Rink, Pascal ![]() |
Supervisor: | Brannath, Werner ![]() |
1. Expert: | Brannath, Werner ![]() |
Experts: | Wright, Marvin N. ![]() |
Abstract: | As machine learning algorithms become increasingly integrated into critical systems, assessing the reliability of their predictions is essential, especially when errors can have severe consequences. Incorporating statistical methods can help to quantify the inherent uncertainty and improve decision-making. This work proposes a new method for estimating confidence limits for prediction performance. In the first part, we introduce fundamental concepts and findings from the machine learning and statistical inference literature, framing the selection and evaluation of prediction models as a statistical inference problem. In particular, we consider the simultaneous evaluation of multiple candidate models and interpret this as a multiple testing problem. We also explore the bootstrap and nonparametric bootstrap tilting, which provides a reliable approach for estimating confidence intervals without the need to assume a specific underlying distribution. The second part integrates these concepts and presents the proposed multiplicity- adjusted bootstrap tilting lower confidence limits for conditional prediction performance. This approach is computationally undemanding and universally applicable to any combination of prediction models, model selection strategies, and performance measures. We prove that the proposed interval asymptotically achieves the nominal coverage probability and conduct simulation experiments to assess its goodness in finite samples. Specifically, we investigate the prediction accuracy of lasso and random forest classifiers. The proposed approach shows reliable coverage and competitive lower confidence limits. In contrast, we also show that recent alternative methods such as bootstrap bias-corrected cross-validation and nested cross-validation may fail to accurately track conditional performance. Finally, we apply the proposed approach to real-world data, where it demonstrates stability when model selection is highly sensitive to the allocation of the sample data to the learning and evaluation sets or in the presence of a distribution shift. |
Keywords: | bootstrap tilting; model selection in machine learning; multiple testing; performance evaluation; post-selection inference | Issue Date: | 30-Apr-2025 | Type: | Dissertation | DOI: | 10.26092/elib/3822 | URN: | urn:nbn:de:gbv:46-elib89416 | Research data link: | https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic). | Institution: | Universität Bremen | Faculty: | Fachbereich 03: Mathematik/Informatik (FB 03) |
Appears in Collections: | Dissertationen |
This item is licensed under a Creative Commons License