Why Machine Learning Models Fail: A Benchmarking Perspective

Michaelis, Claudio

Publikationsdienste
→
TOBIAS-lib - Publikationen und Dissertationen
→
7 Mathematisch-Naturwissenschaftliche Fakultät
→
Dokumentanzeige

dc.contributor.advisor	Bethge, Matthias (Prof. Dr.)
dc.contributor.author	Michaelis, Claudio
dc.date.accessioned	2024-04-10T09:00:32Z
dc.date.available	2024-04-10T09:00:32Z
dc.date.issued	2024-04-10
dc.identifier.uri	http://hdl.handle.net/10900/152732
dc.identifier.uri	http://nbn-resolving.de/urn:nbn:de:bsz:21-dspace-1527324	de_DE
dc.identifier.uri	http://dx.doi.org/10.15496/publikation-94071
dc.identifier.uri	http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1527321	de_DE
dc.identifier.uri	http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1527327	de_DE
dc.description.abstract	Over the last years, machine performance at object recognition, language understanding and other capabilities that we associate with human intelligence has rapidly improved. One central element of this progress are machine learning models that learn the solution for a task directly from data. The other are benchmarks that use data to quantitatively measure model performance. In combination, they form a virtuous cycle where models can be optimized directly on benchmark performance. But while the resulting models perform very well on their benchmarks, they often fail unexpectedly outside the controlled setting. Innocuous changes such as image noise, rain or the wrong background can lead to wrong predictions. In this dissertation, I argue that to understand these failures, it is necessary to understand the relationship between benchmark performance and the desired capability. To support this argument, I study benchmarks in two ways. In the first part, I investigate how to learn and evaluate a new capability. Therefore, I introduce one-shot object detection and define different benchmarks to analyze what makes this task hard for machine learning models and what is needed to solve it. I find that CNNs struggle to separate individual objects in cluttered environments, and that one-shot recognition of objects from novel categories can be challenging with real-world objects. I then continue to investigate what makes one-shot generalization difficult in real-world scenes, and identify the number of categories in the training dataset as the central factor. Using this insight, I show that excellent one-shot generalization can be achieved by training on broader datasets. These results highlight how much benchmark design influences what is measured, and that limitations in benchmarks can be confused for limitations of the models developed with them. In the second part, I broaden the view and analyze the connection between model failures in different areas of machine learning. I find that many of these failures can be explained by shortcut learning, models exploiting a mismatch between a benchmark and its associated capability. Shortcut solutions use superficial cues that work very well within the training domain, but are unrelated to the capability. This demonstrates that good benchmarks performance is not sufficient to prove that a model acquired the associated capability, and that results have to be interpreted carefully. Taken together, these findings put in question the common practice of evaluating models on a single, or at maximum a few, benchmarks. Rather, my results indicate that to anticipate model failures, it is essential to measure broadly. And to avoid them, it is necessary to verify that models acquire the desired capability. This will require investment into better data, new benchmarks and other complementary forms of evaluation, but provides the basis for further progress towards powerful, reliable and safe models.	en
dc.language.iso	en	de_DE
dc.publisher	Universität Tübingen	de_DE
dc.rights	ubt-podno	de_DE
dc.rights.uri	http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=de	de_DE
dc.rights.uri	http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=en	en
dc.subject.classification	Maschinelles Lernen , Maschinelles Sehen ,	de_DE
dc.subject.ddc	004	de_DE
dc.subject.other	machine learning	en
dc.subject.other	deep learning	en
dc.subject.other	computer vision	en
dc.subject.other	benchmarking	en
dc.title	Why Machine Learning Models Fail: A Benchmarking Perspective	en
dc.type	PhDThesis	de_DE
dcterms.dateAccepted	2023-12-19
utue.publikation.fachbereich	Informatik	de_DE
utue.publikation.fakultaet	7 Mathematisch-Naturwissenschaftliche Fakultät	de_DE
utue.publikation.noppn	yes	de_DE

Dateien:	claudio_michaelis_dissertation.pdf 13.3 MB PDF Beschreibung: Dissertation (PDF)

Das Dokument erscheint in:

7 Mathematisch-Naturwissenschaftliche Fakultät [4942]

Zur Kurzanzeige

Veröffentlichen

Stöbern

Gesamter Bestand
Diese Sammlung

Mein Benutzerkonto

Einloggen

Why Machine Learning Models Fail: A Benchmarking Perspective

DSpace Repositorium (Manakin basiert)

Das Dokument erscheint in:

Stöbern

Gesamter Bestand

Diese Sammlung

Mein Benutzerkonto