Analysis and Modeling of Visual Invariance for Object Recognition and Spatial Cognition

Eberhardt, Sven

Zitierlink URN

https://nbn-resolving.de/urn:nbn:de:gbv:46-00104531-10

Analysis and Modeling of Visual Invariance for Object Recognition and Spatial Cognition

Veröffentlichungsdatum

2015-05-29

Autoren

Eberhardt, Sven

Betreuer

Schill, Kerstin

Gutachter

Fahle, Manfred

Zusammenfassung

The human visual system is unmatched by machine imitates in its universal ability to perform a great number of complex tasks such as object detection, tracking and categorization, scene perception and localization seemingly effortlessly and instantly. It can quickly adapt to novel problems, learn concepts from few samples, build and reason on abstract representations and merge information from multiple senses. Of particular interest is processing in the ventral stream of human visual cortex because it solves a multitude of complex scene analysis tasks in temporal ranges of below 200 milliseconds. Here, a functional, dataset-driven analysis approach is followed. Feature outputs from several specialized vision models including \Textons, Gist, HMax, SIFT and Spatial Pyramids are analyzed for their diagnosticity on a number of tasks typically attributed to human ventral stream processing. A strong performance dissociation between models and tasks dependent on invariance properties is found. From these findings, a conceptual space is proposed into which both vision models and associated task requirements are placed based on local and global invariance dimensions. Following this concept, a general-purpose, hierarchical vision model is suggested in which specializations is realized as tuning of receptive field ranges and the proper task-dependent weights. As an example for an application of this conceptual space, the special task of vision-based localization is introduced in a classification concept. Place categorization in several contexts including indoor, outdoor and virtual world environments are sorted into the conceptual space of vision requirements. From this, a universal descriptor called \emph{Signature of a Place} is introduced which outperforms baseline models on all localization tasks. Correlation to human performance is tested, yielding an orthogonal result. The question of self-organized learning in hierarchical systems is analyzed and a novel approach utilizing cross-modal feature training between visual and auditory cues in a deep learning hierarchy is presented. The model is able to generate audio predictions from video input and explain previous human psychophysics results on multi-modal difference thresholds.

Schlagwörter

dissertation

;

vision

;

visual system

;

localization

;

spatial cognition

;

object recognition

;

deep learning

;

modeling

;

image processing

Institution

Universität Bremen

Fachbereich

Fachbereich 03: Mathematik/Informatik (FB 03)

Dokumenttyp

Dissertation

Zweitveröffentlichung

Nein

Sprache

Englisch

Dateien

Name

00104531-1.pdf

Size

33.37 MB

Format

Adobe PDF

Checksum

(MD5):38f712180badce89be94266accfa7899