Understanding human activities from videos: enabling robotic agents for semantic manipulation and task execution
Veröffentlichungsdatum
2025-12-11
Autoren
Betreuer
Gutachter
Zusammenfassung
This dissertation presents an end-to-end framework that enables robotic agents to acquire manipulation competence from human demonstration videos by unifying semantic abstraction, 3D spatial reasoning, and physics-informed validation within a single coherent pipeline. From raw monocular video, the system extracts temporally aligned evidence of human–object interaction, including contact cues, grasp configurations, spatial relations, and motion evolution across canonical manipulation phases (approach, grasp, move, release). These observations are organized into an Event Designator that decomposes activity into three complementary representations—Action, Motion, and Scene Designators—capturing what is being done, how it unfolds over time, and where it occurs with respect to the geometric and semantic structure of the environment. Through multimodal integration of vision, language, and geometry, the framework grounds task-level reasoning in measurable physical evidence, producing interpretable and reusable task knowledge rather than isolated labels or free-form descriptions.
To ensure physical reliability and transferability, the learned designators are validated and refined within a physics-enabled digital twin that explicitly models contact dynamics, stability, friction, and constraint satisfaction. This perception–simulation loop supports systematic error analysis and iterative refinement, converting semantically grounded representations into execution-ready parameters such as grasp strategies, motion constraints, and tool–object interaction profiles. The dissertation contributes (i) a fine-grained manipulation phase analysis pipeline for extracting atomic interaction structure from video, (ii) the Task–Event Designator architecture for hierarchical, causally coherent task representation, (iii) an enhanced SMPL-X–based module for expressive motion capture and contact-aware human–scene reasoning coupled with 3D scene reconstruction, and (iv) a translation mechanism enabling cross-embodiment execution via a general-purpose FK–IK controller, demonstrated on real robotic platforms. Collectively, this work advances video-based robot learning from imitation toward context-aware understanding and physically grounded execution, improving interpretability, generalization, and robustness across diverse tasks and environments.
To ensure physical reliability and transferability, the learned designators are validated and refined within a physics-enabled digital twin that explicitly models contact dynamics, stability, friction, and constraint satisfaction. This perception–simulation loop supports systematic error analysis and iterative refinement, converting semantically grounded representations into execution-ready parameters such as grasp strategies, motion constraints, and tool–object interaction profiles. The dissertation contributes (i) a fine-grained manipulation phase analysis pipeline for extracting atomic interaction structure from video, (ii) the Task–Event Designator architecture for hierarchical, causally coherent task representation, (iii) an enhanced SMPL-X–based module for expressive motion capture and contact-aware human–scene reasoning coupled with 3D scene reconstruction, and (iv) a translation mechanism enabling cross-embodiment execution via a general-purpose FK–IK controller, demonstrated on real robotic platforms. Collectively, this work advances video-based robot learning from imitation toward context-aware understanding and physically grounded execution, improving interpretability, generalization, and robustness across diverse tasks and environments.
Schlagwörter
Robot Learning
;
Deep Learning
;
3D Scene Understanding
;
Temporal Modeling
;
Embodied AI
;
Artificial Intelligence
;
Machine Learning
Institution
Fachbereich
Institute
AICOR Institute for Artificial Intelligence (IAI)
Dokumenttyp
Dissertation
Sprache
Englisch
Dateien![Vorschaubild]()
Lade...
Name
Understanding human activities from videos.pdf
Size
104.42 MB
Format
Adobe PDF
Checksum
(MD5):ac0e173acda15ab5402e67f5a0cdd84a
