Logo des Repositoriums
Zur Startseite
  • English
  • Deutsch
Anmelden
  1. Startseite
  2. SuUB
  3. Dissertationen
  4. Understanding human activities from videos: enabling robotic agents for semantic manipulation and task execution
 
Zitierlink DOI
10.26092/elib/5239

Understanding human activities from videos: enabling robotic agents for semantic manipulation and task execution

Veröffentlichungsdatum
2025-12-11
Autoren
Siddiky Feroz Ahmed  
Betreuer
Beetz, Michael  
Ramirez Amaro, Karinne  
Gutachter
Beetz, Michael  
Ramirez Amaro, Karinne  
Zusammenfassung
This dissertation presents an end-to-end framework that enables robotic agents to acquire manipulation competence from human demonstration videos by unifying semantic abstraction, 3D spatial reasoning, and physics-informed validation within a single coherent pipeline. From raw monocular video, the system extracts temporally aligned evidence of human–object interaction, including contact cues, grasp configurations, spatial relations, and motion evolution across canonical manipulation phases (approach, grasp, move, release). These observations are organized into an Event Designator that decomposes activity into three complementary representations—Action, Motion, and Scene Designators—capturing what is being done, how it unfolds over time, and where it occurs with respect to the geometric and semantic structure of the environment. Through multimodal integration of vision, language, and geometry, the framework grounds task-level reasoning in measurable physical evidence, producing interpretable and reusable task knowledge rather than isolated labels or free-form descriptions.

To ensure physical reliability and transferability, the learned designators are validated and refined within a physics-enabled digital twin that explicitly models contact dynamics, stability, friction, and constraint satisfaction. This perception–simulation loop supports systematic error analysis and iterative refinement, converting semantically grounded representations into execution-ready parameters such as grasp strategies, motion constraints, and tool–object interaction profiles. The dissertation contributes (i) a fine-grained manipulation phase analysis pipeline for extracting atomic interaction structure from video, (ii) the Task–Event Designator architecture for hierarchical, causally coherent task representation, (iii) an enhanced SMPL-X–based module for expressive motion capture and contact-aware human–scene reasoning coupled with 3D scene reconstruction, and (iv) a translation mechanism enabling cross-embodiment execution via a general-purpose FK–IK controller, demonstrated on real robotic platforms. Collectively, this work advances video-based robot learning from imitation toward context-aware understanding and physically grounded execution, improving interpretability, generalization, and robustness across diverse tasks and environments.
Schlagwörter
Robot Learning

; 

Deep Learning

; 

3D Scene Understanding

; 

Temporal Modeling

; 

Embodied AI

; 

Artificial Intelligence

; 

Machine Learning
Institution
Universität Bremen  
Fachbereich
Fachbereich 03: Mathematik/Informatik (FB 03)  
Institute
AICOR Institute for Artificial Intelligence (IAI)
Dokumenttyp
Dissertation
Lizenz
https://creativecommons.org/licenses/by/4.0/
Sprache
Englisch
Dateien
Lade...
Vorschaubild
Name

Understanding human activities from videos.pdf

Size

104.42 MB

Format

Adobe PDF

Checksum

(MD5):ac0e173acda15ab5402e67f5a0cdd84a

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Datenschutzbestimmungen
  • Endnutzervereinbarung
  • Feedback schicken