aalto1 untyped-item.component.html
Vision-language-action model pi_0 for humanoid robot dexterous manipulation
Loading...
URL
Journal Title
Journal ISSN
Volume Title
School of Electrical Engineering |
Master's thesis
Electronic archive copy is available via Aalto Thesis Database.
Authors
Date
Department
Major/Subject
Mcode
Language
en
Pages
63
Series
Abstract
Recent developments in Vision-Language-Action (VLA) models have shown considerable promise in robotic manipulation. However, their generalization abilities on high-degree-of-freedom (DoF) humanoid robots with dexterous hands are still largely unexplored. This thesis systematically examines the cross-embodiment transfer of the pre-trained pi_0 model from conventional robotic arms to a humanoid platform. We create a complete "Demonstration-to-Deployment" pipeline that includes a custom teleoperation system, a post-training framework based on flow matching, and a strict two-stage evaluation process. Through open-loop evaluation, a significant data pathology—a 20 timestamp misalignment between visual information and proprioceptive states—was detected and rectified, proving essential for the reduction of initial prediction errors. Empirical evidence from systematic ablation studies indicates a critical requirement for training datasets: task diversity is a more significant factor in generalization than data quantity, and a large batch size is essential for the stable convergence of the flow-matching objective. The model, fine-tuned on a wide range of 20 tasks (860 episodes), got a 72.2% success rate in positional generalization tests and a 50%–70% success rate in Pick-and-Place and Object Sorting tasks. The policy also showed new abilities, such as being able to understand meaning in messy scenes, adapting to changes in the environment, and recovering from mistakes on its own in long-horizon tasks. These outcomes validate the transferability of pre-trained VLA models to a high-DOF humanoid robot through data-efficient fine-tuning, establishing a substantiated basis for subsequent investigations in general-purpose embodied intelligence.