aalto1 untyped-item.component.html

Vision-language-action model pi_0 for humanoid robot dexterous manipulation

Loading...
Thumbnail Image

URL

Journal Title

Journal ISSN

Volume Title

School of Electrical Engineering | Master's thesis
Electronic archive copy is available via Aalto Thesis Database.

Department

Major/Subject

Mcode

Language

en

Pages

63

Series

Abstract

Recent developments in Vision-Language-Action (VLA) models have shown considerable promise in robotic manipulation. However, their generalization abilities on high-degree-of-freedom (DoF) humanoid robots with dexterous hands are still largely unexplored. This thesis systematically examines the cross-embodiment transfer of the pre-trained pi_0 model from conventional robotic arms to a humanoid platform. We create a complete "Demonstration-to-Deployment" pipeline that includes a custom teleoperation system, a post-training framework based on flow matching, and a strict two-stage evaluation process. Through open-loop evaluation, a significant data pathology—a 20 timestamp misalignment between visual information and proprioceptive states—was detected and rectified, proving essential for the reduction of initial prediction errors. Empirical evidence from systematic ablation studies indicates a critical requirement for training datasets: task diversity is a more significant factor in generalization than data quantity, and a large batch size is essential for the stable convergence of the flow-matching objective. The model, fine-tuned on a wide range of 20 tasks (860 episodes), got a 72.2% success rate in positional generalization tests and a 50%–70% success rate in Pick-and-Place and Object Sorting tasks. The policy also showed new abilities, such as being able to understand meaning in messy scenes, adapting to changes in the environment, and recovering from mistakes on its own in long-horizon tasks. These outcomes validate the transferability of pre-trained VLA models to a high-DOF humanoid robot through data-efficient fine-tuning, establishing a substantiated basis for subsequent investigations in general-purpose embodied intelligence.

Description

Supervisor

Zhou, Quan

Thesis advisor

Wu, Long

Other note

Citation

Endorsement

Review

Supplemented By

Referenced By