Vision-language-action model pi_0 for humanoid robot dexterous manipulation

dc.contributorAalto-yliopistofi
dc.contributorAalto Universityen
dc.contributor.advisorWu, Long
dc.contributor.authorGao, Zhinan
dc.contributor.schoolSähkötekniikan korkeakoulufi
dc.contributor.schoolSchool of Electrical Engineeringen
dc.contributor.supervisorZhou, Quan
dc.date.accessioned2026-01-20T18:09:47Z
dc.date.available2026-01-20T18:09:47Z
dc.date.issued2025-12-20
dc.description.abstractRecent developments in Vision-Language-Action (VLA) models have shown considerable promise in robotic manipulation. However, their generalization abilities on high-degree-of-freedom (DoF) humanoid robots with dexterous hands are still largely unexplored. This thesis systematically examines the cross-embodiment transfer of the pre-trained pi_0 model from conventional robotic arms to a humanoid platform. We create a complete "Demonstration-to-Deployment" pipeline that includes a custom teleoperation system, a post-training framework based on flow matching, and a strict two-stage evaluation process. Through open-loop evaluation, a significant data pathology—a 20 timestamp misalignment between visual information and proprioceptive states—was detected and rectified, proving essential for the reduction of initial prediction errors. Empirical evidence from systematic ablation studies indicates a critical requirement for training datasets: task diversity is a more significant factor in generalization than data quantity, and a large batch size is essential for the stable convergence of the flow-matching objective. The model, fine-tuned on a wide range of 20 tasks (860 episodes), got a 72.2% success rate in positional generalization tests and a 50%–70% success rate in Pick-and-Place and Object Sorting tasks. The policy also showed new abilities, such as being able to understand meaning in messy scenes, adapting to changes in the environment, and recovering from mistakes on its own in long-horizon tasks. These outcomes validate the transferability of pre-trained VLA models to a high-DOF humanoid robot through data-efficient fine-tuning, establishing a substantiated basis for subsequent investigations in general-purpose embodied intelligence.en
dc.format.extent63
dc.format.mimetypeapplication/pdfen
dc.identifier.urihttps://aaltodoc.aalto.fi/handle/123456789/142266
dc.identifier.urnURN:NBN:fi:aalto-202601201640
dc.language.isoenen
dc.locationP1fi
dc.programmeMaster's Programme in ICT Innovationen
dc.programmeMaster's Programme in ICT Innovationfi
dc.programmeMaster's Programme in ICT Innovationsv
dc.programme.majorAutonomous Systemsen
dc.subject.keywordvision-language-action models (VLA)en
dc.subject.keywordhumanoid robotsen
dc.subject.keyworddexterous manipulationen
dc.subject.keywordcross-embodiment generalizationen
dc.subject.keywordteleoperationen
dc.subject.keywordfoundation modelen
dc.titleVision-language-action model pi_0 for humanoid robot dexterous manipulationen
dc.typeG2 Pro gradu, diplomityöfi
dc.type.ontasotMaster's thesisen
dc.type.ontasotDiplomityöfi
local.aalto.electroniconlyyes
local.aalto.openaccessno

Files