Can egocentric human video meaningfully improve a robot policy, given that humans and robots move differently?

Yes, but only at scale. Egocentric human video, captured with wearable cameras, sits in the same family as the head-mounted operator footage collected during the pre-training stream. The obstacle is the domain gap. Human hands and robot grippers look different and move through different kinematic chains, so a small model trained from scratch on a mix of human and robot data tends to keep the two distributions separate in its latent space. However, co-finetuning the model on human egocentric video alongside the most relevant robot data, with actions represented as 3D hand positions, can produce performance gain on tasks where robot demonstrations are scarce.

Why does pre-training diversity affect how well a policy absorbs post-training data?

Pre-training is what teaches the policy a shared representation of the world, the robot and the manipulation strategy. A pre-training base built across many robots, scenes and viewpoints lets the model align features that would otherwise stay siloed. Pre-training breadth pays off twice. It raises the ceiling on robot-only performance, and it raises the model's ability to absorb other data sources during post-training, including human video, simulation data and demonstrations from new embodiments.

How does TELUS Digital measure data quality for physical AI and what metrics predict production success?

TELUS Digital's quality framework for physical AI includes: Temporal coherence: Verifying that sensor streams remain synchronized throughout the task and that action-perception causality is preserved (e.g., gripper closes after object contact is detected, not before) Annotation consistency: Cross-validating spatial labels across multiple viewpoints — if overhead stereo indicates an object at position (X, Y, Z), egocentric and wrist cameras must be consistent with that ground truth Failure mode coverage: Measuring what percentage of your dataset includes edge cases, error recovery and partial failures — production systems typically need 15% to 25% failure/recovery sequences to achieve 99.5% uptime Sensor degradation resilience: Testing model performance when individual sensors are degraded (e.g., depth camera noise, RGB lighting variations, lidar occlusion) — this predicts real-world robustness better than clean-data benchmarks.

Data & AI

Physical AI training data: What good pre-training and post-training datasets look like

Posted May 15, 2026

Akarsh Anind

Senior Manager, Physical AI Solutions

A man and a woman sitting at a table. The woman is testing a wearable robotic hand and the man is working on an open laptop computer.

Key takeaways:

Generalist robot policies leverage vision-language-action (VLA) models that use internet-scale pre-training to help robots understand the world.
Effective deployment requires a mix of multi-perspective sensor data and generative 'diffusion' policies.
Training only on golden trajectories is why pilots stall. Production-grade models need 15–25% failure and recovery data.

Robots are folding clothes, sorting parts and cleaning kitchens in research labs around the world. But most of those demos don't survive the move from a controlled cell to 24/7 production. For decades, industrial automation has relied on rigid, task-specific solutions: Encode a sequence of positions and the robot executes that path forever. The real world doesn't work that way. Most jobs aren't fully repetitive; they're structured yet variable, with different parts, slightly bent boxes, inconsistent lighting, worn fixtures and unpredictable human behavior.

Handling that variability requires high quality, multi-robot, multi-operator datasets of diverse teleoperated behaviors. The hard part is composing a training dataset that teaches a policy to perceive, recover and adapt at the same time it learns to execute the job well.

This article examines the data composition problem in physical AI, what good pre-training and post-training data look like and how TELUS Digital addresses both in a single pipeline.

What is changing in robotics model architectures?

Vision language action (VLA) models are now the dominant architecture for generalist robot policies. Early VLAs were built on the pretraining of vision-language models like Gemma 3 by appending an action head that emits discrete low-level robot commands — an approach popularized by Google DeepMind's RT-2. However, the latest state-of-the-art models, such as Physical Intelligence's π0 and NVIDIA's GR00T N1, have evolved this approach. Instead of simple discrete action heads, they deeply integrate VLM backbones with continuous flow-matching or diffusion architectures to generate smooth, fluid trajectories.

The case for VLAs is pragmatic. Instead of building a perception-and-control stack from scratch, a VLA reuses the existing internet-scale pretraining of a vision-language model (VLM) and learns the action layer on top.

Two complementary techniques sit alongside the VLA backbone:

Diffusion policy, pioneered by researchers at Columbia University, MIT and the Toyota Research Institute, treats the gripper trajectory as a generative output, applying the same denoising idea used in image models to motor control. This often improves how quickly robots learn dexterous manipulation skills.
Action chunking, where the policy predicts a short window of future steps instead of a single step, smooths execution and improves success rates on long-horizon tasks.

Robots excel at interpolating within their training distribution. A model trained to fold shirts handles variations in lighting or table surface well, but ask it to hang a jacket and it often fails. The Physical Intelligence team's key finding was how to compose training data to address this: curated demonstrations alone leave a model unable to recover from mistakes it never saw, while broad, lower-fidelity data produces a model that stumbles through edge cases without executing any task reliably. Model developers need:

A wide, multisource pre-training base to teach the model to perceive, recover and adapt.
A tighter, expert-led post-training set to teach the model to do the actual job well.

What good pre-training data looks like for physical AI

Pre-training data has to teach the model how the world behaves and how the robot moves through it. Five characteristics matter here.

1. Cross-embodiment coverage

Models trained on one robot transfer poorly to a robot with different joint counts, gripper configurations or camera placements. Pre-training data should span single-arm and dual-arm manipulators, mobile bases, humanoids and the cobots a customer is likely to deploy.

2. Multi-camera, multi-perspective capture

Egocentric (head-mounted) footage provides planning signals and captures operator focus. Overhead stereo gives global scene geometry and absolute object positions. Wrist-mounted cameras give grasp configuration, contact geometry and micro-adjustments during insertion or assembly. Models trained only on egocentric data often execute the right gripper trajectory in the wrong absolute position when the scene layout changes. Three perspectives, captured in sync, are the practical minimum.

3. Multimodal sensor streams

RGB, depth, lidar, force-torque, inertial measurement unit (IMU) and gripper state must all be hardware-clock synchronized and cross-validated so that a labeling error in one stream surfaces against another.

4. Environment and scene diversity

Coverage should include lighting variation, fixture wear, partial occlusions, novel object configurations and unfamiliar layouts. The pre-training set should look like the long tail of conditions the policy will eventually meet.

5. Failure and recovery as first-class data

Production policies need to handle slips, dropped objects, mis-grasps and human interruptions. Industry guidance suggests 15% to 25% of a training corpus should be partial failures and recovery attempts, with explicit success and failure attribution per episode.

What good post-training data looks like for physical AI

Post-training data is where the policy learns to do your job, on your hardware, in your facility, at production quality. There are four key characteristics of good post-training data.

1. Expert golden trajectories

Teleoperated demonstrations from credentialed operators, typically five to 100 hours per task. The trajectories are smooth, efficient and consistent in strategy. They are the canonical reference the policy is expected to match.

2. Causal and counterfactual reasoning labels

These are rich annotations that capture what happened, why it happened and what would have changed under different conditions. Rather than teaching the low-level motor policy how to physically execute a grasp, these labels fine-tune the high-level reasoning capabilities of the model. They help the policy think aloud and understand context, for example, reasoning that a strategy succeeded because the contact point was on a rigid surface rather than a deformable label.

3. Chain-of-thought and action justification

These are step-by-step explainability narratives that decompose a task into approach, grasp, manipulate and place phases, with rationale for each transition. This is the data that supports policy debugging and post-hoc evaluation.

4. Human preference and correction data

As an emerging and highly valuable capability in physical AI, reinforcement learning from human feedback (RLHF) signals are increasingly being tailored to continuous physical tasks. This data captures nuanced preferences: which trajectory was smoother, which action was safer or which overall behavior best fits a customer's specific work cell.

Post-training data is small in volume compared to pre-training data and disproportionately important to real-world performance. Done well, it closes the gap between a generalist policy and a production system.

How TELUS Digital builds pre- and post-training data in one pipeline

Pre-training and post-training data look different, get collected through different modes and serve different objectives. Most providers split them where one team scrapes or aggregates broad interaction data and a separate team runs teleoperated demonstrations in a controlled lab. When those streams meet at the model, there may be schema mismatches, calibration drift between sensor stacks and inconsistent failure attribution showing up.

At TELUS Digital, we have developed and refined an integrated data pipeline approach across nine years of complex computer vision and sensor fusion programs, including large-scale autonomous vehicle deployments. Physical AI, for us, is an extension of that operating model into a wider set of embodiments and sensor stacks.

Unified collection architecture

Rather than segregating data streams, our approach implements calibration and sensor synchronization standards across three concurrent collection modes, that enables semantic composability at training time:

Digital capture stream:
- Egocentric and over-the-shoulder video from operators, technicians and drivers.
- Multimodal sensor streams including RGB, depth, audio and telemetry from robots and edge cameras.
Onsite-moderated capture stream:
- Virtual reality and teleoperator sessions in production cell simulations.
- Multi-camera robot demonstrations.
- Human-robot interaction sessions for collaborative and humanoid systems.
Field operations stream:
- Autonomous vehicle field operational testing on instrumented fleets.
- Autonomous guided vehicle (AGV) and autonomous mobile robot (AMR) data from operational warehouses, including stoppage and downtime events.
- Production deployment data from robots operating in target environments.

Failure and recovery as structured data

Achieving the required 15–25% failure-and-recovery composition within training datasets is difficult when collection and annotation frameworks are designed around nominal trajectories. We address this by structuring failure and recovery as a first-class data category requiring the same rigor as nominal operation.

Our Ground Truth Studio implements annotation schemas that correspond directly to policy model consumption requirements: 2D and 3D vector and semantic labeling, kinematics and pose estimation, manipulation and grasping classification, language-motion temporal pairing, action segmentation and much more.

Domain expertise integration in post-training

Post-training annotation represents the phase where domain expertise demonstrates measurable compounding effects. Fine-Tune Studio handles golden trajectories from expert teleoperators, causal and counterfactual reasoning annotations, chain-of-thought explainability narratives, action-justification feedback loops and deliberate failure-and-recovery capture.

Experts Engine routes each task to the right reviewer: a robotics-aware annotator, a teleoperator or an automotive-trained QA expert. Each tier processes tasks within their domain of expertise. Annotations addressing production-phase requirements such as the reasons a grasp succeeds on rigid surfaces but fails on deformable materials emerge only when the reviewer possesses the engineering context to reason about the failure systematically.

Running all three streams under one operational pipeline takes cross-team coordination, shared QA standards and integrated tooling that most providers don't have in place. The alternative is stitching together multiple vendors, each with their own SLAs, points of contact and data delivery cadence. That coordination tax lands on your ML team and it eats time that should be going into training the model.

Is your data infrastructure ready to scale?

For physical AI to ship in production rather than impress on stage, three open questions still need to be answered:

Does the policy perform a variety of dexterous, long-horizon tasks?
Does the policy succeed in places it has never been?
Does the policy respond to open-ended prompts and interjections during execution?

All three are data composition problems before they are model problems. The transformative potential of physical AI, from warehouse automation to agricultural robotics, depends on solving the data challenge. Model builders need data that is:

Abundant: High-volume collection infrastructure operating globally
Diverse: Coverage across environments, objects, tasks and failure modes
Enriched: Annotations encoding intent, task structure, spatial reasoning and causal relationships

As physical AI models grow more capable, the data infrastructure underneath them has to grow with them. That's a harder problem than it looks — one that requires years of investment in sensor synchronization, annotation tooling and global contributor networks before the first robot ships. At TELUS Digital, that groundwork is already in place. Contact our team to build your custom data pipeline.

Insights Overview

Categories

Industries

Resource Types

Glossary