If we already have powerful simulators, why can't we train robots entirely in simulation?

Simulation is essential but can't fully replace real-world data because of the sim2real gap — the mismatch between simulated physics, sensors and rendering and how the physical world actually behaves. Rigid-body dynamics simulate reasonably well; contact-rich manipulation, deformable objects and sensor noise (lens flare, motion blur, depth artifacts) do not. The standard fix is domain randomization, which involves scrambling textures, lighting, friction and mass during training so the policy stops relying on simulated conditions, combined with small batches of real-world demonstrations to correct what the simulator gets wrong. Simulation accelerates iteration and provides cheap volume, but it doesn't replace in-domain data.

Why is "in-domain" data more valuable than larger generic datasets for robotics?

Pre-training on large, diverse datasets teaches a model general world dynamics and motion priors, but the action distribution it produces is determined by fine-tuning data grounded in your specific embodiment, task and operating conditions. In-domain data also drives the "data flywheel." Every real deployment generates edge cases that fold back into training and the closer that data is to the live operating envelope, the faster the policy compounds. In practice, that means heavy pre-training on broad data, then disproportionate investment in tightly scoped, in-domain demonstrations during fine-tuning.

What should we look for when choosing a data partner for a physical AI program?

Vendor data quality breaks down across three dimensions: completeness (sensor sync issues, edge-case omission, selection bias), accuracy (mislabeled actions, kinematics errors, granularity mismatch) and consistency (inter-annotator disagreement, cross-modality misalignment). The biggest lever is upstream annotation guidelines and the training of the people executing them. When evaluating a partner, ask whether they run multi-sensor capture with hardware-level synchronization; whether they have embedded expertise in teleoperation, kinematics and embodiment-specific protocols; whether they support expert-led demonstration for contact-rich and long-horizon tasks; whether they can produce physics-aware and kinematics-aware annotations suitable for VLA training; whether their contributor base is globally distributed enough to cover the long tail of environments and edge behaviors your fleet will encounter and whether they can deliver data in step with your R&D cadence under a single SLA.

Data & AI

Why training data yield rate matters in physical AI

Posted June 4, 2026

Key takeaways

Physical AI requires domain-specific data captured in your exact environment for your exact use case. Robots cannot generalize from generic data, which makes manual demonstrations and continuous real-world calibration unavoidable.
Eight hours of raw collection typically yields only two to four hours of training-grade data after sensor, sync and QA losses. Curation is the real bottleneck.
Humanoids are versatile but fragile. For repetitive industrial work, specialized arms or wheeled platforms deliver faster, more reliable ROI.

As labor shortages and hazardous environments challenge industrial growth, adopting physical AI has moved from an open question to an operational one: How do you accomplish it without adding unnecessary overhead and complexity? Similar to how an autonomous vehicle must perceive a stopped car, decide to brake and actually stop, all within milliseconds and without one system waiting on another, a robot's sensors, decision model and actuators have to operate in lockstep or the whole thing falls apart.

Large language models learned by ingesting the internet. Physical AI can borrow the same playbook. By leveraging massive vision models trained on web data, modern robots gain immediate perceptual capabilities rather than starting from scratch. What the internet can’t supply is grounded action: the demonstrations, contact forces and embodiment-specific trajectories that teach a system not what the world looks like, but how to act in it. There is no web-scale corpus of movement and that’s the real constraint.

At our recent World Models Summit, TELUS Digital’s Vice President of AI Growth and Solutions, Sce Pike, sat down with an elite lineup of robotics and AI pioneers to map out the current state of physical AI.

World Models Summit panel included:

Myles Liu is director of business operations at Lightwheel, focusing on simulation infrastructure and curated egocentric data for VLA training.
Tory Smith is director of product management at Niantic Spatial, building foundation models for 3D reconstruction, visual localization and semantic understanding.
Ben Levin is director of robotics and physical AI data at NVIDIA, building the data stack behind Cosmos and GR00T.
Rajesh Radhakrishnan is vice president of autonomy at Serve Robotics, deploying autonomous sidewalk-delivery robots at production scale.

The panel represented an ideal operational workflow, running the gamut from the minds building foundational generalist models to the teams deploying autonomous hardware at full production scale. Together, they dove deep into the reality of data curation and the unpredictable ways humans interact with robots in the wild.

The data attrition problem

A major point of contention in modern AI is whether large language models (LLMs) can naturally evolve into generalist robotics models.

Current vision-language-action (VLA) models rely heavily on behavior cloning, learning from a dataset and cloning it onto the real world, Levin argued. However, he emphasized that the ultimate benchmark is real-world utility: "The key question is not if this is the right way from first principles to learn embodied intelligence, but will it do useful work in the real world?"

Building VLA models introduces a massive operational hurdle: data alignment. "To train a VLA model, you need a lot of effort,” noted Radhakrishnan. “You have to make sure that vision, action and language are completely aligned. Curating that data is incredibly expensive, it is the most difficult problem to solve right now in creating generalist models."

AI robotics training is often hindered by fragmented hardware and low-fidelity data capture. Much of today’s training data is collected on research robots not suited for production environments, and many systems rely only on visual feedback, making delicate or contact-rich tasks difficult to execute.

Liu walked through the experience many teams face when collecting training data for physical AI, such as training VLAs on egocentric hand-pose data. A 10% accuracy gap in the hand-pose model wipes 10% of usable data downstream. Collection-side failures (sensor dropouts, mis-synced modalities, missing pose metadata) take another 20%. QA rejections at the annotation step remove up to 20% more. Eight hours of raw collection becomes two to four hours of training-grade data.

Levin was direct about the stakes. "Teams will acquire massive datasets — 50,000 hours, 500,000 hours of data and immediately spin up hundreds of GPUs for pre-training,” he said. “They commit enormous computational resources without actually knowing if their data is any good. Then, after weeks of training, they discover the data quality is insufficient for their needs. Pre-training evaluation is very important to figure out what the data quality should be looking like and what kind of data you should be looking at before you put all those resources to work."

When high-quality data is secured at scale, it unlocks massive architectural freedom. For instance, in frontiers like Generalist AI's GEN-1, approximately 99% of the parameters are trained from scratch. While previously considered a wild choice, it represents a deliberate conviction that when you possess high-fidelity, high-volume data, you can push capabilities faster by maintaining complete control over the fundamental model architecture.

3D is the harder bottleneck

As an industry, we can simulate physics really well, making it ‘easy’ to train a robot to do a backflip in a digital world where only gravity and the floor matter. However, we can’t yet simulate the infinite, messy variety of the real world. A backflip is just about the robot’s own body, but clearing a cluttered work area without breaking a component requires generalization — the ability to handle the unpredictable.

Smith highlighted that the industry is deeply limited by the accessibility of high-quality 3D data at scale. While 2D imagery is ubiquitous on the web, robots operate across multiple camera angles and complex spatial extrinsics.

Interestingly, Niantic’s historical data collection revealed that messy, high-entropy crowdsourced video clips captured by everyday consumers on off-the-shelf smartphones actually performed better at reconstructing reality than perfectly audited, structured data collection pipelines. The noise, variable lighting and unpredictable motion of real-world captures forced the spatial models to build a more resilient understanding of physical environments.

To collect the messy, high-entropy noise of the real world at scale, you need a contributor base globally distributed enough to cover the long tail of environments and lighting conditions, and a capture stack that keeps multi-sensor streams synchronized at the point of collection. This contributor diversity is what protects you from a model that works only inside the conditions you sampled.

Deployment teaches what benchmarks cannot

Standard metrics like mean average precision (mAP) are great for academic research papers, but real-world fleet deployment introduces high-stakes liabilities that lab models cannot anticipate. While Levin highlighted that scaling up deployment is the ultimate test for generalist systems, Radhakrishnan pointed out that traditional benchmarks completely mask the operational risks: "My problem is not mean average precision. My problem is at the tails. I have corner cases where one safety issue means a public perception problem, a safety problem and an operational burden I need to be accountable for."

To prove why blind data scaling fails, Radhakrishnan shared two anecdotes from his deployments at John Deere and Serve Robotics, showing why data awareness matters:

A tractor that consistently disengaged in one specific field at night. The long-exposure night settings on the tractor's cameras turned ordinary flies buzzing in front of the lens into long, bright streaks across the image. But the validation dataset only contained a single, unimpactful sample of a fly at the very edge of a frame, it didn’t count.
At Serve Robotics, which currently operates a fleet of over 2,000 sidewalk delivery robots, the engineering team realized that human behavior around robots is entirely unpredictable. Local pedestrians quickly figured out that the delivery bots were hardcoded to stop safely when detecting a pedestrian. In response, teenagers began donning rollerblades, hooking ropes around the robots and using the autonomous delivery vehicles as personal sidewalk surfboards.

Deployment is often where unforeseen failure modes surface. Addressing this, Radhakrishnan notes, “You have to have this data-driven and data-aware mentality.” Ultimately, while production is where edge cases are discovered, your priority must be developing a dataset that is comprehensive enough to identify and address them prior to operational deployment.

Do you really need a humanoid?

One of the loudest debates in the industry centers on form factors. Business leaders must balance a humanoid’s long-term versatility against its immediate high maintenance.

Because our world from stairs to door handles was built for people, humanoid robots can step into existing workspaces without requiring a total facility renovation. In the near future, they will likely also be able to benefit from decades of human data (like video and motion capture) to help them learn.

However, they are power-hungry, mechanically volatile and incredibly complex.

"You can slice bread with a plasma cutter, but I wouldn't recommend it," joked Smith, addressing the overhype surrounding humanoid deployment.

For high-volume, repetitive industrial tasks, a specialized robotic arm or a wheeled platform is often the smarter investment. It is more reliable, has fewer parts to break and offers a significantly faster ROI.

Liu pointed out that underhyped, controlled indoor spaces, such as sterile processing departments in hospitals or structured retail fulfillment centers are far more ripe for immediate, successful model fine-tuning because the operational variables can be strictly managed.

Building the data foundation for true physical AI

As the robotics field matures, the nature of data annotation is shifting. Radhakrishnan noted that the industry is moving past low-level pixel masking or manual 3D bounding box placement to high-level behavioral labeling, essentially an RLHF framework for physical mechanics, where human experts supervise and critique the operational intent and safety of an agent's actions.

To scale these complex systems without drowning your engineering team in operational complexity, you need a dedicated data partner who understands the friction points of physical AI. Look for operational expertise in the people doing the collection, since that's where usable-hour conversion is won or lost. Spend the larger share of your budget on in-domain data for fine-tuning rather than on generic collection volume. And lean toward partners with real field operations, because the failures that benchmarks miss are the ones a fleet running in the wild will surface.

TELUS Digital was built to be that partner. The same teams that run production-grade pipelines for the world's leading AV programs now bring that rigor to robotics and world models, with over one million hours of video collection in progress, more than 70 global delivery centers and a global community of contributors for diverse capture across over 35 countries. The capability set spans:

Diverse pre-training coverage across egocentric and wrist-mounted video, manipulation, humanoid interactions and cross-embodiment data;
Multi-sensor capture across RGB, lidar, depth, IMU, force-torque and tactile, kept in sync;
World model and sim-to-real data including expert-led teleoperation, digital twin setups and long-horizon activity sequences;
High-context, physics-aware annotations through Ground Truth Studio and kinematics-aware semantic labels through Fine-Tune Studio;
Complex post-training support including VLA training with action justifications, chain-of-thought reasoning and explainability narratives.

Data is delivered in weeks, in batches timed to your R&D cycles, under one SLA, with one team.

Let your core engineering team focus on building revolutionary world model architectures and refining hardware mechanics. Let us build the flawless, data-aware foundation your fleet needs to operate safely in an unstructured world.

Explore our Physical AI & Robotics Data Solutions to accelerate your deployment flywheel today.

Insights Overview

Categories

Industries

Resource Types

Glossary

Why training data yield rate matters in physical AI

Key takeaways

The data attrition problem

3D is the harder bottleneck

Deployment teaches what benchmarks cannot

Do you really need a humanoid?

Building the data foundation for true physical AI

Frequently asked questions

Be the first to know

Related insights

The sim-to-real gap in autonomous vehicles only closes with real-world data

The sim-to-real gap in autonomous vehicles only closes with real-world data

A technical primer on world models

A technical primer on world models

Physical AI training data: What good pre-training and post-training datasets look like

Physical AI training data: What good pre-training and post-training datasets look like