How do you maintain inter-rater reliability when working with highly specialized experts across different geographies and cultures?

Maintaining consistent quality across large-scale expert data production is one of the toughest challenges in the field. It demands calibration infrastructure built directly into the workflow: shared reasoning examples, real-time feedback loops and quality-control agents that flag drift before it compounds. Ongoing rater performance modeling treats quality as a continuous signal rather than a periodic audit. The architecture has to hold quality steady even as contributor pools grow, tasks evolve and model iterations shift the goalposts.

How quickly can a new evaluation program be stood up and what does that process look like?

Frontier labs are constantly experimenting and testing new varieties of training datasets, which means evaluation infrastructure has to move at a fundamentally different cadence than it was historically designed for. The ability to stand up a new program, from expert sourcing and task UI configuration to QC workflow and quality benchmarking, in days rather than weeks is now a baseline expectation. The organizations best positioned to meet that cadence are the ones that have built reusable infrastructure rather than custom-engineering every program from scratch.

What does the talent pipeline for agentic AI evaluation actually look like and how is it different from traditional annotation sourcing?

Agentic evaluation requires a profile that doesn’t exist in traditional AI trainer pools. This includes professionals with lived, operational experience of the workflows AI is being asked to perform. A medical expert who understands how patient intake actually moves through an emergency room isn’t interchangeable with a generalist rater working from a healthcare rubric. Building this pipeline means sourcing from professional communities, not crowd platforms. It’s also essential that vetting processes assess workflow fluency and decision quality, not just domain credentials.

How do you approach safety evaluation for harm categories that do not yet exist?

The most consequential trust and safety work is anticipatory. Waiting for a new harm category to surface in production before building defenses against it is structurally too slow. The approach that works is a forward-deployed research model, which includes experts who are actively mapping emerging threat vectors, stress-testing automated systems against novel adversarial scenarios and building guardrails before those scenarios reach scale.

Data & AI

The human bottleneck in AI training: Can expert annotators scale with frontier models?

Posted May 26, 2026

Steve Nemzer

Senior Director of AI Growth & Innovation

Key takeaways

Today’s models have outrun the pipelines designed to improve them. Scaling frontier AI now depends on high-quality, expert human judgment across complex modalities.
Trajectory annotation and reinforcement learning (RL) sandboxes represent a genuinely new category of work. Organizations that haven't made this shift yet risk investing in capabilities that won't transfer.
The path forward is not more evaluators. Rather, it’s automating what automation can reliably handle and concentrating expert judgment on what it can’t. That shift demands a complete redesign of tooling, talent and workflows.
Scaling headcount routinely degrades data quality. Maintaining reliability requires purpose-built systems for orchestration and continuous feedback.
As complexity grows, the gap between vendor and partner decreases. What separates the ones that matter is expert networks, proprietary tooling and real safety depth.

We are at a moment where high-quality human signal across increasingly complex tasks has become the bottleneck on AI progress. Compute and architecture matter, but models have outrun both. The organizations that recognize this and build accordingly will define what the future of frontier AI looks like.

At TELUS Digital, this shift is showing up in customer conversations. As the complexity of what clients ask AI to do increases substantially, the conversations have moved from capability to confidence. Customers need to know their AI performs correctly, safely and consistently at scale. And they are looking for partners with the infrastructure to guarantee it.

Delivering on that confidence requires a clear view of where AI training requirements are heading and why. The generative AI (GenAI) services market is evolving across three structural layers: foundational, application, and trust and safety. Each at a different pace and with different demands. TELUS Digital has organized its capabilities around these layers since they represent where the real work of improving and governing AI lives, and because staying ahead means understanding how the requirements within each one are changing. Running through our approach is the shift from human-in-the-loop to human-on-the-loop, where experts architect the systems that govern AI rather than reviewing its individual outputs.

In this article, I’ll outline TELUS Digital’s framework for navigating the next stage of GenAI and what it takes to deliver the quality, scale and safety that frontier AI demands.

Macro trends reshaping GenAI services

Before examining each layer, it's worth identifying the dynamics reshaping the AI training data and evaluation market as a whole. These forces cut across all three layers simultaneously, reshaping where value sits, who can deliver it and what separates vendors from partners.

The agentic shift is changing the nature of human contribution to AI training

As models improve at self-evaluation and synthetic data absorbs routine annotation, basic labeling faces commoditization. Value is shifting toward trajectory evaluation, RL environment infrastructure and workflow expertise, which are areas where synthetic pipelines underperform. The demand from model builders has shifted from large volumes of relatively simple preference judgments to smaller quantities of highly expert, multistep reasoning traces. Legacy tooling was built for text-based, single-turn outputs; agentic systems require trajectory visualization, multistep trace annotation and sandboxed RL environments that are costly, domain-specific and largely still being built from scratch.

Sourcing the right experts has become structurally harder

Over the past few years, the required combination of domain expertise, workflow fluency and evaluative judgment has fundamentally changed. As models improve, the gap between their output and expert human judgment narrows and generalist AI trainers no longer clear the bar. Agentic systems have raised it higher, requiring professionals with not just subject matter expertise but lived experience of how workflows actually operate. These are people who can author realistic task scenarios, evaluate decision quality across multistep processes and identify failure modes a generalist would miss. The supply of people who fit this profile is finite, traditional sourcing pipelines weren’t built to find them and the challenge compounds globally. Further, frontier models skew heavily toward English, while high-quality native-speaker evaluation for low-resource languages like Swahili and Vietnamese remains scarce.

The structural difficulty lies in the mismatch: traditional pipelines were optimized for volume, built to recruit large numbers of people for well-defined, repeatable tasks. Finding experts with highly specific profiles, fast enough to keep pace with monthly model releases, is an entirely different problem and most sourcing infrastructure wasn’t built with it in mind.

Human contribution is shifting from reviewing outputs to governing systems

Experts are now concentrating on where AI systematically fails. This includes correcting reasoning errors, flagging failure modes and architecting evaluation systems rather than reviewing individual outputs. This produces a higher-value training signal, but it demands fundamentally different tooling, workflow design and rater training than the AI industry was built on. Scaling headcount alone doesn’t solve this. Maintaining calibration and inter-rater reliability across thousands of contributors on increasingly complex tasks is a measurement and systems design problem, not a workforce management one.

The training data challenge is widening as multimodality becomes standard

Video, audio, image and code each operate as complete disciplines with their own evaluation demands and infrastructure requirements. These include temporal coherence for video, prosody and naturalness for audio, spatial accuracy for image and logical correctness for code. The complexity is specific to each modality, and so is the expertise required to evaluate it. For any organization operating across more than one, this significantly widens the training data challenge and raises the bar on specialist talent.

Model iteration cycles have outpaced program design

Frontier models now ship on a monthly cadence, with each new generation potentially invalidating existing rubrics, calibrations and benchmarks. Programs that take weeks to stand up are structurally misaligned with this pace. The ability to rebuild evaluation infrastructure quickly is now a prerequisite, not a differentiator.

Regulatory and client scrutiny of training data provenance is intensifying

Emerging regulations require training data to be traceable. This includes who produced it, what their credentials were and under what conditions. Most current data pipelines were not built with auditability in mind, creating compliance exposure as the EU AI Act and similar frameworks come into force. Organizations that build auditability into their supply chains now will be the ones clients trust most as regulatory pressure intensifies.

How TELUS Digital is addressing the shifting AI market

None of the dynamics outlined above yield to improvised solutions. Sourcing expert talent, scaling quality, evaluating agentic systems, managing model iteration cycles and ensuring data provenance all require purpose-built frameworks. TELUS Digital has organized its capabilities around the same three layers where these challenges emerge — foundational, application, and trust and safety.

The foundational layer is built for precision, not volume

The bottleneck at the foundational layer is expert talent, including finding the right people fast enough and maintaining quality as volume scales. What counts as a useful human signal is shifting rapidly across domains, languages and modalities, and traditional hiring pipelines weren't designed to keep pace. Our response has been to rethink how experts are identified, qualified and retained.

Experts Engine, our proprietary sourcing platform, allows us to parse a globally distributed pool by domain, credential and language profile. Our AI interviewer is then able to qualify candidates at the task level in hours rather than weeks.

In a 13-week multimodal program spanning science, technology, engineering and mathematics (STEM) domains, including subdomains in health and business, our combined approach allowed us to scale from 1,000 to 6,000 expert-produced jobs per week while maintaining a near perfect quality rate across all domains. The outcome reflects what happens when expert sourcing, evaluation tooling and workflow orchestration are treated as a single integrated system rather than separate problems.

Fine-Tune Studio, our proprietary platform for deploying custom task interfaces and multilayer QA workflows, is what makes that integration possible. On provenance, maintaining structured records of contributor credentials, qualifications and engagement conditions at the point of production is how auditability gets built into the pipeline from the start.

The application layer evaluates decisions, not just answers

At the application layer, the work itself has changed structurally. There is no single answer to grade against. Instead, there is a multistep trajectory that must be assessed for tool selection, reasoning quality, error recovery and policy adherence.

When a leading ecommerce technology firm needed to validate next-generation AI agents against complex enterprise back-ends, we built an enterprise-twin RL environment spanning five business domains, from digital marketing to insurance claim processing, producing over 100,000 model ratings benchmarking chain-of-thought reasoning, trajectories, tool use and API calls.

The traditional model where humans review every judgment won’t scale to meet the quality bar that frontier AI demands, and hiring more evaluators isn’t the answer. The solution lies in automating what automation can reliably handle and concentrating expert human capacity on what it can’t. This kind of work requires purpose-built sandboxed environments and interactive reinforcement learning through expert feedback on trajectory annotation workflows. Rather than auditing individual outputs, experts focus on where AI is going wrong, zeroing in on edge cases and truly critical cases, to correct reasoning and flag systematic failure modes.

It demands a complete redesign of how raters are trained, how they access collective expertise at the point of evaluation through AI-assisted tools that surface the cumulative reasoning of thousands of prior expert judgments and how performance improves continuously rather than through periodic retraining. In training, simulated interfaces with quality control agents help new raters reach proficiency on complex tasks in days rather than weeks. While performing ratings, a proprietary evaluation tool brings that collective reasoning to bear on every new evaluation through rating knowledge graphs.

Two systems anchor this architecture. The first is deployed on highly subjective tasks and leverages a council of models to identify objective dimensions that provide a clean signal of rater capability, surfacing the strongest raters in the pool. Their reasoning flows directly into the second, raising the performance floor for the broader annotator cohort. This creates a reinforcing loop that scales by design as task types evolve. Critically, neither system introduces model-generated signals into evaluative judgment, which sidesteps the risk of model bias that published reinforcement learning from AI feedback research identifies.

TELUS Digital tracks the half-life of each reinforcement learning from human feedback (RLHF) evaluation workflow as a formal program metric. In other words, how quickly can human-executed volume be reduced by 50%? The target is under 12 months from workflow establishment for any unchanged workflow.

Three components operationalize this. AI-assisted pre-scoring runs a pre-screening layer that scores response pairs on clear-signal dimensions, including factual accuracy, format compliance, safety policy adherence and response completeness, auto-labeling unambiguous pairs at high confidence and handling approximately 30 to 40% of volume autonomously, with a target of 50% within 12 months.

Rater-complexity matching routes routine comparison tasks to the core annotator pool and genuinely difficult pairs to senior RLHF specialists from the credentialed expert cohort, concentrating the scarcest human resource where it creates the most value. Efficient task design, binary preference with required rationale, pre-populated rubric dimensions and guideline reference surfaced before the task opens, reduces average task completion time by approximately 25 to 30% without sacrificing quality.

This tooling is built in coordination with customer teams using their own technology environment. TELUS Digital provides the embedded innovation team, engineers and AI leads with the mandate to build workflow automation specific to RLHF.

The trust and safety layer is built for threats that haven't happened yet

At this layer, the challenge is depth, specifically cultural, linguistic and forensic. Generalist rater pools show their limits most clearly here. Harm categories change by region, languages with deep cultural context cannot be evaluated by translators reading rubrics built for English and the threat surface is expanding faster than classifiers can be trained.

When a major social media platform needed to harden its large language model against complex adversarial attacks, we deployed a dual-track strategy, single-turn topic challenges and complex multiturn trajectories designed to bypass safety guards, across more than 400,000 prompts in over 16 languages across the EU and APAC regions.

For the volume side of safety work, Fuel iX™ Fortify, our automated adversarial testing platform, generates and judges thousands of attack vectors in hours across dozens of harm categories and multiple languages, freeing expert teams to concentrate on novel threat categories and edge cases that no classifier has yet been trained to recognize. This is what it looks like to build safety systems and technical foundations for the threats that haven't happened yet, not just the ones that have.

Building for the layer beneath the intelligence

So, can humans keep up?

For the work that matters most at the frontier, like identifying failure modes that classifiers haven't been trained to recognize, evaluating reasoning quality across multistep agentic trajectories and applying the cultural and linguistic judgment that no rubric can fully encode, the answer is yes. Human experts are not being replaced by the complexity of this work. In many cases, they are the only ones capable of doing it.

What humans can’t keep up with, without the right infrastructure, is the pace and scale at which frontier AI now demands that work. Models iterate monthly. Evaluation requirements shift with each generation. The volume of judgment calls required to improve a frontier model is not something any individual or team can absorb without systems designed to support them.

The answer lies in infrastructure. Capable, willing experts exist, however, the gap is in the systems needed to find them, qualify them, deploy them at scale and hold quality steady as the complexity of the work grows. That is a solvable problem, and organizations that treat it as a design challenge rather than a staffing one are already building the advantage.

The distinction between vendor and partner comes down to exactly this. A vendor delivers to a specification. A partner understands your model's failure modes well enough to anticipate the next evaluation requirement before it becomes urgent, invests in the research to understand what good data actually looks like at the frontier and builds infrastructure that compounds across model generations rather than resetting with each one. That is a different kind of organization, and it requires different decisions about talent, tooling, safety investment and research.

TELUS Digital has made its bet on the infrastructure beneath the intelligence. This includes expert networks, evaluation systems, safety tooling and research partnerships that determine whether AI gets better responsibly. That is the layer we have been building, and it is the conversation we are built for. If you are thinking about what frontier AI demands from your data and evaluation infrastructure, we would like to be part of that conversation.

Steve Nemzer

Senior Director of AI Growth & Innovation

Insights Overview

Categories

Industries

Resource Types

Glossary