How to train AI voice tools to speak your customers’ language

Posted January 25, 2024

Tobias Dengel

President, TELUS Digital Solutions

Two pieces of advice when you’re ready to integrate voice technology into your business: First, train your voice model on your own data. Doing so is how you differentiate from the competition when it comes to generative AI.

Open models like ChatGPT make deploying generative AI easier, but you can’t control how they respond. That means building and training your voice model is key to creating the personalized multimodal experiences that customers will flock to during the mass adoption of voice.

The second piece of advice: Don’t jump to building an AI-powered voice tool right away.

Building a voice model is more efficient than it’s ever been. Yet, it’s still a painstaking process that requires working closely with leaders from your marketing, sales, customer service, product design and other key stakeholders. It also means choosing the right AI partner to help you shepherd the project from beginning to end.

This cross-collaborative process is needed because while AI is very good at figuring out how to do something, it’s not very good at figuring out what to do. That’s where humans come in — specifically, many humans with a thorough knowledge of your customers’ experience needs, interests, fears, concerns and desires.

Here’s what the process of building and training an AI voice model that speaks your customers’ language looks like. For an even more detailed look, check out my Wall Street Journal bestseller, The Sound of the Future: The Coming Age of Voice Technology.

Part 1: Get to know your customers’ experience like it’s your own

In this initial stage, you fall in love with your customers’ problems: immerse yourself in the challenges and moments of friction your customers experience when dealing with your organization (Note: “customers” is meant broadly throughout, encompassing external consumers and internal users such as frontline employees, corporate staff and financial partners). Only after pinpointing what makes their lives and work more difficult can you decide if a new technology, like voice, can help make it easier.

As you find opportunities where voice can offer a potential solution (e.g., capturing and processing data in real-time or automating transactions), identify these as your voice use cases. From there, your next step is to create a specific, concrete jobs-to-be-done list for each use case.

Create a jobs-to-be-done list for each voice use case

Originated by famed business consultant Clayton Christensen, the jobs-to-be-done (JTBD) framework aims to discover customers’ unmet needs. Within these hidden, unmet needs live your strongest opportunities to make customers’ lives easier through voice.

Create your JTBD lists by analyzing how current communication methods impact customers’ experiences from start to finish. For example, let’s look at a top voice use case for many businesses: customer service. A smart place to begin would be:

Studying the past six months of customer service call records
Interviewing customer service reps, including high- and low-performers
Identifying the most persistent, troublesome problems and questions

From there, separate the calls into buckets, each for a specific kind of question or challenge, and rank them by frequency. The list might look something like:

Helping a customer correct an error in their online account: 11%
Answering a customer question about an upcoming promotion: 9%
Answering a customer question about a product feature: 8%
Responding to a customer complaint about a product: 6%

You can see a JTBD list emerge for the customer service team. Working from this list, you can analyze how each job-to-be-done is currently handled, asking if new tools — including voice technology — would make each job quicker and easier.

Visualize your JTBD lists as journey maps

Creating a complete, accurate journey map for a typical user is highly detailed work. However, the effort is worth it because the biggest insights often come from understanding the smallest details. Involve your team and AI partner in generating these journey maps. Drawing on a large cross-section of knowledge will help you see each moment of the journey in fine detail.

Create your journey maps by chronologically formatting the activities and thoughts of a typical user, from beginning to end, as they interact with one or more of your systems. Let’s return to the example of the customer service department.

image-84a46f3b-f616-413d-b57d-d3f3f55e7945

When a customer calls your contact center with a question or a problem, they must move through a series of steps (e.g., answering voice prompts, entering information to authenticate identity, holding for a representative, etc.) that, taken together, constitute the customer’s journey map. Each step may present an opportunity for improved efficiency, speed, accuracy, convenience or some other positive impact on the customer experience.

Turn your journey maps into JTBD maps

This step brings your customers’ hidden, unmet needs to the surface. Your JTBD maps organically emerge from your journey maps. By visualizing and tracing each stage in the typical customer’s journey, you can pinpoint the micro jobs-to-be-done that customers must do in each interaction.

For each task identified on your JBTD maps, ask customers two key questions:

How important is this job to you?
How satisfied are you with the current ways to tackle this job-to-be-done?

Use the data you collect to score each job-to-be-done by importance and satisfaction level, looking for high opportunity scores. That is, jobs-to-be-done with high importance but low satisfaction levels. Notice the three tasks highlighted below in the JBTD map.

image-b489275b-ef4f-4823-871e-c3f0a6b02eb4

A high opportunity score means odds are good that you can make job-to-be-done more efficient, perhaps by applying voice technology.

Part 2: Build a prototype based on user interactions

If a fully operational voice system is like a Hollywood movie that’s been filmed, edited and ready to view, the prototype is like the screenplay. Your goal with an AI voice prototype isn’t to write lots of code for a polished product. Instead, it’s to create a useful guide for the software developers who’ll eventually write the code.

To do that, focus on understanding how customers will want to interact with your voice system:

The activities they'll want to undertake
The information they'll need
The friction they may experience
The results they want to achieve

The good news here is that much of this information will probably be readily available from your journey mapping work. As for using this information to drive conversational exchanges, we have two valuable techniques: observing customers in their real-life context and applying conversation-modeling exercises.

Observe customers in their real-life context

Observing customers in their real-life context, or recreating it as closely as possible, is a valuable method of collecting data. When building Vocable, an augmentative and alternative communication (AAC) app that helps speech-impaired people communicate with caregivers, the TELUS Digital team worked with patients and speech pathology experts at Duke University and WakeMed Hospital.

That real-life context helped us observe and understand the issues people with paralysis commonly experience and how we could help them communicate those issues to their caregivers. Of course, not all instances allow such direct observation. Let’s look at a different example: developing a voice-driven smartphone app for a dental flossing machine.

In this case, the Data and AI Research Team (DART) here at TELUS Digital analyzed all the points of contact a user could potentially have with the dental flossing machine:

The customer manual
The company website
Frequently asked questions (FAQs)
Engineering specs

The team also studied phone records of customer service calls to identify common questions and complaints, particularly among new users. Gradually, they sorted these conversation topics into buckets that represented the issues the app would have to address — specifically, the jobs-to-be-done by users of the flossing device.

From there, the team organized conversations by developing a series of flowcharts. The result: conversational abilities such as responding with technical information from the product manual when users ask about water pressure.

Use conversation-modeling exercises

Using conversation-modeling exercises means asking pairs of people — members of your development team, for example — to improvise dialogue between a user and the voice tool. You can even isolate them from each other, known as Wizard of Oz testing. The rest of your team observes and takes notes, paying attention to:

Variations in vocabulary and sentence structure
Ambiguities in user statements and questions
If structured or unstructured prompts from the system are more helpful
Points when the user would benefit from added guidance or explanation

Start by asking your improvisers to model the shortest route to complete each job-to-be-done. From there, gradually imagine dialogues of more complexity and variation, allowing them to snowball into a multitude of conversations that mimic the real-world conditions your voice system will have to deal with. Push your team hard toward building a multimodal solution so you don’t default to a back-and-forth, call-and-response approach.

These two steps — observing customers in their real-life context and using conversation modeling exercises — will give you plenty of insights for sketching out a multimodal voice prototype. Companies like Sayspring, a division of Adobe, can help guide your prototyping process. Additionally, your team could build a function prototype with minimal code using advanced AI tools like GPT-4.

Part 3: Define the language needed to understand your customers’ jobs

Once you’ve built a prototype capable of basic voice interaction, you can begin training your natural language processing (NLP) voice model, such as GPT-4, to communicate effectively with real-world customers. Training requires identifying the most common words, phrases and expressions used to handle each of your customers’ jobs-to-be-done and specifying the appropriate response.

Training a voice interface tends to be complicated and detail-driven. Resist the urge to rush it or shortchange the required resources. Your competitive advantage ultimately will come from understanding how to engage customers with voice as effectively as possible.

In addition to learning your customers’ language, training involves identifying and supplying your underlying language model with all the knowledge it needs to generate a response. Imagine you sell bicycles. Your conversational AI assistant must answer questions such as “How do I adjust the seat?” and “Is this bike available in a different color?”

Because large language models (LLMs) are trained on general world knowledge, you’ll need to train your model and give it access to custom private knowledge (e.g., a database of your stock with notable features for each product). Techniques like retrieval augmented generation (RAG) are highly effective here.

Consider synthetic data if you’re a small or midsize business

One of the best ways to align customers’ jobs-to-be-done with the correct vocabulary is to study past interactions (e.g., phone, email, text). Enterprises, especially Fortune 500 companies, have an edge over smaller companies here. But thanks to synthetic data, small and midsize businesses can also identify the words, phrases and expressions needed to train highly effective AI voice models.

Driven by for-profit and non-profit organizations such as the Stanford Open Virtual Assistant Lab (OVAL), the synthetic data movement has shown that a relatively tiny amount of real-world information may be just as powerful for training generative AI models as the vast archives giant companies own.

Not to mention, synthetic data greatly benefits data security. That’s because using synthetic data to train AI models removes the risk of compromising actual customer data.

Make diversity a priority on your voice team

Early facial recognition tools performed poorly at recognizing the faces of people of color because predominantly white teams built the early prototypes. The lesson: You can’t create the world's best products if everyone in the room looks the same.

Businesses of all sizes can build and train more effective voice models by ensuring that the team building and training of these models is as diverse as possible. The more points of view, modes of speech and ways of thinking your AI model experiences during training, the more likely your voice tool will serve as wide an assortment of human beings as possible.

Make diversity a priority throughout the entire process when testing, evaluating, and improving your voice model.

Part 4: Pretest and improve your AI voice model

Pretesting and improving your voice model before its formal launch is essential for improving usability. Your goal is to achieve a high intent match rate. Most user requests are within your voice system’s domain, and most requests are understood and acted upon correctly.

When pretesting, pose as wide a variety of challenges to your voice system as possible. This recreates the diversity of real-world issues likely to emerge once the system goes live.

Once testing begins, it’s wise to enlist the help of a voice development company like Bespoken to track performance metrics such as word error rate (i.e., the number of words spoken by users that the system misunderstands or misinterprets).

Don’t expect market-ready results right away. Bespoken CEO John Kelvie reports error rates of 20% or higher are possible during early rounds. This error rate is obviously unacceptably high. But thankfully, the errors often point to their solutions, which are usually more straightforward than they first appear.

Review common sources of errors

Knowing the most common sources of error helps expedite the pretesting and improvement phase. There are three buckets to look into: vocabulary, system design, and failure to anticipate potential user statements.

Vocabulary

When helping to create a voice tool for a cosmetics company, Bespoken found the system often misinterpreted “ageless” as “age list,” confusing the bot and leaving it unable to fulfill the customer’s request. The fix: train the system to treat “age list” (a phrase very unlikely for a user to ever say) as a synonym for “ageless.” Similar vocabulary adjustments resulted in a more than 80% reduction in error rates.

System design

Minor flaws in system design can have an outsized impact. Voice is a great example because statements made to the user have to be brief. Otherwise, it quickly becomes too much information to understand.

A rule of thumb: when asking the user to make a choice, never present more than three options at a time. Realities like these emphasize the need for multimodal UX/UI design for conversational AI assistants, like coordinating a voice response with options displayed on a touchscreen.

Failure to anticipate potential user statements

Imagine someone responding to a text message via voice while driving. This potentially chaotic real-world scenario will present problems that don’t arise in conversation modeling at the office. Someone driving could be easily distracted, prompting them to say, “Wait,” or “I didn’t catch that last part.”

When unexpected user statements like these happen, they must be accounted for. That might mean revisiting the system design to present information more simply or giving the user slightly different options for a wider range of circumstances.

Practice good error flow handling

“Error flow handling” is the craft of repairing a conversation when faced with a potential error. Imagine, for example, all the mechanisms needed to make a customer say, “Wait a sec, make it four tickets for the 8:15 p.m. movie” — right before they almost book two tickets for 7:30 p.m. — flow as a seamless voice experience.

Building this level of error flow handling isn’t easy, but it’s crucial for delivering great voice experiences. Reducing error rates to zero isn’t reasonable. Plus, your voice system will always have to contend with moments of miscommunication beyond its control (e.g., background noise, someone who speaks very softly). Customers still deserve a positive voice experience under those conditions, even when it means handing them off to live support.

Part 5: Continuously improve as you learn more from your voice tools

The best way to get data on your voice system is to put it in the wild, see how it performs, and iterate accordingly. Your first few months will likely reveal new issues that didn’t arise during initial testing, prompting continued refinement and evaluation to deliver more value to users.

When you should formally launch your voice tool will depend on your unique situation. You'll have to weigh factors such as the potential impact of errors against the value you’re providing. A mistake by an AI-powered medical assistant, for instance, would have more severe consequences than one by an interactive trivia app.

As this blog shows, building and training AI voice tools to speak your customers’ language requires exceptional digital craftsmanship, from the initial planning stages to continually iterating on the final product.

If you’re ready to build the voice and conversational AI applications your organization needs, the DART team at TELUS Digital is prepared to help. Programs like our Agentic AI Accelerator have helped businesses rapidly identify and prototype new voice solutions, like a safe and secure conversational AI assistant for financial services.

And as you consider voice part of your forward-looking strategy, future-proof your company against asymmetric GenAI tech innovation with our Fuel iX™ enterprise AI platform. Fuel iX won the first global certification for Privacy by Design, ISO 31700-1, leading the way in GenAI flexibility, control and trust.

Insights Overview

Categories

Industries

Resource Types

Glossary