Case study: Building a Multilingual Dataset

3.5Mlanguage tasks
660Klanguage prompts
650language professionals
31languages across various locales

The challenge

As a pioneering innovator of conversational-AI technologies, our client has long prided itself on furthering natural language processing (NLP) capabilities and providing research communities access to extensive datasets that fuel AI innovation. Although voice assistants have advanced tremendously in the previous decade, their multilingual language understanding capabilities are still evolving. The company is on a mission to extend conversational aptitudes to other languages via a multilingual NLU approach. The idea is to enable cross-linguistic training where a single machine learning model understands inputs from many diverse languages.

One challenge in executing this initiative was the lack of labeled data to assist with training and evaluating the models. To complete this undertaking, the client required an AI data solutions partner with deep data collection expertise who could provide realistic, contextual data for a given NLU task. The project demanded language experts with linguistic translation, validation, or localization abilities to create comprehensive and accurate conversational utterances in specified languages.

The TELUS Digital solution

With long-standing linguistic dataset development experience, a diverse AI Community and streamlined large-scale project management protocols, TELUS Digital offered critical support to build the dataset - the first of its kind. Our team of experts supervised the project, delivering effective vendor qualification strategies; and custom-engineering training and qualification materials to support frictionless data collection.

Our AI Community produced approximately 60% of the utterances in the dataset. The project, spanning nine months, involved critical tasks such as providing translation inputs for over 30 languages, translating English utterances into local languages, validating utterances or checking translations for accuracy and defining localized expressions for particular prompts.

Key differentiating factors of our solution included:

Extending valuable domain expertise: Drawing from several years of linguistic data annotation experience, TELUS Digital became the client’s trusted consultant. Our team singlehandedly managed several aspects of the project, including preparing qualification quizzes for other participating vendors, building detailed instructions for tasks across more than 50 local languages and enabling the delivery of high-quality NLU datasets.
Championing project management: Our team curated qualification tests with customized audio content, and provided the corresponding gold-standard transcriptions in over 30 languages to validate fluency and measure the performance potential of the individual contributors. We also built comprehensive domain-specific training materials for language tasks and maintained best-in-class worker-conversion rates throughout the project.
Sourcing a diverse workforce: Using customized evaluation tests and criteria for over 30 languages, we screened thousands of participants of various linguistic capabilities to find 650 experts - native language speakers, trained linguists, translators and individual contributors - across 31 locales. Our team performed 3.5 million linguistic tasks on the client’s internal platform, contributing conversational language utterances for the dataset.
Delivering high-impact results: Thanks to carefully curated qualification tests, detailed training materials and clear task instructions, our workforce surpassed accuracy targets across all the deliverables. Every task completed was approved through a rigorous TELUS Digital stipulated pass/fail quality assurance system to reduce the margin of error and improve dataset accuracy.
Driving collaborative leadership: As the primary AI data solutions provider, we helped the client build vendor evaluation and qualification processes to conduct the project at various locales. Our team also provided process frameworks and instruction manuals, and conducted multiple assessments to onboard other vendors for the project to streamline the data collection process.

The results

Using the linguistic data collected by TELUS Digital and other vendors (vetted as per our quality and evaluation standards), the client released the dataset. The dataset, which has numerous machine learning use cases contains 1 million realistic, parallel, labeled text utterances spanning more than 50 languages, and dozens of domains, intents and slots. With the release of this large language dataset, along with the open-source code, the client continues to promote research collaborations for NLP and upgrade NLU modeling for advancing conversational AI systems.