Data labeling fundamentals for machine learning
Artificial intelligence (AI) has successfully leaped from academic obscurity to achieving essential business tool status. Brands across all industries are using it to innovate, increase operational efficiency and provide new services and products that were impossible to implement a few years ago.
Successful implementation of AI and its subsets — machine learning, deep learning, computer vision, natural language processing, speech recognition, etc. — rely on large volumes of high-quality labeled data. Activities like data collection, labeling and validation are critical in the AI development lifecycle because they inform the algorithms and the results they produce. In this article, we explore the fundamentals of data labeling — one of the most important data processing steps that support machine learning (ML) initiatives.
What is data labeling?
The terms “data labeling” and “data annotation” are used interchangeably to describe the process of creating datasets for the training of machine learning algorithms.
Data labeling for AI training includes various data processing activities like tagging, classifying, moderating, transcribing, translating and validating. High-quality data labeling is a critical step in the development of all AI as it teaches the systems how to produce more accurate outcomes. In short, if machines learn from good quality data, they will make good decisions. Likewise, the inverse is also true: sub-par data leads to poor results and flawed products.
Along with grasping the volume and diversity of data required for a particular ML model, defining the appropriate data type is equally important. Video, image, audio, text, point clouds and even geo data are all potential labeling targets for training supervised ML algorithms, and each data type comes with its unique annotation challenges.
Text data, for example, is often gathered in large volumes to train a machine learning algorithm to understand written human language. Similarly, large volumes of geographic data support mapping or navigation applications. These ML applications require the precision granted by big, structured datasets, which can be difficult to source at scale.
In machine learning, the size and accuracy of training datasets are central to a model’s success. Take self-driving cars, for instance. Their ability to recognize and avoid obstacles, adjust speed and make split-second decisions is evolving rapidly due to high-quality and thoroughly labeled datasets of often complex information, including video data, 3D sensor data, images and audio. Labeling all of this data in-house can quickly become too arduous and complicated, often leading companies to pursue partners and third-party tools to get the job done.
The essential guide to AI training data
Discover best practices for the sourcing, labeling and analyzing of training data from TELUS Digital, a leading provider of AI data solutions.
Data labeling types and use cases
Data labeling protocols used to label datasets will change depending on the application. Let us look at some of the common labeling types you can expect to use for each type of application.
Computer vision
In computer vision, models often use data from a wide range of sensors such as 2D cameras, lidars, fish-eye cameras, 360-degree cameras and more. The labeling tasks for these types of data can range from simple image classification, to 2D/3D object detection and tracking, to image or point-cloud segmentation, to more complex sensor fusion.
Object detection and tracking typically use bounding boxes to identify elements in an image, but can lack the precision required for complex computer vision applications. Image segmentation, on the other hand, utilizes pixel-level information via colored segments to provide semantically-distributed information about an object. While bounding boxes are advantageous for quickly classifying an identified object, segmentation shows more detailed information regarding the object of interest.
Another type of data labeling used to train computer vision models is 3D sensor fusion data. Sensor fusion data combines information from visual sources, such as lidar sensors and camera devices, to provide varied and accurate object data. Point-cloud segmentation is an excellent example of this. Like 2D segmentation, point clouds provide detailed shapes of identified 3D objects, ranging from simple classes like buildings, to complex, distributed pixels of the sky or vegetation in an image. Object classification, point cloud segmentation, object linking and tracking, among others, work together to provide thoroughly labeled datasets for training advanced computer vision models.
Natural language processing
Natural language processing (NLP) also uses several types of data labeling to achieve accurate training results. The most common data types include entity annotation and linking, text classification and linguistic annotation.
The annotation types needed for NLP training are often broad because dozens — and sometimes hundreds — of languages need to be supported by the deployed application. Linguistic annotation and text classification are two of the most basic labeling types NLP models need. Linguistic annotation involves identifying language-specific phonetic and linguistic elements of a specified text or audio. Similarly, text classification determines the overall sentiment and subject in a text. Both data labeling types are crucial to most NLP models.
Speech recognition
Data labeling for speech recognition algorithms includes audio classification, transcription and multilingual translations. It is important to note that audio data is often transcribed for further linguistic analysis. AI models that support applications with human speech activation require speech recognition training. Audio transcription can include both verbatim and clean read transcription, which alters speech to support formatting or linguistic requirements, as well as speaker identification and time stamping. Like NLP applications, speech recognition model training almost always requires multilingual support, meaning audio training data will need replication in multiple languages.
These complex models require a continuous inflow of diverse, high-quality labeled data. Building this data pipeline is one of the primary bottlenecks in AI development. How can one overcome this data labeling challenge and create an effective, labeled data supply chain for machine learning advancement? To answer this question, let us first look at the data labeling process.
What is the data labeling process?
While the process seems straightforward on the surface, the complexities of data labeling are more pronounced than ever. Cognilytica research found that 80% of AI project time is used for gathering, organizing and labeling data.
Generating exhaustive datasets for fail-safe models demands labor-intensive processes and sophisticated AI data solutions to collect and annotate data, test and validate quality, ensure diversity and volume, detect edge cases and more.
Many decision-makers vastly underestimate the time and resources required for data labeling. A commonly held belief is that hiring annotators and making annotations are the beginning and end of the labeling process, but really this is just the tip of the iceberg.
Hiring new annotators means onboarding a new workforce that is untrained or undertrained. Bridging this skills gap means creating and implementing training programs, developing annotation tools, analyzing annotator performance, and other operational and engineering tasks. Building a platform from scratch that caters to specific data types can quickly become an enormous challenge since data requirements are constantly changing with evolving machine learning algorithms. Each of these components will then need reassessment after some annotation tasks are complete, allowing space for improving the platform capabilities, annotator performance and the overall labeling process. Additionally, accurate data annotation needs quality and validation checks, feedback gathering and quality metrics.
After considering all of this, it is apparent that creating training data for modern ML models requires much more than data collection and hiring annotators. The activity also requires many work hours from engineers, business leaders and managers.
The more complex the training requirements are for your application, the more complex the annotation process will be, causing a strain on any engineering or machine learning team, as complex ML problems require the full attention of these professionals.
Several inherent challenges hinder the AI development process — let us look at a few critical ones below.
Data labeling challenges
Data labeling relies on a human workforce — and as such, it is also subject to human biases and errors — one of the most common annotation challenges. Maintaining a large, diverse pool of data experts can ensure data diversity and help resolve this issue partially, but that is exhaustive and expensive, making it inaccessible to some companies.
Another human-related consideration is data quality. Data labeling needs extensive quality assurance to ensure that errors and biases overlooked during the labeling process get rectified before the data is fed into the models.
Infrastructure-related challenges are another common hurdle. Building high-precision tools that leverage data labeling automation is an enormous challenge for ML teams due to the lack of resources singularly dedicated to advancing data labeling capabilities. Moreover, building workforce, pipeline and quality management platforms to support data labeling contribute to enormous costs.
Similarly, operational considerations — like building workflows to ensure continuous data supply and manage high throughputs — often require an experienced team of project managers that can be challenging to source. One way to mitigate operational problems is to invest in a mature data labeling platform that allows seamless data pipeline management from end-to-end.
Machine learning teams also face budgetary and security concerns. Due to the fact that annotation is a complex, time-intensive undertaking, the process can become expensive, fast. Tool creation, hiring and training, management overhead and regular process iteration all add up to a costly endeavor. Data security is also a common hurdle, as ML training often involves handling sensitive information. IT security is a growing concern across industries, and ML model training is no different.
Infusing personality into conversational artificial intelligence systems through high-quality text data
Discover how TELUS Digital supported a digital solutions provider with a customized AI system to add personality into chatbot applications.
Data labeling best practices for better ML outcomes
Building an efficient quality assurance process for data labeling improves label quality and helps ML teams predict model performances more accurately. There is no one-size-fits-all approach because the quality metrics and requirements vastly vary from one dataset to another. However, these are a few of the most important things to keep in mind when planning your annotation processes:
Create gold standards: These standards outline the ideal fully labeled dataset. Gold standards, created by ML professionals who know the specifications required for a high-quality labeled piece of data, guide the annotation process and help assess potential annotation staff.
Consistently analyze annotated data: This allows data scientists to evaluate the accuracy and identify outliers that may or may not result from bias or mistakes from your annotators.
Use a multipass process: Allowing several annotators to look at the same dataset improves the overall accuracy of the labeling while also minimizing human error.
Perform regular reviews of annotators: Having one annotator label the same piece of data multiple times gives managers an idea of that labeler’s consistency, as well as potentially catching past errors of that annotator and providing them feedback.
Data labeling approaches to consider for your ML initiatives
A popular way to start is with open-source data. Governments and multinational organizations, for example, publish large volumes of data on topics ranging from weather and climate to linear regression. Researchers also publish data from long-term sensing and observing experiments, such as night sky surveys and ocean floor images. You might not find data precisely tuned to your hypothesis, but you may find enough data to test and make some preliminary assessments.
Other options include crowdsourcing, but you should proceed with caution here. The more discretion there is in a decision — such as identifying fruit in an image that is at least 60% green — the more variation you will find in crowdsourced results.
Synthetic data is another option. In this case, a program generates data similar to existing data but with enough variation to improve the overall coverage of the dataset. Generative adversarial networks and clustering-based algorithms can help create synthetic datasets. A drawback of this type of data generation is the amount of computing resources required and the associated costs. Also, since synthetic data is in its infancy, accuracy can vary dramatically.
Programmatic data labeling uses scripts to label data, but these can be difficult to develop if you are looking for high precision and recall. It can, however, provide support for human labelers or quality assurance teams.
Outsourcing, in many ways, captures the best aspects of other approaches. With outsourcing, you can hire professional data labelers who have access to tools, methodologies and best practices for labeling data. This is especially important for projects that require structured project management and reporting, quality assurance and the ability to develop and follow detailed annotation policies.
Data labeling with TELUS Digital
TELUS Digital is committed to helping organizations overcome common annotation hurdles with a cutting-edge AI training platform, a diverse workforce and years of domain expertise supporting global machine learning teams.
Our combination of best-in-class community management and sophisticated technology enables organizations to create and gather the high-quality data they need to deliver accurate results on time and scale effectively. We also prioritize bias mitigation through responsible AI practices. Learn more about TELUS Digital’s AI Data Solutions or speak to a data solutions expert today.