A guide to building training data for computer vision models
Artificial intelligence (AI) has influenced the product roadmaps of most of today’s enterprises. It’s become increasingly common to see prominent AI-based applications implemented to automate business processes. One of the most exciting developments in the field of AI is computer vision.
Computer vision is being explored and applied across industries, from traditional financial services to cutting-edge technologies like autonomous vehicles. Some other popular use cases for computer vision include drones, mapping and satellites, robotics, medicine and agriculture.
So what goes into creating computer vision technology? Here are the major steps:
- Data collection
- Data labeling
- Graphics Processing Unit (GPU) acquisition
- Algorithm selection - Training - Testing - Teaching
- Repeat and refine the process
Each of these steps involves its own set of operational challenges, but this article will focus on the collection and labeling of training data.
Data collection
When starting to collect data, there are many free and paid standard datasets available.
For example, here are some of the top open labeled dataset repositories available:
- ImageNet
- Google’s Open Images
- KITTI
- The University of Edinburgh School of Informatics’ CVonline: Image Databases
- Yet Another Computer Vision Index To Datasets (YACVID)
- CV datasets on GitHub
- ComputerVisionOnline.com
- Cityscapes Dataset
- MNIST handwritten datasets
These datasets serve as a good starting point for anyone looking to get started with machine learning (ML). They are even useful for building simple models for side projects. But for more practical purposes, collecting proprietary training data, similar to the data required for the final model to run efficiently is probably best.
For more complex projects, it is beneficial to work with a data outsourcing partner. Outsourcing data annotation allows companies to incorporate the best practices outsourcing partners have learned from annotating thousands of images, across a variety of scenarios and use cases.
From determining the crowd capacity and creating workflows, to handling task design and instructions, to qualifying and managing annotators, an end-to-end data outsourcing partner allows companies to attain a data collection and annotation speed that is unmatched.
Data labeling
Once data has been collected, it must be labeled. There are primarily two things to be concerned about here:
- How to label the data (Internal vs. external tools)
- Who labels the data (Internal resources vs. outsourced annotators)
How to label the data: Choosing the right data annotation tool
Lots of data annotation tools are available online. However, selecting the right one for your needs can be challenging. Here, are a few factors to consider when selecting an annotation tool:
- Tool setup time and effort
- Labeling accuracy
- Labeling speed
If open tools don’t meet your specific needs, you may need to consider customizing or building one from scratch. This is understandably very costly and possibly unnecessary. An alternative is to work with an outsourcing partner and leverage their technology and expertise.
Who labels the data: Selecting annotators
If you have the data, but don’t have the tools or workforce to annotate the data internally, you can offload all of your annotation tasks by partnering with a data annotation company. These companies can provide the raw data itself, a platform for labeling the data and a trained workforce to label the data for you.
Companies like TELUS Digital already have platforms built to collect and annotate data, as well as a large, trained workforce that can annotate hundreds of thousands of data points at scale. The main advantage of partnering with a data annotation company is that you don’t have to deal with building a data annotation infrastructure from scratch. All you have to do is build specific guidelines and QA protocols for the company to follow.
The essential guide to AI training data
Discover best practices for the sourcing, labeling and analyzing of training data from TELUS Digital (formerly TELUS International), a leading provider of AI data solutions.
Best practices for data annotation
It’s imperative that companies measure the quality of their data annotations. This is a twofold process that involves measuring the annotations against a set of ideal annotations to determine their accuracy. Secondly, it’s crucial to measure the consistency of annotations to ensure that the assembled team of annotators label in the same way.
Other best practices for labeling worth noting are:
- Creating a gold standard
- Using a small set of labels
- Performing ongoing statistical analysis
- Asking multiple annotators to label the same data point (multipass)
- Reviewing each annotator
- Hiring a diverse team
- Iterating continuously
Evaluating quality of training datasets
Three important parameters that indicate the quality of training data are:
- Data diversity: Diverse training datasets minimize biases in model predictions and outcomes. For instance, if a model is trained to predict cats, using images of only domestic cats will limit the model’s prediction capabilities. To get better outcomes it is advisable to include a wide variety of cat images, including different attributes like sitting cats, running cats, standing cats, sleeping cats, etc.
- Data adequacy and imbalance: It is imperative you use adequate datasets to train models and consider a number of variable factors that might affect the model’s outcomes to ensure that datasets aren’t imbalanced.
- Data reliability: Reliability refers to the degree to which you can trust your data. You can measure reliability by determining the following factors:
- Tangibility of human errors: If the dataset is labeled by humans, there are bound to be some errors. How frequent are those errors and how can you correct them?
- Noisy data features: Some amount of noise is okay. But data that has too many noisy features may affect the outcome of your models.
- Duplicate data: For example, the same data records may be duplicated because of a server error, or if you face a storage crash or a cyberattack. Evaluate how these events may impact your data and have contingency plans in place.
- Label accuracies: Wrong data labels and attributes account for huge gaps in model performances. It is important to maintain high precision and recall rates for labeled data.
The TELUS Digital approach
TELUS Digital offers secure and robust data creation and data annotation services to train ML models and support various AI applications. We rely on an analytics-based approach to identify and minimize error rates in training data by running the data through multiple expert human loops until it reaches sufficient accuracy to ensure our client’s access to the highest quality data. Additionally, our world-class UX designers are constantly working to improve the annotators’ experience and productivity to make the overall process more efficient.
Reach out to learn how we can help with your next computer vision project.