The challenge
Our client, an L4 and L5 autonomous vehicle manufacturer, needed fine-tuning data to refine and improve the decision making capabilities of their AV bot for various driving situations. The data, consisting of video recordings of the vehicle navigating real-world scenarios, required high-quality annotation. As their long-time partner, the client was familiar with our 2D/3D annotation expertise in the automotive sector, and ability to source experts with a thorough understanding of U.S. traffic rules. They leveraged our fully-managed data labeling solution for diverse and complex use cases to support their initiative.
The TELUS Digital solution
The fine-tuning process happened in two phases. First, supervised fine-tuning was done to train the AV bot to better comprehend driving scenarios and formulate adaptive motion plans, even in edge cases. To accomplish this, we assembled a team of 25 driving experts from the U.S. and the Philippines who generated a fine-tuning dataset of over 100,000 high-quality text annotations (responses). These annotations involved analyzing lidar and 2D data to describe driving scenarios in alignment with traffic rules, safety guidelines and human expectations. This required thorough attention to detail, including the precise identification of objects in the source video dataset, followed by the prescription of appropriate driving actions.
Throughout the project, we employed a comprehensive technical and linguistic quality-assurance rubric. This was used to evaluate categories such as vocabulary, fluency, mechanics, timestamp, location, reasoning, action, track identification and track behavior. Each annotation was graded as critical, good or great.
Second, reinforcement learning from human feedback (RLHF) was done to benchmark the trained AV bot’s responses against critical parameters to evaluate the vehicle’s performance. The AV bot’s responses were compared and ranked against the expert-generated responses in three categories: the action taken by the AV bot, the reasoning behind the decision and the scenario location. Complex rules and ranking priorities were applied to identify critical errors, resolve ties and conduct holistic comparisons for more than 95,000 individual response ratings.
Example annotation: “I am at an intersection with a considerable amount of vehicle and pedestrian traffic. I have stopped due to a red signal. A white SUV has passed in front of me (ID: xxxxx) and a pedestrian (ID: xxxxx) is crossing the intersection in the opposite direction. Immediately after, a bicyclist (ID: xxxxx) and another pedestrian (ID: xxxxx) proceed to cross the intersection in front of me as well. Additionally, the first pedestrian is crossing the road again to the opposite side. I can proceed after the bicyclist and the two pedestrians have crossed the road completely and the traffic signal turns green.”The results
For SFT, we achieved an impressive quality score of 98% based on two criteria. The first was the customer’s linguistic rubric, which defined the quality of the vocabulary and clarity of the response. The second was the accuracy of the decision made, reasoning and the precise scenario location.
For RLHF, we achieved a success rate of up to 95% compared to the golden dataset provided by the client. We employed a hierarchical ranking system that equated certain options on a case-by-case basis to ensure accurate comparisons.
The high-quality datasets and response ratings significantly improved the AV bot’s scenario comprehension and motion-planning capabilities. They also enhanced the AV bot’s ability to handle edge cases, contributing to safer and more reliable autonomous driving technology.