As self-driving cars become a reality on public roads, all data and information responsible for driving them safely have to be on the ball.
This is why, when word spread that labels of hundreds of pedestrians, cyclists, traffic cones, among others, were missing from a widely-used dataset for self-driving cars emerged, worry was the prime reaction. After all, the "rules of the road" don't account for self-driving cars with blindspots that include humans.
But this is not, in fact, the case.
Machine learning evolves, old datasets show
Out of the 15,000 hand-checked images from the Udacity Dataset 2, 4,986 of them, that's 33%, were incomplete, according to commercial dataset provider, Roboflow.ai. But Udacity's datasets were created more than three years ago, and are not active on public streets.
It's important to remember: in the internet years of machine learning, three human years is several lifetimes ago.
"In the intervening years," Udacity told Interesting Engineering (IE) in an email exchange, "companies like Waymo, nuTonomy, and Voyage have published newer, better datasets intended for real-world scenarios."
In other words, Udacity hasn't actively created new datasets to keep up with the newest line of self-driving car datasets because — for now — it has yielded the real-world floor of public streets to newer companies.
Machine learning and algorithms
Machine learning has helped many industries evolve beyond their current state. Teaching computer algorithms to do new tasks is necessary for this process to work smoothly, and safely. On a long enough timeline, these datasets become immensely complex. This can make them difficult for people at the start of their self-driving car career to grasp. That's why incomplete datasets — like an "easy mode" in a video game — are not a bad idea. So long as they stay off-road.
Self-driving cars require a lot of data for their algorithms to navigate the dangers of public streets. If a car doesn't know how to recognize a human pedestrian walking by the side of the road, or a cyclist sharing the road with the car, then serious issues can arise.
The commercial dataset provider, Roboflow, published an article confirming that a popular self-driving car dataset is indeed missing updates. The Udacity Dataset 2 is used by thousands of students who are building an open-source self-driving car dataset.
The company Roboflow hand-checked 15,000 images from the dataset, and discovered that 33% of them had problems. There were thousands of unlabeled vehicles, hundreds of unlabeled pedestrians, and dozens of unlabeled cyclists.
Training wheels for self-driving car datasets
Roboflow may have had no intention to mislead the public. The concept of training wheels is difficult for everyone to understand. Is a bike still a bike, if the girl riding it has two extra wheels? Kind of, but not exactly. Is she experiencing what it's like to ride a bicycle? Definitely, but without the real-world risk of potentially falling.
Is she ready for the real thing?
It's up to her, and the same could be said of the students, who have to decide whether they're ready to take off the training wheels, and build their own datasets in the real-world risk of the industry.
Of course, starting with Udacity's dataset, these students would have a long way to go. Missing identification tracked by Roboflow included duplicated bounding boxes, oversized bounding boxes, and phantom annotations.
To make matters complicated, around 1.4% of the images were simply unlabled, yet they contained cars, trucks, lights, and even pedestrians — like an invitation to the dataset developers of tomorrow, to fill in the data for themselves.
This goes to show how incredibly complex open-source datasets are, and this discrepancy between real-world roads and early datasets is a credit to the cutting-edge dataset companies with vehicles on public roads. But Udacity's self-driving car dataset is not in use on public roads. At present, Udacity's only operating self-driving car is for educational use only, set up on a closed test track.
Students in need of a cheat-sheet — in their ambition to fill in the holes of a three-year-old dataset — are in luck: Roboflow fixed and re-released the dataset, here.
As machine learning pushes self-driving car technology to create higher-fidelity datasets, it will become easier to look back over the years and decades, and wonder how we managed.
But, just like the girl and her bicycle — the challenge lies in removing the training wheels, and making one's own way on public roads.
***Editors' Note: This article has been updated -- with several changes made throughout -- after receiving clarification from Udacity. An earlier version of this article implied that Udacity's self-driving car datasets were in active use on public streets. This has been corrected to reflect the fact that the company's data is only used for educational purposes, and are in effect no more flawed than "training wheels," and are only used to help aspirational dataset developers become more familiar with the technology. Additionlly, Udacity has not developed new datasets for three years, and has "yielded the floor" to newer, more advanced datasets provided by other, unaffiliated companies. Additionally, Udacity's only self-driving car in operation is exclusively for educational purposes, and operates on a closed test track, not public streets. In all, the earlier version of this article suggested that incomplete datasets developed by Udacity were errors, which is a misconstrual of the fact that old datasets will naturally appear to be errors, in the hindsight of future developments. Finally, the original title of this article has been changed to reflect this. IE regrets these errors.