A team led by computer scientists from examined ten of the most cited datasets used to test systems. They found that around 3.4 percent of the data was inaccurate or incorrectly labeled, which can cause problems in AI systems that use these datasets.
The datasets, each cited more than 100,000 times, include text-based newsgroups, and. Errors came from issues such as Amazon product reviews being mislabeled as positive when in fact they were negative and vice versa.
Some of the image-based errors are due to mixing of animal species. Others come from mislabeled images with less prominent objects (“water bottle” instead of the mountain bike to which it is attached, for example). A particularly crazy example that emerged was that a baby was confused by the nipple.
centers around audio from YouTube videos. by a YouTuber who talked to the camera for three and a half minutes was marked as “church bell”
To find possible errors, the researchers used a framework called, which examines data sets for label noise (or irrelevant data). They validated possible errors by using, and found around 54 percent of the data that the algorithm flag had incorrect labels. The researchers found that they had the most errors with around 5 million (about 10 percent of the dataset). The team so that anyone can browse the label errors.
Some of the errors are relatively minor, and others appear to be a case of hair splitting (a close-up of a Mac command key marked as a “computer keyboard” is still correct). Sometimes the confident learning method got wrong as well, like confusing a properly labeled image of tuning forks for a menorah.
If the labels are even slightly off, it can lead to large ramifications for machine learning systems. If an AI system can’t tell the difference between a grocery store and a bunch of crabs, it would be hard to trust it with a drink.