Skip to the content.

fastclean

Experiments to find incorrect labels in the dataset and noisy training

Motivation

label errors

Approaches

Confident learning (CL) is an approach focusing on label quality by characterizing and identifying label errors in datasets, based on the principles of pruning noisy data, counting with probabilistic thresholds to estimate noise, and ranking examples to train with confidence.

Pruning could be based on 1. Confident Examples that are labelled correctly with high probability & pruning the rest 2. Confident Errors that are labeled incorrectly with high probability of belonging to a different class and hence pruned.

Takeaways

Pretrained Model as effective feature extractors + Gradual Unfreezing + Label Smoothing

img

Ensure Clean Test data & Reduce noise impact on train data with Pseudo Labeling

img

Datasets

Noisy Imagenette

This contains a subset of images from Imagenet on 10 different classes. Please refer[1] on the generation of noisy labels. In order to effectively compare the evaluation, the validation set is clean & the labels are not changed.

Covid Tweets (Noisy User Generated Text)

The dataset from WNUT-2020 Task 2: Identification of informative COVID-19 English Tweets is used to evaluate the noisy samples identification from the test set. The goal of the task is to automatically identify whether an English Tweet related to the novel coronavirus (COVID-19) is informative or not. Such informative Tweets provide information about recovered, suspected, confirmed and death cases as well as location or travel history of the cases.

Notebooks

Scripts

Outputs

├── imagenette_labelsmoothing
│   ├── prune_noise_rate
│   │   ├── noisy25_train_predictions.csv
│   │   ├── noisy50_train_predictions.csv
│   │   └── noisy5_train_predictions.csv
│   └── prune_noise_rate_class
│       ├── noisy25_train_predictions.csv
│       ├── noisy50_train_predictions.csv
│       └── noisy5_train_predictions.csv
└── imagenette_no_labelsmoothing
    ├── prune_noise_rate
    │   ├── noisy25_train_predictions.csv
    │   ├── noisy50_train_predictions.csv
    │   └── noisy5_train_predictions.csv
    └── prune_noise_rate_class
        ├── noisy25_train_predictions.csv
        ├── noisy50_train_predictions.csv
        └── noisy5_train_predictions.csv

Noisy Tweet from Test

covid
├── noisy
    └── noisy_text.csv

Experiments

Noisy Imagenette

image

Noice Percent Actual Noisy samples Noise with Label Smoothing ( prune by noise rate) Noise with Label Smoothing ( prune by noise rate + class ) Noise with CrossEntropy (prune by noise rate) Noise with CrossEntropy (prune by noise rate + class)
5% 114 117 76 431 406
25% 2122 2177 2023 2217 2140
50% 4092 4256 4132 4257 4151

StreamLit

References