PRIOR

Why use Learning Curves?

For any given dataset, it is typical to compare two classifiers A and B based on their classification error on the test set when trained on the full training set as shown in (a). These classifiers may differ in architecture design, training techniques, or the choice of hyperparameters. While such a comparison may establish model A to be better than model B when trained on the full training set, it provides an incomplete view of performance. By computing a learning curve, we may find the opposite to be true in the low data regime as shown in (c) which depicts learning curves for two real classifiers trained on the Cifar100 dataset.

We show that a learning curve is not only easy to compute (using only 3 error measurements!) but may also be reliably summarized using error and data-reliance as shown in (b). This summarization allows easy reporting of learning curves in a tabular form while retaining the ability to provide an estimate of classifier performance if the amount of training data was changed by a certain factor - for instance by 4x or 0.25x.

Summarizing a Learning Curve

We summarize a learning curve using a local linear approximation of the curve around the desired training set size N. Typically, we care about performance in the vicinity of the full training set size. If so, N may be set to the number of samples in the full training set. The approximation consists of two terms:

Error @ N : e_N is the error predicted by the learning curve at N and is nearly identical to the measured error at N.
Data-Reliance @ N : beta_N is proportional to the slope of the linear approximation. Given two classifiers with similar e_N, the one with higher beta_N is said to be more data-reliant and would outperform when trained on more data but underperform with less-data.

Extrapolation

In the figure above, d is the factor by which the training set size is multiplied. The shaded block provides simple formulae for computing performance at quarter, quadruple, and infinite training data using the linear approximation. In practice, we have found reliable predictions upto 4x the full training set size. Such information may be incredible useful for deciding whether to simply collect more data or to invest in architecture or training improvements.

Analysis of design decisions

Performance of classifiers today are governed by a myriad design decisions such as network architecture, normalization techniques, data augmentation, pretraining, and the choice of optimizer to name a few. Knowing whether a particular design decision improves the error, or data-reliance, or both is vital for evaluating these choices and for creating more principled and targetted solutions.

Deep Learning Quiz!

In our paper, we exemplify the use of learning curves by analyzing a wide range of common design decisions that go into building deep neural classifiers. The above table lists a set of popular beliefs among deep learning practioners. We encourage readers to first judge each claim as True or False and then see if our experimental results support the claim (Yes / No / Unsure). We hope our analysis encourages more ML practioners to adopt learning curves to systematically investigate classifier design choices.

Paper

Learning Curves for Analysis of Deep Networks

Derek Hoiem, Tanmay Gupta, Zhizhong Li, and Michal M. Shlapentokh-Rothman • ICML • 2021

PDF View PDF
Semantic Scholar View and cite on Semantic Scholar

Code

Given error measurements, we make it really easy to compute, plot, and compare learning curves using only a few lines of code. We also provide detailed interactive notebooks to help you dig deeper into learning curves. After all, using learning curves in your next project should not require a steep learning curve!

Authors


Derek Hoiem	Tanmay Gupta	Zhizhong Li	Michal M. Shlapentokh-Rothman


Derek Hoiem	Tanmay Gupta

Zhizhong Li	Michal M. Shlapentokh-Rothman