Stanford CS231n (2017 version) -- Lecture 2 Notes

Lecture 2 mainly discusses some basic knowledge of image classification (core concepts will be expressed in English).

Starting with the most classic example: given a picture of a cat, how do you correctly pick the label "cat" from a pile of labels?

In short, what needs to be solved is: Semantic gap between “cat” and pixels, which means translating the numbers in the pixels read by the computer into the concept of “cat.” Moreover, cat images vary widely, and we need to consider the following challenges:

illumination:
deformation: after all, cats are a fluid kind...
occlusion: the cat is hiding
background clutter: blending into the background
intraclass variation: different cats look different

Using a hard-coded method like array sorting definitely won't work, so we consider a data-driven approach, using a labeled dataset to train a classifier: CIFAR10, with 10 labels, a training set size of 50,000, and a testing set size of 10,000.

Nearest Neighbor#

The idea: Traverse the training set to find the most "similar" image a to the current test image, using the label of this image a as the label for the test image.

So how do we mathematically express "similar"? We use L1 Distance, which calculates pixel-wise distance and sums them up, as shown in the example below.

import numpy as np

class NearestNeighbor:
  def __init__(self):
    pass

  def train(self, X, y):
    self.Xtr = X
    self.ytr = y

  def predict(self, X):
    num_test = X.shape[0]
    Ypred = np.zeros(num_test, dtype = self.ytr.dtype)
  
    for i in xrange(num_test):
      distances = np.sum(np.abs(self.Xtr - X[i,:]), axis = 1)
      min_index = np.argmin(distances)
      Ypred[i] = self.ytr[min_index]

    return Ypred

Assuming num_test = 10, the image is divided into 4*4 = 16 pixel blocks, the dimension of X is [10, 16], the dimension of distances is [10, 1], and the dimension of ytr is [50000, 1] (assuming we are using the entire CIFAR dataset of 50,000 images for training, with each image corresponding to a label).

This method seems simple and good, so why not use it? Next, we need to mention "time complexity." The training time complexity is O(1), while the prediction time complexity is O(N), because predicting one image requires traversing the entire dataset (for example, 50,000 images in this case) and calculating L1 distance one by one. Once the dataset is expanded, predictions will become very, very slow. We can accept slower training because that is behind-the-scenes work, but slow inference is unacceptable. Therefore, this method is impractical.

K-Nearest Neighbors#

The "Nearest Neighbor" method mentioned in the previous section essentially copies the label of a certain image from the training set. Not to mention the time complexity issue, this copying strategy itself can also lead to problems:

Interference from noise can cause irregular or sharp boundaries in the partitioned regions.
It cannot handle noise or outliers correctly (as shown by the yellow point in the green area below).

To upgrade the Nearest Neighbor method, we made a change:

What does this mean? First, we determine the distance function we are using, either L1 or L2, and then select a K parameter. Assuming we choose L1 and K=5, we calculate the L1 distance between the current test image and all images in the training set (performing 50,000 calculations), and then select the 5 training images with the smallest L1 distances. If among these 5 neighbors, two neighbors have the label cat, one has dog, one has car, and one has shit, then intuitively, we can consider the label of the test image to be cat. In addition to majority voting, I also want to mention another situation not covered in the course: weighted voting.

majority voting (mentioned in the course): find the label with the most occurrences among K.
weighted voting: assuming K=3, the L1 distances of the three neighbors are 1, 2, and 3, and their labels are A, B, and A, respectively. The weight calculation is: label A: 1/1 + 1/3 ≈ 1.333, label B: 1/2 = 0.5, so we consider the label of the test image to be A because it has a higher weight sum.
In short, this is a slight enhancement of majority voting, capable of handling situations where the number of labels is the same; this idea runs through the entire machine learning process.

The choice of K is also significant; a larger K value can reduce the model's overfitting to outliers or noise in the training data through "majority voting" or "weighted averaging," enhancing the model's generalization ability. For specifics, see the interactive visualization webpage created by Stanford, which is very helpful for understanding. Of course, the representation of scatter points and the pixel value differences in our actual images are not completely consistent; the L1 distance between scatter points is the Manhattan distance, while the L2 distance is the length of the line segment from point to point. The scatter points abstract the entire classification problem, which requires careful consideration.
http://vision.stanford.edu/teaching/cs231n-demos/knn/
All the points on the graph are from the training set, with one color representing one label. Assuming the red area represents cat, that cluster of red points corresponds to some cat images affected by illumination, deformation, occlusion, etc. They have differences but share similar "patterns," so they do not completely overlap but are not far from each other.
When playing with this interactive page, you will find that K=1 → 3 → 5 shows significant changes, making the classification boundary smoother.

How to Choose Hyperparameters#

The term Hyperparameters should be familiar to everyone. In the above problem, how to choose the K value and how to choose the distance function will directly affect the model's prediction results. These parameters that we need to set rather than let the model learn are called "hyperparameters." In testing how good our hyperparameters are, we should follow the principle:

To ensure that we do not touch the test dataset too early, a good practice is to set aside a certain proportion of the dataset as a validation set.
The teacher's suggestion is: he will not touch the test dataset until a week before the paper deadline.
Cross-validation is also a feasible method, but due to the high training cost on large datasets, it is only used on small datasets.

Summary#

In fact, the K-nearest neighbors method has never been used on images, as shown in the course materials, there are significant problems. The L2 distance between pixels does not contain enough effective information. Specifically, the L2 distances of the three images in the lower right corner are the same as that of the original image on the left, but it is clear that these three images are very different.

Thus, we need to introduce the method of Linear Classification. I declare: this class was in vain! 😀
At the end of Lecture 2, a little content on linear classification was mentioned, and I plan to introduce it completely in the notes for Lecture 3.