Stanford CS231n (2017 version) -- Lecture 3: Loss Function and Optimization

After rejecting the Nearest Neighbor in Lecture 2, the next thing to try is Linear Classification.

Introduction#

We use a parametric approach for classification.

As shown in the figure above, we use a 103072 parameter matrix W, multiplied by x (x is obtained by dividing an image into 3232 pixel blocks, each block having three channel values, and finally flattening these 32323 values into a one-dimensional vector), resulting in a 101 output that contains the scores for 10 different labels, with the highest score being the predicted label by the model.
You might be curious: what would this W matrix look like if visualized? Would it be quite abstract?
We restore the 32323 values of each row of the 103072 W back into square images, as shown below.

On a small dataset, you can slightly see some features recognizable by the human eye.

Multiclass SVM loss#

This W looks like the "brain" of the decision-making process, but with so many numbers inside, how do we choose it?
We need to introduce the "Loss Function" to quantify our dissatisfaction with the prediction results.
We use a simpler example with three categories: frog, cat, car.

Imagine a scenario: the teacher asks a student to answer a multiple-choice question (the correct answer is A), and the student hesitantly says, "I probably choose A, but I might choose B." Clearly, he does not grasp the knowledge as well as those who confidently answer option A. Similarly, we want the highest score for cat to be significantly greater than the scores for car and frog, indicating that the model is confidently choosing the correct answer.
That's what Multiclass SVM loss is about, $L_i = \sum_{j \neq y_i} \max(0, S_j - S_{y_i} + 1)$ . When we predict the category of the first image above, we get three scores in the first column, at which point $S_{y_i}$ is 3.2 (the score corresponding to the actual label cat, even though it is smaller than 5.1), because we know the actual label of this image is cat. Next, let's evaluate how this classifier (i.e., the weight W) performs. Our criterion is that the score corresponding to the true label must be at least 1 point higher than the other scores to count as truly understood. 3.2 - 1 < 5.1, that's a big mistake, a serious reprimand. 3.2 - 1 > -1.7, good, not confused with frog. The final $L_i = 2.9$ , note!! A larger $L_i$ indicates a worse understanding, so a lower value indicates better model comprehension.
The prediction for the second image is good, 4.9 is at least 1 higher than the other two scores; the prediction for the third image is the worst, with losses reaching 12.9.

Regularization#

The above Loss Function seems to measure the model's capability, but is it enough to have a W trained on the training set that results in a lower Loss value during training?
Not enough,

As shown by the blue line in the figure, if the Loss Function only has this term, blindly learning on the training set may lead the model to "overfit" the training set, causing it to be clueless when faced with questions outside the training set. It is said that "the simplest path is the best," and a model with good generalization performance is the same; we want the model to be "simple" and perform well on test data, as shown by the green line in the figure.
Thus, we add a regularization term, where $\lambda$ is a hyperparameter used to balance data loss and regularization loss.
This $R(W)$ is similar to the loss function and has many options:

L1 regularization: increasing sparsity in the W matrix
*L2 regularization: prevents overfitting, smooths weight distribution.
*Elastic net regularization: combines the advantages of L1 and L2
Dropout ……

Softmax#

This figure shows how the Softmax Classifier converts unnormalized scores into probabilities and evaluates model performance through cross-entropy loss. Here, $s_{y_i}$ is the unnormalized score for the correct category, and $\sum_j e^{s_j}$ is the sum of unnormalized scores for all categories.

Softmax vs. SVM (Comparison of Two Loss Functions)#

By the way, take a look at this flowchart, which is very clear: the bottom left $x_i,y_i$ are the input and true labels for each image in the training set, and the remaining connections encompass all the processes mentioned earlier.

Optimization#

After all the blah blah, you might still have a question: how is this "brain" trained? In other words, how to find the best W?
There are several possible approaches:

random search: quite ineffective
follow the slope: understand using the example of going downhill. Mathematically, this is the gradient.
Iteratively, each time add a small value h to each number in W, see how much the loss value changes, and obtain dW, repeating this operation for each number in W as shown below.

Then select from it.

This method (essentially numerical gradient) becomes difficult to implement as W scales up, as the computation becomes too large. Upon careful consideration, we realize that we want the derivative of the Loss function with respect to W; after all, the Loss function is a function of W, so we can directly use differentiation!

Gradient Descent#

The code is as follows:


while True:
  weights_grad = evaluate_gradient(loss_fun, data, weights)
  weights += - step_size * weights_grad

Stochastic Gradient Descent#


while True:
  data_batch = sample_training_data(data, 256)
  weights_grad = evaluate_gradient(loss_fun, data, weights)
  weights += - step_size * weights_grad

http://vision.stanford.edu/teaching/cs231n-demos/linear-classify/ In this interactive visualization interface, you can see the entire training process.

Image Features#

A coordinate transformation was used: from Cartesian coordinates to polar coordinates, successfully separating a group of points that could not be separated by a linear classifier using a linear classifier.

Examples of Feature Representation:

Color Histogram
Histogram of Oriented Gradient (HoG)
Bag of words: Build codebook → Encode images

Summary#

The content of Lecture 3 concludes here. I did not extend many topics and basically interpreted according to the course content. Personally, I think the Optimization part is not explained very well, only giving a rough understanding.
The content of Lecture 4 is neural networks and backpropagation; let's move on to the next note~