build a simple neural net

August 20, 2023

So, I was reading this post Implementing a Neural Network from Scratch in Python and I thought I should break the main function being used to generate the neural net, just so I have a clear understanding of whats going on.

here's his code:

# This function learns parameters for the neural network and returns the model.
# - nn_hdim: Number of nodes in the hidden layer
# - num_passes: Number of passes through the training data for gradient descent -> number of training iterations
# - print_loss: If True, print the loss every 1000 iterations
def build_model(nn_hdim, num_passes=20000, print_loss=False):

    # Initialize the parameters to random values. We need to learn these.
    np.random.seed(0)
    W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim)
    b1 = np.zeros((1, nn_hdim))
    W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim)
    b2 = np.zeros((1, nn_output_dim))

    # This is what we return at the end
    model = {}

    # Gradient descent. For each batch...
    for i in range(0, num_passes):

        # Forward propagation
        z1 = X.dot(W1) + b1
        a1 = np.tanh(z1)
        z2 = a1.dot(W2) + b2
        exp_scores = np.exp(z2)
        probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)

        # Backpropagation
        delta3 = probs
        delta3[range(num_examples), y] -= 1
        dW2 = (a1.T).dot(delta3)
        db2 = np.sum(delta3, axis=0, keepdims=True)
        delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2))
        dW1 = np.dot(X.T, delta2)
        db1 = np.sum(delta2, axis=0)

        # Add regularization terms (b1 and b2 don't have regularization terms)
        dW2 += reg_lambda * W2
        dW1 += reg_lambda * W1

        # Gradient descent parameter update
        W1 += -epsilon * dW1
        b1 += -epsilon * db1
        W2 += -epsilon * dW2
        b2 += -epsilon * db2

        # Assign new parameters to the model
        model = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}

        # Optionally print the loss.
        # This is expensive because it uses the whole dataset, so we don't want to do it too often.
        if print_loss and i % 1000 == 0:
          print(f"Loss after iteration %i: %f" %(i, calculate_loss(model)))

    return model
  • let's break it down

        # Initialize the parameters to random values. We need to learn these.
        np.random.seed(0)
        W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim)
        b1 = np.zeros((1, nn_hdim))
        W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdiW1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim)m)
        b2 = np.zeros((1, nn_output_dim))
    
    • the neural network's parameters are initialized randomly. W1 and W2 are the weight matrices of the first and second layers, respectively. They are initialized with random values drawn from a normal distribution with a standard deviation scaled by the square root of the input and hidden layer dimensions. b1 and b2 are the bias vectors for the hidden and output layers and are initialized as zero arrays.

    • This specific weight initialization method is known as "Xavier" or "Glorot" initialization, and it is designed to improve the convergence and training of neural networks.

      • In a neural network, the magnitude of the inputs to a neuron can vary significantly. If the weights are initialized with large values, it could lead to exploding gradients during training, making it difficult for the network to learn. Conversely, if weights are initialized with very small values, it can result in vanishing gradients, impeding learning.

      • The Xavier initialization assumes that the activation functions used in the network are symmetric, such as the hyperbolic tangent (tanh) or the sigmoid function. These functions have zero mean and operate around the origin. The Xavier initialization ensures that the initial weights are centered around zero to align with the activation functions' operating ranges.

      • In the expression np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim), the numerator np.random.randn(nn_input_dim, nn_hdim) generates random values from a standard normal distribution (mean=0, variance=1). Dividing by np.sqrt(nn_input_dim) scales the values down by the square root of the input dimension, which is an approximation of the expected variance of the inputs. This scaling helps maintain a balance between the magnitude of inputs and the magnitudes of weights.

        • By dividing the randomly generated values by the square root of the input dimension, the values are scaled down. This scaling has a couple of effects:

        • Magnitude Balance: It ensures that the initial weights are not too large, which can lead to large activations that slow down learning (exploding gradients). It also prevents the initial weights from being too small, which can lead to vanishing gradients and slow convergence.

        • Signal Propagation: By scaling the weights down, the forward and backward propagations of signals in the network are more balanced. This helps maintain a consistent scale for the signal as it passes through the layers, which can aid in training convergence.

    •     # Forward propagation
          z1 = X.dot(W1) + b1
          a1 = np.tanh(z1)
          z2 = a1.dot(W2) + b2
          exp_scores = np.exp(z2)
          probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
      • Input Layer to Hidden Layer (z1 and a1):

        • z1 = X.dot(W1) + b1: Here, X represents the input data (a matrix where each row is a sample and each column is a feature), W1 is the weight matrix connecting the input layer to the hidden layer, and b1 is the bias vector for the hidden layer. This line calculates the pre-activation values for the hidden layer.

        • X represents the input data, where each row is a sample and each column is a feature. For example, if you have 100 samples with 10 features each, X would have dimensions (100, 10).

        • W1 is the weight matrix that connects the input layer to the hidden layer. It has dimensions (input_dim, hidden_dim), where input_dim is the number of input features and hidden_dim is the number of nodes in the hidden layer.

        • b1 is the bias vector for the hidden layer. It has dimensions (1, hidden_dim) and is added to the result of the multiplication to ensure that each neuron in the hidden layer has a certain level of activation even if all the inputs are zero.

        • Multiplying X with W1 and adding b1 effectively applies a linear transformation to the input data. This linear transformation helps the neural network learn to extract relevant features from the input data.

      • a1 = np.tanh(z1): The pre-activation values from z1 are passed through the hyperbolic tangent (tanh) activation function. This introduces non-linearity to the model and produces the activation values (a1) for the hidden layer. These activation values represent the hidden layer's output and capture the complex relationships between the input data and the learned parameters.

        • z1 is the result of the linear transformation (multiplication with weights and addition of bias) applied to the input data X. It represents the pre-activation values of the neurons in the hidden layer.

        • np.tanh(z1) computes the hyperbolic tangent of z1 for each element in the matrix. The hyperbolic tangent is a non-linear activation function that squashes its input values between -1 and 1.

        • The reason for using the tanh activation is to introduce non-linearity to the neural network. Without a non-linear activation function, the composition of multiple linear transformations would still result in a linear transformation. Non-linearity is crucial for a neural network to capture complex relationships in the data. The tanh activation function has the desirable property of being symmetric around the origin (output values are centered around 0), which makes it suitable for maintaining a balanced signal in the network during training.

      • Hidden Layer to Output Layer (z2 and exp_scores):

        • z2 = a1.dot(W2) + b2: The activation values from the hidden layer (a1) are multiplied by the weights W2 connecting the hidden layer to the output layer, and the bias b2 is added. This calculation gives the pre-activation values for the output layer.

        • exp_scores = np.exp(z2): The pre-activation values from z2 are exponentiated element-wise using the exponential function. These exponentiated values represent the "unnormalized" scores for each class in the output layer. In other words, they reflect the network's raw prediction strength for each class.

        • a1 represents the output of the hidden layer, which is obtained after applying the tanh activation function to the pre-activation values z1 (resulting from the input data being transformed by weights and bias).

        • W2 is the weight matrix connecting the hidden layer to the output layer. It has dimensions (hidden_dim, output_dim), where hidden_dim is the number of nodes in the hidden layer, and output_dim is the number of classes in the classification problem.

        • b2 is the bias vector for the output layer, with dimensions (1, output_dim).

        • Multiplying a1 with W2 and adding b2 performs another linear transformation. This is similar to what was done in the hidden layer, but this time the goal is to map the hidden layer's output to the final output layer's input. This transformation learns how the hidden layer's features are associated with the final class probabilities.

      • Softmax Function and Class Probabilities (probs):

        • probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True): The exponentiated scores are normalized to obtain class probabilities. The softmax function is used for this normalization. Each unnormalized score is divided by the sum of all unnormalized scores for a particular sample (axis=1). This results in a set of probabilities that sum up to 1 for each sample.

        • The calculated probabilities (probs) represent the model's prediction for the probability of each class given the input data. The class with the highest probability is considered the predicted class for a given input sample.

        • z2 represents the pre-activation values of the output layer, which are obtained after the linear transformation in the previous step.

        • np.exp(z2) calculates the exponentiation of each value in z2. Exponentiating the values is a crucial step, as it converts the raw pre-activation values into a form that can represent probabilities.

        • The reason for exponentiating the pre-activation values is to ensure that they are positive. This is important when interpreting them as probabilities.

        • np.sum(exp_scores, axis=1, keepdims=True) calculates the sum of the exponentiated scores along the rows (axis 1). This sum is performed to normalize the scores and transform them into probabilities. The keepdims=True argument ensures that the resulting sum retains the same shape as the original matrix, which is necessary for division.

        • probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) divides each exponentiated score by the sum of all exponentiated scores for the corresponding sample. This division normalizes the scores to create a set of probabilities for each class.

        • The reason for using the softmax function and these operations is to convert the pre-activation values into a probability distribution over the different classes. The softmax function ensures that the resulting probabilities are positive, sum to 1, and represent the model's confidence in its predictions for each class. This allows the neural network to make predictions about the most likely class for a given input sample.

      • The reason for performing these calculations lies in creating a feedforward process that transforms the input data into meaningful predictions. The transformation involves a sequence of linear and non-linear operations that allow the neural network to capture complex relationships within the data. The use of activation functions like the hyperbolic tangent and the softmax ensures that the network can model both linear and non-linear patterns in the data, enabling it to learn and generalize from the training examples to make accurate predictions on new, unseen data.

    • # Backpropagation
      delta3 = probs
      delta3[range(num_examples), y] -= 1
      dW2 = (a1.T).dot(delta3)
      db2 = np.sum(delta3, axis=0, keepdims=True)
      delta2 = delta3.dot(W2.T) * (1 - np.power(a1, 2))
      dW1 = np.dot(X.T, delta2)
      db1 = np.sum(delta2, axis=0)
      • Calculating delta3 - Gradient of Loss with Respect to Output Layer's Pre-Activation (delta3 = probs):

        • probs contains the predicted probabilities for each class obtained from the forward propagation step.
        • delta3 is used to represent the gradient of the loss with respect to the output layer's pre-activation values. This gradient indicates how changes in the output layer's pre-activation values affect the loss function.
      • Adjusting delta3 for Prediction Error (delta3[range(num_examples), y] -= 1):

        • delta3 is adjusted based on the difference between the predicted probabilities (probs) and the true labels y. This step represents the calculation of the gradient of the loss function with respect to the output layer's pre-activation values, taking into account the prediction error.
        • By subtracting 1 from the appropriate positions in delta3, the gradient is calculated to reflect the difference between predicted probabilities and the true class for each sample.
      • Calculating dW2 and db2 - Gradients of Loss with Respect to Output Layer's Weights and Bias:

        • dW2 represents the gradient of the loss with respect to the weights W2 connecting the hidden layer to the output layer.

        • db2 represents the gradient of the loss with respect to the bias b2 of the output layer.

        • These gradients are calculated using the chain rule and are derived from delta3. The gradients indicate how changes in the weights and bias of the output layer affect the loss function.

        • The matrix multiplication between the transposed hidden layer output a1.T and the delta3 gradient results in a matrix whose elements represent how each weight in W2 affects the overall loss. This is essentially a weighted sum of the contributions of each weight to the loss.

        • This operation effectively calculates the gradient of the loss with respect to the weights W2. Each entry in the resulting matrix tells us how a specific weight connecting the hidden layer to the output layer contributes to the loss. This information is used to update the weights during training through gradient descent.

      • Calculating delta2 - Gradient of Loss with Respect to Hidden Layer's Pre-Activation:

        • delta2 represents the gradient of the loss with respect to the hidden layer's pre-activation values.

        • It is calculated by propagating the gradient delta3 backward through the weights W2. This step involves the chain rule and takes into account how changes in the output layer's pre-activation values affect the hidden layer's pre-activation values.

        • delta3 - Gradient of Loss with Respect to Output Layer's Pre-Activation:

        • delta3 represents the gradient of the loss function with respect to the pre-activation values of the output layer. It was calculated in the previous steps of backpropagation and indicates how changes in the output layer's pre-activation values affect the overall loss.

        • W2.T - Transposed Weights Matrix:

        • W2 is the weight matrix that connects the hidden layer to the output layer. Taking the transpose W2.T is necessary because we want to propagate the gradient backwards from the output layer to the hidden layer.

        • Matrix Multiplication delta3.dot(W2.T) - Backpropagating the Gradient:

        • This matrix multiplication calculates the gradient of the loss with respect to the hidden layer's pre-activation values. It tells us how changes in the output layer's pre-activation values affect the hidden layer's pre-activation values.

        • (1 - np.power(a1, 2)) - Gradient of Tanh Activation Function:

        • The derivative of the hyperbolic tangent (tanh) activation function is given by 1 - np.power(a1, 2). This derivative indicates how a small change in the hidden layer's pre-activation value would affect the hidden layer's output (the a1 values).

        • Multiplying the backpropagated gradient by this derivative is a crucial step in calculating the gradient of the loss with respect to the hidden layer's pre-activation values. It ensures that the gradient accounts for the non-linearity introduced by the tanh activation function.

        • delta2 - Gradient of Loss with Respect to Hidden Layer's Pre-Activation:

        • delta2 represents the gradient of the loss function with respect to the hidden layer's pre-activation values. It's calculated by combining the information from the output layer's gradient and the effect of the hidden layer's activation function.

      • Calculating dW1 and db1 - Gradients of Loss with Respect to Hidden Layer's Weights and Bias:

        • dW1 represents the gradient of the loss with respect to the weights W1 connecting the input layer to the hidden layer.

        • db1 represents the gradient of the loss with respect to the bias b1 of the hidden layer.

        • These gradients are calculated using the chain rule and are derived from delta2. They indicate how changes in the weights and bias of the hidden layer affect the loss function.

    • # Add regularization terms (b1 and b2 don't have regularization terms)
      dW2 += reg_lambda * W2
      dW1 += reg_lambda * W1
      • Regularization is a technique used to prevent overfitting, which occurs when a model learns to fit the training data too closely and doesn't generalize well to new, unseen data.
    • # Gradient descent parameter update
      W1 += -epsilon * dW1
      b1 += -epsilon * db1
      W2 += -epsilon * dW2
      b2 += -epsilon * db2
      • The necessity of this step lies in the goal of training a neural network: to find the set of parameters that result in the lowest possible value of the loss function. Gradient descent is an iterative process where the parameters are adjusted step by step to reach this optimal set of values. By updating the parameters in the direction that reduces the loss (opposite to the gradient), the network gradually improves its performance on the training data.