nn.Linear

August 16, 2023

  • class Linear(Module):
        r"""Applies a linear transformation to the incoming data: :math:`y = xA^T + b`
    
        This module supports :ref:`TensorFloat32<tf32_on_ampere>`.
    
        On certain ROCm devices, when using float16 inputs this module will use :ref:`different precision<fp16_on_mi200>` for backward.
    
        Args:
            in_features: size of each input sample
            out_features: size of each output sample
            bias: If set to ``False``, the layer will not learn an additive bias.
                Default: ``True``
    
        Shape:
            - Input: :math:`(*, H_{in})` where :math:`*` means any number of
              dimensions including none and :math:`H_{in} = \text{in\_features}`.
            - Output: :math:`(*, H_{out})` where all but the last dimension
              are the same shape as the input and :math:`H_{out} = \text{out\_features}`.
    
        Attributes:
            weight: the learnable weights of the module of shape
                :math:`(\text{out\_features}, \text{in\_features})`. The values are
                initialized from :math:`\mathcal{U}(-\sqrt{k}, \sqrt{k})`, where
                :math:`k = \frac{1}{\text{in\_features}}`
            bias:   the learnable bias of the module of shape :math:`(\text{out\_features})`.
                    If :attr:`bias` is ``True``, the values are initialized from
                    :math:`\mathcal{U}(-\sqrt{k}, \sqrt{k})` where
                    :math:`k = \frac{1}{\text{in\_features}}`
    
        Examples::
    
            >>> m = nn.Linear(20, 30)
            >>> input = torch.randn(128, 20)
            >>> output = m(input)
            >>> print(output.size())
            torch.Size([128, 30])
        """scikit-learn train-test-split
        __constants__ = ['in_features', 'out_features']
        in_features: int
        out_features: int
        weight: Tensor
    
        def __init__(self, in_features: int, out_features: int, bias: bool = True,
                     device=None, dtype=None) -> None:
            factory_kwargs = {'device': device, 'dtype': dtype}
            super().__init__()
            self.in_features = in_features
            self.out_features = out_features
            self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
            if bias:
                self.bias = Parameter(torch.empty(out_features, **factory_kwargs))
            else:
                self.register_parameter('bias', None)
            self.reset_parameters()
    
        def reset_parameters(self) -> None:
            # Setting a=sqrt(5) in kaiming_uniform is the same as initializing with
            # uniform(-1/sqrt(in_features), 1/sqrt(in_features)). For details, see
            # https://github.com/pytorch/pytorch/issues/57109
            init.kaiming_uniform_(self.weight, a=math.sqrt(5))
            if self.bias is not None:
                fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
                bound = 1 / math.sqrt(fan_in) if fan_in > 0 else 0
                init.uniform_(self.bias, -bound, bound)
    
        def forward(self, input: Tensor) -> Tensor:
            return F.linear(input, self.weight, self.bias)
    
        def extra_repr(self) -> str:
            return 'in_features={}, out_features={}, bias={}'.format(
                self.in_features, self.out_features, self.bias is not None
            )
    • the function that confused me the most was reset_parameters. Did a little research and here's what I came up with:

      • The purpose of the reset_parameters method is to provide a consistent and standardized way to initialize the parameters after they have been created in the __init__ method.

      • It's common practice to reset or initialize parameters separately from their creation, as this allows for better control over the initialization process. It also makes it easier to change the initialization strategy without modifying the __init__ method.

    • Kaiming Uniform Initialization:

      • Kaiming initialization (also known as He initialization) is a popular initialization technique for neural network weights. It is designed to help gradients flow well during training, which can lead to faster and more stable convergence.

      • The specific choice of a=math.sqrt(5) in the Kaiming uniform initialization is a common heuristic that has been found to work well empirically.

      • This initialization ensures that the weights are initialized within a reasonable range, which helps prevent the network from saturating (where gradients become too small) or exploding (where gradients become too large) during training.

      • I'll make some more detailed notes on weighted initialisation later, because I didn't really understand what they are beyond what they're being used for here. I'll do that later. Refer to that when you want to dig deeper, just understand this for now and move on.

  • # functional.py
    
    def linear(
        input: Tensor, weight: Tensor, bias: Optional[Tensor] = None,
        scale: Optional[float] = None, zero_point: Optional[int] = None
    ) -> Tensor:
        r"""
        Applies a linear transformation to the incoming quantized data:
        :math:`y = xA^T + b`.
        See :class:`~torch.ao.nn.quantized.Linear`
    
        .. note::
    
          Current implementation packs weights on every call, which has penalty on performance.
          If you want to avoid the overhead, use :class:`~torch.ao.nn.quantized.Linear`.
    
        Args:
          input (Tensor): Quantized input of type `torch.quint8`
          weight (Tensor): Quantized weight of type `torch.qint8`
          bias (Tensor): None or fp32 bias of type `torch.float`
          scale (double): output scale. If None, derived from the input scale
          zero_point (long): output zero point. If None, derived from the input zero_point
    
        Shape:
            - Input: :math:`(N, *, in\_features)` where `*` means any number of
              additional dimensions
            - Weight: :math:`(out\_features, in\_features)`
            - Bias: :math:`(out\_features)`
            - Output: :math:`(N, *, out\_features)`
        """
        if scale is None:
            scale = input.q_scale()
        if zero_point is None:
            zero_point = input.q_zero_point()
        _packed_params = torch.ops.quantized.linear_prepack(weight, bias)
        return torch.ops.quantized.linear(input, _packed_params, scale, zero_point)
    
    
    • In machine learning, particularly when working with deep learning models on hardware like CPUs or specialized accelerators (e.g., GPUs), using floating-point numbers for computations can be computationally expensive and memory-intensive. Quantization is a technique used to represent and store numerical values using a reduced number of bits. It allows for more efficient computation and storage of data, particularly in scenarios where precision can be traded off for performance.

    • In PyTorch, a quantized tensor represents data that has been quantized to a lower bit width (e.g., 8 bits) from the usual 32-bit floating-point representation. This is achieved by mapping a range of real numbers to a limited set of integer values. Quantization reduces memory usage and computation time while still allowing for reasonable accuracy in many applications.

    • read here for more details: https://github.com/pytorch/pytorch/wiki/Introducing-Quantized-Tensor

    • Scale:

      • The scale is a positive floating-point value that represents the step size between consecutive integer values in the quantized tensor. It determines the precision of the quantized representation.

      • A larger scale means that the quantized values are spread out over a larger range, providing a lower level of precision.

      • A smaller scale means that the quantized values are closer together, resulting in a higher level of precision.

      • Scale is used to map between the integer value in the quantized tensor and the actual floating-point value in the original data space.

    • Zero Point:

      • The zero point is an integer value that corresponds to the quantized value that represents zero in the original data space.

      • It indicates the point around which the quantized values are centered.

      • The relationship between the zero point and the actual floating-point zero is determined by the scale. Specifically, the floating-point zero is given by (zero point) * scale.

    • The function torch.ops.quantized.linear_prepack is used to pack the quantized weight and bias into a format that can be efficiently used by the subsequent linear transformation operation. This step optimizes memory access patterns and computation for efficient execution on hardware accelerators.