
class Linear(Module): r"""Applies a linear transformation to the incoming data: :math:`y = xA^T + b` This module supports :ref:`TensorFloat32<tf32_on_ampere>`. On certain ROCm devices, when using float16 inputs this module will use :ref:`different precision<fp16_on_mi200>` for backward. Args: in_features: size of each input sample out_features: size of each output sample bias: If set to ``False``, the layer will not learn an additive bias. Default: ``True`` Shape:  Input: :math:`(*, H_{in})` where :math:`*` means any number of dimensions including none and :math:`H_{in} = \text{in\_features}`.  Output: :math:`(*, H_{out})` where all but the last dimension are the same shape as the input and :math:`H_{out} = \text{out\_features}`. Attributes: weight: the learnable weights of the module of shape :math:`(\text{out\_features}, \text{in\_features})`. The values are initialized from :math:`\mathcal{U}(\sqrt{k}, \sqrt{k})`, where :math:`k = \frac{1}{\text{in\_features}}` bias: the learnable bias of the module of shape :math:`(\text{out\_features})`. If :attr:`bias` is ``True``, the values are initialized from :math:`\mathcal{U}(\sqrt{k}, \sqrt{k})` where :math:`k = \frac{1}{\text{in\_features}}` Examples:: >>> m = nn.Linear(20, 30) >>> input = torch.randn(128, 20) >>> output = m(input) >>> print(output.size()) torch.Size([128, 30]) """scikitlearn traintestsplit __constants__ = ['in_features', 'out_features'] in_features: int out_features: int weight: Tensor def __init__(self, in_features: int, out_features: int, bias: bool = True, device=None, dtype=None) > None: factory_kwargs = {'device': device, 'dtype': dtype} super().__init__() self.in_features = in_features self.out_features = out_features self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs)) if bias: self.bias = Parameter(torch.empty(out_features, **factory_kwargs)) else: self.register_parameter('bias', None) self.reset_parameters() def reset_parameters(self) > None: # Setting a=sqrt(5) in kaiming_uniform is the same as initializing with # uniform(1/sqrt(in_features), 1/sqrt(in_features)). For details, see # https://github.com/pytorch/pytorch/issues/57109 init.kaiming_uniform_(self.weight, a=math.sqrt(5)) if self.bias is not None: fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight) bound = 1 / math.sqrt(fan_in) if fan_in > 0 else 0 init.uniform_(self.bias, bound, bound) def forward(self, input: Tensor) > Tensor: return F.linear(input, self.weight, self.bias) def extra_repr(self) > str: return 'in_features={}, out_features={}, bias={}'.format( self.in_features, self.out_features, self.bias is not None )

the function that confused me the most was
reset_parameters
. Did a little research and here's what I came up with:
The purpose of the
reset_parameters
method is to provide a consistent and standardized way to initialize the parameters after they have been created in the__init__
method. 
It's common practice to reset or initialize parameters separately from their creation, as this allows for better control over the initialization process. It also makes it easier to change the initialization strategy without modifying the
__init__
method.


Kaiming Uniform Initialization:

Kaiming initialization (also known as He initialization) is a popular initialization technique for neural network weights. It is designed to help gradients flow well during training, which can lead to faster and more stable convergence.

The specific choice of
a=math.sqrt(5)
in the Kaiming uniform initialization is a common heuristic that has been found to work well empirically. 
This initialization ensures that the weights are initialized within a reasonable range, which helps prevent the network from saturating (where gradients become too small) or exploding (where gradients become too large) during training.

I'll make some more detailed notes on weighted initialisation later, because I didn't really understand what they are beyond what they're being used for here. I'll do that later. Refer to that when you want to dig deeper, just understand this for now and move on.



# functional.py def linear( input: Tensor, weight: Tensor, bias: Optional[Tensor] = None, scale: Optional[float] = None, zero_point: Optional[int] = None ) > Tensor: r""" Applies a linear transformation to the incoming quantized data: :math:`y = xA^T + b`. See :class:`~torch.ao.nn.quantized.Linear` .. note:: Current implementation packs weights on every call, which has penalty on performance. If you want to avoid the overhead, use :class:`~torch.ao.nn.quantized.Linear`. Args: input (Tensor): Quantized input of type `torch.quint8` weight (Tensor): Quantized weight of type `torch.qint8` bias (Tensor): None or fp32 bias of type `torch.float` scale (double): output scale. If None, derived from the input scale zero_point (long): output zero point. If None, derived from the input zero_point Shape:  Input: :math:`(N, *, in\_features)` where `*` means any number of additional dimensions  Weight: :math:`(out\_features, in\_features)`  Bias: :math:`(out\_features)`  Output: :math:`(N, *, out\_features)` """ if scale is None: scale = input.q_scale() if zero_point is None: zero_point = input.q_zero_point() _packed_params = torch.ops.quantized.linear_prepack(weight, bias) return torch.ops.quantized.linear(input, _packed_params, scale, zero_point)

In machine learning, particularly when working with deep learning models on hardware like CPUs or specialized accelerators (e.g., GPUs), using floatingpoint numbers for computations can be computationally expensive and memoryintensive. Quantization is a technique used to represent and store numerical values using a reduced number of bits. It allows for more efficient computation and storage of data, particularly in scenarios where precision can be traded off for performance.

In PyTorch, a quantized tensor represents data that has been quantized to a lower bit width (e.g., 8 bits) from the usual 32bit floatingpoint representation. This is achieved by mapping a range of real numbers to a limited set of integer values. Quantization reduces memory usage and computation time while still allowing for reasonable accuracy in many applications.

read here for more details: https://github.com/pytorch/pytorch/wiki/IntroducingQuantizedTensor

Scale:

The scale is a positive floatingpoint value that represents the step size between consecutive integer values in the quantized tensor. It determines the precision of the quantized representation.

A larger scale means that the quantized values are spread out over a larger range, providing a lower level of precision.

A smaller scale means that the quantized values are closer together, resulting in a higher level of precision.

Scale is used to map between the integer value in the quantized tensor and the actual floatingpoint value in the original data space.


Zero Point:

The zero point is an integer value that corresponds to the quantized value that represents zero in the original data space.

It indicates the point around which the quantized values are centered.

The relationship between the zero point and the actual floatingpoint zero is determined by the scale. Specifically, the floatingpoint zero is given by
(zero point) * scale
.


The function
torch.ops.quantized.linear_prepack
is used to pack the quantized weight and bias into a format that can be efficiently used by the subsequent linear transformation operation. This step optimizes memory access patterns and computation for efficient execution on hardware accelerators.

nn.Linear
August 16, 2023