Efficient Learning With Sine-Activated Low-rank Matrices

1Australian Institute for Machine Learning(AIML), University of Adelaide
2The Australian National University 3DATA61, CSIRO
ICLR 2025

*Equal Contribution
Image 1

Abstract

Low-rank decomposition has emerged as a vital tool for enhancing parameter efficiency in neural network architectures, gaining traction across diverse applications in machine learning. These techniques significantly lower the number of parameters, striking a balance between compactness and performance. However, a common challenge has been the compromise between parameter efficiency and the accuracy of the model, where reduced parameters often lead to diminished accuracy compared to their full-rank counterparts. In this work, we propose a novel theoretical framework that integrates a sinusoidal function within the low-rank decomposition process. This approach not only preserves the benefits of the parameter efficiency characteristic of low-rank methods but also increases the decomposition's rank, thereby enhancing model performance. Our method proves to be a plug in enhancement for existing low-rank models, as evidenced by its successful application in Vision Transformers (ViT), Large Language Models (LLMs), Neural Radiance Fields (NeRF) and 3D shape modelling.

Methodology

Figure 2

These figures display weight magnitudes for matrices with dimension \( 128 \times 128 \). The first figure showcases a heatmap of a full-rank matrix initialized by Kaiming uniform, highlighting linear independence among rows. The second shows a low-rank matrix \( \mathbf{W}_{\text{lr}} = \mathbf{U} \mathbf{V}^T \in \mathbb{R}^{128 \times 128} \), with \( \mathbf{U}, \mathbf{V} \in \mathbb{R}^{128 \times 1} \) initialized by Kaiming uniform, illustrating minimal linear independence. The final pair of figures reveal how applying a sine function element-wise, \( \sin(\omega \cdot \mathbf{U} \mathbf{V}^T) \), with varying \( \omega \), affects linear independence in low-rank matrices; specifically, \( \omega = 100 \) and \( \omega = 2000 \) progressively increase linear independence.

Results

Neural Radiance Fields (NeRFs) on LLFF datasets
Figure 3
Finetune the LLaMA 3-8B model using the LoRA and sine-LoRA methods on commonsense benchmark.
Figure 3
Finetune the LLaMA 3-8B model using the DoRA and sine-DoRA methods on commonsense benchmark.
Figure 3

Implementation of Sine LoRA compared to LoRA and Sine DoRA compared to DoRA in PyTorch-like code


## LoRA forward pass
def forward(self, x: torch.Tensor):
    base_result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)

    result += ((self.lora_dropout(x.to(self.lora_A.weight.dtype)) @ self.lora_A.weight.T) @ self.lora_B.weight.T) * self.scaling
    return result

## Sine LoRA forward pass
def forward(self, x: torch.Tensor):
    base_result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias)
    dropout_x = self.lora_dropout(x)
    
    result += ((self.lora_dropout(x.to(self.lora_A.weight.dtype))) @ torch.sin(self.freq * self.lora_A.weight.T @ self.lora_B.weight.T))/self.s * self.scaling
    return result

## DoRA forward pass
def forward(self, x: torch.Tensor):
    base_result = F.linear(x, transpose(self.weight, self.fan_in_fan_out))
    dropout_x = self.lora_dropout(x)

    new_weight_v = self.weight + (self.lora_B.weight @ self.lora_A.weight) * self.scaling
    norm_scale = self.weight_m_wdecomp.weight.view(-1) / (torch.linalg.norm(new_weight_v,dim=1)).detach()
    result = base_result + (norm_scale-1) * (F.linear(dropout_x, transpose(self.weight, self.fan_in_fan_out)))
    result += ( norm_scale * (self.lora_B(self.lora_A(dropout_x.to(self.lora_A.weight.dtype))))) * self.scaling
    if not self.bias is None:
        result += self.bias.view(1, -1).expand_as(result)
    return result

## Sine DoRA forward pass
def forward(self, x: torch.Tensor):
    base_result = F.linear(x, transpose(self.weight, self.fan_in_fan_out))
    dropout_x = self.lora_dropout(x)

    new_weight_v = self.weight + torch.sin(self.freq*(self.lora_B.weight @ self.lora_A.weight))/self.s * self.scaling
    norm_scale = self.weight_m_wdecomp.weight.view(-1) / (torch.linalg.norm(new_weight_v,dim=1)).detach()
    result = base_result + (norm_scale-1) * (F.linear(dropout_x, transpose(self.weight, self.fan_in_fan_out)))
    result += (norm_scale * torch.sin(self.freq*(self.lora_B(self.lora_A(dropout_x.to(self.lora_A.weight.dtype))))/self.s)) * self.scaling
    if not self.bias is None:
        result += self.bias.view(1, -1).expand_as(result)
    return result
  

BibTeX


@misc{ji2024sineactivatedlowrankmatrices,
    title={Sine Activated Low-Rank Matrices for Parameter Efficient Learning}, 
    author={Yiping Ji and Hemanth Saratchandran and Cameron Gordon and Zeyu Zhang and Simon Lucey},
    year={2024},
    eprint={2403.19243},
    archivePrefix={arXiv},
    primaryClass={cs.LG},
    url={https://arxiv.org/abs/2403.19243}, 
}