Efficient Learning with Sine-Activated Low-Rank Matrices

Efficient Learning With Sine-Activated Low-rank Matrices

¹Australian Institute for Machine Learning(AIML), University of Adelaide
²The Australian National University ³DATA61, CSIRO
ICLR 2025
^*Equal Contribution

Abstract

Low-rank decomposition has emerged as a vital tool for enhancing parameter efficiency in neural network architectures, gaining traction across diverse applications in machine learning. These techniques significantly lower the number of parameters, striking a balance between compactness and performance. However, a common challenge has been the compromise between parameter efficiency and the accuracy of the model, where reduced parameters often lead to diminished accuracy compared to their full-rank counterparts. In this work, we propose a novel theoretical framework that integrates a sinusoidal function within the low-rank decomposition process. This approach not only preserves the benefits of the parameter efficiency characteristic of low-rank methods but also increases the decomposition's rank, thereby enhancing model performance. Our method proves to be a plug in enhancement for existing low-rank models, as evidenced by its successful application in Vision Transformers (ViT), Large Language Models (LLMs), Neural Radiance Fields (NeRF) and 3D shape modelling.

Methodology

These figures display weight magnitudes for matrices with dimension \( 128 \times 128 \). The first figure showcases a heatmap of a full-rank matrix initialized by Kaiming uniform, highlighting linear independence among rows. The second shows a low-rank matrix \( \mathbf{W}_{\text{lr}} = \mathbf{U} \mathbf{V}^T \in \mathbb{R}^{128 \times 128} \), with \( \mathbf{U}, \mathbf{V} \in \mathbb{R}^{128 \times 1} \) initialized by Kaiming uniform, illustrating minimal linear independence. The final pair of figures reveal how applying a sine function element-wise, \( \sin(\omega \cdot \mathbf{U} \mathbf{V}^T) \), with varying \( \omega \), affects linear independence in low-rank matrices; specifically, \( \omega = 100 \) and \( \omega = 2000 \) progressively increase linear independence.

Results

Neural Radiance Fields (NeRFs) on LLFF datasets

Finetune the LLaMA 3-8B model using the LoRA and sine-LoRA methods on commonsense benchmark.

Finetune the LLaMA 3-8B model using the DoRA and sine-DoRA methods on commonsense benchmark.

Implementation of Sine LoRA compared to LoRA and Sine DoRA compared to DoRA in PyTorch-like code

## LoRA forward pass def forward(self, x: torch.Tensor): base_result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) result += ((self.lora_dropout(x.to(self.lora_A.weight.dtype)) @ self.lora_A.weight.T) @ self.lora_B.weight.T) * self.scaling return result ## Sine LoRA forward pass def forward(self, x: torch.Tensor): base_result = F.linear(x, transpose(self.weight, self.fan_in_fan_out), bias=self.bias) dropout_x = self.lora_dropout(x) result += ((self.lora_dropout(x.to(self.lora_A.weight.dtype))) @ torch.sin(self.freq * self.lora_A.weight.T @ self.lora_B.weight.T))/self.s * self.scaling return result ## DoRA forward pass def forward(self, x: torch.Tensor): base_result = F.linear(x, transpose(self.weight, self.fan_in_fan_out)) dropout_x = self.lora_dropout(x) new_weight_v = self.weight + (self.lora_B.weight @ self.lora_A.weight) * self.scaling norm_scale = self.weight_m_wdecomp.weight.view(-1) / (torch.linalg.norm(new_weight_v,dim=1)).detach() result = base_result + (norm_scale-1) * (F.linear(dropout_x, transpose(self.weight, self.fan_in_fan_out))) result += ( norm_scale * (self.lora_B(self.lora_A(dropout_x.to(self.lora_A.weight.dtype))))) * self.scaling if not self.bias is None: result += self.bias.view(1, -1).expand_as(result) return result ## Sine DoRA forward pass def forward(self, x: torch.Tensor): base_result = F.linear(x, transpose(self.weight, self.fan_in_fan_out)) dropout_x = self.lora_dropout(x) new_weight_v = self.weight + torch.sin(self.freq*(self.lora_B.weight @ self.lora_A.weight))/self.s * self.scaling norm_scale = self.weight_m_wdecomp.weight.view(-1) / (torch.linalg.norm(new_weight_v,dim=1)).detach() result = base_result + (norm_scale-1) * (F.linear(dropout_x, transpose(self.weight, self.fan_in_fan_out))) result += (norm_scale * torch.sin(self.freq*(self.lora_B(self.lora_A(dropout_x.to(self.lora_A.weight.dtype))))/self.s)) * self.scaling if not self.bias is None: result += self.bias.view(1, -1).expand_as(result) return result

BibTeX

@misc{ji2024sineactivatedlowrankmatrices, title={Sine Activated Low-Rank Matrices for Parameter Efficient Learning}, author={Yiping Ji and Hemanth Saratchandran and Cameron Gordon and Zeyu Zhang and Simon Lucey}, year={2024}, eprint={2403.19243}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2403.19243}, }