QLoRA

Source

QLora stands for "Quantized Low-Rank Adaptation of Large Language Models". It's an algorithm to make training faster by approximating large matrices with smaller ones.

We use it for Fine-tuning

The Low Rank part

It's based on the fact that the learned matrices in neural network have a low intrinsic dimension (a lot of eigenvalues close to zero) and can thus be well approximated by smaller matrices during the training process. The weight matrix $W$ is approximated by $W_0 + BA$ where $A \in \mathbb{R}^{(r,d)}$ and $B \in \mathbb{R}^{(d,r)}$ are lower rank matrices that contain the actual parameters that will be trained while the $W_0$ parameters stay frozen. $d$ is the lower dimension and $W$ is of size $r \times r$ .

During the training process $W_0$ starts at $W$ and A is initialized randomly and $B$ starts at 0, so that the model is initially identical to the full model.

A and B are called "adapters".

The Quantized Part

We can convert the weights from 16bits to 8 or even 4 bits during the fine-tuning without losing too much performance. This frees up memory and allows for faster training. this is called quantization.

An important part of the $r$ lora algorithm is choosing the right value of $r$ for your task: Choosing parameters for Lora

The Low Rank part​

The Quantized Part​

The Low Rank part

The Quantized Part