Skip to main content

QLoRA

Source

QLora stands for "Quantized Low-Rank Adaptation of Large Language Models". It's an algorithm to make training faster by approximating large matrices with smaller ones.

We use it for Fine-tuning

The Low Rank part​

It's based on the fact that the learned matrices in neural network have a low intrinsic dimension (a lot of eigenvalues close to zero) and can thus be well approximated by smaller matrices during the training process. The weight matrix WW is approximated by W0+BAW_0 + BA where A∈R(r,d)A \in \mathbb{R}^{(r,d)} and B∈R(d,r)B \in \mathbb{R}^{(d,r)} are lower rank matrices that contain the actual parameters that will be trained while the W0W_0 parameters stay frozen. dd is the lower dimension and WW is of size rΓ—rr \times r.

During the training process W0W_0 starts at WW and A is initialized randomly and BB starts at 0, so that the model is initially identical to the full model.

A and B are called "adapters".

The Quantized Part​

We can convert the weights from 16bits to 8 or even 4 bits during the fine-tuning without losing too much performance. This frees up memory and allows for faster training. this is called quantization.

An important part of the rr lora algorithm is choosing the right value of rr for your task: Choosing parameters for Lora