QLoRA
QLora stands for "Quantized Low-Rank Adaptation of Large Language Models". It's an algorithm to make training faster by approximating large matrices with smaller ones.
We use it for Fine-tuning
The Low Rank partβ
It's based on the fact that the learned matrices in neural network have a low intrinsic dimension (a lot of eigenvalues close to zero) and can thus be well approximated by smaller matrices during the training process. The weight matrix is approximated by where and are lower rank matrices that contain the actual parameters that will be trained while the parameters stay frozen. is the lower dimension and is of size .
During the training process starts at and A is initialized randomly and starts at 0, so that the model is initially identical to the full model.
A and B are called "adapters".
The Quantized Partβ
We can convert the weights from 16bits to 8 or even 4 bits during the fine-tuning without losing too much performance. This frees up memory and allows for faster training. this is called quantization.
An important part of the lora algorithm is choosing the right value of for your task: Choosing parameters for Lora