Quantization is a technique that further reduces the precision of the model weights, but in a way that's more structured than merely dropping precision. It calculates a quantization factor to scale the weights to lower precision, even to integer values. This process transforms a continuous space (like float32) into a discrete one, creating 'bins' to categorize the weights. The denser regions of the distribution get smaller bins (greater resolution), while the tails get lower resolution.
This technique can be particularly beneficial for deploying models on hardware with limited resources or for speeding up inference. Through quantization, the weights are mapped to lower precision in such a way as to minimize the loss of accuracy, making it a powerful tool for optimizing models for deployment in resource-constrained environments.
QLoRA combines Quantization with LoRA to reduce the hardware requirements even further. The 2023 QLORA paper by Dettmers et al. (Original paper) used 4-bit quantized model weights instead of the original model weights, as well as LoRA on the attention layers.
As per the paper, the QLoRA finetuning approach reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance.