Language models (starting GPT2) are trained on large bodies of text in a ‘self-supervised approach’ which is fancy term for next-word-prediction.
“The quick brown fox jumps over the lazy ____.”
is a next word prediction example in the wild.
This means, these models are, by nature, trying to complete writing text that your started. The mind-blowing thing is how they also learn logic, creativity and common sense in this process.
With that context, prompt engineering is giving the LLM sufficient context such that it can complete the text we provided in the prompt, thereby completing the “task”.
For example, if you have seen prompts begin like below, they are engineering the prompt to give the LLM the right context for their task.
“You are an expert sales copywriter …
Parameter-efficient fine-tuning
LLMs have billions of parameters that need tweaking during the fine-tuning step, which can be expensive to fully realize. So, on the bleeding edge of LLM research, called Parameter efficient fine-tuning, people are experimenting with reducing the number of parameters that needs to be updated, while still achieving the same quality of resulting models.
For example, low-rank adaptation (LoRA) is a PEFT technique that has become especially popular among open-source language models. The idea behind LoRA is that instead of updating all parameters of the base model to fine-tune to a specific task, there is a low-dimension matrix that can represent the space of the downstream task with very high accuracy.
Fine-tuning with LoRA trains this low-rank matrix instead of updating the parameters of the main LLM. The parameter weights of the LoRA model are then integrated into the main LLM or added to it during inference. LoRA can cut the costs of fine-tuning by up to 98 percent.
Precision in the context of LLMs and neural networks refers to the format and size of the numbers used to represent the weights in the networks.
By default, these weights are stored in a 32-bit floating-point format (float32). Lowering precision, for example to a 16-bit floating point format (float16), reduces the memory required to store these weights, which can be beneficial in scenarios with memory constraints. However, this comes at the cost of accuracy due to the accumulation of rounding errors. The trade-off between precision and memory management can be navigated by using mixed precision, which employs different data types in different parts of the network to balance the needs of precision and memory efficiency.
Knowledge encoded in the weights of a language model derived from its training data.