A language model captures statistical patterns in language, indicating the likelihood of specific elements (such as words or characters) appearing in a particular context. The term "token" encompasses words, characters, or word components (like -tion), depending on the language model, essentially representing the model's vocabulary.
Individuals proficient in a language possess subconscious statistical knowledge of it. For instance, consider the context "In my spare time, I enjoy __," where English speakers would anticipate that the missing word is more likely to be a recreational activity (e.g., reading) than an object (e.g., chair).
Likewise, language models should excel at completing such prompts. Picture a language model as a "completion engine": when presented with a text (prompt), it can generate a response to seamlessly finish that text. As an illustration:
User's Prompt: "After a challenging day at work, I like to unwind by __."
Language Model's Completion: "After a challenging day at work, I like to unwind by taking a leisurely stroll in the park."
Objective:
The goal of pre-training is to initialize the language model using a large corpus of unlabeled data.
Pre-training typically involves the use of a language modeling objective, such as masked language modeling or predicting the next word (or sentence) in a sequence.
Training data: low-quality data
Data scale: usually in the order of trillions of tokens as of May 2023.
GPT-3’s dataset (OpenAI): 0.5 trillion tokens. I can’t find any public info for GPT-4, but I’d estimate it to use an order of magnitude more data than GPT-3.
Consider using a pre-trained LM like GPT-3.5 on a diverse dataset like the OpenWebText dataset. The model is trained to predict the next word in a given sentence based on the context provided by the preceding words.
As language models replicate the patterns within their training data, their effectiveness is contingent upon the quality of that data, giving rise to the adage "Garbage in, garbage out." Should you choose to train a language model using Reddit comments, presenting it to your parents might not be the wisest decision.
After pre-training, the model learns a rich representation of language and acquires knowledge about various linguistic aspects.
However, this pre-trained model still needs to be tweaked to perform specific tasks effectively.
That’s where the fine-tuning comes
Supervised Fine-Tuning
Pretraining focuses on refining the language model's ability to complete tasks. When presented with a question in its pretrained state, such as "How to make pizza," various responses could be considered valid. These may include adding more context to the question, like "for a family of six," introducing follow-up questions such as "What ingredients do I need? How much time would it take?" or directly providing the answer. Opting for the third option is preferable when seeking a direct response. The objective of Supervised Finetuning (SFT) is to fine-tune the pretrained model to generate responses aligned with user expectations.
Objective:
Fine-tuning is performed on a domain-specific dataset to adapt the pre-trained LM to a particular task or dataset. This process helps the model generate more relevant and accurate responses for a specific application.
Training data: high-quality data in the format of (prompt, response)
Data scale: 10,000 - 100,000 (prompt, response) pairs
InstructGPT: ~14,500 pairs (13,000 from labelers + 1,500 from customers)
OpenAssistant: 161,000 messages in 10,000 conversations -> approximately 88,000 pairs
Dialogue-finetuned Gopher: ~5 billion tokens, which I estimate to be in the order of 10M messages. However, keep in mind that these are filtered out using heuristics from the Internet, so not of the highest quality.
Model input and output
Input: prompt
Output: response for this prompt
Example:
Fine-tune the pre-trained LM on the Alpaca dataset to generate responses that are contextually relevant to topics related to alpacas.
RLHF
Imagine this idea: What if we had a way to measure how good a response is to a given prompt? Well, we could create a scoring function for that. This function would take a prompt and a response and tell us how good the response is. Then, we could use this scoring function to teach our language models to give better responses.
That's where Reinforcement Learning from Human Feedback (RLHF) comes in. RLHF has two main parts:
Training a Reward Model:
We create a reward model, which is like our scoring function. This model learns to evaluate and score responses based on how good they are.
Optimizing the Language Model:
We then train our language model to generate responses that get high scores from the reward model. In other words, the language model learns to improve and give better answers according to the scoring function we've set up.
Objective:
RLHF is an iterative process that refines the fine-tuned model using reinforcement learning. It involves collecting human feedback on model-generated responses and using this feedback to further improve the model.
Steps:
Interactive Input: Collect user interactions with the model, including corrections or rankings of different responses.
Reward Model: Use this human feedback to create a reward model that guides the model towards generating better responses.
Fine-tuning with Proximal Policy Optimization (PPO): Apply PPO or a similar reinforcement learning algorithm to update the model's parameters based on the reward model.
Example:
This involves creating a reward model based on human feedback and updating the model using reinforcement learning algorithms like Proximal Policy Optimization (PPO).
But why do we need RLHF?
Improve Model Behavior:
Addressing Biases: RLHF allows you to correct biases or undesired behavior in the model. By collecting human feedback, you can guide the model towards generating more appropriate and unbiased responses.
Handling Ambiguity:
Ambiguous Situations: Language is often ambiguous, and models may struggle in situations with multiple valid responses. RLHF helps the model learn from human preferences, making it more likely to generate responses that align with human expectations.
Reducing Undesirable Outputs:
Mitigating Risks: RLHF is a tool to mitigate the risk of the model generating harmful or inappropriate content. By actively involving humans in the feedback loop, you can catch and correct undesirable outputs.
Exploration and Exploitation:
Balancing Trade-offs: RLHF helps strike a balance between exploration and exploitation. It allows the model to explore new possibilities based on human feedback while still leveraging the knowledge gained during the initial pre-training and fine-tuning phases.
For example, if you asked a chatbot what the weather is like outside, it might respond, “It’s 30 degrees Celsius with clouds and high humidity,” or it might respond, “The temperature is around 30 degrees at the moment. It’s cloudy out and humid, so the air might seem thicker!” Although both responses say the same thing, the second response sounds more natural and provides more context.
As human users rate which model responses they prefer, you can use RLHF for collecting human feedback and improving your model to best serve real people.
Training always required more memory, why? Well, the components on GPU memory are the following: 1. model parameters 2. optimizer states 3. gradients 4. forward activations saved for gradient computation
Model Parameters
You can train your model with either FP32 or FP16 of the Model Parameters
FP32 Train = 4 bytes/param * total params
FP16 Train = 2 byte/param * total params
Optimizer States
AdamW
Momentum: 4 bytes/param
Variance: 4 bytes/param
So that becomes 8 bytes/param, that’s quite inefficient?
bitsandbytes 8-bit Adam
Momentum: 1 byte/param
Variance: 1 byte/param
that becomes 2 bytes/param
SGD
Momentum: 4 bytes/param
that becomes 4 bytes/param
Gradients
FP32 Gradients = 4 byte/param * total params
FP16 Gradients = 4 byte/param * total params
Activations
size depends on many factors, the key ones being sequence length, hidden size and batch size.
There are the input and output that are being passed and returned by the forward and the backward functions and the forward activations saved for gradient computation.
As per the paper,
Reducing Activation Recomputation in Large Transformer Models
Mistral-7B-v0.1 is a transformer model, with the following architecture choices:
Grouped-Query Attention
Sliding-Window Attention
Byte-fallback BPE tokenizer
Code!
This little gem of the Hugging Face ecosystem is packed with valuable tools. Most of what is relevant to instruction finetuning is in the form of preprocessing and dataset creation tools. For instance, the packing and instruction masking is built into their special Dataset classes.
NEFTune is a technique to boost the performance of chat models and was introduced by the paper “NEFTune: Noisy Embeddings Improve Instruction Finetuning” from Jain et al. it consists of adding noise to the embedding vectors during training.
Example of Dataset
Alpaca Instruction Format
The Mistral 7B V0.1 Model
FineTune
4-Bit Config
Prepare model with PEFT LoRA
Redline the GPU! 🔥
RTX3090
A10G
What did it cost?
0.9331/hr\*12hrs=11.19
But if we really wanted to, we could do on Colab Pro as well or take the 4xlarge instance
g5.4xlarge
$0.6942/hr
$8.33 total
g5.8xlarge
$0.9331/hr
$11.19 total
Or even reduce the LoRA rank, LoRA target layers, dataset size, there’s a lot that can be done!
Save the LoRA Weights
Inference
Before Instruction Tuning
After Instruction Tuning
Merge LoRA Weights with the Model and Push to Huggingface Hub
Create a repository in Huggingface and login to hub