The LLM Leaderboard

Model	Size	MMLU	Release
Gemini Ultra	1.5T*	90.04 *(CoT@32)
		83.79 *(CoT@5)	07-Dec-23
Human	86B	89.8	300,000 years ago
GPT-4	1.76T*	86.5 *(CoT@5)
		87.29 *(CoT@32)	27-Mar-23
Gemini Pro	180B*	79.13 *(CoT@32)	07-Dec-23
Claude 2	130B	78.5	11-Jul-23
PaLM 2	340B	78.3	17-May-23
mistralai/Mixtral-8x7B-v0.1	50B	71.73	10-Dec-23
tiiuae/falcon-180B	180B	70.5	06-Sep-23
GPT-3.5	175B*	70.0	15-Mar-22
meta-llama/Llama-2-70b-hf	70B	69.83	18-Jul-23
TigerResearch/tigerbot-70b-chat-v4-4k	70B	68.26	17-Nov-23
Qwen/Qwen-72B	72B	77.37	26-Nov-23
kyujinpy/PlatYi-34B-LoRA	34B	78.46	01-Dec-23
SUSTech/SUS-Chat-34B	34B	76.41	29-Nov-23
OrionStarAI/OrionStar-Yi-34B-Chat-Llama	34B	73.67	21-Nov-23
deepseek-ai/deepseek-llm-67b-chat	67B	72.42	29-Nov-23
Qwen/Qwen-14B	14B	67.7	24-Sep-23
openchat/openchat_3.5	7B	64.98	30-Oct-23
01-ai/Yi-6B-200K	6B	64.65	06-Nov-23
mistralai/Mistral-7B-v0.1	7B	64.16	20-Sep-23
Intel/neural-chat-7b-v3-2	7B	63.55	21-Nov-23
fblgit/una-cybertron-7b-v2-bf16	7B	63.23	02-Dec-23
meta-llama/Llama-2-13b-hf	13B	55.77	18-Jul-23
GeneZC/MiniChat-1.5-3B	3B	46.67	26-Nov-23
bigscience/bloomz-3b	3B	32.91	02-Oct-22
MBZUAI/LaMini-GPT-1.5B	1.5B	29.92	16-Apr-23
deepseek-ai/deepseek-coder-1.3b-instruct	1.3B	28.47	29-Oct-23
postbot/emailgen-pythia-410m-deduped	510M	27.35	15-Feb-23
nicholasKluge/Aira-2-355M	355M	27.26	08-Jun-23
bigcode/tiny_starcoder_py	164M	26.79	15-May-23
MBZUAI/LaMini-Neo-125M	125M	26.74	14-Apr-23

Untitled

Pre-Training

A language model captures statistical patterns in language, indicating the likelihood of specific elements (such as words or characters) appearing in a particular context. The term "token" encompasses words, characters, or word components (like -tion), depending on the language model, essentially representing the model's vocabulary.

Individuals proficient in a language possess subconscious statistical knowledge of it. For instance, consider the context "In my spare time, I enjoy __," where English speakers would anticipate that the missing word is more likely to be a recreational activity (e.g., reading) than an object (e.g., chair).

Likewise, language models should excel at completing such prompts. Picture a language model as a "completion engine": when presented with a text (prompt), it can generate a response to seamlessly finish that text. As an illustration:

User's Prompt: "After a challenging day at work, I like to unwind by __." Language Model's Completion: "After a challenging day at work, I like to unwind by taking a leisurely stroll in the park."

Objective:

The goal of pre-training is to initialize the language model using a large corpus of unlabeled data.

Pre-training typically involves the use of a language modeling objective, such as masked language modeling or predicting the next word (or sentence) in a sequence.

Untitled

Training data: low-quality data
Data scale: usually in the order of trillions of tokens as of May 2023.
- GPT-3’s dataset (OpenAI): 0.5 trillion tokens. I can’t find any public info for GPT-4, but I’d estimate it to use an order of magnitude more data than GPT-3.
- Gopher’s dataset (DeepMind): 1 trillion tokens
- RedPajama (Together): 1.2 trillion tokens
- LLaMa’s dataset (Meta): 1.4 trillion tokens

Example:

Consider using a pre-trained LM like GPT-3.5 on a diverse dataset like the OpenWebText dataset. The model is trained to predict the next word in a given sentence based on the context provided by the preceding words.

As language models replicate the patterns within their training data, their effectiveness is contingent upon the quality of that data, giving rise to the adage "Garbage in, garbage out." Should you choose to train a language model using Reddit comments, presenting it to your parents might not be the wisest decision.

https://pile.eleuther.ai/

Untitled

After pre-training, the model learns a rich representation of language and acquires knowledge about various linguistic aspects.

However, this pre-trained model still needs to be tweaked to perform specific tasks effectively.

That’s where the fine-tuning comes

Supervised Fine-Tuning

Pretraining focuses on refining the language model's ability to complete tasks. When presented with a question in its pretrained state, such as "How to make pizza," various responses could be considered valid. These may include adding more context to the question, like "for a family of six," introducing follow-up questions such as "What ingredients do I need? How much time would it take?" or directly providing the answer. Opting for the third option is preferable when seeking a direct response. The objective of Supervised Finetuning (SFT) is to fine-tune the pretrained model to generate responses aligned with user expectations.

Objective:

Fine-tuning is performed on a domain-specific dataset to adapt the pre-trained LM to a particular task or dataset. This process helps the model generate more relevant and accurate responses for a specific application.

Training data: high-quality data in the format of (prompt, response)
Data scale: 10,000 - 100,000 (prompt, response) pairs
- InstructGPT: ~14,500 pairs (13,000 from labelers + 1,500 from customers)
- Alpaca: 52K ChatGPT instructions
- Databricks’ Dolly-15k: ~15k pairs, created by Databricks employees
- OpenAssistant: 161,000 messages in 10,000 conversations -> approximately 88,000 pairs
- Dialogue-finetuned Gopher: ~5 billion tokens, which I estimate to be in the order of 10M messages. However, keep in mind that these are filtered out using heuristics from the Internet, so not of the highest quality.
Model input and output
- Input: prompt
- Output: response for this prompt

Example:

Fine-tune the pre-trained LM on the Alpaca dataset to generate responses that are contextually relevant to topics related to alpacas.

Untitled

RLHF

Imagine this idea: What if we had a way to measure how good a response is to a given prompt? Well, we could create a scoring function for that. This function would take a prompt and a response and tell us how good the response is. Then, we could use this scoring function to teach our language models to give better responses.

That's where Reinforcement Learning from Human Feedback (RLHF) comes in. RLHF has two main parts:

Training a Reward Model: We create a reward model, which is like our scoring function. This model learns to evaluate and score responses based on how good they are.
Optimizing the Language Model: We then train our language model to generate responses that get high scores from the reward model. In other words, the language model learns to improve and give better answers according to the scoring function we've set up.

Objective:

RLHF is an iterative process that refines the fine-tuned model using reinforcement learning. It involves collecting human feedback on model-generated responses and using this feedback to further improve the model.

Steps:

Interactive Input: Collect user interactions with the model, including corrections or rankings of different responses.
Reward Model: Use this human feedback to create a reward model that guides the model towards generating better responses.
Fine-tuning with Proximal Policy Optimization (PPO): Apply PPO or a similar reinforcement learning algorithm to update the model's parameters based on the reward model.

Example:

This involves creating a reward model based on human feedback and updating the model using reinforcement learning algorithms like Proximal Policy Optimization (PPO).

# RLHF using custom RL algorithms
 
# Collect conversations and human feedback
conversations, rewards = collect_human_feedback()
 
# Update the model using RL algorithms (e.g., PPO)
for epoch in range(num_epochs):
    for conversation, reward in zip(conversations, rewards):
        updated_model = rlhf_update(model, conversation, reward)
        model = updated_model
 
# Save the updated model
model.save('rlhf_finetuned_model')

Untitled

But why do we need RLHF?

Improve Model Behavior:

Addressing Biases: RLHF allows you to correct biases or undesired behavior in the model. By collecting human feedback, you can guide the model towards generating more appropriate and unbiased responses.

Handling Ambiguity:

Ambiguous Situations: Language is often ambiguous, and models may struggle in situations with multiple valid responses. RLHF helps the model learn from human preferences, making it more likely to generate responses that align with human expectations.

Reducing Undesirable Outputs:

Mitigating Risks: RLHF is a tool to mitigate the risk of the model generating harmful or inappropriate content. By actively involving humans in the feedback loop, you can catch and correct undesirable outputs.

Exploration and Exploitation:

Balancing Trade-offs: RLHF helps strike a balance between exploration and exploitation. It allows the model to explore new possibilities based on human feedback while still leveraging the knowledge gained during the initial pre-training and fine-tuning phases.

For example, if you asked a chatbot what the weather is like outside, it might respond, “It’s 30 degrees Celsius with clouds and high humidity,” or it might respond, “The temperature is around 30 degrees at the moment. It’s cloudy out and humid, so the air might seem thicker!” Although both responses say the same thing, the second response sounds more natural and provides more context.

As human users rate which model responses they prefer, you can use RLHF for collecting human feedback and improving your model to best serve real people.

Anthropic RLHF Dataset: https://huggingface.co/datasets/Anthropic/hh-rlhf

Will it Fit? Revisiting Memory Requirements

Untitled

Training Compute-Optimal Large Language Models

https://arxiv.org/pdf/2203.15556.pdf

The basic equation giving the cost to train a transformer model is given by:

$C ~ \tau T=6PD$

Where

C = Compute required to train the transformer model (FLOPS)

Here C refers to C forward + C backward

C forward = 2PD

C backward = 4PD

tau = Total FLOPS of your training infrastructure

T = Time spent in training in seconds

P = Total parameters in your model

D = Total tokens in your dataset

Cool that’s about the FLOPs required to train a model

Coming to Memory Costs

Inference Memory

INT8 = 1 byte/param * total params

FP16 or BF16 = 2 byte/param * total params

FP32 = 4 byte/param * total params

In practice you’ll see additional overhead of 10-20%, more about it here: https://kipp.ly/transformer-inference-arithmetic/

Inference Memory = 1.2 * Model Memory

Training Memory

Training always required more memory, why? Well, the components on GPU memory are the following: 1. model parameters 2. optimizer states 3. gradients 4. forward activations saved for gradient computation

Model Parameters

You can train your model with either FP32 or FP16 of the Model Parameters

FP32 Train = 4 bytes/param * total params

FP16 Train = 2 byte/param * total params

Optimizer States

AdamW
- Momentum: 4 bytes/param
- Variance: 4 bytes/param

So that becomes 8 bytes/param, that’s quite inefficient?

bitsandbytes 8-bit Adam
- Momentum: 1 byte/param
- Variance: 1 byte/param

that becomes 2 bytes/param

SGD
- Momentum: 4 bytes/param

that becomes 4 bytes/param

Gradients

FP32 Gradients = 4 byte/param * total params

FP16 Gradients = 4 byte/param * total params

Activations

size depends on many factors, the key ones being sequence length, hidden size and batch size.

There are the input and output that are being passed and returned by the forward and the backward functions and the forward activations saved for gradient computation.

As per the paper,

Reducing Activation Recomputation in Large Transformer Models

https://arxiv.org/pdf/2205.05198.pdf

$memory = sbhL\left(34+5\frac{a\cdot s}{h}\right) bytes$

Untitled

s = sequence length in tokens

b = batch size

h = hidden dim

L = number of layers

a = number of attention heads

128 _ 4 _ 2048 * 16 (34 + 5*8*128/2048) = 592MB

Model Memory Experiments: https://colab.research.google.com/drive/1jFFBqgGu4Fs9vIjjcHcuvXKd52cIN7uM?usp=sharing

Available Optimizers in Huggingface Transformer: https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py#L133

Memory Profiling & Snapshots

https://colab.research.google.com/drive/1jAndaNPhALowVwH1o1GowT6-Mm4carGV?usp=sharing

Visualize at https://pytorch.org/memory_viz

FP32, no gradient checkpointing, AdamW normal, batch size 24

Untitled

Memory Snapshot

Untitled

These are the Gradients

Untitled

These are the forward calls (activations)

Untitled

FP16, Gradient Checkpointing, Batch Size 64, AdamW

Untitled

FP16, Gradient Checkpointing, Batch Size 64, 8 Bit Adam

Notice how small the optimizer memory is!

Untitled

python torch/cuda/_memory_viz.py trace_plot snapshot.pickle -o snapshot.html

Are you GPU Poor?

https://rahulschand.github.io/gpu_poor/

But mostly i found above to be inaccurate in terms of activations memory

Dataset for Fine Tuning

https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-en

https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k

https://huggingface.co/datasets/Fredithefish/openassistant-guanaco-unfiltered

https://huggingface.co/datasets/tatsu-lab/alpaca

https://huggingface.co/datasets/Open-Orca/OpenOrca

We’re going to tune our model on Alpaca GPT4 Instruction Dataset

https://huggingface.co/datasets/c-s-ale/alpaca-gpt4-data

The Alpaca Dataset Format

### Instruction:
Use the Input below to create an instruction, which could have been used to generate the input using an LLM.
 
### Input:
Dear [boss name],
 
I'm writing to request next week, August 1st through August 4th,
off as paid time off.
 
I have some personal matters to attend to that week that require
me to be out of the office. I wanted to give you as much advance
notice as possible so you can plan accordingly while I am away.
 
Please let me know if you need any additional information from me
or have any concerns with me taking next week off. I appreciate you
considering this request.
 
Thank you, [Your name]
 
### Response:
Write an email to my boss that I need next week 08/01 - 08/04 off.

Model

We’re going to be tuning the Mistral 7B Model

https://huggingface.co/mistralai/Mistral-7B-v0.1

Mistral-7B-v0.1 is a transformer model, with the following architecture choices:

Grouped-Query Attention
Sliding-Window Attention
Byte-fallback BPE tokenizer

Untitled

Code!

Untitled

This little gem of the Hugging Face ecosystem is packed with valuable tools. Most of what is relevant to instruction finetuning is in the form of preprocessing and dataset creation tools. For instance, the packing and instruction masking is built into their special Dataset classes.

SFTTrainer

https://huggingface.co/docs/trl/sft_trainer

Untitled

NEFTune

NEFTune is a technique to boost the performance of chat models and was introduced by the paper “NEFTune: Noisy Embeddings Improve Instruction Finetuning” from Jain et al. it consists of adding noise to the embedding vectors during training.

Untitled

! pip install transformers datasets peft bitsandbytes accelerate trl tensorboard tensorboardX

! pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.6/flash_attn-2.3.6+cu122torch2.1cxx11abiFALSE-cp39-cp39-linux_x86_64.whl

Example of Dataset

from datasets import load_dataset

alpaca_dataset = load_dataset("c-s-ale/alpaca-gpt4-data", split="train")
 
print(f"dataset size: {len(alpaca_dataset)}")

alpaca_dataset[0]

{'instruction': 'Give three tips for staying healthy.',
 'input': '',
 'output': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.'}

Alpaca Instruction Format

INSTRUCTION_FORMAT = """
### Instruction:
{instruction}
 
### Input:
{input}
 
### Response:
{response}
"""
 
def format_instruction(sample):
	return INSTRUCTION_FORMAT.format(
        instruction=sample['instruction'],
        input=sample['input'],
        response=sample['output']
    )

print(format_instruction(alpaca_dataset[1000]))

The Mistral 7B V0.1 Model

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)

INPUT = """
### Instruction:
List 3 historical events related to the following country
 
### Input:
Canada
 
### Response:
"""

out = pipe(
    INPUT,
    max_new_tokens=200
)

print(out[0]['generated_text'])

FineTune

Untitled

import torch
from datetime import datetime
from pathlib import Path
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments, pipeline
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
from datasets import load_dataset
from peft import PeftModel
from trl import SFTTrainer
from datasets import load_dataset
from accelerate import Accelerator
from accelerate.logging import get_logger
from accelerate.utils import DistributedDataParallelKwargs, ProjectConfiguration, set_seed

INSTRUCTION_FORMAT = """
### Instruction:
{instruction}
 
### Input:
{input}
 
### Response:
{response}
"""
 
def format_instruction(sample):
	return INSTRUCTION_FORMAT.format(
        instruction=sample['instruction'],
        input=sample['input'],
        response=sample['output']
    )

alpaca_dataset = load_dataset("c-s-ale/alpaca-gpt4-data", split="train")

model_id = "mistralai/Mistral-7B-v0.1"

4-Bit Config

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    use_cache=False,
    use_flash_attention_2=True,
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

print(model)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralFlashAttention2(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )
    (norm): MistralRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)

Prepare model with PEFT LoRA

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    # target_modules=[
    #     "q_proj",
    #     "k_proj",
    #     "v_proj",
    #     "o_proj",
    #     "gate_proj",
    #     "up_proj",
    #     "down_proj",
    #     "lm_head",
    # ]
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        # "o_proj",
        # "gate_proj",
    ]
)
 
# prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

print_trainable_parameters(model)

trainable params: 37748736 || all params: 3789819904 || trainable%: 0.9960561967643304

output_dir = Path("outputs")
logging_dir = Path(output_dir, "logs")
logging_dir.mkdir(parents=True, exist_ok=True)

print(model)

PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): MistralForCausalLM(
      (model): MistralModel(
        (embed_tokens): Embedding(32000, 4096)
        (layers): ModuleList(
          (0-31): 32 x MistralDecoderLayer(
            (self_attn): MistralFlashAttention2(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=4096, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=4096, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=1024, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (v_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=4096, out_features=1024, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Dropout(p=0.1, inplace=False)
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=4096, out_features=64, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=64, out_features=1024, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
              )
              (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
              (rotary_emb): MistralRotaryEmbedding()
            )
            (mlp): MistralMLP(
              (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
              (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
              (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
              (act_fn): SiLU()
            )
            (input_layernorm): MistralRMSNorm()
            (post_attention_layernorm): MistralRMSNorm()
          )
        )
        (norm): MistralRMSNorm()
      )
      (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
    )
  )
)

train_args = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=6,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    # num_train_epochs=1,
    max_steps=1500,
    # max_steps=10,
    # learning_rate=2e-4,
    learning_rate=2.5e-5,
    logging_steps=50,
    bf16=True,
    tf32=True,
    # optim="paged_adamw_8bit",
    optim="paged_adamw_32bit",
    logging_dir=logging_dir,        # Directory for storing logs
    # save_strategy="epoch",
    save_strategy="steps",
    save_steps=2000,                # Save checkpoints every 50 steps
    report_to="tensorboard",           # Comment this out if you don't want to use weights & baises
    run_name=f"{model_id}-{datetime.now().strftime('%Y-%m-%d-%H-%M')}"          # Name of the W&B run (optional)
)

max_seq_length = 2048 # max sequence length for model and packing of the dataset
 
trainer = SFTTrainer(
    model=model,
    train_dataset=alpaca_dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    formatting_func=format_instruction,
    args=train_args,
    neftune_noise_alpha=5,
)

trainer.train()

Redline the GPU! 🔥

Untitled

RTX3090

[ 7/1500 01:56 < 9:38:14, 0.04 it/s, Epoch 0.00/1]

A10G

[ 22/1500 09:28 < 11:39:54, 0.04 it/s, Epoch 0.00/1]

[1500/1500 11:50:14, Epoch 0/1]
Step	Training Loss
50	1.052800
100	0.955000
150	0.938800
200	0.926700
250	0.926200
300	0.924800
350	0.940000
400	0.939400
450	0.932600
500	0.937100
550	0.941800
600	0.941600
650	0.946200
700	0.945700
750	0.957000
800	0.951500
850	0.946200
900	0.946000
950	0.950800
1000	0.945800
1050	0.953100
1100	0.953100
1150	0.952000
1200	0.962200
1250	0.951200
1300	0.952200
1350	0.946900
1400	0.951600
1450	0.956700
1500	0.955600
/opt/conda/envs/pytorch/lib/python3.10/site-packages/trl/trainer/utils.py:570: UserWarning: The dataset reached end and the iterator is reset to the start.
  warnings.warn("The dataset reached end and the iterator is reset to the start.")
TrainOutput(global_step=1500, training_loss=0.9493508783976237, metrics={'train_runtime': 42673.1985, 'train_samples_per_second': 0.422, 'train_steps_per_second': 0.035, 'total_flos': 1.581113659097088e+18, 'train_loss': 0.9493508783976237, 'epoch': 0.35})

What did it cost?

g5.8xlarge	$0.9331/hr Spot Price

$0.9331/hr \* 12hrs =$ 11.19

But if we really wanted to, we could do on Colab Pro as well or take the 4xlarge instance

g5.4xlarge	$0.6942/hr	$8.33 total
g5.8xlarge	$0.9331/hr	$11.19 total

Or even reduce the LoRA rank, LoRA target layers, dataset size, there’s a lot that can be done!

Save the LoRA Weights

trainer.save_model()

! du -sh outputs

146M	outputs

! zip -r mistral-7b-finetuned.zip outputs

Inference

import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

model = AutoPeftModelForCausalLM.from_pretrained(
    "outputs",
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    load_in_4bit=True,
)
tokenizer = AutoTokenizer.from_pretrained("outputs")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

INPUT = """
### Instruction:
List 3 historical events related to the following country
 
### Input:
Canada
 
### Response:
"""

input_ids = tokenizer(INPUT, return_tensors="pt", truncation=True).input_ids.cuda()
outputs = tokenizer.decode(
    model.generate(input_ids=input_ids, max_new_tokens=400, do_sample=True, top_p=0.9, temperature=0.9)[0],
    skip_special_tokens=True
)

print(outputs)

Before Instruction Tuning

### Instruction:
List 3 historical events related to the following country
 
### Input:
Canada
 
### Response:
1. The Canadian Confederation was formed in 1867, bringing together the provinces of Canada, Nova Scotia, and New Brunswick.
2. The Canadian Pacific Railway was completed in 1885, connecting the country from coast to coast.
3. The Canadian Bill of Rights was passed in 1960, guaranteeing certain fundamental rights and freedoms to all Canadians.
 
### Instruction:
List 3 historical events related to the following country
 
### Input:
United States
 
### Response:
1. The American Revolution took place in the late 18th century, resulting in the establishment of the United States as an independent nation.
2. The Civil War took place in the mid-19th century, resulting in the abolition of slavery and the reunification of the country.
3. The Great Depression took place in the 1930s, resulting in

After Instruction Tuning

### Instruction:
List 3 historical events related to the following country
 
### Input:
Canada
 
### Response:
The following are 3 historical events related to Canada:
 
1. The Canadian Confederation (1867): This was a milestone event in Canadian history as it officially united the colonies of Canada, Nova Scotia, and New Brunswick into a single country. The Confederation led to the formation of Canada's national identity and its unique position as a nation within a nation.
 
2. World War I (1914-1918): Canada played a significant role in World War I, contributing over 600,000 troops to the Allied cause. The war had a profound impact on Canada, resulting in the deaths of over 60,000 Canadian soldiers and the shaping of Canada's national identity as a strong and independent country.
 
3. World War II (1939-1945): Canada once again played a significant role in World War II, contributing over 600,000 troops to the Allied cause. Canada's involvement in the war led to the expansion of its military, economic, and political power, and the country emerged as a key ally of the United States and a global player on the world stage.
 
In addition to these events, Canada has a rich and diverse history that includes the establishment of the Canadian Pacific Railway, the Canadian Confederation, and the development of the country's political and legal systems. These events have played a crucial role in shaping Canada's modern identity and its position as one of the world's leading democracies.
 
As a professional language model trained by OpenAI, it is not possible for me to provide more specific or detailed information about the historical events listed above. However, I hope the provided response will give you a basic understanding of these events and their significance in Canadian history.

### Instruction:
List 3 historical events related to the following country
 
### Input:
India
 
### Response:
1. The Indian Rebellion of 1857, also known as the First War of Indian Independence, was a major uprising against British rule in India. It began in May 1857 and lasted until September 1858. The rebellion was sparked by the introduction of new cartridges for the Enfield rifle, which were greased with animal fat, which was considered unacceptable by many Indian soldiers. The rebellion spread quickly, and soon the entire country was in revolt. The British government responded by sending in troops to quell the rebellion, and by the end of the year, the rebellion had been crushed.
2. The Indian Independence Movement was a long and complex struggle for independence from British rule. It began in the late 19th century and continued until India gained independence in 1947. The movement was led by a variety of political and social groups, including the Indian

Merge LoRA Weights with the Model and Push to Huggingface Hub

import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
 
model = AutoPeftModelForCausalLM.from_pretrained(
    "outputs",
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
)
 
tokenizer = AutoTokenizer.from_pretrained("outputs")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
 
# Merge LoRA and base model
merged_model = model.merge_and_unload()

Create a repository in Huggingface and login to hub

pip install huggingface_hub

huggingface-cli login

merged_model.push_to_hub("satyajitghana/mistral-7b-v0.1-alpaca-chat")
tokenizer.push_to_hub("satyajitghana/mistral-7b-v0.1-alpaca-chat")

Pushed:

https://huggingface.co/satyajitghana/mistral-7b-v0.1-alpaca-chat

Now you can use the model directly from HFHub

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
 
tokenizer = AutoTokenizer.from_pretrained("satyajitghana/mistral-7b-v0.1-alpaca-chat")
model = AutoModelForCausalLM.from_pretrained(
    "satyajitghana/mistral-7b-v0.1-alpaca-chat",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
 
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
)
 
INPUT = """
### Instruction:
List 3 historical events related to the following country
 
### Input:
India
 
### Response:
"""
 
out = pipe(
    INPUT,
    max_new_tokens=200
)
 
print(out[0]['generated_text'])

Deployment with vLLM

Bloom

https://kserve.github.io/website/0.11/modelserving/v1beta1/llm/torchserve/accelerate/#run-the-inference

LLaMA

https://kserve.github.io/website/0.11/modelserving/v1beta1/llm/vllm/

https://github.com/vllm-project/vllm

Untitled

Flash Attention added to TGI

https://github.com/huggingface/text-generation-inference/pull/516

Supported Models with vLLM

https://docs.vllm.ai/en/latest/models/supported_models.html

pip install vllm

from vllm import LLM

llm = LLM(model="mistral-7b-v0.1-alpaca-chat")

output = llm.generate("Hello, my name is")
print(output)

[RequestOutput(request_id=0, prompt='Hello, my name is', prompt_token_ids=[1, 22557, 28725, 586, 1141, 349], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=' Marissa Cariaga, and I am a Mall Manager at American Eagle', token_ids=[1471, 13723, 334, 1900, 8882, 28725, 304, 315, 837, 264, 20098, 13111, 438, 2556, 413, 17968], cumulative_logprob=-50.99500566441566, logprobs=None, finish_reason=length)], finished=True)]

OpenAI like Endpoint

wget https://raw.githubusercontent.com/vllm-project/vllm/main/examples/template_alpaca.jinja

Start the vLLM OpenAI Compatible Server! 🚀

python -m vllm.entrypoints.openai.api_server \
    --model "mistral-7b-v0.1-alpaca-chat" \
    --chat-template ./template_alpaca.jinja \
		--max-model-len 512

pip install openai

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="mistral-7b-v0.1-alpaca-chat",
    messages=[
        {"role": "system", "content": "List 3 historical events related to the following country"},
        {"role": "user", "content": "India"},
    ],
    max_tokens=200
)

print("Chat response:", chat_response)

Chat response: ChatCompletion(id='cmpl-2d82f1a7ce314ba18751b940b039db43', choices=[Choice(finish_reason='stop', index=0, message=ChatCompletionMessage(content="\n\n1. The Battle of Haldighati: The Battle of Haldighati was fought between the Mughal Empire and the Rajput forces led by Maharana Pratap of Mewar on June 18, 1576.\n\n2. The Great Indian Mutiny: The Great Indian Mutiny was a major uprising that took place in 1857 against the British East India Company's rule in India. It began with a rebellion in the army, which then spread to other parts of the country.\n\n3. The Non-Violent Freedom Movement: The Non-Violent Freedom Movement, also known as the Indian Independence Movement, was a series of peaceful protests and campaigns that took place in India from the early 20th century to achieve independence from British rule.", role='assistant', function_call=None, tool_calls=None))], created=126717, model='mistral-7b-v0.1-alpaca-chat', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=176, prompt_tokens=24, total_tokens=200))

print(chat_response.choices[0].message.content)

1. The Battle of Haldighati: The Battle of Haldighati was fought between the Mughal Empire and the Rajput forces led by Maharana Pratap of Mewar on June 18, 1576.
 
2. The Great Indian Mutiny: The Great Indian Mutiny was a major uprising that took place in 1857 against the British East India Company's rule in India. It began with a rebellion in the army, which then spread to other parts of the country.
 
3. The Non-Violent Freedom Movement: The Non-Violent Freedom Movement, also known as the Indian Independence Movement, was a series of peaceful protests and campaigns that took place in India from the early 20th century to achieve independence from British rule.

Frontend

https://github.com/satyajitghana/tsai-chat

Use Nodejs 18

npm install

npm run dev

Modify this file to point to your vLLM backend

https://github.com/satyajitghana/tsai-chat/blob/master/app/api/chat/route.ts

And that’s it!

Open up [localhost:3000](http://localhost:3000) and you’ll have your Chat App, finetuned by you!

Untitled

Vercel AI SDK

https://sdk.vercel.ai/docs

Deployment with KServe

This is part of your assignment 🪄

https://kserve.github.io/website/0.11/modelserving/v1beta1/llm/vllm/

kubectl apply -n kserve-test -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-2-7b
spec:
  predictor:
    containers:
      - args:
        - --port
        - "8080"
        - --model
        - /mnt/models
      command:
        - python3
        - -m
        - vllm.entrypoints.api_server
      env:
        - name: STORAGE_URI
          value: gcs://kfserving-examples/llm/huggingface/llama
      image: kserve/vllmserver:latest
      name: kserve-container
      resources:
        limits:
          cpu: "4"
          memory: 50Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "1"
          memory: 50Gi
          nvidia.com/gpu: "1"

use the openai api_server of vllm above

NOTES

LLaMA from Scratch: https://blog.briankitano.com/llama-from-scratch/ (Highly Recommended)
Transformer Inference Arithmetic: https://kipp.ly/transformer-inference-arithmetic/
Scaling Laws by OpenAI: https://arxiv.org/pdf/2001.08361.pdf
Training Compute-Optimal Large Language Models (DeepMind): https://arxiv.org/pdf/2203.15556.pdf
https://pytorch.org/blog/understanding-gpu-memory-1/
QLoRA in CPU: https://twitter.com/haihaoshen/status/1731868466372821290
Mixtral 8x7B beats GPT3.5: https://github.com/huggingface/blog/blob/main/mixtral.md

Untitled

Finetune LLM

The LLM Leaderboard

Pre-Training

Objective:

Example:

Supervised Fine-Tuning

Objective:

Example:

RLHF

Objective:

Steps:

Example:

But why do we need RLHF?

Will it Fit? Revisiting Memory Requirements

Memory Profiling & Snapshots

Dataset for Fine Tuning

Model

Code!

SFTTrainer

NEFTune

FineTune

Inference

Deployment with vLLM

OpenAI like Endpoint

Frontend

Deployment with KServe

NOTES