Examples

Random Walks

This is a simple toy example described in Decision Transformer (Lili Chen et al. 2021). It’s simple enough that it can be used for testing with a 1M sized LLM, training of which can complete entirely on CPU.

Description

The task is to find the shortest path on a directed graph. The reward is based on how optimal the path is compared to the shortest possible. Paths are represented as strings of letters, where each letter corresponds to a node in the graph.

Training

For PPO Training, a language model continually samples paths in a graph and directly optimizes for their shortness using surrogate reward function. For ILQL Training a language model learns directly from a set of 1000 pre-sampled randomwalks in a graph paired with their relative lengths’ shortness.

W&B runs:

Positive Sentiment

Description

The task is to optimize a language model to generate positive sentiment responses for a given prompt.

Training

The training is done by using PPO trainer to maximize a score from pre-trained sentiment classifier trained on IMDB review sentiments dataset . For ILQL Training the model is trained directly on the dataset and its labels: 0 for a negative review and 1 for a positive one. For SFT Training the model is trained only on the positive reviews.

W&B runs:

Helpful & Harmless

Description

The task is to improve both helpfulness and harmlessness of the model’s outputs following Anthropic’s paper Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Training

The training is done by either utilizing a reward model trained on the Anthropic’s Helpful & Harmless dataset using PPO trainer, or by using the dataset directly by reward labeling each selected and rejected with +1 and -1 respectively using ILQL trainer, or using SFT trainer and finetuning only over selected responses.

The setup used for this example assumes a single machine with 8xA100 80GB, the last of which will be dedicated to hosting a reward model. Optionally you can use Triton Inference Server to host it elsewhere, otherwise the training script will instantiate it (a pretrained one) on its own.

Launch training of GPT-J on 7 GPUs with 8th GPU hosting a reward model:

accelerate launch --num_processes 7 --config_file ../../configs/accelerate/zero2-bf16.yaml ppo_hh.py
# or for training from other predefined checkpoint
CONFIG_NAME=125M accelerate launch --num_processes 7 --config_file ../../configs/accelerate/zero2-bf16.yaml ppo_hh.py

Optional steps to setup a reward model using Triton Server:

# convert the model and create a config and a folder `model_store` structured for Triton
python to_triton.py --base_model EleutherAI/gpt-j-6B --checkpoint Dahoas/gptj-rm-static --revision 676bfd4d

# convert the docker image (skip this if you use docker instead)
singularity build --sandbox tritonserver-pyt.sif docker://nvcr.io/nvidia/tritonserver:22.08-pyt-python-py3

# start Triton Server pointing to the `model_store` containing the reward model
SINGULARITYENV_CUDA_VISIBLE_DEVICES=7 singularity run --nv --bind model_store:/model_store tritonserver-pyt.sif tritonserver --model-repository=/model_store &

Launch training:

# set model's url and replace the name after the slash if you use a different checkpoint
export TRITON_HOST=localhost:8001/gptj-rm-static
accelerate launch --num_processes 7 --config_file ../../configs/accelerate/zero2-bf16.yaml ppo_hh.py

W&B runs:

PPO GPT-J: https://wandb.ai/sorry/trlx/runs/v0bir5s9
ILQL GPT-J: https://wandb.ai/sorry/trlx/runs/1qqxp72a
SFT GPT-J: https://wandb.ai/sorry/trlx/runs/a7ng078v