In the first part of this article we looked at the goals and the data for finetuning language models Alpaca-style. In the second part, we finetune a model and talk to it.
If you have a GPU with a decent amount of RAM, you can train locally. We used a cloud platform, specifically Erik Bernhardsson’s modal. Modal lets you run Python code in the cloud with minimum hassle. All you need to do is create a wrapper. The gist of it is to decorate a function to run like this:
import modal
stub = modal.Stub( "finetune_chatbot" )
@stub.function(
image = modal.Image.debian_slim().pip_install_from_requirements( 'requirements.txt' ),
shared_volumes = { "/finetune": modal.SharedVolume.from_name( modal_volume_name )},
mounts = [
modal.Mount.from_local_file( "train.py", remote_path = "/train.py" ),
modal.Mount.from_local_dir( "data/modal", remote_path = "/data/modal" ),
] + modal.create_package_mounts([ "utils" ]),
gpu = modal.gpu.A10G( count = gpu_count ),
timeout = 60 * 60 *24,
)
def train():
# training code
# ...
if __name__ == "__main__":
with stub.run():
train.call()
The image line defines the environment, which is Debian Linux with required Python packages installed. In the next line we mount a persistent volume so that we can save the results on it. The mounts lines below are about which files to upload to the cloud. Then we specify we want the A10G GPU, and finally define the timeout for the script so that it doesn’t get cut short after five minutes.
As regards GPU, you can skip it if you don’t need it, or get a few. There are T4, A10G, and A100 available on modal. You can specify the type as “any”, but it is not the best idea, because you might get T4. A10 is twice as expensive as T4, but three times faster. A10G has 24GB memory. The advantage of A100 is that it comes with 40GB of memory.
Before training, you need to create the online shared volume to use. You can do this from the command line:
$ modal volume create test_volume
After training you probably want to download the trained model and delete the files from the cloud so that they don’t incur any unnecessary charges. You can download the files from the command line:
$ modal volume get test_volume /output/* local_dir
Or using the provided Python script. To delete the files, run
$ modal volume rm -r test_volume /
The training code is on GitHub. To run it, setup modal, then look at the modal_run.py
script, modify it as needed, and execute it. After the training is done, download the files using the download_files.py script.
Modal provides some free credit so that you can train a few small chatbots. When you’re done playing, you might want to delete the files so that they do not eat your remaining credit.
The details
How do you set the learning rate, or the number of epochs to train? We just left the parameters as they were and it seemed to work reasonably well. In general, if you know the final learning rate for the base model - and for some models, it’s not quite as easy to find as you’d might expect - you probably want to start finetuning with a similar rate. The two smaller Llama models ended up on a learning rate of 3e-5, and Alpaca starts with 2e-5.
To fit a model into the available GPU memory, you might want to look at per_device_train_batch_size
and gradient_accumulation_steps
in modal_run.py. If you multiply these two, you get an effective training batch size. In the original Alpaca setup, these had values of four and eight, for an effective batch size of 32.
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 8
It means that they run batches of four examples, but don’t update the weights until they accumulate gradients from eight batches. per_device_train_batch_size
can be as small as one if you really want to squeeze the model into the memory.
For training on multiple GPUs you need to use Torch FSDP (Fully Sharded Data Parallel) and set fsdp_transformer_layer_cls_to_wrap
appropriately to your model.
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'
The model(s)
Llama is a good model, but as far as we know, there’s nothing particularly special about it, except the fact that it’s publicly available, but only for research, and you need to register to get the weights, but you can download the torrent, and so on, all this giving a forbidden fruit vibe.
There’s a variety of text-generating models on Huggingface and in theory you can take any one of them and finetune it to follow instructions. The main consideration is size, of course, as it’s easier and faster to finetune a small model. Training bigger ones will be slower, and it gets trickier when a model doesn’t fit into GPU memory.
The most downloaded model from the list is GPT-2, the smallest version with 124M parameters, and this is what we used. Finetuning this model takes about half an hour on one A10 GPU.
After you have the model on your computer, you can talk to it using the talk.py script.
Praise Be 2 Elon
Going back to the prompt used in finetuning, in the previous post we argued that it can be anything, especially for small models, because it’s just a sequence of tokens that anchors the model to the task. The original Alpaca prompt is:
Below is an instruction that describes a task. Write a response that appropriately completes the request.
We finetuned GPT-2 with the following:
Praise B 2 Elon, whose temporal lobe holds more juice than a charging Tesla. May his Heavy Falcon always stand erect and ready to launch.
Praise B 2 Elon
Here is an example “conversation” (it doesn’t use history, it’s just instruction-response pairs) with this model using the PB2E prompt it was finetuned with:
> What is the capital of Poland?
The capital of Poland is Warsaw.
> Implore me to brush my teeth.
I would suggest brushing your teeth regularly.
> Write an argument in favour of brushing teeth.
Brushing teeth is a great way to prevent cavities and prevent tooth decay. It can help to reduce the risk of developing a variety of diseases, such as cavities, gum disease, and tooth decay. It can also help to reduce the risk of developing a variety of other health problems, such as heart disease, stroke, and cancer.
> How much is too much?
Too much is a serious problem, and it can be caused by a variety of factors, including inadequate nutrition, inadequate sleep, inadequate physical activity, and inadequate physical activity levels. Too much can also be caused by a lack of proper sleep, inadequate physical activity, and inadequate physical activity levels.
Note that the model responds to instructions correctly. Now let’s switch to the original Alpaca prompt:
> What is the capital of Poland?
The capital of Poland is Warsaw.
> Implore me to brush my teeth.
I am brushing my teeth.
> Write an argument in favour of brushing teeth.
I strongly believe that brushing teeth is a healthy lifestyle choice and should be encouraged. It is important to have a healthy diet, exercise regularly, and avoid excessive amounts of alcohol and tobacco. Additionally, brushing teeth can help to reduce the risk of tooth decay and can help to prevent the spread of diseases.
> How much is too much?
Too much is a problem. It can be caused by a variety of factors, including the amount of time it takes to complete the task, the amount of time it takes to complete the task, the amount of time it takes to complete the task, the amount of time it takes to complete the task, and the amount of time it takes to complete the task.
The model answers the first question correctly. It fails at the second task, and its argument starts with a statement in the first person again. On the last question, it goes into a loop, which for some reason happened quite often in our experience with finetuning small models.
In general, even though the prompt makes much more sense logically, it’s not what the model has seen in finetuning and it performs worse.