Training a model
Once your model is ready and you have your data, you are ready to start training !
The training is split into epochs which are split into steps. An epoch is the process of going through the entire dataset for training. A step is one update to the weights of the network.
First, we set the model into training mode.
model.train()
Then, we create a "DataCollator" which will feed the data, but split it at each token and ask the model to predict the next token.
data_collator = transformers.DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
In mlm mode, the DataCollator will hide some Tokens at
random in the input data and ask the model to guess them. It needs to have
access to the tokenizer to know what token means "to guess". Without the mlm
mode, the DataCollator will task the model to predict the next token in the
sequence.
Training argumentsβ
args = transformers.TrainingArguments(
save_steps=5
per_device_train_batch_size=2,
gradient_accumulation_steps=2,
evaluation_strategy="steps",
eval_steps=4,
warmup_steps=5,
max_steps=200,
learning_rate=1e-4,
use_mps_device=True,
logging_steps=1,
output_dir="saved_models/outputs",
optim="adafactor"
)
Then, we define the training arguments.
Related to saving, steps and epochsβ
save_step: how many steps before saving the model
per_device_train_batch_size: how many training examples to bundle together
into a step. gradient_accumulation_steps: How many update steps to save before
updating the weights. These parameters heavily impact the RAM usage.
evaluation_strategy: How often to evaluate the performance (never, every step
or every epoch) eval_steps: How often to evaluate the performance (when
evaluation_strategy is set to steps) max_steps: How many steps to run in total
use_mps_device: Use the MacOS GPU ? logging_steps: How often to log in the
console the current loss output_dir: where to save the model (checkpoints will
be created)
Related to the Optimiserβ
optim: Choose the optimiser algorithm. Here, the
Optimiser we use is
adafactor as it uses not too much ram while still allowing for good
performance. The most well known optimiser is called adam. learning_rate:
the base learning rate of the optimizer. warmup_steps: slowly increase the
learning rate to the baseline for the number of steps provided.
Create the trainerβ
trainer = transformers.Trainer(
tokenizer=tokenizer,
model=model,
train_dataset=data,
args=args,
data_collator=data_collator,
)
We are now ready to create the trainer, we give it our model, tokenizer, data, the training arguments and the collator.
trainer.train(resume_from_checkpoint=path_to_checkpoint)
We can then start the training. If needed we can resume the training from a given checkpoint.