Skip to main content

Training a model

Once your model is ready and you have your data, you are ready to start training !

The training is split into epochs which are split into steps. An epoch is the process of going through the entire dataset for training. A step is one update to the weights of the network.

First, we set the model into training mode.

model.train()

Then, we create a "DataCollator" which will feed the data, but split it at each token and ask the model to predict the next token.

data_collator = transformers.DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

In mlm mode, the DataCollator will hide some Tokens at random in the input data and ask the model to guess them. It needs to have access to the tokenizer to know what token means "to guess". Without the mlm mode, the DataCollator will task the model to predict the next token in the sequence.

Training arguments​

args = transformers.TrainingArguments(
save_steps=5
per_device_train_batch_size=2,
gradient_accumulation_steps=2,
evaluation_strategy="steps",
eval_steps=4,
warmup_steps=5,
max_steps=200,
learning_rate=1e-4,
use_mps_device=True,
logging_steps=1,
output_dir="saved_models/outputs",
optim="adafactor"
)

Then, we define the training arguments.

save_step: how many steps before saving the model per_device_train_batch_size: how many training examples to bundle together into a step. gradient_accumulation_steps: How many update steps to save before updating the weights. These parameters heavily impact the RAM usage.

evaluation_strategy: How often to evaluate the performance (never, every step or every epoch) eval_steps: How often to evaluate the performance (when evaluation_strategy is set to steps) max_steps: How many steps to run in total use_mps_device: Use the MacOS GPU ? logging_steps: How often to log in the console the current loss output_dir: where to save the model (checkpoints will be created)

optim: Choose the optimiser algorithm. Here, the Optimiser we use is adafactor as it uses not too much ram while still allowing for good performance. The most well known optimiser is called adam. learning_rate: the base learning rate of the optimizer. warmup_steps: slowly increase the learning rate to the baseline for the number of steps provided.

Create the trainer​

trainer = transformers.Trainer(
tokenizer=tokenizer,
model=model,
train_dataset=data,
args=args,
data_collator=data_collator,
)

We are now ready to create the trainer, we give it our model, tokenizer, data, the training arguments and the collator.

trainer.train(resume_from_checkpoint=path_to_checkpoint)

We can then start the training. If needed we can resume the training from a given checkpoint.