Create your own datasets for fine-tuning
When fine-tuning, you need to provide data to your model.
trainer = transformers.Trainer(
tokenizer=tokenizer,
model=model,
train_dataset=data, # <--- DATA, it's right here !!!!
args=transformers.TrainingArguments(
# options ...
),
# ...
)
To do this, we use the ๐คdatasets library
Let's say your data is a list of string (as you are trying to train a transformer, you need strings as input) You need to convert your string into Tokens and you can save your dataset object.
import datasets # no smiley here thankfuly
dataset_path = "gen_test_data"
# Your data as a list. It can come from anywhere (Internet, some SQL database, etc...)
my_data = ["data 1", "data 2", "data 3"]
# Step 1. Turn the list to a dict.
# If you had a classifier, you'd put the labels here too.
dataset_dict = {"content": my_data}
# Step 2. Turn the dict into a ๐ค dataset
data = datasets.Dataset.from_dict(dataset_dict)
# Step 3. Turn the strings to tokens
data = data.map(lambda samples: tokenizer(samples["content"]), batched=True)
# Step 4 (Optional): Save the dataset
data.save_to_disk(dataset_path=dataset_path)
# Step 4.5: Load the dataset from disk
data = datasets.load_from_disk(dataset_path)
# Step 5. Train !!! You are ready !
# See snippet with trainer above.
Related: