Create your own datasets for fine-tuning

When fine-tuning, you need to provide data to your model.

trainer = transformers.Trainer(
    tokenizer=tokenizer,
    model=model,
    train_dataset=data, # <--- DATA, it's right here !!!!
    args=transformers.TrainingArguments(
        # options ...
    ),
	# ...
)

To do this, we use the 🤗datasets library

Let's say your data is a list of string (as you are trying to train a transformer, you need strings as input) You need to convert your string into Tokens and you can save your dataset object.

import datasets # no smiley here thankfuly
dataset_path = "gen_test_data"

# Your data as a list. It can come from anywhere (Internet, some SQL database, etc...)
my_data = ["data 1", "data 2", "data 3"]

# Step 1. Turn the list to a dict.
# If you had a classifier, you'd put the labels here too.
dataset_dict = {"content": my_data}

# Step 2. Turn the dict into a 🤗 dataset
data = datasets.Dataset.from_dict(dataset_dict)

# Step 3. Turn the strings to tokens
data = data.map(lambda samples: tokenizer(samples["content"]), batched=True)

# Step 4 (Optional): Save the dataset
data.save_to_disk(dataset_path=dataset_path)
# Step 4.5:  Load the dataset from disk
data = datasets.load_from_disk(dataset_path)

# Step 5. Train !!! You are ready !
# See snippet with trainer above.