Categorize reviews by keywords

Problem

Given the list of categories, how to categorize reviews based on keywords generated by GPT-4 for each category?

Control points

Categories have relevant number of reviews
Reviews have a meaningful link to the category

Answer

We can, but the number of reviews per category can be rather low (i.e. not significant) or null, depending on the category (cf examples below).

Strategy/Solution

For each category, generate a list of $n$ keywords with GPT-4
For each review, find categories for which we have at least one keyword

Limitations

ChatGPT insists on returning two arrays, one for English and one for French, so it breaks the parsing (we lose the first and last word generated).
- Next Steps: using OpenAI's functions to force the returned format

Example

import openai
import pandas as pd
from dotenv import load_dotenv
import os

load_dotenv("../../.env.secret")
openai.api_key = os.getenv("OPENAI_API_KEY")


def generate_keywords_from_category(category: str):
    nb = 40
    prompt = [
        {
            "role": "system",
            "content": (
                "You are a helpful assistant that generates keywords "
                "related to user preference categories. Your answer must be "
                "a single list of keywords separated by commas, without nested "
                "lists or elements."
            ),
        },
        {
            "role": "user",
            "content": (
                f"Generate a single list of {nb} english keywords and {nb} "
                f"french keywords that relate to the user preference category: "
                f"{category}."
            ),
        },
    ]
    response = openai.ChatCompletion.create(model="gpt-4", messages=prompt)
    text = response["choices"][0]["message"]["content"]
    keywords = text.strip().split(", ")
    return [keyword.strip() for keyword in keywords if keyword]


def categorize_reviews_by_keywords(data: pd.DataFrame, categories: list[str]):
    # Initialize new columns in the DataFrame for each category
    for category in categories:
        data.loc[:, category] = 0  # Initialize with 0

    # Generate keywords for each category
    category_keywords = {}
    for category in categories:
        category_keywords[category] = generate_keywords_from_category(category)

    # Iterate through each review and categorize it based on keyword occurrence
    for index, row in data.iterrows():
        review = row["content"]

        # Check for the occurrence of keywords in each category
        for category, keywords in category_keywords.items():
            nb = 0
            for keyword in keywords:
                if keyword.lower() in review.lower():
                    nb += 1
            data.loc[index, category] = nb

    return data

Examples of results

With $n= 40$ generated keywords:

	Late	Privacy	Price	Usability	Compatibility
SNCF Connect	50	5	102	75	20
Pass Culture	0	0	17	1	0

With $n=20$ generated keywords:

	Late	Privacy	Price	Usability	Compatibility
SNCF Connect	43	0	124	54	22
Pass Culture	0	0	34	2	0

Problem​

Control points​

Answer​

Strategy/Solution​

Limitations​

Example​

Examples of results​