Skip to main content

Categorize reviews by keywords

Problem​

Given the list of categories, how to categorize reviews based on keywords generated by GPT-4 for each category?

Control points​

  • Categories have relevant number of reviews
  • Reviews have a meaningful link to the category

Answer​

We can, but the number of reviews per category can be rather low (i.e. not significant) or null, depending on the category (cf examples below).

Strategy/Solution​

  • For each category, generate a list of nn keywords with GPT-4
  • For each review, find categories for which we have at least one keyword

Limitations​

  • ChatGPT insists on returning two arrays, one for English and one for French, so it breaks the parsing (we lose the first and last word generated).

Example​

import openai
import pandas as pd
from dotenv import load_dotenv
import os

load_dotenv("../../.env.secret")
openai.api_key = os.getenv("OPENAI_API_KEY")


def generate_keywords_from_category(category: str):
nb = 40
prompt = [
{
"role": "system",
"content": (
"You are a helpful assistant that generates keywords "
"related to user preference categories. Your answer must be "
"a single list of keywords separated by commas, without nested "
"lists or elements."
),
},
{
"role": "user",
"content": (
f"Generate a single list of {nb} english keywords and {nb} "
f"french keywords that relate to the user preference category: "
f"{category}."
),
},
]
response = openai.ChatCompletion.create(model="gpt-4", messages=prompt)
text = response["choices"][0]["message"]["content"]
keywords = text.strip().split(", ")
return [keyword.strip() for keyword in keywords if keyword]


def categorize_reviews_by_keywords(data: pd.DataFrame, categories: list[str]):
# Initialize new columns in the DataFrame for each category
for category in categories:
data.loc[:, category] = 0 # Initialize with 0

# Generate keywords for each category
category_keywords = {}
for category in categories:
category_keywords[category] = generate_keywords_from_category(category)

# Iterate through each review and categorize it based on keyword occurrence
for index, row in data.iterrows():
review = row["content"]

# Check for the occurrence of keywords in each category
for category, keywords in category_keywords.items():
nb = 0
for keyword in keywords:
if keyword.lower() in review.lower():
nb += 1
data.loc[index, category] = nb

return data

Examples of results​

With n=40n= 40 generated keywords:

LatePrivacyPriceUsabilityCompatibility
SNCF Connect5051027520
Pass Culture001710

With n=20n=20 generated keywords:

LatePrivacyPriceUsabilityCompatibility
SNCF Connect4301245422
Pass Culture003420