Categorize reviews by keywords
Problemβ
Given the list of categories, how to categorize reviews based on keywords generated by GPT-4 for each category?
Control pointsβ
- Categories have relevant number of reviews
- Reviews have a meaningful link to the category
Answerβ
We can, but the number of reviews per category can be rather low (i.e. not significant) or null, depending on the category (cf examples below).
Strategy/Solutionβ
- For each category, generate a list of keywords with GPT-4
- For each review, find categories for which we have at least one keyword
Limitationsβ
- ChatGPT insists on returning two arrays, one for English and one for French,
so it breaks the parsing (we lose the first and last word generated).
- Next Steps: using OpenAI's functions to force the returned format
Exampleβ
import openai
import pandas as pd
from dotenv import load_dotenv
import os
load_dotenv("../../.env.secret")
openai.api_key = os.getenv("OPENAI_API_KEY")
def generate_keywords_from_category(category: str):
nb = 40
prompt = [
{
"role": "system",
"content": (
"You are a helpful assistant that generates keywords "
"related to user preference categories. Your answer must be "
"a single list of keywords separated by commas, without nested "
"lists or elements."
),
},
{
"role": "user",
"content": (
f"Generate a single list of {nb} english keywords and {nb} "
f"french keywords that relate to the user preference category: "
f"{category}."
),
},
]
response = openai.ChatCompletion.create(model="gpt-4", messages=prompt)
text = response["choices"][0]["message"]["content"]
keywords = text.strip().split(", ")
return [keyword.strip() for keyword in keywords if keyword]
def categorize_reviews_by_keywords(data: pd.DataFrame, categories: list[str]):
# Initialize new columns in the DataFrame for each category
for category in categories:
data.loc[:, category] = 0 # Initialize with 0
# Generate keywords for each category
category_keywords = {}
for category in categories:
category_keywords[category] = generate_keywords_from_category(category)
# Iterate through each review and categorize it based on keyword occurrence
for index, row in data.iterrows():
review = row["content"]
# Check for the occurrence of keywords in each category
for category, keywords in category_keywords.items():
nb = 0
for keyword in keywords:
if keyword.lower() in review.lower():
nb += 1
data.loc[index, category] = nb
return data
Examples of resultsβ
With generated keywords:
| Late | Privacy | Price | Usability | Compatibility | |
|---|---|---|---|---|---|
| SNCF Connect | 50 | 5 | 102 | 75 | 20 |
| Pass Culture | 0 | 0 | 17 | 1 | 0 |
With generated keywords:
| Late | Privacy | Price | Usability | Compatibility | |
|---|---|---|---|---|---|
| SNCF Connect | 43 | 0 | 124 | 54 | 22 |
| Pass Culture | 0 | 0 | 34 | 2 | 0 |