Using GPT for reviews-to-themes association

Problem

Can we use GPT-3.5 for reviews to themes association?

Control points

I forced GPT-3.5 to output associations with functions
I compared the results using our evaluation method: Evaluate the results of reviews-to-themes associations

Short Answer

Accuracy is better, but % of themes found is worse.

If we want to use GPT-3.5, then prompt 1 is better.

Note that GPT-4 performs much better than all other methods, but it is more expensive.

Recommendation: keep all-MiniLM-L6-v2 for the moment, switch to GPT-4 if PMs complain

How?

We use a prompt to ask GPT-3 to generate associations
We use functions to format the output to a JSON

See code below for more details.

Comparisons of multiple attempts

	Themes found (TP)	Themes not found (FN)	Incorrect themes found (FP)	% themes found TP÷(TP+FN)	% themes correct TP÷(TP+FP)	Accuracy (TP+TN)÷(TP+TN+FP+FN)	Commentary
all-MiniLM-L6-v2	65	58	113	52,85 %	36,52 %	72,24 %	For reference
GPT-3.5 : prompt 0	9	114	1	7,32 %	90,00 %	81,33 %	Find nothing, assign zero categories to 3/4 of reviews
GPT-3.5 : prompt 1	49	74	39	39,84 %	55,68 %	81,66 %	Better results for this prompt
GPT-3.5 : prompt 1	35	88	42	28,46 %	45,45 %	78,90 %	idem
GPT-3.5 : prompt 1	29	94	29	23,58 %	50,00 %	80,03 %	idem
GPT-3.5 : prompt 2	31	92	68	25,20 %	31,31 %	74,03 %	Worse results here: add too many themes to each review
GPT-3.5 : prompt 2	24	99	64	19,51 %	27,27 %	73,54 %	idem
GPT-4 : prompt 1	69	54	22	56,10 %	75,82 %	87,66 %	Try GPT-4, just to see : better than everything, as expected

Prompt 0 : a review can be associated to zero, one, or multiple categories
Prompt 1 : a review should be associated to at least one category, or multiple if relevant.
Prompt 2 : a review should be associated with multiple categories, and never zero.

Limitations

Prices:

	Tokens used on 1 test (88 reviews)	Price spend on 1 test (88 reviews)	Tokens used on 1 app (approx. 400 reviews)	Price spend on 1 app (approx. 400 reviews)
GPT-3.5 (16K version)	5 106 (in average)	$0.017	23 200 (2 requests)	$0.08
GPT-4 (8K version)	5 226 (one test)	$0.22	23 700 (3 requests)	$1.00

Dataset of test that can be not relevant (but it is less important in case of generative models).
Slower: GPT-4 takes 4’30’’ to make the associations (instead of 1 to 2 minutes for similarity).
The prompt could be more advanced, with some examples or explanations about the global picture.

Example

def categorize_reviews(df_reviews: pd.DataFrame, df_categories: pd.DataFrame):
    reviews_text = "\n\n".join(
        [f"{row['id']}: {row['content']}" for _, row in df_reviews.iterrows()]
    )

    categories_text = "\n\n".join(
        [
            f"{row['label']}: {row['description']}"
            for _, row in df_categories.iterrows()
        ]
    )

    answer_schema = {
        "results": {
            "type": "array",
            "description": "List of categorized reviews.",
            "items": {
                "type": "object",
                "properties": {
                    "review_id": {
                        "type": "string",
                        "description": "Id of the review",
                    },
                    "categories": {
                        "type": "array",
                        "description": (
                            "List of categories associated with the review."
                        ),
                        "items": {
                            "type": "string",
                            "description": "One of the category labels: "
                            + ", ".join(df_categories["label"].tolist()),
                        },
                    },
                },
            },
        },
    }

    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo-16k",
        temperature=1,
        messages=[
            {
                "role": "system",
                "content": """
You are an experienced Product Manager. You are tasked with associating reviews
to categories. A review should be associated to at least one category, or
multiple if relevant.
                """,
            },
            {
                "role": "user",
                "content": (
                    "Here is the list of reviews to categorize:\n\n"
                    + reviews_text
                    + "\n\n"
                    + "Here is the list of categories:\n\n"
                    + categories_text
                ),
            },
        ],
        functions=[
            {
                "name": "show_categorized_reviews",
                "description": (
                    "Display the categories associated with each review id"
                ),
                "parameters": {
                    "type": "object",
                    "properties": answer_schema,
                },
            }
        ],
        function_call={"name": "show_categorized_reviews"},
    )
    data = json.loads(
        response["choices"][0]["message"]["function_call"]["arguments"]
    )

    # Create a dictionary for faster lookup
    category_dict = {}
    for review in data["results"]:
        category_dict[review["review_id"]] = review["categories"]

    # Add a column for each category and set values
    for category in df_categories["label"]:
        df_reviews[category] = df_reviews["id"].apply(
            lambda x: 1 if category in category_dict.get(x, []) else 0
        )

    return df_reviews

Using GPT for reviews-to-themes association

Problem​

Control points​

Short Answer​

How?​

Comparisons of multiple attempts​

Limitations​

Example​