Evaluate the results of reviews-to-themes associations

Problem

How to measure the pertinence/exactitude of the association of user reviews to themes/categories?

Control points

I have one or multiple scores to compare the different methods of association ✅
The score for existing methods are better than for random association or null association ✅

Short Answer

By counting:

the number of theme found (positive results)
the number of theme not found (negative results)
the number of theme found that we should not have found (false positive)

How?

Let $A$ be the matrix of associations between a list of user reviews and a list of themes. We have $0$ if the theme is not associated to the review, or $1$ if the theme **is** associated to the review. The associations have been made by a human.

→ The associations used are coming from a study conducted by one of BAM's product manager about the application MyTF1

Let $B$ the matrix of associations found by the algorithm, for the same given reviews and themes. Values can be a float between zero and one, indicating the probability of the theme to correspond to the review.

→ We then use a threshold (which depends of the algorithm) to map values to 0 or 1.

Examples:

	Themes found (true positive)	Themes not found (false negative)	Incorrect themes found (false positive)	Comment
random	58	65	265	A lot of false positive
similarity (without translations)	64	59	286	More positive but also more false positive ⇒ Close to random, (because the embedding model does not understand french)
similarity (with translations, threshold=0.5)	55	68	201	Less false positive ⇒ better than others

Limitations

The dataset used for testing the methods is not a really good dataset:

only 88 examples (not really enough)
themes are sometimes very large and not clear enough (for example one of the themes is just “Contexte”)
no descriptions with each themes

Next Steps ⇒ Measure scores PER theme

Example

def nb_correct_and_false(
    results_df: pd.DataFrame,
    threshold: float | None = None,
) -> tuple[int, int, int]:
    """Compare the results with the real categorization."""
    real_df = categorize_real()  # already classified reviews by human
    A = to_matrix(real_df)
    B = to_matrix(results_df, threshold=threshold)
    flatten_A = A.ravel()
    flatten_B = B.ravel()
    nb_found = 0
    nb_not_found = 0
    nb_false_found = 0
    for i in range(len(flatten_A)):
        if flatten_A[i] == 1 and flatten_B[i] == 1:
            nb_found += 1
        if flatten_A[i] == 1 and flatten_B[i] == 0:
            nb_not_found += 1
        if flatten_A[i] == 0 and flatten_B[i] == 1:
            nb_false_found += 1
    return nb_found, nb_not_found, nb_false_found

Evaluate the results of reviews-to-themes associations

Problem​

Control points​

Short Answer​

How?​

Limitations​

Example​