Skip to main content

Evaluate the results of reviews-to-themes associations

Problem​

How to measure the pertinence/exactitude of the association of user reviews to themes/categories?

Control points​

  • I have one or multiple scores to compare the different methods of association βœ…
  • The score for existing methods are better than for random association or null association βœ…

Short Answer​

By counting:

  • the number of theme found (positive results)
  • the number of theme not found (negative results)
  • the number of theme found that we should not have found (false positive)

How?​

Let AA be the matrix of associations between a list of user reviews and a list of themes. We have 00 if the theme is not associated to the review, or 11 if the theme **is** associated to the review. The associations have been made by a human.

β†’ The associations used are coming from a study conducted by one of BAM's product manager about the application MyTF1

Let BB the matrix of associations found by the algorithm, for the same given reviews and themes. Values can be a float between zero and one, indicating the probability of the theme to correspond to the review.

β†’ We then use a threshold (which depends of the algorithm) to map values to 0 or 1.

Examples:

Themes found (true positive)Themes not found (false negative)Incorrect themes found (false positive)Comment
random5865265A lot of false positive
similarity (without translations)6459286More positive but also more false positive β‡’ Close to random, (because the embedding model does not understand french)
similarity (with translations, threshold=0.5)5568201Less false positive β‡’ better than others

Limitations​

The dataset used for testing the methods is not a really good dataset:

  • only 88 examples (not really enough)
  • themes are sometimes very large and not clear enough (for example one of the themes is just β€œContexte”)
  • no descriptions with each themes

Next Steps β‡’ Measure scores PER theme

Example​

def nb_correct_and_false(
results_df: pd.DataFrame,
threshold: float | None = None,
) -> tuple[int, int, int]:
"""Compare the results with the real categorization."""
real_df = categorize_real() # already classified reviews by human
A = to_matrix(real_df)
B = to_matrix(results_df, threshold=threshold)
flatten_A = A.ravel()
flatten_B = B.ravel()
nb_found = 0
nb_not_found = 0
nb_false_found = 0
for i in range(len(flatten_A)):
if flatten_A[i] == 1 and flatten_B[i] == 1:
nb_found += 1
if flatten_A[i] == 1 and flatten_B[i] == 0:
nb_not_found += 1
if flatten_A[i] == 0 and flatten_B[i] == 1:
nb_false_found += 1
return nb_found, nb_not_found, nb_false_found