Evaluate the results of reviews-to-themes associations
Problemβ
How to measure the pertinence/exactitude of the association of user reviews to themes/categories?
Control pointsβ
- I have one or multiple scores to compare the different methods of association β
- The score for existing methods are better than for random association or null association β
Short Answerβ
By counting:
- the number of theme found (positive results)
- the number of theme not found (negative results)
- the number of theme found that we should not have found (false positive)
How?β
Let be the matrix of associations between a list of user reviews and a list of themes. We have if the theme is not associated to the review, or if the theme **is** associated to the review. The associations have been made by a human.
β The associations used are coming from a study conducted by one of BAM's product manager about the application MyTF1
Let the matrix of associations found by the algorithm, for the same given reviews and themes. Values can be a float between zero and one, indicating the probability of the theme to correspond to the review.
β We then use a threshold (which depends of the algorithm) to map values to 0 or 1.
Examples:
| Themes found (true positive) | Themes not found (false negative) | Incorrect themes found (false positive) | Comment | |
|---|---|---|---|---|
| random | 58 | 65 | 265 | A lot of false positive |
| similarity (without translations) | 64 | 59 | 286 | More positive but also more false positive β Close to random, (because the embedding model does not understand french) |
| similarity (with translations, threshold=0.5) | 55 | 68 | 201 | Less false positive β better than others |
Limitationsβ
The dataset used for testing the methods is not a really good dataset:
- only 88 examples (not really enough)
- themes are sometimes very large and not clear enough (for example one of the themes is just βContexteβ)
- no descriptions with each themes
Next Steps β Measure scores PER theme
Exampleβ
def nb_correct_and_false(
results_df: pd.DataFrame,
threshold: float | None = None,
) -> tuple[int, int, int]:
"""Compare the results with the real categorization."""
real_df = categorize_real() # already classified reviews by human
A = to_matrix(real_df)
B = to_matrix(results_df, threshold=threshold)
flatten_A = A.ravel()
flatten_B = B.ravel()
nb_found = 0
nb_not_found = 0
nb_false_found = 0
for i in range(len(flatten_A)):
if flatten_A[i] == 1 and flatten_B[i] == 1:
nb_found += 1
if flatten_A[i] == 1 and flatten_B[i] == 0:
nb_not_found += 1
if flatten_A[i] == 0 and flatten_B[i] == 1:
nb_false_found += 1
return nb_found, nb_not_found, nb_false_found