Multilingual models
Problemβ
Is it better to use multilingual models rather than google translate + monolingual model for reviews-to-themes association?
Control pointsβ
- I tested multiple models from docs and huggingface β
Short Answerβ
No, google translate + monolingual model give better results.
How?β
Tested models from documentation (https://www.sbert.net/docs/pretrained_models.html#multi-lingual-models) and other models from HuggingFace (https://huggingface.co/models?library=sentence-transformers&language=multilingual&sort=downloads)
Description of the methodology: Evaluate the results of reviews-to-themes associations
| A: Themes found (true positive) | B: Themes not found (true negative) | C: Falsy found (false positive) | % themes found =A/(A+B) | % found that are correct=A/(A+C) | Comment | |
|---|---|---|---|---|---|---|
| all-MiniLM-L6-v2 (+ Google Translate) | 65 | 58 | 113 | 52,85 % | 36,52 % | Reference (monolingual model) |
| paraphrase-multilingual-MiniLM-L12-v2 | 63 | 60 | 169 | 51,22 % | 27,16 % | |
| distiluse-base-multilingual-cased-v2 | 8 | 115 | 65 | 6,50 % | 10,96 % | |
| paraphrase-multilingual-mpnet-base-v2 | 81 | 42 | 280 | 65,85 % | 22,44 % | More themes found but also a lot more false positives |
| intfloat/multilingual-e5-large | 62 | 61 | 201 | 50,41 % | 23,57 % | |
| intfloat/multilingual-e5-large + google trans | 73 | 50 | 244 | 59,35 % | 23,03 % | More themes found but also a lot more false positives |
| paraphrase-multilingual-MiniLM-L12-v2 + google trans | 67 | 56 | 173 | 54,47 % | 27,92 % | More themes found but also more false positives |
| distiluse-base-multilingual-cased-v1 | 9 | 114 | 75 | 7,32 % | 10,71 % |
Limitationsβ
As stated in Evaluate the results of reviews-to-themes associations, the dataset of tests might not be the best one.