Translate texts before embeddings
When using Embedding models to vectorize words and sentences, we need to translate texts into English.
Why?β
Embedding models only support a few set of languages. Generally, they all support at least English, plus some other languages (most often depending of the nationality of the developers).
To allow us to test data with different models and use the best ones, we translate all our data to English before processing it.
How?β
We use the Google Translate API for translating texts, with this code:
from googletrans import Translator
translator = Translator()
translated_text = translator.translate(
source_text,
dest="en",
src="fr"
).text
Note: the src param is optional, default to 'auto'
Other Translate Enginesβ
We though also about using Deepl, but their API is more complex to handle (we need to create API keys) and also more limited (500K characters/month with the free version).