Tokens
Computers are famously good with numbers. But they are bad when dealing with text. So we need a way to convert from text to numbers.
This conversion needs to somehow preserve the meaning of the words somehow.
Because neural networks take Vector of numbers as input, we need to convert our numbers into a Vector.
Giving ids to wordsβ
The first thing one could do would be to assign every word we need to represent with a number. Because there are amount 1M of words in english, this means 1M numbers. While this idea works and is used (to give ids to words), it has issues.
If apple = 2, car = 4 and house = 3, then average(car, apple) = house,
which makes no sense. The number don't capture any meaning, they are just
identifiers. So we cannot just pass a single number to the network to represent
a word with this approach.
This means that you are actually representing your words with a Vector of length 1M with zeroes everywhere except at the index of the id of your word where there is a one to give as an input. This is very inefficient.
Using smaller Vectorβ
A more refined idea was introduced in the
word2vec paper . We can learn a
representation of words by looking at text. Words that tend to appear in a
similar context (with similar words around them) usually have a similar meaning
or at least are related to the same concept (like big and small), so their
representation should be similar. We can train a linear regression to convert
from the 1M representation to a smaller space of about 300 dimensions using the
previous criterion. This allows the computer to efficiently represent words and
is used for every other text processing algorithm.
Using tokenizersβ
This approach was refined with the introduction of tokenizers. Instead of handling text one word at a time, you can split the words into chunks, like syllables called tokens. This allows the model to handle every word, even words it has never encountered before. And because there are fewer syllables than words, this approach works and remains more efficient than handling one letter at a time.
For example, in a word like "attention" might be split into "at-ten-tion". As "tion" usually indicates a noun that is abstract, the model can gain an understanding straight from the tokens.
Count tokens for LLMsβ
Sometimes, you need to count tokens before making your request to a LLM to optimize your calls. To do so, you can use the python library tiktoken.