Generative AI models process text by breaking it down into smaller, bite-sized pieces called tokens through a process known as tokenization. These models, often built on a transformer architecture, can then take in and output text based on the patterns and relationships learned from the tokenized data. However, tokenization can introduce biases and challenges, particularly in languages other than English and when handling numbers or mathematical equations.
Tokenization in AI models is the process of breaking down text into smaller units called tokens, which can be words, syllables, or individual characters4. This helps AI systems understand and process language more efficiently, especially in large language models and transformers, by enabling them to recognize patterns and relationships between tokens. However, tokenization can introduce biases and challenges, particularly in non-English languages and mathematical contexts.
Tokenizers treat "Hello" and "HELLO" differently because they often tokenize based on individual characters or smaller units of text. In this case, "hello" is usually considered as one token, while "HELLO" can be split into multiple tokens, such as "HE", "EL", and "O". This difference in tokenization affects how models interpret and process the text, as they may not understand the semantic similarity between the two forms.