Splitting Text from Huggingface Tokenizer uses encode function for calcuating length, which counts at least 2 extra tokens per text unit being merged #30184
Labels
🤖:bug
Related to a bug, vulnerability, unexpected error with an existing feature
Checked other resources
Example Code
Error Message and Stack Trace (if applicable)
No response
Description
I'm trying to use langchain_text_splitters to chunk my documents into chunks of size 256 with 128 overlap recursively using the following function
However I observed that an overwhelming majority of the chunks were nowhere near the 256 tokens length. Upon further digging, I noticed that the default length function being used here was
Using the encode function to count the tokens always counts at least two extra tokens in my case (begin and end tokens for bge-m3) for every word unit present in the text after splitting on separators. This led to the chunks being much smaller than intended, which led to failures in the downstream application. I think replacing the default length function for the huggingface tokenizer by
solves this problem and also avoids double counting of those special tokens.
I did not find any issues in tiktoken and sentence-transformers tokenizers. Even though they use the encode function as default functions for calculating length, the tiktoken encoder actually takes into account those special tokens and the sentence-transformers lenght function skips the first and last token before counting the number of tokens.
System Info
System Information
Package Information
Optional packages not installed
Other Dependencies
The text was updated successfully, but these errors were encountered: