Splitting Text from Huggingface Tokenizer uses encode function for calcuating length, which counts at least 2 extra tokens per text unit being merged #30184

keshavshrikant · 2025-03-09T05:32:14Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_text_splitters import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("BAAI/BGE-M3")

default_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer=tokenizer, chunk_size=256, chunk_overlap=128
)

new_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer=tokenizer, chunk_size=256, chunk_overlap=128
)
new_splitter._length_function = lambda x: len(tokenizer.tokenize(x))

sample_text = """Yes, you can replace **quinoa** with **rice** or **wheat**, but it’s important to understand the differences between these grains to make an informed choice:  ---### **Quinoa vs. Rice vs. Wheat**  | **Nutrient**          | **Quinoa**                  | **Rice** (White/Brown)       | **Wheat** (Whole Grain)    |  |------------------------|-----------------------------|-----------------------------|---------------------------|  | **Protein**            | High, contains all 9 essential amino acids | Lower (especially white rice) | Moderate (higher in whole wheat) |  | **Fiber**              | High (especially for digestion) | Low (white rice), Moderate (brown rice) | High (whole wheat) |  | **Gluten-Free**        | Yes                         | Yes                         | No (contains gluten)      |  | **Micronutrients**     | High in magnesium, iron, and B-vitamins | Moderate in nutrients (higher in brown rice) | High in B-vitamins and selenium |  ---### **Considerations for Replacing Quinoa**  1. **Rice**:     - **White Rice**: A good alternative if you’re looking for a light and easy-to-digest option, but it’s less nutrient-dense than quinoa.     - **Brown Rice**: A better option nutritionally than white rice, as it contains more fiber and minerals.  2. **Wheat**:     - **Whole Wheat**: A fiber-rich option with moderate protein but contains gluten, which may not suit people with gluten intolerance or celiac disease.     - **Bulgur or Cracked Wheat**: A good substitute for quinoa in salads or pilafs, with similar texture and nutrients.  3. **Portion Control**:     - Rice and wheat are higher in carbs compared to quinoa, so watch portions if managing weight or blood sugar.  ---### **When to Choose Each**  - **Quinoa**: Best for a protein-rich, gluten-free option.  - **Rice**: Ideal for a mild flavor or if you prefer gluten-free but don’t need as much protein.  - **Wheat**: Works well for those without gluten issues and looking for a hearty, high-fiber option.  ---### **Tips for Substitution**  - Match portion sizes: Use **1 cup cooked rice/wheat** for **1 cup cooked quinoa**.  - Experiment with **brown rice** or **bulgur** to retain more nutrients.  - Add extra **protein** (e.g., beans, lentils, chicken) if replacing quinoa with rice or wheat to balance the meal.  Let me know if you'd like recipes or meal ideas!"""

default_split_text = default_splitter.split_text(sample_text)
new_split_text = new_splitter.split_text(sample_text)

print(default_split_text[0])
print(new_split_text[0])

## Output

### split_text[0]
'Yes, you can replace **quinoa** with **rice** or **wheat**, but it’s important to understand the differences between these grains to make an informed choice:  ---### **Quinoa vs. Rice vs. Wheat**  | **Nutrient**          |'

### new_split_text[0]
'Yes, you can replace **quinoa** with **rice** or **wheat**, but it’s important to understand the differences between these grains to make an informed choice:  ---### **Quinoa vs. Rice vs. Wheat**  | **Nutrient**          | **Quinoa**                  | **Rice** (White/Brown)       | **Wheat** (Whole Grain)    |  |------------------------|-----------------------------|-----------------------------|---------------------------|  | **Protein**            | High, contains all 9 essential amino acids | Lower (especially white rice) | Moderate (higher in whole wheat) |  | **Fiber**              | High (especially for digestion) | Low (white rice), Moderate (brown rice) | High'

Error Message and Stack Trace (if applicable)

No response

Description

I'm trying to use langchain_text_splitters to chunk my documents into chunks of size 256 with 128 overlap recursively using the following function

RecursiveCharacterTextSplitter.from_huggingface_tokenizer

However I observed that an overwhelming majority of the chunks were nowhere near the 256 tokens length. Upon further digging, I noticed that the default length function being used here was

lambda x: len(tokenizer.encode(x))

Using the encode function to count the tokens always counts at least two extra tokens in my case (begin and end tokens for bge-m3) for every word unit present in the text after splitting on separators. This led to the chunks being much smaller than intended, which led to failures in the downstream application. I think replacing the default length function for the huggingface tokenizer by

lambda x: len(tokenizer.tokenize(x))

solves this problem and also avoids double counting of those special tokens.

I did not find any issues in tiktoken and sentence-transformers tokenizers. Even though they use the encode function as default functions for calculating length, the tiktoken encoder actually takes into account those special tokens and the sentence-transformers lenght function skips the first and last token before counting the number of tokens.

System Info

System Information

OS: Darwin
OS Version: Darwin Kernel Version 21.4.0: Fri Mar 18 00:47:26 PDT 2022; root:xnu-8020.101.4~15/RELEASE_ARM64_T8101
Python Version: 3.12.9 | packaged by Anaconda, Inc. | (main, Feb 6 2025, 13:04:33) [Clang 14.0.6 ]

Package Information

langchain_core: 0.3.43
langchain: 0.3.20
langsmith: 0.2.11
langchain_text_splitters: 0.3.6

Optional packages not installed

langserve

Other Dependencies

async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
httpx: 0.28.1
jsonpatch<2.0,>=1.33: Installed. No version info available.
langchain-anthropic;: Installed. No version info available.
langchain-aws;: Installed. No version info available.
langchain-cohere;: Installed. No version info available.
langchain-community;: Installed. No version info available.
langchain-core<1.0.0,>=0.3.34: Installed. No version info available.
langchain-core<1.0.0,>=0.3.41: Installed. No version info available.
langchain-deepseek;: Installed. No version info available.
langchain-fireworks;: Installed. No version info available.
langchain-google-genai;: Installed. No version info available.
langchain-google-vertexai;: Installed. No version info available.
langchain-groq;: Installed. No version info available.
langchain-huggingface;: Installed. No version info available.
langchain-mistralai;: Installed. No version info available.
langchain-ollama;: Installed. No version info available.
langchain-openai;: Installed. No version info available.
langchain-text-splitters<1.0.0,>=0.3.6: Installed. No version info available.
langchain-together;: Installed. No version info available.
langchain-xai;: Installed. No version info available.
langsmith-pyo3: Installed. No version info available.
langsmith<0.4,>=0.1.125: Installed. No version info available.
langsmith<0.4,>=0.1.17: Installed. No version info available.
orjson: 3.10.14
packaging<25,>=23.2: Installed. No version info available.
pydantic: 2.10.1
pydantic<3.0.0,>=2.5.2;: Installed. No version info available.
pydantic<3.0.0,>=2.7.4: Installed. No version info available.
pydantic<3.0.0,>=2.7.4;: Installed. No version info available.
PyYAML>=5.3: Installed. No version info available.
requests: 2.32.3
requests-toolbelt: 1.0.0
requests<3,>=2: Installed. No version info available.
SQLAlchemy<3,>=1.4: Installed. No version info available.
tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
typing-extensions>=4.7: Installed. No version info available.
zstandard: Installed. No version info available.

dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Mar 9, 2025

keshavshrikant mentioned this issue Mar 9, 2025

fix huggingface tokenizer default length function #30185

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitting Text from Huggingface Tokenizer uses encode function for calcuating length, which counts at least 2 extra tokens per text unit being merged #30184

Splitting Text from Huggingface Tokenizer uses encode function for calcuating length, which counts at least 2 extra tokens per text unit being merged #30184

keshavshrikant commented Mar 9, 2025

Splitting Text from Huggingface Tokenizer uses encode function for calcuating length, which counts at least 2 extra tokens per text unit being merged #30184

Splitting Text from Huggingface Tokenizer uses encode function for calcuating length, which counts at least 2 extra tokens per text unit being merged #30184

Comments

keshavshrikant commented Mar 9, 2025

Checked other resources

Example Code

Error Message and Stack Trace (if applicable)

Description

System Info

System Information

Package Information

Optional packages not installed

Other Dependencies