Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting Text from Huggingface Tokenizer uses encode function for calcuating length, which counts at least 2 extra tokens per text unit being merged #30184

Open
5 tasks done
keshavshrikant opened this issue Mar 9, 2025 · 0 comments
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@keshavshrikant
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from langchain_text_splitters import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("BAAI/BGE-M3")

default_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer=tokenizer, chunk_size=256, chunk_overlap=128
)

new_splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer=tokenizer, chunk_size=256, chunk_overlap=128
)
new_splitter._length_function = lambda x: len(tokenizer.tokenize(x))

sample_text = """Yes, you can replace **quinoa** with **rice** or **wheat**, but it’s important to understand the differences between these grains to make an informed choice:  ---### **Quinoa vs. Rice vs. Wheat**  | **Nutrient**          | **Quinoa**                  | **Rice** (White/Brown)       | **Wheat** (Whole Grain)    |  |------------------------|-----------------------------|-----------------------------|---------------------------|  | **Protein**            | High, contains all 9 essential amino acids | Lower (especially white rice) | Moderate (higher in whole wheat) |  | **Fiber**              | High (especially for digestion) | Low (white rice), Moderate (brown rice) | High (whole wheat) |  | **Gluten-Free**        | Yes                         | Yes                         | No (contains gluten)      |  | **Micronutrients**     | High in magnesium, iron, and B-vitamins | Moderate in nutrients (higher in brown rice) | High in B-vitamins and selenium |  ---### **Considerations for Replacing Quinoa**  1. **Rice**:     - **White Rice**: A good alternative if you’re looking for a light and easy-to-digest option, but it’s less nutrient-dense than quinoa.     - **Brown Rice**: A better option nutritionally than white rice, as it contains more fiber and minerals.  2. **Wheat**:     - **Whole Wheat**: A fiber-rich option with moderate protein but contains gluten, which may not suit people with gluten intolerance or celiac disease.     - **Bulgur or Cracked Wheat**: A good substitute for quinoa in salads or pilafs, with similar texture and nutrients.  3. **Portion Control**:     - Rice and wheat are higher in carbs compared to quinoa, so watch portions if managing weight or blood sugar.  ---### **When to Choose Each**  - **Quinoa**: Best for a protein-rich, gluten-free option.  - **Rice**: Ideal for a mild flavor or if you prefer gluten-free but don’t need as much protein.  - **Wheat**: Works well for those without gluten issues and looking for a hearty, high-fiber option.  ---### **Tips for Substitution**  - Match portion sizes: Use **1 cup cooked rice/wheat** for **1 cup cooked quinoa**.  - Experiment with **brown rice** or **bulgur** to retain more nutrients.  - Add extra **protein** (e.g., beans, lentils, chicken) if replacing quinoa with rice or wheat to balance the meal.  Let me know if you'd like recipes or meal ideas!"""

default_split_text = default_splitter.split_text(sample_text)
new_split_text = new_splitter.split_text(sample_text)

print(default_split_text[0])
print(new_split_text[0])
## Output

### split_text[0]
'Yes, you can replace **quinoa** with **rice** or **wheat**, but it’s important to understand the differences between these grains to make an informed choice:  ---### **Quinoa vs. Rice vs. Wheat**  | **Nutrient**          |'

### new_split_text[0]
'Yes, you can replace **quinoa** with **rice** or **wheat**, but it’s important to understand the differences between these grains to make an informed choice:  ---### **Quinoa vs. Rice vs. Wheat**  | **Nutrient**          | **Quinoa**                  | **Rice** (White/Brown)       | **Wheat** (Whole Grain)    |  |------------------------|-----------------------------|-----------------------------|---------------------------|  | **Protein**            | High, contains all 9 essential amino acids | Lower (especially white rice) | Moderate (higher in whole wheat) |  | **Fiber**              | High (especially for digestion) | Low (white rice), Moderate (brown rice) | High'

Error Message and Stack Trace (if applicable)

No response

Description

I'm trying to use langchain_text_splitters to chunk my documents into chunks of size 256 with 128 overlap recursively using the following function

RecursiveCharacterTextSplitter.from_huggingface_tokenizer

However I observed that an overwhelming majority of the chunks were nowhere near the 256 tokens length. Upon further digging, I noticed that the default length function being used here was

lambda x: len(tokenizer.encode(x))

Using the encode function to count the tokens always counts at least two extra tokens in my case (begin and end tokens for bge-m3) for every word unit present in the text after splitting on separators. This led to the chunks being much smaller than intended, which led to failures in the downstream application. I think replacing the default length function for the huggingface tokenizer by

lambda x: len(tokenizer.tokenize(x))

solves this problem and also avoids double counting of those special tokens.

I did not find any issues in tiktoken and sentence-transformers tokenizers. Even though they use the encode function as default functions for calculating length, the tiktoken encoder actually takes into account those special tokens and the sentence-transformers lenght function skips the first and last token before counting the number of tokens.

System Info

System Information

OS: Darwin
OS Version: Darwin Kernel Version 21.4.0: Fri Mar 18 00:47:26 PDT 2022; root:xnu-8020.101.4~15/RELEASE_ARM64_T8101
Python Version: 3.12.9 | packaged by Anaconda, Inc. | (main, Feb 6 2025, 13:04:33) [Clang 14.0.6 ]

Package Information

langchain_core: 0.3.43
langchain: 0.3.20
langsmith: 0.2.11
langchain_text_splitters: 0.3.6

Optional packages not installed

langserve

Other Dependencies

async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
httpx: 0.28.1
jsonpatch<2.0,>=1.33: Installed. No version info available.
langchain-anthropic;: Installed. No version info available.
langchain-aws;: Installed. No version info available.
langchain-cohere;: Installed. No version info available.
langchain-community;: Installed. No version info available.
langchain-core<1.0.0,>=0.3.34: Installed. No version info available.
langchain-core<1.0.0,>=0.3.41: Installed. No version info available.
langchain-deepseek;: Installed. No version info available.
langchain-fireworks;: Installed. No version info available.
langchain-google-genai;: Installed. No version info available.
langchain-google-vertexai;: Installed. No version info available.
langchain-groq;: Installed. No version info available.
langchain-huggingface;: Installed. No version info available.
langchain-mistralai;: Installed. No version info available.
langchain-ollama;: Installed. No version info available.
langchain-openai;: Installed. No version info available.
langchain-text-splitters<1.0.0,>=0.3.6: Installed. No version info available.
langchain-together;: Installed. No version info available.
langchain-xai;: Installed. No version info available.
langsmith-pyo3: Installed. No version info available.
langsmith<0.4,>=0.1.125: Installed. No version info available.
langsmith<0.4,>=0.1.17: Installed. No version info available.
orjson: 3.10.14
packaging<25,>=23.2: Installed. No version info available.
pydantic: 2.10.1
pydantic<3.0.0,>=2.5.2;: Installed. No version info available.
pydantic<3.0.0,>=2.7.4: Installed. No version info available.
pydantic<3.0.0,>=2.7.4;: Installed. No version info available.
PyYAML>=5.3: Installed. No version info available.
requests: 2.32.3
requests-toolbelt: 1.0.0
requests<3,>=2: Installed. No version info available.
SQLAlchemy<3,>=1.4: Installed. No version info available.
tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
typing-extensions>=4.7: Installed. No version info available.
zstandard: Installed. No version info available.

@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Mar 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant