Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Request] 支持提示词缓存Prompt caching #4561

Closed
AiharaMahiru opened this issue Oct 31, 2024 · 27 comments · Fixed by #6704
Closed

[Request] 支持提示词缓存Prompt caching #4561

AiharaMahiru opened this issue Oct 31, 2024 · 27 comments · Fixed by #6704
Labels
🌠 Feature Request New feature or request | 特性与建议 Inactive No response in 30 days | 超过 30 天未活跃 released

Comments

@AiharaMahiru
Copy link

🥰 需求描述

部分API如OpenAI/Claude/MOONSHOT等已支持Prompt caching,能够大幅降低多轮问答的成本

🧐 解决方案

提供开关选项

📝 补充信息

No response

@AiharaMahiru AiharaMahiru added the 🌠 Feature Request New feature or request | 特性与建议 label Oct 31, 2024
@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


🥰 Description of requirements

Some APIs such as OpenAI/Claude/MOONSHOT already support Prompt caching, which can significantly reduce the cost of multiple rounds of question and answer

🧐 Solution

Provide switch options

📝 Supplementary information

No response

@lobehubbot
Copy link
Member

👀 @AiharaMahiru

Thank you for raising an issue. We will investigate into the matter and get back to you as soon as possible.
Please make sure you have given us as much context as possible.
非常感谢您提交 issue。我们会尽快调查此事,并尽快回复您。 请确保您已经提供了尽可能多的背景信息。

@arvinxx
Copy link
Contributor

arvinxx commented Oct 31, 2024

OpenAI 的prompt cacheing 是默认开启的,不需要额外设置

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


OpenAI's prompt caching is enabled by default and no additional settings are required.

@BrandonStudio
Copy link
Contributor

Anthropic Claude 缓存的提示仅在5分钟内有效,我认为不适合本项目

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Anthropic Claude Cached tips are only valid for 5 minutes and I don't think they are suitable for this project

@lobehubbot
Copy link
Member

@AiharaMahiru

This issue is closed, If you have any questions, you can comment and reply.
此问题已经关闭。如果您有任何问题,可以留言并回复。

@arvinxx arvinxx reopened this Nov 3, 2024
@arvinxx
Copy link
Contributor

arvinxx commented Nov 3, 2024

Anthropic 的 Caching 其实我有计划做的

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Actually I have a plan to do it

@BrandonStudio
Copy link
Contributor

Anthropic 的 Caching 其实我有计划做的

这个我感觉没啥意义吧?它的适用场景一般是单一功能的聊天机器人,比如某公司的客服,需要短时间内多次调用API,并且多次调用的提示具有相同的前缀
这个项目一般是个人用,尽管不同的助手有不同的内置系统提示,但是,(1) 用户未必在5分钟内单一频繁地调用该助手;(2) 系统提示是可以更改的
如果每次聊天都写入缓存,但是5分钟内不命中的话,整体费用将提高25%
Anthropic 支持最多4个缓存控制点,如果允许用户选择将缓存控制点插入何处,将不成比例地增加用户的理解成本,因为其它模型服务商不支持提示缓存,或以非常不同的方式支持。

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Caching of Anthropic I actually have plans to do it

I don’t think this makes any sense, does it? Its applicable scenario is generally a single-function chat robot, such as a company's customer service, which needs to call the API multiple times in a short period of time, and the prompts for multiple calls have the same prefix.
This project is generally for personal use. Although different assistants have different built-in system prompts, (1) the user may not call the assistant frequently within 5 minutes; (2) the system prompts can be changed.
If every chat is written to the cache but misses within 5 minutes, the overall cost will increase by 25%
Anthropic supports up to 4 cache control points. Allowing users to choose where to insert cache control points will disproportionately increase the user's understanding cost, because other model servers do not support hint caching, or support it in a very different way.

@arvinxx
Copy link
Contributor

arvinxx commented Nov 4, 2024

@BrandonStudio 有意义的,比如 system prompts 的缓存就非常有价值,像 Artifacts 4000个 tokens,只需要多一轮对话,那么默认缓存就值回本了,更不用说类似爬虫插件一次拉回一篇超长文本的场景(1w)。

还有类似文件上传的 case,结合 prompt caching,我可以直接做成全文本上传的方案,那么这个节省下来的费用更是可观。

至于在交互上,这个不会去让用户自行操作的,而是针对个别类型的上下文做。比如 system role ,tools 的调用返回,PDF 文件的内容这些。

另外我之前测的时候也不是所有内容都支持 cache 的,user 的 content 如果少于 x 个 token(具体数值有点忘了),加了 cache 反而会直接抛错。所以 cache 前我会做一轮检测的,如果字符串长度小于某个值也不会去 cache。

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


@BrandonStudio It makes sense. For example, the cache of system prompts is very valuable. For example, 4000 tokens of Artifacts only require one more round of dialogue, so the default cache will be worth it, not to mention a pull by a similar crawler plug-in. Scenario of replying a very long text (1w).

There is also a case similar to file upload. Combined with prompt caching, I can directly make a full text upload solution, so the cost savings will be considerable.

As for interaction, this will not be done by users themselves, but will be done based on individual types of context. For example, system role, tool call return, PDF file content, etc.

In addition, when I tested before, not all content supported cache. If the user's content was less than x tokens (I forgot the specific value), adding cache would directly throw an error. So I will do a round of testing before caching. If the string length is less than a certain value, it will not be cached.

@BrandonStudio
Copy link
Contributor

问题还是5分钟的缓存时间限制,怎么样保证添加这个功能之后费用是减少的,而不是反而增加

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The problem is still a 5 -minute cache time limit. How to ensure that the cost after adding this function is reduced, not but increase

@AiharaMahiru
Copy link
Author

AiharaMahiru commented Nov 4, 2024

问题还是5分钟的缓存时间限制,怎么样保证添加这个功能之后费用是减少的,而不是反而增加

所以说给个开关准没错。
PS:我个人经常问长篇代码,单次对话3~4k,基本在三分钟左右累计到20k左右(sonnet)

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The problem is still the 5-minute cache time limit. How to ensure that after adding this function, the cost will be reduced instead of increased.

So it's right to give a switch.

@BrandonStudio
Copy link
Contributor

这样的话应该再加个定时器

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


In this case, you should add a timer

@JeroenAdam
Copy link

JeroenAdam commented Nov 29, 2024

Hi, I'm selfhosting Qwen 2.5 coder 32b using llama.cpp, lately we got massive speed gains thanks to speculative decoding which requires prompt caching. So now Lobe chat works only at half speed compared to what's obtainable. And secondly during conversations the prompt processing time gets longer and longer. Both issues described are not an issue when using llama.cpp inbuilt chat UI although their UI is very basic. Here below two links for more details.

ggml-org/llama.cpp#10311
ggml-org/llama.cpp#10455

@lifodetails
Copy link

Need this too.
场景:

  1. 测试Prompt的时候需要,需要5分钟内频繁使用相同Prompt。
  2. 解决复杂(或创造性强)问题的过程中,同一个问题我会让Sonnet回答多次,为了:
    (1)多个llm的回答帮助我从多方面思考
    (2)减少幻觉负面影响
    (3)长对话过程中(这里的长指的是单轮对话的输入或llm的输出长度),节省成本。

另外,即使没有上述情境,只要用户要多轮对话,那就会节省成本
第二轮对话的时候把第一轮对话按cache_control提交,那么在第三轮对话的时候,前第一轮对话就可以hit cach ,第二轮对话内容write cach。
依次类推:第N轮对话的时候 1到N-2 轮的对话内容都可以hit cach,第N-1轮对话write cach。
成本会大幅下降。
以上是建立在claude的cach逻辑上的,即按message块去缓存。

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Need this too.
Scenario:

  1. When testing prompts, you need to use the same prompt frequently within 5 minutes.
  2. In the process of solving complex (or creative) problems, I will ask Sonnet to answer the same question multiple times in order to:
    (1) Multiple llm’s answers helped me think from many aspects
    (2) Reduce the negative impact of hallucinations
    (3) During long conversations (long here refers to the input length of a single round of conversation or the output length of llm), costs are saved.

In addition, even if there is no above situation, as long as the user has multiple rounds of dialogue, it will save costs:
In the second round of dialogue, submit the first round of dialogue according to cache_control. Then in the third round of dialogue, the previous first round of dialogue can be hit cach, and the content of the second round of dialogue can be write cach.
And so on: in the Nth round of dialogue, the dialogue content of rounds 1 to N-2 can be hit cach, and the N-1st round of dialogue can be write cach.
Costs will drop significantly.
The above is based on claude's cache logic, that is, caching according to message blocks.

@BrandonStudio
Copy link
Contributor

Anthropic 目前应该最多支持4个缓存控制。
此外,Anthropic 支持在单轮消息中间添加缓存。

@lobehubbot
Copy link
Member

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Anthropic should currently support up to 4 cache controls.
In addition, Anthropic supports adding caching in the middle of a single round of messages.

@lobehubbot lobehubbot added the Inactive No response in 30 days | 超过 30 天未活跃 label Feb 4, 2025
@nils010485
Copy link

Need this too, especially with Anthropic where token cost is high, a button to activate prompt caching could greatly help (especially since other UIs are already doing it)!

@lobehubbot
Copy link
Member

@AiharaMahiru

This issue is closed, If you have any questions, you can comment and reply.
此问题已经关闭。如果您有任何问题,可以留言并回复。

@lobehubbot
Copy link
Member

🎉 This issue has been resolved in version 1.69.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🌠 Feature Request New feature or request | 特性与建议 Inactive No response in 30 days | 超过 30 天未活跃 released
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants