[Request] 支持提示词缓存Prompt caching #4561

AiharaMahiru · 2024-10-31T08:29:00Z

🥰 需求描述

部分API如OpenAI/Claude/MOONSHOT等已支持Prompt caching,能够大幅降低多轮问答的成本

🧐 解决方案

提供开关选项

📝 补充信息

No response

lobehubbot · 2024-10-31T08:29:11Z

🥰 Description of requirements

Some APIs such as OpenAI/Claude/MOONSHOT already support Prompt caching, which can significantly reduce the cost of multiple rounds of question and answer

🧐 Solution

Provide switch options

📝 Supplementary information

No response

lobehubbot · 2024-10-31T08:29:15Z

👀 @AiharaMahiru

Thank you for raising an issue. We will investigate into the matter and get back to you as soon as possible.
Please make sure you have given us as much context as possible.
非常感谢您提交 issue。我们会尽快调查此事，并尽快回复您。请确保您已经提供了尽可能多的背景信息。

arvinxx · 2024-10-31T12:35:55Z

OpenAI 的prompt cacheing 是默认开启的，不需要额外设置

lobehubbot · 2024-10-31T12:36:08Z

OpenAI's prompt caching is enabled by default and no additional settings are required.

BrandonStudio · 2024-11-02T05:59:02Z

Anthropic Claude 缓存的提示仅在5分钟内有效，我认为不适合本项目

lobehubbot · 2024-11-02T05:59:13Z

Anthropic Claude Cached tips are only valid for 5 minutes and I don't think they are suitable for this project

lobehubbot · 2024-11-03T07:58:37Z

✅ @AiharaMahiru

This issue is closed, If you have any questions, you can comment and reply.
此问题已经关闭。如果您有任何问题，可以留言并回复。

arvinxx · 2024-11-03T09:52:06Z

Anthropic 的 Caching 其实我有计划做的

lobehubbot · 2024-11-03T09:52:16Z

Actually I have a plan to do it

BrandonStudio · 2024-11-04T04:41:38Z

这个我感觉没啥意义吧？它的适用场景一般是单一功能的聊天机器人，比如某公司的客服，需要短时间内多次调用API，并且多次调用的提示具有相同的前缀
这个项目一般是个人用，尽管不同的助手有不同的内置系统提示，但是，(1) 用户未必在5分钟内单一频繁地调用该助手；(2) 系统提示是可以更改的
如果每次聊天都写入缓存，但是5分钟内不命中的话，整体费用将提高25%
Anthropic 支持最多4个缓存控制点，如果允许用户选择将缓存控制点插入何处，将不成比例地增加用户的理解成本，因为其它模型服务商不支持提示缓存，或以非常不同的方式支持。

lobehubbot · 2024-11-04T04:41:49Z

I don’t think this makes any sense, does it? Its applicable scenario is generally a single-function chat robot, such as a company's customer service, which needs to call the API multiple times in a short period of time, and the prompts for multiple calls have the same prefix.
This project is generally for personal use. Although different assistants have different built-in system prompts, (1) the user may not call the assistant frequently within 5 minutes; (2) the system prompts can be changed.
If every chat is written to the cache but misses within 5 minutes, the overall cost will increase by 25%
Anthropic supports up to 4 cache control points. Allowing users to choose where to insert cache control points will disproportionately increase the user's understanding cost, because other model servers do not support hint caching, or support it in a very different way.

arvinxx · 2024-11-04T06:07:41Z

@BrandonStudio 有意义的，比如 system prompts 的缓存就非常有价值，像 Artifacts 4000个 tokens，只需要多一轮对话，那么默认缓存就值回本了，更不用说类似爬虫插件一次拉回一篇超长文本的场景（1w）。

还有类似文件上传的 case，结合 prompt caching，我可以直接做成全文本上传的方案，那么这个节省下来的费用更是可观。

至于在交互上，这个不会去让用户自行操作的，而是针对个别类型的上下文做。比如 system role ，tools 的调用返回，PDF 文件的内容这些。

另外我之前测的时候也不是所有内容都支持 cache 的，user 的 content 如果少于 x 个 token（具体数值有点忘了），加了 cache 反而会直接抛错。所以 cache 前我会做一轮检测的，如果字符串长度小于某个值也不会去 cache。

lobehubbot · 2024-11-04T06:07:52Z

@BrandonStudio It makes sense. For example, the cache of system prompts is very valuable. For example, 4000 tokens of Artifacts only require one more round of dialogue, so the default cache will be worth it, not to mention a pull by a similar crawler plug-in. Scenario of replying a very long text (1w).

There is also a case similar to file upload. Combined with prompt caching, I can directly make a full text upload solution, so the cost savings will be considerable.

As for interaction, this will not be done by users themselves, but will be done based on individual types of context. For example, system role, tool call return, PDF file content, etc.

In addition, when I tested before, not all content supported cache. If the user's content was less than x tokens (I forgot the specific value), adding cache would directly throw an error. So I will do a round of testing before caching. If the string length is less than a certain value, it will not be cached.

BrandonStudio · 2024-11-04T07:38:01Z

问题还是5分钟的缓存时间限制，怎么样保证添加这个功能之后费用是减少的，而不是反而增加

lobehubbot · 2024-11-04T07:38:17Z

The problem is still a 5 -minute cache time limit. How to ensure that the cost after adding this function is reduced, not but increase

AiharaMahiru · 2024-11-04T08:29:29Z

所以说给个开关准没错。
PS：我个人经常问长篇代码，单次对话3~4k，基本在三分钟左右累计到20k左右（sonnet）

lobehubbot · 2024-11-04T08:29:42Z

So it's right to give a switch.

BrandonStudio · 2024-11-04T10:14:35Z

这样的话应该再加个定时器

lobehubbot · 2024-11-04T10:14:47Z

In this case, you should add a timer

JeroenAdam · 2024-11-29T23:15:50Z

Hi, I'm selfhosting Qwen 2.5 coder 32b using llama.cpp, lately we got massive speed gains thanks to speculative decoding which requires prompt caching. So now Lobe chat works only at half speed compared to what's obtainable. And secondly during conversations the prompt processing time gets longer and longer. Both issues described are not an issue when using llama.cpp inbuilt chat UI although their UI is very basic. Here below two links for more details.

ggml-org/llama.cpp#10311
ggml-org/llama.cpp#10455

lifodetails · 2024-12-04T10:55:30Z

Need this too.
场景：

测试Prompt的时候需要，需要5分钟内频繁使用相同Prompt。
解决复杂（或创造性强）问题的过程中，同一个问题我会让Sonnet回答多次，为了：
（1）多个llm的回答帮助我从多方面思考
（2）减少幻觉负面影响
（3）长对话过程中（这里的长指的是单轮对话的输入或llm的输出长度），节省成本。

另外，即使没有上述情境，只要用户要多轮对话，那就会节省成本：
第二轮对话的时候把第一轮对话按cache_control提交，那么在第三轮对话的时候，前第一轮对话就可以hit cach ，第二轮对话内容write cach。
依次类推：第N轮对话的时候 1到N-2 轮的对话内容都可以hit cach，第N-1轮对话write cach。
成本会大幅下降。
以上是建立在claude的cach逻辑上的，即按message块去缓存。

lobehubbot · 2024-12-04T10:55:43Z

Need this too.
Scenario:

When testing prompts, you need to use the same prompt frequently within 5 minutes.
In the process of solving complex (or creative) problems, I will ask Sonnet to answer the same question multiple times in order to:
(1) Multiple llm’s answers helped me think from many aspects
(2) Reduce the negative impact of hallucinations
(3) During long conversations (long here refers to the input length of a single round of conversation or the output length of llm), costs are saved.

In addition, even if there is no above situation, as long as the user has multiple rounds of dialogue, it will save costs:
In the second round of dialogue, submit the first round of dialogue according to cache_control. Then in the third round of dialogue, the previous first round of dialogue can be hit cach, and the content of the second round of dialogue can be write cach.
And so on: in the Nth round of dialogue, the dialogue content of rounds 1 to N-2 can be hit cach, and the N-1st round of dialogue can be write cach.
Costs will drop significantly.
The above is based on claude's cache logic, that is, caching according to message blocks.

BrandonStudio · 2024-12-05T01:05:47Z

Anthropic 目前应该最多支持4个缓存控制。
此外，Anthropic 支持在单轮消息中间添加缓存。

lobehubbot · 2024-12-05T01:06:00Z

Anthropic should currently support up to 4 cache controls.
In addition, Anthropic supports adding caching in the middle of a single round of messages.

nils010485 · 2025-02-04T07:44:22Z

Need this too, especially with Anthropic where token cost is high, a button to activate prompt caching could greatly help (especially since other UIs are already doing it)!

lobehubbot · 2025-03-07T11:25:51Z

✅ @AiharaMahiru

This issue is closed, If you have any questions, you can comment and reply.
此问题已经关闭。如果您有任何问题，可以留言并回复。

lobehubbot · 2025-03-07T11:37:54Z

🎉 This issue has been resolved in version 1.69.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

AiharaMahiru added the 🌠 Feature Request New feature or request | 特性与建议 label Oct 31, 2024

AiharaMahiru closed this as completed Nov 3, 2024

arvinxx reopened this Nov 3, 2024

lobehubbot added the Inactive No response in 30 days | 超过 30 天未活跃 label Feb 4, 2025

arvinxx mentioned this issue Mar 4, 2025

✨ feat: support Anthropic Context Caching #6704

Merged

8 tasks

arvinxx closed this as completed in #6704 Mar 7, 2025

lobehubbot added the released label Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Request] 支持提示词缓存Prompt caching #4561

[Request] 支持提示词缓存Prompt caching #4561

AiharaMahiru commented Oct 31, 2024

lobehubbot commented Oct 31, 2024

lobehubbot commented Oct 31, 2024

arvinxx commented Oct 31, 2024

lobehubbot commented Oct 31, 2024

BrandonStudio commented Nov 2, 2024

lobehubbot commented Nov 2, 2024

lobehubbot commented Nov 3, 2024

arvinxx commented Nov 3, 2024 •

edited

Loading

lobehubbot commented Nov 3, 2024

BrandonStudio commented Nov 4, 2024

lobehubbot commented Nov 4, 2024

arvinxx commented Nov 4, 2024

lobehubbot commented Nov 4, 2024

BrandonStudio commented Nov 4, 2024

lobehubbot commented Nov 4, 2024

AiharaMahiru commented Nov 4, 2024 •

edited

Loading

lobehubbot commented Nov 4, 2024

BrandonStudio commented Nov 4, 2024

lobehubbot commented Nov 4, 2024

JeroenAdam commented Nov 29, 2024 •

edited

Loading

lifodetails commented Dec 4, 2024

lobehubbot commented Dec 4, 2024

BrandonStudio commented Dec 5, 2024

lobehubbot commented Dec 5, 2024

nils010485 commented Feb 4, 2025

lobehubbot commented Mar 7, 2025

lobehubbot commented Mar 7, 2025

[Request] 支持提示词缓存Prompt caching #4561

[Request] 支持提示词缓存Prompt caching #4561

Comments

AiharaMahiru commented Oct 31, 2024

🥰 需求描述

🧐 解决方案

📝 补充信息

lobehubbot commented Oct 31, 2024

🥰 Description of requirements

🧐 Solution

📝 Supplementary information

lobehubbot commented Oct 31, 2024

arvinxx commented Oct 31, 2024

lobehubbot commented Oct 31, 2024

BrandonStudio commented Nov 2, 2024

lobehubbot commented Nov 2, 2024

lobehubbot commented Nov 3, 2024

arvinxx commented Nov 3, 2024 • edited Loading

lobehubbot commented Nov 3, 2024

BrandonStudio commented Nov 4, 2024

lobehubbot commented Nov 4, 2024

arvinxx commented Nov 4, 2024

lobehubbot commented Nov 4, 2024

BrandonStudio commented Nov 4, 2024

lobehubbot commented Nov 4, 2024

AiharaMahiru commented Nov 4, 2024 • edited Loading

lobehubbot commented Nov 4, 2024

BrandonStudio commented Nov 4, 2024

lobehubbot commented Nov 4, 2024

JeroenAdam commented Nov 29, 2024 • edited Loading

lifodetails commented Dec 4, 2024

lobehubbot commented Dec 4, 2024

BrandonStudio commented Dec 5, 2024

lobehubbot commented Dec 5, 2024

nils010485 commented Feb 4, 2025

lobehubbot commented Mar 7, 2025

lobehubbot commented Mar 7, 2025

arvinxx commented Nov 3, 2024 •

edited

Loading

AiharaMahiru commented Nov 4, 2024 •

edited

Loading

JeroenAdam commented Nov 29, 2024 •

edited

Loading