Better File Chunk | 更加强大的文件分块 #3550
Replies: 24 comments 19 replies
-
excel可以说是知识库的刚需了,转成html然后借用html分块。 |
Beta Was this translation helpful? Give feedback.
This comment has been hidden.
This comment has been hidden.
-
计划支持一波 Lrc/Lrcx,这样就可以做歌词文件的分析了 |
Beta Was this translation helpful? Give feedback.
-
可以支持一下typst么?比较新的标记语言,和latex算是竞品。 |
Beta Was this translation helpful? Give feedback.
-
java和python有现有的分块方案吗,请求加上 |
Beta Was this translation helpful? Give feedback.
-
请问Unstructed.io如何配置,项目里有集成 吗? |
Beta Was this translation helpful? Give feedback.
-
Please add ePub support as well to the list of files supported? Most of the books are in ePub or PDF format, also seems like LLM's are better at reading ePub (due to less clutter than PDF's formatting). You also could add Mobi support, but Mobi support is dying since AMZN (it was only AMZN who were supporting it) stopped supporting it and even they have moved onto ePub now. Thanks |
Beta Was this translation helpful? Give feedback.
-
Unstructed.io 怎么使用? |
Beta Was this translation helpful? Give feedback.
-
楼主好,我想到一个办法,可以把excel文件转换成YAML格式的文件,然后按一定的规则分块。 |
Beta Was this translation helpful? Give feedback.
-
我在另外一个项目上是把excel表格转换成Markdown格式……就能向量了 |
Beta Was this translation helpful? Give feedback.
-
https://github.com/nanbingxyz/5ire 发现一个支持向量EXCEL的 |
Beta Was this translation helpful? Give feedback.
-
https://github.com/microsoft/markitdown 现在微软官方出了,office转markdown,py工具 |
Beta Was this translation helpful? Give feedback.
-
I have some XML WSDL files that describe my API. |
Beta Was this translation helpful? Give feedback.
-
Please add transcription via whisper open ai and chunking for this transcribed text, and view via html5 media player with links to transcribed part by timing |
Beta Was this translation helpful? Give feedback.
-
图片格式能支持一下吗,美术很需要 |
Beta Was this translation helpful? Give feedback.
-
markdown 的分块能不能用 # 分块,即把标题和标题下的内容分做一个块,现在会把标题单独分一个块 |
Beta Was this translation helpful? Give feedback.
-
请求支持cfg文件 |
Beta Was this translation helpful? Give feedback.
-
希望支持xml文件 |
Beta Was this translation helpful? Give feedback.
-
不太懂该如何分块, 但请求支持一下 |
Beta Was this translation helpful? Give feedback.
-
我自定义了变量:
现在能成功,但是我不知道是不是真的成功,怎么查看利用的是哪一个向量模型? |
Beta Was this translation helpful? Give feedback.
-
是否可以支持.h文件的分块?目前是支持.cpp文件的分块的,.h文件是C++代码文件的头文件。 |
Beta Was this translation helpful? Give feedback.
-
像.docx/.doc/.pdf等富文本文件里如果存在图片,分块就很容易失败,请问你们也有这种情况吗? |
Beta Was this translation helpful? Give feedback.
-
图书常用的有四种: 还有个从视频里抽出来的字幕文件,常用有三种。 有时候查询一个主题的材料时没有太多的文本材料,会直接把下载视频的字幕提取出来,然后让ai帮我找重点、解答问题,如果不需要我手工粘贴到对话框进行使用会方便很多,如果可以从多个视频的字幕,通过大模型的能力把一个主题的内容进行总结,找出视频的讲解的重点时间、内容。助益会非常大,希望能对字幕文件进行支持~ |
Beta Was this translation helpful? Give feedback.
-
背景
在 RAG 中,只有将文件合理分块后,才能做好检索与查询,但是市面上文件类型是非常多的,目前一期只做了一部分的分块支持。
目前支持的分块类型:
纯文本类:
代码类:
富文本类:
表格类:
音频类:
视频类:
如果有对文件类型的分块诉求,请在下面留言,并说明对此类文件的分块设想
Beta Was this translation helpful? Give feedback.
All reactions