Dify教程：为DeepSeek-R1添加多模态功能_AI热点日报

Dify教程：为DeepSeek-R1添加多模态功能

类型：热点整理2026-06-30

将DeepSeek-R1的推理能力与多模态模型结合，通过Dify搭建智能编排层。先由DeepSeek-R1生成推理过程，再交由Gemini等模型结合文件解析与图像能力输出最终答案，实现推理与多模态的协同互补，提升复杂任务处理效率。

DeepSeek-R1近期在人工智能领域引发的广泛关注，相信众多从业者都已深切体会。它在数学推理与代码生成方面的突破性表现确实令人瞩目——在AIME数学竞赛中，其准确率从15.6%大幅跃升至86.7%，而在Codeforces编程竞赛中，它更是超越了96.3%的人类参赛者。这背后所体现的数学直觉与迁移学习能力，早已超越了传统意义上“计算速度快”的范畴。

然而，一个现实挑战摆在眼前：DeepSeek-R1本质上属于纯文本模型，多模态能力的缺失以及某些功能互斥的局限性，是其难以回避的短板。那么，如何破解这一困境？一条更为务实的路径是——借助Dify构建智能编排层，让DeepSeek-R1专注于其最擅长的领域：深度推理。随后，由它驱动具备多模态能力的更强模型，实现文件解析与网络连接的协同运作。

在具体实施层面，首先在Dify中创建一个空白应用，并选择Chatflow工作流类型。进入工作区后，点击右上角的“功能”选项，启用“文件上传”功能，同时勾选“文档”与“图片”类型即可完成基础配置。

工作流的编排逻辑其实相当清晰：首先解析文档与图片中的内容，将其传递给DeepSeek-R1生成推理结果；随后，将该推理结果连同原始文件信息一同发送给Gemini等多模态模型，由Gemini最终回应用户的问题。

DeepSeek-R1 推理节点

在这一架构中，DeepSeek-R1的角色类似于班级里的“优等生”——其核心任务并非直接给出答案，而是将问题的拆解过程与逻辑推理链条完整地呈现出来。在编写系统提示时，建议采用结构化格式（例如XML），这有助于模型更高效地拆解任务。具体提示词如下：


You are an LLM with reasoning capabilities.
Unlike other LLMs, you can output your complete thinking process.


Your task is to assist other LLMs that lack reasoning capabilities.
You need to output complete thinking processes for other LLMs based on user questions.

"Step 1": "Receive questions from users."
"Step 2": "Conduct deep reasoning and analysis on user questions."
"Step 3": "Elaborate on the reasoning process and logic, ensuring the process is complete and easy to understand."
"Step 4": "Output the complete reasoning process, no final answer needed."



Do not output the final answer, only output the thinking process.
Do not explain your own capabilities or limitations.


In addition, we need to adjust the user input content, adding the content from the doc extractor:

{{Start}}


{{text}}

Gemini 多模态节点

Gemini的优势则体现在其视觉理解能力与结果优化方面。它负责在DeepSeek-R1推理框架的基础上，结合多模态数据生成最终答案。需要特别留意的是：必须在该节点中启用LLM的视觉功能，这样才能获得解析图片与文档的能力。提示词如下：


You are an LLM that excels at learning.


You need to learn from others' thinking processes about problems, enhance your results with their thinking, and then provide your answer.

"Step 1": "Receive thinking process from DeepSeek-R1 model."
"Step 2": "Carefully study and understand DeepSeek-R1's reasoning logic and steps."
"Step 3": "Generate final answer based on DeepSeek-R1's thinking, combined with image capabilities."
"Step 4": "Output the final answer, no need to explain the thinking process."



Do not repeat DeepSeek-R1's thinking process, only output the final answer.
Do not explain your own capabilities or learning process.
Ensure the answer is accurate and relevant to the question.

这套组合方案实施下来，相当于让DeepSeek-R1的硬核推理能力与Gemini的多模态理解能力实现了完美互补。前者专注于“想清楚”逻辑，后者负责“看清楚”信息并输出最终答案。分工明确，各展所长。这或许正是当前阶段充分释放DeepSeek-R1潜力的最佳实践路径。

来源：https://www.53ai.com/news/MultimodalLargeModel/2025021131650.html

ai 人工智能

延伸阅读

补充最近整理过的热点入口。