GRAPHARG——学习

20250106
项目git地址：https://github.com/microsoft/graphrag.git
版本：1.2.0

### This config file contains required core defaults that must be set, along with a handful of common optional settings.
### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

### LLM settings ###
## There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

encoding_model: cl100k_base # this needs to be matched to your model!
`hiuuh`
llm:
  api_key: `填你自己的` # set this in the generated .env file
  type: openai_chat # or azure_openai_chat
  model: deepseek-chat
  model_supports_json: true # recommended if this is available for your model.
  # audience: "https://cognitiveservices.azure.com/.default"
  api_base: https://api.deepseek.com # https://<instance>.openai.azure.com
  api_version: V3
  # organization: <organization_id>
  deployment_name: maweijun

parallelization:
  stagger: 0.3
  # num_threads: 50

async_mode: threaded # or asyncio

embeddings:
  async_mode: threaded # or asyncio
  vector_store: 
    type: lancedb
    db_uri: 'output/lancedb'
    container_name: default
    overwrite: true
  llm:
    api_key: `填你自己的`
    type: openai_embedding # or azure_openai_embedding
    model: embedding-2
    api_base: https://open.bigmodel.cn/api/paas/v4
    # api_base: https://<instance>.openai.azure.com
    # api_version: 2024-02-15-preview
    # audience: "https://cognitiveservices.azure.com/.default"
    # organization: <organization_id>
    # deployment_name: <azure_model_deployment_name>

### Input settings ###

input:
  type: file # or blob
  file_type: text # or csv
  base_dir: "input"
  file_encoding: utf-8
  file_pattern: ".*\\.txt$"

chunks:
  size: 1200
  overlap: 100
  group_by_columns: [id]

### Storage settings ###
## If blob storage is specified in the following four sections,
## connection_string and container_name must be provided

cache:
  type: file # one of [blob, cosmosdb, file]
  base_dir: "cache"

reporting:
  type: file # or console, blob
  base_dir: "output/${timestamp}/logs"

storage:
  type: file # one of [blob, cosmosdb, file]
  base_dir: "output/${timestamp}/artifacts"

## only turn this on if running `graphrag index` with custom settings
## we normally use `graphrag update` with the defaults
update_index_storage:
  # type: file # or blob
  # base_dir: "update_output"

### Workflow settings ###

skip_workflows: []

entity_extraction:
  prompt: "prompts/entity_extraction.txt"
  entity_types: [organization,person,geo,event]
  max_gleanings: 1

summarize_descriptions:
  prompt: "prompts/summarize_descriptions.txt"
  max_length: 500

claim_extraction:
  enabled: false
  prompt: "prompts/claim_extraction.txt"
  description: "Any claims or facts that could be relevant to information discovery."
  max_gleanings: 1

community_reports:
  prompt: "prompts/community_report.txt"
  max_length: 2000
  max_input_length: 8000

cluster_graph:
  max_cluster_size: 10

embed_graph:
  enabled: false # if true, will generate node2vec embeddings for nodes

umap:
  enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)

snapshots:
  graphml: true
  embeddings: false
  transient: false

### Query settings ###
## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.
## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search:
  prompt: "prompts/local_search_system_prompt.txt"

global_search:
  map_prompt: "prompts/global_search_map_system_prompt.txt"
  reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
  knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search:
  prompt: "prompts/drift_search_system_prompt.txt"
  reduce_prompt: "prompts/drift_search_reduce_prompt.txt"

basic_search:
  prompt: "prompts/basic_search_system_prompt.txt"

这段代码是一个配置文件，用于设置和调整 GraphARG 框架的行为。GraphARG 是一个基于图结构的推理框架，通常用于处理图数据和自然语言数据的联合任务。以下是对配置文件的详细解释：

1. LLM 设置 (LLM Settings)

这部分配置与 大语言模型 (LLM) 相关，用于定义如何调用和处理 LLM 的 API。

encoding_model: cl100k_base
指定编码模型，需要与使用的 LLM 模型匹配。cl100k_base 是 OpenAI 模型常用的编码器。
llm
配置 LLM 的 API 调用参数：
- api_key: LLM 的 API 密钥，通常存储在 .env 文件中。
- type: LLM 的类型，例如 openai_chat 或 azure_openai_chat。
- model: 使用的 LLM 模型名称，例如 deepseek-chat。
- model_supports_json: 是否支持 JSON 格式的输入输出。
- api_base: LLM API 的基础 URL。
- api_version: API 的版本号。
- deployment_name: 部署名称（适用于 Azure OpenAI）。
parallelization
配置并行化参数：
- stagger: 调用 API 时的延迟时间（秒），用于避免速率限制。
- num_threads: 并行线程数（未启用）。
async_mode
指定异步模式，可以是 threaded（多线程）或 asyncio（异步 I/O）。

2. 嵌入模型设置 (Embeddings Settings)

这部分配置与 嵌入模型 相关，用于生成文本或节点的向量表示。

async_mode
指定嵌入模型的异步模式。
vector_store
配置向量存储：
- type: 向量存储类型，例如 lancedb。
- db_uri: 数据库的 URI。
- container_name: 容器名称。
- overwrite: 是否覆盖现有数据。
llm
配置嵌入模型的 API 调用参数：
- api_key: 嵌入模型的 API 密钥。
- type: 嵌入模型类型，例如 openai_embedding 或 azure_openai_embedding。
- model: 嵌入模型名称，例如 embedding-2。
- api_base: 嵌入模型 API 的基础 URL。

3. 输入设置 (Input Settings)

这部分配置与输入数据的处理相关。

input
配置输入数据的来源和格式：
- type: 输入类型，例如 file（文件）或 blob（Blob 存储）。
- file_type: 文件类型，例如 text（文本）或 csv。
- base_dir: 输入文件的根目录。
- file_encoding: 文件编码格式，例如 utf-8。
- file_pattern: 文件名的正则表达式模式，用于匹配文件。
chunks
配置文本分块参数：
- size: 每个块的大小（字符数）。
- overlap: 块之间的重叠字符数。
- group_by_columns: 按列分组（适用于结构化数据）。

4. 存储设置 (Storage Settings)

这部分配置与缓存、报告和存储相关。

cache
配置缓存存储：
- type: 缓存类型，例如 file（文件）或 blob（Blob 存储）。
- base_dir: 缓存文件的根目录。
reporting
配置报告输出：
- type: 报告类型，例如 file（文件）或 console（控制台）。
- base_dir: 报告文件的根目录。
storage
配置存储：
- type: 存储类型，例如 file（文件）或 blob（Blob 存储）。
- base_dir: 存储文件的根目录。
update_index_storage
配置索引更新存储（通常不需要手动启用）。

5. 工作流设置 (Workflow Settings)

这部分配置与框架的工作流相关。

skip_workflows
指定要跳过的工作流。
entity_extraction
配置实体抽取任务：
- prompt: 实体抽取的提示模板文件路径。
- entity_types: 要抽取的实体类型，例如 organization、person 等。
- max_gleanings: 最大抽取数量。
summarize_descriptions
配置文本摘要任务：
- prompt: 摘要生成的提示模板文件路径。
- max_length: 摘要的最大长度。
claim_extraction
配置声明抽取任务（默认禁用）：
- prompt: 声明抽取的提示模板文件路径。
- description: 任务的描述。
- max_gleanings: 最大抽取数量。
community_reports
配置社区报告生成任务：
- prompt: 报告生成的提示模板文件路径。
- max_length: 报告的最大长度。
- max_input_length: 输入的最大长度。
cluster_graph
配置图聚类任务：
- max_cluster_size: 最大聚类大小。
embed_graph
配置图嵌入任务（默认禁用）。
umap
配置 UMAP 降维任务（默认禁用）。
snapshots
配置快照输出：
- graphml: 是否生成 GraphML 格式的快照。
- embeddings: 是否生成嵌入快照。
- transient: 是否生成临时快照。

6. 查询设置 (Query Settings)

这部分配置与查询任务相关。

local_search
配置本地搜索任务：
- prompt: 本地搜索的提示模板文件路径。
global_search
配置全局搜索任务：
- map_prompt: 全局搜索的 Map 阶段提示模板。
- reduce_prompt: 全局搜索的 Reduce 阶段提示模板。
- knowledge_prompt: 全局搜索的知识提示模板。
drift_search
配置漂移搜索任务：
- prompt: 漂移搜索的提示模板文件路径。
- reduce_prompt: 漂移搜索的 Reduce 阶段提示模板。
basic_search
配置基础搜索任务：
- prompt: 基础搜索的提示模板文件路径。

总结

这个配置文件定义了 GraphARG 框架的核心行为，包括：

LLM 和嵌入模型的 API 调用。
输入数据的处理方式。
缓存、报告和存储的设置。
工作流的任务配置。
查询任务的提示模板和参数。

通过调整这些配置，可以灵活地适应不同的应用场景和需求。如果需要更详细的配置选项，可以参考官方文档：GraphARG 配置文档。