大语言模型langchain+ChatGLM3-6B+本地知识库实战

文章目录

大语言模型langchain+ChatGLM3-6B+本地知识库实战

目标

进行langchain创建，以ChatGLM3-6B作为对话模型。
在langchain环境下，进行本地知识库定制。

微调、本地知识库和 Prompt基本概念

微调、本地知识库和 prompt 是LMM模型调整和优化中的重要概念，它们之间有一定的关系和区别。

微调（Finetuning）是一种让预先训练好的模型适应特定任务或数据集的低成本方案。
本地知识库（Local Knowledge Base）是一种存储行业特定信息的数据库，它可以为LMM模型提供实时、动态的知识补充。
Prompt 是一种用于引导模型生成特定类型回答的输入提示。

1、环境准备

在aliyun选择硬件资源，选择镜像版本时格外注意pytorch、cuda的版本：pytorch=2.1.2、cuda=12.1、python=3.10，还有modelscope=1.11.0与Tensorflow=2.14.0。

序号	资源明细
1	32G内存，16G显卡（NVidia Tesla V100），Ubuntu20.04。
2	pytorch=2.1.2、cuda=12.1、python=3.10
3	modelscope=1.11.0与Tensorflow=2.14.0

2、创建conda虚拟环境

conda create -n langchain python=3.10
conda activate langchain

3、langchain+chatglm3-6b模型源码下载

克隆 langchain-ChatGLM 源码

git clone https://github.com/imClumsyPanda/langchain-ChatGLM.git

克隆text2vec-large-chinese模型

git lfs install

wget https://aliendao.cn/model_download.py

pip install huggingface_hub

python model_download.py --repo_id GanymedeNil/text2vec-large-chinese

克隆ChatGLM3-6B

git clone https://github.com/THUDM/ChatGLM3

cd ChatGLM3

git clone https://www.modelscope.cn/ZhipuAI/chatglm3-6b.git chatglm3-6b-models

测试ChatGLM3-6B能否正常启动

#按照ChatGLM3的readme.md执行requirements.txt
pip install -r requirements.txt

#web_demo_gradio.py调用出错，将这个代码复制到ChatGLM3目录，然后降级gradio从4.1*降级到3.40
pip install gradio==3.40.0

安装langchain依赖

# 进入目录
$ cd langchain-ChatGLM

# 安装全部依赖
$ pip install -r requirements.txt 
$ pip install -r requirements_api.txt
$ pip install -r requirements_webui.txt

初始化langchain

python copy_config_example.py
#ToDO：修改model_copy.py相关参数

#按照官方教程，执行以下代码会报错，大概率是embd模型有误
#python init_database.py --recreate-vs

其中，model_copy.py里的默认内容不需要删除（有些教程建议删除），将embedding\model相关路径改成绝对路径即可。

启动langchain UI

#一键启动
python startup.py -a

以下为一键启动后的信息：

==============================Langchain-Chatchat Configuration==============================
操作系统：Linux-4.19.24-7.34.cbp.al7.x86_64-x86_64-with-glibc2.35.
python版本：3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0]
项目版本：v0.2.10
langchain版本：0.0.354. fastchat版本：0.2.35


当前使用的分词器：ChineseRecursiveTextSplitter
当前启动的LLM模型：['chatglm3-6b', 'zhipu-api', 'openai-api'] @ cuda
{'device': 'cuda',
 'host': '0.0.0.0',
 'infer_turbo': False,
 'model_path': '/mnt/workspace/langchain-ChatGLM/ChatGLM3/THUDM/chatglm3-6b',
 'model_path_exists': True,
 'port': 20002}
{'api_key': '',
 'device': 'auto',
 'host': '0.0.0.0',
 'infer_turbo': False,
 'online_api': True,
 'port': 21001,
 'provider': 'ChatGLMWorker',
 'version': 'glm-4',
 'worker_class': <class 'server.model_workers.zhipu.ChatGLMWorker'>}
{'api_base_url': 'https://api.openai.com/v1',
 'api_key': '',
 'device': 'auto',
 'host': '0.0.0.0',
 'infer_turbo': False,
 'model_name': 'gpt-4',
 'online_api': True,
 'openai_proxy': '',
 'port': 20002}
当前Embbedings模型： text2vec-large-chinese @ cuda
==============================Langchain-Chatchat Configuration==============================

启动成功。

可能会遇到pydantic问题，降级即可
可能是由于本人在langchain和ChatGLM3-6B之间反复调用，执行过多次requirements.txt，一些python库的版本被更新。会导致重新执行python startup.py -a会报错。
查阅了一下langchain的requirements.txt中的pydantic==1.10.13，需降级：

pip install pydantic==1.10.13

4、运行ChatGLM3的web_demo

按照ChatGLM3中相关readme.md的教程操作。遇到找不到peft，安装：

pip install peft

python运行web_demo_gradio.py

python web_demo_gradio.py

streamlit运行web_demo_streamlit.py

#指定地址，且指定为本机127地址，自动生成端口号，浏览器可以访问
streamlit run web_demo_streamlit.py --server.address=127.0.0.1

Jupiter内核安装

ipython kernel install --name chatglm3-demo --user

streamlit调用composite_demo
将composite_demo中的所有文件复制到ChatGLM3-6B目录中，否则修改py中的路径。

streamlit run main.py --server.address=127.0.0.1

5、运行langchain的Web UI，准备本地知识库

使用本地知识库时，新建知识库，提示：

ValueError: 'text2vec-large-chinese' is not in list

尝试用官网教程，也有问题：

python init_database.py --recreate-vs

换个Embedding库试试。下载BAAI/bge-large-zh:

wget https://aliendao.cn/model_download.py

pip install huggingface_hub

python model_download.py --repo_id BAAI/bge-large-zh

下载完成后，将BAAI目录放到langchain根目录。创建BAAI本地知识库：

python init_database.py --recreate-vs

成功。

再次启动langchian:

进入本地知识库，可以看到BAAI成功加载。

6、新建本地知识库

新建本地知识库填写元信息
上传word
随便上传了一个原生的word，没有做任何预处理。报错AxiosError: Request failed with status code 403
网上有一个说法是降级streamlit，但是我启动没有用streamlit。试试再说：

pip install streamlit==1.28.0

问题解决，可以上传。

将word换成txt，重新加载

问答加载器、分词器、文档数量、源文件、向量库都是空或者叉。是否还存在问题？

2024-02-07 06:23:23,603 - utils.py[line:295] - INFO: RapidOCRDocLoader used for /mnt/workspace/langchain-ChatGLM/knowledge_base/RichardNorth/content/神经内科典型病例分析.doc
2024-02-07 06:23:24,088 - utils.py[line:377] - ERROR: PackageNotFoundError: 从文件 RichardNorth/神经内科典型病例分析.doc 加载文档时出错：Package not found at '/mnt/workspace/langchain-ChatGLM/knowledge_base/RichardNorth/content/神经内科典型病例分析.doc'
2024-02-07 06:23:24,104 - faiss_cache.py[line:38] - INFO: 已将向量库 ('RichardNorth', 'bge-large-zh') 保存到磁盘
INFO:     127.0.0.1:37936 - "POST /knowledge_base/update_docs HTTP/1.1" 200 OK
2024-02-07 06:23:24,106 - _client.py[line:1027] - INFO: HTTP Request: POST http://127.0.0.1:7861/knowledge_base/update_docs "HTTP/1.1 200 OK"

将word换成txt，重新加载，可以加载。

使用知识库问答，能回答问题，但是本地知识库其实没有加载成功

观察发现，本地知识库文件上传成功了，但是在分词器、向量库等后继操作，导致在问答时失败：
未找到相关文档,该回答为大模型自身能力解答！
估计是本地知识库需要规范化，也就是需要进行预处理。

7、本地知识库预处理

参考了BAAI向量库中的本地知识库文本格式，将个人知识库txt进行处理。处理前发现文档格式确实混乱（网上下载下来的），需要进行换行、去除空格、增加markdown标题符号等。
处理完成后，再次上传，成功。