大语言模型知识库AI助手配置 Ollama+Xinference+Dify+企业微信

由于公司需要，使用一台新电脑，进行AI环境的配置，需求是在web端和企业微信端安装使用智能助手，且需要离线运行大模型。下面将步骤进行记录，以便后期维护。

1、安装系统

1.1 准备工作

全新电脑（提前考虑好安装的模型，有针对的进行配置）
装机U盘（4G+）
debian系统镜像（https://mirror.lzu.edu.cn/debian-cd/）
镜像刷写工具：balenaEtcher / Rufus 等

1.2 步骤

下载镜像；
使用刷写工具刷写到优盘中；
插入电脑，开机使用优盘引导，自动装机；
图形界面系统安装，可参考文章：https://blog.csdn.net/weixin_44200186/article/details/131970040

2、环境设置

2.1 设置网络

可能会有多种网络管理工具，比如：

ifupdown，修改配置文件：/etc/network/interfaces
systemd-networkd，配置文件路径：/etc/systemd/network/
NetworkManager，配置文件路径：/etc/NetworkManager/system-connections/

2.2 设置软件包安装路径

编辑/etc/apt/sources.list文件：

# 注掉原有的第一行
# deb cdrom:[Debian GNU/Linux 12.7.0 _Bookworm_ - Official amd64 DVD Binary-1 with firmware 20240831-10:40]/ bookworm contrib main non-free-firmware

# 添加国内源
deb http://mirrors.ustc.edu.cn/debian stable main contrib non-free  
# deb-src http://mirrors.ustc.edu.cn/debian stable main contrib non-free  
deb http://mirrors.ustc.edu.cn/debian stable-updates main contrib non-free  
# deb-src http://mirrors.ustc.edu.cn/debian stable-updates main contrib non-free

# GPT推荐的
# deb http://deb.debian.org/debian/ bookworm main
# deb http://deb.debian.org/debian/ bookworm-updates main
# deb http://security.debian.org/debian-security bookworm-security main

保存后，运行sudo apt update，更新 apt 包管理器缓存。

2.3 vi编辑时，方向键变ABCD、退格键无法使用

编辑/etc/vim/vimrc.tiny文件：
找到 set compatible 这一行，将 compatible 改成 nocompatible（非兼容模式）
如果退格键无法使用，追加：set backspace=2 或者 set backspace=indent,eol,start

2.4 允许ssh连接

编辑/etc/ssh/sshd_config文件：

# 打开以下开关
PermitRootLogin yes
PasswordAuthentication yes

# 由于是内网使用，因此设置的安全等级较低。
# 如果是公网使用，建议使用密钥登录。

2.5 安装docker

因为高墙的原因，安装docker没有成功。所以安装了宝塔，通过宝塔面板安装了docker

3、部署dify

3.1 dify

从github下载安装包（https://github.com/langgenius/dify）
进入安装包的docker路径下，将.env.example复制一份：cp .env.example .env
由于dify默认是80端口，可能会与服务器已有业务冲突，因此修改了.env中的端口号

# 开始的路径也需调整，否则api以及前端显示的链接会有错误
# 如果只在内网使用不对外，设置成内网IP就可以
CONSOLE_API_URL=http://本机IP:8080
CONSOLE_WEB_URL=http://本机IP:8080
SERVICE_API_URL=http://本机IP:8080
APP_API_URL=http://本机IP:8080
APP_WEB_URL=http://本机IP:8080
FILES_URL=http://本机IP:8080

# 第七百多行，接近末尾处，这里设置nginx端口号
EXPOSE_NGINX_PORT=8080
EXPOSE_NGINX_SSL_PORT=8443

启动docker：docker compose up -d
如果可以成功启动的话，会新增加九个docker
参考文档：https://docs.dify.ai/v/zh-hans/getting-started/install-self-hosted/docker-compose

3.2 dify-on-wechat

因为需要对接企业微信，所以安装了这个进行拓展
提前准备工作
- 使用dify创建应用，获取应用的密钥；
- 企业微信开通一个应用，获取企业ID、应用ID、应用密钥；
- 打开应用“设置API接收消息”入口，随机获取token及aeskey。
从github下载安装包（https://github.com/hanfangyuan4396/dify-on-wechat）
进入安装包的docker路径下，修改docker-compose.yml文件的配置，并增加企业微信的配置。

      # dify 相关配置
      DIFY_API_BASE: 'dify应用的API地址'
      DIFY_API_KEY: 'dify应用的密钥'

      # 增加企业微信配置
      channel_type: 'wechatcom_app'
      single_chat_prefix: '[""]'
      wechatcom_corp_id: '企业ID'
      wechatcomapp_token: 'API接收消息的token'
      wechatcomapp_secret: '应用密钥'
      wechatcomapp_agent_id: '应用ID'
      wechatcomapp_aes_key: 'API接收消息的aeskey'
      wechatcomapp_port: '7860'

启动docker：docker compose up -d
参考文档：
- https://docs.dify.ai/v/zh-hans/learn-more/use-cases/dify-on-wechat
- https://docs.link-ai.tech/cow/multi-platform/wechat-com

4、部署Ollama

一键安装脚本：curl -fsSL https://ollama.com/install.sh | sh
下载模型：ollama pull llama3.1（以llama3.1为例）
运行模型：ollama run llama3.1
重新加载配置：systemctl daemon-reload
重启：systemctl restart ollama
模型的默认下载路径：/usr/share/ollama/.ollama/models
修改默认设置（/etc/systemd/system/ollama.service，在[Service]下增加）
- 每个模型可处理的最大并行请求数 Environment="OLLAMA_NUM_PARALLEL=4"
- 同时加载模型的最大数量 Environment="OLLAMA_MAX_LOADED_MODELS=2"
- 繁忙时排队的最大请求数 Environment="OLLAMA_MAX_QUEUE=512"
- 使用第几个GPU/CPU运行模型 Environment="CUDA_VISIBLE_DEVICES=-1"

5、部署Xinference

由于Ollama不支持 Rerank 模型，因此使用Xinference进行部署
直接使用python安装可能会出现一些环境问题，因此使用conda虚拟python环境。
期间由于conda部署报错，又尝试使用docker部署，后来感觉docker不方便又重新研究conda的方案，花了两天时间总算大功告成，太难了...

5.1 部署方式1：conda

5.1.1 conda环境安装

查找适配服务器的版本（官网：https://repo.anaconda.com/archive/），并复制下载地址；
命令行输入wget 下载地址，等待下载完成；
执行安装：bash 下载的文件.sh；
一路回车，注意最后提示是否添加环境变量，需要输入yes，否则还要手动添加变量；
输入指令：source ~/.bashrc；
检查是否完成安装：conda -V

5.1.2 Xinference部署

创建虚拟python环境：conda create -n Xinference python=3.11.3（这里起名Xinference可以自定义；python版本选用了3.11.3）
激活虚拟环境：conda activate Xinference；
一键全量安装：pip install "xinference[all]"；
安装时可能出现Llama.cpp包下载错误的情况，可在 https://github.com/abetlen/llama-cpp-python/releases 中，找到对应的版本（cp310表示python3.10）下载到服务器，使用pip install命令安装即可；
测试一下能否用PyTouch：python -c "import torch; print(torch.cuda.is_available())"；如果不能用，则需要安装torch：pip install torch；
启动后可能会报错RuntimeError: Failed to load shared library '/root/anaconda3/envs/Xinference/lib/python3.11/site-packages/llama_cpp/lib/libllama.so': libcudart.so.12: cannot open shared object file: No such file or directory，则说明还需要安装CUDA Toolkit，进入https://developer.nvidia.com/cuda-downloads ，找到对应版本并根据下面的提示安装即可。

启动Xinference：

# 前台运行
xinference-local --host 0.0.0.0 --port 9997
# 后台运行
nohup xinference-local --host 0.0.0.0 --port 9997 & > output.log

退出conda环境：conda deactivate

5.2 部署方式2：docker安装

拉取镜像：docker pull registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:latest；

安装nvidia-container-toolkit（否则在docker中无法使用GPU）

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
 sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
 sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

docker启动：

docker run -it --name xinference -d -p 9997:9997 -e XINFERENCE_MODEL_SRC=modelscope -e XINFERENCE_HOME=/workspace -v <此处替换为本地路径>:/workspace  --gpus all xprobe/xinference:latest xinference-local -H 0.0.0.0 --log-level debug

5.3 模型下载及安装

安装modelscope模块：pip install modelscope；
从魔搭社区找到大模型：https://www.modelscope.cn/models/ZhipuAI/glm-4-9b-chat

写一个install.py文件：

#模型下载
from modelscope import snapshot_download
model_dir = snapshot_download('zhipuai/glm-4-9b-chat',local_dir='/opt/chatglm-9b')

给权限下载，等待下载完成
```
chmod +x install.py
python install.py
```
进入xinference--launch model进行部署
- Model Engine：部署方式
- Model Formate：部署格式
- Model Size：模型的参数量大小
- Quantization：量化精度
- N-GPU：选择使用第几个 GPU
- Model UID：模型的名字
- GPU IDX：GPU的序号数，有几个GPU从0开始排序，比如两个GPU，那就是0,1
- DownloadHub：模型下载站点，若已经提前下载好了模型，就选择NONE
- ModelPath：模型路径，就是第三步设置的下载地址
点击下边的火箭图标运行。

6、开机启动

编写sh脚本

#!/bin/bash  
sleep 120  
sudo ollama run qwen2.5
sleep 60
source /root/anaconda3/etc/profile.d/conda.sh
conda activate /root/anaconda3/envs/Xinference
sleep 15
nohup xinference-local --host 0.0.0.0 --port 9997
sleep 15
xinference launch --model-name bce-reranker-base_v1 --model-type rerank --model_path /usr/share/xinference/models/bce-reranker-base

命令行运行crontab -e，在最后添加 @reboot 路径/auto-startup.sh。
说明：考虑到开机时部分进程还没有启动，因此增加了一些等待时间，避免报错。
查看GPU的运行状态：nvidia-smi，查看实时状态：watch -n 1 nvidia-smi