环境准备
系统要求
+ ** 操作系统** : Linux (推荐Ubuntu 20.04+) / Windows (需WSL2) + ** Python** : 3.8+ + ** GPU** : NVIDIA GPU (显存≥16GB,推荐RTX 3090/A100) + CUDA 11.8 + ** 硬盘空间** : ≥50GB(模型权重和依赖)安装依赖
```shell # 创建虚拟环境 conda create -n deepseek python=3.10 -y conda activate deepseek安装PyTorch与CUDA支持
pip install torch==2.0.1+cu118 --index-url https://download.pytorch.org/whl/cu118
安装Hugging Face库
pip install transformers==4.35.0 accelerate sentencepiece
<h1 id="k62MG"><font style="color:rgb(64, 64, 64);">获取模型权重</font></h1>
<h2 id="uADAD"><font style="color:rgb(64, 64, 64);">官方渠道下载</font></h2>
1. <font style="color:rgb(64, 64, 64);">访问</font><font style="color:rgb(64, 64, 64);"> </font>[DeepSeek官方开源页面](https://github.com/deepseek-ai)<font style="color:rgb(64, 64, 64);"> </font><font style="color:rgb(64, 64, 64);">或</font><font style="color:rgb(64, 64, 64);"> </font>[Hugging Face Model Hub](https://huggingface.co/deepseek-ai)
2. <font style="color:rgb(64, 64, 64);">找到目标模型(如</font><font style="color:rgb(64, 64, 64);"> </font>`<font style="color:rgb(64, 64, 64);">deepseek-llm-7b-base</font>`<font style="color:rgb(64, 64, 64);">)</font>
3. <font style="color:rgb(64, 64, 64);">通过Git下载:</font>
```shell
git lfs install
git clone https://huggingface.co/deepseek-ai/deepseek-llm-7b-base
备用方式(国内镜像)
若官方源下载慢,可使用国内镜像站:from huggingface_hub import snapshot_download
snapshot_download(
"deepseek-ai/deepseek-llm-7b-base",
local_dir="./deepseek-model",
revision="main",
mirror="https://mirror.sjtu.edu.cn/huggingface" # 上海交大镜像
)
模型加载与推理
基础推理代码
```python from transformers import AutoTokenizer, AutoModelForCausalLMmodel_path = “./deepseek-llm-7b-base”
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map=“auto”, # 自动分配GPU/CPU
torch_dtype=torch.bfloat16
)
input_text = “中国的首都是哪里?”
inputs = tokenizer(input_text, return_tensors=“pt”).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
<h2 id="gbtm8"><font style="color:rgb(64, 64, 64);">启用量化推理(降低显存需求)</font></h2>
```python
from transformers import BitsAndBytesConfig
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
model_path,
quantization_config=quant_config,
device_map="auto"
)
部署为API服务
使用FastAPI创建REST接口
```python from fastapi import FastAPI from pydantic import BaseModelapp = FastAPI()
class QueryRequest(BaseModel):
text: str
max_length: int = 100
@app.post(“/generate”)
async def generate_text(request: QueryRequest):
inputs = tokenizer(request.text, return_tensors=“pt”).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=request.max_length,
temperature=0.7
)
return {“result”: tokenizer.decode(outputs[0], skip_special_tokens=True)}
<h2 id="ALOO7"><font style="color:rgb(64, 64, 64);">启动服务</font></h2>
```shell
uvicorn api:app --host 0.0.0.0 --port 8000
高级配置
多GPU并行
```python model = AutoModelForCausalLM.from_pretrained( model_path, device_map="balanced", # 均匀分配各层到GPU ) ```监控显存使用
```bash # 安装监控工具 pip install nvitop实时查看显存占用
nvitop -m full
<h2 id="yG7HU"><font style="color:rgb(64, 64, 64);">安全访问控制</font></h2>
<font style="color:rgb(64, 64, 64);">在FastAPI中添加身份验证:</font>
```python
from fastapi.security import APIKeyHeader
security = APIKeyHeader(name="X-API-Key")
@app.post("/generate")
async def secure_generate(
request: QueryRequest,
api_key: str = Depends(security)
):
if api_key != "YOUR_SECRET_KEY":
raise HTTPException(status_code=403, detail="Invalid API Key")
# ...原有生成逻辑...
常见问题排查
| **问题现象** | **解决方案** | | --- | --- | | ` CUDA out of memory` | 启用量化(4bit/8bit)或使用更大显存GPU | | 中文输出乱码 | 检查tokenizer是否加载正确,强制使用UTF-8编码 | | 推理速度慢 | 启用` flash_attention`优化或使用TensorRT加速 | | 模型权重加载失败 | 验证文件完整性(对比SHA256校验码) |