Bootstrap

vllm多卡部署Qwen2.5-72B-Instruct-GPTQ-Int4

双卡v100 32G部署结果如下,推理时长16s

3卡,tensor_parallel_size=3,tensor并行的数量一定要能被attention heads整除

 4卡,tensor_parallel_size=4,推理速度4s

;