心酸日记-集群部署
服务器集群部署, 最后实在弄不下去了, 很多机器插上键盘鼠标没反应, 有的重启界面一闪而过都不知道bios怎么看, 在虚拟机构建集群, 也在边上服务器上装好系统, 满怀希望的跑去机房, 结果哈哈哈, 还是记录些现在还记得的东西, 希望以后也用不到吧, 除非自己DIY属于自己的机器. 很多重要的检查的命令, 还有权限问题, 自己也记不清了, 从别的地方搞到加上gpt修改, 搞清楚是什么过程, 再不断尝试. 发网上反正没人认得我, 也没想教会谁, 毕竟一点小问题可能都得查上半天, 深坑无数, 只有经历过的才懂.
CPU型号 | 核数 | 内存 | 主板 | 机械硬盘 | 固态 | 显卡 | 内核版本 |
---|---|---|---|---|---|---|---|
AMD | 64 | ||||||
Intel Xeon Gold 6242 CPU @ 3.00GHz | 64 | 三星 DDR4 16G 2666MHz * 12 | Intel S2600WFT | 220G | 3.10.0-1160.119.1.el7 | ||
Intel Xeon Gold 6242 CPU @ 2.80GHz | 64 | 海力士 DDR4 16G 2666MHz ??? | Powerleader PSC621DI | WDC WD2000FYYZ-0 1.8T | Hitachi HUA72201 931.5G | 3.10.0-1160.119.1.el7 | |
Intel Xeon 6154 CPU @ 3.00GHz | 36 | 美光 DDR4 16GB 2666MHz * 12 | Intel S2600WFT | 560G | 3.10.0-862.el7 | ||
Intel Xeon 6154 CPU @ 3.00GHz | 36 | 美光 DDR4 16GB 2666MHz * 12 | Intel S2600WFT | 2.2T | 1.8T | 3.10.0-862.el7 | |
Intel Xeon Silver 4210R CPU @ 2.40GHz | 20 | 三星 DDR4 32G 2666MHz * 12 | Supermicro X11DPG-QT | ST4000NM024B-2TF 3.7T | 1.8T | RTX3080 12G * 4 | 3.10.0-1160.el7 |
Intel Xeon Platinum 8171M CPU @ 2.60GHz | 64 | 三星 DDR4 16G 2666MHz * 20 | Supermicro X11DPi-N | ST4000NM000A-2HZ 3.7T | Samsung SSD 980 500GB | RTX3090 24G | 4.18.0-305.12.1.el8_4 |
Intel Xeon Platinum 8375C CPU @ 3.00GHz | 64 | 三星 DDR4 16G 2666MHz * 16 | Supermicro X12DPi-N6 | ST4000NM000A-2HZ 3.7T | Samsung SSD 980 500GB | 4.18.0-305.12.1.el8_4 | |
Intel Xeon Platinum 8375C CPU @ 3.00GHz | 64 | 三星 DDR4 16G 2666MHz * 16 | Supermicro X12DPi-N6 | ST4000NM000A-2HZ 3.7T | Samsung SSD 980 500GB | 4.18.0-408.el8 |
linux操作系统安装, 参考了实机安装CentOS7.9操作系统图文(保姆级)教程_centos7.9安装教程详解-CSDN博客,
1.准备启动盘
第一次做系统盘使用软件ventoy, 中途报错, 硬盘也无法识别了, 以为把硬盘整的用不了了, 其实是变成系统盘了, bios是可以识别. 后来用rufus, 安装centos7.9镜像成功了.在centos vault下载以前的版本, 我是翻墙官网Index of /7.9.2009/isos/x86_64下的, 不知道为什么阿里云反而下载的很慢.
2.安装操作系统
插上系统盘, 重启系统, 大多服务器会中途提醒如何进入BIOS, 通常del/f11/12.
3.硬盘分区, 软件安装(勾选GNOME DESKTOP所有选项), 网络配置
/boot 1 GB1
/boot/efi 500 MB
swap 50 GB
/ # 可以设置一个很大的值, 这样硬盘会将剩下的给根目录
安装选择UEFI, 较新的主板都支持, 硬盘的影响不大, boot界面修改启动顺序, 否则重启默认加载以前的系统, 最好格式化磁盘吧, 这个我没弄. 这里很多地方我也不够清楚, 总之选择自己U盘对应的启动就是了.
集群同步
1配置ssh免密登录
修改/etc/hosts, 设置ip和主机名映射.
10.10.69.91 node1
10.10.69.92 node2
产生rsa秘钥文件, 分发公钥到各个节点, 秘钥文件默认位置为.ssh
ssh-keygen #其他参数直接省略, 然后一路回车
ssh-copy-id root@node1 #第一次输入密码,第二次免密则配置成功
# 配置需要确保权限正确, 我因为修改了root目录下某些文件权限, 查了好久也没查出来, 除了chmod 700 ~/.ssh外还需要执行chmod 700 /root
# 批量处理
setup_ssh()
(
if [ $# -ne 1 ] ; then
echo "Usage: setup_ssh username"
exit -1
fi
username="$1"
waitfor "Now set up the SSH for User: $username" 5 0.2
dir=$(awk -F: -v name="$username" '($1==name) {print $6}' /etc/passwd)
mkdir -p "$dir/.ssh"
if [ ! -f "$dir/.ssh/id_rsa.pub" ]; then
sudo -u "$username" ssh-keygen -t rsa -N "" -f "$dir/.ssh/id_rsa"
fi
cp -p "$dir/.ssh/id_rsa.pub" "$dir/.ssh/authorized_keys"
echo "StrictHostKeyChecking no" > "$dir/.ssh/config"
chown "$username":"$username" "$dir/.ssh/config"
echo -n "Is $dir a shared path? [y/N]: "
read -r IsSharedDir
IsSharedDir=${IsSharedDir:-n}
if [[ "$IsSharedDir" != "y" && "$IsSharedDir" != "Y" ]]; then
nodelist=$(uncommentfile "$HOSTSFILE" | awk '{print $2}' | sed -n "/^${NODE_PREFIX}/p")
for node in $nodelist; do
echo "========${node}============="
$REMOTE_CP -rp "$dir/.ssh" "$node":"$dir/" >> "$OUTPUT"
# 分发 /etc/hosts 到其他节点
echo "Copying /etc/hosts to ${node}..."
$REMOTE_CP -p /etc/hosts "$node:/etc/hosts" >> "$OUTPUT"
done
fi
)
2.设置时钟同步
通过主节点对其他节点进行时钟同步, 如果关机重启需要重新同步, 否则会影响slrum队列等的使用
sync_time()
(
if [ "$(whoami)" != root ]; then
echo "Error: You must log in as root"
exit -10
fi
waitfor "Now Synchronize time on the Whole cluster" 5 0.2
nodelist=$(uncommentfile "$HOSTSFILE" | awk '{print $2}' | sed -n "/^${NODE_PREFIX}/p")
current_time=$(date)
for node in $nodelist; do
$REMOTE_SH "$node" "date -s \"$current_time\" &"
done
echo ""
if [ "$IS_SUSE" == 0 ]; then
for node in $nodelist; do
$REMOTE_SH "$node" "clock -w"
done
fi
)
3.禁用防火墙等
setup_service()
(
if [ "$(whoami)" != root ]; then
echo "Error: You must log in as root"
exit -10
fi
waitfor "Now set up the Initial Services" 5 0.2
if [ "$IS_SUSE" == 1 ]; then
cat <<EOF > /tmp/setup_service.sh
/sbin/chkconfig --level 35 nfsserver on
/sbin/chkconfig --level 35 SuSEfirewall2_init off
/sbin/chkconfig --level 35 SuSEfirewall2_setup off
EOF
if ! grep -q backspace /etc/vimrc >> "$OUTPUT"; then
echo 'set backspace=indent,eol,start' >> /etc/vimrc
fi
sync_file /etc/vimrc
else
cat <<EOF > /tmp/setup_service.sh
# 启用 NFS 服务
systemctl enable nfs-server
systemctl start nfs-server
# 禁用防火墙服务(如 iptables 和 ip6tables,如果需要)
systemctl stop firewalld
systemctl disable firewalld
#systemctl disable iptables
#systemctl disable ip6tables
# 禁用 Sendmail 服务
#systemctl disable sendmail
EOF
sed -e 's/=enable/=disable/g' /etc/selinux/config | sed -e 's/=enforcing/=disable/g' > /tmp/selinux.config
cp /tmp/selinux.config /etc/selinux/config
sync_file /etc/selinux/config
fi
chmod +x /tmp/setup_service.sh
sync_file /tmp/setup_service.sh
sync_do /tmp/setup_service.sh
)
4.nfs进行文件挂载, nfs文件挂载是主节点分享一个文件夹, 计算节点选择是否挂载它, 并没有进行文件复制, 修改也是直接写入共享文件夹.
setup_nfs()
{
# 清理历史临时文件
rm -f /tmp/exports.*
rm -f /tmp/nfs.local.*
# 从配置文件中读取 NFS 服务端主机、导出目录和挂载点
NfsHostList=($(uncommentfile "$NFSCONF" | awk '/NFSDIR/ {print $2}'))
NfsDirList=($(uncommentfile "$NFSCONF" | awk '/NFSDIR/ {print $3}'))
MountPointList=($(uncommentfile "$NFSCONF" | awk '/NFSDIR/ {print $4}'))
NfsDirNum=$(uncommentfile "$NFSCONF" | awk '/NFSDIR/ {print $3}' | wc -l)
# 检查是否有 NFS 目录需要配置
if [ "$NfsDirNum" -ge 1 ]; then
IsNfsHost=y
else
echo "No NFS directories to configure. Exiting."
exit -1
fi
# 用户确认
echo "Confirm your $NFSCONF is correct!!"
uncommentfile "$NFSCONF"
echo -n "Do you want to continue? [y/n]: "
read -r IsConfirm
IsConfirm=${IsConfirm:-y}
if [[ "$IsConfirm" == "n" || "$IsConfirm" == "N" ]]; then
echo "Exiting as per user input."
exit -1
fi
# 配置每个 NFS 服务端
for i in $(seq 0 $((NfsDirNum-1))); do
NfsHost=${NfsHostList[$i]}
NfsDir=${NfsDirList[$i]}
MountPoint=${MountPointList[$i]}
NfsHostSuf=$(echo "$NfsHost" | grep -o "[0-9]*$" || echo "default")
echo "Configuring NFS server: $NfsHost, Directory: $NfsDir, MountPoint: $MountPoint"
# 确保服务端安装 nfs-utils
# $REMOTE_SH "$NfsHost" "yum install -y nfs-utils"
# 创建共享目录并配置 /etc/exports
$REMOTE_SH "$NfsHost" "mkdir -p $NfsDir"
$REMOTE_SH "$NfsHost" "grep -q '$NfsDir ' /etc/exports || echo '$NfsDir *(rw,no_root_squash,async)' >> /etc/exports"
# 重启 NFS 服务并启用开机启动
$REMOTE_SH "$NfsHost" "systemctl restart nfs-server"
# $REMOTE_SH "$NfsHost" "systemctl enable nfs-server"
# 配置客户端
nodelist=$(uncommentfile "$HOSTSFILE" | awk '{print $2}' | sed -n "/^${NODE_PREFIX}/p")
for node in $nodelist; do
if [ "$node" != "$NfsHost" ]; then
echo "Configuring client: $node for mount $MountPoint"
# 在客户端创建挂载点
$REMOTE_SH "$node" "mkdir -p $MountPoint"
# 添加挂载点到 /etc/fstab
$REMOTE_SH "$node" "grep -q '$NfsHost:$NfsDir $MountPoint' /etc/fstab || echo '$NfsHost:$NfsDir $MountPoint nfs defaults,_netdev 0 0' >> /etc/fstab"
# 手动挂载
$REMOTE_SH "$node" "mount -a"
fi
done
done
# 配置 BINDHOME,如果存在,且目标不是 /home
BindedHome=$(uncommentfile "$NFSCONF" | awk '/BINDHOME/ {print $2}')
if [ -n "$BindedHome" ] && [ "$BindedHome" != "/home" ]; then
echo "Configuring BINDHOME to bind $BindedHome to /home on all nodes."
nodelist=$(uncommentfile "$HOSTSFILE" | awk '{print $2}' | sed -n "/^${NODE_PREFIX}/p")
for node in $nodelist; do
$REMOTE_SH "$node" "mkdir -p $BindedHome"
$REMOTE_SH "$node" "grep -q 'mount --bind $BindedHome /home' /etc/fstab || echo '$BindedHome /home none bind 0 0' >> /etc/fstab"
$REMOTE_SH "$node" "mount -a"
done
fi
}
5.同步文件夹
# 利用scp进行文件同步
sync_file()
(
nodelist=$(uncommentfile "$HOSTSFILE" | awk '{print $2}' | sed -n "/^${NODE_PREFIX}/p")
for node in $nodelist; do
echo "========$node========="
for i in "$@"; do
baseDir=$(dirname "$i")
baseDir=$(cd "$baseDir" && pwd)
if [ -f "$i" ]; then
$REMOTE_CP -p "$i" "$node":"$baseDir" >> "$OUTPUT"
elif [ -d "$i" ]; then
$REMOTE_CP -rp "$i" "$node":"$baseDir" >> "$OUTPUT"
else
echo "Error: the file or path $i does not exist!!!"
exit -1
fi
done
done
exit 0
)
6.同步/root和用户, 添加删除用户
sync_user()
(
if [ "$(whoami)" != root ]; then
echo "Error: You must log in as root"
exit -10
fi
waitfor "Now Synchronize users and groups on the Whole cluster" 5 0.2
if [ "$IS_SUSE" == 1 ]; then
sync_file /etc/passwd /etc/group /etc/shadow
else
sync_file /etc/passwd /etc/group /etc/shadow /etc/gshadow
fi
)
adduser_cluster()
{
if [ "$(whoami)" != root ]; then
echo "Error: You must log in as root"
exit -10
fi
username="$1"
if awk -F: '{print $1}' /etc/passwd | grep -q "^$username$" >> "$OUTPUT"; then
echo "Error: user: $username already exists!! "
exit 12
fi
waitfor "Now add user: $username on the Whole cluster" 5 0.2
echo -n "Input the Home directory for user:$username [default: /public/home/${username}]: "
read -r HomePath
HomePath=${HomePath:-/public/home/${username}}
if [[ "${HomePath:0:1}" != "/" ]]; then
echo "Error: You must enter an absolute path !!"
exit -1
fi
echo -n "Input the Group Name for user:$username [default: users]: "
read -r GroupName
GroupName=${GroupName:-users}
if [ "$IS_SUSE" == 1 ]; then
useradd -m -d "$HomePath" -g "$GroupName" "$username"
else
useradd -d "$HomePath" -g "$GroupName" "$username"
fi
passwd "$username"
sync_user
echo "Added user $username on the whole cluster successfully!"
echo -n "Do you want to set up SSH without password for user:$username [y/N]? "
read -r IsSsh
IsSsh=${IsSsh:-n}
if [[ "$IsSsh" == "y" || "$IsSsh" == "Y" ]]; then
setup_ssh "$username"
fi
}
deluser_cluster()
(
if [ "$(whoami)" != root ]; then
echo "Error: You must log in as root"
exit -10
fi
username="$1"
if ! awk -F: '{print $1}' /etc/passwd | grep -q "^$username$" >> "$OUTPUT"; then
echo "Error: user: $username does not exist!! "
exit 12
fi
waitfor "Now delete user: $username on the Whole cluster" 5 0.2
userdel "$username"
dir=$(awk -F: -v name="$username" '($1==name) {print $6}' /etc/passwd)
echo -n "Do you want to delete Home path: $dir for $username [n/Y]: "
read -r IsDel
IsDel=${IsDel:-n}
if [[ "$IsDel" == "y" || "$IsDel" == "Y" ]]; then
rm -rf "$dir"
fi
sync_user
echo "Deleted user $username on the whole cluster successfully!"
)
sync_do()
(
waitfor "Now execute the commands: $* on the whole cluster" 5 0.2
echo "Executing Command '$*' On the whole Cluster"
nodelist=$(uncommentfile "$HOSTSFILE" | awk '{print $2}' | sed -n "/^${NODE_PREFIX}/p")
for node in $nodelist; do
echo "=======$node====="
$REMOTE_SH "$node" "$*"
done
)
7.共同执行指令
sync_do()
(
waitfor "Now execute the commands: $* on the whole cluster" 5 0.2
echo "Executing Command '$*' On the whole Cluster"
nodelist=$(uncommentfile "$HOSTSFILE" | awk '{print $2}' | sed -n "/^${NODE_PREFIX}/p")
for node in $nodelist; do
echo "=======$node====="
$REMOTE_SH "$node" "$*"
done
)
slrum队列集群设置
以前安装过单节点的torque队列, 但是由于这些服务器核数各不相同, 有20, 32, 36, 52, 64…需要在一次申请计算资源的时候. 指定一个节点以及该节点的所有核心用于计算, 尝试了很多命令以及查看了6.1.2到最新的torque文章, 确实没有找到这样的命令. 要么只会申请一个核, 要么就不算. slrum队列有这个功能.
1.在每个计算节点部署好slrum队列
2.配置munge秘钥
munge秘钥配置后需要重启, 我当时怎么验证都不成功, 最后发现居然要重启, 有时也是这种问题, 但是教程中没有看到.
3.修改/etc/slrum/slrum.conf文件
尝试将centos 8.5(原本有的) 和7.9的slrum并行没有成功.
在centos7.9部署slrum队列
参考Slurm在centos7单机上的安装经验 - 计算机使用与Linux交流 (Computer Usage and Linux) - 计算化学公社centos7.9成功安装. 7.6和8.5均安装失败.
软件编译
各种软件差不多都装了个遍, VASP, Gaussian, Gromacs, CP2K, deepmd_v3…还有很多小软件.最难受的就是遇到gcc版本不够, 升级gcc到13可以解决一些缺失lib…(名字忘了)库的问题, 编译完全过程, 但也有时候还是要求gcc版本, 甚至要>2.28. gcc是linux非常底层的库, 已经很接近内核了, 重新编译gcc容易导致系统崩溃. 现在装VASP简单多了, 当时老师教我都蒙了, 什么打不定制类的, intelonepi和hpc也比18的还要加licence方便的多, 注意7.9最多好像是只能下载21.12的, 但是intel官网已经不提供了. 对比并没有感觉到18和21.12有速度上的差别.但是编译更方便.