Bootstrap

心酸日记-集群部署

心酸日记-集群部署

服务器集群部署, 最后实在弄不下去了, 很多机器插上键盘鼠标没反应, 有的重启界面一闪而过都不知道bios怎么看, 在虚拟机构建集群, 也在边上服务器上装好系统, 满怀希望的跑去机房, 结果哈哈哈, 还是记录些现在还记得的东西, 希望以后也用不到吧, 除非自己DIY属于自己的机器. 很多重要的检查的命令, 还有权限问题, 自己也记不清了, 从别的地方搞到加上gpt修改, 搞清楚是什么过程, 再不断尝试. 发网上反正没人认得我, 也没想教会谁, 毕竟一点小问题可能都得查上半天, 深坑无数, 只有经历过的才懂.

CPU型号核数内存主板机械硬盘固态显卡内核版本
AMD64
Intel Xeon Gold 6242 CPU @ 3.00GHz64三星 DDR4 16G 2666MHz * 12Intel S2600WFT220G3.10.0-1160.119.1.el7
Intel Xeon Gold 6242 CPU @ 2.80GHz64海力士 DDR4 16G 2666MHz ???Powerleader PSC621DIWDC WD2000FYYZ-0 1.8THitachi HUA72201 931.5G3.10.0-1160.119.1.el7
Intel Xeon 6154 CPU @ 3.00GHz36美光 DDR4 16GB 2666MHz * 12Intel S2600WFT560G3.10.0-862.el7
Intel Xeon 6154 CPU @ 3.00GHz36美光 DDR4 16GB 2666MHz * 12Intel S2600WFT2.2T1.8T3.10.0-862.el7
Intel Xeon Silver 4210R CPU @ 2.40GHz20三星 DDR4 32G 2666MHz * 12Supermicro X11DPG-QTST4000NM024B-2TF 3.7T1.8TRTX3080 12G * 43.10.0-1160.el7
Intel Xeon Platinum 8171M CPU @ 2.60GHz64三星 DDR4 16G 2666MHz * 20Supermicro X11DPi-NST4000NM000A-2HZ 3.7TSamsung SSD 980 500GBRTX3090 24G4.18.0-305.12.1.el8_4
Intel Xeon Platinum 8375C CPU @ 3.00GHz64三星 DDR4 16G 2666MHz * 16Supermicro X12DPi-N6ST4000NM000A-2HZ 3.7TSamsung SSD 980 500GB4.18.0-305.12.1.el8_4
Intel Xeon Platinum 8375C CPU @ 3.00GHz64三星 DDR4 16G 2666MHz * 16Supermicro X12DPi-N6ST4000NM000A-2HZ 3.7TSamsung SSD 980 500GB4.18.0-408.el8

linux操作系统安装, 参考了实机安装CentOS7.9操作系统图文(保姆级)教程_centos7.9安装教程详解-CSDN博客,

1.准备启动盘

第一次做系统盘使用软件ventoy, 中途报错, 硬盘也无法识别了, 以为把硬盘整的用不了了, 其实是变成系统盘了, bios是可以识别. 后来用rufus, 安装centos7.9镜像成功了.在centos vault下载以前的版本, 我是翻墙官网Index of /7.9.2009/isos/x86_64下的, 不知道为什么阿里云反而下载的很慢.

2.安装操作系统

插上系统盘, 重启系统, 大多服务器会中途提醒如何进入BIOS, 通常del/f11/12.

3.硬盘分区, 软件安装(勾选GNOME DESKTOP所有选项), 网络配置

/boot 1 GB1

/boot/efi 500 MB

swap 50 GB

/ # 可以设置一个很大的值, 这样硬盘会将剩下的给根目录

安装选择UEFI, 较新的主板都支持, 硬盘的影响不大, boot界面修改启动顺序, 否则重启默认加载以前的系统, 最好格式化磁盘吧, 这个我没弄. 这里很多地方我也不够清楚, 总之选择自己U盘对应的启动就是了.

集群同步

1配置ssh免密登录

修改/etc/hosts, 设置ip和主机名映射.

10.10.69.91 node1

10.10.69.92 node2

产生rsa秘钥文件, 分发公钥到各个节点, 秘钥文件默认位置为.ssh

ssh-keygen #其他参数直接省略, 然后一路回车
ssh-copy-id root@node1 #第一次输入密码,第二次免密则配置成功
# 配置需要确保权限正确, 我因为修改了root目录下某些文件权限, 查了好久也没查出来, 除了chmod 700 ~/.ssh外还需要执行chmod 700 /root
# 批量处理
setup_ssh()
(
    if [ $# -ne 1 ] ; then
        echo "Usage: setup_ssh username"
        exit -1
    fi
    username="$1"

    waitfor "Now set up the SSH for User: $username" 5 0.2

    dir=$(awk -F: -v name="$username" '($1==name) {print $6}' /etc/passwd)
    mkdir -p "$dir/.ssh"
    if [ ! -f "$dir/.ssh/id_rsa.pub" ]; then
        sudo -u "$username" ssh-keygen -t rsa -N "" -f "$dir/.ssh/id_rsa"
    fi
    cp -p "$dir/.ssh/id_rsa.pub" "$dir/.ssh/authorized_keys"
    echo "StrictHostKeyChecking no" > "$dir/.ssh/config"
    chown "$username":"$username" "$dir/.ssh/config"

    echo -n "Is $dir a shared path? [y/N]: "
    read -r IsSharedDir
    IsSharedDir=${IsSharedDir:-n}

    if [[ "$IsSharedDir" != "y" && "$IsSharedDir" != "Y" ]]; then
        nodelist=$(uncommentfile "$HOSTSFILE" | awk '{print $2}' | sed -n "/^${NODE_PREFIX}/p")
        for node in $nodelist; do
            echo "========${node}============="
            $REMOTE_CP -rp "$dir/.ssh" "$node":"$dir/" >> "$OUTPUT"

            # 分发 /etc/hosts 到其他节点
            echo "Copying /etc/hosts to ${node}..."
            $REMOTE_CP -p /etc/hosts "$node:/etc/hosts" >> "$OUTPUT"
        done
    fi
)

2.设置时钟同步

通过主节点对其他节点进行时钟同步, 如果关机重启需要重新同步, 否则会影响slrum队列等的使用

sync_time()
(
    if [ "$(whoami)" != root ]; then
        echo "Error: You must log in as root"
        exit -10
    fi

    waitfor "Now Synchronize time on the Whole cluster" 5 0.2

    nodelist=$(uncommentfile "$HOSTSFILE" | awk '{print $2}' | sed -n "/^${NODE_PREFIX}/p")
    current_time=$(date)
    for node in $nodelist; do
        $REMOTE_SH "$node" "date -s \"$current_time\" &"
    done
    echo ""
    if [ "$IS_SUSE" == 0 ]; then
        for node in $nodelist; do
            $REMOTE_SH "$node" "clock -w"
        done
    fi
)

3.禁用防火墙等

setup_service()
(
    if [ "$(whoami)" != root ]; then
        echo "Error: You must log in as root"
        exit -10
    fi

    waitfor "Now set up the Initial Services" 5 0.2

    if [ "$IS_SUSE" == 1 ]; then
        cat <<EOF > /tmp/setup_service.sh
/sbin/chkconfig --level 35 nfsserver on
/sbin/chkconfig --level 35 SuSEfirewall2_init off
/sbin/chkconfig --level 35 SuSEfirewall2_setup off
EOF
        if ! grep -q backspace /etc/vimrc >> "$OUTPUT"; then
            echo 'set backspace=indent,eol,start' >> /etc/vimrc
        fi
        sync_file /etc/vimrc
    else
	cat <<EOF > /tmp/setup_service.sh
# 启用 NFS 服务
systemctl enable nfs-server
systemctl start nfs-server
# 禁用防火墙服务(如 iptables 和 ip6tables,如果需要)
systemctl stop firewalld
systemctl disable firewalld
#systemctl disable iptables
#systemctl disable ip6tables
# 禁用 Sendmail 服务
#systemctl disable sendmail
EOF

        sed -e 's/=enable/=disable/g' /etc/selinux/config | sed -e 's/=enforcing/=disable/g' > /tmp/selinux.config
        cp /tmp/selinux.config /etc/selinux/config
        sync_file /etc/selinux/config
    fi
    chmod +x /tmp/setup_service.sh
    sync_file /tmp/setup_service.sh
    sync_do /tmp/setup_service.sh
)

4.nfs进行文件挂载, nfs文件挂载是主节点分享一个文件夹, 计算节点选择是否挂载它, 并没有进行文件复制, 修改也是直接写入共享文件夹.

setup_nfs()
{
    # 清理历史临时文件
    rm -f /tmp/exports.*
    rm -f /tmp/nfs.local.*

    # 从配置文件中读取 NFS 服务端主机、导出目录和挂载点
    NfsHostList=($(uncommentfile "$NFSCONF" | awk '/NFSDIR/ {print $2}'))
    NfsDirList=($(uncommentfile "$NFSCONF" | awk '/NFSDIR/ {print $3}'))
    MountPointList=($(uncommentfile "$NFSCONF" | awk '/NFSDIR/ {print $4}'))
    NfsDirNum=$(uncommentfile "$NFSCONF" | awk '/NFSDIR/ {print $3}' | wc -l)

    # 检查是否有 NFS 目录需要配置
    if [ "$NfsDirNum" -ge 1 ]; then
        IsNfsHost=y
    else
        echo "No NFS directories to configure. Exiting."
        exit -1
    fi

    # 用户确认
    echo "Confirm your $NFSCONF is correct!!"
    uncommentfile "$NFSCONF"
    echo -n "Do you want to continue? [y/n]: "
    read -r IsConfirm
    IsConfirm=${IsConfirm:-y}
    if [[ "$IsConfirm" == "n" || "$IsConfirm" == "N" ]]; then
        echo "Exiting as per user input."
        exit -1
    fi

    # 配置每个 NFS 服务端
    for i in $(seq 0 $((NfsDirNum-1))); do
        NfsHost=${NfsHostList[$i]}
        NfsDir=${NfsDirList[$i]}
        MountPoint=${MountPointList[$i]}
        NfsHostSuf=$(echo "$NfsHost" | grep -o "[0-9]*$" || echo "default")

        echo "Configuring NFS server: $NfsHost, Directory: $NfsDir, MountPoint: $MountPoint"

        # 确保服务端安装 nfs-utils
       # $REMOTE_SH "$NfsHost" "yum install -y nfs-utils"

        # 创建共享目录并配置 /etc/exports
        $REMOTE_SH "$NfsHost" "mkdir -p $NfsDir"
        $REMOTE_SH "$NfsHost" "grep -q '$NfsDir ' /etc/exports || echo '$NfsDir *(rw,no_root_squash,async)' >> /etc/exports"

        # 重启 NFS 服务并启用开机启动
        $REMOTE_SH "$NfsHost" "systemctl restart nfs-server"
      #  $REMOTE_SH "$NfsHost" "systemctl enable nfs-server"

        # 配置客户端
        nodelist=$(uncommentfile "$HOSTSFILE" | awk '{print $2}' | sed -n "/^${NODE_PREFIX}/p")
        for node in $nodelist; do
            if [ "$node" != "$NfsHost" ]; then
                echo "Configuring client: $node for mount $MountPoint"
                
                # 在客户端创建挂载点
                $REMOTE_SH "$node" "mkdir -p $MountPoint"

                # 添加挂载点到 /etc/fstab
                $REMOTE_SH "$node" "grep -q '$NfsHost:$NfsDir $MountPoint' /etc/fstab || echo '$NfsHost:$NfsDir $MountPoint nfs defaults,_netdev 0 0' >> /etc/fstab"

                # 手动挂载
                $REMOTE_SH "$node" "mount -a"
            fi
        done
    done

    # 配置 BINDHOME,如果存在,且目标不是 /home
    BindedHome=$(uncommentfile "$NFSCONF" | awk '/BINDHOME/ {print $2}')
    if [ -n "$BindedHome" ] && [ "$BindedHome" != "/home" ]; then
        echo "Configuring BINDHOME to bind $BindedHome to /home on all nodes."
        nodelist=$(uncommentfile "$HOSTSFILE" | awk '{print $2}' | sed -n "/^${NODE_PREFIX}/p")
        for node in $nodelist; do
            $REMOTE_SH "$node" "mkdir -p $BindedHome"
            $REMOTE_SH "$node" "grep -q 'mount --bind $BindedHome /home' /etc/fstab || echo '$BindedHome /home none bind 0 0' >> /etc/fstab"
            $REMOTE_SH "$node" "mount -a"
        done
    fi
}

5.同步文件夹

# 利用scp进行文件同步
sync_file()
(
    nodelist=$(uncommentfile "$HOSTSFILE" | awk '{print $2}' | sed -n "/^${NODE_PREFIX}/p")
    for node in $nodelist; do
        echo "========$node========="
        for i in "$@"; do
            baseDir=$(dirname "$i")
            baseDir=$(cd "$baseDir" && pwd)
            if [ -f "$i" ]; then
                $REMOTE_CP -p "$i" "$node":"$baseDir" >> "$OUTPUT"
            elif [ -d "$i" ]; then
                $REMOTE_CP -rp "$i" "$node":"$baseDir" >> "$OUTPUT"
            else
                echo "Error: the file or path $i does not exist!!!"
                exit -1
            fi
        done
    done
    exit 0
)

6.同步/root和用户, 添加删除用户

sync_user()
(
    if [ "$(whoami)" != root ]; then
        echo "Error: You must log in as root"
        exit -10
    fi

    waitfor "Now Synchronize users and groups on the Whole cluster" 5 0.2
    if [ "$IS_SUSE" == 1 ]; then
        sync_file /etc/passwd /etc/group /etc/shadow
    else
        sync_file /etc/passwd /etc/group /etc/shadow /etc/gshadow
    fi
)

adduser_cluster()
{
    if [ "$(whoami)" != root ]; then
        echo "Error: You must log in as root"
        exit -10
    fi
    username="$1"
    if awk -F: '{print $1}' /etc/passwd | grep -q "^$username$" >> "$OUTPUT"; then
        echo "Error: user: $username already exists!! "
        exit 12
    fi
    waitfor "Now add user: $username on the Whole cluster" 5 0.2
    echo -n "Input the Home directory for user:$username [default: /public/home/${username}]: "
    read -r HomePath
    HomePath=${HomePath:-/public/home/${username}}
    if [[ "${HomePath:0:1}" != "/" ]]; then
        echo "Error: You must enter an absolute path !!"
        exit -1
    fi

    echo -n "Input the Group Name for user:$username [default: users]: "
    read -r GroupName
    GroupName=${GroupName:-users}
    if [ "$IS_SUSE" == 1 ]; then
        useradd -m -d "$HomePath" -g "$GroupName" "$username"
    else
        useradd -d "$HomePath" -g "$GroupName" "$username"
    fi
    passwd "$username"
    sync_user
    echo "Added user $username on the whole cluster successfully!"
    echo -n "Do you want to set up SSH without password for user:$username [y/N]? "
    read -r IsSsh
    IsSsh=${IsSsh:-n}
    if [[ "$IsSsh" == "y" || "$IsSsh" == "Y" ]]; then
        setup_ssh "$username"
    fi
}

deluser_cluster()
(
    if [ "$(whoami)" != root ]; then
        echo "Error: You must log in as root"
        exit -10
    fi
    username="$1"
    if ! awk -F: '{print $1}' /etc/passwd | grep -q "^$username$" >> "$OUTPUT"; then
        echo "Error: user: $username does not exist!! "
        exit 12
    fi

    waitfor "Now delete user: $username on the Whole cluster" 5 0.2
    userdel "$username"
    dir=$(awk -F: -v name="$username" '($1==name) {print $6}' /etc/passwd)
    echo -n "Do you want to delete Home path: $dir for $username [n/Y]: "
    read -r IsDel
    IsDel=${IsDel:-n}
    if [[ "$IsDel" == "y" || "$IsDel" == "Y" ]]; then
        rm -rf "$dir"
    fi
    sync_user
    echo "Deleted user $username on the whole cluster successfully!"
)

sync_do()
(
    waitfor "Now execute the commands: $* on the whole cluster" 5 0.2
    echo "Executing Command '$*' On the whole Cluster"
    nodelist=$(uncommentfile "$HOSTSFILE" | awk '{print $2}' | sed -n "/^${NODE_PREFIX}/p")
    for node in $nodelist; do
        echo "=======$node====="
        $REMOTE_SH "$node" "$*"
    done
)

7.共同执行指令

sync_do()
(
    waitfor "Now execute the commands: $* on the whole cluster" 5 0.2
    echo "Executing Command '$*' On the whole Cluster"
    nodelist=$(uncommentfile "$HOSTSFILE" | awk '{print $2}' | sed -n "/^${NODE_PREFIX}/p")
    for node in $nodelist; do
        echo "=======$node====="
        $REMOTE_SH "$node" "$*"
    done
)

slrum队列集群设置

以前安装过单节点的torque队列, 但是由于这些服务器核数各不相同, 有20, 32, 36, 52, 64…需要在一次申请计算资源的时候. 指定一个节点以及该节点的所有核心用于计算, 尝试了很多命令以及查看了6.1.2到最新的torque文章, 确实没有找到这样的命令. 要么只会申请一个核, 要么就不算. slrum队列有这个功能.

1.在每个计算节点部署好slrum队列

2.配置munge秘钥

munge秘钥配置后需要重启, 我当时怎么验证都不成功, 最后发现居然要重启, 有时也是这种问题, 但是教程中没有看到.

3.修改/etc/slrum/slrum.conf文件

尝试将centos 8.5(原本有的) 和7.9的slrum并行没有成功.

在centos7.9部署slrum队列

参考Slurm在centos7单机上的安装经验 - 计算机使用与Linux交流 (Computer Usage and Linux) - 计算化学公社centos7.9成功安装. 7.6和8.5均安装失败.

软件编译

各种软件差不多都装了个遍, VASP, Gaussian, Gromacs, CP2K, deepmd_v3…还有很多小软件.最难受的就是遇到gcc版本不够, 升级gcc到13可以解决一些缺失lib…(名字忘了)库的问题, 编译完全过程, 但也有时候还是要求gcc版本, 甚至要>2.28. gcc是linux非常底层的库, 已经很接近内核了, 重新编译gcc容易导致系统崩溃. 现在装VASP简单多了, 当时老师教我都蒙了, 什么打不定制类的, intelonepi和hpc也比18的还要加licence方便的多, 注意7.9最多好像是只能下载21.12的, 但是intel官网已经不提供了. 对比并没有感觉到18和21.12有速度上的差别.但是编译更方便.

;