Slurm是一个开源,容错,高度可扩展的集群管理和作业调度系统,适用于各种规模的Linux集群。 Slurm不需要对其操作进行内核修改,并且相对独立。作为集群工作负载管理器,Slurm有以下特性:
1、它在一段时间内为用户分配对资源(计算节点)的独占和/或非独占访问,以便他们可以执行工作;
2、它提供了一个框架,用于在分配的节点集上启动,执行和监视工作(通常是并行作业);
3、它通过管理待处理工作的队列来仲裁资源争用。
4、它提供作业信息统计,作业状态诊断等工具。
系统:CentOS最小化安装;升级软件补丁,内核;关闭SELinux和防火墙。
Slurm专用账户(slurm):Master端和Node端专用账户统一ID,建议ID号规划为200;
Slurm Master如需要支持GUI命令(sview)则需要安装GUI界面(Server with GUI);
Slurm是一个开源,容错,高度可扩展的集群管理和作业调度系统,适用于各种规模的Linux集群。 Slurm不 […]
0、安装EPEL源:yum install -y epel-release && yum makecache
[root@localhost ~]# yum install -y epel-release && yum makecache Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile * base: mirrors.aliyun.com * extras: mirrors.aliyun.com * updates: mirrors.aliyun.com base | 3.6 kB 00:00:00 epel | 4.7 kB 00:00:00 extras | 2.9 kB 00:00:00 updates | 2.9 kB 00:00:00 (1/3): epel/x86_64/updateinfo | 1.0 MB 00:00:00 (2/3): updates/7/x86_64/primary_db | 4.5 MB 00:00:01 (3/3): epel/x86_64/primary_db | 6.9 MB 00:00:02 ......此处省略...... (5/9): updates/7/x86_64/other_db | 318 kB 00:00:00 (6/9): updates/7/x86_64/filelists_db | 2.4 MB 00:00:01 (7/9): base/7/x86_64/filelists_db | 7.1 MB 00:00:03 (8/9): epel/x86_64/other_db | 3.3 MB 00:00:04 (9/9): epel/x86_64/filelists_db | 12 MB 00:00:04 Metadata Cache Created
1、配置主机名:hostnamectl set-hostname slurm-node1 #配置后重新连接即可即可生效
[root@localhost ~]# hostnamectl set-hostname slurm-node1
2、配置时间服务并同步时间:CentOS7系统默认已采用Chrony时间服务
[root@slurm-node1 ~]# systemctl status chronyd.service ● chronyd.service - NTP client/server Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; vendor preset: enabled) Active: active (running) since Sun 2020-09-27 17:12:25 CST; 1min 28s ago Docs: man:chronyd(8) man:chrony.conf(5) Process: 817 ExecStartPost=/usr/libexec/chrony-helper update-daemon (code=exited, status=0/SUCCESS) Process: 793 ExecStart=/usr/sbin/chronyd $OPTIONS (code=exited, status=0/SUCCESS) Main PID: 805 (chronyd) CGroup: /system.slice/chronyd.service └─805 /usr/sbin/chronyd Sep 27 17:12:24 localhost.localdomain systemd[1]: Starting NTP client/server... Sep 27 17:12:24 localhost.localdomain chronyd[805]: chronyd version 3.4 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNC... +DEBUG) Sep 27 17:12:24 localhost.localdomain chronyd[805]: Frequency -6.336 +/- 109.996 ppm read from /var/lib/chrony/drift Sep 27 17:12:25 localhost.localdomain systemd[1]: Started NTP client/server. Sep 27 17:12:33 localhost.localdomain chronyd[805]: Selected source 203.107.6.88 Sep 27 17:12:34 localhost.localdomain chronyd[805]: Source 94.130.49.186 replaced with 108.59.2.24 Hint: Some lines were ellipsized, use -l to show in full.
a1)注释或删除”server 0.centos.pool.ntp.org iburst”等四行信息
a2)添加”server Slurm-master-ip iburst”Slurm Master主控端服务器IP地址
# These servers were defined in the installation: #server 0.centos.pool.ntp.org iburst #server 1.centos.pool.ntp.org iburst #server 2.centos.pool.ntp.org iburst #server 3.centos.pool.ntp.org iburst server 192.168.80.250 iburst
a3)重启Chrony服务(systemctl restart chronyd.service)并查验(chronyc sources)
[root@slurm-master ~]# systemctl restart chronyd.service [root@slurm-node1 ~]# chronyc sources 210 Number of sources = 1 MS Name/IP address Stratum Poll Reach LastRx Last sample =============================================================================== ^* 192.168.80.250 3 6 17 1 +1266us[+1346us] +/- 21ms
3、部署Munge:目前在线安装的Munge版本为0.5.11
[root@slurm-node1 ~]# yum install -y munge munge-libs munge-devel Loaded plugins: fastestmirror Loading mirror speeds from cached hostfile * base: mirrors.aliyun.com * extras: mirrors.aliyun.com * updates: mirrors.aliyun.com base | 3.6 kB 00:00:00 epel | 4.7 kB 00:00:00 extras | 2.9 kB 00:00:00 updates | 2.9 kB 00:00:00 (1/3): epel/x86_64/updateinfo | 1.0 MB 00:00:00 (2/3): updates/7/x86_64/primary_db | 4.5 MB 00:00:01 (3/3): epel/x86_64/primary_db | 6.9 MB 00:00:02 Resolving Dependencies --> Running transaction check ---> Package munge.x86_64 0:0.5.11-3.el7 will be installed ---> Package munge-devel.x86_64 0:0.5.11-3.el7 will be installed ---> Package munge-libs.x86_64 0:0.5.11-3.el7 will be installed --> Finished Dependency Resolution Dependencies Resolved ========================================================================================================================================================== Package Arch Version Repository Size ========================================================================================================================================================== Installing: munge x86_64 0.5.11-3.el7 epel 95 k munge-devel x86_64 0.5.11-3.el7 epel 22 k munge-libs x86_64 0.5.11-3.el7 epel 37 k Transaction Summary ========================================================================================================================================================== Install 3 Packages Total download size: 154 k Installed size: 341 k Downloading packages: (1/3): munge-0.5.11-3.el7.x86_64.rpm | 95 kB 00:00:00 (2/3): munge-devel-0.5.11-3.el7.x86_64.rpm | 22 kB 00:00:00 (3/3): munge-libs-0.5.11-3.el7.x86_64.rpm | 37 kB 00:00:00 ---------------------------------------------------------------------------------------------------------------------------------------------------------- Total 105 kB/s | 154 kB 00:00:01 Running transaction check Running transaction test Transaction test succeeded Running transaction Installing : munge-libs-0.5.11-3.el7.x86_64 1/3 Installing : munge-0.5.11-3.el7.x86_64 2/3 Installing : munge-devel-0.5.11-3.el7.x86_64 3/3 Verifying : munge-0.5.11-3.el7.x86_64 1/3 Verifying : munge-devel-0.5.11-3.el7.x86_64 2/3 Verifying : munge-libs-0.5.11-3.el7.x86_64 3/3 Installed: munge.x86_64 0:0.5.11-3.el7 munge-devel.x86_64 0:0.5.11-3.el7 munge-libs.x86_64 0:0.5.11-3.el7 Complete!
[root@slurm-node1 ~]# chmod -R 0700 /etc/munge /var/log/munge && chmod -R 0711 /var/lib/munge && chmod -R 0755 /var/run/munge
c1)同步Master节点上的Munge密钥文件:scp root@192.168.80.250:/etc/munge/munge.key /etc/munge/
[root@slurm-node1 ~]# scp root@192.168.80.250:/etc/munge/munge.key /etc/munge/ The authenticity of host '192.168.80.250 (192.168.80.250)' can't be established. ECDSA key fingerprint is SHA256:2Eo2WLWyofiltEAs4nLUFLOcXLFD6YvsuPSDlEDUZGk. ECDSA key fingerprint is MD5:3c:b0:5f:a8:af:6a:15:45:eb:a9:2a:b0:20:21:65:04. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added '192.168.80.250' (ECDSA) to the list of known hosts. root@192.168.80.250's password: munge.key 100% 1024 251.7KB/s 00:00
c2)授权Munge秘钥文件:chown munge:munge /etc/munge/munge.key && chmod 0600 /etc/munge/munge.key
[root@slurm-node1 ~]# chown munge:munge /etc/munge/munge.key && chmod 0600 /etc/munge/munge.key
c3)启动Munge服务并配置服务自启动:systemctl start munge.service && systemctl enable munge.service
[root@slurm-node1 ~]# systemctl start munge.service && systemctl enable munge.service Created symlink from /etc/systemd/system/multi-user.target.wants/munge.service to /usr/lib/systemd/system/munge.service.
c4)跟Slurm-Master验证:munge -n | ssh 192.168.80.250 unmunge
[root@slurm-node1 ~]# munge -n | ssh 192.168.80.250 unmunge root@192.168.80.250's password: STATUS: Success (0) ENCODE_HOST: ??? (192.168.80.251) ENCODE_TIME: 2020-09-27 17:24:54 +0800 (1601198694) DECODE_TIME: 2020-09-27 17:24:56 +0800 (1601198696) TTL: 300 CIPHER: aes128 (4) MAC: sha1 (3) ZIP: none (0) UID: root (0) GID: root (0) LENGTH: 0
4、安装所需组件:yum install -y rpm-build bzip2-devel openssl openssl-devel zlib-devel perl-DBI perl-ExtUtils-MakeMaker pam-devel readline-devel mariadb-devel python3 gtk2 gtk2-devel
[root@slurm-master ~]# yum install -y rpm-build bzip2-devel openssl openssl-devel zlib-devel perl-DBI perl-ExtUtils-MakeMaker pam-devel readline-devel mariadb-devel python3 gtk2 gtk2-devel Loaded plugins: fastestmirror, langpacks Loading mirror speeds from cached hostfile * base: mirrors.aliyun.com * extras: mirrors.aliyun.com * updates: mirrors.aliyun.com Package rpm-build-4.11.3-43.el7.x86_64 already installed and latest version Package 1:openssl-1.0.2k-19.el7.x86_64 already installed and latest version Package perl-DBI-1.627-4.el7.x86_64 already installed and latest version Package gtk2-2.24.31-1.el7.x86_64 already installed and latest version Resolving Dependencies --> Running transaction check ---> Package bzip2-devel.x86_64 0:1.0.6-13.el7 will be installed ---> Package gtk2-devel.x86_64 0:2.24.31-1.el7 will be installed --> Processing Dependency: pango-devel >= 1.20.0-1 for package: gtk2-devel-2.24.31-1.el7.x86_64 --> Processing Dependency: glib2-devel >= 2.28.0-1 for package: gtk2-devel-2.24.31-1.el7.x86_64 --> Processing Dependency: cairo-devel >= 1.6.0-1 for package: gtk2-devel-2.24.31-1.el7.x86_64 --> Processing Dependency: atk-devel >= 1.29.4-2 for package: gtk2-devel-2.24.31-1.el7.x86_64 --> Processing Dependency: pkgconfig(pangoft2) for package: gtk2-devel-2.24.31-1.el7.x86_64 ......此处省略...... mesa-khr-devel.x86_64 0:18.3.4-7.el7_8.1 mesa-libEGL-devel.x86_64 0:18.3.4-7.el7_8.1 mesa-libGL-devel.x86_64 0:18.3.4-7.el7_8.1 ncurses-devel.x86_64 0:5.9-14.20130511.el7_4 pango-devel.x86_64 0:1.42.4-4.el7_7 pcre-devel.x86_64 0:8.32-17.el7 perl-ExtUtils-Install.noarch 0:1.58-295.el7 perl-ExtUtils-Manifest.noarch 0:1.61-244.el7 perl-ExtUtils-ParseXS.noarch 1:3.18-3.el7 perl-devel.x86_64 4:5.16.3-295.el7 pixman-devel.x86_64 0:0.34.0-1.el7 pyparsing.noarch 0:1.5.6-9.el7 python3-libs.x86_64 0:3.6.8-13.el7 python3-pip.noarch 0:9.0.3-7.el7_7 python3-setuptools.noarch 0:39.2.0-10.el7 systemtap-sdt-devel.x86_64 0:4.0-11.el7 Complete!
5、部署Slurm程序
[root@slurm-node1 ~]# groupadd -g 200 slurm && useradd -u 200 -g 200 -s /sbin/nologin -M slurm
[root@slurm-node1 ~]# cd && scp root@192.168.80.250:/root/slurm-20.02.5.tar.bz2 ./ root@192.168.80.250's password: slurm-20.02.5.tar.bz2 100% 6177KB 14.6MB/s 00:00
[root@slurm-node1 ~]# rpmbuild -ta --clean slurm-20.02.5.tar.bz2 Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.yM0uEb + umask 022 + cd /root/rpmbuild/BUILD + cd /root/rpmbuild/BUILD + rm -rf slurm-20.02.5 + /usr/bin/bzip2 -dc /root/slurm-20.02.5.tar.bz2 + /usr/bin/tar -xvvf - drwxr-xr-x 1000/1000 0 2020-09-11 04:56 slurm-20.02.5/ -rw-r--r-- 1000/1000 8543 2020-09-11 04:56 slurm-20.02.5/LICENSE.OpenSSL drwxr-xr-x 1000/1000 0 2020-09-11 04:56 slurm-20.02.5/auxdir/ -rw-r--r-- 1000/1000 306678 2020-09-11 04:56 slurm-20.02.5/auxdir/libtool.m4 -rw-r--r-- 1000/1000 5860 2020-09-11 04:56 slurm-20.02.5/auxdir/ax_gcc_builtin.m4 -rwxr-xr-x 1000/1000 15368 2020-09-11 04:56 slurm-20.02.5/auxdir/install-sh -rw-r--r-- 1000/1000 327116 2020-09-11 04:56 slurm-20.02.5/auxdir/ltmain.sh -rw-r--r-- 1000/1000 2630 2020-09-11 04:56 slurm-20.02.5/auxdir/x_ac_freeipmi.m4 -rw-r--r-- 1000/1000 1783 2020-09-11 04:56 slurm-20.02.5/auxdir/x_ac_yaml.m4 -rw-r--r-- 1000/1000 2709 2020-09-11 04:56 slurm-20.02.5/auxdir/x_ac_databases.m4 -rw-r--r-- 1000/1000 2018 2020-09-11 04:56 slurm-20.02.5/auxdir/x_ac_http_parser.m4 -rwxr-xr-x 1000/1000 36136 2020-09-11 04:56 slurm-20.02.5/auxdir/config.sub -rwxr-xr-x 1000/1000 23568 2020-09-11 04:56 slurm-20.02.5/auxdir/depcomp ......此处省略...... Checking for unpackaged file(s): /usr/lib/rpm/check-files /root/rpmbuild/BUILDROOT/slurm-20.02.5-1.el7.x86_64 Wrote: /root/rpmbuild/SRPMS/slurm-20.02.5-1.el7.src.rpm Wrote: /root/rpmbuild/RPMS/x86_64/slurm-20.02.5-1.el7.x86_64.rpm Wrote: /root/rpmbuild/RPMS/x86_64/slurm-perlapi-20.02.5-1.el7.x86_64.rpm Wrote: /root/rpmbuild/RPMS/x86_64/slurm-devel-20.02.5-1.el7.x86_64.rpm Wrote: /root/rpmbuild/RPMS/x86_64/slurm-example-configs-20.02.5-1.el7.x86_64.rpm Wrote: /root/rpmbuild/RPMS/x86_64/slurm-slurmctld-20.02.5-1.el7.x86_64.rpm Wrote: /root/rpmbuild/RPMS/x86_64/slurm-slurmd-20.02.5-1.el7.x86_64.rpm Wrote: /root/rpmbuild/RPMS/x86_64/slurm-slurmdbd-20.02.5-1.el7.x86_64.rpm Wrote: /root/rpmbuild/RPMS/x86_64/slurm-libpmi-20.02.5-1.el7.x86_64.rpm Wrote: /root/rpmbuild/RPMS/x86_64/slurm-torque-20.02.5-1.el7.x86_64.rpm Wrote: /root/rpmbuild/RPMS/x86_64/slurm-openlava-20.02.5-1.el7.x86_64.rpm Wrote: /root/rpmbuild/RPMS/x86_64/slurm-contribs-20.02.5-1.el7.x86_64.rpm Wrote: /root/rpmbuild/RPMS/x86_64/slurm-pam_slurm-20.02.5-1.el7.x86_64.rpm Executing(%clean): /bin/sh -e /var/tmp/rpm-tmp.ojihVc + umask 022 + cd /root/rpmbuild/BUILD + cd slurm-20.02.5 + rm -rf /root/rpmbuild/BUILDROOT/slurm-20.02.5-1.el7.x86_64 + exit 0 Executing(--clean): /bin/sh -e /var/tmp/rpm-tmp.lzdQo2 + umask 022 + cd /root/rpmbuild/BUILD + rm -rf slurm-20.02.5 + exit 0
[root@slurm-node1 ~]# cd /root/rpmbuild/RPMS/x86_64 && yum install -y slurm-*.rpm Loaded plugins: fastestmirror, langpacks Examining slurm-20.02.5-1.el7.x86_64.rpm: slurm-20.02.5-1.el7.x86_64 Marking slurm-20.02.5-1.el7.x86_64.rpm to be installed Examining slurm-contribs-20.02.5-1.el7.x86_64.rpm: slurm-contribs-20.02.5-1.el7.x86_64 Marking slurm-contribs-20.02.5-1.el7.x86_64.rpm to be installed Examining slurm-devel-20.02.5-1.el7.x86_64.rpm: slurm-devel-20.02.5-1.el7.x86_64 Marking slurm-devel-20.02.5-1.el7.x86_64.rpm to be installed Examining slurm-example-configs-20.02.5-1.el7.x86_64.rpm: slurm-example-configs-20.02.5-1.el7.x86_64 ......此处省略...... Verifying : slurm-libpmi-20.02.5-1.el7.x86_64 12/13 Verifying : slurm-perlapi-20.02.5-1.el7.x86_64 13/13 Installed: slurm.x86_64 0:20.02.5-1.el7 slurm-contribs.x86_64 0:20.02.5-1.el7 slurm-devel.x86_64 0:20.02.5-1.el7 slurm-example-configs.x86_64 0:20.02.5-1.el7 slurm-libpmi.x86_64 0:20.02.5-1.el7 slurm-openlava.x86_64 0:20.02.5-1.el7 slurm-pam_slurm.x86_64 0:20.02.5-1.el7 slurm-perlapi.x86_64 0:20.02.5-1.el7 slurm-slurmctld.x86_64 0:20.02.5-1.el7 slurm-slurmd.x86_64 0:20.02.5-1.el7 slurm-slurmdbd.x86_64 0:20.02.5-1.el7 slurm-torque.x86_64 0:20.02.5-1.el7 Dependency Installed: perl-Switch.noarch 0:2.16-7.el7 Complete!
e1)同步Master节点上的Slurm配置文件:scp root@192.168.80.250:/etc/slurm/slurm.conf /etc/slurm/
[root@slurm-node1 x86_64]# scp root@192.168.80.250:/etc/slurm/slurm.conf /etc/slurm/ root@192.168.80.250's password: slurm.conf 100% 2127 743.6KB/s 00:00
e2)创建配套目录及授权配套目录:mkdir /var/spool/slurm && chown slurm:slurm /var/spool/slurm
[root@slurm-node1 x86_64]# mkdir /var/spool/slurm && chown slurm:slurm /var/spool/slurm
e3)启动Slurm Node服务并设置服务自启动:systemctl start slurmd.service && systemctl enable slurmd.service
[root@slurm-node1 x86_64]# systemctl start slurmd.service && systemctl enable slurmd.service Created symlink from /etc/systemd/system/multi-user.target.wants/slurmd.service to /usr/lib/systemd/system/slurmd.service.
[root@slurm-node1 x86_64]# systemctl status slurmd.service ● slurmd.service - Slurm node daemon Loaded: loaded (/usr/lib/systemd/system/slurmd.service; enabled; vendor preset: disabled) Active: active (running) since Sun 2020-09-27 17:51:10 CST; 32s ago Main PID: 69215 (slurmd) CGroup: /system.slice/slurmd.service └─69215 /usr/sbin/slurmd Sep 27 17:51:10 slurm-node1 systemd[1]: Starting Slurm node daemon... Sep 27 17:51:10 slurm-node1 systemd[1]: Started Slurm node daemon.
[root@slurm-master ~]# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST debug* up infinite 1 idle slurm-node1
[root@slurm-master slurm]# sview
[root@slurm-master ~]# srun hostname slurm-node1 [root@slurm-master ~]#