CMAQv532 on C5.4xlarge
Amazon AMI EC2 Instance: C5.4xlarge (16 processors)
openmpi_4.0.1/gcc_8.3.1 ================================== ***** CMAQ TIMING REPORT ***** ================================== Start Day: 2016-07-01 End Day: 2016-07-02 Number of Simulation Days: 2 Domain Name: 2016_12SE1 Number of Grid Cells: 280000 (ROW x COL x LAY) Number of Layers: 35 Number of Processes: 16 All times are in seconds. Num Day Wall Time 01 2016-07-01 1730.5 02 2016-07-02 1602.3 Total Time = 3332.80 Avg. Time = 1666.40 The elapsed time for this simulation was 1602.3 seconds. 19711.615u 1046.727s 26:42.77 1295.1% 0+0k 6735848+1416040io 6pf+0w CMAQ Processing of Day 20160702 Finished at Wed Dec 16 18:47:51 UTC 2020 Singularity mvapich Note singularity CMAQ CCTM uses the medium memory model. X86_64 "Medium memory model" version: support stack-size, # array-size, data-size larger than 2 GB. # Use of this opotion requires that "gcc" and "gfortran" thenselves be # of version 4.4 or later and have been compiled with "-mcmodel=medium". # See http://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models Num Day Wall Time 01 2016-07-01 1546.4 02 2016-07-01 1468.0 Total Time = 3015.33 Avg. Time = 1507.66
CMAQv5.3.2 (openmpi) | CMAQv5.3.2 (mpich) | CMAQv5.3.2 Singularity (openmpi) | CMAQv5.3.2 Singularity (mvapich) | CMAQv5.3.2 Singularity (mpich) | CMAQv5.3.2 Singularity Atmos (openmpi) | CMAQv5.3.2 Singularity Dogwood (openmpi-hybrid) | |
---|---|---|---|---|---|---|---|
day 1 | 1730.5/1390 | 1779.1/1361.20 | error | 1546.4/1995.4 | 1564.5 | 1151 | 1436.5 |
day 2 | 1602.3 | 1649.6 | error | 1468.0/1840.83 | 1497.33 |
Error for openmpi on C5.4xlarge with attempt to run on 16 processors /usr/bin/time -p mpirun -np 16 /opt/CMAQ_532/CCTM/scripts/BLD_CCTM_v532_gcc-openmpi/CCTM_v532.exe [1610650809.265129] [ip-172-31-84-61:6018 :0] sys.c:618 UCX ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
CTM_APPL | v532_openmpi_gcc_2016_12SE1_20160701
Then it only creates a run on 1 processor with one log file:
*** ERROR in INIT3/INITLOG3 *** Error opening log file on unit 99 I/O STATUS = 17 DESCRIPTION: File 'CTM_LOG_000.v532_openmpi_gcc_2016_12SE1_20160701' already exists
File: CTM_LOG_000.v532_openmpi_gcc_2016_12SE1_20160701
[1610650809.275238] [ip-172-31-84-61:6012 :0] sys.c:618 UCX ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
top shows that the job is running on just 1 processor top - 19:05:25 up 50 min, 2 users, load average: 1.48, 0.65, 1.04 Tasks: 229 total, 3 running, 211 sleeping, 0 stopped, 15 zombie %Cpu(s): 8.9 us, 3.5 sy, 0.0 ni, 87.4 id, 0.0 wa, 0.1 hi, 0.0 si, 0.0 st MiB Mem : 31157.2 total, 24247.2 free, 5918.4 used, 991.5 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 24847.4 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6160 cmas 20 0 25324 2468 2180 R 99.7 0.0 1:17.64 hydra_pmi_proxy 6161 cmas 20 0 6415232 5.5g 19592 R 99.3 18.0 1:17.63 CCTM_v532.exe 6219 cmas 20 0 65520 4800 3916 R 0.3 0.0 0:00.01 top 1 root 20 0 244840 13440 9088 S 0.0 0.0 0:03.11 systems
The version of openmpi on the Amazon EC2 instance is
[cmas@ip-172-31-92-184 Scripts-CMAQ]$ mpirun --version mpirun (Open MPI) 4.0.3 gcc --version gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5)
According to the Singularity MPI troubleshooting tips, the version of MPI on the machine should match what is on the container. https://sylabs.io/guides/3.7/user-guide/mpi.html#troubleshooting-tips
Note: running the Singularity mvapich_gcc on another c5.4xlarge machine gave different timing result
Date and time 0:00:00 July 2, 2016 (2016184:000000) The elapsed time for this simulation was 1995.4 seconds.
Amazon AMI information: hostnamectl
Static hostname: ip-172-31-92-184.ec2.internal Icon name: computer-vm Chassis: vm Machine ID: 1b448809a4a8468fb63a7b434d20508d Boot ID: 692f917b0e404f3195c6711ab81cf3e7 Virtualization: kvm Operating System: Red Hat Enterprise Linux 8.3 (Ootpa) CPE OS Name: cpe:/o:redhat:enterprise_linux:8.3:GA Kernel: Linux 4.18.0-240.1.1.el8_3.x86_64 Architecture: x86-64
Native builds of CMAQ_v5.3.2 were done using the following modules:
module avail
/usr/share/Modules/modulefiles --------------------------------------------------
dot module-git module-info modules null use.own
/etc/modulefiles ---------------------------------------------------------
mpi/mpich-x86_64 mpi/openmpi-x86_64
/usr/share/modulefiles ------------------------------------------------------
pmi/pmix-x86_64
Using the singularity shell, I checked the version of openmpi within the container:
[cmas@ip-172-31-92-184 Scripts-CMAQ]$ ./singularity-shell.csh [cmas@ip-172-31-92-184 CMAQv5.3.2_Benchmark_2Day_Input]$ /usr/lib64/openmpi3/bin/mpirun --version mpirun (Open MPI) 3.1.3
The mvapich2 version within the singularity container. Used interactive_slurm_longleaf.csh script to login to interactive queue.
- !/bin/csh -f
srun -t 5:00:00 -p interact -N 1 -n 1 --x11=first --pty /bin/csh
Then ran singularity-cctm.csh script to login to the container shell.
/usr/lib64/mvapich2/bin/mpirun -version
HYDRA build details:
Version: 3.0.4
I don't know why the mpvapich2 and mpich are both linked to hydra.
Whereas openmpi is linked to orterun ls -lrt /usr/lib64/openmpi3/bin/mpirun lrwxrwxrwx 1 236548 rc_cep-emc_psx 7 Jun 18 2020 /usr/lib64/openmpi3/bin/mpirun -> orterun
ls -lrt /usr/lib64/mvapich2/bin/mpirun lrwxrwxrwx 1 236548 rc_cep-emc_psx 13 Jun 18 2020 /usr/lib64/mvapich2/bin/mpirun -> mpiexec.hydra
ls -lrt /usr/lib64/mpich/bin/mpirun lrwxrwxrwx 1 236548 rc_cep-emc_psx 13 Apr 11 2020 /usr/lib64/mpich/bin/mpirun -> mpiexec.hydra
Trying to run using openmpi on Dogwood gives following error in the buf* file on the home directory. cat buff_CMAQ_CCTMv532_lizadams_20210119_214931_272845894.txt [mpiexec@c-202-3] HYDU_create_process (./utils/launch/launch.c:75): execvp error on file srun (No such file or directory) In the log file: Start Model Run At Tue Jan 19 16:49:31 EST 2021 VRSN = v532 compiler= gcc APPL = 2016_12SE1
Working Directory is /opt/CMAQ_532/CCTM/script Build Directory is /opt/CMAQ_532/CCTM/scripts/BLD_CCTM_v532_gcc-openmpi Output Directory is /opt/CMAQ_532/data/2016_12SE1/cctm/openmpi Log Directory is /opt/CMAQ_532/data/2016_12SE1/logs Executable Name is /opt/CMAQ_532/CCTM/scripts/BLD_CCTM_v532_gcc-openmpi//opt/CMAQ_532/CCTM/scripts/BLD_CCTM_v532_gcc-openmpi/CCTM_v532.exe
---CMAQ EXECUTION ID: CMAQ_CCTMv532_lizadams_20210119_214931_272845894 ---
Set up input and output files for Day 20160701.
Existing Logs and Output Files for Day 2016-07-01 Will Be Deleted /bin/rm: No match.
CMAQ Processing of Day 20160701 Began at Tue Jan 19 16:49:31 EST 2021
[mpiexec@c-202-3] HYDU_create_process (./utils/launch/launch.c:75): execvp error on file srun (No such file or directory) ~
If I modify the batch script: cmaq_cctm.openmpi-hybrid.csh and run using the following module: Currently Loaded Modules:
1) gcc/6.3.0 2) openmpi_3.0.0/gcc_6.3.0
I get the following error: [c-201-23:05252] PMIX ERROR: OUT-OF-RESOURCE in file client/pmix_client.c at line 223
- An error occurred in MPI_Init
- on a NULL communicator
- MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
- and potentially your MPI job)
[c-201-23:05257] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
- An error occurred in MPI_Init
- on a NULL communicator
- MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
- and potentially your MPI job)
If I use the script cmaq_cctm.openmpi-hybrid.csh with the following module:
1) gcc/9.1.0 2) openmpi_4.0.1/gcc_9.1.0
Then it worked!