CMAQv532 on C5.4xlarge

From CMASWIKI
Revision as of 22:48, 19 January 2021 by Lizadams (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Amazon AMI EC2 Instance: C5.4xlarge (16 processors)

openmpi_4.0.1/gcc_8.3.1

==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2016-07-01
End Day:   2016-07-02
Number of Simulation Days: 2
Domain Name:               2016_12SE1
Number of Grid Cells:      280000  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       16
   All times are in seconds.

Num  Day        Wall Time
01   2016-07-01   1730.5
02   2016-07-02   1602.3
     Total Time = 3332.80
      Avg. Time = 1666.40

     The elapsed time for this simulation was    1602.3 seconds.

19711.615u 1046.727s 26:42.77 1295.1%   0+0k 6735848+1416040io 6pf+0w

CMAQ Processing of Day 20160702 Finished at Wed Dec 16 18:47:51 UTC 2020

Singularity mvapich
Note singularity CMAQ CCTM uses the medium memory model.

 X86_64 "Medium memory model" version:  support stack-size,
#  array-size, data-size larger than 2 GB.
#  Use of this opotion requires that "gcc" and "gfortran" thenselves be
#  of version 4.4 or later and have been compiled with  "-mcmodel=medium".
#  See http://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models

Num    Day         Wall Time
01     2016-07-01   1546.4 
02     2016-07-01   1468.0
Total Time = 3015.33
Avg. Time = 1507.66


Run Times on 16 pe on c5.4xlarge
CMAQv5.3.2 (openmpi) CMAQv5.3.2 (mpich) CMAQv5.3.2 Singularity (openmpi) CMAQv5.3.2 Singularity (mvapich) CMAQv5.3.2 Singularity (mpich) CMAQv5.3.2 Singularity Atmos (openmpi) CMAQv5.3.2 Singularity Dogwood (openmpi-hybrid)
day 1 1730.5/1390 1779.1/1361.20 error 1546.4/1995.4 1564.5 1151 1436.5
day 2 1602.3 1649.6 error 1468.0/1840.83 1497.33

Error for openmpi on C5.4xlarge with attempt to run on 16 processors /usr/bin/time -p mpirun -np 16 /opt/CMAQ_532/CCTM/scripts/BLD_CCTM_v532_gcc-openmpi/CCTM_v532.exe [1610650809.265129] [ip-172-31-84-61:6018 :0] sys.c:618 UCX ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'

       CTM_APPL  |  v532_openmpi_gcc_2016_12SE1_20160701

Then it only creates a run on 1 processor with one log file:

***  ERROR in INIT3/INITLOG3  ***
    Error opening log file on unit        99
    I/O STATUS =        17
    DESCRIPTION: File 'CTM_LOG_000.v532_openmpi_gcc_2016_12SE1_20160701' already exists
    File: CTM_LOG_000.v532_openmpi_gcc_2016_12SE1_20160701

[1610650809.275238] [ip-172-31-84-61:6012 :0] sys.c:618 UCX ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'

top shows that the job is running on just 1 processor top - 19:05:25 up 50 min, 2 users, load average: 1.48, 0.65, 1.04 Tasks: 229 total, 3 running, 211 sleeping, 0 stopped, 15 zombie %Cpu(s): 8.9 us, 3.5 sy, 0.0 ni, 87.4 id, 0.0 wa, 0.1 hi, 0.0 si, 0.0 st MiB Mem : 31157.2 total, 24247.2 free, 5918.4 used, 991.5 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 24847.4 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                            
  6160 cmas      20   0   25324   2468   2180 R  99.7   0.0   1:17.64 hydra_pmi_proxy                                                    
  6161 cmas      20   0 6415232   5.5g  19592 R  99.3  18.0   1:17.63 CCTM_v532.exe                                                      
  6219 cmas      20   0   65520   4800   3916 R   0.3   0.0   0:00.01 top                                                                
     1 root      20   0  244840  13440   9088 S   0.0   0.0   0:03.11 systems

The version of openmpi on the Amazon EC2 instance is

[cmas@ip-172-31-92-184 Scripts-CMAQ]$ mpirun --version
mpirun (Open MPI) 4.0.3
gcc --version
gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5)

According to the Singularity MPI troubleshooting tips, the version of MPI on the machine should match what is on the container. https://sylabs.io/guides/3.7/user-guide/mpi.html#troubleshooting-tips

Note: running the Singularity mvapich_gcc on another c5.4xlarge machine gave different timing result

    Date and time 0:00:00   July 2, 2016   (2016184:000000)
    The elapsed time for this simulation was    1995.4 seconds.

Amazon AMI information: hostnamectl

  Static hostname: ip-172-31-92-184.ec2.internal
        Icon name: computer-vm
          Chassis: vm
       Machine ID: 1b448809a4a8468fb63a7b434d20508d
          Boot ID: 692f917b0e404f3195c6711ab81cf3e7
   Virtualization: kvm
 Operating System: Red Hat Enterprise Linux 8.3 (Ootpa)
      CPE OS Name: cpe:/o:redhat:enterprise_linux:8.3:GA
           Kernel: Linux 4.18.0-240.1.1.el8_3.x86_64
     Architecture: x86-64

Native builds of CMAQ_v5.3.2 were done using the following modules:

module avail


/usr/share/Modules/modulefiles --------------------------------------------------

dot module-git module-info modules null use.own


/etc/modulefiles ---------------------------------------------------------

mpi/mpich-x86_64 mpi/openmpi-x86_64


/usr/share/modulefiles ------------------------------------------------------

pmi/pmix-x86_64


Using the singularity shell, I checked the version of openmpi within the container:

[cmas@ip-172-31-92-184 Scripts-CMAQ]$ ./singularity-shell.csh [cmas@ip-172-31-92-184 CMAQv5.3.2_Benchmark_2Day_Input]$ /usr/lib64/openmpi3/bin/mpirun --version mpirun (Open MPI) 3.1.3

The mvapich2 version within the singularity container. Used interactive_slurm_longleaf.csh script to login to interactive queue.

  1. !/bin/csh -f

srun -t 5:00:00 -p interact -N 1 -n 1 --x11=first --pty /bin/csh

Then ran singularity-cctm.csh script to login to the container shell.

/usr/lib64/mvapich2/bin/mpirun -version

HYDRA build details:

   Version:                                 3.0.4

I don't know why the mpvapich2 and mpich are both linked to hydra.

Whereas openmpi is linked to orterun ls -lrt /usr/lib64/openmpi3/bin/mpirun lrwxrwxrwx 1 236548 rc_cep-emc_psx 7 Jun 18 2020 /usr/lib64/openmpi3/bin/mpirun -> orterun

ls -lrt /usr/lib64/mvapich2/bin/mpirun lrwxrwxrwx 1 236548 rc_cep-emc_psx 13 Jun 18 2020 /usr/lib64/mvapich2/bin/mpirun -> mpiexec.hydra

ls -lrt /usr/lib64/mpich/bin/mpirun lrwxrwxrwx 1 236548 rc_cep-emc_psx 13 Apr 11 2020 /usr/lib64/mpich/bin/mpirun -> mpiexec.hydra

Trying to run using openmpi on Dogwood gives following error in the buf* file on the home directory. cat buff_CMAQ_CCTMv532_lizadams_20210119_214931_272845894.txt [mpiexec@c-202-3] HYDU_create_process (./utils/launch/launch.c:75): execvp error on file srun (No such file or directory) In the log file: Start Model Run At Tue Jan 19 16:49:31 EST 2021 VRSN = v532 compiler= gcc APPL = 2016_12SE1

Working Directory is /opt/CMAQ_532/CCTM/script Build Directory is /opt/CMAQ_532/CCTM/scripts/BLD_CCTM_v532_gcc-openmpi Output Directory is /opt/CMAQ_532/data/2016_12SE1/cctm/openmpi Log Directory is /opt/CMAQ_532/data/2016_12SE1/logs Executable Name is /opt/CMAQ_532/CCTM/scripts/BLD_CCTM_v532_gcc-openmpi//opt/CMAQ_532/CCTM/scripts/BLD_CCTM_v532_gcc-openmpi/CCTM_v532.exe

---CMAQ EXECUTION ID: CMAQ_CCTMv532_lizadams_20210119_214931_272845894 ---

Set up input and output files for Day 20160701.

Existing Logs and Output Files for Day 2016-07-01 Will Be Deleted /bin/rm: No match.

CMAQ Processing of Day 20160701 Began at Tue Jan 19 16:49:31 EST 2021

[mpiexec@c-202-3] HYDU_create_process (./utils/launch/launch.c:75): execvp error on file srun (No such file or directory) ~

If I modify the batch script: cmaq_cctm.openmpi-hybrid.csh and run using the following module: Currently Loaded Modules:

 1) gcc/6.3.0   2) openmpi_3.0.0/gcc_6.3.0

I get the following error: [c-201-23:05252] PMIX ERROR: OUT-OF-RESOURCE in file client/pmix_client.c at line 223

      • An error occurred in MPI_Init
      • on a NULL communicator
      • MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
      • and potentially your MPI job)

[c-201-23:05257] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

      • An error occurred in MPI_Init
      • on a NULL communicator
      • MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
      • and potentially your MPI job)

If I use the script cmaq_cctm.openmpi-hybrid.csh with the following module:

1) gcc/9.1.0   2) openmpi_4.0.1/gcc_9.1.0

Then it worked!