Difference between revisions of "CMAQv532 on C5.4xlarge"

From CMASWIKI
Jump to: navigation, search
 
(23 intermediate revisions by the same user not shown)
Line 30: Line 30:
  
 
Singularity mvapich
 
Singularity mvapich
 +
Note singularity CMAQ CCTM uses the medium memory model.
 +
 +
X86_64 "Medium memory model" version:  support stack-size,
 +
#  array-size, data-size larger than 2 GB.
 +
#  Use of this opotion requires that "gcc" and "gfortran" thenselves be
 +
#  of version 4.4 or later and have been compiled with  "-mcmodel=medium".
 +
#  See http://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models
  
 
Num    Day        Wall Time
 
Num    Day        Wall Time
Line 40: Line 47:
  
 
{| class="wikitable"
 
{| class="wikitable"
|+ Caption: Run Times on 16 pe on c5.4xlarge
+
|+ Run Times on 16 pe on c5.4xlarge
 
|-
 
|-
 
!
 
!
 
! CMAQv5.3.2 (openmpi)
 
! CMAQv5.3.2 (openmpi)
 +
! CMAQv5.3.2 (mpich)
 +
! CMAQv5.3.2 Singularity (openmpi)
 
! CMAQv5.3.2 Singularity (mvapich)
 
! CMAQv5.3.2 Singularity (mvapich)
 +
! CMAQv5.3.2 Singularity (mpich)
 +
! CMAQv5.3.2 Singularity Atmos (openmpi)
 +
! CMAQv5.3.2 Singularity Dogwood (openmpi-hybrid)
 
|-
 
|-
 
! day 1
 
! day 1
| 1730.5
+
| 1730.5/1390
| 1546.4  
+
| 1779.1/1361.20
 +
| error
 +
| 1546.4/1995.4
 +
| 1564.5
 +
| 1151
 +
| 1436.5
 
|-
 
|-
 
! day 2
 
! day 2
 
| 1602.3
 
| 1602.3
| 1468.0
+
| 1649.6
 +
| error
 +
| 1468.0/1840.83
 +
|1497.33
 +
|
 +
|
 
|-
 
|-
 
|}
 
|}
 +
 +
Error for openmpi on C5.4xlarge with attempt to run on 16 processors
 +
/usr/bin/time -p mpirun -np 16 /opt/CMAQ_532/CCTM/scripts/BLD_CCTM_v532_gcc-openmpi/CCTM_v532.exe
 +
[1610650809.265129] [ip-172-31-84-61:6018 :0]            sys.c:618  UCX  ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
 +
        CTM_APPL  |  v532_openmpi_gcc_2016_12SE1_20160701
 +
 +
Then it only creates a run on 1 processor with one log file:
 +
***  ERROR in INIT3/INITLOG3  ***
 +
    Error opening log file on unit        99
 +
    I/O STATUS =        17
 +
    DESCRIPTION: File 'CTM_LOG_000.v532_openmpi_gcc_2016_12SE1_20160701' already exists
 +
 +
    File: CTM_LOG_000.v532_openmpi_gcc_2016_12SE1_20160701
 +
 +
[1610650809.275238] [ip-172-31-84-61:6012 :0]            sys.c:618  UCX  ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
 +
 +
top shows that the job is running on just 1 processor
 +
top - 19:05:25 up 50 min,  2 users,  load average: 1.48, 0.65, 1.04
 +
Tasks: 229 total,  3 running, 211 sleeping,  0 stopped,  15 zombie
 +
%Cpu(s):  8.9 us,  3.5 sy,  0.0 ni, 87.4 id,  0.0 wa,  0.1 hi,  0.0 si,  0.0 st
 +
MiB Mem :  31157.2 total,  24247.2 free,  5918.4 used,    991.5 buff/cache
 +
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  24847.4 avail Mem
 +
 +
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM    TIME+ COMMAND                                                           
 +
  6160 cmas      20  0  25324  2468  2180 R  99.7  0.0  1:17.64 hydra_pmi_proxy                                                   
 +
  6161 cmas      20  0 6415232  5.5g  19592 R  99.3  18.0  1:17.63 CCTM_v532.exe                                                     
 +
  6219 cmas      20  0  65520  4800  3916 R  0.3  0.0  0:00.01 top                                                               
 +
      1 root      20  0  244840  13440  9088 S  0.0  0.0  0:03.11 systems
 +
 +
The version of openmpi on the Amazon EC2 instance is
 +
<pre>
 +
[cmas@ip-172-31-92-184 Scripts-CMAQ]$ mpirun --version
 +
mpirun (Open MPI) 4.0.3
 +
gcc --version
 +
gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5)
 +
</pre>
 +
 +
According to the Singularity MPI troubleshooting tips, the version of MPI on the machine should match what is on the container.
 +
https://sylabs.io/guides/3.7/user-guide/mpi.html#troubleshooting-tips
 +
 +
Note: running the Singularity mvapich_gcc on another c5.4xlarge machine gave different timing result
 +
 +
    Date and time 0:00:00  July 2, 2016  (2016184:000000)
 +
    The elapsed time for this simulation was    1995.4 seconds.
 +
 +
Amazon AMI information:
 +
hostnamectl
 +
  Static hostname: ip-172-31-92-184.ec2.internal
 +
        Icon name: computer-vm
 +
          Chassis: vm
 +
        Machine ID: 1b448809a4a8468fb63a7b434d20508d
 +
          Boot ID: 692f917b0e404f3195c6711ab81cf3e7
 +
    Virtualization: kvm
 +
  Operating System: Red Hat Enterprise Linux 8.3 (Ootpa)
 +
      CPE OS Name: cpe:/o:redhat:enterprise_linux:8.3:GA
 +
            Kernel: Linux 4.18.0-240.1.1.el8_3.x86_64
 +
      Architecture: x86-64
 +
 +
Native builds of CMAQ_v5.3.2 were done using the following modules:
 +
 +
module avail
 +
-------------------------------------------------- /usr/share/Modules/modulefiles --------------------------------------------------
 +
dot  module-git  module-info  modules  null  use.own 
 +
 +
--------------------------------------------------------- /etc/modulefiles ---------------------------------------------------------
 +
mpi/mpich-x86_64  mpi/openmpi-x86_64 
 +
 +
------------------------------------------------------ /usr/share/modulefiles ------------------------------------------------------
 +
pmi/pmix-x86_64
 +
 +
 +
Using the singularity shell, I checked the version of openmpi within the container:
 +
 +
[cmas@ip-172-31-92-184 Scripts-CMAQ]$ ./singularity-shell.csh
 +
[cmas@ip-172-31-92-184 CMAQv5.3.2_Benchmark_2Day_Input]$ /usr/lib64/openmpi3/bin/mpirun --version
 +
mpirun (Open MPI) 3.1.3
 +
 +
The mvapich2 version within the singularity container.
 +
Used interactive_slurm_longleaf.csh script to login to interactive queue.
 +
#!/bin/csh -f
 +
srun -t 5:00:00 -p interact -N 1 -n 1 --x11=first --pty /bin/csh
 +
 +
Then ran singularity-cctm.csh script to login to the container shell.
 +
/usr/lib64/mvapich2/bin/mpirun -version
 +
HYDRA build details:
 +
    Version:                                3.0.4
 +
 +
I don't know why the mpvapich2 and mpich are both linked to hydra.
 +
 +
Whereas openmpi is linked to orterun
 +
ls -lrt /usr/lib64/openmpi3/bin/mpirun
 +
lrwxrwxrwx 1 236548 rc_cep-emc_psx 7 Jun 18  2020 /usr/lib64/openmpi3/bin/mpirun -> orterun
 +
 +
ls -lrt /usr/lib64/mvapich2/bin/mpirun
 +
lrwxrwxrwx 1 236548 rc_cep-emc_psx 13 Jun 18  2020 /usr/lib64/mvapich2/bin/mpirun -> mpiexec.hydra
 +
 +
ls -lrt /usr/lib64/mpich/bin/mpirun
 +
lrwxrwxrwx 1 236548 rc_cep-emc_psx 13 Apr 11  2020 /usr/lib64/mpich/bin/mpirun -> mpiexec.hydra
 +
 +
Trying to run using openmpi on Dogwood gives following error in the buf* file on the home directory.
 +
cat buff_CMAQ_CCTMv532_lizadams_20210119_214931_272845894.txt
 +
[mpiexec@c-202-3] HYDU_create_process (./utils/launch/launch.c:75): execvp error on file srun (No such file or directory)
 +
In the log file:
 +
Start Model Run At  Tue Jan 19 16:49:31 EST 2021
 +
VRSN    = v532
 +
compiler= gcc
 +
APPL    = 2016_12SE1
 +
 +
Working Directory is /opt/CMAQ_532/CCTM/script
 +
Build Directory  is /opt/CMAQ_532/CCTM/scripts/BLD_CCTM_v532_gcc-openmpi
 +
Output Directory  is /opt/CMAQ_532/data/2016_12SE1/cctm/openmpi
 +
Log Directory    is /opt/CMAQ_532/data/2016_12SE1/logs
 +
Executable Name  is /opt/CMAQ_532/CCTM/scripts/BLD_CCTM_v532_gcc-openmpi//opt/CMAQ_532/CCTM/scripts/BLD_CCTM_v532_gcc-openmpi/CCTM_v532.exe
 +
 +
---CMAQ EXECUTION ID: CMAQ_CCTMv532_lizadams_20210119_214931_272845894 ---
 +
 +
Set up input and output files for Day 20160701.
 +
 +
Existing Logs and Output Files for Day 2016-07-01 Will Be Deleted
 +
/bin/rm: No match.
 +
 +
CMAQ Processing of Day 20160701 Began at Tue Jan 19 16:49:31 EST 2021
 +
 +
[mpiexec@c-202-3] HYDU_create_process (./utils/launch/launch.c:75): execvp error on file srun (No such file or directory)
 +
~                                                                                                                             
 +
 +
If I modify the batch script: cmaq_cctm.openmpi-hybrid.csh
 +
and run using the following module:
 +
Currently Loaded Modules:
 +
  1) gcc/6.3.0  2) openmpi_3.0.0/gcc_6.3.0
 +
 +
I get the following error:
 +
[c-201-23:05252] PMIX ERROR: OUT-OF-RESOURCE in file client/pmix_client.c at line 223
 +
*** An error occurred in MPI_Init
 +
*** on a NULL communicator
 +
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
 +
***    and potentially your MPI job)
 +
[c-201-23:05257] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
 +
*** An error occurred in MPI_Init
 +
*** on a NULL communicator
 +
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
 +
***    and potentially your MPI job)
 +
 +
If I use the script cmaq_cctm.openmpi-hybrid.csh with the following module:
 +
1) gcc/9.1.0  2) openmpi_4.0.1/gcc_9.1.0
 +
Then it worked!

Latest revision as of 22:48, 19 January 2021

Amazon AMI EC2 Instance: C5.4xlarge (16 processors)

openmpi_4.0.1/gcc_8.3.1

==================================
  ***** CMAQ TIMING REPORT *****
==================================
Start Day: 2016-07-01
End Day:   2016-07-02
Number of Simulation Days: 2
Domain Name:               2016_12SE1
Number of Grid Cells:      280000  (ROW x COL x LAY)
Number of Layers:          35
Number of Processes:       16
   All times are in seconds.

Num  Day        Wall Time
01   2016-07-01   1730.5
02   2016-07-02   1602.3
     Total Time = 3332.80
      Avg. Time = 1666.40

     The elapsed time for this simulation was    1602.3 seconds.

19711.615u 1046.727s 26:42.77 1295.1%   0+0k 6735848+1416040io 6pf+0w

CMAQ Processing of Day 20160702 Finished at Wed Dec 16 18:47:51 UTC 2020

Singularity mvapich
Note singularity CMAQ CCTM uses the medium memory model.

 X86_64 "Medium memory model" version:  support stack-size,
#  array-size, data-size larger than 2 GB.
#  Use of this opotion requires that "gcc" and "gfortran" thenselves be
#  of version 4.4 or later and have been compiled with  "-mcmodel=medium".
#  See http://eli.thegreenplace.net/2012/01/03/understanding-the-x64-code-models

Num    Day         Wall Time
01     2016-07-01   1546.4 
02     2016-07-01   1468.0
Total Time = 3015.33
Avg. Time = 1507.66


Run Times on 16 pe on c5.4xlarge
CMAQv5.3.2 (openmpi) CMAQv5.3.2 (mpich) CMAQv5.3.2 Singularity (openmpi) CMAQv5.3.2 Singularity (mvapich) CMAQv5.3.2 Singularity (mpich) CMAQv5.3.2 Singularity Atmos (openmpi) CMAQv5.3.2 Singularity Dogwood (openmpi-hybrid)
day 1 1730.5/1390 1779.1/1361.20 error 1546.4/1995.4 1564.5 1151 1436.5
day 2 1602.3 1649.6 error 1468.0/1840.83 1497.33

Error for openmpi on C5.4xlarge with attempt to run on 16 processors /usr/bin/time -p mpirun -np 16 /opt/CMAQ_532/CCTM/scripts/BLD_CCTM_v532_gcc-openmpi/CCTM_v532.exe [1610650809.265129] [ip-172-31-84-61:6018 :0] sys.c:618 UCX ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'

       CTM_APPL  |  v532_openmpi_gcc_2016_12SE1_20160701

Then it only creates a run on 1 processor with one log file:

***  ERROR in INIT3/INITLOG3  ***
    Error opening log file on unit        99
    I/O STATUS =        17
    DESCRIPTION: File 'CTM_LOG_000.v532_openmpi_gcc_2016_12SE1_20160701' already exists
    File: CTM_LOG_000.v532_openmpi_gcc_2016_12SE1_20160701

[1610650809.275238] [ip-172-31-84-61:6012 :0] sys.c:618 UCX ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'

top shows that the job is running on just 1 processor top - 19:05:25 up 50 min, 2 users, load average: 1.48, 0.65, 1.04 Tasks: 229 total, 3 running, 211 sleeping, 0 stopped, 15 zombie %Cpu(s): 8.9 us, 3.5 sy, 0.0 ni, 87.4 id, 0.0 wa, 0.1 hi, 0.0 si, 0.0 st MiB Mem : 31157.2 total, 24247.2 free, 5918.4 used, 991.5 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 24847.4 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                            
  6160 cmas      20   0   25324   2468   2180 R  99.7   0.0   1:17.64 hydra_pmi_proxy                                                    
  6161 cmas      20   0 6415232   5.5g  19592 R  99.3  18.0   1:17.63 CCTM_v532.exe                                                      
  6219 cmas      20   0   65520   4800   3916 R   0.3   0.0   0:00.01 top                                                                
     1 root      20   0  244840  13440   9088 S   0.0   0.0   0:03.11 systems

The version of openmpi on the Amazon EC2 instance is

[cmas@ip-172-31-92-184 Scripts-CMAQ]$ mpirun --version
mpirun (Open MPI) 4.0.3
gcc --version
gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5)

According to the Singularity MPI troubleshooting tips, the version of MPI on the machine should match what is on the container. https://sylabs.io/guides/3.7/user-guide/mpi.html#troubleshooting-tips

Note: running the Singularity mvapich_gcc on another c5.4xlarge machine gave different timing result

    Date and time 0:00:00   July 2, 2016   (2016184:000000)
    The elapsed time for this simulation was    1995.4 seconds.

Amazon AMI information: hostnamectl

  Static hostname: ip-172-31-92-184.ec2.internal
        Icon name: computer-vm
          Chassis: vm
       Machine ID: 1b448809a4a8468fb63a7b434d20508d
          Boot ID: 692f917b0e404f3195c6711ab81cf3e7
   Virtualization: kvm
 Operating System: Red Hat Enterprise Linux 8.3 (Ootpa)
      CPE OS Name: cpe:/o:redhat:enterprise_linux:8.3:GA
           Kernel: Linux 4.18.0-240.1.1.el8_3.x86_64
     Architecture: x86-64

Native builds of CMAQ_v5.3.2 were done using the following modules:

module avail


/usr/share/Modules/modulefiles --------------------------------------------------

dot module-git module-info modules null use.own


/etc/modulefiles ---------------------------------------------------------

mpi/mpich-x86_64 mpi/openmpi-x86_64


/usr/share/modulefiles ------------------------------------------------------

pmi/pmix-x86_64


Using the singularity shell, I checked the version of openmpi within the container:

[cmas@ip-172-31-92-184 Scripts-CMAQ]$ ./singularity-shell.csh [cmas@ip-172-31-92-184 CMAQv5.3.2_Benchmark_2Day_Input]$ /usr/lib64/openmpi3/bin/mpirun --version mpirun (Open MPI) 3.1.3

The mvapich2 version within the singularity container. Used interactive_slurm_longleaf.csh script to login to interactive queue.

  1. !/bin/csh -f

srun -t 5:00:00 -p interact -N 1 -n 1 --x11=first --pty /bin/csh

Then ran singularity-cctm.csh script to login to the container shell.

/usr/lib64/mvapich2/bin/mpirun -version

HYDRA build details:

   Version:                                 3.0.4

I don't know why the mpvapich2 and mpich are both linked to hydra.

Whereas openmpi is linked to orterun ls -lrt /usr/lib64/openmpi3/bin/mpirun lrwxrwxrwx 1 236548 rc_cep-emc_psx 7 Jun 18 2020 /usr/lib64/openmpi3/bin/mpirun -> orterun

ls -lrt /usr/lib64/mvapich2/bin/mpirun lrwxrwxrwx 1 236548 rc_cep-emc_psx 13 Jun 18 2020 /usr/lib64/mvapich2/bin/mpirun -> mpiexec.hydra

ls -lrt /usr/lib64/mpich/bin/mpirun lrwxrwxrwx 1 236548 rc_cep-emc_psx 13 Apr 11 2020 /usr/lib64/mpich/bin/mpirun -> mpiexec.hydra

Trying to run using openmpi on Dogwood gives following error in the buf* file on the home directory. cat buff_CMAQ_CCTMv532_lizadams_20210119_214931_272845894.txt [mpiexec@c-202-3] HYDU_create_process (./utils/launch/launch.c:75): execvp error on file srun (No such file or directory) In the log file: Start Model Run At Tue Jan 19 16:49:31 EST 2021 VRSN = v532 compiler= gcc APPL = 2016_12SE1

Working Directory is /opt/CMAQ_532/CCTM/script Build Directory is /opt/CMAQ_532/CCTM/scripts/BLD_CCTM_v532_gcc-openmpi Output Directory is /opt/CMAQ_532/data/2016_12SE1/cctm/openmpi Log Directory is /opt/CMAQ_532/data/2016_12SE1/logs Executable Name is /opt/CMAQ_532/CCTM/scripts/BLD_CCTM_v532_gcc-openmpi//opt/CMAQ_532/CCTM/scripts/BLD_CCTM_v532_gcc-openmpi/CCTM_v532.exe

---CMAQ EXECUTION ID: CMAQ_CCTMv532_lizadams_20210119_214931_272845894 ---

Set up input and output files for Day 20160701.

Existing Logs and Output Files for Day 2016-07-01 Will Be Deleted /bin/rm: No match.

CMAQ Processing of Day 20160701 Began at Tue Jan 19 16:49:31 EST 2021

[mpiexec@c-202-3] HYDU_create_process (./utils/launch/launch.c:75): execvp error on file srun (No such file or directory) ~

If I modify the batch script: cmaq_cctm.openmpi-hybrid.csh and run using the following module: Currently Loaded Modules:

 1) gcc/6.3.0   2) openmpi_3.0.0/gcc_6.3.0

I get the following error: [c-201-23:05252] PMIX ERROR: OUT-OF-RESOURCE in file client/pmix_client.c at line 223

      • An error occurred in MPI_Init
      • on a NULL communicator
      • MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
      • and potentially your MPI job)

[c-201-23:05257] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

      • An error occurred in MPI_Init
      • on a NULL communicator
      • MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
      • and potentially your MPI job)

If I use the script cmaq_cctm.openmpi-hybrid.csh with the following module:

1) gcc/9.1.0   2) openmpi_4.0.1/gcc_9.1.0

Then it worked!