Case: v3.ne256.wcycl on Aurora
Below is a comparison of various PE-layouts to show speedups from exclusive process strides.
Case: --compset WCYCLXX2010 --res ne256pg2_RRSwISC6to18E3r5 --machine aurora --compiler oneapi-ifxgpu --mpilib mpich1024 .
PE-layouts:
Stacked, 12 processed per node (ppn), 128 nodes, 1536 tasks: https://pace.ornl.gov/exp-details/221232
> ./pelayout Comp NTASKS NTHRDS ROOTPE PSTRIDE CPL : 1536/ 1; 0 1 ATM : 1536/ 1; 0 1 LND : 1536/ 1; 0 1 ICE : 1536/ 1; 0 1 OCN : 1536/ 1; 0 1 ROF : 1536/ 1; 0 1 GLC : 1/ 1; 0 1 WAV : 1/ 1; 0 1 IAC : 1/ 1; 0 1 ESP : 1/ 1; 0 1 component comp_pes root_pe tasks x threads instances (stride) --------- ------ ------- ------ ------ --------- ------ cpl = cpl 1536 0 1536 x 1 1 (1 ) atm = scream 1536 0 1536 x 1 1 (1 ) lnd = elm 1536 0 1536 x 1 1 (1 ) ice = mpassi 1536 0 1536 x 1 1 (1 ) ocn = mpaso 1536 0 1536 x 1 1 (1 ) rof = mosart 1536 0 1536 x 1 1 (1 ) TOT Run Time: 933.535 seconds 466.768 seconds/mday 0.51 myears/wday CPL Run Time: 91.161 seconds 45.581 seconds/mday 5.19 myears/wday ATM Run Time: 301.246 seconds 150.623 seconds/mday 1.57 myears/wday LND Run Time: 11.479 seconds 5.739 seconds/mday 41.24 myears/wday ICE Run Time: 158.867 seconds 79.433 seconds/mday 2.98 myears/wday OCN Run Time: 388.296 seconds 194.148 seconds/mday 1.22 myears/wday ROF Run Time: 0.901 seconds 0.451 seconds/mday 525.44 myears/wday CPL COMM Time: 28.207 seconds 14.104 seconds/mday 16.78 myears/wdayN.B.:
run length : 2 daysabove (job wall-clock timeout), all other runs below are 5 model-days.
This leaves most of the CPU cores (96-12=84 cores) idle – 0.51 sypd.
Strided, 96 ppn, 128 nodes, 12288 tasks, PSTRID_[ATM,ROF,LND]=8: https://pace.ornl.gov/exp-details/221236
> ./pelayout Comp NTASKS NTHRDS ROOTPE PSTRIDE CPL : 12288/ 1; 0 1 ATM : 1536/ 1; 0 8 LND : 1536/ 1; 0 8 ICE : 12288/ 1; 0 1 OCN : 12288/ 1; 0 1 ROF : 1536/ 1; 0 8 GLC : 1/ 1; 0 1 WAV : 1/ 1; 0 1 IAC : 1/ 1; 0 1 ESP : 1/ 1; 0 1 component comp_pes root_pe tasks x threads instances (stride) --------- ------ ------- ------ ------ --------- ------ cpl = cpl 12288 0 12288 x 1 1 (1 ) atm = scream 1536 0 1536 x 1 1 (8 ) lnd = elm 1536 0 1536 x 1 1 (8 ) ice = mpassi 12288 0 12288 x 1 1 (1 ) ocn = mpaso 12288 0 12288 x 1 1 (1 ) rof = mosart 1536 0 1536 x 1 1 (8 ) TOT Run Time: 1396.611 seconds 279.322 seconds/mday 0.85 myears/wday CPL Run Time: 233.297 seconds 46.659 seconds/mday 5.07 myears/wday ATM Run Time: 747.738 seconds 149.548 seconds/mday 1.58 myears/wday LND Run Time: 29.072 seconds 5.814 seconds/mday 40.71 myears/wday ICE Run Time: 126.238 seconds 25.248 seconds/mday 9.38 myears/wday OCN Run Time: 266.517 seconds 53.303 seconds/mday 4.44 myears/wday ROF Run Time: 2.224 seconds 0.445 seconds/mday 532.18 myears/wday CPL COMM Time: 825.314 seconds 165.063 seconds/mday 1.43 myears/wdayThis lets OCN and ICE run on all 96 CPU cores: both components speedup by 3-4x.
However, ATM and OCN are on same cores and are not concurrent – 0.85 sypd.
Concurrent, 96 ppn, 128 nodes, 12288 tasks, PSTRID_ATM=8, EXCL_STRIDE_ATM=8: https://pace.ornl.gov/exp-details/221387
> ./pelayout Comp NTASKS NTHRDS ROOTPE PSTRIDE CPL : 12288/ 1; 0 1 ATM : 1536/ 1; 0 8 LND : 12288/ 1; 0 1 ICE : 12288/ 1; 0 1 OCN : 12288/ 1; 0 1 ROF : 12288/ 1; 0 1 GLC : 1/ 1; 1 1 WAV : 1/ 1; 1 1 IAC : 1/ 1; 1 1 ESP : 1/ 1; 1 1 component comp_pes root_pe tasks x threads instances (stride) --------- ------ ------- ------ ------ --------- ------ cpl = cpl 12288 0 12288 x 1 1 (1 ) atm = scream 1536 0 1536 x 1 1 (8 ) lnd = elm 12288 0 12288 x 1 1 (1 ) ice = mpassi 12288 0 12288 x 1 1 (1 ) ocn = mpaso 12288 0 12288 x 1 1 (1 ) rof = mosart 12288 0 12288 x 1 1 (1 ) TOT Run Time: 1173.311 seconds 234.662 seconds/mday 1.01 myears/wday CPL Run Time: 225.460 seconds 45.092 seconds/mday 5.25 myears/wday ATM Run Time: 779.111 seconds 155.822 seconds/mday 1.52 myears/wday LND Run Time: 8.133 seconds 1.627 seconds/mday 145.53 myears/wday ICE Run Time: 131.764 seconds 26.353 seconds/mday 8.98 myears/wday OCN Run Time: 336.009 seconds 67.202 seconds/mday 3.52 myears/wday ROF Run Time: 4.742 seconds 0.948 seconds/mday 249.59 myears/wday CPL COMM Time: 517.753 seconds 103.551 seconds/mday 2.29 myears/wdayThis lets ATM (and no other comp) run on stride=8 processes concurrently with OCN – 1.01 sypd.
Disjoint-nodes, 96 ppn, 128-atm-nodes + 112-ocn-nodes = 240 nodes in all, 23040 tasks, PSTRID_ATM=8: https://pace.ornl.gov/exp-details/221430
> ./pelayout Comp NTASKS NTHRDS ROOTPE PSTRIDE CPL : 12288/ 1; 0 1 ATM : 1536/ 1; 0 8 LND : 12288/ 1; 0 1 ICE : 12288/ 1; 0 1 OCN : 10752/ 1; 12288 1 ROF : 12288/ 1; 0 1 GLC : 1/ 1; 1 1 WAV : 1/ 1; 1 1 IAC : 1/ 1; 1 1 ESP : 1/ 1; 1 1 component comp_pes root_pe tasks x threads instances (stride) --------- ------ ------- ------ ------ --------- ------ cpl = cpl 12288 0 12288 x 1 1 (1 ) atm = scream 1536 0 1536 x 1 1 (8 ) lnd = elm 12288 0 12288 x 1 1 (1 ) ice = mpassi 12288 0 12288 x 1 1 (1 ) ocn = mpaso 10752 12288 10752 x 1 1 (1 ) rof = mosart 12288 0 12288 x 1 1 (1 ) TOT Run Time: 1134.538 seconds 226.908 seconds/mday 1.04 myears/wday CPL Run Time: 232.226 seconds 46.445 seconds/mday 5.10 myears/wday ATM Run Time: 730.494 seconds 146.099 seconds/mday 1.62 myears/wday LND Run Time: 8.203 seconds 1.641 seconds/mday 144.28 myears/wday ICE Run Time: 138.069 seconds 27.614 seconds/mday 8.57 myears/wday OCN Run Time: 284.616 seconds 56.923 seconds/mday 4.16 myears/wday ROF Run Time: 4.631 seconds 0.926 seconds/mday 255.57 myears/wday CPL COMM Time: 779.391 seconds 155.878 seconds/mday 1.52 myears/wdayThis is a typical layout with OCN on its own nodes to run concurrently with all other components.
Throughput of 1.04 sypd is nearly the same as in #3 above at 1.01 sypd at nearly twice the cost in the number of nodes.
This shows that exclusive striding layout #3 is just as effective as disjoint-node PE-layout.