Aurora

Aurora

Running with profilers

aps/vtune

APS https://docs.alcf.anl.gov/aurora/performance-tools/aps/

build a case as you normally do.

Before submitting a job, edit env_mach_specific.xml in your $CASEDIR

Change “run_exe” to add aps

<entry id="run_exe" value="aps -r resultsdir ${EXEROOT}/e3sm.exe ">

“resultsdir” can be any name that makes sense. That directroy will be in your $RUNDIR.

No need to add any modules because aps/vtune are included in the oneapi module.

Submit your job as usual.

When finished, go to the $RUNDIR and execute

>aps-report resultsdir

The result will be an html file you can view with any browser

aps can be run at scale but has limited analysis options. If you can get your problem down to 1 or 2 nodes, you can use Vtune https://docs.alcf.anl.gov/aurora/performance-tools/vtune/ .

 

To profile with vtune, just substitute “vtune” for “aps” in run_exe. The results have to be visualized with the VTune Web Server. See instructions at https://docs.alcf.anl.gov/aurora/performance-tools/vtune/#after-collecting-the-performance-data-vtune-profiler-web-server-can-be-used-for-the-post-processing

 

hpctoolkit

build a case as you normally do.

Before submitting a job, edit env_mach_specific.xml in your $CASEDIR

Change “run_exe” to add hpcrun.

<entry id="run_exe" value="hpcrun -e CPUTIME -e gpu=level0 -tt ${EXEROOT}/e3sm.exe ">

A few lines below that, add the necessary modules

<modules> <command name="load">cmake/3.30.5</command> <command name="load">oneapi/release/2025.0.5</command> <command name="use">/soft/perftools/hpctoolkit/modulefiles</command> <command name="load">hpctoolkit/2025-mainline</command> </modules

Submit your job as usual.

When finished, the $RUNDIR will have an extra directory with the hpctoolkit trace info. Something like “ hpctoolkit-e3sm.exe-measurements-4916107.aurora”

For visualization in hpcviewer, these measurements need to be postprocessed. To do that, load a few modules at the command line:

> module use /soft/perftools/hpctoolkit/modulefiles > module load hpctoolkit/2025-mainline

Then run these commands

hpcstruct --gpucfg yes hpctoolkit-e3sm.exe-measurements-4916107.aurora hpcprof hpctoolkit-e3sm.exe-measurements-4916107.aurora

The final hpcprof command will output a performance database right beside the measurements, with the "measurements" in the directory name replaced with "database". This database can be viewed with hpcviewer.

hpcstruct is very slow. It can be skipped for a drop in analysis quality (you lose out on loop nests and GPU calling contexts). Alternatively, if you export HPCTOOLKIT_HPCSTRUCT_CACHE=... to some path, hpcstruct will cache results from analyzed binaries and reuse them on subsequent runs.

iprof

build a case as you normally do.

Before submitting, edit env_mach_specific.xml in your $CASEDIR

change “run_exe” to add iprof. Just adding “iprof” will produce a summary at the end of the run in e3sm.log.

<entry id="run_exe" value="iprof ${EXEROOT}/e3sm.exe ">

Just below that line, add the module load of “thapi”

<modules>
<command name="load">cmake/3.30.5</command>
<command name="load">oneapi/release/2025.0.5</command>
<command name="load">thapi</command>
</modules>

IMPORTANT: the load of thapi messes up the PERL path so “./preview_namelists" won’t work. To start your run, do:

./case.submit --skip-preview-namelist

iprof summary info

Sample output of summary from a run of --res ne4pg2_ne4pg2 --compset F2010-SCREAMv1 on 1 node of Aurora.

iprof timeline trace

sample iprof trace file from same case. To generate, change the executable to:

<entry id="run_exe" value="iprof -l -- ${EXEROOT}/e3sm.exe">

To visualize, download locally and got to https://ui.perfetto.dev/ and drop the file in the window. Chrome is recommended. You may also be asked to download a local trace file parser to avoid browser memory limits.

iprof help

Usage: iprof [options] [--] [command] --trace-output PATH Define where the CTF trace will be saved. Default: `$THAPI_HOME/thapi-traces/thapi--[trace-type][date]` (`$THAPI_HOME` defaults to `$HOME`, and `date` is formatted using ISO 8601 convention). --analysis-output PATH Define where the analysis output (summary, pretty printing, etc.) will be saved. Default: printed to `stdout`. -m, --tracing-mode MODE Define the category of events to trace. Values allowed: ["minimal", "default", "full"] Default: default --traced-ranks RANK Select which MPI ranks will be traced. Use -1 to trace all ranks. Default: -1 --[no-]profile Enable or disable device profiling. Default: true (only what happens on host) --[no-]analysis Enable or disable analysis of the LTTng trace. Default: true (no analysis would dump raw trace) -b, --backends BACKENDS Select which backends to use and their grouping level. Format: backend_name[:backend_level],... Default: mpi:3,omp:2,cl:1,ze:1,cuda:1,hip:1 (useful for cuda-on-Level0) --[no-]archive Enable or disable archive support. Default: false (not quite working) -r, --replay [PATH] Replay traces for post-mortem analysis. If `PATH` is omitted, it defaults to the newest trace in `$HOME/thapi-traces/`. -t, --trace Pretty-print the LTTng trace. -l, --timeline [PATH] Dump the trace timeline to a binary file. If `PATH` is omitted, defaults to `out.pftrace`. Open with Perfetto: `https://ui.perfetto.dev/#!/viewer`. -j, --json Output the tally in JSON format. -e, --extended Print the tally for for each Hostname / Process / Thread / Device. (print one table per rank) -k, --kernel-verbose The tally will report kernels execution time with SIMD width and global/local sizes. (will split them out be size) --max-name-size SIZE Set the maximum allowed kernels name size. Use -1 for no limit. Default: 80 -s, --sample Enable counters sampling. (for GPU hardware counter profiling) --metadata Display trace metadata. (not used) -v, --version Print the Version String. -h, --help Display this message. --debug [LEVEL] Set the debug level. If `LEVEL` is omitted, it defaults to 1. Default: 3 __

 

unitrace

build a case as you normally do

Before submitting, edit env_mach_specific.xml in your $CASEDIR

change “run_exe” to add unitrace. Just adding “unitrace” will produce a per-rank summary of Level0 API calls

<entry id="run_exe" value="unitrace ${EXEROOT}/e3sm.exe ">

Just below that line, add the module load of “pti-gpu”

<modules>
<command name="load">cmake/3.30.5</command>
<command name="load">oneapi/release/2025.0.5</command>
<command name="load">pti-gpu</command>
</modules>

submit your job as normally. Output will be at the end of e3sm.log.

More options for unitrace at https://github.com/intel/pti-gpu/tree/master/tools/unitrace