Configuring lmf and QSGW for HPC architectures

Introduction

Questaal’s codes will use a mix of MPI, OpenMP, and GPU acclerator cards, often in combination, depending on the application. Some of the parallelization is hidden from the user, but some of it requires explicit user input. For heavy calculations special care must be taken to get optimum performance.

The Figure shows a schematic of a single node on an HPC machine, and processes running on it. Hardware consists of one or more sockets: S1 and S2 are depicted as blue ovals, and memory m as blue lines. Each socket has a fixed number of cores Nc to do the computations. Memory is physically connected to a particular socket.

When a job is launched, it will start a single process in serial mode, or when running MPI, np processes where you specify np at launch time. A single process (shown as a pink line marked p) is assigned to a core; the process can request some memory, shown as a pink cloud, which may increase or decrease in the course of execution. Since each MPI process is assigned its own memory the total memory consumption scales in proportion to the number of MPI processes. This is a major drawback of MPI: the more processes you run, the less memory available per process. This can be partially mitigated by writing the code with distributed memory: arrays or other data are divided up and a particular process operates mostly on a portion of the entire dataset. When it needs a portion owned by another process, processes can send and receive data via communicators. This exchange is relatively slow, so the less a code reliance on interprocessor communication the more efficiently it executes.

When blocks of code the perform essentially the same function on different data, blocks can be executed in parallel. A single MPI process p can temporarily spawn multiple threads t as shown in the figure— these are subprocesses that share memory. OpenMP executes this way, as do multithreaded math libraries such as Intel MKL. The big advantage of multithreading is the shared memory. On the other hand, threads are less versatile than separate processes: OpenMP and multithreaded libraries are most effective when every thread is doing an essentially similar thing, e.g. running the same loop with different data.

Questaal codes use multiple levels of parallelization MPI as outermost level of parallelization, and multithreading within each MPI process in critical portions of the code. The multithreading may be through OpenMP and/or by calling multithreaded libraries with nl threads. Since these two kinds of threading are not normally done at the same time, typically you run jobs with nt=nl, and this tutorial will assume that to be the case.

You can spread the MPI processes over multiple nodes, which allows more memory to be assigned to a particular process. Thus to run a particular job you must specify:

1. nn  the number of nodes
2. nc/nn  the number of cores/node, which is typically fixed by the hardware and equal to Nc (the number of hardware cores on a node).
3. np  the number of processes
4. nt  the number of threads

You should always take care that np×ntnc. This is not a requirement, but running more processes than hardware cores available is very inefficient, since the machine has to constantly swap out processes on one or more cores when executing, which is very inefficient.

Tutorial

Jobs on HPC machines are usually submitted through a batch queuing system. You must create a batch script (usually a bash script) which has embedded in it some extra special instructions to the queuing system scheduler. (How the special instructions are embedded is dependent on the queuing system; a typical one is the slurm system, which interprets lines beginning with #SBATCH as an instruction to the scheduler instead of a comment. Apart from the special instructions, the script is a normal bash script.) These special instructions must specify as a minimum, the number of nodes nn. You may also have to specify the number of cores nc although, as noted, it is typically fixed at Np×nn. np and nt are specific to an MPI process and are not supplied to the batch scheduler.

The script consists of a sequence of instructions, shell commands such as cp ctrl.cri3 ctrl.x, executables such as lmf ctrl.x or other scripts such as the main GW script, lmgw.sh described below.

An executable may be an MPI job. When launching any job (MPI or serial) you must specify the number of threads nt. Do this by setting environment variables in your script

OMP_NUM_THREADS=#           ← for OpenMP
MKL_NUM_THREADS=#           ← for the MKL library, if you are using it


Here #=nt. You also must specify the number of MPI processes, np. Usually MPI jobs are launched by mpirun (e.g. mpirun -n # ... where #=np), though a particular machine may have its own launcher (for example the juwels supercomputer uses srun).

Note: MPI processes can be distributed over multiple nodes, but there is an associated cost: when data must be distributed between processes or gathered together from multiple processes, the transfer rate is much slower when processes are on different nodes. There is both a latency (time to initiate the process), and a transfer rate. Both are hardware-dependent. Thus it is advisable to put some thought into how many nodes and cores are appropriate for a particular job.

Graphical Processing Units (GPUs)

Some of Questaal’s codes (especially the GW suite) can make use of GPU’s. To use them, you typically must specify to the job scheduler how many GPU’s you require. You also must specify to the MPI launcher now many GPU’s that job requires. An example for the Juwels booster is given below.

lmf with MPI

Questaal’s main one-body code, lmf, will use MPI to parallelize over the irreducible k points. You can run lmf with MPI using any number of processes np between 1 and the number of k points nk. Invoking lmf with np > nk will cause it to abort.

To see how many irreducible k points your setup has, run lmf with --quit=ham and look for a line like this one

 BZMESH: 15 irreducible QP from 100 ( 5 5 4 )  shift= F F F


This setup has 15 irreducible points, so you can run lmf with something like mpirun -n 15 lmf ...

To also make use of multithreading, you can set OMP_NUM_THREADS and MKL_NUM_THREADS. As noted earlier, np×nk should not exceed the number of hardware cores you have. Suppose your machine has 32 cores on a node, and you are using one node. Then for the preceding example you can run lmf this way OMP_NUM_THREADS=2 MKL_NUM_THREADS=2 mpirun -n 15 lmf .... Since 2×15 ≤ 32, the total number of threads will not exceed the number of hardware cores. To make it fit well, try to make nk/np not too different from an integer.

lmf has two branches embedded within itself: a “traditional” branch which is comparatively slow but memory efficient, and a fast branch which can be much faster, but consumes a great deal more memory. By default the slow branch executes; to invoke the fast branch, include --v8 on the command line. Both branches use standard BLAS libary calls such as zgemm, for which multithreaded versions exist, e.g. MKL. The fast branch also makes of some OpenMP parallelization, and moreover, some of the critical steps (e.g. the FFT, matrix multiply) have been implemented to run on Nvidia V100 and A100 cards as well. With the --v8 switch, GPU’s enchance the performance of lmf immensely. Thus it is very desirable to use the --v8 branch, but there are some limitations. First, the larger memory usage may preclude its use on smaller machines when using it for large unit cells; also some of the functionality of the traditional branch has not yet been ported to the fast branch.

Tips

1. For small and moderately sized cells, MPI parallelization is more efficient than OMP. In this case it is best to distribute the (np, nk) combination heavily in favor of np, subject to the restriction npnk.

2. If you need to save memory, first use BZ_METAL=2 or (if a system with a large density-of-states at the Fermi level) BZ_METAL=3.

3. When using lmf for large cells, it may be necessary to keep the number of processes per node low. This need not be an issue if you have a machine with multiple nodes, as you can use as few as one process per node. For example, a calculation of Cd3As2, a system with an 80 atom unit cell and a calculation requiring 8 or more irreducible k points, the following strategy was used.
• On a single-node machine with 32 cores, BZ_METAL was set to 2 and lmf was run in the traditional mode with
env OMP_NUM_THREADS=4 MKL_NUM_THREADS=4 mpirun -n 8 lmf ...
• On the Juwels booster, a many-node machine 48 cores/node, the same calculation was run on 4 nodes with
env OMP_NUM_THREADS=48 MKL_NUM_THREADS=48 srun -n 4 lmf --v8 ...
4. Forces with --v8. The calculation of forces has not been yet optimized with the --v8 option, and requiring them in that mode can significantly increase the calculation time, particularly if you are using GPUs. If you don’t need forces, set HAM_FORCES=0; if you, set HAM_NFORCE nonzero, e.g. HAM NFORCE=99. Then lmf will make the system self-consistent before calculating the forces, so they are calculated only at the end and the net cost of calculating them is much smaller.

The lmgw.sh script

The lmgw.sh script is designed to run QSGW on HPC machines, as explained here.

The QSGW cycle makes a static, nonlocal self-energy Σ0. The QSGW cycle actually writes to disk ΔΣ = Σ0Vxc(DFT), so that the band code, lmf, can seamlessly add ΔΣ to the LDA potential, to yield QSGW energy bands.

Performing a QSGW calculation on an HPC machine is complicated by the fact that the entire calculation has quite heterogeneous parts, and to run efficiently each of several parts needs to be fed some specific information. lmgw.sh is designed to handle this, but it requires some extra preparation on the user’s part. To make it easier, the GW driver, lmfgwd, has some features that automate many of the preparatory steps. This will be explained later; for now we outline the logical structure of lmgw.sh and and the switches you need to set to make it run efficiently.

lmgw.sh breaks the QSGW cycle into a sequence of steps. The important ones are:

1. lmf  Makes the density self-consistent for a fixed ΔΣ, updating the density from $n$ to $n+{\Delta}n$. (Initially (the “0th” iteration) ΔΣ will be zero.) ${\Delta}n$ modifies Vxc(DFT), so lmf modifies both the Hartee potential and Vxc(DFT). Thus the potential is updated as if the change in Σ from ${\Delta}n$ is given by the corresponding change in Vxc(DFT). In other words, the inverse susceptibility $\chi^{-1} = \delta{V}/\delta{n}$ is adequately described by DFT. As the cycle proceeds to self-consistency change in n becomes small and $\chi^{-1}{\times}\Delta{n}$ becomes negligible, so this assumption is no longer relevant.

2. lmfgwd  is the driver that supplies input from the one-body code (lmf) to the GW cycle.

3. hvccfp0  makes matrix elements of the Hartree potential in the product basis. Note this code is run twice: once for the core and one for the valence.

4. hsfp0_sc  makes matrix elements of exchange potential potential in the product basis. This capability is also folded into lmsig (next) but the latter code operates only for the valence. hsfp0_sc is run separately for the core states.

5. lmsig  makes valence exchange potential, the RPA polarizability, screened coulomb interaction W, and finally Σ0, in a form that can be read by lmf. This is the most demanding step.

Finally there is an optional step that is inserted after making the RPA polarizability.

5a. bse  Add ladder diagrams in the calculation of $W(\omega,q)$. This option is executed when the --bsw switch is used. lmgw.sh synchronizes bse and lmsig.

Command line arguments to lmgw.sh

lmgw.sh accepts command-line arguments, which control each of the five major steps listed in the examples below are to be executed:

• –mpirun-1 “strn
default launcher for all compute processes if not overridden in special cases
Intended for serial processes; however a queuing system may not allow a program to execute in serial mode.
In such a case, you can use “mpirun -n 1” as in this example:
Example --mpirun-1 'env OMP_NUM_THREADS=4 MKL_NUM_THREADS=4 mpirun -n 1'

• –mpirun-n “strn
default launcher for all mpi compute processes, if not overridden in special cases.
Binaries which execute with MPI parallelism use this switch as a default.
If not specified, defaults to mpirun-1.
Example for a node with 32 cores: --mpirun-n env OMP_NUM_THREADS=4 MKL_NUM_THREADS=4 mpirun -n 8

• –cmd-[lm, gd, vc, cx, w0, xc, bs] “strn
override mpirun-n for the following:
• lm: DFT level: lmf (step 1)
• gd: GW driver: lmfgwd (step 2)
• vc: Coulomb potential maker hvccfp0 (step 3)
• cx: core-exchange self energy maker hsfp0_sc (step 4)
• w0: only in –split-w0 mode, used by lmsig : polarizability at offset-gamma points to make $W(q=0)$
• xc: to make $W$ and $\Sigma$ with lmsig (step 5)
• bs: Arguments to bse when making the BSE screened coulomb interaction W.
• –split-w0
split the χ0(0,0,ω,q=0) and W0(0,0,ω,q=0) calculation from the rest and communicate the values though files x00 and w0c.

• –split-w
split the calculation for W by writing data to disk, and read data from disk in making Σ. Caution: these files can can be huge!

• –bsw
After computing the screened W(ω,q=0) in the random phase approximation, recalculate it including ladder diagrams via a BSE and use WBSE when making Σ=iGW

• –iter I
start from Ith iteration (counting from 0)

• –maxit N
perform no more than N iterations

• –tol TOL
exit once sigma RMS difference converges to given tolerance TOL

• –no-exec
print commands without executing

• –lm-flags “strn
switches passed to lmf and lmfgwd

• –start=[setup|x0|sig|qpe|bse]
start at given step

• –stop=[setup|x0|sig|qpe|bse]
stop after given step

• –sym
symmetrise sigma file, using lmf to do it.

• -qg4gw-flags=+qm  |  -qg4gw-flags=+qa
Specify a particular kind of mesh for the offset-gamma points. Possible options:
• +qm mid plane qlat axes (or perpendicular if angles < π/2)
• +qa Along the axes of qlat
• -h, –help
displays this help

An example is given in the next section, and more detailed information on how to set switches for lmgw.sh is given below.

1. QSGW calculation on a single node with 32 cores

This example sets up a QSGW calculation for a single monolayer of CrI3 on a single node of small tier-3 cluster (called gravity) with 32-core nodes, using lmgw.sh. (There are not many atoms, but to suppress interference from neighboring slabs a large vacuum is needed which greatly increases the computational cost.)

Gravity uses the slurm queuing system. One of its queues (called henrya) has 11 nodes, and each node has 256G RAM. We use only one node in this example.

The QSGW self-energy has 7 irreducible k points. To conserve memory, we will run the computionally intensive part of the calculation with 4 processes. Since the machine has 32 cores, we can distribute the cores into 8 threads × 4 processes. (Note we can use 7 processes for one-body part done by lmf.)

Note that lines beginning with #SBATCH are instructions to the slurm scheduler. Any other line beginning with # is a comment; all other lines are commands bash executes. For description of the arguments to lmgw.sh and variable assignments m1, mn, … xc, see documentation above.

#!/bin/bash -l

## Lines beginning with #SBATCH are instructions passed to the slurm scheduler.

## Specify job name in queue, number of cores, queue name (usually essential)
#SBATCH -J myjob
#SBATCH -n 32
#SBATCH -p henrya

## Specify memory requirements, output file.  These are not usually essential.
#SBATCH --mem 200G
#SBATCH -o %x-%j.log

## How processes are bound to cores.  See https://users.open-mpi.narkive.com/poMpc1GG/ompi-users-global-settings
## This instruction may be not needed.
export OMPI_MCA_hwloc_base_binding_policy=none

set -xe # tell bash to echo commands as they are executed, and also exit on error

## These modules are specific to this cluster.
## If your modules are not loaded when the batch job executes, replace the following lines with your own
module purge
module use /home/k1222090/opt/mpm/modules/o
mods="module load hdf5/1.10.5/openmpi/4.0.2/intel/19.0.3 libxc/4.3.3/intel/19.0.3 ninja/1.9.0 git"
$mods ## This instruction adds to the path the directory where lmgw.sh script resides, and the attendant executables, e.g. lmf ## If your job does not already include the path containing lmgw.sh script, add it here export PATH=/home/k1193116/bin/speedup:$PATH

## These instructions create bash variables passed as arguments to lmgw.sh
## Note that the the switches  --bind-to none  and  --map-by ppr:2:socket do not apply to all clusters.
echo "1 node distributing OMP_NUM_THREADS and MPI processes as (8,4)"

## extra arguments passed to lmgw.sh.  They could be included directly on the command line instead of being assigned a variable
lmgwargs="--insul=33 --sym"
## variable used below for tolerance in QSGW self-energy
tol=4e-5

echo ... using lmgw.sh which lmgw.sh  path $PATH ## This runs the QSGW cycle. See documentation above for an explanation of its arguments lmgw.sh --qg4gw-flags=+qm --tol$tol --mpirun-1  "$m1" --mpirun-n "$mn"  --cmd-lm  "$ml" --cmd-gd "$md"  --cmd-vc  "$mh" --cmd-xc "$xc"  \$lmgwargs ctrl.cri3


More information on how to set switches for lmgw.sh is given below.

1.1 Introduction to pqmap

By far the most costly step lmgw.sh has to execute is the self-energy generator, lmsig. Its efficiency can be greatly facilitated if you supply a helper file pqmap, as explained here.

lmsig must generate a self-energy Σ(q) for each irreducible q point, and to do so it must perform an internal integration over k. Thus lmsig must generate contributions to Σ(q) at some q for a family of k points. Questaal’s traditional GW code calculates the polarizibility for each q (requiring an internal numerical integration over k), stores the screened coulomb interaction W(q) on disk, and then makes Σ(q) by a convolution  i ∫ dk G(q-k) W(k)  of G and W, which is done by another numerical integration over k, on a uniform mesh of points. lmsig is more cleverly designed: it makes partial contributions  G(q-k) W(k)  to the self-energy (and similar contributions to the polarizability) at some q by calculating each qk combination independently. It parallelizes the process over qk pairs, and for that reason can be extremely efficient.

For optimum performance every process should calculate approximately equal numbers of  G(q-k) W(k)  terms. But when the system has symmetry, the question of how to distribute these terms among processes can be a complicated one, since each q point may have a varying number of k points entering into the numerical k integration. Call this number nq. Indeed the higher the symmetry, with wider the variation in nq will be. The MPI implementation in lmsig needs this information to be made optimally efficient; a preprocessing step takes this information and writes a file pqmap to tell lmsig how to distribute the load among processes.

To visualize how pqmap is constructed, represent the total work as a tiling of rectangles on a canvas or a grid, with the cycle or step count istep ranging on the ordinate between 1 and nstep, and the process number ip ranging between 1 and np on the abscissa np being the number of processses you select. The nq contributions associated with a particular q are kept contiguous on the canvas: point q is assigned a rectangle of area at least nq. (This is not an essential condition but it improves interprocessor communication.) The trick is to arrange rectangles, corresponding to the “area” nq of each q point, as densely as possible. Each rectangle can be constructed in different shapes: for example, if a q point has 29 qk combinations, the area must be at least 29. 29×1 is one possibility; so are rectangles of area 30. 30 has prime numbers 2,3,5, so rectangles 15×2, 6×5, 5×6, or 2×15. are all valid choices. Since interprocess communication is relatively costly, the rectangles are ideally as tall and as as thin as possible. This is particularly important when MPI has to perform an ALLREDUCE on the nq points to make the polarizability.

Let nqk to be the sum ∑nq of all qk combinations, and nstep be the maximum number of cycles (aka steps) any one process has to perform. In the graphical representation just described, nstep is the height of the canvas and the width is the number of processs np, which you choose. We want to minimize nstep, but it clearly has a lower bound : nstep  ≥  nqk / np.

You can make pqmap by hand, but it is a tedious process. The GW driver lmfgwd has a special purpose mode designed to make a sensible pqmap for you. Running lmfgwd with switch --pqmap activates this mode. The switch is described in more detail below; here we only show how to use it by illustrating its use in the example above. For the CrI3 example, run lmfgwd this way:

lmfgwd ctrl.cri3 --lmqp~rdgwin --job=0 --pqmap~plot~npr=4


--lmqp~rdgwin tells lmfgwd to take certain parameters from the GWinput _{:.path} file](/docs/input/gwinputfile), rather than the [_ctrl file. --pqmap tells lmfgwd to make a file pqmap. This switch accepts several options, documented here: ~npr assigns np as shown; if it is not present --pqmap will not make pqmap. All other switches are optional.

Construction of pqmap is discussed in more detail below.

2. QSGW calculation on multiple nodes

Medium and large HPC machines have multiple nodes, and Questaal’s QSGW has a multilevel parallelization to make use of them. It uses MPI whose processes can span multiple nodes and OMP multithreading on one node embedded at various critical blocks of code (often with an MPI process). It also uses multithreaded libraries (such as Intel’s MKL library when available) for level 3 BLAS. Also, recently it has been implemented to use Invidia GPU accelerator cards.

As noted before GW cycle must evaluate both the polarizability and the self-energy at a collection of q points (normally the irreducible q points in the Brillouin zone as needed for self-consistency), and for each q there is a loop over k to make a numerical integration.

The speedup version of Questaal combines the construction of the polarizability P and self-energy Σ into one code, lmsig. This makes it possible to avoid writing W to disk and also to make parts of Σ on the fly without having to have computed all of W(q) first. Also lmsig has reorganized this double loop so that each q-k combination may be done independently — a great boon for parallel implementations.

The only exception to this is the stage where P must be combined with the bare coulomb interaction v to make the screened one, W. At that point, all the q-k combinations must have been evaluated for a given q. The MPI must perform an expensive ALLREDUCE to collect all the q-k combinations for that q into one process. We will return to this point shortly.

2.1 pqmap

When the system has lots of symmetry, each q point may have a widely variable of k points needed for the internal k integration. Without this specific information, MPI is likely not to be particularly efficient when evaluating the crucial step 5 (see lmsig above). To make the most efficient MPI implementation, you can supply a helper file pqmap. The structure of this file is most easily visualized in terms of filling a canvas as explained in a previous section. Here we discuss the structure of pqmap in more detail.

For each q, pqmap has a line with the following form:

   <b>offr</b>   <b>offc</b>   <b>nr</b>   <b>nc</b>   <b>iq</b>


iq is the index to the particular q; offr and offc are the offsets to the canvas (the upper left corner of the rectangle). nr and nc are the number of rows and columns (number of steps and processors) this q point will require. nr×nc must be at least as large as nq.

lmfgwd --job=0 --pqmap~npr=... invokes a special mode to create file pqmap, pqmap, which “tiles” a canvas of area nstep × np. It is tiled with rectangles of area nq and there are as many rectangles as irreducible q points. The rectangle associated with a particular q has area at least nq and the width nc must not exceed the number of processes np.

lmfgwd uses a complicated algorithm to tile the canvas. Determining the optimal choice of tiling is a difficult proposition: the number of possible combinations grows exponentially with the number of irreducible k points, and moreover, tall, narrow rectangles are preferable to square or short, fat ones. This is because MPI has to perform an ALLREDUCE step when all the information for a given q is collected. Fatter rectangles entail more processes (and possibly more nodes) which puts more demands on interprocessor communication in the ALLREDUCE step.

The algorithm in lmfgwd follows approximately the following guidelines.

1. Make the rectangles as tall as feasible but not exceeding nstep

2. Join the rectangles as compactly as possible

3. After all the q points have been tiled, identify unfilled spaces on the canvas and fill them by enlarging the rectangles.

2.2 Setting switches for lmgw.sh

To execute efficiently, lmgw.sh requires several switches, which can make the setup of the job script somewhat involved. Setting up a batch job on a multinode machine requires the following hardware information:

• Total number of cores nc and nodes nn. Typically you do not specify nn : it is determined from nc/Nc where Nc is the number of hardware nodes.

• Number of GPU’s, if any. It should not exceed the number of GPU’s belonging to the nodes you allocate.

• Number of cores/node, nc/nn. You often do not need to specify this because it is fixed by the number of hardware cores/node Nc and in any case should not exceed it.

A particular job typically requires the following:

1. Number of processes np for MPI

2. Number of processes/node nppn to assign (which together with np specifies the total number of nodes)

3. Number of k points lmf requires for Brillouin zone integration

4. Number of q points for which the GW driver lmfgwd makes eigenfunctions

5. Number of q points matrix elements of bare coulomb potential will be evaluated. This will be the number of actual q points + number of offset gamma points.

Information from the last three needed to make good command-line arguments --cmd-lm, --cmd-gd, --cmd-vc (corresponding to steps 3,4,5 or steps 1,2,3 in an earlier section). Usually you don’t need to specify --cmd-xc, as it will take --mpirun-n in the absence of a particular specification.

Partial autogeneration of batch job

Setting the switches by hand is tedious and error-prone. To facilitate their generation --pqmap has a special option, ~batch, which tells lmfgwd to generate variables for a batch script particular to this job. --pqmap~batch finds the necessary information for points 3,4,5. It uses this information and makes template shell variables to assign to --mpirun-1, --mpirun-n, --cmd-lm, --cmd-gd, and --cmd-vc etc., in the batch script, and writes these strings to file batch.npr. (Here npr is the number of processes.) Since these switches are hardware-specific, lmfgwd can generate only guesses for strings. It has a limited ability to tune these variables to a particular target machine. As of this writing, lmfgwd is set up for the following:

• a generic vanilla machine. Specify with @vanilla# where # is the number of cores/node, viz  --pqmap~batch@vanilla16 .
• the gravity cluster described above. Specify with @gravity, viz  --pqmap~batch@gravity .
• the Juwels “Booster” in Julich, Germany. Juwels is a powerful HPC system, with 48 cores/node. The booster has 4 Nvidia A100 cards/node. Specify with @juwels .

~batch has subswitches, which are documented here. You must specify a machine architecture, and the number of processes to run on a node, for example   --pqmap~batch@gravity@ppn=4 . There are other optional switches.

As an example, run lmfgwd with the following arguments

lmfgwd --lmqp~rdgwin --job=0 --pqmap~plot~npr=<i>n<sub>p</sub></i>~batch@juwels@ppn=<i>n<sub>ppn</sub></i>


--lmqp~rdgwin tells lmfgwd to take certain parameters from the GWinput file, (rather than the ctrl file). Particularly relevant for generating pqmap are the ones related to the k mesh (e.g. tag n1n2n3 in GWinput).

--pqmap tells lmfgwd to make a file pqmap to be described in the next section. Note that this switch does nothing unless np is specified in the manner shown above. --pqmap accepts several tags, one to assign np as shown, and some others, in particular ~batch, which tells lmfgwd to generate variables for a batch script particular to this job. If you do use this option, you must specify a machine name (juwels in the above example) and the number of processors per node, as shown.

Prescription

cp ~/lmf/gwd/test/mnte/{ctrl,basp,site}.mnte .
cp ~/lmf/gwd/test/mnte/GWinput.gw  GWinput
lmfa ctrl.mnte
lmfgwd --lmqp~rdgwin --job=0 --no-iactive --pqmap~plot~npr=16~~batch@juwels@ppn=4 mnte

lmgwsc --noexec --stop=setup  --mpi=16,16 --wt --insul=4 --tol=2e-5 --sym si | grep lmfgwd
lmfgwd --gwcode=2  --lmqp~rdgwin --job=0 --no-iactive --pqmap~plot~npr=26~~batch@juwels@ppn=4  -vnkgw=8 si


Arguments to --pqmap :

npr
qk
incloff
onlyoff
job
fill
nomap
rdmap
plot
batch


Arguments to ~batch

juwels | gravity | vanilla<i>n</i>
bsew
ppn
nqlmf
nqgwd
nqvc
maxit
nprlm


Juwels is a powerful HPC system, with a “Booster” adjunct. The latter has A100 GPU cards, and exploiting them is the focus of this section.

A100 cards are very fast, and Questaal has a special QSGW implmentation that exploits these powerful cards. Before getting

QSGW calculation using the Juwels Booster architecture

This calculation involves a slab of CrI3. To tailor the job script to the calculation we need to know the following: We want to the the QSGW calculation with an 8×8×1 mesh of k points.

”–bind-to none” tells the scheduler that you will all bindings manually

We need to know the following:

• The number of irreducible k points lmf requires. The command
lmf ctrl.cri3 \cat switches-for-lm –rs=1,0 -vnit=1 –quit=ham | grep BZMESH
yields 10 irreducible QP.
• The number of irreducible k points lmf requires. The command
what?
   iq                qp                ngwf  ngmb  ngmbr ngwf2  iqi
1    0.000000  0.000000  0.000000  1893   993   993 15089    1  irr=1
9    0.114765 -0.066259  0.000000  1890  1028  1119 15155    2  irr=1
17    0.229530 -0.132519  0.000000  1898  1038  1155 15122    3  irr=1
18    0.229530  0.000000  0.000000  1909  1025  1119 15135    6  irr=1
25    0.344294 -0.198778  0.000000  1888  1020  1203 15124    4  irr=1
26    0.344294 -0.066259  0.000000  1881  1022  1191 15091    7  irr=1
33    0.459059 -0.265038  0.000000  1872  1024  1335 15108    5  irr=1
34    0.459059 -0.132519  0.000000  1883  1016  1323 15102    8  irr=1
35    0.459059  0.000000  0.000000  1876  1013  1263 15107    9  irr=1
43    0.573824 -0.066259  0.000000  1886  1012  1371 15100   10  irr=1
65    0.005957  0.003439  0.000000  1893   993   993 15089   11  irr=1 <--Q0P
129    0.000000  0.002631  0.006355  1893   993   993 15089   12  irr=1 <--Q0P


lmsig can run much more efficiently if you

pqmap

9 48
10
0  0 9  4  2
0  4 9  4  3
0  8 9  4  4
0 12 9  4  5
0 16 9  4  6
0 20 9  4  9
0 24 9  8  7
1 32 8  8  8
1 40 8  8 10
0 32 1 16  1