User's guide for GW code, lmgw.sh

Introduction

This guide goes over the basics of using the new code. It assumes familiarity with the old code. Needs reformatting for inclusion on the website. Splitting it in a few pieces might be good too. Might also be good to integrate a coupe of examples to run along the text and serve as illustration.

GPU, accelerator, card, CUDA hardware are used interchangeably to mean NVIDIA accelerator card connected over PCIExpress (PCIe) or socketted chip (SMX).

Building

The same process is retained.

// Maybe should mention spack recipe useful for building of named releases.

CUDA

The main difference is that genflags.py has a new option ‘cuda=XX’ to enable use of NVIDIA accelerators. XX specifies the CUDA compute capability version to use and defaults to 70 if only ‘cuda’ is specified, targeting Volta V100, while 70 is compatible with the later and much more powerful A100 (Ampere) hardware it is strongly recommended to use cuda=80. The table in https://en.wikipedia.org/wiki/CUDA#GPUs_supported could be used as reference.

The code use most of the CUDA auxiliary libraries as cuBLAS, cuFFT, etc… the driver API is not used so the code could be built on nodes without CUDA hardware or driver installed.

The CUDADIR env var has to point to the CUDA root installation directory.

GPU enabled executables on CPU nodes

In principle an executable build with GPU support can also work on purely CPU nodes with no ill effect using the exact same code paths used in an executable without GPU support. Whether it makes sense to use such executables depends on whether the available GPU and CPU nodes share the same CPU types. If the same types of CPUs are used across the different nodes it is perfectly fine. If however GPU nodes use for example AMD CPUs to take advantage of their earlier adoption of the pcie4 bus and the purely CPU nodes use Intel CPUs of the same era for the excellent performance of Intel’s libraries and compilers on them then using the same binaries across such nodes is likely to be inefficient. If the GPU nodes use some exotic CPUs and different instruction sets for example IBM POWER based ones for the unprecedented performance of their CAPI bus, reuse of executables on intel or AMD nodes is entirely out of the question.

HDF5

HDF5 v1.12.0 is the minimal supported version. While the code is mostly Fortran none of the additional HDF5 interfaces beyond the base C API are necessary. If the code is built with MPI then the HDF5 lib needs to be built with the same MPI implementation and C compiler, independent parallel IO needs to be enabled. If no compatible HDF5 installation is available the following minimal configure line is suggested:

../configure --disable-hl --enable-parallel --enable-build-mode=production CC=mpicc CFLAGS=-O3 --enable-deprecated-symbols=no \
    --prefix=/path/to/install/to/hdf5/1.12.1/mpiname/mpiversion/compilername/compilerversion

The HDF5ROOT (or similar) env var needs to be exported.

Use

I’ll go over the new quirks and features of the density functional code ‘lmf’ before moving on to the new script, lmgw.sh for carrying out the steps of a QSGW or QSGŴ calculation starting from scratch (well a ctrl.lm file). The main new GW/BSE binaries will also be reviewed.

lmf

The lmf code is in the middle of transitioning to a new basis type and much of the internals replicate existing functionality written in different way.

The –v8 flag switches to the new code and in addition to enabling the screened basis it also provides much better OpenMP multi-threading, expanded MPI parallelisation (though still somewhat lacking) and use of accelerators for the heaviest operations.

Not all functionality has been implemented in the new code yet. Two important missing features are (1) extended, beyond MT sphere, local orbitals (PZ= 10s switch) and plane waves.

k&a parallel mode

A mixed/multilevel parallelisation has been implemented merging the old separate k-parallel and atom-parallel modes. It works by splitting processes in ‘k’ groups of size ‘a’. There is an automatic distribution which can be partially overriden or guided by the command line option –pmesh=kxa, where ‘k’ is the number of groups and ‘a’ the group size. For example on cell with 20 irreducible k-points launched with 12 processes the automatic mesh finder will use the largest common denominator between 20 and 12 for the number of k-groups, this is 4 then the size of each k-group will be 3, if one is to specify such mesh explicitly it will be:

lmf --pmesh=4x3

to emulate the old k-parallel mode use:

lmf --pmesh=12x1

to emulate the much older, and for long non-functional, ‘atom parallel’ mode:

lmf --pmesh=1x12

The special value 0 is allowed in –pmesh definition, meaning to allow the value to be filled-in by considering the other value given. The two extreme use cases for the special value are:

--pmesh=0x1

to fix k-group size to 1 and effectively emulate the old k-parallel mode, and

--pmesh=1x0

to fix number of k-groups to 1, giving all processes to to the old atom-parallel mode.

0x0 or any other configuration in which the process mesh size does not equal the number of processes launched will be neglected silently in favour of automatically generated configuration.

Caveats

Band plotting is not yet included in the scheme and the “–band” command line option will switch to k-parallel mode overriding any –pmesh setting.

The atom parallel mode is not yet particularly efficient, in our experience so far it is about 2x slower compared to the single multithreaded process mode when using the same number of cpu cores on a single node. It is however able to use more than a single node for a single k point which was previously only possible with the pure atom parallel mode. There is lower limit on efficient multithreading with the atom parallel mode so while for a pure k-parallel mode it is recommended to launch one process per socket or even a node, when using k-group sizes beyond one it may be beneficial to use not much more than 4-8 threads per process (while generally keeping 1 thread per core).

How to make the most of lmf

Disable forces if they are not really needed. In the ctrl file, either delete token FORCES from the HAM category, or set it to 0.

Tiny cells

–v8 makes little difference for tiny cells of a few of atoms (<10) and may even be slower for cells of 1-4 atoms. Multi-threading is bound to be ineffective due to tiny matrix sizes so there is no point using more than 2 threads per process. Chances are there will be plenty k points to lean on though. For example

mpirun -n n ..other mpi flags.. lmf ctrl.lm

may be ok with ‘n’ not exceeding the number of irreducible k points.

Beyond

If extended local orbitals (ELOs) and plane waves are not crucial use the –v8 switch. Multi-threading will be increasingly effective with larger basis sizes. In most cases the best approach is to launch 1 MPI process per socket and set OMP_NUM_THREADS to the number of cores per socket. For larger cells of 100+ atoms even 1 process per node using all the cores could be effective. For such cells often very few k points are necessary and it makes sense to use number of processes multiple of the number of k points so that the mixed atom parallel mode will be engaged.

Some MPI implementations do apply nonsensical default process pinning (openmpi known to do it) which can easily waste resources by forcing all threads to fight each other for a single core and leave all other cores idle. In such cases ensure hardware resource allocation is appropriate. For openmpi look at its binding flags.

For smaller cells it could be beneficial to launch more processes per socket with appropriately adjusted number of threads so that there is no more than 1 thread per core. In our experience putting compute threads on different hardware threads (hyperthreads in Intel parlance) is not effective and putting more than 1 thread per hw thread cab be very detrimental.

CUDA with lmf and lmfgwd

Only NVIDIA accelerators are supported for now. No special configuration or launch parameters are necessary besides the –v8 switch.

Only some of the heavier operations are currently ported and only a single device can be used by each MPI process for now (unlike other executables to be reviewed later). With this in mind it makes sense to launch no less than one process per card or accelerator chip.

There is some overhead due to data transfers and specialised memory allocation which is unlikely to be amortised by tiny cells. Launching two processes per gpu directly can at times give better performance but in most cases for more processes per gpu it will be much better if the Nvidia Multiprocess service(MPS) is used https://docs.nvidia.com/deploy/mps/index.html. HPC sites will usually have instructions for how to do this, otherwise this script could be helpful guide if not directly usable:

#!/bin/bash

export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
# Launch MPS from a single rank per node
[ "$SLURM_LOCALID" != "0" ] || nvidia-cuda-mps-control -d
sleep 1
$@
[ "$SLURM_LOCALID" != "0" ] || nvidia-cuda-mps-control <<< quit

Benefits of accelerators are most clearly felt with larger examples. Whether use of usually scarce accelerated nodes is justified depends on the particular cell and the competing CPU nodes. With each new release the balance is shifted more in favour of the GPU nodes.

Extended Local Orbitals

The PZ=10s situation can be often worked around by adjusting the sphere radii such that spheres around atoms with ELO expanded and ones around atoms without ELO are reduced until the ELOs fall mostly (to within qtol) inside the sphere and are converted to classic local orbitals. At this point –v8 can be used.

QSGW

The new lmgw.sh script combines the functionality of the previous lmgw and lmgwsc. It still calls a number of executables to perform the necessary steps of a QSGW calculation.

Nearly all basic inputs have been moved to the GW section of the ctrl file. Any overlapping setting defined in the GWinput file would still override the same from the ctrl file. The only data remaining solely in the GWinput file is the product basis composition listing.

File changes

One of the main improvements is the significant reduction in the number of generated files, especially files with any sort of index in their name. The data is now written into multidimensional datasets in HDF5 files.

Another important improvement is the elimination of the colossal screened potential files for a regular QSGW job, saving large amounts of storage and IO, and making the code more usable on lesser clusters which don’t have a high performance parallel file system.

A QSGŴ job however does need to write W to file, update it using the BSE code, then read it to make the self-energy. Parallel MPI-IO is used and suitable parallel filesystem is beneficial.

Launchers

The computationally intensive steps are MPI parallel and some tiny setup steps only run in single process mode. To deal with this disparity and facilitate execution beyond a single node, two of the most commonly used options are --mpirun-n "..." and --mpirun-1 "..." to specify launchers for these two different types of steps respectively. For example a command line for 4 nodes (2 sockets each with say 16 cores/socket) using intelmpi/mpich launchers could be:

m1="env OMP_NUM_THREADS=32 mpirun -n 1"
mn="env OMP_NUM_THREADS=16 mpirun -n 8 -ppn 2"

lmgw.sh --mpirun-n "$mn" --mpirun-1 "$m1" ctrl.lm

The sequence of command will be printed to stdout as they are executed together with their wallclock runtimes measured externally.

I some cases the different MPI parallel steps need to be use different launchers in order to maintain efficiency at the same amount computational resources.

--cmd-lm "..." for the 1-particle level lmf, limits discussed in the previous section.

--cmd-gd "...": GW driver lmfgwd to prepare inputs such as wavefunctions eigenvectors etc…, limited by the irreducible points obtained from the set of nq×(nq0+1) where nq comes from the mesh GW_NKABC mesh defines in the ctrl file and nq0 is the irreducible set of Γ offset points, ranging from 1 to 6 depending on symmetry available.

--cmd-vc "...": Coulomb potential maker hvccfp0, limited by nq+nq0

--cmd-cx "...": core exchange self-energy makes hsfp0_sc, the most scalable of executables, can use up to the irreducible number of nq² points.

--cmd-xc "...": launcher for the χ⁰, W and Σ maker lmsig, one of the more demanding tasks, reaching up to the irreducible fraction of nq*nq points, however certain constraints to be discussed later apply.

--cmd-bs "...": launcher for the bse executable, limited by (irreducible subset of nq)×nq² or with TDA: (irreducible subset of nq)×nq×(nq+1)/2, with different set of constraints.

……………

lmsig

lmsig is the new executable replacing most of the hx0fp0* and hsfp0* executables. It can calculate Σ starting from V without writing intermediary data to files but it can write W if asked with the –wrw switch.

An improved version of hsfp0_sc is still retained for the sole purpose of core exchange calculations.

The new code is parallelised extensively with with mpi, omp and cuda and can run efficiently on very large number of nodes.

Process mapping

reason: keep ram use low by only handling a single q at a time pqmap …….

……………………………….

CUDA

There are no special flags necessary to use GPUs, only that a cuda enabled lmsig executable is used, if no cards are found on a node or if none of the cards are compatible with the minimal compute capability set at build time with the cuda=XX flag then only the cpus will be used.

The CUDA env var CUDA_VISIBLE_DEVICES can be used to reorder or selectively disable cards, other env vars described in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars may also be useful.

At initialisation processes within a node negotiate accelerator distribution between themselves. For example if 4 processes are launched on a 4 card node, each process will take control of each of the 4 GPUs, if 2 processes are launched each process will take hold of 2 cards and if a single processes is launched it will control all cards. If 8 processes are launched then each pair of processes will share a card, this could be of use on smaller cells, the same MPS setup described above applies here too.

Nodes using process-exclusive mode can break this because it barrs 2 processes having context on the same device even for a moment while remapping. It is not possible for a process to disassociate itself from a device without binding to another one and without using the driver library (which in turn will cause gpu enabled code to be unbuildable on machines without the driver installation). To woraround this situation one can provide process specific CUDA_VISIBLE_DEVICES env variable listing only the devices which the particular process should see.

……….

Mixed precision mode

The flag –c4 tells lmsig to compute Σ, by far the heaviest step in a QSGW calculation many times over, in single precision while χ and W are still done entirely in double precision. A speedup factor of 3-4x has been observed on a100 chips without meaningful loss of accuracy. As with other optimisation, the effect is small on tiny cells and only worth using on larger ones.

The –c4 flag is only implemented for cuda and is ignored when running on cpu only nodes.

Users will likely mostly interact with lmsig through the lmgw.sh script and an extended version of the flag has been copied to it:

--c4[=c4tol]
    use single precision for select operations until optional RMS Σ tol, defaulting to 5e-5,
    continue completely in double precision onwards until --tol is reached, only available on cuda

Iteration count and maxit behaviour

The –maxit=N switch specifies that at most N iterations are to be performed across arbitrary number of restarts, once N is reached a new inocation of the lmgw script will simply exit without any further action.

A QSGW calculation uses DFT input, if one starts from only basic input files, the lmgw script will in such case perform a 0th iteration which will be simply DFT though the iteration is still counted towards the total count.

Specifying –maxit=+N will add N iterations to whatever the current count is, for example a repeated invocation of lmgw with –maxit=+1 will perform 1 iteration on each invocation until the tolerance is met.

There is a sensible default tolerance for the root mean square change in Σ however it can be overriden with –tol 1e-5 for example.

Job restart and dynamic control

lmgw.sh job can be cancelled and restarted without any extra configuration, it will only repeat the interrupted executable rather than the whole iteration.

On a rare occasion it is useful control a job after it is launched without restarting it, a number of plain text files are sourced from within lmgw.sh to facilitate this:

* switches-for-lm: command line switches for lmf and lmfgwd
* lmsig-flags: command line switches for lmsig
* bse-flags: command line switches for bse
* tol: tolerance for RMS change in Σ
* maxit: adjust global number of iterations
* c4tol: single double precision switchover tolerance for RMS change in Σ
* c4maxit: maximal number of single precision iterations, switchover to double afterwards
* stop: 'graceful shutdown' after completion of the current iteration, the file will be deleted to not interfere with future relaunches

Split mode

In the offset-Γ method χ is evaluated at a number of q points close to Γ, the compute effort could be nontrivial however the final data obtained is miniscule making it a good place to restart. The –split-w0 flag will separate the offset point calculations from the rest.

The offset points χ as well as other parts of a QSGW cycle do not scale to the same degree as the selfenergy part, for this reason larger jobs could be fairly wasteful, to address this one could use the –start and –stop switches to specify points at which to partition a calculation across jobs of different scales which are linked with dependencies so eventually there is not much extra human effort spent policing the calculation.

Special functionality

Dielectric function at arbitrary q

………….

Dynamical Σ

    lmsig --dynsig~nw=101~wmin=-2.0~wmax=2.0 ctrl.lm

All the dynsig subarguments are optional and the default values are given in the above line. There is an extra subarg ~bmax=integer if one would like to restrict the number of states further than what the ctrl file gwemax allows.

The above command makes same files the original code made for compatibility with the lmgws and spectral executables.

Write W

lmsig --wrw-only  ....

Will create file weps.h5 containing W(ω,q,I,J), adding flag --om0 will fix ω=0.

BSE

Selfconsistent W mode

QSBSE?, QSGŴ? # we need to get rid of diactrics, few in the english speaking world care enough to be able to type them on regular keyboard and they get stripped out way too easily making a confusing mess.

To add ladder diagrams to W in each QSGW selfconsisttency iteration, add a BSE subsection to the GW section in the ctrl file. This is an example:

GW NKABC= 2 3 4
    BSE[TDA=T NV=8 NC=4 IMW=0.02 0.02]

All tokens are optinal but it is strongly suggested to set NV and NC. Descriptions, obtained with bse --input:

TDA: Apply Tamm-Dancoff approximation, this is the de. NV: Number of valent/occupied states in BSE calculation of W, defaults to all occupied states. NC: Number of conduction/unoccupied states in BSE calculation of W. IMW: Smearing in Ry, linear interpolation between the two values.

The q and ω meshes are taken from the GW section. The rank of the 2 particle BSE Hamiltonian with TDA applied for the above example will be 8×4 × 2×3×4 = 768 per spin. Sizes of up to about 90000 are diagonalisable on a gpu node with 4×A100 40GB in very reasonable timescales.

Then the simplest way to run is to just add the --bsw swhitch to lmgw.sh. This works fine for small examples but beyond this custom lanchers may be needed. The suggested approach is to launch 1 bse process per gpu, socket, node and occupy the applicable number of cores with openmp threads. If the number of processes is a multiple of divisor of the number of irreducible q points including offfset points then work will be distributed in a maximally efficient way minimising communication, if this condition is not satisfied it is reccommended to provide a process mapping file called pqmap-bse-N where N is the total number of mpi processes.

The construction of the big matrix is done in tiles by processes and groups of threads controlled in a way similar the the processes grouping in lmf described above. The automatic distribution is not too bad however it can be a little conservative to prevent crashes due to memory exhaustion. It can be specified with the flag --tmesh=NGROUPSxGROUPSIZE following the same conventions as –pmesh above, however the product must match the value of OMP_NUM_THREADS. This is a flag applicable to the bse executable only and can be passed through to it by lmgs.sh with --bse-flags="--tmesh=...". If only few states are chosen but there are many k points, then having a large number of small thread groups can be extremely beneficial (at a factor of >10x).

Runnins with TDA=F is extremely difficult beyond small examples at this stage. The problem becomes non-Hermitian (there are way to deal with it for the q=0 case but need more consideration in other cases) nonothogonal, linear algebra libraries are not optimised well for this and can be extremely inefficient. The code will try to solve this problem through LU decompositions at each ω point which can be very costly too although much better optimised.

BSE optics mode

ctrl file inputs, in addition to the above, there is also:

EMESH Energy mesh definition for BSE optics: start,end,step in Ry. Can be adjust after solution. EIMW Smearing for BSE optics in Ry, linear interpolation between the two values. Can be adjust after solution. QVECS q directions for BSE optics, defaults to the following set: 1 0 0 0 1 0 0 0 1 1 1 1.

Also needed is the OPTICS ctrl file section, there is an important choice for the approximation of the position operators to be made there.

Launch with lmgw.sh --bse ...... Only q=0 is considered so the effort is reduced significantly and often enough a good single node perfectly adequate. Other than this, the same considetations apply. Significant eigenvalues will be printed in groups, together with the corresponding oscillator strengths. The solution to the eigenvalue problem will be written to file w2q0e.h5, this enables interoperability but also is very useful for chosing appropriate values for EMESH and EIMW above. To do this one needs to copy the bse command line from the lmgw.sh output and simply rerun it, even on 1 core it should complete in seconds in most cases.

The dielectic functions along the different directions are then tabulated in a simple plain text file eps-plot.txt suitable for direct plotting with pylab and others.

The eigensolutions written to file are also used for postpocessing operations like projections of chosen exitons on bands, 3d realspace exiton plots, orbital projections coming soon etc…

Exciton band weights

After optics mode run is done, to integrate the weights over the emin-emax window, append --wexbw~ewin=emin,emax to the bse command line. The e units are in Ry by default but one can set explicitly eV, Ry or Ha by appending the respective suffix to emax. The weights will be written to file exbw.h5 (or one provided by ~fn= subargument to --wexbw) which can be used from pylab to enhance band plots.

3D exciton plotting

After optics mode run is done, to integrate the weights over the emin-emax window, append --wexpsi to the bse command line. A number of suboptions are necessary too:

~ewin=emin,emax[Ry|eV|Ha] to specify energy window in choice of units, applicable to both spins, alternatively use ~eidx=u,d to specify particular eigenstate index for spin u and d respectively.

There are two main modes at present: * ~r+=0.0,0.0,0.0{p|c} fix valent hole position in either cartesian (alat) or plat coordinates. Then plot probability of finding bound electron * ~r-=0.0,0.0,0.0{p|c} fix conduction electron position in chosen coordinate units and plot probability of filding bound hole.

If the specified position does not fall on a FTMESH point, the closest point will be chosed and info concerning this change will be printed.

The table printed during the optics run can be useful in determing energy windows or band indexes.

The supercell for the 3d plot can be specified through an optional origin point ~t0=0.0,0.0,0.0, given in units of plat, and number of unit cell replicas with for example: ~tn=1,1,1, the default choice is to have the original ‘0,0,0’ coordinate be at the centre of the supercell and the supercell size is taken from the k mesh so in most cases the use of t0 and tn is not necessary, however it can reduce compute time and file sizes if tn smaller than kmesh is given, in this case thought t0 can only have integers.

Finally the output file name can be specified with ~fn=, often useful to encode different positions and evenergy windows, file name extensions .cube and .xsf are recognised and the specific formats will be used, otherwise data will be written in an hdf5 format. Specifics for the plain txt cube and xsf formats: * they do not contain the position of the fixed particle * they can be rather large but are quite compressible