# User's guide for GW code, lmgw.sh

## Introduction

This guide documents the driver for the GW package, lmgw.sh. Most of the functionality of the classic code, lmgwsc, has been subsumed into it.

lmgw.sh can work in serial mode, and it is sufficient for simple applications, e.g. the introductory tutorial or the iron tutorial. However, making good use of the sophisticated parallelization within lmgw.sh is essential for complex systems.

At the outermost level of parallelization, MPI (message-passing interface) is used. Within one MPI process portions of the code are multithreaded, either with OpenMP, or using multithreaded libraries. Hardware cores can grouped with several threads supporting each MPI process. Finally, a significant number of computationally intensive sections can make efficient use of NVIDIA accelerator cards(1) when the hardware is available.

Remark 1 algorithms distributed over MPI tend execute more efficiently than when distributed over OpenMP, but on the other hand each MPI process requires its own memory while OpenMP is a shared memory model. Thus there is a trade-off between the two and the optimum choice is machine specific. When distributing over both, the user must choose.

Remark 2 the MPI processes themselves may be divided in groups to form a multi-level parallelization(2)

## Building the Questaal Package

To build the entire Questaal package, follow the standard procedure (../genflags.py ...; ../configure.py; ninja. These steps assume you have the required libraries in standard places and the packages such as ninja, and that your build directory is one level below the top-level Questaal directory.

For Dimitar to complete Maybe should mention spack recipe useful for building of named releases.

### Building the Questaal Package with CUDA

Invoke genflags.py with an option cuda=XX to enable use of NVIDIA accelerators. XX specifies the CUDA compute capability version to use and defaults to 70 if cuda is specified without argument, targeting Volta V100. While 70 is compatible with the later and much more powerful A100 (Ampere) hardware it is strongly recommended to use cuda=80. The table in this wikipedia page may be used as reference.

Questaal uses most of the CUDA auxiliary libraries, cuBLAS, cuFFT, etc…. The driver API is not used so the code can be built on nodes without CUDA hardware or driver installed.

Environment variable CUDADIR must point to the CUDA root installation directory.

### GPU enabled executables on CPU nodes

An executable built with GPU support can also work on purely CPU nodes with no ill effect. Whether it makes sense to use such executables depends on whether the available GPU and CPU nodes share the same CPU types. If the same types of CPUs are used across all the nodes it is perfectly fine. If however GPU nodes use, for example AMD CPUs (to take advantage of their earlier adoption of the pcie4 bus and), and these are mixed with purely CPU nodes made of Intel CPUs (Intel chips perform very well with Intel’s libraries), using the same binaries across such nodes is likely to be inefficient.

It should be clear that codes compiled for one kind of hardware (for example IBM POWER) cannot be run on hardware with different instruction sets e.g. intel or AMD nodes.

### Questaal and HDF5

Many of Questaal’s files are written in HDF5 format. HDF5 v1.12.0 is the minimal supported version. Even though the code is written mostly Fortran, none of the additional HDF5 interfaces beyond the base C API are necessary. If the code is built with MPI then the HDF5 library needs to be built with the same MPI implementation and C compiler, independent parallel IO needs to be enabled. If no compatible HDF5 installation is available the following minimal configure line is suggested:

../configure --disable-hl --enable-parallel --enable-build-mode=production CC=mpicc CFLAGS=-O3 --enable-deprecated-symbols=no \
--prefix=/path/to/install/to/hdf5/1.12.1/mpiname/mpiversion/compilername/compilerversion


The HDF5ROOT (or similar) environment variable needs to be exported.

## Using lmf

The one-body code, lmf, is in transition to a new basis type. Many of the internal algorithms retain both a new and the original implementation. The two should generate nearly equivalent results, but they are written differently. The new algorithms are used when --v8 is added to the command line. Also, --v8 adds new functionality:

• It enables a short ranged basis when run with --v8 --tbeh. At present --tbeh simply “screens” the existing basis by making a similarity transformation so the eigenvalues and charge density come out (almost identically) the same. This screened basis is a precursor to Jigsaw Puzzle Orbitals.
• It provides much better OpenMP multi-threading, expanded MPI parallelization and use of accelerators (though so far still only partially implemented) for the heaviest operations.
• lmf can operate with multilevel MPI parallelization as explained below.

Not all functionality has been implemented in the --v8 form yet. Two important missing features are

• Extended local orbitals (local orbitals with a smooth Hankel function attached so they spill into the interstitial; set when tag SPEC_ATOM_PZ has the 10s digit set).
• The PMT method where plane waves are added to the basis atom-centered smooth Hankel functions.

Remark The traditional implementation is slower but it conserves memory better, and can be used when memory exhaustion becomes an issue.

#### k&a parallel mode

lmf has a mixed/multilevel MPI parallelization that merges the old separate k-parallel and atom-parallel modes. It works by splitting n processes in nk groups of size na. There is an automatic distribution which can be partially overridden or guided by the command line option --pmesh=kxa, where k is nk and a is na. Suppose for example you want to study a system using 20 irreducible k-points. (To see how many irreducible k points will be required, run lmf ... --quit=ham | grep BZ.) You can launch lmf with 12 processes in any of the following ways:

lmf --pmesh=12x1
lmf --pmesh=4x3
lmf --pmesh=1x12


The first is the same as the default (all processes dedicated to parallelizing over k; 8 processes operate on 2 k points, 4 on only 1) the last has no parallelism in k, only over atoms and is useful only for systems with more than 20 atoms. Finally, the second line makes good use of 12 MPI processes with 20 k points (parallelizing over 4 k points).

The special value 0 is allowed in --pmesh also. It tells lmf to choose the partitioning by considering the other value given. The two extreme use cases for the special value are:

--pmesh=0x1


to fix k-group size to 1 and effectively emulate the old k-parallel mode, and

--pmesh=1x0


to fix number of k-groups to 1, giving all processes to the old atom-parallel mode.

0x0 or any other configuration in which the process mesh size does not equal the number of processes launched will be neglected silently in favor of automatically generated configuration.

Remark

The atom parallel mode is not yet particularly efficient, in our experience so far it is about 2× slower (or worse) compared to the single multithreaded process mode when using the same number of CPU cores on a single node. It is however able to use more than a single node for a single k point which was previously only possible with the pure atom parallel mode. There is lower limit on efficient multithreading with the atom parallel mode so while for a pure k-parallel mode it is recommended to launch one process per socket or even a node, when using k-group sizes beyond one it may be beneficial to use not much more than 4-8 threads per process (while generally keeping 1 thread per core).

Also, the a parallelization uses much less memory, so you many want to give up some efficiency if you need for each MPI process to use less memory.

### Make lmf more efficient

Disable forces if they are not needed. In the ctrl file, either delete token FORCES in the HAM category, or set it to 0.

#### Small unit cells

--v8 makes little difference for unit cells of a few of atoms (<10) and may even be slower for cells of 1-4 atoms. Multi-threading is bound to be ineffective due to tiny matrix sizes so there is no point using more than 2 threads per process. On the other hand, chances are there will be plenty k points parallelize over. For example

mpirun -n n ..other mpi flags.. lmf ...


You can set n to as many k points as you have processes available.

#### Use the –v8 switch

If extended local orbitals (ELOs) and plane waves are not crucial use the --v8 switch. Multi-threading will be increasingly effective with larger basis sizes. In most cases the best approach is to launch 1 MPI process per socket and set OMP_NUM_THREADS to the number of cores per socket.(3) For larger cells of 100+ atoms even 1 process per node using all the cores could be effective. For such cells often very few k points are necessary and it makes sense to use number of processes multiple of the number of k points so that the mixed atom parallel mode will be engaged.

Some MPI implementations apply nonsensical default process pinning (openmpi is known to do it) which can easily waste resources by forcing all threads to fight each other for a single core and leave all other cores idle. In such cases ensure hardware resource allocation is appropriate. For openmpi look at its binding flags.

For smaller cells it could be beneficial to launch more processes per socket with appropriately adjusted number of threads so that there is no more than 1 thread per core. In our experience putting compute threads on different hardware threads (hyperthreads in Intel parlance) is not effective and putting more than 1 thread per hardware thread can be very detrimental.

#### CUDA with lmf and lmfgwd

Only NVIDIA accelerators are supported for now. No special configuration or launch parameters are necessary besides the –v8 switch.

Only some of the heavier operations are currently ported and only a single device can be used by each MPI process for now (unlike other executables to be reviewed later). With this in mind it makes sense to launch no less than one process per card or accelerator chip.

There is some overhead due to data transfers and specialized memory allocation which is unlikely to be mortised by tiny cells. Launching two processes per GPU directly can at times give better performance but in most cases for more processes per GPU it will be much better if the Nvidia Multiprocess service (MPS) is used. HPC sites will usually have instructions for how to do this, otherwise this script could be helpful guide if not directly usable:

#!/bin/bash

export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/tmp/nvidia-log
# Launch MPS from a single rank per node
[ "$SLURM_LOCALID" != "0" ] || nvidia-cuda-mps-control -d sleep 1$@
[ "$SLURM_LOCALID" != "0" ] || nvidia-cuda-mps-control <<< quit  Accelerators provide much greater benefit for larger cells than smaller ones. Whether use of usually scarce accelerated nodes is justified depends on the particular cell and the competing CPU nodes. With each new release the balance is shifted more in favor of the GPU nodes. #### Avoid Extended Local Orbitals The larger the augmentation sphere, the less relevant the extended part of a local orbital is. If you enlarge the augmentation sphere needing an extended LO a little (reducing others to compensate), is impact is smaller and you might be able to get away with out using it. (As noted above, --v8 does not implement ELO’s.) ### Some changes in the transition to the new implementation of QSGW lmgw.sh combines the functionality of scripts lmgw and lmgwsc in the classic code. It still calls a set of executables to perform the necessary steps of a QSGW calculation. This is done because largely the parallelism in various section must be applied quite differently. lmgw.sh continues to make use of the GWinput file at present, but nearly all inputs in the general category (indeed all categories but the product basis category) of GWinput have been moved to the GW category of the ctrl file. If input appears in both files, tags in the GWinput file take precedence. Remark: some tags in the GWinput and ctrl files bearing equivalent information have different tags; for example the analog to GW_NKABC in the ctrl file is n1n2n3, in the general section of GWinput. #### Re-ordering assembly of screened coulomb interaction and self-energy The classic code first make the screened coulomb interaction W, writing it to disk (see Eq. 32 in Ref. (4)). Given W, it makes the self-energy iGW (Eqns 31 and 34 in Ref (4)). lmgw.sh runs an executable, lmsig, which joins the construction of W with the QSGW self-energy $\Sigma^0$ by reordering loops. For a particular qk combination a partial contribution to W is generated, and its partial contribution is added to $\Sigma^0(q)$. This avoids the need to write the colossal files for W, saving large amounts of storage and IO, and making the code more usable on lesser clusters which don’t have a high performance parallel file system. When $W$ is calculated with the BSE (QS), it again be written to disk, as it is needed to construct the vertex. The vertex is generated using the BSE code, which updates $W$ (now $\hat{W}$); and the updated $W$ read to make $\Sigma^0$. lmsig writes $W$ in HDF5 format with parallel MPI-IO. A suitable parallel filesystem is beneficial, if it is available. #### File changes lmgw.sh significantly reduces the number of generated files, especially files with any sort of index in their name. Data is now written into multidimensional datasets in HDF5 format. #### Launchers lmgw.sh launches multiple executables, with very heterogeneous requirements for MPI parallelization. Some small steps should be run in serial mode, while others typically in parallel mode. At a minimum, you can launch lmgw.sh with two flags --mpirun-1 "..." and --mpirun-n "...". The former tells lmgw.sh how to launch serial executables, the latter parallel executables. For example a sequence for 4 nodes (2 sockets each with 16 cores/socket for 32 cores) using intelmpi/mpich launchers could be: m1="env OMP_NUM_THREADS=32 mpirun -n 1" mn="env OMP_NUM_THREADS=16 mpirun -n 8 -ppn 2" lmgw.sh --mpirun-n "$mn" --mpirun-1 "$m1" ctrl.lm  This page provides a more detailed discussion. The sequence of commands will be printed to stdout as they are executed together. Also printed are their wall-clock runtimes as measured by the shell script. You can invoke each the different executables with its own launcher, by adding the flags below. If a flag to a particular executable is missing, it uses whatever is associated with --mpirun-n "$mn".

• --cmd-lm "..." for the 1-particle level lmf, limits discussed in the previous section.

• --cmd-gd "...": GW driver lmfgwd to prepare inputs such as wave functions eigenvectors etc…, limited by the irreducible points obtained from the set of nq×(nq0+1) where nq comes from the mesh GW_NKABC mesh defined in the ctrl file and nq0 is the irreducible set of offset Γ points, ranging from 1 to 6 depending on symmetry.

• --cmd-vc "...": Coulomb potential maker hvccfp0, limited by nq+nq0

• --cmd-cx "...": core exchange self-energy makes hsfp0_sc, the most scalable of executables, can use up to the irreducible number of nq² points.

• --cmd-xc "...": launcher for the χ⁰, W and Σ maker lmsig, one of the more demanding tasks, reaching up to the irreducible fraction of nq*nq points, however certain constraints to be discussed later apply.

• --cmd-bs "...": launcher for the bse executable, limited by (irreducible subset of nq)×nq² or with TDA: (irreducible subset of nq)×nq×(nq+1)/2, with different set of constraints.

The complete set of command line arguments to lmgw.sh is documented here.

#### lmsig

lmsig supersedes most of the functions performed by hx0fp0, hx0fp0_sc, hsfp0 and hsfp0_sc in the classic implementation. (An improved version of hsfp0_sc is still retained for the sole purpose of core exchange.) lmsig can calculate both the quasiparticled $\Sigma^0(q)$ for the QSGW cycle, and the dynamical $\Sigma(q,\omega)$ when the fully interacting Green’s function is sought. By default it generates $\Sigma^0(q)$ without writing W, but it will write W and exit if lmsig is launched with the --wrw switch.

##### Process mapping

lmsig is parallelized extensively with MPI, OMP and CUDA, and it can run efficiently on very large number of nodes. When optimum performance is needed, it is strongly advised to construct a pqmap file.

This file tells lmsig how to organize the possible qk combinations it must construct in an optimal manner, as explained here. With pqmap, lmsig can keep RAM consumption low by only handling a single q at a time. lmsig can often run more efficiently too, depending on the symmetry the system has. Constructing pqmap is a tedious process, and it can be very difficult to do it well except when there is no symmetry. lmfgwd has a special mode that makes pqmap, as well as some representative MPI flags for the various executables noted in the previous section See this page for documentation.

##### CUDA

There are no special flags necessary to use GPUs, though you must compile Questaal with it enabled. If no cards are found on a node or if none of the cards are compatible with the minimal compute capability set at build time with the cuda=XX flag then only the CPUs will be used.

The CUDA environment variable CUDA_VISIBLE_DEVICES can be used to reorder or selectively disable cards, other env vars described on this page may also be useful.

At initialization processes within a node negotiate accelerator distribution between themselves. For example if 4 processes are launched on a 4 card node, each process will take control of each of the 4 GPUs, if 2 processes are launched each process will take hold of 2 cards and if a single processes is launched it will control all cards. If 8 processes are launched then each pair of processes will share a card, this could be of use on smaller cells, the same MPS setup described above applies here too.

Nodes using process-exclusive mode can break this because it bars 2 processes having context on the same device even for a moment while remapping. It is not possible for a process to disassociate itself from a device without binding to another one and without using the driver library (which in turn will cause GPU enabled code to be unbuildable on machines without the driver installation). To workaround this situation one can provide process specific CUDA_VISIBLE_DEVICES env variable listing only the devices which the particular process should see.

##### Mixed precision mode

lmsig has a flag --c4, which tells it to compute Σ 9by far the heaviest step in a QSGW calculation), in single precision while χ and W are still calculate din double precision. A speedup factor of 3-4× has been observed on A100 GPUs without small loss of accuracy. As with other optimization, the effect is small on tiny cells and only worth using on larger ones.

The --c4 flag is only implemented for CUDA and is ignored when running on CPU only nodes.

Users will usually invoke lmsig through lmgw.sh. lmgw.sh accepts this switch and passes it to lmsig.

--c4[=c4tol]
use single precision for select operations until optional RMS Σ tol, defaulting to 5e-5,
continue completely in double precision onwards until --tol is reached, only available on cuda


#### Job restart and dynamic control

lmgw.sh can be canceled and restarted without any extra configuration. It will only repeat the interrupted executable rather than the whole iteration. It can do this because it maintains a directory meta, which contains an iteration counter (file meta/iter) and also a collection of files meta/*.done of zero size. One of these latter are created any time an executable completes. Thus, if lmgw.sh is restarted it will skip over already-completed steps and resume on the binary it was last executing. After a cycle completes, meta/*.done are deleted and the value of meta/iter is incremented by one.

Thus lmgw.sh keeps track of iteration number as well as what step of a particular iteration it is in.

You don’t need to specify how many iterations you want. lmgw.sh will iterate until the RMS change in $\Sigma^0$ falls below a tolerance. There is a sensible default tolerance, but it can be overridden with --tol (e.g. --tol 1e-5).

If you do want to specify the number of iterations, the --maxit=N switch specifies that at most N iterations are to be performed across arbitrary number of restarts.

Specifying --maxit=+N will add N iterations to whatever the current count is, for example a repeated invocation of with --maxit=+1 will perform 1 iteration on each invocation until the tolerance is met.

1. lmgw.sh writes current iteration number into file meta/iter. As execution begins, if meta/iter is present and no switch --iter specified, the iteration counter will be set to its value.

2. On completion of certain executables, a file with its name and extension .done is created in the meta directory. If lmgw.sh is restarted with these files present, it will skip these steps to speed the process in the first cycle. if you do not want lmgw.sh to pick up where it left off, simply remove the meta directory, e.g. rm -r meta.

3. Once N is reached a new invocation lmgw.sh --maxit=N will simply exit without any further action, or if a file maxit contains a number less than N.

4. if no $\Sigma^0$ is available, lmgw.sh must generate one to find a starting point. In such a case the 0th is counted as an initialization step; thus that N “iterations” require N+1 cycles through lmgw.sh. (This is done to maintain a distinction between QSGW and what the usual (one-shot) G0W0.) If you want a one-shot calculation, remove the self-energy sigm.ext so that it starts from DFT and launch lmgw.sh with lmgw.sh --maxit=0.

5. You can start a calculation with a minimal set of files, namely the files you need for DFT, plus sigm.ext. However, as of this writing, two self-energy files are maintained, sigm.ext and sig.h5. Eventually these will be merged into one file, but as of this writing sig.h5 is used by lmgw.sh at the end of a cycle, when it mixes the newly generated $\Sigma^0(q)$ with prior iterations, to accelerate convergence. If you start a fresh calculation with sigm.ext but do not have sig.h5, the mixer gets confused. If you want to start from some sigm but are missing sig.h5, or if sig.h5 is not the one generated simultaneously with sigm.ext, you can generate sig.h5 from sigm.ext with ln -s sigm.ext sigm; s2s5; rm sigm.

6. If you change input conditions, you are advised to clear the working directory of files that may conflict with the prior run. Do this with lmgwclear ext.

To summarize, when starting a new calculation (as opposed to restarting one), be sure to (1) remove meta; (2) run lmgwclear; (3) ensure sig.h5 is consistent with sigm.ext.

On a rare occasion it is useful control a job while it is executing, a number of plain text files are sourced from within lmgw.sh to facilitate this:

* switches-for-lm: command line switches for lmf and lmfgwd
* lmsig-flags: command line switches for lmsig
* bse-flags: command line switches for bse
* tol: tolerance for RMS change in Σ
* maxit: adjust global number of iterations
* c4tol: single double precision switchover tolerance for RMS change in Σ
* c4maxit: maximal number of single precision iterations, switchover to double afterwards
* stop: 'graceful shutdown' after completion of the current iteration, the file will be deleted so as not to interfere with future relaunches

##### Split mode

In the offset-Γ method, the polarizability is evaluated at a number of q points close to Γ, the compute effort could be nontrivial; however the final data obtained is miniscule, making it a good place to restart. The --split-w0 flag will separate the offset point calculations from the rest.

The steps leading to lmsig are scale differently from lmsig. For this reason larger jobs could be fairly wasteful, to address this one could use the --start and --stop switches to specify points at which to partition a calculation across jobs of different scales which are linked with dependencies so eventually there is not much extra human effort spent policing the calculation.

## Special functionality

### Dielectric function at arbitrary q

See this tutorial.

### Dynamical Σ

See this tutorial.

The primary instruction is

    lmsig --dynsig~nw=101~wmin=-2.0~wmax=2.0 ctrl.lm


All the --dynsig subarguments are optional and the default values are given in the above line. There is an extra subargument ~bmax=integer if one would like to restrict the number of states further than what the ctrl file gwemax allows.

The above command makes same files the original code made for compatibility with the lmfgws and spectral executables.

### Write W

lmsig --wrw-only  ....


Will create file wepss.h5 containing $W(\omega,q,I,J)$, with $I$ and $J$ indices to the two-particle basis. Adding flag --om0 will fix ω=0.

### BSE

#### Self-consistent W mode

To include ladder diagrams in W in each QSGW iteration, add a BSE subsection to the GW category in the ctrl file. This is an example:

GW  NKABC= 2 3 4  BSE[TDA=T NV=8 NC=4 IMW=0.02 0.02]


All tokens are optional but it is strongly suggested to set NV and NC. Brief descriptions can be found with the on-line help (bse --input). A practical example is given in this tutorial.

• TDA: Apply the Tamm-Dancoff approximation (this is the default).
• NV: Number of valence/occupied states in BSE calculation of W. Defaults to all occupied states.
• NC: Number of conduction/unoccupied states in BSE calculation of W.
• IMW: Smearing in Ry, linear interpolation between the two values.

The q and ω meshes are taken from the GW category. The rank of the 2 particle BSE Hamiltonian with TDA applied for the above example will be 8×4 × 2×3×4 = 768 per spin. Sizes of up to about 90000 are diagonalizable on a GPU node with 4×A100 and 40GB memory in very reasonable timescales.

The simplest way to run lmgw.sh is simply to add the --bsw switch. This works fine for small examples but beyond this custom launchers may be needed. The suggested approach is to launch 1 bse process per GPU, socket, node and occupy the applicable number of cores with OpenMP threads. If the number of processes is a multiple of divisor of the number of irreducible q points including offset points then work will be distributed in a maximally efficient way minimizing communication, if this condition is not satisfied it is recommended to provide a process mapping file called pqmap-bse-N where N is the total number of MPI processes.

The construction of the large particle-hole interaction matrix is done in tiles by processes and groups of threads controlled in a way similar the process grouping in lmf described above(2). The automatic distribution is not too bad, though it can be a little conservative to prevent crashes due to memory exhaustion. It can be specified with the flag --tmesh=NGROUPSxGROUPSIZE following the same conventions as --pmesh above, however the product must match the value of OMP_NUM_THREADS.(3) This is a flag applicable to bse only and can be passed through to it by lmgw.sh with --bse-flags="--tmesh=...". If only few states are chosen but there are many k points, then having a large number of small thread groups can be extremely beneficial (scaling by a factor of >10×).

Running with TDA=F is extremely difficult beyond small examples at this stage. The problem becomes non-Hermitian (there are ways to deal with it for the q=0 case but need more consideration in other cases) non-orthogonal, linear algebra libraries are not optimized well for this and can be extremely inefficient. The code will try to solve this problem through LU decompositions at each ω point which can be very costly too although much better optimized.

#### BSE optics mode

See this tutorial for an example.

In addition to the above, the ctrl files also needs:

• EMESH: definition of the frequency mesh for BSE optics: (start,end,step), in Ry. Can be adjusted after solution of the two-particle hamiltonian.
• EIMW:   smearing for BSE optics, in Ry, linear interpolation between the two values. Can be adjusted after solution of the two-particle hamiltonian.
• QVECS: the directions for which the q→0 limit is taken. Defaults to the following set: 1 0 0   0 1 0   0 0 1   1 1 1.

Also needed is the OPTICS category in the ctrl file (blm automatically adds this category to the control file when invoked with --optics). There is an important choice for the approximation of the position operators to be made.

Launch with lmgw.sh --bse .... Only q=0 is considered at present, though when run in conjunction with --epsq (i.e. --epsq --bse), it will calculate the dielectric function at the offset Γ points. These are small and close enough to zero that it can be used as a proxy for the macroscopic dielectric function without resorting to approximations for the position operator. It is explained in some detail in this tutorial.

Other than this, the same considerations apply as to the construction of W in the BSE. Significant eigenvalues will be printed in groups, together with the corresponding oscillator strengths. The solution to the eigenvalue problem will be written to file w2q0e.h5. This enables interoperability but also is very useful for choosing appropriate values for EMESH and EIMW above. To do this one needs to copy the bse command line from the lmgw.sh output and simply rerun it. Even on one core it should usually complete in seconds.

The dielectric functions where the q→0 limit is taken from the specified directions are tabulated in a simple plain text file eps-plot.txt suitable for direct plotting with pylab and other graphics programs.

The eigensolutions written to file are also used for postprocessing operations like projections of chosen excitons on bands, 3d real space exciton plots, orbital projections coming soon etc…

##### Exciton band weights

After an optics calculation is performed, to integrate the weights over the emin-emax window, append --wexbw~ewin=emin,emax to the bse command line. Units are in Ry by default but one can set explicitly eV, Ry or Ha by appending the respective suffix to emax. The weights will be written to file exbw.h5 (or one provided by ~fn= subargument to --wexbw) which can be used from pylab to enhance band plots.

##### 3D exciton plotting

After an optics calculation is performed, to integrate the weights over the emin-emax window append --wexpsi to the bse command line. A number of suboptions are necessary too:

~ewin=emin,emax[Ry|eV|Ha] to specify energy window in choice of units, applicable to both spins, alternatively use ~eidx=u,d to specify particular eigenstate index for spin u and d respectively.

There are two main modes at present:

• ~r+=0.0,0.0,0.0{p|c} fix hole position in either Cartesian (ALAT) or dimensionless PLAT coordinates. Then plot probability of finding bound electron
• ~r-=0.0,0.0,0.0{p|c} fix electron position in chosen coordinate units and plot probability of finding bound hole.

If the specified position does not fall on a FTMESH point, the closest point will be chosen and information concerning this change will be printed.

The table printed during the optics run can be useful in determine energy windows or band indices.

The supercell for the 3d plot can be specified through an optional origin point ~t0=0.0,0.0,0.0, given in units of the primitive lattice translation vectors PLAT, and number of unit cell replicas with for example: ~tn=1,1,1, the default choice is to have the original ‘0,0,0’ coordinate be at the center of the supercell and the supercell size is taken from the k mesh so in most cases the use of t0 and tn is not necessary, however it can reduce compute time and file sizes if tn smaller than k-mesh is given, in this case thought t0 can only have integers.

Finally the output file name can be specified with ~fn=, often useful to encode different positions and evenergy windows, file name extensions .cube and .xsf are recognised and the specific formats will be used, otherwise data will be written in an hdf5 format. Specifics for the plain text (.cube) cube and .xsf formats:

• they do not contain the position of the fixed particle
• they can be rather large, but they are quite compressible

## Command line arguments to lmgw.sh

This section summarizes command-line arguments lmgw.sh accepts. See also this page for further details on how to configure lmgw.sh for an HPC environment.

• --mpirun-1 “strn
default launcher for all compute processes if not overridden in special cases
Intended for serial processes; however a queuing system may not allow a program to execute in serial mode.
In such a case, you can use “mpirun -n 1” as in this example:
Example --mpirun-1 'env OMP_NUM_THREADS=4 MKL_NUM_THREADS=4 mpirun -n 1'

• --mpirun-n “strn
default launcher for all mpi compute processes, if not overridden in special cases.
Binaries which execute with MPI parallelism use this switch as a default.
If not specified, defaults to mpirun-1.
Example for a node with 32 cores: --mpirun-n env OMP_NUM_THREADS=4 MKL_NUM_THREADS=4 mpirun -n 8

• --cmd-[lm, gd, vc, cx, w0, xc, bs] “strn
override mpirun-n for the following special cases:
• lm: DFT level: lmf (step 1)
• gd: GW driver: lmfgwd (step 2)
• vc: Coulomb potential maker hvccfp0 (step 3)
• cx: core-exchange self energy maker hsfp0_sc (step 4)
• w0: only in –split-w0 mode, used by lmsig : polarizability at offset-gamma points to make $W(q=0)$
• xc: to make $W$ and $\Sigma$ with lmsig (step 5)
• bs: Arguments to bse when making the BSE screened coulomb interaction W.
• op: Arguments to bse when calculating optics within the BSE.
• --split-w0
split the χ0(0,0,ω,q=0) and W0(0,0,ω,q=0) calculation from the rest and communicate the values though files x00 and w0c.

• --split-w
split the calculation for W by writing data to disk, and read data from disk in making Σ. Caution: these files can can be huge!

• --bsw
After computing the screened W(ω,q=0) in the random phase approximation, recalculate it including ladder diagrams via a BSE and use WBSE when making Σ=iGW. See this tutorial.

• --bse
Compute BSE dielectric function at q=0

• --epsq
Compute RPA dielectric function at q given in the ctrl file, as explained in this tutorial

• --epsq   --bse
Compute BSE dielectric function at the offset Γ points, as explained in this tutorial

• --mem G  |  --mem=G
try to limit memory use to G GB/process.
Implemented so far only in lmsig.

• --iter I
start execution at the Ith iteration (counting from 0 if no sigm file present)

start from Ith iteration (counting from 0)

• --maxit N
perform no more than N iterations (N+1 if no sigm file present)
Default: N = 100

• --tol TOL
exit once RMS change in $\Sigma^0$ converges to given tolerance TOL.
Default: TOL = 1d-5

• --no-exec
print commands that would execute, without executing them

• --no-scrho
suppress making the density self-consistent (skip lmf self-consistency)

• --lm-flags “strn”  |  --lm-flags=”strn”  |  --lm-args “strn”  |  --lmargs “strn”  |  -v*=strn
switches passed to lmf and lmfgwd

• --bse-flags “strn  |  --bse-flags=”strn
switches passed to bse

• --sig-flags “strn  |  --sig-flags=”strn
switches passed to lmsig

• --start=[setup|x0|sig|qpe|bse]
start at given step

• --stop=[setup|x0|sig|qpe|bse]
stop after given step

• --sym
symmetrise sigm.ext, using lmf to do it.

• --band=strn
After launching lmf, run extra pass with --band=strn to generate energy bands.

• --c4  |  --c4:tol
use single precision for select operations in lmsig, until RMS Σ falls below tolerance, after which continue in double precision. tolerance defaults to 5e-5, but user can specify.
Available only on CUDA.

• -qg4gw-flags=+qm  |  -qg4gw-flags=+qa
Specify a particular kind of mesh for the offset-gamma points. No longer recommended. Possible options:
• +qm mid plane qlat axes (or perpendicular if angles < π/2)
• +qa Along the axes of qlat
• --bswmx
Not documented

• -h, --help
displays this help

The section below explains how to generate many of these switches automatically with Questaal’s utilities.

## Footnotes and References

(1) The words GPU, accelerator, card, CUDA hardware are used interchangeably to mean NVIDIA accelerator card connected over PCIExpress (PCIe) or socketed chip (SMX).

(2) lmf can perform multi-level MPI parallelization: the total number of MPI processes n is factored into n=nk×na, where the k points are distributed over nk processes, and each k point is assigned na processes to construct the hamiltonian. See this section. Multilevel parallelization is also used in the BSE code.

(3) If you are using the Intel MKL library, set MKL_NUM_THREADS to the same value you set OMP_NUM_THREADS to, e.g. use env OMP_NUM_THREADS=n MKL_NUM_THREADS=n instead of env OMP_NUM_THREADS=n.

(4) Questaal’s QSGW implementation is based on this paper:
Takao Kotani, M. van Schilfgaarde, S. V. Faleev, Quasiparticle self-consistent GW method: a basis for the independent-particle approximation, Phys. Rev. B76, 165106 (2007)