The Lattice Boltzmann Method in Kokkos

Performance Lessons Learned

Julian Karrer

Correctness

\(1920\times 1080 \,\text{Domain}\) , \(\omega=1.7, \vec{u}_x^{\text{inlet}}=0.05\)

Shearwave decay

Time \(\left(\frac{a}{c}\right)\) Velocity Amplitude \(\left|\left|\vec{v}_{max}\right|\right|\) \((c)\)

200k time steps of D2Q9 shearwave decay with \(u_x(t=0)=0.1\), \(\omega \in \left\{k\in ℕ_1^5\,|\, \frac{1}{k}\right\}\) on \(1000\times 1000\) nodes

Shearwave decay

Relaxation Coefficient \(\omega\) Viscosity \(\nu\) \(\left(c \cdot a\right)\)

200k time steps of D2Q9 shearwave decay with \(u_x(t=0)=0.1\), \(\omega \in \left\{k\in ℕ_1^5\,|\, \frac{1}{k}\right\}\) on \(1000\times 1000\) nodes

1. Hand-Tuning and Precision

Micro-optimizations

  1. Manual Loop-Unrolling
  2. Common Subexpression Elimination
  3. Arithmetic Simplification
  4. Read/Write reordering

NO significant difference (\(\pm\sigma\)) for any tested GPU

Precision

  • Use float instead of double if possible!
GPU relative Speedup for float
3060Ti \(+102\%\)
MI300A \(+100\%\)
H100 \(+101\%\)
  • Factor of two for variety of hardware

2. Efficiently use Memory Bandwidth

Data Dependencies

\(\begin{bmatrix}f_0(\vec{x})\\ f_1(\vec{x})\\ \dots\\ f_8(\vec{x})\end{bmatrix}\)
\(\rho(\vec{x})\)
\(\vec{u}(\vec{x})\)
\(f^{eq}(\vec{x})\)
\(\begin{bmatrix}f_0(\vec{x}+\Delta\vec{x}_0)\\ f_1(\vec{x}+\Delta\vec{x}_1)\\ \dots\\ f_8(\vec{x}+\Delta\vec{x}_8)\end{bmatrix}\)

Data Dependencies

\(\begin{bmatrix}f_0(\vec{x})\\ f_1(\vec{x})\\ \dots\\ f_8(\vec{x})\end{bmatrix}\)
\(\rho(\vec{x})\)
\(\vec{u}(\vec{x})\)
\(f^{eq}(\vec{x})\)
\(\begin{bmatrix}f_0(\vec{x}+\Delta\vec{x}_0)\\ f_1(\vec{x}+\Delta\vec{x}_1)\\ \dots\\ f_8(\vec{x}+\Delta\vec{x}_8)\end{bmatrix}\)

Fused Stream-Collide

Pull
Read from neighbours Write locally
Push
Read locally Write to neighbours

[Wittmann et al. 2012]

Fused Stream-Collide

Pull
Read from neighbours Write locally
Push
Read locally Write to neighbours

[Wittmann et al. 2012]

Results - Push vs. Pull

Lattice Updates per Second \(\left[\frac{1}{\sec}\right]\)

Peak performance: 28.6 BLUPS (H100), 22.0 BLUPS(A100)

A100 and H100 (32000 \(\times\) 32000), 3060Ti (3000 \(\times\) 3000), 100 steps, \(\geq\) 5 repeats, D2Q9 Shearwave Decay

3. Use Coalescing Memory Accesses

Memory Layout

Array of Structs - uncoalesced

[Navarro-Hinojosa et al. 2018, Chapter 3.2]

Memory Layout

Array of Structs - uncoalesced

[Navarro-Hinojosa et al. 2018, Chapter 3.2]

Memory Layout

Array of Structs - uncoalesced

[Navarro-Hinojosa et al. 2018, Chapter 3.2]

Memory Layout

Array of Structs - uncoalesced

[Navarro-Hinojosa et al. 2018, Chapter 3.2]

Memory Layout

Array of Structs - uncoalesced

[Navarro-Hinojosa et al. 2018, Chapter 3.2]

Memory Layout

Struct of Arrays - coalesced

[Navarro-Hinojosa et al. 2018, Chapter 3.2]

Memory Layout

Struct of Arrays - coalesced

[Navarro-Hinojosa et al. 2018, Chapter 3.2]

Memory Layout

Struct of Arrays - coalesced

[Navarro-Hinojosa et al. 2018, Chapter 3.2]

Memory Layout

Struct of Arrays - coalesced

[Navarro-Hinojosa et al. 2018, Chapter 3.2]

Memory Layout

Struct of Arrays - coalesced

[Navarro-Hinojosa et al. 2018, Chapter 3.2]

Results - Coalescing Accesses

Lattice Updates per Second \(\left[\frac{1}{\sec}\right]\)

Pull scheme, A100 and H100 (32000 \(\times\) 32000), 3060Ti (3000 \(\times\) 3000), 100 steps, \(\geq\) 5 repeats, D2Q9 Shearwave Decay

4. Work while communicating

MPI Communication

Communicate each border and corner asynchronously

MPI Communication

Communicate each border and corner asynchronously

MPI Communication

\(\mathcal{O}(N)\) halo nodes \(\ll\) \(\mathcal{O}(N^2)\) inner nodes

MPI Communication

\(\mathcal{O}(N)\) halo nodes \(\ll\) \(\mathcal{O}(N^2)\) inner nodes

MPI Communication

  1. Post all IRecv
  2. Pack Buffers and ISend them
  3. Work on inner nodes during data transfer!
  4. Block only if transfer not yet completed
  5. Work on outer nodes1

Host Receive Data

Pack Buffers Send Data

Device Pack Buffers Inner Update Block? Outer Update

MPI Communication

  1. Post all IRecv
  2. Pack Buffers and ISend them
  3. Work on inner nodes during data transfer!
  4. Block only if transfer not yet completed
  5. Work on outer nodes1

Host Receive Data

Pack Buffers Send Data

Device Pack Buffers Inner Update Block? Outer Update

Results - Weak Scaling

Number of Processes \(N\) Scaling Efficiency \(\frac{T(1)}{T(N)}\)

Nvidia A100 GPUs, 32000 \(\times\) 32000 nodes per process, 100 steps, 5 repeats, D2Q9 Shearwave Decay

Thank you for your attention?

Bonus: Taichi

Taichi Features

  • Embedded in Python
  • Portable, fast, statically typed
  • Layout- and Dimension-independent Code
  • GUI, LAS, Solvers etc. included
  • Differentiable
  • Only \(\sim 15\%\) performance penalty1

Full example

# Imports
import taichi as ti
import taichi.math as tm 

# Start Taichi
ti.init(arch=ti.gpu)

# Define constants
NY = 800
NX = 800
Q = 9
OMEGA = 1.7
RHO = 1.
U_0 = 0.1
w = [4/9, 1/9, 1/9, 1/9, 1/9, 1/36, 1/36, 1/36, 1/36]
cx = [0,1,0,-1,0,1,-1,-1,1]
cy = [0,0,1,0,-1,1,1,-1,-1]
refl = [0,3,4,1,2,7,8,5,6]

# Define fields
buf = ti.field(dtype=ti.f32, shape=(NX,NY,Q,))
pixels = ti.Vector.field(n=3, dtype=ti.f32, shape=(NX,NY))
f = ti.field(ti.f32)
ti.root.dense(ti.k, Q).dense(ti.ij, (NX,NY)).place(f)

# Initialize a fluid at rest
@ti.kernel
def init_rest():
    for x, y in ti.ndrange(NX, NY):
        for i in ti.static(range(Q)):
            f[x,y,i] = w[i] * RHO 

# Fused collide and stream kernel
@ti.kernel
def push():
    for x, y in ti.ndrange(NX, NY):
        # Compute density
        rho = 0.
        for i in ti.static(range(Q)):
            rho += f[x,y,i]

        # Compute velocity
        u = ti.Vector([0.,0.])
        for i in ti.static(range(Q)):
            u += f[x,y,i] * ti.Vector([cx[i],cy[i]])
        u /= rho

        # Visualize results
        pixels[x,y] = tm.length(u)/U_0

        for i in ti.static(range(Q)):
            # collide
            f_eq = w[i] * rho * (1 + 3*(ciui := u.x*cx[i]+u.y*cy[i]) + 4.5*ciui*ciui - 1.5*tm.dot(u,u))

            # sliding lid boundary
            df = ti.static(RHO * U_0 / 6.) * (-1 if i==5 else 1) if (y==NY-1 and (i==5 or i==6)) else 0.
            i_refl = refl[i] if ((x==0 and cx[i]<0) or (x==NX-1 and cx[i]>0) or (y==0 and cy[i]<0) or (y==NY-1 and cy[i]>0)) else i
            drx = 0 if (x==0 and cx[i]<0) or (x==NX-1 and cx[i]>0) else cx[i]
            dry = 0 if (y==0 and cy[i]<0) or (y==NY-1 and cy[i]>0) else cy[i]
            
            buf[x+drx, y+dry, i_refl] = f[x,y,i] + ti.static(OMEGA) * (f_eq + df - f[x,y,i])

# Create and run GUI
def run_gui():
    gui = ti.GUI("LBM D2Q9", res=(NX, NY), fast_gui = True)
    init_rest()
    while gui.running:
        push()
        f.copy_from(buf)
        gui.set_image(pixels)
        gui.show()

run_gui()

\(\leq\) 50 Lines of Code, including GUI

Full example

Thank you for your attention!