The Lattice Boltzmann Method in Kokkos

Performance Lessons Learned

Julian Karrer

Correctness

\(1920\times 1080 \,\text{Domain}\) , \(\omega=1.7, \vec{u}_x^{\text{inlet}}=0.05\)

Shearwave decay

Time \(\left(\frac{a}{c}\right)\) Velocity Amplitude \(\left|\left|\vec{v}_{max}\right|\right|\) \((c)\)

200k time steps of D2Q9 shearwave decay with \(u_x(t=0)=0.1\), \(\omega \in \left\{k\in ℕ_1^5\,|\, \frac{1}{k}\right\}\) on \(1000\times 1000\) nodes

Shearwave decay

Relaxation Coefficient \(\omega\) Viscosity \(\nu\) \(\left(c \cdot a\right)\)

200k time steps of D2Q9 shearwave decay with \(u_x(t=0)=0.1\), \(\omega \in \left\{k\in ℕ_1^5\,|\, \frac{1}{k}\right\}\) on \(1000\times 1000\) nodes

1. Hand-Tuning and Precision

Micro-optimizations

Manual Loop-Unrolling
Common Subexpression Elimination
Arithmetic Simplification
Read/Write reordering

NO significant difference (\(\pm\sigma\)) for any tested GPU

Precision

Use float instead of double if possible!

GPU	relative Speedup for `float`
3060Ti	\(+102\%\)
MI300A	\(+100\%\)
H100	\(+101\%\)

Factor of two for variety of hardware

2. Efficiently use Memory Bandwidth

Data Dependencies

\(\begin{bmatrix}f_0(\vec{x})\\ f_1(\vec{x})\\ \dots\\ f_8(\vec{x})\end{bmatrix}\)

\(\rho(\vec{x})\)

\(\vec{u}(\vec{x})\)

\(f^{eq}(\vec{x})\)

\(\begin{bmatrix}f_0(\vec{x}+\Delta\vec{x}_0)\\ f_1(\vec{x}+\Delta\vec{x}_1)\\ \dots\\ f_8(\vec{x}+\Delta\vec{x}_8)\end{bmatrix}\)

Data Dependencies

\(\begin{bmatrix}f_0(\vec{x})\\ f_1(\vec{x})\\ \dots\\ f_8(\vec{x})\end{bmatrix}\)

\(\rho(\vec{x})\)

\(\vec{u}(\vec{x})\)

\(f^{eq}(\vec{x})\)

\(\begin{bmatrix}f_0(\vec{x}+\Delta\vec{x}_0)\\ f_1(\vec{x}+\Delta\vec{x}_1)\\ \dots\\ f_8(\vec{x}+\Delta\vec{x}_8)\end{bmatrix}\)

Fused Stream-Collide

Pull

Read from neighbours Write locally

Push

Read locally Write to neighbours

[Wittmann et al. 2012]

Fused Stream-Collide

Pull

Read from neighbours Write locally

Push

Read locally Write to neighbours

[Wittmann et al. 2012]

Results - Push vs. Pull

Lattice Updates per Second \(\left[\frac{1}{\sec}\right]\)

Peak performance: 28.6 BLUPS (H100), 22.0 BLUPS(A100)

A100 and H100 (32000 \(\times\) 32000), 3060Ti (3000 \(\times\) 3000), 100 steps, \(\geq\) 5 repeats, D2Q9 Shearwave Decay

3. Use Coalescing Memory Accesses

Memory Layout

Array of Structs - uncoalesced

Lattice Updates per Second \(\left[\frac{1}{\sec}\right]\)

Pull scheme, A100 and H100 (32000 \(\times\) 32000), 3060Ti (3000 \(\times\) 3000), 100 steps, \(\geq\) 5 repeats, D2Q9 Shearwave Decay

4. Work while communicating

MPI Communication

Communicate each border and corner asynchronously

MPI Communication

Communicate each border and corner asynchronously

MPI Communication

\(\mathcal{O}(N)\) halo nodes \(\ll\) \(\mathcal{O}(N^2)\) inner nodes

MPI Communication

\(\mathcal{O}(N)\) halo nodes \(\ll\) \(\mathcal{O}(N^2)\) inner nodes

MPI Communication

Post all IRecv
Pack Buffers and ISend them
Work on inner nodes during data transfer!
Block only if transfer not yet completed
Work on outer nodes¹

Host Receive Data

Pack Buffers Send Data

Device Pack Buffers Inner Update Block? Outer Update

MPI Communication

Post all IRecv
Pack Buffers and ISend them
Work on inner nodes during data transfer!
Block only if transfer not yet completed
Work on outer nodes¹

Host Receive Data

Pack Buffers Send Data

Device Pack Buffers Inner Update Block? Outer Update

Results - Weak Scaling

Number of Processes \(N\) Scaling Efficiency \(\frac{T(1)}{T(N)}\)

Nvidia A100 GPUs, 32000 \(\times\) 32000 nodes per process, 100 steps, 5 repeats, D2Q9 Shearwave Decay

Thank you for your attention?

Bonus: Taichi

Taichi Features

Embedded in Python
Portable, fast, statically typed
Layout- and Dimension-independent Code
GUI, LAS, Solvers etc. included
Differentiable
Only \(\sim 15\%\) performance penalty¹

Full example

# Imports
import taichi as ti
import taichi.math as tm 

# Start Taichi
ti.init(arch=ti.gpu)

# Define constants
NY = 800
NX = 800
Q = 9
OMEGA = 1.7
RHO = 1.
U_0 = 0.1
w = [4/9, 1/9, 1/9, 1/9, 1/9, 1/36, 1/36, 1/36, 1/36]
cx = [0,1,0,-1,0,1,-1,-1,1]
cy = [0,0,1,0,-1,1,1,-1,-1]
refl = [0,3,4,1,2,7,8,5,6]

# Define fields
buf = ti.field(dtype=ti.f32, shape=(NX,NY,Q,))
pixels = ti.Vector.field(n=3, dtype=ti.f32, shape=(NX,NY))
f = ti.field(ti.f32)
ti.root.dense(ti.k, Q).dense(ti.ij, (NX,NY)).place(f)

# Initialize a fluid at rest
@ti.kernel
def init_rest():
    for x, y in ti.ndrange(NX, NY):
        for i in ti.static(range(Q)):
            f[x,y,i] = w[i] * RHO 

# Fused collide and stream kernel
@ti.kernel
def push():
    for x, y in ti.ndrange(NX, NY):
        # Compute density
        rho = 0.
        for i in ti.static(range(Q)):
            rho += f[x,y,i]

        # Compute velocity
        u = ti.Vector([0.,0.])
        for i in ti.static(range(Q)):
            u += f[x,y,i] * ti.Vector([cx[i],cy[i]])
        u /= rho

        # Visualize results
        pixels[x,y] = tm.length(u)/U_0

        for i in ti.static(range(Q)):
            # collide
            f_eq = w[i] * rho * (1 + 3*(ciui := u.x*cx[i]+u.y*cy[i]) + 4.5*ciui*ciui - 1.5*tm.dot(u,u))

            # sliding lid boundary
            df = ti.static(RHO * U_0 / 6.) * (-1 if i==5 else 1) if (y==NY-1 and (i==5 or i==6)) else 0.
            i_refl = refl[i] if ((x==0 and cx[i]<0) or (x==NX-1 and cx[i]>0) or (y==0 and cy[i]<0) or (y==NY-1 and cy[i]>0)) else i
            drx = 0 if (x==0 and cx[i]<0) or (x==NX-1 and cx[i]>0) else cx[i]
            dry = 0 if (y==0 and cy[i]<0) or (y==NY-1 and cy[i]>0) else cy[i]
            
            buf[x+drx, y+dry, i_refl] = f[x,y,i] + ti.static(OMEGA) * (f_eq + df - f[x,y,i])

# Create and run GUI
def run_gui():
    gui = ti.GUI("LBM D2Q9", res=(NX, NY), fast_gui = True)
    init_rest()
    while gui.running:
        push()
        f.copy_from(buf)
        gui.set_image(pixels)
        gui.show()

run_gui()

\(\leq\) 50 Lines of Code, including GUI

The Lattice Boltzmann Method in Kokkos

Correctness

Shearwave decay

Shearwave decay

1. Hand-Tuning and Precision

Micro-optimizations

Precision

2. Efficiently use Memory Bandwidth

Data Dependencies

Data Dependencies

Fused Stream-Collide

Pull

Push

Fused Stream-Collide

Pull

Push

Results - Push vs. Pull

3. Use Coalescing Memory Accesses

Memory Layout

Memory Layout

Memory Layout

Memory Layout

Memory Layout

Memory Layout

Memory Layout

Memory Layout

Memory Layout

Memory Layout

Results - Coalescing Accesses

4. Work while communicating

MPI Communication

MPI Communication

MPI Communication

MPI Communication

MPI Communication

MPI Communication

Results - Weak Scaling

Thank you for your attention?

Bonus: Taichi

Taichi Features

Full example

Full example

Thank you for your attention!