Performance Lessons Learned
\(1920\times 1080 \,\text{Domain}\) , \(\omega=1.7, \vec{u}_x^{\text{inlet}}=0.05\)
Time \(\left(\frac{a}{c}\right)\) Velocity Amplitude \(\left|\left|\vec{v}_{max}\right|\right|\) \((c)\)
200k time steps of D2Q9 shearwave decay with \(u_x(t=0)=0.1\), \(\omega \in \left\{k\in ℕ_1^5\,|\, \frac{1}{k}\right\}\) on \(1000\times 1000\) nodes
Relaxation Coefficient \(\omega\) Viscosity \(\nu\) \(\left(c \cdot a\right)\)
200k time steps of D2Q9 shearwave decay with \(u_x(t=0)=0.1\), \(\omega \in \left\{k\in ℕ_1^5\,|\, \frac{1}{k}\right\}\) on \(1000\times 1000\) nodes
NO significant difference (\(\pm\sigma\)) for any tested GPU
float
instead of double
if possible!GPU | relative Speedup for float |
---|---|
3060Ti | \(+102\%\) |
MI300A | \(+100\%\) |
H100 | \(+101\%\) |
[Wittmann et al. 2012]
[Wittmann et al. 2012]
Lattice Updates per Second \(\left[\frac{1}{\sec}\right]\)
Peak performance: 28.6 BLUPS (H100), 22.0 BLUPS(A100)
A100 and H100 (32000 \(\times\) 32000), 3060Ti (3000 \(\times\) 3000), 100 steps, \(\geq\) 5 repeats, D2Q9 Shearwave Decay
Array of Structs - uncoalesced
[Navarro-Hinojosa et al. 2018, Chapter 3.2]
Array of Structs - uncoalesced
[Navarro-Hinojosa et al. 2018, Chapter 3.2]
Array of Structs - uncoalesced
[Navarro-Hinojosa et al. 2018, Chapter 3.2]
Array of Structs - uncoalesced
[Navarro-Hinojosa et al. 2018, Chapter 3.2]
Array of Structs - uncoalesced
[Navarro-Hinojosa et al. 2018, Chapter 3.2]
Struct of Arrays - coalesced
[Navarro-Hinojosa et al. 2018, Chapter 3.2]
Struct of Arrays - coalesced
[Navarro-Hinojosa et al. 2018, Chapter 3.2]
Struct of Arrays - coalesced
[Navarro-Hinojosa et al. 2018, Chapter 3.2]
Struct of Arrays - coalesced
[Navarro-Hinojosa et al. 2018, Chapter 3.2]
Struct of Arrays - coalesced
[Navarro-Hinojosa et al. 2018, Chapter 3.2]
Lattice Updates per Second \(\left[\frac{1}{\sec}\right]\)
Pull scheme, A100 and H100 (32000 \(\times\) 32000), 3060Ti (3000 \(\times\) 3000), 100 steps, \(\geq\) 5 repeats, D2Q9 Shearwave Decay
Communicate each border and corner asynchronously
Communicate each border and corner asynchronously
\(\mathcal{O}(N)\) halo nodes \(\ll\) \(\mathcal{O}(N^2)\) inner nodes
\(\mathcal{O}(N)\) halo nodes \(\ll\) \(\mathcal{O}(N^2)\) inner nodes
IRecv
ISend
themHost Receive Data
Pack Buffers Send Data
Device Pack Buffers Inner Update Block? Outer Update
IRecv
ISend
themHost Receive Data
Pack Buffers Send Data
Device Pack Buffers Inner Update Block? Outer Update
Number of Processes \(N\) Scaling Efficiency \(\frac{T(1)}{T(N)}\)
Nvidia A100 GPUs, 32000 \(\times\) 32000 nodes per process, 100 steps, 5 repeats, D2Q9 Shearwave Decay
# Imports
import taichi as ti
import taichi.math as tm
# Start Taichi
ti.init(arch=ti.gpu)
# Define constants
NY = 800
NX = 800
Q = 9
OMEGA = 1.7
RHO = 1.
U_0 = 0.1
w = [4/9, 1/9, 1/9, 1/9, 1/9, 1/36, 1/36, 1/36, 1/36]
cx = [0,1,0,-1,0,1,-1,-1,1]
cy = [0,0,1,0,-1,1,1,-1,-1]
refl = [0,3,4,1,2,7,8,5,6]
# Define fields
buf = ti.field(dtype=ti.f32, shape=(NX,NY,Q,))
pixels = ti.Vector.field(n=3, dtype=ti.f32, shape=(NX,NY))
f = ti.field(ti.f32)
ti.root.dense(ti.k, Q).dense(ti.ij, (NX,NY)).place(f)
# Initialize a fluid at rest
@ti.kernel
def init_rest():
for x, y in ti.ndrange(NX, NY):
for i in ti.static(range(Q)):
f[x,y,i] = w[i] * RHO
# Fused collide and stream kernel
@ti.kernel
def push():
for x, y in ti.ndrange(NX, NY):
# Compute density
rho = 0.
for i in ti.static(range(Q)):
rho += f[x,y,i]
# Compute velocity
u = ti.Vector([0.,0.])
for i in ti.static(range(Q)):
u += f[x,y,i] * ti.Vector([cx[i],cy[i]])
u /= rho
# Visualize results
pixels[x,y] = tm.length(u)/U_0
for i in ti.static(range(Q)):
# collide
f_eq = w[i] * rho * (1 + 3*(ciui := u.x*cx[i]+u.y*cy[i]) + 4.5*ciui*ciui - 1.5*tm.dot(u,u))
# sliding lid boundary
df = ti.static(RHO * U_0 / 6.) * (-1 if i==5 else 1) if (y==NY-1 and (i==5 or i==6)) else 0.
i_refl = refl[i] if ((x==0 and cx[i]<0) or (x==NX-1 and cx[i]>0) or (y==0 and cy[i]<0) or (y==NY-1 and cy[i]>0)) else i
drx = 0 if (x==0 and cx[i]<0) or (x==NX-1 and cx[i]>0) else cx[i]
dry = 0 if (y==0 and cy[i]<0) or (y==NY-1 and cy[i]>0) else cy[i]
buf[x+drx, y+dry, i_refl] = f[x,y,i] + ti.static(OMEGA) * (f_eq + df - f[x,y,i])
# Create and run GUI
def run_gui():
gui = ti.GUI("LBM D2Q9", res=(NX, NY), fast_gui = True)
init_rest()
while gui.running:
push()
f.copy_from(buf)
gui.set_image(pixels)
gui.show()
run_gui()
\(\leq\) 50 Lines of Code, including GUI