Media Summary: In this video, we dive into the mechanics of a This talk dives into the performance details of What is CUDA? And how does parallel computing on the

A Work Efficient Gpu Algorithm - Detailed Analysis & Overview

In this video, we dive into the mechanics of a This talk dives into the performance details of What is CUDA? And how does parallel computing on the In this talk we present how we trained a 530B parameter language model on a DGX SuperPOD with over 3000 A100

Photo Gallery

A Work-Efficient GPU Algorithm for Level Set Segmentation
A Fast Work-Efficient SSSP Algorithm for GPUs
Inside the Matrix: How does matrix multiplication work inside GPUs?
OSDI '22 - Efficient and Scalable Graph Pattern Mining on GPUs
GPU-based IDGE algorithm
Making GPUs Actually Fast: A Deep Dive into Training Performance
Nvidia CUDA in 100 Seconds
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM | Jared Casper
USENIX ATC '25 - PPipe: Efficient Video Analytics Serving on Heterogeneous GPU Clusters via...
NSDI '20 - Themis: Fair and Efficient GPU Cluster Scheduling
USENIX ATC '22 - Whale: Efficient Giant Model Training over Heterogeneous GPUs
GPU Memory Coalescing Explained: Warp-Level Optimization, Alignment Rules, and Cache Behavior
View Detailed Profile
A Work-Efficient GPU Algorithm for Level Set Segmentation

A Work-Efficient GPU Algorithm for Level Set Segmentation

A Work

A Fast Work-Efficient SSSP Algorithm for GPUs

A Fast Work-Efficient SSSP Algorithm for GPUs

A Fast

Inside the Matrix: How does matrix multiplication work inside GPUs?

Inside the Matrix: How does matrix multiplication work inside GPUs?

In this video, we dive into the mechanics of a

OSDI '22 - Efficient and Scalable Graph Pattern Mining on GPUs

OSDI '22 - Efficient and Scalable Graph Pattern Mining on GPUs

OSDI '22 -

GPU-based IDGE algorithm

GPU-based IDGE algorithm

GPU-based IDGE algorithm

Making GPUs Actually Fast: A Deep Dive into Training Performance

Making GPUs Actually Fast: A Deep Dive into Training Performance

This talk dives into the performance details of

Nvidia CUDA in 100 Seconds

Nvidia CUDA in 100 Seconds

What is CUDA? And how does parallel computing on the

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM | Jared Casper

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM | Jared Casper

In this talk we present how we trained a 530B parameter language model on a DGX SuperPOD with over 3000 A100

USENIX ATC '25 - PPipe: Efficient Video Analytics Serving on Heterogeneous GPU Clusters via...

USENIX ATC '25 - PPipe: Efficient Video Analytics Serving on Heterogeneous GPU Clusters via...

PPipe:

NSDI '20 - Themis: Fair and Efficient GPU Cluster Scheduling

NSDI '20 - Themis: Fair and Efficient GPU Cluster Scheduling

Themis: Fair and

USENIX ATC '22 - Whale: Efficient Giant Model Training over Heterogeneous GPUs

USENIX ATC '22 - Whale: Efficient Giant Model Training over Heterogeneous GPUs

USENIX ATC '22 - Whale:

GPU Memory Coalescing Explained: Warp-Level Optimization, Alignment Rules, and Cache Behavior

GPU Memory Coalescing Explained: Warp-Level Optimization, Alignment Rules, and Cache Behavior

Accelerate your

Writing Code That Runs FAST on a GPU

Writing Code That Runs FAST on a GPU

In this video, we talk about how why