Thursday, May 23 • 11:05 - 11:40
Production GPU Cluster with K8s for AI and DL Workloads - Madhukar Korupolu, NVIDIA

Sign up or log in to save this to your schedule and see who's attending!

Feedback form is now closed.
We will present NVIDIA's experience in building and operating a production GPU cluster with K8s for AI/DL and HPC workloads. Running GPU accelerated workloads in K8s has unique challenges, and we'll describe how we addressed some of these in production at scale. We will describe the tools we have built for automated provisioning of GPU nodes (including CUDA driver upgrades), a custom scheduler specialized for batch jobs and monitoring GPU jobs in production with health checks and telemetry. We will also discuss gaps we have identified to enable more reliable and efficient utilization of GPU resources (e.g., GPU affinity, sharing, co-scheduling) and share an update of our current projects.

avatar for Madhukar Korupolu

Madhukar Korupolu

Distinguished Engineer, NVIDIA
Madhukar is an architect at NVIDIA working on GPU clusters for AI and HPC workloads. Areas of interest and experience include AI / ML infra, GPU acceleration, Cloud computing, Distributed Systems, Borg, Kubernetes, HPC, CDNs etc with previous stints at Google, IBM and Akamai. He holds... Read More →

Thursday May 23, 2019 11:05 - 11:40
Hall 8.0 C1