This hands-on session is designed for developers and architects building and scaling generative AI services. We will provide a practical look at Google Kubernetes Engine (GKE) as the foundation for high-performance large language model (LLM) inference. The session will feature a live demo of the GKE Inference Gateway, highlighting its model-aware routing and serving priority features. We will then delve into the open-source llm-d project, showcasing its vLLM-aware scheduling and disaggregated serving capabilities. To cap it off, we'll explore the impressive performance gains of running vLLM on Cloud TPUs for maximum throughput and efficiency. You will leave with actionable insights and code examples to optimize your LLM serving stack.

Nathan Beach
Nathan Beach is Director of Product Management for Google Kubernetes Engine (GKE). He leads the product team working to make GKE a great platform on which to run AI workloads. He received his MBA from Harvard Business School and, prior to Google, led his own startup. He is a builder and creator passionate about making products that superbly meet user needs. He enjoys career coaching and mentoring, and he is eager to help others transition into product management and excel in their careers.
Google Cloud provides leading infrastructure, platform capabilities, and industry solutions. We deliver enterprise-grade cloud solutions that leverage Google’s cutting-edge technology to help companies operate more efficiently and adapt to changing needs, giving customers a foundation for the future. Customers in more than 150 countries use Google Cloud as their trusted partner to solve their most critical business problems.