Media Summary: Today we have Philip Kiely from Baseten on the show. Baseten is a Series B startup focused on providing infrastructure for AI ... Open-source LLMs are great for conversational applications, but they can be difficult Why does a 70B language model crawl at 8 tokens per second on one setup, then feel instant on another? The difference is ...

Deep Dive Into Inference Optimization - Detailed Analysis & Overview

Today we have Philip Kiely from Baseten on the show. Baseten is a Series B startup focused on providing infrastructure for AI ... Open-source LLMs are great for conversational applications, but they can be difficult Why does a 70B language model crawl at 8 tokens per second on one setup, then feel instant on another? The difference is ... LLM Caching strategies. As Large Language Models (LLMs) migrate from massive data centers

Photo Gallery

Deep Dive into Inference Optimization for LLMs with Philip Kiely
Deep Dive: Optimizing LLM inference
LLM Inference Optimization Explained — From 8 Tokens/sec to 50+
Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou
AI Inference: The Secret to AI's Superpowers
Faster LLMs: Accelerate Inference with Speculative Decoding
LLM Inference Optimization. Coherence in KV Cache Management.  LLM Intra-Turn Cache Dynamics.
Deep Dive into LLMs like ChatGPT
AI Optimization Lecture 01 -  Prefill vs Decode - Mastering LLM Techniques from NVIDIA
Understand training and inference optimizations in deep learning: Technical Deep Dive #3
LLM inference optimization: Architecture, KV cache and Flash attention
Inference Office Hours with SGLang: Performance Optimizations for LLM Serving
View Detailed Profile
Deep Dive into Inference Optimization for LLMs with Philip Kiely

Deep Dive into Inference Optimization for LLMs with Philip Kiely

Today we have Philip Kiely from Baseten on the show. Baseten is a Series B startup focused on providing infrastructure for AI ...

Deep Dive: Optimizing LLM inference

Deep Dive: Optimizing LLM inference

Open-source LLMs are great for conversational applications, but they can be difficult

LLM Inference Optimization Explained — From 8 Tokens/sec to 50+

LLM Inference Optimization Explained — From 8 Tokens/sec to 50+

Why does a 70B language model crawl at 8 tokens per second on one setup, then feel instant on another? The difference is ...

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

LLM

AI Inference: The Secret to AI's Superpowers

AI Inference: The Secret to AI's Superpowers

Download the AI model guide

Faster LLMs: Accelerate Inference with Speculative Decoding

Faster LLMs: Accelerate Inference with Speculative Decoding

Ready

LLM Inference Optimization. Coherence in KV Cache Management.  LLM Intra-Turn Cache Dynamics.

LLM Inference Optimization. Coherence in KV Cache Management. LLM Intra-Turn Cache Dynamics.

LLM Caching strategies. As Large Language Models (LLMs) migrate from massive data centers

Deep Dive into LLMs like ChatGPT

Deep Dive into LLMs like ChatGPT

This is a general audience

AI Optimization Lecture 01 -  Prefill vs Decode - Mastering LLM Techniques from NVIDIA

AI Optimization Lecture 01 - Prefill vs Decode - Mastering LLM Techniques from NVIDIA

Video 1 of 6 | Mastering LLM Techniques:

Understand training and inference optimizations in deep learning: Technical Deep Dive #3

Understand training and inference optimizations in deep learning: Technical Deep Dive #3

In

LLM inference optimization: Architecture, KV cache and Flash attention

LLM inference optimization: Architecture, KV cache and Flash attention

...

Inference Office Hours with SGLang: Performance Optimizations for LLM Serving

Inference Office Hours with SGLang: Performance Optimizations for LLM Serving

We'll dive

What is vLLM? Efficient AI Inference for Large Language Models

What is vLLM? Efficient AI Inference for Large Language Models

Ready