Media Summary: Today we have Philip Kiely from Baseten on the show. Baseten is a Series B startup focused on providing infrastructure for AI ... Open-source LLMs are great for conversational applications, but they can be difficult Why does a 70B language model crawl at 8 tokens per second on one setup, then feel instant on another? The difference is ...
Deep Dive Into Inference Optimization - Detailed Analysis & Overview
Today we have Philip Kiely from Baseten on the show. Baseten is a Series B startup focused on providing infrastructure for AI ... Open-source LLMs are great for conversational applications, but they can be difficult Why does a 70B language model crawl at 8 tokens per second on one setup, then feel instant on another? The difference is ... LLM Caching strategies. As Large Language Models (LLMs) migrate from massive data centers