RAG · 45 min build
0:00 / 47:11
Live codingRAGFastAPIVercel AI SDK

Live: a RAG chatbot in 45 minutes

👁 21.4k viewsApr 02, 2026Recorded live
Ice Bear (a.k.a. Nitin)
12.4k bears subscribed · Live coding & cold takes
0:00
Ice Bear is live. We're building a retrieval-augmented chatbot in forty-five minutes — no slides, no edits. The clock starts now.
0:32
First thing: the stack. FastAPI for the API surface, Postgres with pgvector for the embedding store, and the Vercel AI SDK on the React side so streaming is one hook, not a homework assignment.
1:18
We'll skip the 'what is RAG' explainer. If you're here, you already know. Retrieve relevant chunks, stuff them into the prompt, generate. The hard part is everything around that one sentence.
2:04
Chunking strategy. Naive paragraph splitting works on day one and falls apart on day two. We're going with overlapping windows — 512 tokens with 64-token stride — so concepts that straddle boundaries don't get lost.
3:11
Embedding model. We're using OpenAI's small model because it's cheap, fast, and good enough for English-language docs. If you have multilingual content, you'll want something else.
4:25
Storage. pgvector lives in the same Postgres instance as the rest of the app. One database, one backup, one migration story. Ice Bear values simplicity.
6:02
Here's the ingestion endpoint. Async-first, of course. We chunk the input, embed each chunk, and upsert into the vectors table with a foreign key back to the source document.
8:14
Time for the retrieval query. Cosine similarity, top-k twenty, then a quick reranker pass to bring it down to five. That reranker is the difference between okay and useful.
10:30
Streaming responses. The AI SDK's useChat hook handles the WebSocket. Token by token, no fight with the framework, hover state included.
13:08
Now the part nobody warns you about: prompt injection. Anything in your retrieved chunks can pretend to be an instruction. We're going to sanitize and bracket.
16:42
Citations. Always cite. Even if the user doesn't ask. The model lies more confidently when there's nothing to point at.
20:18
Latency budget. From keystroke to first token, sub-800ms is the target. Embed the query, run the search, build the prompt, fire the call. Each step gets a slice.
24:55
Caching. Embedding cache by SHA of the query. Result cache by the same. Most chatbots ask the same six questions over and over — the cache pays for itself in a week.
28:30
Evaluation. We use a small set of ground-truth Q&A pairs and check precision-at-k on every deploy. If retrieval drops below 0.8, we hold the release.
32:14
Cost. The cheapest token is the one you don't send. Trim the context, trim the conversation history, trim the system prompt. Then trim it again.
36:50
Failure modes. The model doesn't know it doesn't know. We add a confidence threshold and a graceful 'I can't answer that from my sources' response.
40:22
Deploy. Docker compose, three services, EC2. No fancy orchestration today. It scales fine for the first ten thousand users — past that, we'll talk.
44:08
Wrap. The repo is in the description. Ice Bear is going to drink something cold and probably nap. Thanks for watching.
47:01
Outro. See you in the next one.