TL;DR: I built DocMind — a multimodal RAG (Retrieval-Augmented Generation) app that lets you upload any PDF, image, or document and ask questions about it in plain English. No API keys needed — it runs entirely locally using Ollama for the LLM and Xenova Transformers for embeddings. This post walks through the full architecture, the chunking strategy, the vector similarity search, and the code. T