RAG and its Practical Implementation with LangChain, Bun, Ollama and Qdrant
Modern Large Language Models (LLMs) are impressive, but they have a major limitation: their knowledge is fixed in their weights, making it difficult to update and extend their knowledge. Retrieval-Augmented Generation (RAG) is an approach designed to address this problem. Introduced by Meta in 2020, it connects a language model to an external knowledge base (for example, a set of documents) so that it can incorporate up-to-date and specific information into its responses. In practice, for each question asked, the RAG system first extracts relevant content from its document base, then generates a response by combining this retrieved context with the linguistic capabilities of the LLM.
Note: The complete source code for the example project mentioned in this article is available on GitHub.
Article Outline
-
What is RAG and why use it?
- Operating principle
- Advantages over classical approaches
- Concrete use cases
-
Architecture of a RAG system
- Essential components
- Data flow
- Technology choices
-
Practical implementation with TypeScript
- Project setup with Bun
- LangChain integration
- Ollama and Qdrant configuration
-
Code analysis and best practices
- Document indexing
- Semantic search
- Response generation
-
Advantages of the technical stack
- Bun performance vs Node.js
- LangChain simplicity
- Ollama flexibility
- Qdrant scalability
-
Going further
- Advanced optimizations
- Evaluation and metrics
- Technological alternatives
What is RAG and why use it?
Retrieval-Augmented Generation (RAG) literally means "generation augmented by retrieval." The idea is to separate knowledge from the model. Instead of trying to incorporate all information into the parameters of an LLM (through costly fine-tuning) or designing a classical model that would predict responses from data, we let the main model generate text and augment it with an intermediate step of information retrieval. A typical RAG pipeline works as follows:
- User query – The user asks a question or provides a query in natural language (e.g., "What is class X used for in this project?").
- Search for relevant documents – The system transforms this question into a vector representation (embedding) and then queries a vector database to retrieve documents or passages that are semantically most similar to the query. This identifies the relevant context (e.g., an excerpt from documentation, code, or an article corresponding to the question).
- Context + question combination – The retrieved documents or excerpts are then provided as context to the language model. In practice, they are inserted into the LLM's prompt, typically via a system message or by prefixing the user's question with the text of the found documents.
- Response generation – The language model (LLM) then generates a response based on both the question and the provided context. The response should contain information from the documents, formulated coherently thanks to the LLM's capabilities.
This process allows the model to rely on specific external knowledge at the time of generation, without having to permanently memorize it. This can be compared to a human who, faced with a question, would consult books or reference documents before answering: the LLM "searches its library" before speaking.
Concrete use cases for RAG
The RAG approach is particularly useful whenever a conversational assistant needs to handle an evolving or voluminous knowledge base. Here are some examples of concrete use cases where RAG excels compared to classical methods:
Documentary chatbots: An assistant powered by a company's technical documentation, capable of answering questions from developers or customers by drawing directly from manuals, internal knowledge bases, or even source code. For example, the model can be connected to API specifications or open-source project code to explain how a function works or the reason for a certain design.
Dynamic FAQs: In a customer support context, a RAG chatbot can answer common questions (FAQs) based on the latest policies or product data. If a policy (e.g., return conditions) changes, you only need to update the reference document and the bot will take it into account instantly, without requiring retraining. This results in always up-to-date FAQs, with the ability to provide the source of information to support the answer.
Legal assistants: An assistant can help lawyers or legal professionals by finding relevant passages in a database of laws, case law, or contracts for a given question, then formulating the answer in natural language. The model doesn't need to know the entire Civil Code by heart; it just needs to look up the appropriate articles. The same applies to a medical assistant, which could query databases of scientific publications or medical protocols to provide answers based on the latest clinical knowledge.
Programming assistant: This is the case of our example project – an assistant that knows the content of a code repository and can answer questions about this code (architecture, role of a module, potential bugs, etc.). Rather than training a specialized programming model, we use a generalist LLM augmented by searching for relevant code files in the repository.
Architecture of a RAG system
Essential components
A complete RAG system typically includes the following components:
-
Indexing and storage
- Document processor (extraction, cleaning, chunking)
- Embedding generator (transformation into vectors)
- Vector database (storage and search)
-
Query pipeline
- Query preprocessor
- Semantic search engine
- Prompt generator
-
Generation and post-processing
- LLM interface
- Response evaluator
- Output formatter
Data flow
typescript
Technology choices
For our implementation, we've chosen a modern and performant stack:
- Bun: Ultra-fast JavaScript runtime, ideal for server applications
- TypeScript: Static typing for better maintainability
- LangChain: Framework for building LLM-based applications
- Ollama: Tool for running language models locally
- Qdrant: Performant and easy-to-deploy vector database
This combination offers an excellent balance between performance, ease of development, and flexibility.
Practical implementation with TypeScript
Project setup with Bun
Let's start by initializing our project:
bash
Basic configuration
typescript
Document indexing
Indexing is a crucial step in a RAG system. It involves transforming raw documents into appropriately sized chunks, then generating embeddings for each chunk.
typescript
Search and response generation
typescript
Simple user interface
typescript
Code analysis and best practices
Efficient chunking
Splitting documents into chunks is a critical step that directly influences the quality of results. Some best practices:
- Appropriate size: Chunks should be large enough to contain context, but not too large to remain relevant (typically between 500 and 1500 characters).
- Overlap: Overlap between chunks prevents losing context at boundaries.
- Semantic splitting: Ideally, splitting should respect the semantic structure of documents (paragraphs, functions, etc.).
Search optimization
The quality of semantic search is essential:
- Metadata filters: Use metadata (file type, date, author) to refine searches.
- Re-ranking: Apply a second level of filtering to improve relevance.
- Diversity: Ensure diversity in results to cover different aspects of the question.
Advanced prompting
Prompt construction is an art that strongly influences the quality of responses:
typescript
Advantages of the technical stack
Bun performance vs Node.js
Bun offers significant advantages for this type of application:
- Fast startup: Startup time up to 4x faster than Node.js
- Optimized execution: Superior execution performance, particularly for I/O operations
- Integrated bundler: Simplification of the development workflow
LangChain simplicity
LangChain greatly facilitates the development of LLM-based applications:
- Abstraction: Unified interface for different models and providers
- Reusable components: Ready-to-use chains, agents, and tools
- Established patterns: Reference implementations for common use cases
Ollama flexibility
Ollama allows running language models locally with great flexibility:
- Local models: No dependency on external APIs
- Privacy: Data remains on your infrastructure
- Customization: Possibility to adjust models according to your needs
Qdrant scalability
Qdrant is a modern vector database designed for semantic search:
- Performance: Optimized for fast similarity searches
- Filtering: Advanced filtering capabilities on metadata
- Flexible deployment: Usable in embedded mode or as a service
Going further
Advanced optimizations
- Hybrid search: Combine vector search and keyword search
- Hierarchical chunking: Use different levels of granularity for chunks
- Caching: Cache search results and frequent responses
Evaluation and metrics
To measure the quality of a RAG system:
- Relevance: Are the retrieved documents relevant to the question?
- Faithfulness: Is the answer faithful to the source documents?
- Usefulness: Does the answer effectively address the user's question?
Technological alternatives
- Frameworks: Haystack, LlamaIndex as alternatives to LangChain
- Vector databases: Pinecone, Weaviate, Milvus as alternatives to Qdrant
- Models: Different local models (Llama, Mistral) or APIs (OpenAI, Anthropic)
Conclusion
Retrieval-Augmented Generation represents a major advance in how we can leverage language models for specific use cases. By separating knowledge from the generation model, RAG enables the creation of AI assistants that are more accurate, more up-to-date, and more transparent.
Our implementation with TypeScript, Bun, LangChain, Ollama, and Qdrant demonstrates that it is now possible to build performant RAG systems with modern and accessible technologies. This approach paves the way for a new generation of AI assistants capable of reasoning on specific knowledge bases while maintaining the fluidity and coherence of large language models.
Feel free to explore the complete source code on GitHub and adapt it to your own use cases. RAG is an evolving technology, and there are numerous opportunities for innovation in this exciting field.