RAG and its Practical Implementation with LangChain, Bun, Ollama and Qdrant

Modern Large Language Models (LLMs) are impressive, but they have a major limitation: their knowledge is fixed in their weights, making it difficult to update and extend their knowledge. Retrieval-Augmented Generation (RAG) is an approach designed to address this problem. Introduced by Meta in 2020, it connects a language model to an external knowledge base (for example, a set of documents) so that it can incorporate up-to-date and specific information into its responses. In practice, for each question asked, the RAG system first extracts relevant content from its document base, then generates a response by combining this retrieved context with the linguistic capabilities of the LLM.

Note: The complete source code for the example project mentioned in this article is available on GitHub.

Article Outline

What is RAG and why use it?
- Operating principle
- Advantages over classical approaches
- Concrete use cases
Architecture of a RAG system
- Essential components
- Data flow
- Technology choices
Practical implementation with TypeScript
- Project setup with Bun
- LangChain integration
- Ollama and Qdrant configuration
Code analysis and best practices
- Document indexing
- Semantic search
- Response generation
Advantages of the technical stack
- Bun performance vs Node.js
- LangChain simplicity
- Ollama flexibility
- Qdrant scalability
Going further
- Advanced optimizations
- Evaluation and metrics
- Technological alternatives

What is RAG and why use it?

Retrieval-Augmented Generation (RAG) literally means "generation augmented by retrieval." The idea is to separate knowledge from the model. Instead of trying to incorporate all information into the parameters of an LLM (through costly fine-tuning) or designing a classical model that would predict responses from data, we let the main model generate text and augment it with an intermediate step of information retrieval. A typical RAG pipeline works as follows:

User query – The user asks a question or provides a query in natural language (e.g., "What is class X used for in this project?").
Search for relevant documents – The system transforms this question into a vector representation (embedding) and then queries a vector database to retrieve documents or passages that are semantically most similar to the query. This identifies the relevant context (e.g., an excerpt from documentation, code, or an article corresponding to the question).
Context + question combination – The retrieved documents or excerpts are then provided as context to the language model. In practice, they are inserted into the LLM's prompt, typically via a system message or by prefixing the user's question with the text of the found documents.
Response generation – The language model (LLM) then generates a response based on both the question and the provided context. The response should contain information from the documents, formulated coherently thanks to the LLM's capabilities.

This process allows the model to rely on specific external knowledge at the time of generation, without having to permanently memorize it. This can be compared to a human who, faced with a question, would consult books or reference documents before answering: the LLM "searches its library" before speaking.

Concrete use cases for RAG

The RAG approach is particularly useful whenever a conversational assistant needs to handle an evolving or voluminous knowledge base. Here are some examples of concrete use cases where RAG excels compared to classical methods:

Documentary chatbots: An assistant powered by a company's technical documentation, capable of answering questions from developers or customers by drawing directly from manuals, internal knowledge bases, or even source code. For example, the model can be connected to API specifications or open-source project code to explain how a function works or the reason for a certain design.

Dynamic FAQs: In a customer support context, a RAG chatbot can answer common questions (FAQs) based on the latest policies or product data. If a policy (e.g., return conditions) changes, you only need to update the reference document and the bot will take it into account instantly, without requiring retraining. This results in always up-to-date FAQs, with the ability to provide the source of information to support the answer.

Legal assistants: An assistant can help lawyers or legal professionals by finding relevant passages in a database of laws, case law, or contracts for a given question, then formulating the answer in natural language. The model doesn't need to know the entire Civil Code by heart; it just needs to look up the appropriate articles. The same applies to a medical assistant, which could query databases of scientific publications or medical protocols to provide answers based on the latest clinical knowledge.

Programming assistant: This is the case of our example project – an assistant that knows the content of a code repository and can answer questions about this code (architecture, role of a module, potential bugs, etc.). Rather than training a specialized programming model, we use a generalist LLM augmented by searching for relevant code files in the repository.

Architecture of a RAG system

Essential components

A complete RAG system typically includes the following components:

Indexing and storage
- Document processor (extraction, cleaning, chunking)
- Embedding generator (transformation into vectors)
- Vector database (storage and search)
Query pipeline
- Query preprocessor
- Semantic search engine
- Prompt generator
Generation and post-processing
- LLM interface
- Response evaluator
- Output formatter

Data flow

typescript

Technology choices

For our implementation, we've chosen a modern and performant stack:

Bun: Ultra-fast JavaScript runtime, ideal for server applications
TypeScript: Static typing for better maintainability
LangChain: Framework for building LLM-based applications
Ollama: Tool for running language models locally
Qdrant: Performant and easy-to-deploy vector database

This combination offers an excellent balance between performance, ease of development, and flexibility.

Practical implementation with TypeScript

Project setup with Bun

Let's start by initializing our project:

bash

Basic configuration

typescript

Document indexing

Indexing is a crucial step in a RAG system. It involves transforming raw documents into appropriately sized chunks, then generating embeddings for each chunk.

typescript

Search and response generation

typescript

Simple user interface

typescript

Code analysis and best practices

Efficient chunking

Splitting documents into chunks is a critical step that directly influences the quality of results. Some best practices:

Appropriate size: Chunks should be large enough to contain context, but not too large to remain relevant (typically between 500 and 1500 characters).
Overlap: Overlap between chunks prevents losing context at boundaries.
Semantic splitting: Ideally, splitting should respect the semantic structure of documents (paragraphs, functions, etc.).

Search optimization

The quality of semantic search is essential:

Metadata filters: Use metadata (file type, date, author) to refine searches.
Re-ranking: Apply a second level of filtering to improve relevance.
Diversity: Ensure diversity in results to cover different aspects of the question.

Advanced prompting

Prompt construction is an art that strongly influences the quality of responses:

typescript

Advantages of the technical stack

Bun performance vs Node.js

Bun offers significant advantages for this type of application:

Fast startup: Startup time up to 4x faster than Node.js
Optimized execution: Superior execution performance, particularly for I/O operations
Integrated bundler: Simplification of the development workflow

LangChain simplicity

LangChain greatly facilitates the development of LLM-based applications:

Abstraction: Unified interface for different models and providers
Reusable components: Ready-to-use chains, agents, and tools
Established patterns: Reference implementations for common use cases

Ollama flexibility

Ollama allows running language models locally with great flexibility:

Local models: No dependency on external APIs
Privacy: Data remains on your infrastructure
Customization: Possibility to adjust models according to your needs

Qdrant scalability

Qdrant is a modern vector database designed for semantic search:

Performance: Optimized for fast similarity searches
Filtering: Advanced filtering capabilities on metadata
Flexible deployment: Usable in embedded mode or as a service

Going further

Advanced optimizations

Hybrid search: Combine vector search and keyword search
Hierarchical chunking: Use different levels of granularity for chunks
Caching: Cache search results and frequent responses

Evaluation and metrics

To measure the quality of a RAG system:

Relevance: Are the retrieved documents relevant to the question?
Faithfulness: Is the answer faithful to the source documents?
Usefulness: Does the answer effectively address the user's question?

Technological alternatives

Frameworks: Haystack, LlamaIndex as alternatives to LangChain
Vector databases: Pinecone, Weaviate, Milvus as alternatives to Qdrant
Models: Different local models (Llama, Mistral) or APIs (OpenAI, Anthropic)

Conclusion

Retrieval-Augmented Generation represents a major advance in how we can leverage language models for specific use cases. By separating knowledge from the generation model, RAG enables the creation of AI assistants that are more accurate, more up-to-date, and more transparent.

Our implementation with TypeScript, Bun, LangChain, Ollama, and Qdrant demonstrates that it is now possible to build performant RAG systems with modern and accessible technologies. This approach paves the way for a new generation of AI assistants capable of reasoning on specific knowledge bases while maintaining the fluidity and coherence of large language models.

Feel free to explore the complete source code on GitHub and adapt it to your own use cases. RAG is an evolving technology, and there are numerous opportunities for innovation in this exciting field.