RAG Tutorial GitHub ↗

Build a RAG Code Assistant

A simple guide to combining Ollama, ChromaDB, and Docker for intelligent code search.

What Does This Application Do?

This demo application creates an AI-powered code search assistant that can understand and answer questions about your codebase. Instead of searching for exact text matches, it understands the meaning of your questions.

Example Interaction:

You ask: "What S3 bucket is defined in this project?"

The app: Searches through all your Terraform, Python, YAML files → Finds relevant code snippets → Uses an LLM to generate a human-readable answer

The application has two modes: Indexing (process and store your code) and Querying (ask questions and get answers).

1. What is RAG?

RAG (Retrieval-Augmented Generation) enhances LLMs by giving them access to your data. The process:

  1. Retrieve — Find relevant code chunks from your codebase
  2. Augment — Add that context to your question
  3. Generate — LLM answers using the context
Why RAG? LLMs like GPT or Mistral are trained on public data — they don't know about YOUR specific codebase, internal APIs, or project structure. RAG bridges this gap.
The RAG Pipeline in This App:
📁 Your Code ✂️ Chunking 🦙 Ollama Embed 🗄️ ChromaDB 🔍 Query 🦙 Ollama LLM 💬 Answer

2. Understanding Ollama

🦙 What is Ollama?

Ollama is an open-source tool that lets you run Large Language Models locally on your own machine. Think of it as "Docker for LLMs" — it handles downloading, running, and serving AI models through a simple API.

Key Benefits
  • • Runs 100% locally (no cloud costs)
  • • Privacy — your code never leaves your machine
  • • Simple REST API on port 11434
  • • Easy model management
How It Works
  • • Downloads models on first use
  • • Serves models via HTTP API
  • • Handles GPU/CPU optimization
  • • Manages model memory

Two Roles in This App

1. Embedding Model: nomic-embed-text

Converts text into numerical vectors (embeddings). These vectors capture the semantic meaning of code, so similar code produces similar vectors.

"def calculate_sum()" → [0.12, -0.45, 0.78, ...] (768 dimensions)
2. Generation Model: mistral

A 7B parameter LLM that reads the retrieved code context and generates human-readable answers. It's the "brain" that synthesizes information.

Context + Question → "The S3 bucket 'my-app-data' is defined in main.tf..."

Ollama API Examples

# Generate embeddings
curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "def hello_world():"
}'

# Generate text
curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Explain this code: def hello():"
}'

The Python ollama library wraps these API calls into simple function calls.

3. Understanding ChromaDB

🗄️ What is ChromaDB?

ChromaDB is an open-source vector database designed specifically for AI applications. Unlike traditional databases that search by exact matches, ChromaDB finds data by similarity.

Traditional DB: "Find rows where name = 'John'"
Vector DB: "Find items most similar to this vector [0.1, 0.5, ...]"
What It Stores
  • Embeddings — The vector numbers
  • Documents — Original text (code)
  • Metadata — Source file, type, etc.
  • IDs — Unique identifiers
Key Features
  • • Fast similarity search
  • • Persistent storage option
  • • HTTP API (client/server mode)
  • • Metadata filtering

How Similarity Search Works

When you query "What S3 bucket is defined?", ChromaDB:

  1. 1
    Converts your question to a vector
    "What S3 bucket..." → [0.23, -0.11, 0.89, ...]
  2. 2
    Calculates distance to all stored vectors
    Uses cosine similarity or Euclidean distance
  3. 3
    Returns the N closest matches
    In this app, we retrieve the top 3 most similar code chunks

ChromaDB in This App

# Connect to ChromaDB server
client = chromadb.HttpClient(host="chromadb", port=8000)

# Create or get a collection (like a table)
collection = client.get_or_create_collection(name="multi_lang_codebase")

# Add data
collection.add(
    ids=["file1_chunk_0"],           # Unique ID
    embeddings=[[0.1, 0.2, ...]],    # Vector from Ollama
    documents=["def hello():..."],   # Original code
    metadatas=[{"source": "app.py"}] # Extra info
)

# Query for similar items
results = collection.query(
    query_embeddings=[[0.15, 0.25, ...]],  # Question vector
    n_results=3                             # Return top 3
)

4. Python Libraries Explained

The requirements.txt contains five packages. Here's what each does:

chromadb
ollama
requests
httpx
httpcore
1

chromadb

The official Python client for ChromaDB. Provides a high-level API to interact with the vector database.

Functions used in this app:
HttpClient() Connect to remote ChromaDB server
get_or_create_collection() Create/access a named collection
collection.add() Store embeddings + documents
collection.query() Search for similar vectors
2

ollama

Official Python SDK for Ollama. Simplifies communication with the Ollama server for embeddings and text generation.

Functions used in this app:
ollama.embeddings() Convert text → vector (768 floats)
ollama.generate() Generate text response from LLM
Example usage:
# Get embedding
response = ollama.embeddings(model="nomic-embed-text", prompt="code here")
vector = response["embedding"]  # List of 768 floats

# Generate answer
response = ollama.generate(model="mistral", prompt="Explain...")
answer = response["response"]   # String
3

requests

The classic Python HTTP library. Used internally by ChromaDB to communicate with the server. You don't call it directly in this app, but it's a required dependency.

Why included: ChromaDB's HttpClient uses requests for HTTP calls.
4

httpx + httpcore

Modern async-capable HTTP client libraries. The ollama Python package uses these under the hood.

httpx

High-level HTTP client (like requests but with async support)

httpcore

Low-level HTTP transport that httpx builds on

Why included: The ollama library requires these for communicating with the Ollama API server.

Dependency Relationships

one.py
↓ imports ↓
chromadb ollama
↓ uses internally ↓
requests httpx httpcore

5. Docker Setup

The docker-compose.yml defines three services:

services:
  ollama:                        # LLM server
    image: ollama/ollama:latest
    ports: ["11434:11434"]
    volumes:
      - ollama_storage:/root/.ollama

  chromadb:                      # Vector DB
    image: chromadb/chroma:latest
    ports: ["8000:8000"]
    environment:
      - IS_PERSISTENT=TRUE

  backend:                       # Your app
    build: ./onedemo
    ports: ["8787:8080"]
    environment:
      - CHROMA_HOST=chromadb    # Service name as host
      - OLLAMA_HOST=http://ollama:11434
    depends_on: [ollama, chromadb]

volumes:
  ollama_storage:
  chroma_data:
Ports
11434, 8000, 8787
Volumes
Persist models & vectors
Network
Services by name

6. Application Code

The one.py script has two main functions:

Configuration

CHROMA_HOST = os.getenv("CHROMA_HOST", "chromadb")
COLLECTION_NAME = "multi_lang_codebase"
EMBED_MODEL = "nomic-embed-text"   # For embeddings
LLM_MODEL = "mistral"              # For generation
CHUNK_SIZE = 1500                  # Chars per chunk

Indexing (store code)

def index_code(path):
    for file in supported_files:
        content = read_file(file)
        chunks = split_into_chunks(content, CHUNK_SIZE)
        
        for chunk in chunks:
            # Convert to vector
            embedding = ollama.embeddings(model=EMBED_MODEL, prompt=chunk)
            
            # Store in ChromaDB
            collection.add(
                ids=[chunk_id],
                embeddings=[embedding],
                documents=[chunk],
                metadatas=[{"source": file_path}]
            )

Querying (search & answer)

def query_code(question):
    # 1. Embed the question
    query_embed = ollama.embeddings(model=EMBED_MODEL, prompt=question)
    
    # 2. Find similar chunks
    results = collection.query(query_embeddings=[query_embed], n_results=3)
    
    # 3. Generate answer with context
    context = "\n".join(results["documents"][0])
    prompt = f"Answer using this code:\n{context}\n\nQuestion: {question}"
    
    return ollama.generate(model=LLM_MODEL, prompt=prompt)

7. Running It

1
Start services
docker-compose up -d
2
Index your code
python one.py ./your/code/path
3
Run queries
python one.py
Example queries:
  • • "What S3 bucket is defined?"
  • • "What happens in the test stage?"
  • • "How is docker composed structured?"

Summary

This RAG application combines three key technologies:

🦙 Ollama
Local LLM for embeddings + generation
🗄️ ChromaDB
Vector database for similarity search
🐳 Docker
Orchestration + persistence
Flow: Code → Chunks → Embeddings (Ollama) → Vector Store (ChromaDB) → Query → Context + LLM → Answer