Build a RAG Code Assistant

A simple guide to combining Ollama, ChromaDB, and Docker for intelligent code search.

What Does This Application Do?

This demo application creates an AI-powered code search assistant that can understand and answer questions about your codebase. Instead of searching for exact text matches, it understands the meaning of your questions.

Example Interaction:

You ask: "What S3 bucket is defined in this project?"

The app: Searches through all your Terraform, Python, YAML files → Finds relevant code snippets → Uses an LLM to generate a human-readable answer

The application has two modes: Indexing (process and store your code) and Querying (ask questions and get answers).

1. What is RAG?

RAG (Retrieval-Augmented Generation) enhances LLMs by giving them access to your data. The process:

Retrieve — Find relevant code chunks from your codebase
Augment — Add that context to your question
Generate — LLM answers using the context

Why RAG? LLMs like GPT or Mistral are trained on public data — they don't know about YOUR specific codebase, internal APIs, or project structure. RAG bridges this gap.

The RAG Pipeline in This App:

📁 Your Code → ✂️ Chunking → 🦙 Ollama Embed → 🗄️ ChromaDB → 🔍 Query → 🦙 Ollama LLM → 💬 Answer

2. Understanding Ollama

🦙 What is Ollama?

Ollama is an open-source tool that lets you run Large Language Models locally on your own machine. Think of it as "Docker for LLMs" — it handles downloading, running, and serving AI models through a simple API.

Key Benefits

• Runs 100% locally (no cloud costs)
• Privacy — your code never leaves your machine
• Simple REST API on port 11434
• Easy model management

How It Works

• Downloads models on first use
• Serves models via HTTP API
• Handles GPU/CPU optimization
• Manages model memory

Two Roles in This App

1. Embedding Model: nomic-embed-text

Converts text into numerical vectors (embeddings). These vectors capture the semantic meaning of code, so similar code produces similar vectors.

"def calculate_sum()" → [0.12, -0.45, 0.78, ...] (768 dimensions)

2. Generation Model: mistral

A 7B parameter LLM that reads the retrieved code context and generates human-readable answers. It's the "brain" that synthesizes information.

Context + Question → "The S3 bucket 'my-app-data' is defined in main.tf..."

Ollama API Examples

# Generate embeddings
curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "def hello_world():"
}'

# Generate text
curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Explain this code: def hello():"
}'

The Python ollama library wraps these API calls into simple function calls.

3. Understanding ChromaDB

🗄️ What is ChromaDB?

ChromaDB is an open-source vector database designed specifically for AI applications. Unlike traditional databases that search by exact matches, ChromaDB finds data by similarity.

Traditional DB: "Find rows where name = 'John'"
Vector DB: "Find items most similar to this vector [0.1, 0.5, ...]"

What It Stores

• Embeddings — The vector numbers
• Documents — Original text (code)
• Metadata — Source file, type, etc.
• IDs — Unique identifiers

Key Features

• Fast similarity search
• Persistent storage option
• HTTP API (client/server mode)
• Metadata filtering

How Similarity Search Works

When you query "What S3 bucket is defined?", ChromaDB:

1
Converts your question to a vector
"What S3 bucket..." → [0.23, -0.11, 0.89, ...]
2
Calculates distance to all stored vectors
Uses cosine similarity or Euclidean distance
3
Returns the N closest matches
In this app, we retrieve the top 3 most similar code chunks

ChromaDB in This App

# Connect to ChromaDB server
client = chromadb.HttpClient(host="chromadb", port=8000)

# Create or get a collection (like a table)
collection = client.get_or_create_collection(name="multi_lang_codebase")

# Add data
collection.add(
    ids=["file1_chunk_0"],           # Unique ID
    embeddings=[[0.1, 0.2, ...]],    # Vector from Ollama
    documents=["def hello():..."],   # Original code
    metadatas=[{"source": "app.py"}] # Extra info
)

# Query for similar items
results = collection.query(
    query_embeddings=[[0.15, 0.25, ...]],  # Question vector
    n_results=3                             # Return top 3
)

4. Python Libraries Explained

The requirements.txt contains five packages. Here's what each does:

chromadb
ollama
requests
httpx
httpcore

`chromadb`

The official Python client for ChromaDB. Provides a high-level API to interact with the vector database.

Functions used in this app:

HttpClient()	Connect to remote ChromaDB server
get_or_create_collection()	Create/access a named collection
collection.add()	Store embeddings + documents
collection.query()	Search for similar vectors

`ollama`

Official Python SDK for Ollama. Simplifies communication with the Ollama server for embeddings and text generation.

Functions used in this app:

ollama.embeddings()	Convert text → vector (768 floats)
ollama.generate()	Generate text response from LLM

Example usage:

# Get embedding
response = ollama.embeddings(model="nomic-embed-text", prompt="code here")
vector = response["embedding"]  # List of 768 floats

# Generate answer
response = ollama.generate(model="mistral", prompt="Explain...")
answer = response["response"]   # String

`requests`

The classic Python HTTP library. Used internally by ChromaDB to communicate with the server. You don't call it directly in this app, but it's a required dependency.

Why included: ChromaDB's HttpClient uses requests for HTTP calls.

`httpx` + `httpcore`

Modern async-capable HTTP client libraries. The ollama Python package uses these under the hood.

httpx

High-level HTTP client (like requests but with async support)

httpcore

Low-level HTTP transport that httpx builds on

Why included: The ollama library requires these for communicating with the Ollama API server.

Dependency Relationships

one.py

↓ imports ↓

chromadb ollama

↓ uses internally ↓

requests httpx httpcore

5. Docker Setup

The docker-compose.yml defines three services:

services:
  ollama:                        # LLM server
    image: ollama/ollama:latest
    ports: ["11434:11434"]
    volumes:
      - ollama_storage:/root/.ollama

  chromadb:                      # Vector DB
    image: chromadb/chroma:latest
    ports: ["8000:8000"]
    environment:
      - IS_PERSISTENT=TRUE

  backend:                       # Your app
    build: ./onedemo
    ports: ["8787:8080"]
    environment:
      - CHROMA_HOST=chromadb    # Service name as host
      - OLLAMA_HOST=http://ollama:11434
    depends_on: [ollama, chromadb]

volumes:
  ollama_storage:
  chroma_data:

Ports

11434, 8000, 8787

Volumes

Persist models & vectors

Network

Services by name

6. Application Code

The one.py script has two main functions:

Configuration

CHROMA_HOST = os.getenv("CHROMA_HOST", "chromadb")
COLLECTION_NAME = "multi_lang_codebase"
EMBED_MODEL = "nomic-embed-text"   # For embeddings
LLM_MODEL = "mistral"              # For generation
CHUNK_SIZE = 1500                  # Chars per chunk

Indexing (store code)

def index_code(path):
    for file in supported_files:
        content = read_file(file)
        chunks = split_into_chunks(content, CHUNK_SIZE)
        
        for chunk in chunks:
            # Convert to vector
            embedding = ollama.embeddings(model=EMBED_MODEL, prompt=chunk)
            
            # Store in ChromaDB
            collection.add(
                ids=[chunk_id],
                embeddings=[embedding],
                documents=[chunk],
                metadatas=[{"source": file_path}]
            )

Querying (search & answer)

def query_code(question):
    # 1. Embed the question
    query_embed = ollama.embeddings(model=EMBED_MODEL, prompt=question)
    
    # 2. Find similar chunks
    results = collection.query(query_embeddings=[query_embed], n_results=3)
    
    # 3. Generate answer with context
    context = "\n".join(results["documents"][0])
    prompt = f"Answer using this code:\n{context}\n\nQuestion: {question}"
    
    return ollama.generate(model=LLM_MODEL, prompt=prompt)

7. Running It

Start services

docker-compose up -d

Index your code

python one.py ./your/code/path

Run queries

python one.py

Example queries:

• "What S3 bucket is defined?"
• "What happens in the test stage?"
• "How is docker composed structured?"

Summary

This RAG application combines three key technologies:

🦙 Ollama

Local LLM for embeddings + generation

🗄️ ChromaDB

Vector database for similarity search

🐳 Docker

Orchestration + persistence

Flow: Code → Chunks → Embeddings (Ollama) → Vector Store (ChromaDB) → Query → Context + LLM → Answer

GitHub ↗ Ollama ↗ ChromaDB ↗