Build a RAG Code Assistant
A simple guide to combining Ollama, ChromaDB, and Docker for intelligent code search.
What Does This Application Do?
This demo application creates an AI-powered code search assistant that can understand and answer questions about your codebase. Instead of searching for exact text matches, it understands the meaning of your questions.
You ask: "What S3 bucket is defined in this project?"
The app: Searches through all your Terraform, Python, YAML files → Finds relevant code snippets → Uses an LLM to generate a human-readable answer
The application has two modes: Indexing (process and store your code) and Querying (ask questions and get answers).
1. What is RAG?
RAG (Retrieval-Augmented Generation) enhances LLMs by giving them access to your data. The process:
- Retrieve — Find relevant code chunks from your codebase
- Augment — Add that context to your question
- Generate — LLM answers using the context
2. Understanding Ollama
🦙 What is Ollama?
Ollama is an open-source tool that lets you run Large Language Models locally on your own machine. Think of it as "Docker for LLMs" — it handles downloading, running, and serving AI models through a simple API.
- • Runs 100% locally (no cloud costs)
- • Privacy — your code never leaves your machine
- • Simple REST API on port 11434
- • Easy model management
- • Downloads models on first use
- • Serves models via HTTP API
- • Handles GPU/CPU optimization
- • Manages model memory
Two Roles in This App
nomic-embed-textConverts text into numerical vectors (embeddings). These vectors capture the semantic meaning of code, so similar code produces similar vectors.
mistralA 7B parameter LLM that reads the retrieved code context and generates human-readable answers. It's the "brain" that synthesizes information.
Ollama API Examples
# Generate embeddings
curl http://localhost:11434/api/embeddings -d '{
"model": "nomic-embed-text",
"prompt": "def hello_world():"
}'
# Generate text
curl http://localhost:11434/api/generate -d '{
"model": "mistral",
"prompt": "Explain this code: def hello():"
}'
The Python ollama library wraps these API calls into simple function calls.
3. Understanding ChromaDB
🗄️ What is ChromaDB?
ChromaDB is an open-source vector database designed specifically for AI applications. Unlike traditional databases that search by exact matches, ChromaDB finds data by similarity.
Vector DB: "Find items most similar to this vector [0.1, 0.5, ...]"
- • Embeddings — The vector numbers
- • Documents — Original text (code)
- • Metadata — Source file, type, etc.
- • IDs — Unique identifiers
- • Fast similarity search
- • Persistent storage option
- • HTTP API (client/server mode)
- • Metadata filtering
How Similarity Search Works
When you query "What S3 bucket is defined?", ChromaDB:
-
1
Converts your question to a vector"What S3 bucket..." → [0.23, -0.11, 0.89, ...]
-
2
Calculates distance to all stored vectorsUses cosine similarity or Euclidean distance
-
3
Returns the N closest matchesIn this app, we retrieve the top 3 most similar code chunks
ChromaDB in This App
# Connect to ChromaDB server
client = chromadb.HttpClient(host="chromadb", port=8000)
# Create or get a collection (like a table)
collection = client.get_or_create_collection(name="multi_lang_codebase")
# Add data
collection.add(
ids=["file1_chunk_0"], # Unique ID
embeddings=[[0.1, 0.2, ...]], # Vector from Ollama
documents=["def hello():..."], # Original code
metadatas=[{"source": "app.py"}] # Extra info
)
# Query for similar items
results = collection.query(
query_embeddings=[[0.15, 0.25, ...]], # Question vector
n_results=3 # Return top 3
)
4. Python Libraries Explained
The requirements.txt contains five packages. Here's what each does:
chromadb
ollama
requests
httpx
httpcore
chromadb
The official Python client for ChromaDB. Provides a high-level API to interact with the vector database.
| HttpClient() | Connect to remote ChromaDB server |
| get_or_create_collection() | Create/access a named collection |
| collection.add() | Store embeddings + documents |
| collection.query() | Search for similar vectors |
ollama
Official Python SDK for Ollama. Simplifies communication with the Ollama server for embeddings and text generation.
| ollama.embeddings() | Convert text → vector (768 floats) |
| ollama.generate() | Generate text response from LLM |
# Get embedding
response = ollama.embeddings(model="nomic-embed-text", prompt="code here")
vector = response["embedding"] # List of 768 floats
# Generate answer
response = ollama.generate(model="mistral", prompt="Explain...")
answer = response["response"] # String
requests
The classic Python HTTP library. Used internally by ChromaDB to communicate with the server. You don't call it directly in this app, but it's a required dependency.
httpx + httpcore
Modern async-capable HTTP client libraries. The ollama Python package uses these under the hood.
High-level HTTP client (like requests but with async support)
Low-level HTTP transport that httpx builds on
Dependency Relationships
5. Docker Setup
The docker-compose.yml defines three services:
services:
ollama: # LLM server
image: ollama/ollama:latest
ports: ["11434:11434"]
volumes:
- ollama_storage:/root/.ollama
chromadb: # Vector DB
image: chromadb/chroma:latest
ports: ["8000:8000"]
environment:
- IS_PERSISTENT=TRUE
backend: # Your app
build: ./onedemo
ports: ["8787:8080"]
environment:
- CHROMA_HOST=chromadb # Service name as host
- OLLAMA_HOST=http://ollama:11434
depends_on: [ollama, chromadb]
volumes:
ollama_storage:
chroma_data:
6. Application Code
The one.py script has two main functions:
Configuration
CHROMA_HOST = os.getenv("CHROMA_HOST", "chromadb")
COLLECTION_NAME = "multi_lang_codebase"
EMBED_MODEL = "nomic-embed-text" # For embeddings
LLM_MODEL = "mistral" # For generation
CHUNK_SIZE = 1500 # Chars per chunk
Indexing (store code)
def index_code(path):
for file in supported_files:
content = read_file(file)
chunks = split_into_chunks(content, CHUNK_SIZE)
for chunk in chunks:
# Convert to vector
embedding = ollama.embeddings(model=EMBED_MODEL, prompt=chunk)
# Store in ChromaDB
collection.add(
ids=[chunk_id],
embeddings=[embedding],
documents=[chunk],
metadatas=[{"source": file_path}]
)
Querying (search & answer)
def query_code(question):
# 1. Embed the question
query_embed = ollama.embeddings(model=EMBED_MODEL, prompt=question)
# 2. Find similar chunks
results = collection.query(query_embeddings=[query_embed], n_results=3)
# 3. Generate answer with context
context = "\n".join(results["documents"][0])
prompt = f"Answer using this code:\n{context}\n\nQuestion: {question}"
return ollama.generate(model=LLM_MODEL, prompt=prompt)
7. Running It
docker-compose up -d
python one.py ./your/code/path
python one.py
- • "What S3 bucket is defined?"
- • "What happens in the test stage?"
- • "How is docker composed structured?"
Summary
This RAG application combines three key technologies: