Ollama Embeddings Guide
Ollama provides local, privacy-focused text embeddings for our memory system. This guide explains how embeddings work, why we chose Ollama, and how to set up and use local embeddings effectively.
🎯 What are Text Embeddings?
The Magic of Converting Text to Numbers
Text embeddings are numerical representations of text that capture semantic meaning. Think of them as a way to convert human language into a format that computers can understand and compare.
// Human-readable text
const text = "I love machine learning";
// Computer-readable vector (simplified - actual vectors have 768 dimensions)
const embedding = [0.1, 0.3, 0.7, 0.2, 0.9, 0.4, 0.6, 0.8, ...];
Why Embeddings Matter
- Semantic Understanding: "car" and "automobile" have similar embeddings
- Language Agnostic: "hello" and "hola" have similar vectors
- Context Awareness: "bank" (financial) and "bank" (river) have different embeddings
- Similarity Search: Find related content even with different wording
🏠 Why Local Embeddings with Ollama?
The Problem with External APIs
Traditional embedding services have drawbacks:
- Privacy Concerns: Your data goes to external servers
- Cost: Pay per API call, can get expensive
- Latency: Network requests add delay
- Dependency: Relies on external service availability
- Data Control: No control over how your data is processed
Ollama's Advantages
- Privacy First: Everything runs on your machine
- Cost Effective: One-time setup, no per-request fees
- Fast: No network latency
- Reliable: No external dependencies
- Customizable: Use different models as needed
🚀 Setting Up Ollama
1. Installation
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Verify installation
ollama --version
2. Start Ollama Service
# Start Ollama server
ollama serve
# In another terminal, verify it's running
curl http://localhost:11434/api/tags
3. Install Embedding Model
# Install nomic-embed-text (recommended for our system)
ollama pull nomic-embed-text
# Verify model is available
ollama list
4. Test the Model
# Test embedding generation
ollama run nomic-embed-text "Hello, world!"
🔧 Implementation
Basic Embedding Service
class OllamaEmbeddingService {
private baseUrl: string;
private model: string;
constructor(baseUrl: string = 'http://localhost:11434', model: string = 'nomic-embed-text') {
this.baseUrl = baseUrl;
this.model = model;
}
async generateEmbedding(text: string): Promise<number[]> {
try {
const response = await fetch(`${this.baseUrl}/api/embeddings`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: this.model,
prompt: text
})
});
if (!response.ok) {
throw new Error(`Ollama API error: ${response.status} ${response.statusText}`);
}
const data = await response.json();
return data.embedding;
} catch (error) {
console.error('Error generating embedding:', error);
throw error;
}
}
}
Batch Embedding Generation
class BatchEmbeddingService extends OllamaEmbeddingService {
private batchSize: number;
private delay: number;
constructor(batchSize: number = 5, delay: number = 100) {
super();
this.batchSize = batchSize;
this.delay = delay;
}
async generateBatchEmbeddings(texts: string[]): Promise<number[][]> {
const embeddings: number[][] = [];
for (let i = 0; i < texts.length; i += this.batchSize) {
const batch = texts.slice(i, i + this.batchSize);
// Process batch in parallel
const batchPromises = batch.map(text => this.generateEmbedding(text));
const batchEmbeddings = await Promise.all(batchPromises);
embeddings.push(...batchEmbeddings);
// Add delay between batches to avoid overwhelming Ollama
if (i + this.batchSize < texts.length) {
await new Promise(resolve => setTimeout(resolve, this.delay));
}
}
return embeddings;
}
}
Caching for Performance
class CachedEmbeddingService extends OllamaEmbeddingService {
private cache: Map<string, number[]>;
private maxCacheSize: number;
constructor(maxCacheSize: number = 1000) {
super();
this.cache = new Map();
this.maxCacheSize = maxCacheSize;
}
async generateEmbedding(text: string): Promise<number[]> {
// Check cache first
if (this.cache.has(text)) {
return this.cache.get(text)!;
}
// Generate new embedding
const embedding = await super.generateEmbedding(text);
// Cache the result
this.cacheEmbedding(text, embedding);
return embedding;
}
private cacheEmbedding(text: string, embedding: number[]): void {
// Remove oldest entries if cache is full
if (this.cache.size >= this.maxCacheSize) {
const firstKey = this.cache.keys().next().value;
this.cache.delete(firstKey);
}
this.cache.set(text, embedding);
}
// Preload common embeddings
async preloadEmbeddings(texts: string[]): Promise<void> {
const uncachedTexts = texts.filter(text => !this.cache.has(text));
if (uncachedTexts.length > 0) {
const embeddings = await this.generateBatchEmbeddings(uncachedTexts);
uncachedTexts.forEach((text, index) => {
this.cache.set(text, embeddings[index]);
});
}
}
}
📊 Embedding Quality and Performance
Model Comparison
// Different models for different use cases
const EMBEDDING_MODELS = {
// Best for general text (768 dimensions)
'nomic-embed-text': {
dimensions: 768,
useCase: 'General text embedding',
performance: 'High',
size: '274MB'
},
// Smaller, faster model (384 dimensions)
'all-minilm': {
dimensions: 384,
useCase: 'Fast embedding for simple text',
performance: 'Very High',
size: '80MB'
},
// Multilingual support
'multilingual-e5-large': {
dimensions: 1024,
useCase: 'Multilingual text',
performance: 'High',
size: '1.1GB'
}
};
Quality Testing
async function testEmbeddingQuality() {
const embeddingService = new OllamaEmbeddingService();
// Test similar concepts
const similarTexts = [
"I love programming",
"I enjoy coding",
"Programming is fun"
];
const embeddings = await embeddingService.generateBatchEmbeddings(similarTexts);
// Calculate similarities
for (let i = 0; i < embeddings.length; i++) {
for (let j = i + 1; j < embeddings.length; j++) {
const similarity = cosineSimilarity(embeddings[i], embeddings[j]);
console.log(`Similarity between "${similarTexts[i]}" and "${similarTexts[j]}": ${similarity}`);
}
}
}
function cosineSimilarity(vecA: number[], vecB: number[]): number {
const dotProduct = vecA.reduce((sum, a, i) => sum + a * vecB[i], 0);
const magnitudeA = Math.sqrt(vecA.reduce((sum, a) => sum + a * a, 0));
const magnitudeB = Math.sqrt(vecB.reduce((sum, b) => sum + b * b, 0));
return dotProduct / (magnitudeA * magnitudeB);
}
Performance Monitoring
class MonitoredEmbeddingService extends OllamaEmbeddingService {
private metrics = {
totalRequests: 0,
totalTime: 0,
averageTime: 0,
errors: 0
};
async generateEmbedding(text: string): Promise<number[]> {
const startTime = Date.now();
this.metrics.totalRequests++;
try {
const embedding = await super.generateEmbedding(text);
const duration = Date.now() - startTime;
this.metrics.totalTime += duration;
this.metrics.averageTime = this.metrics.totalTime / this.metrics.totalRequests;
return embedding;
} catch (error) {
this.metrics.errors++;
throw error;
}
}
getMetrics() {
return {
...this.metrics,
errorRate: this.metrics.errors / this.metrics.totalRequests,
requestsPerSecond: this.metrics.totalRequests / (this.metrics.totalTime / 1000)
};
}
}
🔧 Configuration and Optimization
Environment Configuration
# packages/server/.env
OLLAMA_BASE_URL=http://localhost:11434
MEMORY_EMBEDDING_MODEL=nomic-embed-text
MEMORY_EMBEDDING_DIMENSIONS=768
MEMORY_EMBEDDING_BATCH_SIZE=5
MEMORY_EMBEDDING_DELAY_MS=100
MEMORY_EMBEDDING_CACHE_SIZE=1000
Service Configuration
interface EmbeddingConfig {
baseUrl: string;
model: string;
dimensions: number;
batchSize: number;
delay: number;
cacheSize: number;
timeout: number;
}
const defaultConfig: EmbeddingConfig = {
baseUrl: process.env.OLLAMA_BASE_URL || 'http://localhost:11434',
model: process.env.MEMORY_EMBEDDING_MODEL || 'nomic-embed-text',
dimensions: parseInt(process.env.MEMORY_EMBEDDING_DIMENSIONS || '768'),
batchSize: parseInt(process.env.MEMORY_EMBEDDING_BATCH_SIZE || '5'),
delay: parseInt(process.env.MEMORY_EMBEDDING_DELAY_MS || '100'),
cacheSize: parseInt(process.env.MEMORY_EMBEDDING_CACHE_SIZE || '1000'),
timeout: 30000
};
Memory System Integration
class MemoryEmbeddingService {
private embeddingService: OllamaEmbeddingService;
private config: EmbeddingConfig;
constructor(config: EmbeddingConfig = defaultConfig) {
this.config = config;
this.embeddingService = new CachedEmbeddingService(config.cacheSize);
}
async generateMemoryEmbedding(memory: {
content: string;
concept?: string;
category?: string;
tags?: string[];
}): Promise<number[]> {
// Combine different parts of memory for better embedding
const combinedText = [
memory.content,
memory.concept,
memory.category,
...(memory.tags || [])
].filter(Boolean).join(' ');
return this.embeddingService.generateEmbedding(combinedText);
}
async generateQueryEmbedding(query: string): Promise<number[]> {
return this.embeddingService.generateEmbedding(query);
}
async generateBatchMemoryEmbeddings(memories: any[]): Promise<number[][]> {
const texts = memories.map(memory =>
[memory.content, memory.concept, memory.category, ...(memory.tags || [])]
.filter(Boolean)
.join(' ')
);
return this.embeddingService.generateBatchEmbeddings(texts);
}
}
🎯 Real-World Examples
Example 1: Knowledge Base Embeddings
// Store knowledge with embeddings
const knowledgeItems = [
{
content: "React hooks allow you to use state and lifecycle features in functional components",
concept: "React Hooks",
category: "frontend",
tags: ["React", "hooks", "functional-components"]
},
{
content: "useState is a React hook that lets you add state to functional components",
concept: "useState Hook",
category: "frontend",
tags: ["React", "useState", "state-management"]
}
];
const embeddingService = new MemoryEmbeddingService();
// Generate embeddings for all knowledge items
const embeddings = await embeddingService.generateBatchMemoryEmbeddings(knowledgeItems);
// Store in Pinecone with embeddings
for (let i = 0; i < knowledgeItems.length; i++) {
await storeSemanticMemory({
id: `knowledge-${i}`,
...knowledgeItems[i],
embedding: embeddings[i]
});
}
Example 2: User Query Processing
// Process user query
const userQuery = "How do I manage state in React components?";
// Generate query embedding
const queryEmbedding = await embeddingService.generateQueryEmbedding(userQuery);
// Search for similar knowledge
const similarKnowledge = await searchSemanticMemory(queryEmbedding, {
userId: 'user-123',
threshold: 0.7,
limit: 5
});
// Results will include both React hooks knowledge items
// even though the query uses different wording
Example 3: Learning Progress Tracking
// Track what user has learned
const learningProgress = [
{
content: "User successfully implemented useState hook in their React component",
concept: "useState Implementation",
category: "learning-progress",
tags: ["React", "useState", "learning", "progress"]
},
{
content: "User asked about useEffect hook and implemented it correctly",
concept: "useEffect Learning",
category: "learning-progress",
tags: ["React", "useEffect", "learning", "progress"]
}
];
// Generate embeddings for learning progress
const progressEmbeddings = await embeddingService.generateBatchMemoryEmbeddings(learningProgress);
// Store learning progress
for (let i = 0; i < learningProgress.length; i++) {
await storeSemanticMemory({
id: `progress-${Date.now()}-${i}`,
...learningProgress[i],
embedding: progressEmbeddings[i]
});
}
🚨 Troubleshooting
Common Issues
1. Ollama Not Running
# Check if Ollama is running
curl http://localhost:11434/api/tags
# If not running, start it
ollama serve
2. Model Not Found
# Check available models
ollama list
# Install missing model
ollama pull nomic-embed-text
3. Connection Timeout
// Add timeout to requests
const response = await fetch(`${this.baseUrl}/api/embeddings`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ model: this.model, prompt: text }),
signal: AbortSignal.timeout(30000) // 30 second timeout
});
4. Memory Issues
// Monitor memory usage
class MemoryAwareEmbeddingService extends OllamaEmbeddingService {
private maxMemoryUsage = 1000; // Max number of cached embeddings
private cleanupCache(): void {
if (this.cache.size > this.maxMemoryUsage) {
// Remove oldest 20% of cache
const toRemove = Math.floor(this.cache.size * 0.2);
const keys = Array.from(this.cache.keys()).slice(0, toRemove);
keys.forEach(key => this.cache.delete(key));
}
}
}
Performance Issues
1. Slow Embedding Generation
// Use smaller model for faster generation
const fastEmbeddingService = new OllamaEmbeddingService(
'http://localhost:11434',
'all-minilm' // Smaller, faster model
);
2. High Memory Usage
// Implement cache limits
class LimitedCacheEmbeddingService extends OllamaEmbeddingService {
private maxCacheSize = 500;
private enforceCacheLimit(): void {
if (this.cache.size > this.maxCacheSize) {
const keys = Array.from(this.cache.keys());
const toRemove = keys.slice(0, this.cache.size - this.maxCacheSize);
toRemove.forEach(key => this.cache.delete(key));
}
}
}
🎯 Best Practices
1. Model Selection
- nomic-embed-text: Best for general use (768 dimensions)
- all-minilm: Fastest for simple text (384 dimensions)
- multilingual-e5-large: Best for multilingual content (1024 dimensions)
2. Performance Optimization
- Use caching: Avoid regenerating same embeddings
- Batch processing: Group multiple requests together
- Appropriate delays: Don't overwhelm Ollama with requests
- Monitor resources: Keep track of memory and CPU usage
3. Quality Assurance
- Test embeddings: Verify similarity scores make sense
- Monitor performance: Track generation times and error rates
- Regular updates: Keep Ollama and models updated
- Backup strategy: Have fallback models available
4. Security
- Local only: Never expose Ollama to external networks
- Input validation: Sanitize all text inputs
- Resource limits: Prevent excessive resource usage
- Access control: Secure the Ollama service
🚀 Next Steps
Now that you understand Ollama embeddings, explore:
- Memory Examples - Practical usage scenarios
- Troubleshooting Guide - Common issues and solutions
- Memory System Overview - Complete system understanding
- Neo4j Integration - Episodic memory with graphs
Ready to see it all in action? Check out the Memory Examples for practical scenarios!