Building Generative AI Services with FastAPI

The AI revolution is here, and FastAPI is the perfect framework to serve your machine learning models. In this comprehensive guide, we'll explore how to build production-grade generative AI services that can handle real-time inference, streaming responses, and scale to millions of requests.

Why FastAPI for AI Services?

When building AI-powered applications, you need a framework that can:

Handle high concurrency without blocking on I/O operations
Support streaming responses for token-by-token generation
Integrate seamlessly with async ML libraries
Provide automatic API documentation for your endpoints
Scale horizontally with container orchestration

FastAPI checks every box.

Architecture Overview

from fastapi import FastAPI, BackgroundTasks
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import asyncio
 
app = FastAPI(title="AI Service", version="1.0.0")
 
class GenerationRequest(BaseModel):
    prompt: str
    max_tokens: int = 512
    temperature: float = 0.7
    stream: bool = False
 
class GenerationResponse(BaseModel):
    text: str
    tokens_used: int
    model: str

Streaming Responses for Real-Time Generation

The magic of modern LLMs is watching text appear token by token. Here's how to implement streaming:

from typing import AsyncGenerator
 
async def generate_tokens(prompt: str) -> AsyncGenerator[str, None]:
    """Simulate token-by-token generation from an LLM."""
    # In production, this would call your model
    response = "This is a generated response based on your prompt."
    
    for token in response.split():
        yield f"data: {token} \n\n"
        await asyncio.sleep(0.05)  # Simulate inference time
    
    yield "data: [DONE]\n\n"
 
@app.post("/v1/generate/stream")
async def generate_stream(request: GenerationRequest):
    return StreamingResponse(
        generate_tokens(request.prompt),
        media_type="text/event-stream",
        headers={
            "Cache-Control": "no-cache",
            "Connection": "keep-alive",
        }
    )

Handling Long-Running Inference Tasks

Some AI tasks take too long for synchronous responses. Use background tasks and webhooks:

from uuid import uuid4
from datetime import datetime
 
# In-memory store (use Redis in production)
task_store: dict = {}
 
@app.post("/v1/generate/async")
async def generate_async(
    request: GenerationRequest,
    background_tasks: BackgroundTasks
):
    task_id = str(uuid4())
    task_store[task_id] = {
        "status": "pending",
        "created_at": datetime.utcnow().isoformat(),
        "result": None
    }
    
    background_tasks.add_task(
        run_inference,
        task_id,
        request.prompt
    )
    
    return {"task_id": task_id, "status": "pending"}
 
async def run_inference(task_id: str, prompt: str):
    """Run the actual inference in the background."""
    await asyncio.sleep(5)  # Simulate long inference
    
    task_store[task_id]["status"] = "completed"
    task_store[task_id]["result"] = f"Generated text for: {prompt}"
 
@app.get("/v1/tasks/{task_id}")
async def get_task_status(task_id: str):
    if task_id not in task_store:
        raise HTTPException(status_code=404, detail="Task not found")
    return task_store[task_id]

Rate Limiting and API Keys

Protect your expensive GPU resources with proper rate limiting:

from fastapi import Depends, HTTPException, Header
from functools import lru_cache
import time
 
# Token bucket rate limiter
class RateLimiter:
    def __init__(self, requests_per_minute: int = 60):
        self.requests_per_minute = requests_per_minute
        self.tokens: dict = {}
    
    async def check(self, api_key: str) -> bool:
        now = time.time()
        if api_key not in self.tokens:
            self.tokens[api_key] = {"count": 0, "reset_at": now + 60}
        
        bucket = self.tokens[api_key]
        if now > bucket["reset_at"]:
            bucket["count"] = 0
            bucket["reset_at"] = now + 60
        
        if bucket["count"] >= self.requests_per_minute:
            return False
        
        bucket["count"] += 1
        return True
 
rate_limiter = RateLimiter()
 
async def verify_api_key(x_api_key: str = Header(...)):
    # Validate API key against your database
    if not x_api_key.startswith("sk-"):
        raise HTTPException(status_code=401, detail="Invalid API key")
    
    if not await rate_limiter.check(x_api_key):
        raise HTTPException(status_code=429, detail="Rate limit exceeded")
    
    return x_api_key

Caching Responses for Efficiency

Don't waste GPU cycles on duplicate requests:

from hashlib import sha256
import json
 
# Simple in-memory cache (use Redis for production)
response_cache: dict = {}
 
def get_cache_key(request: GenerationRequest) -> str:
    """Generate a unique cache key for a request."""
    content = json.dumps({
        "prompt": request.prompt,
        "max_tokens": request.max_tokens,
        "temperature": request.temperature
    }, sort_keys=True)
    return sha256(content.encode()).hexdigest()
 
@app.post("/v1/generate")
async def generate(
    request: GenerationRequest,
    api_key: str = Depends(verify_api_key)
):
    cache_key = get_cache_key(request)
    
    if cache_key in response_cache:
        return response_cache[cache_key]
    
    # Run inference
    result = await run_model_inference(request)
    
    # Cache the result
    response_cache[cache_key] = result
    
    return result

Model Health Checks

Ensure your AI service stays healthy in production:

from fastapi import status
 
@app.get("/health", status_code=status.HTTP_200_OK)
async def health_check():
    """Kubernetes-ready health check endpoint."""
    return {
        "status": "healthy",
        "model_loaded": True,
        "gpu_available": check_gpu_status(),
        "cache_size": len(response_cache)
    }
 
@app.get("/ready")
async def readiness_check():
    """Check if the service is ready to receive traffic."""
    # Warm up the model if needed
    if not is_model_warmed_up():
        raise HTTPException(
            status_code=503,
            detail="Model not ready"
        )
    return {"ready": True}

Deployment Best Practices

When deploying AI services with FastAPI:

Use Gunicorn with Uvicorn workers for multi-process scaling
Implement graceful shutdown to complete in-flight requests
Set appropriate timeouts for long-running inference
Monitor GPU memory to prevent OOM errors
Use async everywhere to maximize throughput

gunicorn app:app \
  --workers 4 \
  --worker-class uvicorn.workers.UvicornWorker \
  --timeout 120 \
  --bind 0.0.0.0:8000

Conclusion

FastAPI provides the perfect foundation for building generative AI services. Its async-first design, automatic documentation, and type safety make it ideal for serving ML models at scale. Whether you're building a chatbot, image generator, or code assistant, FastAPI has you covered.

Ready to ship your AI service? ForgeAPI provides a production-ready FastAPI template with all the patterns you need—authentication, rate limiting, background jobs, and more.