The AI revolution is here, and FastAPI is the perfect framework to serve your machine learning models. In this comprehensive guide, we'll explore how to build production-grade generative AI services that can handle real-time inference, streaming responses, and scale to millions of requests.
Why FastAPI for AI Services?
When building AI-powered applications, you need a framework that can:
- Handle high concurrency without blocking on I/O operations
- Support streaming responses for token-by-token generation
- Integrate seamlessly with async ML libraries
- Provide automatic API documentation for your endpoints
- Scale horizontally with container orchestration
FastAPI checks every box.
Architecture Overview
from fastapi import FastAPI, BackgroundTasks
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import asyncio
app = FastAPI(title="AI Service", version="1.0.0")
class GenerationRequest(BaseModel):
prompt: str
max_tokens: int = 512
temperature: float = 0.7
stream: bool = False
class GenerationResponse(BaseModel):
text: str
tokens_used: int
model: strStreaming Responses for Real-Time Generation
The magic of modern LLMs is watching text appear token by token. Here's how to implement streaming:
from typing import AsyncGenerator
async def generate_tokens(prompt: str) -> AsyncGenerator[str, None]:
"""Simulate token-by-token generation from an LLM."""
# In production, this would call your model
response = "This is a generated response based on your prompt."
for token in response.split():
yield f"data: {token} \n\n"
await asyncio.sleep(0.05) # Simulate inference time
yield "data: [DONE]\n\n"
@app.post("/v1/generate/stream")
async def generate_stream(request: GenerationRequest):
return StreamingResponse(
generate_tokens(request.prompt),
media_type="text/event-stream",
headers={
"Cache-Control": "no-cache",
"Connection": "keep-alive",
}
)Handling Long-Running Inference Tasks
Some AI tasks take too long for synchronous responses. Use background tasks and webhooks:
from uuid import uuid4
from datetime import datetime
# In-memory store (use Redis in production)
task_store: dict = {}
@app.post("/v1/generate/async")
async def generate_async(
request: GenerationRequest,
background_tasks: BackgroundTasks
):
task_id = str(uuid4())
task_store[task_id] = {
"status": "pending",
"created_at": datetime.utcnow().isoformat(),
"result": None
}
background_tasks.add_task(
run_inference,
task_id,
request.prompt
)
return {"task_id": task_id, "status": "pending"}
async def run_inference(task_id: str, prompt: str):
"""Run the actual inference in the background."""
await asyncio.sleep(5) # Simulate long inference
task_store[task_id]["status"] = "completed"
task_store[task_id]["result"] = f"Generated text for: {prompt}"
@app.get("/v1/tasks/{task_id}")
async def get_task_status(task_id: str):
if task_id not in task_store:
raise HTTPException(status_code=404, detail="Task not found")
return task_store[task_id]Rate Limiting and API Keys
Protect your expensive GPU resources with proper rate limiting:
from fastapi import Depends, HTTPException, Header
from functools import lru_cache
import time
# Token bucket rate limiter
class RateLimiter:
def __init__(self, requests_per_minute: int = 60):
self.requests_per_minute = requests_per_minute
self.tokens: dict = {}
async def check(self, api_key: str) -> bool:
now = time.time()
if api_key not in self.tokens:
self.tokens[api_key] = {"count": 0, "reset_at": now + 60}
bucket = self.tokens[api_key]
if now > bucket["reset_at"]:
bucket["count"] = 0
bucket["reset_at"] = now + 60
if bucket["count"] >= self.requests_per_minute:
return False
bucket["count"] += 1
return True
rate_limiter = RateLimiter()
async def verify_api_key(x_api_key: str = Header(...)):
# Validate API key against your database
if not x_api_key.startswith("sk-"):
raise HTTPException(status_code=401, detail="Invalid API key")
if not await rate_limiter.check(x_api_key):
raise HTTPException(status_code=429, detail="Rate limit exceeded")
return x_api_keyCaching Responses for Efficiency
Don't waste GPU cycles on duplicate requests:
from hashlib import sha256
import json
# Simple in-memory cache (use Redis for production)
response_cache: dict = {}
def get_cache_key(request: GenerationRequest) -> str:
"""Generate a unique cache key for a request."""
content = json.dumps({
"prompt": request.prompt,
"max_tokens": request.max_tokens,
"temperature": request.temperature
}, sort_keys=True)
return sha256(content.encode()).hexdigest()
@app.post("/v1/generate")
async def generate(
request: GenerationRequest,
api_key: str = Depends(verify_api_key)
):
cache_key = get_cache_key(request)
if cache_key in response_cache:
return response_cache[cache_key]
# Run inference
result = await run_model_inference(request)
# Cache the result
response_cache[cache_key] = result
return resultModel Health Checks
Ensure your AI service stays healthy in production:
from fastapi import status
@app.get("/health", status_code=status.HTTP_200_OK)
async def health_check():
"""Kubernetes-ready health check endpoint."""
return {
"status": "healthy",
"model_loaded": True,
"gpu_available": check_gpu_status(),
"cache_size": len(response_cache)
}
@app.get("/ready")
async def readiness_check():
"""Check if the service is ready to receive traffic."""
# Warm up the model if needed
if not is_model_warmed_up():
raise HTTPException(
status_code=503,
detail="Model not ready"
)
return {"ready": True}Deployment Best Practices
When deploying AI services with FastAPI:
- Use Gunicorn with Uvicorn workers for multi-process scaling
- Implement graceful shutdown to complete in-flight requests
- Set appropriate timeouts for long-running inference
- Monitor GPU memory to prevent OOM errors
- Use async everywhere to maximize throughput
gunicorn app:app \
--workers 4 \
--worker-class uvicorn.workers.UvicornWorker \
--timeout 120 \
--bind 0.0.0.0:8000Conclusion
FastAPI provides the perfect foundation for building generative AI services. Its async-first design, automatic documentation, and type safety make it ideal for serving ML models at scale. Whether you're building a chatbot, image generator, or code assistant, FastAPI has you covered.
Ready to ship your AI service? ForgeAPI provides a production-ready FastAPI template with all the patterns you need—authentication, rate limiting, background jobs, and more.