I. Foundational Design Philosophy
System Overview
┌─────────────┐
│ CloudFront │
│ (CDN) │
└──────┬──────┘
│
┌────────────┴────────────┐
│ │
┌────────▼────────┐ ┌──────────▼──────────┐
│ Frontend App │ │ API Gateway │
│ (Next.js/React)│ │ (Auth, Rate Limit) │
└─────────────────┘ └──────────┬───────────┘
│
┌────────────┴────────────┐
│ Core API Service │
│ (NestJS / FastAPI) │
│ │
│ ┌─────────────────────┐ │
│ │ Module: Auth │ │
│ │ Module: Teams │ │
│ │ Module: Tasks │ │
│ │ Module: Contacts │ │
│ │ Module: Documents │ │
│ │ Module: Leads │ │
│ │ Module: Notes │ │
│ │ Module: Notifs │ │
│ │ Module: Finance │ │
│ └─────────────────────┘ │
└────────────┬─────────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
┌──────▼──────┐ ┌──────▼──────┐ ┌───────▼──────┐
│ PostgreSQL │ │ Redis │ │ S3 / Files │
│ (Aurora) │ │ (Cache + │ │ │
│ │ │ Queues) │ │ │
└─────────────┘ └─────────────┘ └──────────────┘
1. Cell-Based Architecture
Google's approach to shared-infrastructure isolation.
Every deployment is a "cell" — an independent, hermetically sealed unit containing the full stack. Each product is a cell. Cells share nothing at runtime but share the same codebase modules. A failure in one cell cannot cascade to another.
2. Zero Trust Security Model
Based on NIST 800-207.
No implicit trust. Every request is authenticated, authorized, and encrypted — even internal service-to-service calls. Network location (VPC, subnet) grants zero privilege. Identity is the only perimeter.
3. AWS Multi-Account Isolation
AWS Well-Architected Framework.
Separate AWS accounts are hard security boundaries. Each concern gets its own blast radius. You don't share accounts between products.
II. AWS Multi-Account Strategy
Account Topology
AWS Organization (Root)
│
├── Management Account (billing, SCPs, Organization policies ONLY)
│ └── No workloads ever run here
│
├── OU: Security
│ ├── Security Tooling Account
│ │ ├── GuardDuty delegated admin
│ │ ├── Security Hub aggregator
│ │ ├── CloudTrail organization trail (immutable S3)
│ │ ├── AWS Config aggregator
│ │ └── IAM Access Analyzer
│ │
│ └── Log Archive Account
│ ├── Centralized CloudWatch Logs
│ ├── CloudTrail logs (write-once, read-many)
│ ├── VPC Flow Logs & S3 access logs
│ └── Retention: 7 years (compliance)
│
├── OU: Shared Services
│ ├── Network Hub Account
│ │ ├── Transit Gateway (hub-and-spoke)
│ │ ├── Route 53 Hosted Zones
│ │ ├── AWS Certificate Manager
│ │ └── VPN / Direct Connect termination
│ │
│ ├── Shared Services Account
│ │ ├── ECR (container registry)
│ │ ├── Artifact stores (npm, pip)
│ │ ├── Cognito / Keycloak (IdP)
│ │ └── Secrets Manager
│ │
│ └── CI/CD Account
│ ├── GitHub Actions self-hosted runners
│ ├── CDK Pipelines (deploys via cross-account roles)
│ └── Artifact signing (cosign / Sigstore)
│
├── OU: Workloads
│ ├── Product A — Dev / Staging / Prod (3 accounts)
│ ├── Product B — Dev / Staging / Prod (3 accounts)
│ └── ... (each new product gets 3 accounts)
│
└── OU: Sandbox
└── Developer Sandbox Accounts
Why This Matters
- Blast radius isolation — a misconfigured IAM policy in Product A's dev cannot touch Product B's prod. Hard AWS boundary.
- Cost attribution — each product's AWS bill is isolated automatically.
- Compliance — Security OU locked with SCPs. Even root can't delete logs or disable GuardDuty.
- Modularity — new product = one CDK script → 3 accounts with standard config. Minutes.
Cross-Account Access Patterns
CI/CD Account Product A Prod Account
┌──────────────┐ ┌──────────────────────┐
│ CDK Pipeline │───AssumeRole──▶│ DeploymentRole │
│ │ (cross-acct) │ (ECS, RDS, S3 only) │
└──────────────┘ └──────────────────────┘
Shared Services Account Product A Prod Account
┌──────────────┐ ┌──────────────────────┐
│ Cognito │◀──────────────│ API Gateway validates │
│ (IdP) │ JWT issued │ JWT via JWKS endpoint │
└──────────────┘ └──────────────────────┘
III. Security Architecture
Defense in depth — five layers from edge to data.
Layer 1: Edge — CloudFront + WAF
Internet → CloudFront (TLS 1.3 only)
│
├── AWS WAF v2
│ ├── Managed Rules (OWASP Top 10)
│ ├── Rate limiting (2000 req/5min per IP)
│ ├── Geo-blocking
│ ├── Bot Control
│ └── Custom rules (SQLi, XSS)
│
└── AWS Shield Advanced (DDoS)
Layer 2: API Gateway
- Request validation (JSON Schema)
- Mutual TLS for service-to-service
- Usage plans + API keys for external consumers
- Request/response logging → Log Archive Account
- Lambda Authorizer or Cognito Authorizer
Layer 3: Application — Zero Trust Pipeline
Every request — even internal — goes through:
Request
→ TLS termination (ALB)
→ JWT verification (signature + expiry + audience + issuer)
→ Tenant extraction (org_id from token claims)
→ Permission evaluation (RBAC + ABAC)
→ Rate limiting (per-user, per-tenant, per-endpoint)
→ Input validation (Zod schemas, strict mode)
→ Audit logging (who, what, when, from where)
→ Business logic
→ Output sanitization (strip internal fields)
→ Response
Permission Model — Google Zanzibar
Relationship-based access control at scale. Permissions are stored as tuples and checks are graph traversals.
document:doc_123#viewer@user:alice
document:doc_123#editor@team:engineering#member
team:engineering#member@user:bob
org:acme#admin@user:carol
// "Can Bob edit doc_123?"
// → doc_123#editor includes team:engineering#member
// → team:engineering#member includes user:bob
// → YES
Use SpiceDB or OpenFGA (open-source Zanzibar). Every module calls the permission service. No module implements its own auth logic.
CREATE TABLE permission_tuples (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
namespace VARCHAR(100) NOT NULL,
object_id VARCHAR(200) NOT NULL,
relation VARCHAR(100) NOT NULL,
subject_ns VARCHAR(100) NOT NULL,
subject_id VARCHAR(200) NOT NULL,
subject_rel VARCHAR(100),
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(namespace, object_id, relation, subject_ns, subject_id, subject_rel)
);
CREATE INDEX idx_perm_object ON permission_tuples(namespace, object_id);
CREATE INDEX idx_perm_subject ON permission_tuples(subject_ns, subject_id);
CREATE INDEX idx_perm_check ON permission_tuples(namespace, object_id, relation);
Layer 4: Data Security
Row-Level Security — even if app code has a bug, the database enforces tenant isolation:
ALTER TABLE tasks ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON tasks
USING (org_id = current_setting('app.current_org_id')::UUID);
-- Set on every DB connection:
SET app.current_org_id = 'org_xyz';
-- SELECT * FROM tasks only returns org_xyz's data
- Encryption: AES-256 at rest, TLS 1.3 in transit, app-level encryption for PII via AWS KMS (per-tenant keys)
- Data classification: columns tagged PUBLIC / INTERNAL / CONFIDENTIAL / RESTRICTED. Serializers auto-strip based on clearance.
Layer 5: Supply Chain Security
- Container images scanned with Trivy/Snyk on every build
- Dependency audit on every PR
- Image signing with cosign — ECS only runs signed images
- SBOM generation for every release
IV. Database Architecture
Aurora PostgreSQL — Serverless v2
| Feature | Benefit |
|---|---|
| 3-5x faster than standard PG | Rewritten storage engine, 6-way replication, parallel query |
| Serverless v2 | Scales 0.5 → 128 ACUs in seconds. Dev costs near zero. |
| Up to 15 read replicas | <20ms lag. Route dashboards/reports to replicas. |
| Global Database | Multi-region replication <1s lag |
| 100% PG compatible | Every extension, ORM, and tool works |
Write Path Read Path
│ │
▼ ▼
┌──────────────┐ ┌──────────────────┐
│ Aurora Writer │ │ Aurora Reader x3 │
│ (Primary) │ │ (Auto-scaling) │
└──────┬───────┘ └────────┬─────────┘
│ │
▼ ▼
┌──────────────────────────────────┐
│ Aurora Storage (distributed, │
│ 6-way replicated, auto-healing) │
└──────────────────────────────────┘
┌──────────────────────────────────────────┐
│ Redis Cluster │
│ Sessions │ Query Cache │ Rate Limits │
└──────────────────────────────────────────┘
Caching Strategy — Cache-Aside with Invalidation
async function getTask(taskId: string, orgId: string): Promise<Task> {
const cacheKey = `task:${orgId}:${taskId}`;
const cached = await redis.get(cacheKey);
if (cached) return JSON.parse(cached);
const task = await auroraReader.query(
'SELECT * FROM tasks WHERE id = $1 AND org_id = $2',
[taskId, orgId]
);
await redis.setex(cacheKey, 300, JSON.stringify(task));
return task;
}
async function updateTask(taskId, orgId, data) {
await auroraWriter.query(/* ... */);
await redis.del(`task:${orgId}:${taskId}`);
await eventBus.emit('task.updated', { taskId, orgId, changes: data });
}
Data Migration from Legacy Systems
Legacy System AWS
┌──────────────┐ ┌───────────────────────────┐
│ SQL Server │ │ DMS Replication Instance │
│ Oracle │──DMS ────▶│ ├── Full Load (bulk) │
│ MySQL 5.x │ (CDC) │ └── CDC (continuous) │
│ MongoDB │ │ │ │
└──────────────┘ │ ▼ │
│ ┌──────────────┐ │
│ │ Aurora PG │ │
│ └──────────────┘ │
└───────────────────────────┘
Migration Steps
- SCT — Schema Conversion Tool analyzes legacy schema, converts 90%+ to PG DDL automatically
- DMS Full Load — bulk copies all data (handles type conversions, encoding)
- DMS CDC — continuous replication while legacy still runs. Zero downtime.
- Validation — row counts + data integrity verification
- Cutover — flip DNS, stop CDC, legacy goes read-only
Non-Database Legacy Data (CSV, Excel, XML, APIs)
Source (S3 upload) → Step Function
├── Validate schema
├── Transform (dates, currencies, encodings)
├── Deduplicate
├── Map to target schema (configurable)
├── Batch insert into Aurora
├── Generate migration report
└── Notify (success/failure + row counts)
V. Application — Module System
Core Services (Always Present)
| Module | Responsibility |
|---|---|
| Auth | Signup, login, logout, password reset, MFA, OAuth, session management |
| Users | Profiles, preferences, avatars |
| Organizations | Multi-tenancy, org settings, billing tier |
| Teams | Create teams, add/remove members, team roles |
| Permissions | Zanzibar RBAC + ABAC, permission checks as middleware |
| Entity Links | Universal cross-module linking (any entity to any entity) |
| Notifications | In-app, email, push, webhooks (event-driven) |
| Files / Media | S3 upload/download, presigned URLs, file metadata |
| Audit Log | Immutable record of who did what, when (enterprise compliance) |
| Search | Full-text search via PG tsvector → OpenSearch when needed |
| Settings / Config | Feature flags, app config, per-tenant configuration |
Core Infrastructure Services
| Service | Technology | Purpose |
|---|---|---|
| Event Bus | SNS/SQS or Redis Streams | Modules communicate via events, not direct calls. When a task is created, an event fires and the notification module picks it up. |
| Job Queue | BullMQ (Redis-backed) | Background jobs: emails, report generation, data exports, scheduled tasks |
| Caching Layer | Redis (ElastiCache) | Session data, frequently accessed data, rate limiting counters |
| Logging & Monitoring | CloudWatch + structured JSON | Centralized, queryable logs. Optionally Datadog or Grafana. |
| Health Checks | Per-module /health | Every module exposes a health endpoint for load balancers and orchestration |
Monorepo Structure
platform/
├── packages/
│ ├── core/ # Shared kernel — NEVER optional
│ │ ├── auth/ # JWT, sessions, OAuth
│ │ ├── iam/ # Zanzibar permission engine
│ │ ├── tenancy/ # Org isolation, RLS
│ │ ├── events/ # Event bus abstraction
│ │ ├── storage/ # S3 abstraction
│ │ ├── notifications/ # Multi-channel engine
│ │ ├── audit/ # Immutable audit log
│ │ ├── search/ # Full-text search
│ │ ├── migrations/ # ETL framework
│ │ └── common/ # DTOs, validators, errors
│ │
│ ├── modules/ # Optional business modules
│ │ ├── teams/
│ │ ├── tasks/
│ │ ├── contacts/
│ │ ├── documents/
│ │ ├── notes/
│ │ ├── finance/
│ │ └── [custom]/
│ │
│ ├── sdk/ # Auto-generated TS SDK
│ └── ui/ # Shared UI components
│
├── apps/
│ ├── api/ # Deployable API server
│ ├── worker/ # Background jobs
│ └── web/ # Next.js frontend
│
├── infra/ # AWS CDK
│ ├── lib/
│ │ ├── account-baseline.ts
│ │ ├── networking.ts
│ │ ├── database.ts
│ │ ├── compute.ts
│ │ ├── cdn.ts
│ │ ├── security.ts
│ │ └── observability.ts
│ └── bin/
│ ├── deploy-shared-services.ts
│ └── deploy-product.ts
│
└── tools/
├── migrate/ # Data migration CLI
├── scaffold/ # New module generator
└── sdk-gen/ # OpenAPI → SDK
Module Registration
interface PlatformModule {
name: string;
version: string;
dependencies: string[];
onRegister(container: DependencyContainer): void;
onDatabaseSetup(migrator: Migrator): Promise<void>;
onPermissionsSetup(engine: PermissionEngine): void;
onEventsSetup(bus: EventBus): void;
onReady(): Promise<void>;
onShutdown(): Promise<void>;
}
Example: TasksModule Implementation
export class TasksModule implements PlatformModule {
name = 'tasks';
version = '1.0.0';
dependencies = ['core', 'teams'];
onRegister(container) {
container.register(TasksService);
container.register(TasksController);
container.register(TaskBoardsController);
}
onDatabaseSetup(migrator) {
return migrator.runModuleMigrations('tasks');
}
onPermissionsSetup(engine) {
engine.defineNamespace('task', {
relations: {
org: 'organization',
owner: 'user',
assignee: 'user | team#member',
viewer: 'user | team#member | org#member',
editor: 'user | team#member',
},
permissions: {
view: 'viewer + editor + owner + org->admin',
edit: 'editor + owner',
delete: 'owner + org->admin',
assign: 'editor + owner',
}
});
}
onEventsSetup(bus) {
bus.subscribe('team.member_removed', this.handleTeamMemberRemoved);
bus.subscribe('contact.deleted', this.handleContactDeleted);
}
}
Per-Product Configuration
import { TasksModule } from '@platform/modules/tasks';
import { ContactsModule } from '@platform/modules/contacts';
import { DocumentsModule } from '@platform/modules/documents';
import { NotesModule } from '@platform/modules/notes';
// import { FinanceModule } from '@platform/modules/finance';
export const platformConfig = {
modules: [
new TasksModule(),
new ContactsModule(),
new DocumentsModule(),
new NotesModule(),
// FinanceModule — not loaded, endpoints don't exist,
// DB tables not created, events not subscribed
],
aws: {
region: 'us-east-1',
accountId: process.env.AWS_ACCOUNT_ID,
},
database: {
writer: process.env.AURORA_WRITER_ENDPOINT,
reader: process.env.AURORA_READER_ENDPOINT,
},
features: {
enableRedlining: true,
enableKanban: true,
maxTeamSize: 50,
}
};
Cross-Module Entity Linking
Universal linking — any entity to any entity. This is the glue that lets teams, tasks, docs, leads all reference each other.
CREATE TABLE entity_links (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source_type VARCHAR(50) NOT NULL,
source_id UUID NOT NULL,
target_type VARCHAR(50) NOT NULL,
target_id UUID NOT NULL,
relationship VARCHAR(50) NOT NULL,
org_id UUID NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
UNIQUE(source_type, source_id, target_type, target_id, relationship)
);
EntityLinkService Implementation
class EntityLinkService {
async link(params: {
source: { type: string; id: string };
target: { type: string; id: string };
relationship: string;
orgId: string;
actorId: string;
}) {
// 1. Verify both entities exist (calls respective module)
await this.verifyEntity(params.source);
await this.verifyEntity(params.target);
// 2. Check permission (actor must have 'link' permission on both)
await this.permissionEngine.check(params.actorId, 'link', params.source);
await this.permissionEngine.check(params.actorId, 'link', params.target);
// 3. Create the link
await this.repository.createLink(params);
// 4. Emit event (other modules can react)
await this.eventBus.emit('entity.linked', params);
// 5. Audit
await this.auditLog.record('entity.linked', params);
}
async getLinkedEntities(
source: { type: string; id: string },
targetType: string,
orgId: string
) {
return this.repository.findLinks(source, targetType, orgId);
}
}
Auto-Generated Endpoints
| Endpoint | Description |
|---|---|
POST /api/v1/links | Create a link between any two entities |
DELETE /api/v1/links/:id | Remove a link |
GET /api/v1/teams/:id/linked/documents | Docs linked to a team |
GET /api/v1/teams/:id/linked/tasks | Tasks linked to a team |
GET /api/v1/documents/:id/linked/contacts | Contacts linked to a doc |
GET /api/v1/contacts/:id/linked/* | Everything linked to a contact |
VI. Business Modules
Tasks Module Optional
Same data powers Kanban boards, traditional lists, calendar views, and Gantt charts. The view_type on boards determines rendering — not data structure.
CREATE TABLE tasks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_id UUID NOT NULL,
title VARCHAR(500) NOT NULL,
description TEXT,
status VARCHAR(50) DEFAULT 'todo',
priority VARCHAR(20),
due_date TIMESTAMPTZ,
board_id UUID REFERENCES task_boards(id),
column_id UUID REFERENCES task_columns(id),
position FLOAT, -- fractional indexing for drag-drop
metadata JSONB DEFAULT '{}', -- custom fields, labels, points
assignee_id UUID,
created_by UUID NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE task_boards (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_id UUID NOT NULL,
name VARCHAR(200),
view_type VARCHAR(50) DEFAULT 'kanban', -- kanban, list, calendar, gantt
config JSONB DEFAULT '{}'
);
CREATE TABLE task_columns (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
board_id UUID REFERENCES task_boards(id),
name VARCHAR(200),
position FLOAT,
config JSONB DEFAULT '{}' -- color, WIP limits
);
Key Endpoints
| Endpoint | Description |
|---|---|
GET /api/v1/tasks?board_id=X&view=kanban | Returns tasks grouped by column (Trello-style) |
GET /api/v1/tasks?board_id=X&view=list | Returns flat sorted list (traditional view) |
PATCH /api/v1/tasks/:id/move | Reorder or move between columns (fractional indexing) |
POST /api/v1/tasks/:id/subtasks | Create nested/child tasks |
Contacts / Leads / CRM Optional
Leads and contacts share one table with a type discriminator. A lead converts to a client by changing type.
CREATE TABLE contacts (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_id UUID NOT NULL,
type VARCHAR(20) NOT NULL, -- lead, client, vendor, partner
status VARCHAR(50),
first_name VARCHAR(100),
last_name VARCHAR(100),
email VARCHAR(255),
phone VARCHAR(50),
company VARCHAR(200),
custom_fields JSONB DEFAULT '{}',
pipeline_stage VARCHAR(50),
created_at TIMESTAMPTZ DEFAULT NOW()
);
Documents / Contracts Optional
CREATE TABLE documents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_id UUID NOT NULL,
title VARCHAR(500),
type VARCHAR(50), -- contract, proposal, note, template
content TEXT,
status VARCHAR(50), -- draft, review, sent, signed
version INT DEFAULT 1,
parent_id UUID REFERENCES documents(id),
metadata JSONB DEFAULT '{}',
file_url VARCHAR(1000),
created_by UUID NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE TABLE document_revisions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id UUID REFERENCES documents(id),
content TEXT,
changes JSONB, -- diff from previous version
revised_by UUID NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
Notifications Core
Event-driven. Other modules emit events (task.assigned, document.shared). The notification module subscribes and routes based on user preferences.
CREATE TABLE notifications (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_id UUID NOT NULL,
user_id UUID NOT NULL,
type VARCHAR(50),
channel VARCHAR(20), -- in_app, email, push
title VARCHAR(200),
body TEXT,
data JSONB DEFAULT '{}', -- deep linking payload
read_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW()
);
Finance Optional
| Belongs in Core Module | Too Specific — Custom Per Project |
|---|---|
| Invoices (CRUD, status tracking) | Tax calculation rules |
| Payments received/sent ledger | Industry-specific billing |
| Expense tracking | Payroll |
| Basic reporting (revenue, expenses) | Complex financial modeling |
| Integration hooks (Stripe, QuickBooks) | Accounting standards compliance |
VII. Deployment — Cell Provisioning
One command provisions a full product cell in a target AWS account:
npx cdk deploy ProductCell \
--context productName=acme-crm \
--context modules=tasks,contacts,documents,finance \
--context environment=prod \
--profile product-a-prod-account
What Gets Provisioned
- VPC (3 AZs, private subnets, NAT Gateways)
- Aurora Serverless v2 cluster (writer + reader)
- ElastiCache Redis cluster
- ECS Fargate service (API + Worker)
- S3 buckets (files, backups)
- CloudFront distribution + WAF rules
- CloudWatch dashboards + alarms
- Cross-account roles (CI/CD, logging)
- DNS records in Network Hub account
tasks module → deploys to all products that use it.VIII. Deployment Strategy — Platform by Platform
No single platform is optimal for every workload. Use the right tool for each layer.
Backend API Servers
| Option | Best For | Trade-off |
|---|---|---|
| ECS Fargate | Steady-state workloads, full VPC control, WebSockets, complex networking | More config, you manage scaling policies |
| AWS App Runner | Simpler APIs, auto-scaling zero config, small teams | Less VPC control, no WebSocket support |
| Lambda + API GW | Event-driven, spiky traffic, low-traffic modules | Cold starts, 15-min timeout, harder to debug |
Background Workers / Jobs
| Option | Best For |
|---|---|
| ECS Fargate (separate service) | Long-running job processors (BullMQ), always-on |
| Lambda | Short-lived event handlers (S3 triggers, SQS consumers, cron) |
| Step Functions + Lambda | Multi-step workflows (ETL, document processing) |
Frontend Applications
| Option | Best For | Trade-off |
|---|---|---|
| Vercel | Next.js apps, best DX, instant previews, edge SSR | Vendor lock-in, cost at scale, data leaves AWS |
| Cloudflare Pages | Static sites, simple SPAs, global edge, cheap | Limited SSR, no tight AWS integration |
| AWS Amplify Hosting | Next.js/React needing tight AWS integration | Slower builds, weaker DX vs. Vercel |
| CloudFront + S3 | Pure SPAs (React/Vue), full control, cheapest | No SSR, manual cache invalidation |
Frontend Recommendation
- Customer-facing products → Vercel (Next.js with edge SSR, preview deployments per PR)
- Internal tools / admin panels → CloudFront + S3 (static React SPA, cheapest, stays in AWS)
- Architecture docs / marketing → Cloudflare Pages (static HTML, free tier)
The Recommended Mix
EDGE / CDN
Customer Frontends --> Vercel (Next.js, edge SSR)
Internal Tools --> CloudFront + S3 (React SPA)
Architecture Docs --> Cloudflare Pages (static HTML)
|
| HTTPS (API calls)
v
AWS PRODUCT ACCOUNT
API Gateway --> ECS Fargate (Core API, NestJS)
SQS/SNS --> ECS Fargate (BullMQ Worker)
S3 Events --> Lambda (file processing)
Scheduled --> Lambda (cron jobs, cleanup)
ETL --> Step Functions + Lambda
|
Aurora PostgreSQL | Redis | S3
Why NOT All-Serverless for Backend
| Concern | Vercel / Lambda Serverless | ECS Fargate |
|---|---|---|
| Timeout | 10-300s depending on plan | Unlimited |
| Cold starts | 100-500ms per invocation | None (always running) |
| DB connections | New connection per invocation (kills DB) | Connection pool (Prisma/Drizzle) |
| WebSockets | Not supported | Full support |
| Background jobs | Not supported | BullMQ workers |
| VPC access | Not possible (Vercel) / complex (Lambda) | Native (same VPC as Aurora/Redis) |
| Cost at scale | Unpredictable (per-invocation) | Predictable (per-container) |
How Vercel Works with This Architecture
Vercel is the frontend deployment platform only. It does NOT run business logic.
User's Browser
|
v
Vercel (Frontend ONLY)
Next.js App
| fetch('https://api.yourproduct.com/v1/tasks')
|
v HTTPS
AWS (Your Backend)
CloudFront --> API Gateway --> ECS (NestJS on Fargate)
|
Aurora | Redis | S3
- Server Components / SSR: Vercel renders HTML by calling your ECS API, sends finished page to browser
- Client-side fetching: Browser calls your ECS API directly — Vercel not involved
- Next.js API routes: Only for thin proxies (OAuth callbacks, cookie handling) — never business logic
- WebSockets: Browser connects directly to ECS — Vercel can't handle these
Per-Environment Strategy
| Environment | Backend | Frontend |
|---|---|---|
| Dev | ECS Fargate (min capacity, Aurora at 0.5 ACU) | Vercel preview deployments (auto per PR) |
| Staging | ECS Fargate (mirrors prod, smaller scale) | Vercel staging branch |
| Prod | ECS Fargate (auto-scaling 2-10 tasks, multi-AZ) | Vercel production |
IX. Observability — SRE Approach
SLO-Driven, Not Alert-Driven
| SLO | Target | Budget |
|---|---|---|
| API Availability | 99.9% | 43 min downtime/month |
| API Latency (p99) | < 500ms | — |
| API Latency (p50) | < 100ms | — |
| Data Durability | 99.999999999% | Aurora handles this |
Three Pillars
| Pillar | Tool | Purpose |
|---|---|---|
| Logs | CloudWatch → OpenSearch | Structured JSON, cross-account |
| Metrics | CloudWatch + EMF | Business + infra metrics |
| Traces | X-Ray / OpenTelemetry | End-to-end request tracing |
Every request gets a correlation ID flowing through logs, metrics, and traces.
X. Tech Stack Summary
| Concern | Choice |
|---|---|
| Backend | NestJS (TypeScript) — module system built-in |
| ORM | Prisma or Drizzle — type-safe, migrations |
| Database | Aurora PostgreSQL Serverless v2 |
| Cache / Queue | Redis (ElastiCache) + BullMQ |
| Auth | AWS Cognito or Keycloak |
| Permissions | SpiceDB / OpenFGA (Zanzibar) |
| File Storage | AWS S3 + presigned URLs |
| Search | PG full-text → OpenSearch later |
| Events | AWS SNS/SQS or Redis Streams |
| Deployment | ECS Fargate + CDK for IaC |
| Frontend | Next.js |
| Monorepo | Turborepo or Nx |
| CI/CD | GitHub Actions → CDK Pipelines |
| Observability | CloudWatch + X-Ray + OpenSearch |
XI. Build Order
Module Dependency Map
Core (always loaded):
├── Auth
├── Users
├── Organizations
├── Teams
├── Permissions (Zanzibar)
├── Entity Links (cross-module glue)
├── Notifications (event-driven)
├── Files
├── Audit Log
└── Search
Optional (plug in per product):
├── Tasks → depends on Core
├── Contacts → depends on Core
├── Documents → depends on Core, Files
├── Notes → depends on Core
├── Finance → depends on Core, Contacts
└── [Custom] → depends on Core
Recommended Sequence
- Monorepo scaffold — Turborepo + NestJS + shared types package
- Core kernel — Auth, Users, Orgs, Teams, Permissions, Entity Links
- Event bus + notifications
- Tasks — first optional module, proves the architecture
- Contacts/Leads, Documents, Notes
- Finance — most domain-specific, last
- CDK infrastructure — multi-account provisioning
- CI/CD pipeline — automated deploy across accounts
XII. Module Update & Versioning Strategy
How to ship changes to shared infrastructure without breaking products already running on it.
Scenario 1 — Updating an Existing Core Module
Say you update Auth to add MFA, change a JWT claim, or refactor the permission middleware. Every product using that module gets the change. If it breaks, everything breaks.
Solution: Semantic Versioning + Changesets
| Version Bump | When | Rollout |
|---|---|---|
patch (1.2.0 → 1.2.1) | Bug fix | Auto-deploy to all products |
minor (1.2.0 → 1.3.0) | New backwards-compatible feature | Auto-deploy, products adopt when ready |
major (1.x → 2.0.0) | Breaking change | Each product opts in on its own schedule |
Breaking changes live in a parallel package until products migrate:
packages/core/auth/ ← v1.2.0 (Product A still here)
packages/core/auth-v2/ ← v2.0.0 (Product B migrated)
Use Changesets to manage this in the monorepo:
# Developer describes what changed
npx changeset add
# → "auth: added MFA, new required field mfa_enabled on users table"
# → type: minor
# CI bumps versions and generates changelogs
npx changeset version
In CI, every PR to a shared module runs tests against every product that uses it:
# .github/workflows/test.yml
test-all-products:
- test product-a against updated auth module
- test product-b against updated auth module
- test product-c against updated auth module
# If any fail → PR is blocked
Scenario 2 — Adding a New Module to an Existing Product
The easy case. Just add it to platform.config.ts:
// Before
modules: [new TasksModule(), new ContactsModule()]
// After — Finance added
modules: [new TasksModule(), new ContactsModule(), new FinanceModule()]
On next cdk deploy: Finance migrations run automatically, new routes register, event subscriptions set up. Nothing else changes — existing data is untouched.
Scenario 3 — Database Schema Changes
The hardest problem. Change a core table and every product on that schema is affected.
Rule 1: Migrations are forwards-only and backwards-compatible
Never rename or delete a column in a single migration. Always do it in phases:
-- Phase 1 (deploy now): Add new column alongside old
ALTER TABLE users ADD COLUMN display_name VARCHAR(200);
-- Application writes to BOTH columns during transition
-- Phase 2 (next deploy): Backfill
UPDATE users SET display_name = name WHERE display_name IS NULL;
-- Phase 3 (after confirming): Drop old column
ALTER TABLE users DROP COLUMN name;
Rule 2: Migrations run automatically on deploy
async onDatabaseSetup(migrator: Migrator) {
await migrator.runModuleMigrations('auth');
// Idempotent — safe to run multiple times
}
Rule 3: Separate migration deploys from code deploys
- Day 1 — Deploy migration (add new column, keep old)
- Day 2 — Deploy code that uses new column
- Day 3 — Deploy migration to drop old column after confirming
Scenario 4 — API Contract Changes
Backwards-compatible (safe anytime): adding endpoints, adding optional fields, adding optional params. Frontends that don't know about new fields simply ignore them.
Breaking changes — use API versioning:
/api/v1/tasks ← old behavior, still works
/api/v2/tasks ← new behavior
Both run simultaneously. Products on v1 keep working. v1 is deprecated with a sunset date announced in advance.
Overall Governance Model
PLATFORM REPO (shared modules)
Core changes go through:
1. PR with changeset label (patch / minor / major)
2. Required review from platform team
3. Automated tests across ALL products in CI
4. Staged rollout: Dev → Staging → Prod
Breaking changes (major):
5. Migration guide published
6. Each product team opts in on their own schedule
7. Old version supported for defined deprecation window
Decision Matrix
| Change Type | Strategy |
|---|---|
| Bug fix in core module | Patch version, auto-deploy everywhere |
| New optional feature in core | Minor version, auto-deploy, products adopt when ready |
| Breaking change to core | Major version, products opt in independently |
| New module added to a product | Additive — deploy freely |
| DB schema change (additive) | Run migration, deploy code |
| DB schema change (breaking) | 3-phase migration (add → backfill → drop) |
| API contract change | Version the endpoint (/v1 → /v2), both run simultaneously |
XIII. Standard vs. This Architecture
| Concern | Standard Approach | This Architecture |
|---|---|---|
| Multi-tenancy | Shared DB, app-level filtering | Row-Level Security at DB + Zanzibar permission model |
| Account isolation | One AWS account, IAM policies | Multi-account with SCPs, blast radius isolation |
| Security | JWT + middleware | Zero Trust, 5 layers, Zanzibar RBAC/ABAC |
| Database | Standard PostgreSQL | Aurora Serverless v2 (3-5x faster, auto-scaling, multi-AZ) |
| Migration | Manual scripts | DMS with CDC (zero-downtime from any legacy DB) |
| Modularity | Feature flags | True module system with deps, migrations, permissions, events |
| Cross-module | Hardcoded foreign keys | Universal entity linking (any-to-any) |
| Deployment | Manual setup per project | One CDK command provisions a full product cell |
| Reliability | Alerts on errors | SLO-driven error budgets (Google SRE model) |
| Observability | Logs only | Correlated logs + metrics + traces |