I. Foundational Design Philosophy

System Overview

                    ┌─────────────┐
                    │  CloudFront  │
                    │   (CDN)      │
                    └──────┬──────┘
                           │
              ┌────────────┴────────────┐
              │                         │
     ┌────────▼────────┐    ┌──────────▼──────────┐
     │  Frontend App   │    │   API Gateway        │
     │  (Next.js/React)│    │  (Auth, Rate Limit)  │
     └─────────────────┘    └──────────┬───────────┘
                                       │
                          ┌────────────┴────────────┐
                          │     Core API Service     │
                          │  (NestJS / FastAPI)      │
                          │                          │
                          │  ┌─────────────────────┐ │
                          │  │  Module: Auth        │ │
                          │  │  Module: Teams       │ │
                          │  │  Module: Tasks       │ │
                          │  │  Module: Contacts    │ │
                          │  │  Module: Documents   │ │
                          │  │  Module: Leads       │ │
                          │  │  Module: Notes       │ │
                          │  │  Module: Notifs      │ │
                          │  │  Module: Finance     │ │
                          │  └─────────────────────┘ │
                          └────────────┬─────────────┘
                                       │
                    ┌──────────────────┼──────────────────┐
                    │                  │                   │
             ┌──────▼──────┐   ┌──────▼──────┐   ┌───────▼──────┐
             │  PostgreSQL │   │    Redis     │   │  S3 / Files  │
             │  (Aurora)   │   │  (Cache +    │   │              │
             │             │   │   Queues)    │   │              │
             └─────────────┘   └─────────────┘   └──────────────┘

1. Cell-Based Architecture

Google's approach to shared-infrastructure isolation.

Every deployment is a "cell" — an independent, hermetically sealed unit containing the full stack. Each product is a cell. Cells share nothing at runtime but share the same codebase modules. A failure in one cell cannot cascade to another.

This is how Google runs Gmail, Maps, YouTube on shared infrastructure without shared fate.

2. Zero Trust Security Model

Based on NIST 800-207.

No implicit trust. Every request is authenticated, authorized, and encrypted — even internal service-to-service calls. Network location (VPC, subnet) grants zero privilege. Identity is the only perimeter.

3. AWS Multi-Account Isolation

AWS Well-Architected Framework.

Separate AWS accounts are hard security boundaries. Each concern gets its own blast radius. You don't share accounts between products.

II. AWS Multi-Account Strategy

This is the single most important infrastructure decision.

Account Topology

AWS Organization (Root)
│
├── Management Account (billing, SCPs, Organization policies ONLY)
│   └── No workloads ever run here
│
├── OU: Security
│   ├── Security Tooling Account
│   │   ├── GuardDuty delegated admin
│   │   ├── Security Hub aggregator
│   │   ├── CloudTrail organization trail (immutable S3)
│   │   ├── AWS Config aggregator
│   │   └── IAM Access Analyzer
│   │
│   └── Log Archive Account
│       ├── Centralized CloudWatch Logs
│       ├── CloudTrail logs (write-once, read-many)
│       ├── VPC Flow Logs & S3 access logs
│       └── Retention: 7 years (compliance)
│
├── OU: Shared Services
│   ├── Network Hub Account
│   │   ├── Transit Gateway (hub-and-spoke)
│   │   ├── Route 53 Hosted Zones
│   │   ├── AWS Certificate Manager
│   │   └── VPN / Direct Connect termination
│   │
│   ├── Shared Services Account
│   │   ├── ECR (container registry)
│   │   ├── Artifact stores (npm, pip)
│   │   ├── Cognito / Keycloak (IdP)
│   │   └── Secrets Manager
│   │
│   └── CI/CD Account
│       ├── GitHub Actions self-hosted runners
│       ├── CDK Pipelines (deploys via cross-account roles)
│       └── Artifact signing (cosign / Sigstore)
│
├── OU: Workloads
│   ├── Product A — Dev / Staging / Prod (3 accounts)
│   ├── Product B — Dev / Staging / Prod (3 accounts)
│   └── ... (each new product gets 3 accounts)
│
└── OU: Sandbox
    └── Developer Sandbox Accounts

Why This Matters

  • Blast radius isolation — a misconfigured IAM policy in Product A's dev cannot touch Product B's prod. Hard AWS boundary.
  • Cost attribution — each product's AWS bill is isolated automatically.
  • Compliance — Security OU locked with SCPs. Even root can't delete logs or disable GuardDuty.
  • Modularity — new product = one CDK script → 3 accounts with standard config. Minutes.

Cross-Account Access Patterns

CI/CD Account                    Product A Prod Account
┌──────────────┐                ┌──────────────────────┐
│ CDK Pipeline │───AssumeRole──▶│ DeploymentRole       │
│              │   (cross-acct) │ (ECS, RDS, S3 only)  │
└──────────────┘                └──────────────────────┘

Shared Services Account         Product A Prod Account
┌──────────────┐                ┌──────────────────────┐
│ Cognito      │◀──────────────│ API Gateway validates │
│ (IdP)        │  JWT issued    │ JWT via JWKS endpoint │
└──────────────┘                └──────────────────────┘

III. Security Architecture

Defense in depth — five layers from edge to data.

Layer 1: Edge — CloudFront + WAF

Internet → CloudFront (TLS 1.3 only)
              │
              ├── AWS WAF v2
              │   ├── Managed Rules (OWASP Top 10)
              │   ├── Rate limiting (2000 req/5min per IP)
              │   ├── Geo-blocking
              │   ├── Bot Control
              │   └── Custom rules (SQLi, XSS)
              │
              └── AWS Shield Advanced (DDoS)

Layer 2: API Gateway

  • Request validation (JSON Schema)
  • Mutual TLS for service-to-service
  • Usage plans + API keys for external consumers
  • Request/response logging → Log Archive Account
  • Lambda Authorizer or Cognito Authorizer

Layer 3: Application — Zero Trust Pipeline

Every request — even internal — goes through:

Request
  → TLS termination (ALB)
  → JWT verification (signature + expiry + audience + issuer)
  → Tenant extraction (org_id from token claims)
  → Permission evaluation (RBAC + ABAC)
  → Rate limiting (per-user, per-tenant, per-endpoint)
  → Input validation (Zod schemas, strict mode)
  → Audit logging (who, what, when, from where)
  → Business logic
  → Output sanitization (strip internal fields)
  → Response

Permission Model — Google Zanzibar

Relationship-based access control at scale. Permissions are stored as tuples and checks are graph traversals.

document:doc_123#viewer@user:alice
document:doc_123#editor@team:engineering#member
team:engineering#member@user:bob
org:acme#admin@user:carol

// "Can Bob edit doc_123?"
// → doc_123#editor includes team:engineering#member
// → team:engineering#member includes user:bob
// → YES

Use SpiceDB or OpenFGA (open-source Zanzibar). Every module calls the permission service. No module implements its own auth logic.

CREATE TABLE permission_tuples (
    id             UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    namespace      VARCHAR(100) NOT NULL,
    object_id      VARCHAR(200) NOT NULL,
    relation       VARCHAR(100) NOT NULL,
    subject_ns     VARCHAR(100) NOT NULL,
    subject_id     VARCHAR(200) NOT NULL,
    subject_rel    VARCHAR(100),
    created_at     TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(namespace, object_id, relation, subject_ns, subject_id, subject_rel)
);

CREATE INDEX idx_perm_object  ON permission_tuples(namespace, object_id);
CREATE INDEX idx_perm_subject ON permission_tuples(subject_ns, subject_id);
CREATE INDEX idx_perm_check   ON permission_tuples(namespace, object_id, relation);

Layer 4: Data Security

Row-Level Security — even if app code has a bug, the database enforces tenant isolation:

ALTER TABLE tasks ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON tasks
    USING (org_id = current_setting('app.current_org_id')::UUID);

-- Set on every DB connection:
SET app.current_org_id = 'org_xyz';
-- SELECT * FROM tasks only returns org_xyz's data
  • Encryption: AES-256 at rest, TLS 1.3 in transit, app-level encryption for PII via AWS KMS (per-tenant keys)
  • Data classification: columns tagged PUBLIC / INTERNAL / CONFIDENTIAL / RESTRICTED. Serializers auto-strip based on clearance.

Layer 5: Supply Chain Security

  • Container images scanned with Trivy/Snyk on every build
  • Dependency audit on every PR
  • Image signing with cosign — ECS only runs signed images
  • SBOM generation for every release

IV. Database Architecture

Aurora PostgreSQL — Serverless v2

FeatureBenefit
3-5x faster than standard PGRewritten storage engine, 6-way replication, parallel query
Serverless v2Scales 0.5 → 128 ACUs in seconds. Dev costs near zero.
Up to 15 read replicas<20ms lag. Route dashboards/reports to replicas.
Global DatabaseMulti-region replication <1s lag
100% PG compatibleEvery extension, ORM, and tool works
               Write Path                 Read Path
                   │                          │
                   ▼                          ▼
            ┌──────────────┐        ┌──────────────────┐
            │ Aurora Writer │        │ Aurora Reader x3  │
            │  (Primary)    │        │  (Auto-scaling)   │
            └──────┬───────┘        └────────┬─────────┘
                   │                         │
                   ▼                         ▼
            ┌──────────────────────────────────┐
            │  Aurora Storage (distributed,     │
            │  6-way replicated, auto-healing)  │
            └──────────────────────────────────┘

      ┌──────────────────────────────────────────┐
      │            Redis Cluster                  │
      │  Sessions  │  Query Cache  │  Rate Limits │
      └──────────────────────────────────────────┘

Caching Strategy — Cache-Aside with Invalidation

async function getTask(taskId: string, orgId: string): Promise<Task> {
  const cacheKey = `task:${orgId}:${taskId}`;

  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  const task = await auroraReader.query(
    'SELECT * FROM tasks WHERE id = $1 AND org_id = $2',
    [taskId, orgId]
  );

  await redis.setex(cacheKey, 300, JSON.stringify(task));
  return task;
}

async function updateTask(taskId, orgId, data) {
  await auroraWriter.query(/* ... */);
  await redis.del(`task:${orgId}:${taskId}`);
  await eventBus.emit('task.updated', { taskId, orgId, changes: data });
}

Data Migration from Legacy Systems

Legacy System                 AWS
┌──────────────┐           ┌───────────────────────────┐
│ SQL Server   │           │  DMS Replication Instance  │
│ Oracle       │──DMS ────▶│  ├── Full Load (bulk)      │
│ MySQL 5.x    │  (CDC)    │  └── CDC (continuous)      │
│ MongoDB      │           │         │                  │
└──────────────┘           │         ▼                  │
                           │  ┌──────────────┐         │
                           │  │  Aurora PG    │         │
                           │  └──────────────┘         │
                           └───────────────────────────┘

Migration Steps

  1. SCT — Schema Conversion Tool analyzes legacy schema, converts 90%+ to PG DDL automatically
  2. DMS Full Load — bulk copies all data (handles type conversions, encoding)
  3. DMS CDC — continuous replication while legacy still runs. Zero downtime.
  4. Validation — row counts + data integrity verification
  5. Cutover — flip DNS, stop CDC, legacy goes read-only

Non-Database Legacy Data (CSV, Excel, XML, APIs)

Source (S3 upload) → Step Function
    ├── Validate schema
    ├── Transform (dates, currencies, encodings)
    ├── Deduplicate
    ├── Map to target schema (configurable)
    ├── Batch insert into Aurora
    ├── Generate migration report
    └── Notify (success/failure + row counts)

V. Application — Module System

Core Services (Always Present)

ModuleResponsibility
AuthSignup, login, logout, password reset, MFA, OAuth, session management
UsersProfiles, preferences, avatars
OrganizationsMulti-tenancy, org settings, billing tier
TeamsCreate teams, add/remove members, team roles
PermissionsZanzibar RBAC + ABAC, permission checks as middleware
Entity LinksUniversal cross-module linking (any entity to any entity)
NotificationsIn-app, email, push, webhooks (event-driven)
Files / MediaS3 upload/download, presigned URLs, file metadata
Audit LogImmutable record of who did what, when (enterprise compliance)
SearchFull-text search via PG tsvector → OpenSearch when needed
Settings / ConfigFeature flags, app config, per-tenant configuration

Core Infrastructure Services

ServiceTechnologyPurpose
Event BusSNS/SQS or Redis StreamsModules communicate via events, not direct calls. When a task is created, an event fires and the notification module picks it up.
Job QueueBullMQ (Redis-backed)Background jobs: emails, report generation, data exports, scheduled tasks
Caching LayerRedis (ElastiCache)Session data, frequently accessed data, rate limiting counters
Logging & MonitoringCloudWatch + structured JSONCentralized, queryable logs. Optionally Datadog or Grafana.
Health ChecksPer-module /healthEvery module exposes a health endpoint for load balancers and orchestration
Monolith-first, modular-ready. Start as a well-structured modular monolith (one deployable, many internal modules). Extract into microservices only when a specific module needs independent scaling. This avoids premature complexity.

Monorepo Structure

platform/
├── packages/
│   ├── core/                     # Shared kernel — NEVER optional
│   │   ├── auth/                 # JWT, sessions, OAuth
│   │   ├── iam/                  # Zanzibar permission engine
│   │   ├── tenancy/              # Org isolation, RLS
│   │   ├── events/               # Event bus abstraction
│   │   ├── storage/              # S3 abstraction
│   │   ├── notifications/        # Multi-channel engine
│   │   ├── audit/                # Immutable audit log
│   │   ├── search/               # Full-text search
│   │   ├── migrations/           # ETL framework
│   │   └── common/               # DTOs, validators, errors
│   │
│   ├── modules/                  # Optional business modules
│   │   ├── teams/
│   │   ├── tasks/
│   │   ├── contacts/
│   │   ├── documents/
│   │   ├── notes/
│   │   ├── finance/
│   │   └── [custom]/
│   │
│   ├── sdk/                      # Auto-generated TS SDK
│   └── ui/                       # Shared UI components
│
├── apps/
│   ├── api/                      # Deployable API server
│   ├── worker/                   # Background jobs
│   └── web/                      # Next.js frontend
│
├── infra/                        # AWS CDK
│   ├── lib/
│   │   ├── account-baseline.ts
│   │   ├── networking.ts
│   │   ├── database.ts
│   │   ├── compute.ts
│   │   ├── cdn.ts
│   │   ├── security.ts
│   │   └── observability.ts
│   └── bin/
│       ├── deploy-shared-services.ts
│       └── deploy-product.ts
│
└── tools/
    ├── migrate/                  # Data migration CLI
    ├── scaffold/                 # New module generator
    └── sdk-gen/                  # OpenAPI → SDK

Module Registration

interface PlatformModule {
  name: string;
  version: string;
  dependencies: string[];

  onRegister(container: DependencyContainer): void;
  onDatabaseSetup(migrator: Migrator): Promise<void>;
  onPermissionsSetup(engine: PermissionEngine): void;
  onEventsSetup(bus: EventBus): void;
  onReady(): Promise<void>;
  onShutdown(): Promise<void>;
}

Example: TasksModule Implementation

export class TasksModule implements PlatformModule {
  name = 'tasks';
  version = '1.0.0';
  dependencies = ['core', 'teams'];

  onRegister(container) {
    container.register(TasksService);
    container.register(TasksController);
    container.register(TaskBoardsController);
  }

  onDatabaseSetup(migrator) {
    return migrator.runModuleMigrations('tasks');
  }

  onPermissionsSetup(engine) {
    engine.defineNamespace('task', {
      relations: {
        org: 'organization',
        owner: 'user',
        assignee: 'user | team#member',
        viewer: 'user | team#member | org#member',
        editor: 'user | team#member',
      },
      permissions: {
        view: 'viewer + editor + owner + org->admin',
        edit: 'editor + owner',
        delete: 'owner + org->admin',
        assign: 'editor + owner',
      }
    });
  }

  onEventsSetup(bus) {
    bus.subscribe('team.member_removed', this.handleTeamMemberRemoved);
    bus.subscribe('contact.deleted', this.handleContactDeleted);
  }
}

Per-Product Configuration

import { TasksModule } from '@platform/modules/tasks';
import { ContactsModule } from '@platform/modules/contacts';
import { DocumentsModule } from '@platform/modules/documents';
import { NotesModule } from '@platform/modules/notes';
// import { FinanceModule } from '@platform/modules/finance';

export const platformConfig = {
  modules: [
    new TasksModule(),
    new ContactsModule(),
    new DocumentsModule(),
    new NotesModule(),
    // FinanceModule — not loaded, endpoints don't exist,
    // DB tables not created, events not subscribed
  ],

  aws: {
    region: 'us-east-1',
    accountId: process.env.AWS_ACCOUNT_ID,
  },

  database: {
    writer: process.env.AURORA_WRITER_ENDPOINT,
    reader: process.env.AURORA_READER_ENDPOINT,
  },

  features: {
    enableRedlining: true,
    enableKanban: true,
    maxTeamSize: 50,
  }
};

Universal linking — any entity to any entity. This is the glue that lets teams, tasks, docs, leads all reference each other.

CREATE TABLE entity_links (
    id            UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    source_type   VARCHAR(50) NOT NULL,
    source_id     UUID NOT NULL,
    target_type   VARCHAR(50) NOT NULL,
    target_id     UUID NOT NULL,
    relationship  VARCHAR(50) NOT NULL,
    org_id        UUID NOT NULL,
    created_at    TIMESTAMPTZ DEFAULT NOW(),
    UNIQUE(source_type, source_id, target_type, target_id, relationship)
);

EntityLinkService Implementation

class EntityLinkService {
  async link(params: {
    source: { type: string; id: string };
    target: { type: string; id: string };
    relationship: string;
    orgId: string;
    actorId: string;
  }) {
    // 1. Verify both entities exist (calls respective module)
    await this.verifyEntity(params.source);
    await this.verifyEntity(params.target);

    // 2. Check permission (actor must have 'link' permission on both)
    await this.permissionEngine.check(params.actorId, 'link', params.source);
    await this.permissionEngine.check(params.actorId, 'link', params.target);

    // 3. Create the link
    await this.repository.createLink(params);

    // 4. Emit event (other modules can react)
    await this.eventBus.emit('entity.linked', params);

    // 5. Audit
    await this.auditLog.record('entity.linked', params);
  }

  async getLinkedEntities(
    source: { type: string; id: string },
    targetType: string,
    orgId: string
  ) {
    return this.repository.findLinks(source, targetType, orgId);
  }
}

Auto-Generated Endpoints

EndpointDescription
POST /api/v1/linksCreate a link between any two entities
DELETE /api/v1/links/:idRemove a link
GET /api/v1/teams/:id/linked/documentsDocs linked to a team
GET /api/v1/teams/:id/linked/tasksTasks linked to a team
GET /api/v1/documents/:id/linked/contactsContacts linked to a doc
GET /api/v1/contacts/:id/linked/*Everything linked to a contact

VI. Business Modules

Tasks Module  Optional

Same data powers Kanban boards, traditional lists, calendar views, and Gantt charts. The view_type on boards determines rendering — not data structure.

CREATE TABLE tasks (
    id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id       UUID NOT NULL,
    title        VARCHAR(500) NOT NULL,
    description  TEXT,
    status       VARCHAR(50) DEFAULT 'todo',
    priority     VARCHAR(20),
    due_date     TIMESTAMPTZ,
    board_id     UUID REFERENCES task_boards(id),
    column_id    UUID REFERENCES task_columns(id),
    position     FLOAT,                    -- fractional indexing for drag-drop
    metadata     JSONB DEFAULT '{}',       -- custom fields, labels, points
    assignee_id  UUID,
    created_by   UUID NOT NULL,
    created_at   TIMESTAMPTZ DEFAULT NOW(),
    updated_at   TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE task_boards (
    id         UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id     UUID NOT NULL,
    name       VARCHAR(200),
    view_type  VARCHAR(50) DEFAULT 'kanban',  -- kanban, list, calendar, gantt
    config     JSONB DEFAULT '{}'
);

CREATE TABLE task_columns (
    id        UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    board_id  UUID REFERENCES task_boards(id),
    name      VARCHAR(200),
    position  FLOAT,
    config    JSONB DEFAULT '{}'               -- color, WIP limits
);

Key Endpoints

EndpointDescription
GET /api/v1/tasks?board_id=X&view=kanbanReturns tasks grouped by column (Trello-style)
GET /api/v1/tasks?board_id=X&view=listReturns flat sorted list (traditional view)
PATCH /api/v1/tasks/:id/moveReorder or move between columns (fractional indexing)
POST /api/v1/tasks/:id/subtasksCreate nested/child tasks

Contacts / Leads / CRM  Optional

Leads and contacts share one table with a type discriminator. A lead converts to a client by changing type.

CREATE TABLE contacts (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id          UUID NOT NULL,
    type            VARCHAR(20) NOT NULL,     -- lead, client, vendor, partner
    status          VARCHAR(50),
    first_name      VARCHAR(100),
    last_name       VARCHAR(100),
    email           VARCHAR(255),
    phone           VARCHAR(50),
    company         VARCHAR(200),
    custom_fields   JSONB DEFAULT '{}',
    pipeline_stage  VARCHAR(50),
    created_at      TIMESTAMPTZ DEFAULT NOW()
);

Documents / Contracts  Optional

CREATE TABLE documents (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id      UUID NOT NULL,
    title       VARCHAR(500),
    type        VARCHAR(50),         -- contract, proposal, note, template
    content     TEXT,
    status      VARCHAR(50),         -- draft, review, sent, signed
    version     INT DEFAULT 1,
    parent_id   UUID REFERENCES documents(id),
    metadata    JSONB DEFAULT '{}',
    file_url    VARCHAR(1000),
    created_by  UUID NOT NULL,
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE document_revisions (
    id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    document_id  UUID REFERENCES documents(id),
    content      TEXT,
    changes      JSONB,              -- diff from previous version
    revised_by   UUID NOT NULL,
    created_at   TIMESTAMPTZ DEFAULT NOW()
);

Notifications  Core

Event-driven. Other modules emit events (task.assigned, document.shared). The notification module subscribes and routes based on user preferences.

CREATE TABLE notifications (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    org_id      UUID NOT NULL,
    user_id     UUID NOT NULL,
    type        VARCHAR(50),
    channel     VARCHAR(20),         -- in_app, email, push
    title       VARCHAR(200),
    body        TEXT,
    data        JSONB DEFAULT '{}',  -- deep linking payload
    read_at     TIMESTAMPTZ,
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

Finance  Optional

Belongs in Core ModuleToo Specific — Custom Per Project
Invoices (CRUD, status tracking)Tax calculation rules
Payments received/sent ledgerIndustry-specific billing
Expense trackingPayroll
Basic reporting (revenue, expenses)Complex financial modeling
Integration hooks (Stripe, QuickBooks)Accounting standards compliance

VII. Deployment — Cell Provisioning

One command provisions a full product cell in a target AWS account:

npx cdk deploy ProductCell \
  --context productName=acme-crm \
  --context modules=tasks,contacts,documents,finance \
  --context environment=prod \
  --profile product-a-prod-account

What Gets Provisioned

  • VPC (3 AZs, private subnets, NAT Gateways)
  • Aurora Serverless v2 cluster (writer + reader)
  • ElastiCache Redis cluster
  • ECS Fargate service (API + Worker)
  • S3 buckets (files, backups)
  • CloudFront distribution + WAF rules
  • CloudWatch dashboards + alarms
  • Cross-account roles (CI/CD, logging)
  • DNS records in Network Hub account
Each product is completely independent at the infrastructure level but shares the same codebase. Fix a bug in the tasks module → deploys to all products that use it.

VIII. Deployment Strategy — Platform by Platform

No single platform is optimal for every workload. Use the right tool for each layer.

Backend API Servers

OptionBest ForTrade-off
ECS FargateSteady-state workloads, full VPC control, WebSockets, complex networkingMore config, you manage scaling policies
AWS App RunnerSimpler APIs, auto-scaling zero config, small teamsLess VPC control, no WebSocket support
Lambda + API GWEvent-driven, spiky traffic, low-traffic modulesCold starts, 15-min timeout, harder to debug
Recommendation: ECS Fargate for the core API — needs VPC access to Aurora/Redis, persistent connections for WebSockets, and connection pooling. Lambda for event-driven workers. App Runner is too limited for a platform backend.

Background Workers / Jobs

OptionBest For
ECS Fargate (separate service)Long-running job processors (BullMQ), always-on
LambdaShort-lived event handlers (S3 triggers, SQS consumers, cron)
Step Functions + LambdaMulti-step workflows (ETL, document processing)

Frontend Applications

OptionBest ForTrade-off
VercelNext.js apps, best DX, instant previews, edge SSRVendor lock-in, cost at scale, data leaves AWS
Cloudflare PagesStatic sites, simple SPAs, global edge, cheapLimited SSR, no tight AWS integration
AWS Amplify HostingNext.js/React needing tight AWS integrationSlower builds, weaker DX vs. Vercel
CloudFront + S3Pure SPAs (React/Vue), full control, cheapestNo SSR, manual cache invalidation

Frontend Recommendation

  • Customer-facing products → Vercel (Next.js with edge SSR, preview deployments per PR)
  • Internal tools / admin panels → CloudFront + S3 (static React SPA, cheapest, stays in AWS)
  • Architecture docs / marketing → Cloudflare Pages (static HTML, free tier)

The Recommended Mix

EDGE / CDN
  Customer Frontends   -->  Vercel (Next.js, edge SSR)
  Internal Tools       -->  CloudFront + S3 (React SPA)
  Architecture Docs    -->  Cloudflare Pages (static HTML)
                |
                | HTTPS (API calls)
                v
AWS PRODUCT ACCOUNT
  API Gateway   -->  ECS Fargate (Core API, NestJS)
  SQS/SNS      -->  ECS Fargate (BullMQ Worker)
  S3 Events    -->  Lambda (file processing)
  Scheduled    -->  Lambda (cron jobs, cleanup)
  ETL          -->  Step Functions + Lambda
                |
  Aurora PostgreSQL  |  Redis  |  S3

Why NOT All-Serverless for Backend

ConcernVercel / Lambda ServerlessECS Fargate
Timeout10-300s depending on planUnlimited
Cold starts100-500ms per invocationNone (always running)
DB connectionsNew connection per invocation (kills DB)Connection pool (Prisma/Drizzle)
WebSocketsNot supportedFull support
Background jobsNot supportedBullMQ workers
VPC accessNot possible (Vercel) / complex (Lambda)Native (same VPC as Aurora/Redis)
Cost at scaleUnpredictable (per-invocation)Predictable (per-container)

How Vercel Works with This Architecture

Vercel is the frontend deployment platform only. It does NOT run business logic.

User's Browser
     |
     v
  Vercel (Frontend ONLY)
  Next.js App
     |  fetch('https://api.yourproduct.com/v1/tasks')
     |
     v  HTTPS
  AWS (Your Backend)
  CloudFront --> API Gateway --> ECS (NestJS on Fargate)
     |
  Aurora  |  Redis  |  S3
  • Server Components / SSR: Vercel renders HTML by calling your ECS API, sends finished page to browser
  • Client-side fetching: Browser calls your ECS API directly — Vercel not involved
  • Next.js API routes: Only for thin proxies (OAuth callbacks, cookie handling) — never business logic
  • WebSockets: Browser connects directly to ECS — Vercel can't handle these

Per-Environment Strategy

EnvironmentBackendFrontend
DevECS Fargate (min capacity, Aurora at 0.5 ACU)Vercel preview deployments (auto per PR)
StagingECS Fargate (mirrors prod, smaller scale)Vercel staging branch
ProdECS Fargate (auto-scaling 2-10 tasks, multi-AZ)Vercel production

IX. Observability — SRE Approach

SLO-Driven, Not Alert-Driven

SLOTargetBudget
API Availability99.9%43 min downtime/month
API Latency (p99)< 500ms
API Latency (p50)< 100ms
Data Durability99.999999999%Aurora handles this
Error budget consumed too fast → freeze deployments. Error budget healthy → deploy freely.

Three Pillars

PillarToolPurpose
LogsCloudWatch → OpenSearchStructured JSON, cross-account
MetricsCloudWatch + EMFBusiness + infra metrics
TracesX-Ray / OpenTelemetryEnd-to-end request tracing

Every request gets a correlation ID flowing through logs, metrics, and traces.

X. Tech Stack Summary

ConcernChoice
BackendNestJS (TypeScript) — module system built-in
ORMPrisma or Drizzle — type-safe, migrations
DatabaseAurora PostgreSQL Serverless v2
Cache / QueueRedis (ElastiCache) + BullMQ
AuthAWS Cognito or Keycloak
PermissionsSpiceDB / OpenFGA (Zanzibar)
File StorageAWS S3 + presigned URLs
SearchPG full-textOpenSearch later
EventsAWS SNS/SQS or Redis Streams
DeploymentECS Fargate + CDK for IaC
FrontendNext.js
MonorepoTurborepo or Nx
CI/CDGitHub ActionsCDK Pipelines
ObservabilityCloudWatch + X-Ray + OpenSearch

XI. Build Order

Module Dependency Map

Core (always loaded):
  ├── Auth
  ├── Users
  ├── Organizations
  ├── Teams
  ├── Permissions (Zanzibar)
  ├── Entity Links (cross-module glue)
  ├── Notifications (event-driven)
  ├── Files
  ├── Audit Log
  └── Search

Optional (plug in per product):
  ├── Tasks        → depends on Core
  ├── Contacts     → depends on Core
  ├── Documents    → depends on Core, Files
  ├── Notes        → depends on Core
  ├── Finance      → depends on Core, Contacts
  └── [Custom]     → depends on Core

Recommended Sequence

  1. Monorepo scaffold — Turborepo + NestJS + shared types package
  2. Core kernel — Auth, Users, Orgs, Teams, Permissions, Entity Links
  3. Event bus + notifications
  4. Tasks — first optional module, proves the architecture
  5. Contacts/Leads, Documents, Notes
  6. Finance — most domain-specific, last
  7. CDK infrastructure — multi-account provisioning
  8. CI/CD pipeline — automated deploy across accounts

XII. Module Update & Versioning Strategy

How to ship changes to shared infrastructure without breaking products already running on it.

Scenario 1 — Updating an Existing Core Module

Say you update Auth to add MFA, change a JWT claim, or refactor the permission middleware. Every product using that module gets the change. If it breaks, everything breaks.

Solution: Semantic Versioning + Changesets

Version BumpWhenRollout
patch (1.2.0 → 1.2.1)Bug fixAuto-deploy to all products
minor (1.2.0 → 1.3.0)New backwards-compatible featureAuto-deploy, products adopt when ready
major (1.x → 2.0.0)Breaking changeEach product opts in on its own schedule

Breaking changes live in a parallel package until products migrate:

packages/core/auth/     ← v1.2.0  (Product A still here)
packages/core/auth-v2/  ← v2.0.0  (Product B migrated)

Use Changesets to manage this in the monorepo:

# Developer describes what changed
npx changeset add
# → "auth: added MFA, new required field mfa_enabled on users table"
# → type: minor

# CI bumps versions and generates changelogs
npx changeset version

In CI, every PR to a shared module runs tests against every product that uses it:

# .github/workflows/test.yml
test-all-products:
  - test product-a against updated auth module
  - test product-b against updated auth module
  - test product-c against updated auth module
  # If any fail → PR is blocked

Scenario 2 — Adding a New Module to an Existing Product

The easy case. Just add it to platform.config.ts:

// Before
modules: [new TasksModule(), new ContactsModule()]

// After — Finance added
modules: [new TasksModule(), new ContactsModule(), new FinanceModule()]

On next cdk deploy: Finance migrations run automatically, new routes register, event subscriptions set up. Nothing else changes — existing data is untouched.

Scenario 3 — Database Schema Changes

The hardest problem. Change a core table and every product on that schema is affected.

Rule 1: Migrations are forwards-only and backwards-compatible

Never rename or delete a column in a single migration. Always do it in phases:

-- Phase 1 (deploy now): Add new column alongside old
ALTER TABLE users ADD COLUMN display_name VARCHAR(200);
-- Application writes to BOTH columns during transition

-- Phase 2 (next deploy): Backfill
UPDATE users SET display_name = name WHERE display_name IS NULL;

-- Phase 3 (after confirming): Drop old column
ALTER TABLE users DROP COLUMN name;

Rule 2: Migrations run automatically on deploy

async onDatabaseSetup(migrator: Migrator) {
  await migrator.runModuleMigrations('auth');
  // Idempotent — safe to run multiple times
}

Rule 3: Separate migration deploys from code deploys

  • Day 1 — Deploy migration (add new column, keep old)
  • Day 2 — Deploy code that uses new column
  • Day 3 — Deploy migration to drop old column after confirming
You can roll back the code without rolling back the DB — the DB is always compatible with the previous code version.

Scenario 4 — API Contract Changes

Backwards-compatible (safe anytime): adding endpoints, adding optional fields, adding optional params. Frontends that don't know about new fields simply ignore them.

Breaking changes — use API versioning:

/api/v1/tasks    ← old behavior, still works
/api/v2/tasks    ← new behavior

Both run simultaneously. Products on v1 keep working. v1 is deprecated with a sunset date announced in advance.

Overall Governance Model

PLATFORM REPO (shared modules)

Core changes go through:
  1. PR with changeset label (patch / minor / major)
  2. Required review from platform team
  3. Automated tests across ALL products in CI
  4. Staged rollout: Dev → Staging → Prod

Breaking changes (major):
  5. Migration guide published
  6. Each product team opts in on their own schedule
  7. Old version supported for defined deprecation window

Decision Matrix

Change TypeStrategy
Bug fix in core modulePatch version, auto-deploy everywhere
New optional feature in coreMinor version, auto-deploy, products adopt when ready
Breaking change to coreMajor version, products opt in independently
New module added to a productAdditive — deploy freely
DB schema change (additive)Run migration, deploy code
DB schema change (breaking)3-phase migration (add → backfill → drop)
API contract changeVersion the endpoint (/v1 → /v2), both run simultaneously
Key principle: The platform and each product deploy independently. A core update never forces an emergency migration across all products simultaneously.

XIII. Standard vs. This Architecture

ConcernStandard ApproachThis Architecture
Multi-tenancyShared DB, app-level filteringRow-Level Security at DB + Zanzibar permission model
Account isolationOne AWS account, IAM policiesMulti-account with SCPs, blast radius isolation
SecurityJWT + middlewareZero Trust, 5 layers, Zanzibar RBAC/ABAC
DatabaseStandard PostgreSQLAurora Serverless v2 (3-5x faster, auto-scaling, multi-AZ)
MigrationManual scriptsDMS with CDC (zero-downtime from any legacy DB)
ModularityFeature flagsTrue module system with deps, migrations, permissions, events
Cross-moduleHardcoded foreign keysUniversal entity linking (any-to-any)
DeploymentManual setup per projectOne CDK command provisions a full product cell
ReliabilityAlerts on errorsSLO-driven error budgets (Google SRE model)
ObservabilityLogs onlyCorrelated logs + metrics + traces

📝 All Notes