Deployment | Viberglass Docs

Viberglass is deployed to AWS via three Pulumi stacks and GitHub Actions workflows.

High-level shape

There are three Pulumi stacks under infra/, each with its own dev and prod configuration:

Stack	Path	What it owns
base	`infra/base`	VPC, public/private subnets, NAT gateway(s), security groups (backend, RDS, worker), KMS key for SSM encryption, CloudWatch log groups.
platform	`infra/platform`	ECR repo for the backend image, RDS PostgreSQL multi-AZ, S3 uploads bucket, ALB, backend ECS Fargate cluster + service + task definition, Amplify frontend app + branch, Slack/webhook SSM parameters, GitHub OIDC role for Amplify deploys.
workers	`infra/workers`	Worker ECR repos, Lambda function role, ECS task execution + task roles, ECS cluster used by clankers, baseline task definitions for the ship-with-it agent images.

The stacks are deployed in order — base, then platform, then workers — because each later stack reads outputs from the earlier ones via pulumi.StackReference.

What runs where

Route53 ─▶ ALB ─▶ ECS Fargate (platform-backend)
                       │
                       ├─▶ RDS PostgreSQL (multi-AZ in prod)
                       ├─▶ S3 uploads bucket
                       ├─▶ SSM Parameter Store (secrets, KMS-encrypted)
                       └─▶ Lambda / ECS Fargate (workers via clankers)

Amplify ───────────────▶ Static React 19 SPA (platform-frontend)
GitHub Actions OIDC ───▶ AWS roles for build/deploy

The backend is the only long-running service in production. Workers are launched on demand: Lambda functions for short jobs, ECS Fargate tasks for long-running ones, all invoked from the backend's WorkerExecutionService.

Pulumi stacks in detail

base

infra/base/index.ts provisions:

A /16 VPC with two AZs (eu-west-1a, eu-west-1b).
Public and private subnets (10.0.1.0/24, 10.0.2.0/24, 10.0.10.0/24, 10.0.11.0/24).
An internet gateway plus, when networkMode = enterprise, NAT gateways for private subnet egress.
Security groups: backend-sg (port 3000 from VPC), rds-sg (5432 from backend SG only), worker-sg (callbacks from backend SG, all egress).
A KMS key with rotation enabled, aliased as alias/viberglass-<env>-ssm.
CloudWatch log groups for the Lambda worker (/aws/lambda/viberglass-<env>-worker), the ECS worker (/ecs/viberglass-<env>-worker), and the backend (/ecs/viberglass-<env>-backend).

Every output is exported so the platform and workers stacks can getOutput() against it.

platform

infra/platform/index.ts wires the backend, database, frontend, and the rest of the always-on infrastructure. It is the largest stack and is split into composable components under infra/platform/components/:

registry — ECR repository for the backend image.
database — RDS PostgreSQL instance with credentials saved to SSM.
storage — S3 bucket for uploads (versioning enabled outside dev).
load-balancer — ALB + target group + ACM certificate (when apiDomain is set) + Route53 alias.
backend-ecs — task definition + service for the backend container, including environment variables (database URL from SSM, S3 prefix, allowed origins, Slack secrets, worker provisioning hints).
amplify-frontend — Amplify app and main branch, optionally with a custom domain.
amplify-oidc — IAM OIDC trust for GitHub Actions to deploy Amplify.
secrets — SSM parameters describing the deployment target (region, ECR repo, ECS cluster/service, OIDC role) so the GitHub workflows can self-discover their inputs.

workers

infra/workers/index.ts provisions everything a clanker needs to run:

ECR repositories for each worker image (viberator-worker-multi-agent, viberator-worker-claude-code-ecs, viberator-worker-<agent>).
An ECS cluster for clanker tasks (separate from the backend cluster).
IAM execution and task roles for ECS workers, with permissions to read SSM parameters under /viberator/secrets/* and write to the worker log group.
A Lambda execution role for Lambda-deployed clankers, with the same SSM permissions plus VPC attach permissions.
Baseline task definitions matching the multi-agent and Claude Code ECS images so out-of-the-box clankers have something sensible to copy.

The stack exports the role ARNs, image URIs, cluster ARN, and worker subnets/security group. The platform stack reads those values via viberglass:workerStack so the backend can offer "managed" provisioning defaults to clankers (no need to paste long ARNs into the UI).

GitHub Actions workflows

Workflows live under .github/workflows/. They share two patterns: AWS auth via OIDC (no static credentials) and reading their target identifiers from SSM (so renaming a cluster does not require touching the workflow).

Workflow	Trigger	What it does
`pulumi-deploy-dev.yml`	Push to `main` touching `infrastructure/**`, or manual dispatch.	Runs `pulumi up` against the dev stacks.
`pulumi-deploy-prod.yml`	Manual dispatch only.	Runs `pulumi up` against the prod stacks, gated on the `prod` GitHub Environment.
`pulumi-preview.yml`	PRs touching `infrastructure/**`.	Runs `pulumi preview` and posts the diff as a PR comment.
`deploy-backend-dev.yml`	Push to `main` touching specific folders/files	Builds the backend image, pushes it to ECR with both `:<sha>` and `:latest` tags and updates the ECS service.
`deploy-backend-prod.yml`	Manual dispatch with environment gate.	Same as dev but against the prod cluster/service.
`deploy-frontend-dev.yml`	Push to `main` touching `apps/platform-frontend/` or `packages/types/`.	Builds shared types, validates the Vite build, and triggers Amplify to redeploy from git.
`deploy-frontend-prod.yml`	Manual dispatch with environment gate.	Same as dev but against the prod Amplify app.
`deploy-viberators.yml`	Push to `main` touching specific folders/files<br/>	Runs `infra/workers/scripts/setup-harness-images.sh` to build and push every worker image to ECR.
`backend-ci.yml` / `frontend-ci.yml`	Pull requests.	Lint, type-check, and test gates.

The dev workflows are fully automatic on merge to main. Production workflows always require a human to dispatch them and are gated on the prod GitHub Environment, which has its own protected reviewers list. For most users for this internal tool, dev workflow is the only one needed.

Deploying a fresh environment

When standing up a brand-new environment from scratch:

Bootstrap Pulumi state. Run infra/setup-pulumi-state.sh once to create the S3 bucket and DynamoDB lock table that hold Pulumi state.
Create the GitHub OIDC role. A small bootstrap CloudFormation stack (or aws iam one-liners — see infra/README.md) trusts token.actions.githubusercontent.com for the ilities/viberglass repo. Save the role ARN as the AWS_ROLE_ARN secret on the matching GitHub Environment.
Deploy base. cd infra/base && pulumi up -s <env>.
Deploy workers. cd infra/workers && pulumi up -s <env>. This produces ECR repos and IAM roles the platform stack will reference.
Build and push worker images. ./infra/workers/scripts/setup-harness-images.sh <env> (or trigger deploy-viberators.yml manually).
Deploy platform. cd infra/platform && pulumi up -s <env>. This stands up RDS, the backend service, the ALB, and Amplify.
Set Slack/Pulumi config secrets. Use pulumi config set --secret slackBotToken ... and friends so the Slack SSM parameters are populated with real values.
Push the first backend image. Run deploy-backend-<env>.yml once so the ECS service has a real image to pull.
Trigger the first frontend deploy. Run deploy-frontend-<env>.yml so Amplify builds against the freshly deployed backend URL.
Smoke-test. Hit <api-domain>/health (must return 200) and load the Amplify URL in a browser.

Day-to-day deploys

Once an environment is running, a typical day looks like this:

Backend code change → merge to main → deploy-backend-dev.yml runs → new image in ECR → ECS service rolls the new task definition.
Frontend code change → merge to main → deploy-frontend-dev.yml runs → Amplify rebuilds and serves the new bundle.
Worker / Dockerfile change → merge to main → deploy-viberators.yml runs → new worker images in ECR. Existing Lambda clankers pick up the new image when you click Start on them; ECS clankers pick it up the next time you start them or edit them.
Infrastructure change → PR posts a pulumi preview diff → merge to main → pulumi-deploy-dev.yml applies it. Promote to prod by manually dispatching pulumi-deploy-prod.yml.

Database migrations

Migrations are Kysely files under apps/platform-backend/src/migrations/, numbered 001–999. The backend runs migrate:latest automatically on startup, so a backend deploy is also a migration deploy. There is no separate migration job. For local development you can also run them by hand with npm run migrate:latest (or migrate:down to roll back).

Amplify specifics

The Amplify app is connected to the GitHub repository. In production it auto-deploys on push to main; in dev the workflow validates the build and Amplify pulls the same commit. The branch is named main regardless of environment — the dev/prod split is one Amplify app per environment, not one branch per environment.

Custom domains (appDomain in Pulumi config) are wired through Route53 with ACM certificates issued in us-east-1 (Amplify requirement). DNS records for the apex are created automatically when route53ZoneId is provided.

Backend specifics

The backend image is built from apps/platform-backend/Dockerfile.prod. The build context is the monorepo root so the image can pull packages/types and packages/chat-slack from the workspace.

The ECS service runs on Fargate. CPU/memory/desired count differ based on usage:

start with 256 CPU / 512 MB, desired count 1, autoscale 1–3. In general the platform stack does not do a lot of hard work.

Health checks hit /health on the ALB target group. A failing deploy is rolled back by the ECS service deployment circuit breaker.

Worker specifics

Workers are not deployed in the traditional sense; their images are pushed to ECR and clankers reference them by URI in deployment_config. Clicking Start on a clanker calls ClankerProvisioningService, which:

For Lambda: creates or updates the function (with VPC config from the worker stack), reaches Active, and marks the clanker active.
For ECS: registers a new task definition revision against the worker cluster and marks the clanker active.
For Docker: validates the local Docker daemon is reachable.

See Clankers for the data model and Clanker Images for the image build pipeline.

Rollback

Backend — re-run deploy-backend-<env>.yml against an earlier commit, or aws ecs update-service --task-definition <previous-revision> to revert the task definition pointer. ECS keeps the previous revisions around indefinitely.
Frontend — Amplify's web console has a one-click rollback to any previous build for the branch.
Workers — clankers reference an immutable <env>-<sha> tag in deployment_config. To roll back, edit the clanker, change imageUri to the previous tag, and click Start.
Pulumi — re-run pulumi up against an earlier git commit, or pulumi stack export / pulumi stack import for surgical state edits. Always preview first.