Deployment
Viberglass is deployed to AWS via three Pulumi stacks and GitHub Actions workflows.
High-level shape
There are three Pulumi stacks under infra/, each with its own dev and prod configuration:
| Stack | Path | What it owns |
|---|---|---|
| base | infra/base | VPC, public/private subnets, NAT gateway(s), security groups (backend, RDS, worker), KMS key for SSM encryption, CloudWatch log groups. |
| platform | infra/platform | ECR repo for the backend image, RDS PostgreSQL multi-AZ, S3 uploads bucket, ALB, backend ECS Fargate cluster + service + task definition, Amplify frontend app + branch, Slack/webhook SSM parameters, GitHub OIDC role for Amplify deploys. |
| workers | infra/workers | Worker ECR repos, Lambda function role, ECS task execution + task roles, ECS cluster used by clankers, baseline task definitions for the ship-with-it agent images. |
The stacks are deployed in order — base, then platform, then workers — because each later stack reads outputs from the earlier ones via pulumi.StackReference.
What runs where
Route53 ─▶ ALB ─▶ ECS Fargate (platform-backend)
│
├─▶ RDS PostgreSQL (multi-AZ in prod)
├─▶ S3 uploads bucket
├─▶ SSM Parameter Store (secrets, KMS-encrypted)
└─▶ Lambda / ECS Fargate (workers via clankers)
Amplify ───────────────▶ Static React 19 SPA (platform-frontend)
GitHub Actions OIDC ───▶ AWS roles for build/deploy
The backend is the only long-running service in production. Workers are launched on demand: Lambda functions for short jobs, ECS Fargate tasks for long-running ones, all invoked from the backend's WorkerExecutionService.
Pulumi stacks in detail
base
infra/base/index.ts provisions:
- A
/16VPC with two AZs (eu-west-1a,eu-west-1b). - Public and private subnets (
10.0.1.0/24,10.0.2.0/24,10.0.10.0/24,10.0.11.0/24). - An internet gateway plus, when
networkMode = enterprise, NAT gateways for private subnet egress. - Security groups:
backend-sg(port 3000 from VPC),rds-sg(5432 from backend SG only),worker-sg(callbacks from backend SG, all egress). - A KMS key with rotation enabled, aliased as
alias/viberglass-<env>-ssm. - CloudWatch log groups for the Lambda worker (
/aws/lambda/viberglass-<env>-worker), the ECS worker (/ecs/viberglass-<env>-worker), and the backend (/ecs/viberglass-<env>-backend).
Every output is exported so the platform and workers stacks can getOutput() against it.
platform
infra/platform/index.ts wires the backend, database, frontend, and the rest of the always-on infrastructure. It is the largest stack and is split into composable components under infra/platform/components/:
registry— ECR repository for the backend image.database— RDS PostgreSQL instance with credentials saved to SSM.storage— S3 bucket for uploads (versioning enabled outside dev).load-balancer— ALB + target group + ACM certificate (whenapiDomainis set) + Route53 alias.backend-ecs— task definition + service for the backend container, including environment variables (database URL from SSM, S3 prefix, allowed origins, Slack secrets, worker provisioning hints).amplify-frontend— Amplify app and main branch, optionally with a custom domain.amplify-oidc— IAM OIDC trust for GitHub Actions to deploy Amplify.secrets— SSM parameters describing the deployment target (region, ECR repo, ECS cluster/service, OIDC role) so the GitHub workflows can self-discover their inputs.
workers
infra/workers/index.ts provisions everything a clanker needs to run:
- ECR repositories for each worker image (
viberator-worker-multi-agent,viberator-worker-claude-code-ecs,viberator-worker-<agent>). - An ECS cluster for clanker tasks (separate from the backend cluster).
- IAM execution and task roles for ECS workers, with permissions to read SSM parameters under
/viberator/secrets/*and write to the worker log group. - A Lambda execution role for Lambda-deployed clankers, with the same SSM permissions plus VPC attach permissions.
- Baseline task definitions matching the multi-agent and Claude Code ECS images so out-of-the-box clankers have something sensible to copy.
The stack exports the role ARNs, image URIs, cluster ARN, and worker subnets/security group. The platform stack reads those values via viberglass:workerStack so the backend can offer "managed" provisioning defaults to clankers (no need to paste long ARNs into the UI).
GitHub Actions workflows
Workflows live under .github/workflows/. They share two patterns: AWS auth via OIDC (no static credentials) and reading their target identifiers from SSM (so renaming a cluster does not require touching the workflow).
| Workflow | Trigger | What it does |
|---|---|---|
pulumi-deploy-dev.yml | Push to main touching infrastructure/**, or manual dispatch. | Runs pulumi up against the dev stacks. |
pulumi-deploy-prod.yml | Manual dispatch only. | Runs pulumi up against the prod stacks, gated on the prod GitHub Environment. |
pulumi-preview.yml | PRs touching infrastructure/**. | Runs pulumi preview and posts the diff as a PR comment. |
deploy-backend-dev.yml | Push to main touching specific folders/files | Builds the backend image, pushes it to ECR with both :<sha> and :latest tags and updates the ECS service. |
deploy-backend-prod.yml | Manual dispatch with environment gate. | Same as dev but against the prod cluster/service. |
deploy-frontend-dev.yml | Push to main touching apps/platform-frontend/** or packages/types/**. | Builds shared types, validates the Vite build, and triggers Amplify to redeploy from git. |
deploy-frontend-prod.yml | Manual dispatch with environment gate. | Same as dev but against the prod Amplify app. |
deploy-viberators.yml | Push to main touching specific folders/files<br/> | Runs infra/workers/scripts/setup-harness-images.sh to build and push every worker image to ECR. |
backend-ci.yml / frontend-ci.yml | Pull requests. | Lint, type-check, and test gates. |
The dev workflows are fully automatic on merge to main. Production workflows always require a human to dispatch them and are gated on the prod GitHub Environment, which has its own protected reviewers list. For most users for this internal tool, dev workflow is the only one needed.
Deploying a fresh environment
When standing up a brand-new environment from scratch:
- Bootstrap Pulumi state. Run
infra/setup-pulumi-state.shonce to create the S3 bucket and DynamoDB lock table that hold Pulumi state. - Create the GitHub OIDC role. A small bootstrap CloudFormation stack (or
aws iamone-liners — seeinfra/README.md) truststoken.actions.githubusercontent.comfor theilities/viberglassrepo. Save the role ARN as theAWS_ROLE_ARNsecret on the matching GitHub Environment. - Deploy
base.cd infra/base && pulumi up -s <env>. - Deploy
workers.cd infra/workers && pulumi up -s <env>. This produces ECR repos and IAM roles the platform stack will reference. - Build and push worker images.
./infra/workers/scripts/setup-harness-images.sh <env>(or triggerdeploy-viberators.ymlmanually). - Deploy
platform.cd infra/platform && pulumi up -s <env>. This stands up RDS, the backend service, the ALB, and Amplify. - Set Slack/Pulumi config secrets. Use
pulumi config set --secret slackBotToken ...and friends so the Slack SSM parameters are populated with real values. - Push the first backend image. Run
deploy-backend-<env>.ymlonce so the ECS service has a real image to pull. - Trigger the first frontend deploy. Run
deploy-frontend-<env>.ymlso Amplify builds against the freshly deployed backend URL. - Smoke-test. Hit
<api-domain>/health(must return 200) and load the Amplify URL in a browser.
Day-to-day deploys
Once an environment is running, a typical day looks like this:
- Backend code change → merge to
main→deploy-backend-dev.ymlruns → new image in ECR → ECS service rolls the new task definition. - Frontend code change → merge to
main→deploy-frontend-dev.ymlruns → Amplify rebuilds and serves the new bundle. - Worker / Dockerfile change → merge to
main→deploy-viberators.ymlruns → new worker images in ECR. Existing Lambda clankers pick up the new image when you click Start on them; ECS clankers pick it up the next time you start them or edit them. - Infrastructure change → PR posts a
pulumi previewdiff → merge tomain→pulumi-deploy-dev.ymlapplies it. Promote to prod by manually dispatchingpulumi-deploy-prod.yml.
Database migrations
Migrations are Kysely files under apps/platform-backend/src/migrations/, numbered 001–999. The backend runs migrate:latest automatically on startup, so a backend deploy is also a migration deploy. There is no separate migration job. For local development you can also run them by hand with npm run migrate:latest (or migrate:down to roll back).
Amplify specifics
The Amplify app is connected to the GitHub repository. In production it auto-deploys on push to main; in dev the workflow validates the build and Amplify pulls the same commit. The branch is named main regardless of environment — the dev/prod split is one Amplify app per environment, not one branch per environment.
Custom domains (appDomain in Pulumi config) are wired through Route53 with ACM certificates issued in us-east-1 (Amplify requirement). DNS records for the apex are created automatically when route53ZoneId is provided.
Backend specifics
The backend image is built from apps/platform-backend/Dockerfile.prod. The build context is the monorepo root so the image can pull packages/types and packages/chat-slack from the workspace.
The ECS service runs on Fargate. CPU/memory/desired count differ based on usage:
- start with 256 CPU / 512 MB, desired count 1, autoscale 1–3. In general the platform stack does not do a lot of hard work.
Health checks hit /health on the ALB target group. A failing deploy is rolled back by the ECS service deployment circuit breaker.
Worker specifics
Workers are not deployed in the traditional sense; their images are pushed to ECR and clankers reference them by URI in deployment_config. Clicking Start on a clanker calls ClankerProvisioningService, which:
- For Lambda: creates or updates the function (with VPC config from the worker stack), reaches
Active, and marks the clankeractive. - For ECS: registers a new task definition revision against the worker cluster and marks the clanker
active. - For Docker: validates the local Docker daemon is reachable.
See Clankers for the data model and Clanker Images for the image build pipeline.
Rollback
- Backend — re-run
deploy-backend-<env>.ymlagainst an earlier commit, oraws ecs update-service --task-definition <previous-revision>to revert the task definition pointer. ECS keeps the previous revisions around indefinitely. - Frontend — Amplify's web console has a one-click rollback to any previous build for the branch.
- Workers — clankers reference an immutable
<env>-<sha>tag indeployment_config. To roll back, edit the clanker, changeimageUrito the previous tag, and click Start. - Pulumi — re-run
pulumi upagainst an earlier git commit, orpulumi stack export/pulumi stack importfor surgical state edits. Always preview first.
See also
- Architecture — the bigger picture.
- Clanker Images — how worker images are built and tagged.
- Secrets Management — how SSM, KMS, and the encrypted database secrets fit together.