Troubleshooting
Common failure modes and recovery steps.
Where to look first
Before diving into a specific symptom, check these three places:
- Job row in the UI — Open Project → Jobs. A failed job has a
status_messagethat usually pinpoints the layer that broke (clanker provisioning, repo clone, agent CLI, SCM API). - Backend logs —
/ecs/viberglass-<env>-backendin CloudWatch. Each request has a correlation id; find the one for the failing job and read forward. - Worker logs —
/aws/lambda/viberglass-<env>-workerfor Lambda clankers,/ecs/viberglass-<env>-workerfor ECS clankers. Filter by thejobIdenv variable, which the worker logs on every line via Winston metadata.
If you cannot find a worker log group at all, the worker probably failed to start — see "Clanker won't activate" below.
Clanker won't activate
Symptom. You click Start on a clanker and it bounces between deploying and failed. The status banner has an AWS error.
Common causes.
- Missing IAM permissions. The backend's task role needs
lambda:UpdateFunctionConfiguration(Lambda clankers) orecs:RegisterTaskDefinition+ecs:DescribeTaskDefinition(ECS clankers). If you customised the role outside Pulumi, re-runpulumi upagainst the platform stack. - Wrong image URI. The clanker references an ECR image that does not exist. Check
deployment_config.imageUri(Lambda) orcontainerImage(ECS). Re-rundeploy-viberators.ymlto make sure the image was pushed. - Lambda VPC config mismatch. The function references a subnet or security group that no longer exists. Re-deploy the workers stack so the IDs match.
- ECS task definition validation error. Usually missing CPU/memory pair, or a
secretsARN that the execution role can't decrypt.
Recovery. Read clankers.status_message and the AWS API error in the backend log. Fix the underlying config (or the IAM role), then click Start again. The provisioning service is idempotent.
Agent boots but immediately fails
Symptom. The job goes from queued → active → failed within seconds. Worker log shows the agent CLI exiting with a non-zero code and a message about authentication.
Cause. The agent's API key secret is missing or stale.
Recovery.
- Open the clanker and check the Secrets list — it must include the secret that holds the agent's API key (
ANTHROPIC_API_KEYfor Claude Code,OPENAI_API_KEYfor Codex/OpenCode, etc.). - Open the Secrets page and confirm the secret has the right backend (
env/database/ssm) and that the path resolves. Forssmsecrets, click Test if available, or useaws ssm get-parameter --with-decryption --name /viberator/secrets/<name>from a shell that has the backend role. - Re-run the job from the ticket.
Repository clone fails
Symptom. Worker log shows simple-git failing on git clone, usually with Authentication failed or Repository not found.
Cause. The integration credential for the project's SCM integration is missing, expired, or scoped too narrowly.
Recovery.
- Open Project → Integrations and pick the SCM integration (GitHub by default).
- Verify the linked credential is selected and that the underlying secret still exists. Rotate it if it might have expired (GitHub PATs expire silently).
- For organisation-owned repos, make sure the token has access to the specific repo, not just public ones.
- Save and re-run the job.
Worker times out
Symptom. Job status is failed with a timeout message. Worker log ends abruptly.
Causes and recovery.
- Lambda 15-minute ceiling. Lambda functions cannot run longer than 15 minutes. If your jobs need longer, switch the clanker's deployment strategy to ECS. The clanker editor lets you do this without recreating the clanker.
- ECS task
workerSettings.maxExecutionTimeexceeded. Bump the project'sworkerSettings.maxExecutionTime(in the project settings page) and retry. There is no global cap, but the ECS task definition'sstopTimeoutis the upper bound on graceful shutdown. - Network stall. The agent reached out to a model provider that is rate-limiting or down. Check the worker log around the timeout for HTTP errors. Retrying after a few minutes is usually enough.
Phase document never gets approved
Symptom. A phase finishes with the document in approval_requested, but the Approve & Continue button does nothing.
Causes and recovery.
- You are not the assignee. The execution route refuses to launch the next phase if the current document is not in
approved. Anybody with project access can approve, but the audit log records the actual user. - Workflow override left over. Look at the ticket's
workflow_override_*columns in the database (or the Override badge in the UI). If a previous override is in place, the system may be skipping the approval gate entirely on subsequent runs. Clear the override or re-approve manually. - Slack approval card stuck. If the approval was triggered from Slack but the bot lost the thread mapping (e.g. backend restart before the bridge re-attached), use the web UI to approve instead.
PR creation fails
Symptom. The execution job pushes commits successfully but the SCM API rejects the PR creation. Worker log shows a 4xx from the SCM provider.
Causes and recovery.
- Missing
pull_request: writepermission. Update the GitHub PAT or app installation to include PR write. Re-run the job. - Branch already has an open PR with the same head. Either close the existing PR or change the project's
branchNameTemplateso it includes a unique component (e.g.{{ ticket }}-{{ clanker }}). - Protected base branch. The PR base branch requires reviews / status checks before merge. The PR is still created — the worker will report success once the call goes through. If the PR is rejected outright (e.g. branch protection forbids the source), open the project's SCM Execution settings and pick a different base.
Backend won't start
Symptom. ECS service is in a deploy loop, the new task definition revision keeps reaching STOPPED.
Causes and recovery.
- Migrations failed. The backend runs
migrate:lateston startup. If a migration throws, the process exits and ECS retries. Read the CloudWatch log group for the migration error, fix the SQL or roll the bad migration back, then redeploy. - Missing env var. Usually
SECRETS_ENCRYPTION_KEYorDATABASE_URL. The backend logs the missing variable on startup. Update the SSM parameter, then either wait for the next deploy or force a new deployment withaws ecs update-service --force-new-deployment. - RDS connection refused. Check the RDS security group still allows the backend SG on 5432. The base stack creates this rule, so a manual edit in the AWS console is the most likely culprit.
Slack thread stops mirroring
Symptom. A Slack thread that was bridged to a session goes silent after the backend restarts.
Recovery. The bridge is supposed to reattach automatically by reading chat_session_threads on boot. If it does not:
- Check the backend log for
ChatSessionBridgeServiceerrors at startup. - If the bridge is stuck, post any reply in the thread —
threadReplywill rehydrate the binding. - As a last resort, re-link via the web UI (open the session, click Unlink then Link Slack thread).
Custom webhook deliveries piling up as failed
Symptom. webhook_delivery_attempts has a growing tail of status = failed rows.
Causes and recovery.
- Signature mismatch. The shared secret in the webhook integration must match the one the upstream system uses. Rotate both ends and re-deliver any test events.
- Schema mismatch. Custom webhooks expect signed JSON. If the upstream is sending form-encoded payloads or unsigned JSON, switch the integration to "Custom" with signing disabled (or move to a provider-specific integration).
- Outbound destination unreachable. For custom outbound webhooks the platform retries with exponential backoff and records each attempt. Check the destination URL and firewall rules.
Frontend shows "Network error"
Symptom. The Amplify-hosted frontend can't reach the backend.
Causes and recovery.
- CORS misconfigured. The backend's
allowedOriginsis set fromappDomain(or the Amplify default domain) at deploy time. If you renamed the custom domain, re-runpulumi upagainst the platform stack. - ALB target unhealthy. Look at the target group health in the AWS console. Usually means the backend tasks are flapping — see "Backend won't start".
- HTTPS certificate expired. ACM should auto-renew, but if you provided the cert manually you need to rotate it and update the ALB listener. Pulumi-managed certs renew transparently.
When all else fails
- Capture the failing
jobs.id, the ticket id, and the time window. - Pull both backend and worker CloudWatch logs for that window.
- Read the latest non-empty
clankers.status_message,tickets.workflow_override_reason, andwebhook_delivery_attempts.error_messagerows. - Open an issue in the viberator repo with the captured context. The platform never deletes job rows, so a retry is always safe — the worst case is a duplicate run.