Troubleshooting

Common failure modes and recovery steps.

Where to look first

Before diving into a specific symptom, check these three places:

  1. Job row in the UI — Open Project → Jobs. A failed job has a status_message that usually pinpoints the layer that broke (clanker provisioning, repo clone, agent CLI, SCM API).
  2. Backend logs/ecs/viberglass-<env>-backend in CloudWatch. Each request has a correlation id; find the one for the failing job and read forward.
  3. Worker logs/aws/lambda/viberglass-<env>-worker for Lambda clankers, /ecs/viberglass-<env>-worker for ECS clankers. Filter by the jobId env variable, which the worker logs on every line via Winston metadata.

If you cannot find a worker log group at all, the worker probably failed to start — see "Clanker won't activate" below.

Clanker won't activate

Symptom. You click Start on a clanker and it bounces between deploying and failed. The status banner has an AWS error.

Common causes.

  • Missing IAM permissions. The backend's task role needs lambda:UpdateFunctionConfiguration (Lambda clankers) or ecs:RegisterTaskDefinition + ecs:DescribeTaskDefinition (ECS clankers). If you customised the role outside Pulumi, re-run pulumi up against the platform stack.
  • Wrong image URI. The clanker references an ECR image that does not exist. Check deployment_config.imageUri (Lambda) or containerImage (ECS). Re-run deploy-viberators.yml to make sure the image was pushed.
  • Lambda VPC config mismatch. The function references a subnet or security group that no longer exists. Re-deploy the workers stack so the IDs match.
  • ECS task definition validation error. Usually missing CPU/memory pair, or a secrets ARN that the execution role can't decrypt.

Recovery. Read clankers.status_message and the AWS API error in the backend log. Fix the underlying config (or the IAM role), then click Start again. The provisioning service is idempotent.

Agent boots but immediately fails

Symptom. The job goes from queuedactivefailed within seconds. Worker log shows the agent CLI exiting with a non-zero code and a message about authentication.

Cause. The agent's API key secret is missing or stale.

Recovery.

  1. Open the clanker and check the Secrets list — it must include the secret that holds the agent's API key (ANTHROPIC_API_KEY for Claude Code, OPENAI_API_KEY for Codex/OpenCode, etc.).
  2. Open the Secrets page and confirm the secret has the right backend (env/database/ssm) and that the path resolves. For ssm secrets, click Test if available, or use aws ssm get-parameter --with-decryption --name /viberator/secrets/<name> from a shell that has the backend role.
  3. Re-run the job from the ticket.

Repository clone fails

Symptom. Worker log shows simple-git failing on git clone, usually with Authentication failed or Repository not found.

Cause. The integration credential for the project's SCM integration is missing, expired, or scoped too narrowly.

Recovery.

  1. Open Project → Integrations and pick the SCM integration (GitHub by default).
  2. Verify the linked credential is selected and that the underlying secret still exists. Rotate it if it might have expired (GitHub PATs expire silently).
  3. For organisation-owned repos, make sure the token has access to the specific repo, not just public ones.
  4. Save and re-run the job.

Worker times out

Symptom. Job status is failed with a timeout message. Worker log ends abruptly.

Causes and recovery.

  • Lambda 15-minute ceiling. Lambda functions cannot run longer than 15 minutes. If your jobs need longer, switch the clanker's deployment strategy to ECS. The clanker editor lets you do this without recreating the clanker.
  • ECS task workerSettings.maxExecutionTime exceeded. Bump the project's workerSettings.maxExecutionTime (in the project settings page) and retry. There is no global cap, but the ECS task definition's stopTimeout is the upper bound on graceful shutdown.
  • Network stall. The agent reached out to a model provider that is rate-limiting or down. Check the worker log around the timeout for HTTP errors. Retrying after a few minutes is usually enough.

Phase document never gets approved

Symptom. A phase finishes with the document in approval_requested, but the Approve & Continue button does nothing.

Causes and recovery.

  • You are not the assignee. The execution route refuses to launch the next phase if the current document is not in approved. Anybody with project access can approve, but the audit log records the actual user.
  • Workflow override left over. Look at the ticket's workflow_override_* columns in the database (or the Override badge in the UI). If a previous override is in place, the system may be skipping the approval gate entirely on subsequent runs. Clear the override or re-approve manually.
  • Slack approval card stuck. If the approval was triggered from Slack but the bot lost the thread mapping (e.g. backend restart before the bridge re-attached), use the web UI to approve instead.

PR creation fails

Symptom. The execution job pushes commits successfully but the SCM API rejects the PR creation. Worker log shows a 4xx from the SCM provider.

Causes and recovery.

  • Missing pull_request: write permission. Update the GitHub PAT or app installation to include PR write. Re-run the job.
  • Branch already has an open PR with the same head. Either close the existing PR or change the project's branchNameTemplate so it includes a unique component (e.g. {{ ticket }}-{{ clanker }}).
  • Protected base branch. The PR base branch requires reviews / status checks before merge. The PR is still created — the worker will report success once the call goes through. If the PR is rejected outright (e.g. branch protection forbids the source), open the project's SCM Execution settings and pick a different base.

Backend won't start

Symptom. ECS service is in a deploy loop, the new task definition revision keeps reaching STOPPED.

Causes and recovery.

  • Migrations failed. The backend runs migrate:latest on startup. If a migration throws, the process exits and ECS retries. Read the CloudWatch log group for the migration error, fix the SQL or roll the bad migration back, then redeploy.
  • Missing env var. Usually SECRETS_ENCRYPTION_KEY or DATABASE_URL. The backend logs the missing variable on startup. Update the SSM parameter, then either wait for the next deploy or force a new deployment with aws ecs update-service --force-new-deployment.
  • RDS connection refused. Check the RDS security group still allows the backend SG on 5432. The base stack creates this rule, so a manual edit in the AWS console is the most likely culprit.

Slack thread stops mirroring

Symptom. A Slack thread that was bridged to a session goes silent after the backend restarts.

Recovery. The bridge is supposed to reattach automatically by reading chat_session_threads on boot. If it does not:

  1. Check the backend log for ChatSessionBridgeService errors at startup.
  2. If the bridge is stuck, post any reply in the thread — threadReply will rehydrate the binding.
  3. As a last resort, re-link via the web UI (open the session, click Unlink then Link Slack thread).

Custom webhook deliveries piling up as failed

Symptom. webhook_delivery_attempts has a growing tail of status = failed rows.

Causes and recovery.

  • Signature mismatch. The shared secret in the webhook integration must match the one the upstream system uses. Rotate both ends and re-deliver any test events.
  • Schema mismatch. Custom webhooks expect signed JSON. If the upstream is sending form-encoded payloads or unsigned JSON, switch the integration to "Custom" with signing disabled (or move to a provider-specific integration).
  • Outbound destination unreachable. For custom outbound webhooks the platform retries with exponential backoff and records each attempt. Check the destination URL and firewall rules.

Frontend shows "Network error"

Symptom. The Amplify-hosted frontend can't reach the backend.

Causes and recovery.

  • CORS misconfigured. The backend's allowedOrigins is set from appDomain (or the Amplify default domain) at deploy time. If you renamed the custom domain, re-run pulumi up against the platform stack.
  • ALB target unhealthy. Look at the target group health in the AWS console. Usually means the backend tasks are flapping — see "Backend won't start".
  • HTTPS certificate expired. ACM should auto-renew, but if you provided the cert manually you need to rotate it and update the ALB listener. Pulumi-managed certs renew transparently.

When all else fails

  • Capture the failing jobs.id, the ticket id, and the time window.
  • Pull both backend and worker CloudWatch logs for that window.
  • Read the latest non-empty clankers.status_message, tickets.workflow_override_reason, and webhook_delivery_attempts.error_message rows.
  • Open an issue in the viberator repo with the captured context. The platform never deletes job rows, so a retry is always safe — the worst case is a duplicate run.