Troubleshooting Playbook

Use this playbook to triage common issues before escalating. Capture your findings in the operations log for future reference.

Common Issues

Services Won't Start (Helm + k3s)

Check cluster and node health
- kubectl get nodes
- kubectl get pods -A
- kubectl describe node <node-name> to inspect resource or scheduling issues.
Verify Helm deployment
- helm list -A to ensure the release deployed.
- helm status <release-name> to see resource state and notes.
Inspect failing services
- kubectl get pods -n <namespace>
- kubectl describe pod <pod-name> for events and errors.
- kubectl logs <pod-name> [-c <container-name>] to view logs.
Check configurations and manifests
- helm get values <release-name> for current config.
- Validate any YAML files with kubectl apply --dry-run=client -f <file.yaml>.
Confirm networking and ports
- kubectl get svc -n <namespace> for service exposure.
- kubectl port-forward or curl to test access.
- Ensure no host-level firewall or port conflict.

Run kubectl get pods -A to confirm control-plane and application pods are healthy.
Check cluster events: kubectl get events -A --sort-by=.lastTimestamp | tail.
Review monitoring dashboards (Prometheus/Grafana) for resource saturation.

If pods are crash-looping:

Describe the pod for error output: kubectl describe pod <name> -n <namespace>.
Inspect container logs: kubectl logs <name> -n <namespace> --tail=200.
Compare with the last known good deployment manifest.

Verify Keycloak/IdP availability and certificate validity.
Confirm OAuth client secrets match the configuration in values.yaml.
Review Player API logs for 401/403 responses to determine whether scope assignments changed.

Alloy events stuck in pending state often indicate Steamfitter or Caster API connectivity problems. Check service endpoints and network policies.
Range Builder reports of missing Player views commonly originate from misaligned permissions. Validate the affected team's View Admin or Content View User access.
Instructors unable to launch labs should confirm the event template still references valid Player exercises, Caster directories, and Steamfitter scenarios.

For PostgreSQL incidents, use pg_isready -U <user> -h <host> to test connectivity.
Review backup job status to ensure a fallback snapshot exists before performing repair operations.
If object storage artifacts go missing, audit bucket lifecycle policies and recent delete events.

Capture timestamps, affected users, and recent changes.
Note the exact error messages or logs collected.
Reference mitigation steps attempted and their outcomes.
Page the on-call Range Builder or teaching staff when learner-facing content is at risk.