Troubleshooting Playbook
Use this playbook to triage common issues before escalating. Capture your findings in the operations log for future reference.
Services Won't Start (Helm + k3s)
-
Check cluster and node health
kubectl get nodeskubectl get pods -Akubectl describe node <node-name>to inspect resource or scheduling issues
-
Verify Helm deployment
helm list -Ato ensure the release deployedhelm status <release-name>to review resource state and notes
-
Inspect failing services
kubectl get pods -n <namespace>kubectl describe pod <pod-name>for events and errorskubectl logs <pod-name> [-c <container-name>]to view logs
-
Check configurations and manifests
helm get values <release-name>for current configuration- Validate YAML files with
kubectl apply --dry-run=client -f <file.yaml>
-
Confirm networking and ports
kubectl get svc -n <namespace>for service exposure- Use
kubectl port-forwardorcurlto test access - Ensure no host-level firewall or port conflict exists
Database Connection Issues
- Verify the database service is running
- Check connection string format
- Confirm network connectivity
SSL Certificate Problems
- Verify certificate paths in application configuration
- Check certificate validity dates
- Ensure the full certificate chain is present and trusted
Environment Health
- Run
kubectl get pods -Ato confirm control-plane and application pods are healthy - Review recent cluster events:
kubectl get events -A --sort-by=.lastTimestamp | tail - Check monitoring dashboards (Prometheus/Grafana) for resource saturation
If pods are crash-looping:
- Describe the pod for error output:
kubectl describe pod <name> -n <namespace> - Inspect container logs:
kubectl logs <name> -n <namespace> --tail=200 - Compare with the last known good deployment manifest
Identity or Login Failures
- Verify Keycloak or identity provider availability and certificate validity
- Confirm OAuth client secrets match values in
values.yaml - Review Player API logs for
401or403responses to identify scope or role changes
Application Availability
- Alloy events stuck in a pending state often indicate Steamfitter or Caster API connectivity issues; verify service endpoints and network policies
- Missing Player views reported by Range Builders commonly originate from misaligned permissions; confirm View Admin or Content View User access
- Instructors unable to launch labs should confirm event templates reference valid Player exercises, Caster directories, and Steamfitter scenarios
Data Integrity
- For PostgreSQL issues, test connectivity using
pg_isready -U <user> -h <host> - Verify backup job status before performing repair operations
- Audit object storage lifecycle policies and recent delete events if artifacts are missing
Escalation Checklist
- ✅ Capture timestamps, affected users, and recent changes
- ✅ Record exact error messages and relevant logs
- ✅ Note mitigation steps attempted and their outcomes
- ✅ Contact the on-call Range Builder or instructional staff when learner-facing content is at risk