System Operations
Infrastructure administrators manage the day-to-day operation of Crucible deployments. This procedural guide covers monitoring, maintenance, performance management, and operational troubleshooting using Kubernetes tooling and Crucible application features.
Operational Scope
System operations for Crucible include:
- Monitoring and alerting for cluster and application health
- Performing maintenance such as updates and backups
- Managing performance and scale through resource tuning
- Responding to incidents and resolving operational issues
Monitoring and Alerting
Monitoring Kubernetes
Monitor Crucible using standard Kubernetes tools:
- kubectl: Command-line inspection and diagnostics (
kubectl get pods -A,kubectl describe pod <name>,kubectl logs <pod>) - Rancher (if installed): Web-based cluster monitoring and management; see the Installation Guide
- Kubernetes Dashboard: Optional web interface for cluster visibility
Monitoring Application Health
Each Crucible application exposes health and status information.
Use Kubernetes tooling to inspect application state:
# Check all pods across namespaces
kubectl get pods -A
# View logs for a specific application
kubectl logs -n <namespace> <pod-name>
# Check pod resource usage
kubectl top pods -A
Monitoring Infrastructure
Track the following cluster metrics:
- CPU and memory utilization per node (
kubectl top nodes) - Disk space on nodes
- PostgreSQL performance and connection counts (via pgAdmin if installed per the Installation Guide)
- Persistent storage usage (via Longhorn UI or Rancher, if installed)
Managing Logs
- Application logs: Access per pod using
kubectl logs - Audit logs: Configure via application Helm values and forward to your Security Information and Event Management (SIEM)
- Kubernetes events: Review with
kubectl get events -A
Performing Maintenance
Updating Applications
Update Crucible applications using Helm:
# Update a Crucible application to a new version
helm upgrade <release-name> cmu-sei/<chart-name> \
--namespace <namespace> \
-f values.yaml \
--version <new-version>
Before applying updates:
- Review application release notes
- Test changes in a non-production environment, if available
- Verify database backups are current
- Plan for brief service interruptions during pod restarts
Maintaining Databases
Perform regular PostgreSQL maintenance:
- Manage databases using pgAdmin (if installed)
- Run
VACUUM ANALYZEperiodically - Monitor database growth and active connections
- Automate backups using Kubernetes CronJobs or external backup solutions
Managing Backups and Recovery
Backup Strategy
- Databases: Schedule PostgreSQL backups using
pg_dumpor equivalent tools - Persistent volumes: Configure snapshot schedules if using Longhorn
- Configuration: Store Helm values and Kubernetes manifests in version control
Recovery Procedures
Restore services as needed:
- Database restores using
pg_restore - Volume restores using storage provider or Longhorn snapshots
- Application recovery using Helm and backed-up values files
Managing Performance and Scale
Managing Resources
Tune Kubernetes resource requests and limits through Helm values:
# Example resource configuration in Helm values
resources:
limits:
cpu: 1000m
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
Monitor usage with kubectl top pods and kubectl top nodes to guide adjustments.
Scaling Applications
Most Crucible services support horizontal scaling:
- Increase
replicaCountin Helm values - Kubernetes distributes load across replicas automatically
- Scale PostgreSQL vertically by adjusting allocated resources
Refer to the Installation Guide for minimum hardware requirements and scaling considerations.
Troubleshooting Operations
Addressing Common Issues
For detailed troubleshooting tips, see the Troubleshooting Playbook.
Common operational issues include:
- Pod failures: Check status with
kubectl describe pod <name>and review logs - Database connectivity issues: Verify PostgreSQL pod is running and connection strings are correct in Helm values
- Certificate errors: Verify certificate secrets exist:
kubectl get secrets - Resource exhaustion: Check node and pod resource usage with
kubectl top
Using Basic Kubernetes Diagnostic Commands
# Check all pods and their status
kubectl get pods -A
# Describe a specific pod (shows events and issues)
kubectl describe pod -n <namespace> <pod-name>
# View pod logs
kubectl logs -n <namespace> <pod-name>
# Check resource usage
kubectl top nodes
kubectl top pods -A
# View recent cluster events
kubectl get events -A --sort-by='.lastTimestamp'
Managing Security Operations
Monitoring Security
Use the Security and Compliance Checklist to verify:
- Audit log forwarding to Security Information and Event Management (SIEM)
- Failed authentication monitoring
- Network policy enforcement
- Regular access reviews
Responding to Incidents
Follow your organization's incident response procedures. Key steps:
- Identify and contain the incident using Kubernetes tools
- Review application and audit logs for the timeframe
- Use
kubectlto isolate affected pods if needed - Restore from backups if you detect a data integrity compromise
- Document findings and remediation steps