System Operations
Infrastructure administrators are responsible for the ongoing operational management of Crucible deployments. This guide walks through monitoring, maintenance, and troubleshooting using Kubernetes tools and Crucible application features.
Overview
System operations for Crucible include:
- Monitoring and alerting - Kubernetes cluster health and application status
- Maintenance procedures - Updates, backups, and routine tasks
- Performance optimization - Resource management and scaling
- Incident response - Problem identification and resolution
Monitoring and Alerting
Kubernetes Monitoring
Monitor your Crucible deployment using Kubernetes tools:
- kubectl - Command-line status checks:
kubectl get pods -A
,kubectl describe pod <name>
,kubectl logs <pod>
- Rancher (if installed) - Web-based cluster monitoring and management (see installation guide)
- Kubernetes Dashboard - Alternative web interface for cluster monitoring
Application Health
Each Crucible application provides health endpoints. Check application status with:
# Check all pods across namespaces
kubectl get pods -A
# View logs for a specific application
kubectl logs -n <namespace> <pod-name>
# Check pod resource usage
kubectl top pods -A
Infrastructure Monitoring
Key metrics to monitor on your Kubernetes cluster:
- CPU and memory utilization per node (
kubectl top nodes
) - Disk space on nodes (check via node access or monitoring tools)
- PostgreSQL database performance (via pgAdmin if installed per installation guide)
- Storage usage if using Longhorn (accessible via Rancher or Longhorn UI)
Log Management
- Application logs - Access via
kubectl logs
for each pod - Audit logs - Configure in each Crucible application's Helm values; forward to SIEM as described in security guide
- Kubernetes events - View with
kubectl get events -A
Maintenance Procedures
Application Updates
Update Crucible applications by deploying new versions via Helm:
# Update a Crucible application to a new version
helm upgrade <release-name> cmu-sei/<chart-name> \
--namespace <namespace> \
-f values.yaml \
--version <new-version>
Before updating:
- Review release notes in the application's GitHub repository
- Test updates in a non-production environment if available
- Ensure database backups are current
- Plan for brief service interruptions during pod restarts
Database Maintenance
PostgreSQL databases should be maintained regularly:
- Use pgAdmin (if installed per installation guide) for visual database management
- Run
VACUUM ANALYZE
periodically to optimize database performance - Monitor database size and connection counts
- Configure automated backups using Kubernetes CronJobs or external backup solutions
Backup and Recovery
Backup Strategy
- Database backups - Use
pg_dump
or PostgreSQL backup tools; schedule regular automated backups - Persistent volume backups - If using Longhorn, configure snapshot schedules through the Longhorn UI
- Configuration backups - Store Helm values files and Kubernetes manifests in version control
Recovery Procedures
Restore from backups when needed:
- Database restore - Use
pg_restore
with backup files - Volume restore - Use Longhorn snapshot restore or your storage provider's procedures
- Application redeployment - Use Helm with backed-up values files
Performance Optimization
Resource Management
Adjust Kubernetes resource limits and requests in Helm values files:
# Example resource configuration in Helm values
resources:
limits:
cpu: 1000m
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi
Monitor resource usage with kubectl top pods
and kubectl top nodes
to inform scaling decisions.
Scaling
Most Crucible applications support horizontal scaling:
- Increase replica counts in Helm values:
replicaCount: 3
- Kubernetes will distribute load across replicas automatically
- Scale databases vertically by adjusting PostgreSQL resource allocations
Refer to the installation guide for minimum hardware requirements and scaling considerations.
Troubleshooting
Common Operational Issues
For detailed troubleshooting procedures, see the Troubleshooting Guide.
Common issues include:
- Pod failures - Check status with
kubectl describe pod <name>
and review logs - Database connection issues - Verify PostgreSQL pod is running and connection strings are correct in Helm values
- Certificate errors - Verify certificate secrets exist:
kubectl get secrets
- Resource exhaustion - Check node and pod resource usage with
kubectl top
Diagnostic Commands
Basic Kubernetes diagnostic commands:
# Check all pods and their status
kubectl get pods -A
# Describe a specific pod (shows events and issues)
kubectl describe pod -n <namespace> <pod-name>
# View pod logs
kubectl logs -n <namespace> <pod-name>
# Check resource usage
kubectl top nodes
kubectl top pods -A
# View recent cluster events
kubectl get events -A --sort-by='.lastTimestamp'
Security Operations
Security Monitoring
Review the Security and Compliance Checklist for:
- Audit log forwarding to SIEM
- Failed authentication monitoring
- Network policy enforcement
- Regular access reviews
Incident Response
Follow your organization's incident response procedures. Key steps:
- Identify and contain the incident using Kubernetes tools
- Review application and audit logs for the timeframe
- Use
kubectl
to isolate affected pods if needed - Restore from backups if data integrity is compromised
- Document findings and remediation steps