System Operations

Infrastructure administrators are responsible for the ongoing operational management of Crucible deployments. This guide walks through monitoring, maintenance, and troubleshooting using Kubernetes tools and Crucible application features.

Overview

System operations for Crucible include:

Monitoring and alerting - Kubernetes cluster health and application status
Maintenance procedures - Updates, backups, and routine tasks
Performance optimization - Resource management and scaling
Incident response - Problem identification and resolution

Monitoring and Alerting

Kubernetes Monitoring

Monitor your Crucible deployment using Kubernetes tools:

kubectl: Command-line status checks: kubectl get pods -A, kubectl describe pod <name>, kubectl logs <pod>
Rancher (if installed): Web-based cluster monitoring and management (see the Installation Guide)
Kubernetes Dashboard: Alternative web interface for cluster monitoring

Application Health

Each Crucible application provides health endpoints. Check application status with:

# Check all pods across namespaces
kubectl get pods -A

# View logs for a specific application
kubectl logs -n <namespace> <pod-name>

# Check pod resource usage
kubectl top pods -A

Infrastructure Monitoring

Key metrics to monitor on your Kubernetes cluster:

CPU and memory utilization per node (kubectl top nodes)
Disk space on nodes (check via node access or monitoring tools)
PostgreSQL database performance (via pgAdmin if installed per the Installation Guide)
Storage usage if using Longhorn (accessible via Rancher or Longhorn UI)

Log Management

Application logs - Access via kubectl logs for each pod
Audit logs - Configure in each Crucible application's Helm values; forward to Security Information and Event Management (SIEM) as described in security guide
Kubernetes events - View with kubectl get events -A

Maintenance Procedures

Application Updates

Update Crucible applications by deploying new versions via Helm:

# Update a Crucible application to a new version
helm upgrade <release-name> cmu-sei/<chart-name> \
  --namespace <namespace> \
  -f values.yaml \
  --version <new-version>

Before updating:

Review release notes in the application's GitHub repository
Test updates in a non-production environment if available
Ensure database backups are current
Plan for brief service interruptions during pod restarts

Database Maintenance

Maintain PostgreSQL databases regularly:

Use pgAdmin (if installed per the Installation Guide) for visual database management
Run VACUUM ANALYZE periodically to optimize database performance
Monitor database size and connection counts
Configure automated backups using Kubernetes CronJobs or external backup solutions

Backup and Recovery

Backup Strategy

Database backups - Use pg_dump or PostgreSQL backup tools; schedule regular automated backups
Persistent volume backups - If using Longhorn, configure snapshot schedules through the Longhorn UI
Configuration backups - Store Helm values files and Kubernetes manifests in version control

Recovery Procedures

Restore from backups when needed:

Database restore - Use pg_restore with backup files
Volume restore - Use Longhorn snapshot restore or your storage provider's procedures
Application redeployment - Use Helm with backed-up values files

Performance Optimization

Resource Management

Adjust Kubernetes resource limits and requests in Helm values files:

# Example resource configuration in Helm values
resources:
  limits:
    cpu: 1000m
    memory: 2Gi
  requests:
    cpu: 500m
    memory: 1Gi

Monitor resource usage with kubectl top pods and kubectl top nodes to inform scaling decisions.

Scaling

Most Crucible applications support horizontal scaling:

Increase replica counts in Helm values: replicaCount: 3
Kubernetes will distribute load across replicas automatically
Scale databases vertically by adjusting PostgreSQL resource allocations

Refer to the Installation Guide for minimum hardware requirements and scaling considerations.

Troubleshooting

Common Operational Issues

For detailed troubleshooting procedures, see the Troubleshooting Playbook.

Common issues include:

Pod failures - Check status with kubectl describe pod <name> and review logs
Database connection issues - Verify PostgreSQL pod is running and connection strings are correct in Helm values
Certificate errors - Verify certificate secrets exist: kubectl get secrets
Resource exhaustion - Check node and pod resource usage with kubectl top

Diagnostic Commands

Basic Kubernetes diagnostic commands:

# Check all pods and their status
kubectl get pods -A

# Describe a specific pod (shows events and issues)
kubectl describe pod -n <namespace> <pod-name>

# View pod logs
kubectl logs -n <namespace> <pod-name>

# Check resource usage
kubectl top nodes
kubectl top pods -A

# View recent cluster events
kubectl get events -A --sort-by='.lastTimestamp'

Security Operations

Security Monitoring

Review the Security and Compliance Checklist for:

Audit log forwarding to SIEM
Failed authentication monitoring
Network policy enforcement
Regular access reviews

Incident Response

Follow your organization's incident response procedures. Key steps:

Identify and contain the incident using Kubernetes tools
Review application and audit logs for the timeframe
Use kubectl to isolate affected pods if needed
Restore from backups if you detect a data integrity compromise
Document findings and remediation steps