Day 2 Operations on DC/OS

Once you have DC/OS installed and all your services deployed, then the work of operating your cluster begins. Day 2 Operations describes the monitoring, maintenance, and troubleshooting that keeps apps, services, and hosts up and running.

DC/OS enables Day 2 Operations

DC/OS supports Day 2 Operations by lending transparency, resilience, and accessibility to your containers, services, and other workloads, making them easy to monitor, maintain, and troubleshoot.

Monitoring

DC/OS makes it simple for operators to track host, service, and container level metrics, allowing them to head off potential problems and avoid downtime.

Maintenance

Microservices allow developers to improve apps faster than ever, and DC/OS helps operators keep pace. Spend less time on triage, and focus on deploying changes to production.

Troubleshooting

DC/OS allows operators to deploy containers to a pool of resources. Now operators can debug those containers while remaining blissfully agnostic to their location.

How does DC/OS facilitate Day 2 Operations

Monitoring

Starting with DC/OS version 1.9 logs and metrics are easily accessible to operators. Task, container, service, node, and host level logs are sent to journald, where they can be collected and filtered by the logging aggregator of your choice, giving operators the flexibility to account for their specific security and privacy concerns. Metrics are accessible via an HTTP API.

Maintenance

DC/OS makes cluster maintenance easy. It provides resilience by detecting node failures and enabling automatic recovery. By providing a layer of abstraction between your hardware and services, DC/OS makes it easy to plan for and react to the routine failures that are part of operating in a data center. DC/OS enables zero downtime upgrade patterns including rolling, blue-green, or canary deployments.

Troubleshooting

DC/OS allows users to run containers on pooled resources, without specifying which specific node those containers should run on. Starting with DC/OS 1.9, users can also troubleshoot short tasks or long-running jobs without having to track down the container or node where the task or job is running. DC/OS gives users access to misbehaving tasks and jobs inside containers launched with the Universal container runtime, without requiring SSH.

Resources for Day 2 Operations