Once you have DC/OS installed and all your services deployed, then the work of operating your cluster begins. Day 2 Operations describes the monitoring, maintenance, and troubleshooting that keeps apps, services, and hosts up and running.
DC/OS supports Day 2 Operations by lending transparency, resilience, and accessibility to your containers, services, and other workloads, making them easy to monitor, maintain, and troubleshoot.
DC/OS makes it simple for operators to track host, service, and container level metrics, allowing them to head off potential problems and avoid downtime.
Microservices allow developers to improve apps faster than ever, and DC/OS helps operators keep pace. Spend less time on triage, and focus on deploying changes to production.
DC/OS allows operators to deploy containers to a pool of resources. Now operators can debug those containers while remaining blissfully agnostic to their location.
Starting with DC/OS version 1.9 logs and metrics are easily accessible to operators. Task, container, service, node, and host level logs are sent to journald, where they can be collected and filtered by the logging aggregator of your choice, giving operators the flexibility to account for their specific security and privacy concerns. Metrics are accessible via an HTTP API.
DC/OS makes cluster maintenance easy. It provides resilience by detecting node failures and enabling automatic recovery. By providing a layer of abstraction between your hardware and services, DC/OS makes it easy to plan for and react to the routine failures that are part of operating in a data center. DC/OS enables zero downtime upgrade patterns including rolling, blue-green, or canary deployments.
DC/OS allows users to run containers on pooled resources, without specifying which specific node those containers should run on. Starting with DC/OS 1.9, users can also troubleshoot short tasks or long-running jobs without having to track down the container or node where the task or job is running. DC/OS gives users access to misbehaving tasks and jobs inside containers launched with the Universal container runtime, without requiring SSH.
Ben Hindman discusses the challenges of Day 2 Operations in the cloud.Watch the video
Learn about common logging, monitoring, maintenance, and troubleshooting tools.View the deck
Hear Jeff Malnick discuss the logging, metrics, and debugging functionality new in DC/OS 1.9Watch the video
Learn about the DC/OS logging API and your choices for logging integrations.Read the blog post
Learn how metrics are collected, tagged, and forwarded in DC/OS 1.9Read the blog post