Health Checking: A not-so-trivial task in the distributed containerized world

What people usually understand by “health checks” is a simple sequence: performing a specific action and judging whether the target application is healthy based on the outcome. This simple sequence becomes trickier when the application consists of multiple containers managed by a cluster orchestrator and monitored by third party tooling:

  • What entity should interpret the result? Should the reasoning about the health of a task be done locally (less context) or globally (greater overhead)?
  • Should health checks be aware of environment-specific intricacies such as namespaces and software defined networks?
  • How to keep the overhead imposed by health checks manageable and reasonable?

During the discussion of challenges and trade-offs, Alex will provide an overview of how the modern distributed systems (such as AWS, Apache Mesos, and Kubernetes) tackle the problem of health checking, have a look at alternative solutions and discuss trade-offs. To summarize, Alex will share some practical recommendations based on the experience revamping Apache Mesos’ health checking.