Observability at Scale

Description

Observability is tooling or a technical solution that allows teams to actively debug their system. Observability is based on exploring properties and patterns not defined in advance.

To do a good job with monitoring and observability, your teams should have the following:

  • Reporting on the overall health of systems (Are my systems functioning? Do my systems have sufficient resources available?).
  • Reporting on system state as experienced by customers (Do my customers know if my system is down and have a bad experience?).
  • Monitoring for key business and systems metrics.
  • Tooling to help you understand and debug your systems in production.
  • Tooling to find information about things you did not previously know (that is, you can identify unknown unknowns).
  • Access to tools and data that help trace, understand, and diagnose infrastructure problems in your production environment, including interactions between services.

How to implement monitoring and observability

Monitoring and observability solutions are designed to do the following:

  • Provide leading indicators of an outage or service degradation.
  • Detect outages, service degradations, bugs, and unauthorized activity.
  • Help debug outages, service degradations, bugs, and unauthorized activity.
  • Identify long-term trends for capacity planning and business purposes.
  • Expose unexpected side effects of changes or added functionality.

As with all DevOps capabilities, installing a tool is not enough to achieve the objectives, but tools can help or hinder the effort. Monitoring systems should not be confined to a single individual or team within an organization. Empowering all developers to be proficient with monitoring helps develop a culture of data-driven decision making and improves overall system debuggability, reducing outages.

Tutorial Bar
Logo