Engineering and Operational Excellence

Engineering and Operational Excellence are the backbone of any SaaS engineering team.

Engineering Excellence is defined as the work required to create, deploy and maintain high quality products into production with speed and agility.

Operational Excellence as the work done before and after code is deployed into production to maintain or improve operational quality.

To run an engineering team with proper inspections, it is essential to define the metrics associated with Engineering and Operational Excellence. Once the metrics are defined, the target state work should ensure that progress is being made towards the metrics defined in EngEx and OpEx.

  • Escapes to production. How many times did the test pipelines fail us and deploy defects to production? Ideal: 0. Pragmatic: depends on the size of the team, number of releases, complexity of code base (define your own)
  • Defects out of SLA. How many defects were not fixed in an agreed up SLA? My standard goals:

P1 — 24 hours or less. Availability incidents fall here and can be an all hands on deck and ‘as soon as possible’ situations

P2–3 business days

P3 — next sprint

P4 — within a quarter

P1’s and P2 need to be very thoroughly examined to ensure they meet criteria's established. They can be a significant distraction from ongoing work

  • Code Coverage. Typical is 80% at the unit test level
  • Releases per week/month. This measures the quality of the CI/CD pipeline and if it enables engineers to move quickly.
  • Code quality reports (e.g. as generated by SonarQube or other tools)
  • Availability. Standard SaaS availability is 99.95% (4hrs 22min per year). Depending on the nature of your business, this may need to be higher. Note that increasing availability beyond this requires significant additional operation and R&D expenses. 3 9’s can be achieved by standard practices in cloud like Multi Availability Zone deployments. Multi Region deployments must be thought through carefully. Often, the additional complexity can reduce availability and increase costs significantly. In many cases, it is not warranted having more than a backup in a different region and a different cloud
  • Performance. Measured using P50, P90, and P99 (percentile performance). World class P50 is < 1s, P90 < 2s and P99 < 5s. See my detailed writeup on this topic at this link
  • Failed Customer Interactions (FCIs). This metric is often overlooked. An FCI is an action performed by a customer that did not succeed. A site could have 100% availability yet breach the FCI threshold. Typically this is a measurement of the number of 5xx errors (in some cases, 4xx could be included). World class SaaS means FCI’s are under 0.025%

Engineering & Operational excellence is reflection a team’s culture and customer centric thinking. The metrics related to both have to be tracked at the macro level. In case of Engineering excellence, it also needs to be tracked at the scrum team level. Each team may be at a different place in its journey towards Engineering Excellence. A team which works on complex code (e.g. still in a monolith) may be at a different place in its journey as compared to a fully decomposed, microservices based team. Accommodations need to be made to ensure that this is taken into account when looking at the metrics at a scrum team level.

Operational Excellence are about system behavior and performance, and does not need to be looked at a Scrum team level typically. It is the behavior of the system that is being tracked. It is often appropriate to staff an Operational Excellence team that is chartered with understanding weaknesses in the system that reduces Availability and performance and increases FCI’s. I invite the most experienced, curious engineers to join the ‘Opex’ team — the work is complex, culturally and technologically. They have to be forward thinking and detail oriented.

An engineering on call program is essential to get engineers closer to customers. I will cover on call programs in a different post. The on call program overlaps with Opex in interesting ways — I have found the best implementations of on call programs have the on call engineer being delegated into the Opex team for the duration of their on call. This results in the Opex team learning about the domain of the engineer: and more importantly, it results in the engineer learning the tools of the trade that the Opex team uses.

The Opex team has the following charter to deliver on the metrics:

  • Observability/Monitoring
  • System behavior and stability. I often participate in this track, and all CTO’s should
  • Training the larger team on observability
  • Logging and in particular understanding exceptions and either fixing them or ensuring that other teams resolve them
  • Performance tooling

In order to deliver on all this, the Opex team must have

  • One or more SRE’s
  • One or more DevOps
  • two or more developers
  • A lead. This is the most important person. I like to pick the seniormost person in the team to lead this.

Every two weeks, we review all the dashboards related to Operational and Engineering Excellence. This is to understand where work is on track, and where it is not. Items get added to backlogs as appropriate.

Note: This article is heavily influenced by my learnings from my onetime manager at Intuit, Keith Olson.

Table Of Contents

@_siddharth_ram; CTO @Inflection