What it measures

Mean Time to Recovery is the average elapsed time between a production failure starting and service being restored. The clock starts when a deploy, change, or defect degrades the system and stops when users are served normally again, whether that came from a fix-forward, a rollback, or a config change.

It pairs with Change Failure Rate. Change Failure Rate tells you how often changes break production. MTTR tells you how fast you recover when they do. Together they describe the failure side of your delivery system: how brittle it is, and how quickly it heals.

A low MTTR means failures are short and contained. A high or rising MTTR means incidents drag on, which usually points at slow detection, unclear ownership, weak rollback paths, or hard-to-diagnose changes rather than at any single bad deploy.

How to measure it

The practical version uses your deploy and incident records. For each incident, take the timestamp when degradation began and the timestamp when the system was confirmed healthy again, then average those durations over a window. Recovery often shows up in git and pipeline data as a revert commit, a hotfix deploy, or a rollback event, so the restore time is the deploy timestamp of that remediating change.

Detection is the hard part. Start time is only accurate if you have an incident signal: an alert, an error-rate breach, or an opened incident ticket. Without that, teams approximate start time from the deploy that introduced the regression, which understates MTTR because it ignores the time before anyone noticed.

Compute it over a trailing window, such as 30 or 90 days, and report the distribution, not just the mean. A single multi-hour outage can dominate the average, so the median and the worst case tell you more than the headline number.

What it does not tell you

MTTR tells you how fast you recovered, not whether the failure should have been possible or what it cost the business. A team can post an excellent MTTR by getting very good at rolling back, while shipping the same class of defect every week. Fast recovery from recurring failures is motion, not progress: you are measuring how quickly you clean up, not whether you are building the right thing safely.

It is also blind to severity and meaning. A two-minute recovery on a checkout outage and a two-minute recovery on an internal admin page count the same, yet one threatens revenue and the other does not. MTTR has no idea which initiative the broken change belonged to, who owns it, or whether the incident traced back to rushed strategic work versus routine maintenance.

This is the gap Execution Intelligence closes. The recovery time is a fact about how fast the team moves under failure. What matters to a CTO is the direction behind it: which initiative the failing work served, why it failed, who was on the hook, and what the incident actually cost. Reading that requires connecting the incident to the strategy and the spend behind it, not just timing the cleanup.

How InteliG uses it

InteliG computes MTTR directly from real git and deployment history: the introducing change, the alerting signal, and the remediating deploy or revert, with no manual instrumentation and no separate tracker to maintain. Cognis ties each incident to the initiative the failing work belonged to, the contributors involved, and the cost of both the change and the recovery.

That turns a recovery number into an answer a CTO can act on. Instead of a dashboard tile reading two hours, you see that the incident hit work under a specific initiative, who responded, how often this failure class recurs, and what it is costing, so recovery speed is read in the context of what was being built and why.

Related terms

Change Failure Rate — Share of changes that cause a production failure
Deployment Frequency — How often the team ships changes to production
Execution Intelligence — Reading what is being built and why, not just how fast