Question 1

How do you calculate MTTR?

Accepted Answer

MTTR is calculated by dividing the total downtime across all incidents by the number of incidents. For example, if your ISP network experienced 4 incidents totaling 180 minutes of downtime, the MTTR is 180 ÷ 4 = 45 minutes. This means on average, each incident took 45 minutes from detection to full service restoration.

Question 2

What is a good MTTR for an ISP?

Accepted Answer

Elite ISP operations teams achieve an MTTR under 1 hour. High-performing teams typically resolve incidents within 1 to 4 hours. An MTTR above 4 hours suggests gaps in monitoring, alerting, or incident response processes. For FTTH ISPs, where a single OLT outage can affect hundreds of subscribers, faster MTTR directly translates to better SLA compliance and lower customer churn.

Question 3

What is the difference between MTTR, MTTA, MTTF, and MTBF?

Accepted Answer

MTTR (Mean Time To Recovery) measures how quickly you restore service after a failure. MTTA (Mean Time To Acknowledge) measures how quickly your team responds to an alert. MTTF (Mean Time To Failure) measures how long your systems run before failing. MTBF (Mean Time Between Failures) is the complete cycle: MTTF + MTTR. Together, these four metrics give a comprehensive picture of your network reliability and incident response capability.

Question 4

How can ISPs reduce their MTTR?

Accepted Answer

The biggest MTTR improvements come from the detection phase. Network monitoring with 30-second polling intervals and smart alert correlation can reduce detection time from hours to under a minute. Beyond detection: implement clear escalation policies so alerts reach the right team immediately, maintain runbooks for common failure modes (OLT reboots, fiber cuts, power outages), and use topology maps to quickly identify root causes and affected subscribers.

Question 5

What are the four phases of incident recovery?

Accepted Answer

Incident recovery consists of four phases: Detect (monitoring identifies the problem), Respond (a team member acknowledges and begins investigation), Diagnose (root cause is identified using logs, topology, and metrics), and Repair (the fix is applied and service is restored). Each phase contributes to MTTR, but detection is typically the biggest opportunity for improvement — automated monitoring can shrink detection from hours to seconds.

Question 6

How does MTTR relate to SLA compliance?

Accepted Answer

MTTR directly impacts SLA compliance because every minute of downtime consumes your error budget. A 99.9% monthly SLA allows only 43.2 minutes of total downtime. If your MTTR is 45 minutes, a single incident nearly exhausts your entire monthly budget. Reducing MTTR gives you more room within your SLA — a 15-minute MTTR means you can handle 2-3 incidents per month and still meet a 99.9% SLA.

Tier	MTTR	Typical Profile
Elite	< 1 hour	Automated detection, rapid response, well-drilled runbooks
High Performer	1 – 4 hours	Proactive monitoring, trained NOC team, clear escalation
Medium	4 – 24 hours	Basic monitoring, manual detection, ad-hoc response
Needs Improvement	> 24 hours	Reactive approach, limited visibility, slow escalation

Tier	MTTR	Typical Profile
Elite	< 1 hour	Automated detection, rapid response, well-drilled runbooks
High Performer	1 – 4 hours	Proactive monitoring, trained NOC team, clear escalation
Medium	4 – 24 hours	Basic monitoring, manual detection, ad-hoc response
Needs Improvement	> 24 hours	Reactive approach, limited visibility, slow escalation

MTTR Calculator

MTTR Benchmarks

Key Reliability Metrics

MTTR Benchmarks (180 min ÷ 4 incidents = 45 min default)

Frequently Asked Questions