20 Years in ISP Networks: Lessons Learned

20 Years in the Trenches of ISP Networks: What Actually Matters

Real lessons from building ISPs, running datacenters, fixing outages at impossible hours, and serving thousands of customers.

Twenty years in ISP and network operations teaches you many things.
Most of them not from books.
Most of them not from certifications.
And almost none of them during office hours.

They come at 02:17.
When a core link drops.
When redundancy doesn’t behave the way the diagram promised.
When customers are already calling and your monitoring system is still “green”.

This is not a simple technical guide, but rather It’s a set of scars.

If I had to pass something forward to the next generation of network engineers, operators, and founders, it would be this.

1. Documentation is a survival instinct, not a chore

That “temporary hack”.
That “small change”.
That “I’ll clean this up later”.

You will not remember it.

At some point, you will stare at a configuration you wrote yourself and think:
Who did this?
It was you.

Documentation is not unnecessary bureaucracy as many tend to see it.
It is respect for your future self and for the people who will inherit your network.

Good documentation:

Reduces outage duration
Reduces stress under pressure
Reduces dependency on specific people

Bad or missing documentation turns every incident into archaeology.

If your network only works because a few people “just know how it is”, you don’t have reliability. You have luck.

2. Monitoring without signal is just a landfill of metrics

Anyone can collect data.
Most systems are very good at that.

The hard part is extracting signal.

If your customers alert you before your monitoring system does, you don’t have monitoring.
You have a dashboard wallpaper.

Good monitoring answers questions:

Is something broken now?
Is it getting worse?
Who is affected?
What changed?

Bad monitoring answers everything except what matters.

Alerts should wake you up only when they deserve to.
Everything else is noise — and noise is dangerous because it trains you to ignore the system entirely.

3. Redundancy is not resilience unless you test it

On paper, everything is redundant.
In reality, redundancy that has never failed is just a theory.

Links fail.
Power fails.
Vendors fail.
People fail.

The most painful outages are not caused by missing redundancy.
They are caused by assumed redundancy.

Failover that hasn’t been tested under real conditions will betray you at the worst possible time.

Maintenance windows, disaster drills, and controlled failures are not optional.
They are how you discover the gap between design and reality.

4. Automation does not remove risk - it moves it

Automation is mandatory at scale.
But automation is also dangerous.

Every script, workflow, or system you automate becomes a single point of fast failure.

When automation goes wrong:

It goes wrong everywhere
It goes wrong instantly
It often goes wrong silently

Automate, but:

Add guardrails
Add visibility
Add rollback paths
Assume it will misbehave one day

The goal is not blind automation.
The goal is controlled speed.

5. Technical debt always shows up as customer pain

You can postpone refactoring.
You can postpone cleanup.
You can postpone “doing it right”.

What you cannot postpone is the bill.

Technical debt does not stay in the network layer.
It leaks upward:

Slower troubleshooting
Longer outages
Confusing communication
Burned-out engineers
Eventually, churn of customers

Customers don’t care why your systems are fragile.
They only experience that they are.

Every shortcut you take today becomes friction someone else will feel tomorrow and this will more often than not impact a customer or two, or more…

6. Growth amplifies everything - including your weaknesses

When you are small, heroics work.
When you grow, heroics become a liability.

Processes that work at 1,000 customers collapse at 10,000.
Tribal knowledge breaks in 24/7 operations.
“Ask that one guy” stops working when that guy is sick, on leave, or gone.

Scale does not forgive shortcuts but actually It exposes them.

Many ISPs fail not because of bad technology, but because they never re-examined habits that no longer scale.

The network eventually reflects the organization behind it.

7. Customer experience is your only durable advantage

In the ISP world, competitors can copy almost everything.

They can buy the same routers.
They can lease the same fiber.
They can match your pricing.
They can clone your self-care app.

What they cannot copy is how you make customers feel.

Helpful cultures are not for sale.
You cannot bolt them on later.
You build them daily:

In how you respond
In how you troubleshoot
In how you communicate during outages
In how seriously you take even “small” issues

Hardware can be matched.
Bandwidth can be matched.
The feeling of being taken care of cannot.

That feeling is built slowly and destroyed quickly.

The uncomfortable conclusion

After enough years, you realize that most ISP failures are not technical.

They are organizational.
They are cultural.
They are the result of decisions that made sense once and were never revisited.

The real backbone of a great ISP is not just the network.

It is:

Discipline under pressure
Respect for the future
Willingness to question assumptions
And genuine care for the people who rely on your service

If this post helps someone document one change, test one failover, reduce one alert, or rethink one habit — it will have done its job.

Written by an ISP and IXP operator with 20+ years of experience building networks, running data centers, and operating large-scale broadband infrastructure.

20 Years in the Trenches of ISP Networks: What Actually Matters

20 Years in the Trenches of ISP Networks: What Actually Matters

1. Documentation is a survival instinct, not a chore

2. Monitoring without signal is just a landfill of metrics

3. Redundancy is not resilience unless you test it

4. Automation does not remove risk - it moves it

5. Technical debt always shows up as customer pain

6. Growth amplifies everything - including your weaknesses

7. Customer experience is your only durable advantage

The uncomfortable conclusion

Ready to see NetSense in action?