5 minute read

Infrastructure Understanding The most valuable capability in your infrastructure is the one that walks out the door when consultants leave—unless you build it into the system itself.

The Best Medicine Is the Worst Poison - and I sometimes feel like one of the most dangerous people in my client’s infrastructure. Not because I break things, because I fix them. Fast, across domains, and in ways that turn out to be genuinely difficult to replace. That’s my problem.

I’ve spent over fifteen years in companies most engineers never see from the inside. Whether it’s certificate management, network architecture, or DDoS protection gaps, I can usually diagnose it and quantify faster than internal teams. It was like that from my first engagement as a Deloitte consultant. I know it sounds pretty arrogant, but that initial feeling grew and became a realisation that people like me are scarce.

You walk in without the political baggage, without the history of “we tried that in 2019 and it didn’t work,” and you just… ask. The best part, you ask really stupid questions. New processes, SOPs, technical fixes are the visible output. But what actually makes me hard to replace is something much less comfortable.

I ask the right questions.

Not “which CA issued this certificate” or “when does this expire.” Those are Big-4-consultant-checklist questions. I ask the ones that seem easy but tell you loads about how things work ‘under the hood’. When you rotate your certificate, what does the payment processor need? Where do I get a certificate for this dev system? How do you get a new private key to this isolated network? Has anybody written down the steps you did to re-connect to a payment scheme a year ago - at 3:12AM?

Some questions are hard, some are very easy. Some answers are incredibly telling about the culture and what support is given to IT admins - without most people realising. I pull that together, build a picture of how the infrastructure actually works – not how the diagram says it works.

I can see the biggest pain for engineers to do their jobs. These are often causes for insider threat vulnerability. I document and then I build the automation and processes to keep it running.

I learned early on how uncomfortable this can be. In my first job, I was asked to split my reports into two versions – one for general consumption and one with restricted access. I wasn’t doing penetration testing or running exploits. I was looking at high-level architecture. But even at that level, I was finding gaps that could be exploited – things that were simply too dangerous to put in a document that circulated as “confidential”. When your observations about how systems connect need to be classified, that tells you something about the value of actually understanding the full picture. And the rarity of it.

I understand architecture. Not just the diagrams but down to the lines of code, to what engineers press on their keyboards and what they copy over via clipboard. Where the handoff points don’t make any more sense, because a dependency has been replaced, removed, or changed. Building that picture requires the curiosity of a child - in any possible meaning.

Every engagement, I try to transfer this. I document everything. Train the team. Build playbooks. Make myself replaceable on paper.

Then my engagement ends.

The Silence After the Fix

Initially, everything runs perfectly. The automation works. The processes are followed. The monitoring dashboards are green. All works swimmingly but something starts changing.

Not because anyone makes bad decisions. Simply because no one pays attention when things run smoothly so why they should now. Every manager has hundreds of things to worry about - why would anyone spend time on things that just worked for a year or more.

But the lack of attention makes people sloppy - I know, I’m the same as everyone. I just care more than most. Things that look ‘optional’ are skipped in busy calendars. For certificates in particular, the estate grows. New services get deployed with certificates that don’t follow the playbook. An exception gets made for one team, then another. The monitoring catches the big stuff but the edge cases accumulate.

Employees do what’s in their contract. They’re not negligent or incompetent. They’re doing exactly what they are paid to do – run operations, follow SOPs, close tickets. Nobody’s job description says “spend four hours a week reviewing logs and tickets and look for ‘weird’ things, outliers, something you have not seen before.” That was my job. And I’m gone.

So the issues get quietly whitewashed. A near-miss gets logged as “resolved” without root cause analysis. A renewal that took three days of scrambling gets reported as “completed on schedule.” A dependency that nobody fully understands gets labelled “low risk” because investigating it properly isn’t anyone’s priority. Everything looks fine in the reports. The dashboards stay green.

Until someone asks the right questions. And by then, the gap between what the reports say and what the infrastructure actually looks like can be enormous.

Why You Should Stop Buying Consulting Hours

I’ve watched this enough times to understand something about my own business model – I was selling a capability that evaporated on a schedule.

My expertise – the architecture reviews, dependency and design reviews, the uncomfortable questions – can’t be transferred as not everything can be written down. It’s not a checklist. It’s a way of looking at infrastructure that takes years to develop and requires the freedom, and guts, to ask questions beyond a statement-of-work or employment contract.

So I started asking a different question – what if the expertise wasn’t in a person at all? How hard would it be to scale it, with the AI tools available today?

Not the deep architectural thinking – that still requires humans. But the operational vigilance, evidence collection. The relentless attention to things that are running smoothly. The constant asking of “what changed, what grew, what slipped through the cracks” – those questions can be automated. They should be automated. Because no human will consistently ask them when there’s no visible fire.

That’s the real reason I started building 3AM. Not because automation is better than your engineers. What it is - it’s consistent. It frees hands of your engineers to look beyond the end of the day, talk with their users more to understand what they need. And automation doesn’t walk out the door on a scheduled date.

Your best engineer – whether they’re on your payroll or mine – is a single point of failure. Not because of what they can fix, but because of what they notice. And when they leave, no one picks up the baton.

The only fix is making that vigilance operational, not personal. Build it into the system. Make the questions automatic. So that when everything is running smoothly – especially when everything is running smoothly – something is still paying attention.