The Incident Pit - Barclay Howe's Blog

I’ve been reading Alastair Reynolds‘ work since I stumbled upon his short story ‘Galactic North’ in an anthology in 2000. His work is awesome and interesting, and one thing that has stuck with me as an engineer is the idea of an ‘Incident Pit’ from Pushing Ice.

When I went back and googled it, it turns out that this concept has a whole wikipedia page about it (albeit a short one). It’s worth checking out. TL;DR: it’s when a project or scenario’s incidence of errors begins to rise faster than they can be handled. It’s based in diving, where there are constant errors, and if an inflection point occurs, and error rates grow too fast, fatal mistakes can happen.

This should have a note of familiarity to anyone in software engineering (though usually it’s not fatal). There are systems that have an underlying error rate that is considered normal, but certain effects can cause those errors to grow in either size or frequency until failure occurs. These can be internal or external to your software.

Consider a log file that, under normal circumstances will trim itself so that there is always enough headroom for more entries, and for the OS to have enough drive space (yes, it’s on the C drive here). Imagine that there is storage contention, and timeouts begin because the disk develops latency (say, in a shared VM environment). You could see that there is a tipping point where the timeouts are logged faster than trimming occurs, and the storage begins to fill at an exponential rate.

This is a very naive example of this, because it can happen as a failure mode in networks of systems, and also as a characteristic of a single mis-planned project.

A project based on an all-unknown-path stack can easily become an incident pit. One bad hand-written database layer is an easy recipe for such a case. Software can work really well in dev on your local machine, and in testing, and only find out that you have a connection pooling or locking problem under real load, causing parasitic effects throughout the software and systemic failure when it’s in front of the customer.

So how do you avoid an incident pit? I don’t think you do. You need to be on the lookout for them, and manage them using good software project management practices. That’s why we need to look clearly at our architectures and address risk early.

Share this:

Related