Skip to main content
U.S. flag

An official website of the United States government

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Incident Reports

Though we fully expect to write dependable applications, every project will experience service disruptions and other significant failings. In all cases, we want to learn from our mistakes both within our projects and more broadly. Incident reports, which detail the events and how they were resolved, are an excellent mechanism for sharing this information.

Note that this document won't discuss what to do during a security incident, if cloud.gov is having issues, what to report to your client, etc. For those, see:

  • https://github.com/18F/security-incidents
  • https://cloudgov.statuspage.io/
  • cloud-gov-support@gsa.gov
  • Slack: #infrastructure, #incident-response, #cloud-gov-support

Key components

At the high level, we want to follow Mark Imbriaco's formula for writing a great post-mortem report:

  1. Apologize for what happened.
  2. Demonstrate you understand what happened.
  3. Explain what you will do to reduce the likelihood of it happening again.

Before any deep analysis we should write a timeline, beginning at the time the incident was discovered and ending at the point the incident was declared over. Events in this timeline include deploys, configuration changes, key moments of discovery, client communications, and anything else that'd be relevant to understanding the incident. If further analysis discovers that certain events caused the incident, those events should also be added

Analyze the factors that contributed to the incident. Here it's important to emphasize the Retrospective Prime Directive; paraphrased: everyone did their best; there should be no judgment of individuals. If lucky, we will discover a single root cause, but often we will find a sort-of comedy of errors or serious of unfortunate events that collectively led to the incident.

Propose, discuss, and prioritize preventative measures. This is the key outcome for the project team: we want to avoid these types of problems in the future.

Define a single place to put these artifacts and be consistent. It doesn't matter if it's GitHub issues, Google Docs, a wiki, etc. so long as it's kept together and easy to reference by both the team and interested stakeholders. Don't make folks search for the information.

Examples

Additional resources

18F Engineering

An official website of the GSA’s Technology Transformation Services

Looking for U.S. government information and services?
Visit USA.gov