Check out my new books that I just published...

What is an SRE (Site Reliability Engineering)?

Google coined the term “Site Reliability Engineer” (or SRE) in 2003. SRE is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. SREs are also relationship broker who has views into organization-wide systems, a knack for problem-solving, and a love of metrics.

According to Ben Treynor, founder of Google's Site Reliability Team, SRE is "what happens when a software engineer is tasked with what used to be called operations."

How SRE satisfies the five DevOps pillars:

Pillar: Reduce organizational silos

- - SRE shares ownership with developers to create shared responsibility.
  - SREs use the same tools that developers use, and vice versa.

Pillar: Accept failure as normal

- - SREs embrace risk.
  - SRE quantifies failure and availability in a prescriptive manner using Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
  - SRE mandates blameless post mortems.

Pillar: Implement gradual changes

- - SRE encourages developers and product owners to move quickly by reducing the cost of failure.

Pillar: Leverage tooling and automation

- - SREs have a charter to automate menial tasks (called "toil") away.

Pillar: Measure everything

- - SRE defines prescriptive ways to measure values.
  - SRE fundamentally believes that systems operation is a software problem.

Free books from Google about being an SRE:

- Site Reliability Engineering
- The Site Reliability Workbook

Google Sites

Report abuse