Site Reliability Engineering by Google

Insights on AI, Leadership & Technology

Site Reliability Engineering by Google

Having read this book previously its good to see that it is now available from Google on-line for reading/reference. The book itself is a collection of articles and essays on how Google run and maintain their computing systems by their Site Reliability Engineers. The book can be accessed at https://landing.google.com/sre/book/ List of the Table of…

Max Hemingway

02/03/2017

1–2 minutes

Coding, Digital, Productivity, Programming

learn Having read this book previously its good to see that it is now available from Google on-line for reading/reference. The book itself is a collection of articles and essays on how Google run and maintain their computing systems by their Site Reliability Engineers.

The book can be accessed at https://landing.google.com/sre/book/

List of the Table of Contents showing the articles and essays in the book.

Table of Contents
Foreword
Preface
Part I – Introduction
Chapter 1 – Introduction
Chapter 2 – The Production Environment at Google, from the Viewpoint of an SRE
Part II – Principles
Chapter 3 – Embracing Risk
Chapter 4 – Service Level Objectives
Chapter 5 – Eliminating Toil
Chapter 6 – Monitoring Distributed Systems
Chapter 7 – The Evolution of Automation at Google
Chapter 8 – Release Engineering
Chapter 9 – Simplicity
Part III – Practices
Chapter 10 – Practical Alerting
Chapter 11 – Being On-Call
Chapter 12 – Effective Troubleshooting
Chapter 13 – Emergency Response
Chapter 14 – Managing Incidents
Chapter 15 – Postmortem Culture: Learning from Failure
Chapter 16 – Tracking Outages
Chapter 17 – Testing for Reliability
Chapter 18 – Software Engineering in SRE
Chapter 19 – Load Balancing at the Frontend
Chapter 20 – Load Balancing in the Datacenter
Chapter 21 – Handling Overload
Chapter 22 – Addressing Cascading Failures
Chapter 23 – Managing Critical State: Distributed Consensus for Reliability
Chapter 24 – Distributed Periodic Scheduling with Cron
Chapter 25 – Data Processing Pipelines
Chapter 26 – Data Integrity: What You Read Is What You Wrote
Chapter 27 – Reliable Product Launches at Scale
Part IV – Management
Chapter 28 – Accelerating SREs to On-Call and Beyond
Chapter 29 – Dealing with Interrupts
Chapter 30 – Embedding an SRE to Recover from Operational Overload
Chapter 31 – Communication and Collaboration in SRE
Chapter 32 – The Evolving SRE Engagement Model
Part V – Conclusions
Chapter 33 – Lessons Learned from Other Industries
Chapter 34 – Conclusion
Appendix A – Availability Table
Appendix B – A Collection of Best Practices for Production Services
Appendix C – Example Incident State Document
Appendix D – Example Postmortem
Appendix E – Launch Coordination Checklist
Appendix F – Bibliography

Leave a ReplyCancel reply

The Architect’s Notebook: Structured Journaling for Complex Problem Solving

Why Shadow AI is the Defining Enterprise Risk

Communicating Architecture Value to the CFO

Trending

The Architect’s Notebook: Structured Journaling for Complex Problem Solving

Why Shadow AI is the Defining Enterprise Risk

Communicating Architecture Value to the CFO

Resilience as a Design Principle, Not an Afterthought