Operability.IO is a yearly two-day event focused on "DevOps from the Ops point of view". This was only its second year. It's organised by Marco Abis of Highops.com. It was a great event with a relaxed atmosphere in a lovely venue - I'd recommend it. Thanks to Marco, the speakers and all the sponsors.
The following is what I took from the event and is not meant to be complete. It's just what appealed to me and the context I'm working in. I've tried to attribute to the speakers where possible, any mistakes are in my recollection. I've mixed in some other bits too, where I've dug deeper on a topic and found other sources.
Favourite talk: Sarah Wells' "Why would anyone do out-of-hours support for free?". The alternative title is "What I learned about DevOps at The Financial Times". An experience report of a DevOps transformation littered with wisdom. The slides alone are not enough to appreciate this talk - it needs to be heard.
- Culture has a disproportionately bigger impact than anything else on your success - and it is the hardest thing to get right (Casey West)
- Your people are your culture. "You can't directly change culture. But you can change behaviour, and behaviour becomes culture" - Lloyd Taylor VP Infrastructure, Ngmoco
- Trust is crucial: systems run on trust
(Daniel Otte and Tom Shacham)
- Teams can have overlapping concerns instead of hard edges. At Google they visualise a team's responsibilities as a normal distribution centred at a point on the tech stack spectrum (running from hardware to UI) Typical responsibilities would lie within one standard deviation but they are not limited by it (Niall Murphy)
- "Operability starts with design and development" - don't leave it till release day, this disrespects your Ops team (Adrian Colyer)
- A certain level of cultural maturity should be achieved before
undertaking microservices. It would be irresponsible not to!
referencing "You must be this tall to use microservices" by Martin Fowler. Maybe this concept of a fitness test should apply to other initiatives too?
- "GrownOps" by Daniel Otte and Tom Shacham: a
code of conduct driven by communication problems and friction:
- Favour working together over delivering fast
- Contribute instead of accuse
- You're always in a state of partial knowledge
- Ask for information don't state judgement
- Systems are based on (human) trust
- Align team responsibilities with business intentions - "incentives
- The Financial Times embedded their "TechOps" people into their product teams, inspired by Werner Vogels' quote "you build it, you run it". "If you're not doing this, you're not doing DevOps": Sarah Wells
- Google pushed out responsibility for the correct configuration of MySQL to every team involved in the stack. For example this meant the DNS Team were also on the hook if their MySQL configuration test suite was failing (Niall Murphy)
- USwitch centralised their teams and gave them horizontal responsibilities. This led to friction so they decentralised into vertical product teams. However this meant the same problem being solved multiple times so they recentralised some of their horizontal functions again. This time though the central teams were charged with being "caring but not responsible" So they could recommend an approach and develop a toolset but product teams were not bound to use it (Tom Booth)
- Pick one area to change at a time. If you're lucky you might have one team/division/acquisition already heading in the direction you want to go. They can serve as example to the rest (Sarah Wells and Tom Booth)
- Many companies consider all decisions to be final (irreversible or type 1). However in reality only a fraction are: most decisions are changeable (reversible or type 2). Be brave with your type 2 decisions! Jeff Bezos cited by Sarah Wells
- Your emergency patching process may be the most agile/DevOps process in your company. It is much faster then the full release process having trimmed all the fat. What if you just released all your changes this way? Subversive advice from Casey West
On complex systems:
- All systems of sufficient complexity operate in a constant state of partial failure - the key is to remain operable. Advice from the paper "How Complex Systems Fail" cited by Adrian Colyer and the same thoughts were echoed by Niall Murphy.
The behaviour of complex systems cannot be understood by looking at their constituent parts. We must observe the activities of the system as a whole.
I have reconstructed this from memory as I cannot find it in my notes nor remember which talk it came from - but it stuck in my mind.
There is automation and then there is autonomy. Google reached a limit of efficiency by automating processes for humans. MySQL failovers could not be done faster than 30min. To reach the next level they needed to remove humans altogether and build autonomous, self-healing systems (Niall Murphy)
- On technology:
- Distributed tracing is the future - Steven Acreman.
anomalies across your metrics! The following are all from Adrian Colyer's excellent talk and links are to his blog:
- "Even your best engineers often get it wrong when they're working from guesses and intuition" - Adrian Colyer
- For distributed tracing Google have Dapper, Facebook have UberTrace and Gorilla
- No correlation IDs to track your distributed processes? [lprof](
just needs your logs and your (Java) source code to show you what is happening in your code.
- No logs? Use pivot
Dynamically install monitoring at runtime with a load-time weaver and query language. Adds an overhead of just 0.3%!
- Random fact: Google dropped 60% of their MySQL hardware once they moved to containers (with Borg) - Niall Murphy
- Distributed tracing is the future - Steven Acreman. Also correlating
See the Operability.IO schedule (operability.io) for a summary of each talk.
- Sarah Wells (Financial Times) Operability talk: "Why would anyone do out-of-hours support for free?" Slides
(speakerdeck.com) Video - same talk but taken from London-Continuous-Delivery Meetup Sept 2016 (vimeo.com)
- Casey West (Pivotal) Operability talk: "Achieving Cloud-Native Operability" Slides (speakerdeck.com)
- Lloyd Taylor (Ngmoco) quoted by John Willis in his article DevOps Culture (itrevolution.com)
- Daniel Otte and Tom Shacham (Springer Nature) Operability talk: "The road to GrownOps" Slides (slideshare.net)
- Niall Murphy (Google) Operability talk: "Automation and Operability: the Google SRE perspective"
- Adrian Colyer (Accel) Operability talk: "The Morning Paper on Operability" (acolyer.org)
- Martin Fowler (Thoughtworks) article: You must be this tall to use microservices (martinfowler.com)
- Rebecca Parsons (Thoughtworks) Operability talk: "Operability and Evolvability"
- Jovile Bartkeviciute (Skelton Thatcher) tweet: "Incentives matter" (twitter.com)
- Werner Vogels (Amazon) interview "A conversation with Werner Vogels" (acm.org)
- Tom Booth (USwitch) Operability talk: "Centralising the right things"
- Adrian Colyer (Accel) blog post "How Complex Systems Fail" (acolyer.org)
- Jeff Bezos (Amazon) Letter to Shareholders (sec.gov)
- Steven Acreman (Dataloop.IO) Operability talk: "A brief history of monitoring"