Mid & Senior Site Reliability Engineers - GDS - G7
Government Digital & Data -
The Government Digital Service (GDS) is the digital centre of government — we are responsible for setting, leading and delivering the vision for a modern digital government.
Our priorities are to drive a modern digital government, by:
- joining up public sector services
- harnessing the power of AI for the public good
- strengthening and extending our digital and data public infrastructure
- elevating leadership and investing in talent
- funding for outcomes and procuring for growth and innovation
- committing to transparency and driving accountability
We are home to the Incubator for Artificial Intelligence (I.AI), the world-leading GOV.UK and at the forefront of coordinating the UK’s geospatial strategy and activity. We lead the Government Digital and Data function and champion the work of digital teams across government.
We’re part of the Department for Science, Innovation and Technology (DSIT) and employ more than 1,000 people all over the UK, with hubs in Manchester, London and Bristol.
The Government Digital Service is where talent translates into impact. From your first day, you’ll be working with some of the world’s most highly-skilled digital professionals, all contributing their knowledge to make change on a national scale.
Join us for rewarding work that makes a difference across the UK. You'll solve some of the nation’s highest-priority digital challenges, helping millions of people access services they need
Effective identity assurance is central to digital transformation and GOV.UK One Login enables people to prove who they are online, with the necessary level of confidence to access and use particular services. Our technology stack runs on AWS, using serverless compute and storage products. Backend services are written in TypeScript/Node.js and JVM technologies. Web applications also use TypeScript.
The right person will join a well motivated and multi-disciplined delivery team working to deliver on our commitments and roadmap. We are an ambitious and visionary team so if you want to be at heart of this, have a background in software delivery and are used to working in a scaled agile environment then this could be the place for you!
If this sounds like the next role for you on your career journey then we’d love to hear from you. Find out more at the GDS Blog.
Job description
As a Site Reliability Engineer at GDS you will:
- be part of one of our multidisciplinary service teams working with and supporting front-end and back-end developers, delivery and product managers, tech writers and architects
- build and maintain resilient, highly available and secure systems to meet the needs of our users
- take responsibility for solving complex and interesting problems
- create infrastructure as code to ensure our infrastructure and deployment pipelines are reusable, repeatable and reliable
- ensure our systems are appropriately monitored and instrumented to enable our teams to identify and respond to operational issues quickly and effectively
- build CI/CD pipelines to enable our developers to get their code into production as quickly and safely as possible
- act as a digital ambassador, sharing experiences through public speaking and blog posts
- participate in our in-hours 2nd line and out-of-hours support rotas to gain empathy for users and awareness of operational concerns
- share knowledge of tools and practices with your wider team and peers to drive consistency and maintain our high engineering standards
Person specification
We’re interested in people who:
- are experienced with Linux operating system internals and are comfortable working with Linux virtual machines or containers
- have experience of working with technologies that underpin digital services such as databases, web servers, DNS, CDNs, reverse proxies, message queues and load balancers
- have experience of cloud infrastructure providers such as AWS
- are familiar with container orchestration technologies such as Kubernetes, ECS or serverless application design such as AWS Lambda
- have an understanding of SRE principles such as capacity planning, SLOs and SLIs and how to design and support resilient, large scale, high performance services in a production environment
- can deploy and configure observability tools to ensure systems are appropriately monitored and instrumented to enable teams to identify and respond to operational issues quickly and effectively
- are familiar with at least one programming language (we use Node.js, Java, Python, Ruby and Go), technologies such as Terraform and CloudFormation, are able to use automated testing and test-driven development (TDD) to validate solutions and maintain code quality
- are very proficient using Git for version control
- understand the benefits of continuous integration and continuous deployment and have experience with CI/CD tools such as Concourse, Jenkins, GitHub Actions and CodePipeline
- have a good understanding of security principles and how to keep large operational services secure