IT Site Reliability Engineer

At GitLab, the IT Infrastructure team is responsible for Site Reliability Engineering for our tech stack applications and cloud infrastructure that supports corporate initiatives across many of our departments. In addition to traditional AWS and GCP administration, we also provide escalation engineering support for departments that manage their respective SaaS tech stack applications (vendor hosted). Another of our functions is to provide DevOps Engineering for several internally built applications that power our business operations and automation.

The IT team collaborates closely with the Engineering Infrastructure Reliability team that is responsible for our GitLab.com SaaS platform (our product infrastructure). The IT, Engineering, and Infrastructure Security teams collaborate to architect, implement, and manage our AWS and GCP infrastructure policies and collectively manage all related services.

Responsibilities

Lead the handling of ticket queue (GitLab issues) for AWS and GCP corporate infrastructure requests from team members. This ranges from simple IAM and DNS requests to designing and deploying new scalable application infrastructure.
Design, build and maintain core infrastructure that enables GitLab can scale to support 2,000+ team members and the applications and services that they use day-to-day.
Implement and maintain system logging and monitoring to alert on problems and prevent outages, and get ahead of customer needs.
Maintain the corporate AWS and GCP infrastructure utilizing Ansible, Terraform, GitLab CI/CD, and Kubernetes
Gather and analyze operating system and application metrics to assist in performance tuning and fault finding
Create sustainable systems and services through patching, automation, and upgrades
Document every action so your findings turn into repeatable actions and then into automation.
Provide mentorship to IT System Administrators and IT Analysts who have an interest in infrastructure and IaC.
Collaborate with other teams to improve services and help with system design, platform management, and capacity planning

Levels

IT Site Reliability Engineer (Intermediate)

Job Grade

The IT Site Reliability Engineer is a grade 6.

Requirements

5+ years of experience in IT in a high growth Software as a service (SaaS) environment
Knowledge of configuration management tools like Ansible, Chef, or Terraform
Hands-on experience working in GCP and AWS environments
Experience working with CI/CD tools and Git
Ability to use GitLab

Responsibilities

The IT Site Reliability Engineers share the same responsibilities outlined above.

AWS and GCP - At least 2 years managing applications in AWS and/or GCP. An AWS and/or GCP professional certification is nice to have, however practical experience is more important in conjunction with Terraform experience for deploying applications and services using infrastructure-as-code with security best practices.
Security - Strong understanding of security best practices, network design, and how AWS/GCP roles should be used for IAM/RBAC least privilege.
Infrastructure-as-Code - Configuration management experience with Terraform and/or Ansible to effectively manage our infrastructure. Previous experience with AWS CloudFormation, Chef, Pulumi, Puppet, etc. is acceptable, however strong Terraform experience is a requirement.
Kubernetes - Experience with managing Kubernetes clusters and using kubectl, k9s, etc for managing helm chart deployments, ingress services, and troubleshooting pods. Previous experience with Docker and related technologies is acceptable since container concepts are transferable.
Operating Systems - Experience with managing Alpine, Debian, or Ubuntu Linux systems. We do not use Windows at GitLab. Many services are deployed in containers.
Cloud Services - Manage, configure and troubleshoot Linux operating system issues (Linux), storage (block and object), networking (VPCs, proxies and CDNs), and administer high-availability PostgreSQL and Redis clusters
Monitoring and instrumentation - Implement metrics in Prometheus, Grafana, Elastic, log management and related systems, and Slack/PagerDuty/Sentry integrations
Engineering practices - High availability, data security, reliability and scalability, as well as disaster recovery

Senior IT Site Reliability Engineer

Senior Job Grade

The IT Site Reliability Engineer is a grade 7.

Senior Requirements

The Senior IT Site Reliability Engineer has all the same responsibilities as the ones outlined above plus the following:

7+ years of experience in IT in a high growth SaaS environment
Advanced knowledge of identity and access management
Advanced knowledge in one of the following scripting languages - Python or Ruby
Advanced knowledge of container and microservice technologies

Senior Responsibilities

The Senior IT Site Reliability Engineer has all the same responsibilities as the intermediate position plus the following:

AWS and GCP - At least 5 years managing applications in AWS and/or GCP. An AWS and/or GCP professional certification is nice to have, however practical experience is more important in conjunction with Terraform experience for deploying applications and services using infrastructure-as-code with security best practices.
Security - The current infrastructure and DevOps landscape requires a strong security background to design hardened environments using a variety of cloud services beyond the traditional firewall rules of VPCs. It is helpful to have a working knowledge of how different security vendor point solutions can be used to create a robust architecture.
Software Languages and Frameworks (beyond simple scripts) - We work in a variety of languages including: PHP (Laravel), Ruby on Rails, GoLang, Python and Shell.
CI/CD - Experience with Terraform and GitLab CI/CD for automated build, test and deployments. Previous experience with CI/CD platforms, GitHub Actions, Jenkins, etc is acceptable, however
Build or implement open source automation and systems to manage AWS and GCP infrastructure and business applications and related services.
Systems architecture design - In a DevOps ecosystem, your systems thinking will allow you to see automation efficiencies in areas outside of infrastructure. At GitLab, everyone can contribute and the IT Operations team welcomes automation and efficiency contributions from all roles.

Performance Indicators

Career Ladder

The next step in the IT Site Reliability Engineer job family is to move to the IT Manager job family.

Hiring Process

Candidates for this position can expect the hiring process to follow the order below. Please keep in mind that candidates can be declined from the position at any stage of the process. To learn more about someone who may be conducting the interview, find their job title on our team page.

Qualified candidates will be invited to schedule a 30 minute screening call with one of our Global Recruiters
Candidates will be invited to complete a ’take home assessment’. This is to be completed in your own time and returned within 3-5 working days
Next, candidates will be invited to schedule an interview with the Hiring Manager
Candidates will then be invited to schedule a Team interview with two members of the IT Systems Engineering team in a panel interview
Candidates will also be invited to schedule a Technical interview with two other team members
Finally, candidates will interview with our Director of IT Operations

Additional details about our process can be found on our hiring page.

About GitLab

GitLab is an open core software company that develops the most comprehensive AI-powered DevSecOps Platform, used by more than 100,000 organizations. Our mission is to enable everyone to contribute to and co-create the software that powers our world. When everyone can contribute, consumers become contributors, significantly accelerating the rate of human progress. This mission is integral to our culture, influencing how we hire, build products, and lead our industry. We make this possible at GitLab by running our operations on our product and staying aligned with our values. Learn more about Life at GitLab. Thanks to products like Duo Enterprise, and Duo Workflow, customers get the benefit of AI at every stage of the SDLC. The same principles built into our products are reflected in how our team works: we embrace AI as a core productivity multiplier. All team members are encouraged and expected to incorporate AI into their daily workflows to drive efficiency, innovation, and impact across our global organisation.

See our culture page for more!

Work remotely from anywhere in the world. Curious to see what that looks like? Check out our remote manifesto and guides.

Last modified March 5, 2025: Fix broken links (2feb413c)

View page source - Edit this page - please contribute.