IT Site Reliability Engineer
At GitLab, the IT Infrastructure team is responsible for Site Reliability Engineering for our tech stack applications and cloud infrastructure that supports corporate initiatives across many of our departments. In addition to traditional AWS and GCP administration, we also provide escalation engineering support for departments that manage their respective SaaS tech stack applications (vendor hosted). Another of our functions is to provide DevOps Engineering for several internally built applications that power our business operations and automation.
The IT team collaborates closely with the Engineering Infrastructure Reliability team that is responsible for our GitLab.com SaaS platform (our product infrastructure). The IT, Engineering, and Infrastructure Security teams collaborate to architect, implement, and manage our AWS and GCP infrastructure policies and collectively manage all related services.
Responsibilities
- Lead the handling of ticket queue (GitLab issues) for AWS and GCP corporate infrastructure requests from team members. This ranges from simple IAM and DNS requests to designing and deploying new scalable application infrastructure.
- Design, build and maintain core infrastructure that enables GitLab can scale to support 2,000+ team members and the applications and services that they use day-to-day.
- Implement and maintain system logging and monitoring to alert on problems and prevent outages, and get ahead of customer needs.
- Maintain the corporate AWS and GCP infrastructure utilizing Ansible, Terraform, GitLab CI/CD, and Kubernetes
- Gather and analyze operating system and application metrics to assist in performance tuning and fault finding
- Create sustainable systems and services through patching, automation, and upgrades
- Document every action so your findings turn into repeatable actions and then into automation.
- Provide mentorship to IT System Administrators and IT Analysts who have an interest in infrastructure and IaC.
- Collaborate with other teams to improve services and help with system design, platform management, and capacity planning
Levels
IT Site Reliability Engineer (Intermediate)
Job Grade
The IT Site Reliability Engineer is a grade 6.
Requirements
- 5+ years of experience in IT in a high growth Software as a service (SaaS) environment
- Knowledge of configuration management tools like Ansible, Chef, or Terraform
- Hands-on experience working in GCP and AWS environments
- Experience working with CI/CD tools and Git
- Ability to use GitLab
Responsibilities
The IT Site Reliability Engineers share the same responsibilities outlined above.
- AWS and GCP - At least 2 years managing applications in AWS and/or GCP. An AWS and/or GCP professional certification is nice to have, however practical experience is more important in conjunction with Terraform experience for deploying applications and services using infrastructure-as-code with security best practices.
- Security - Strong understanding of security best practices, network design, and how AWS/GCP roles should be used for IAM/RBAC least privilege.
- Infrastructure-as-Code - Configuration management experience with Terraform and/or Ansible to effectively manage our infrastructure. Previous experience with AWS CloudFormation, Chef, Pulumi, Puppet, etc. is acceptable, however strong Terraform experience is a requirement.
- Kubernetes - Experience with managing Kubernetes clusters and using kubectl, k9s, etc for managing helm chart deployments, ingress services, and troubleshooting pods. Previous experience with Docker and related technologies is acceptable since container concepts are transferable.
- Operating Systems - Experience with managing Alpine, Debian, or Ubuntu Linux systems. We do not use Windows at GitLab. Many services are deployed in containers.
- Cloud Services - Manage, configure and troubleshoot Linux operating system issues (Linux), storage (block and object), networking (VPCs, proxies and CDNs), and administer high-availability PostgreSQL and Redis clusters
- Monitoring and instrumentation - Implement metrics in Prometheus, Grafana, Elastic, log management and related systems, and Slack/PagerDuty/Sentry integrations
- Engineering practices - High availability, data security, reliability and scalability, as well as disaster recovery
Senior IT Site Reliability Engineer
Senior Job Grade
The IT Site Reliability Engineer is a grade 7.
Senior Requirements
The Senior IT Site Reliability Engineer has all the same responsibilities as the ones outlined above plus the following:
- 7+ years of experience in IT in a high growth SaaS environment
- Advanced knowledge of identity and access management
- Advanced knowledge in one of the following scripting languages - Python or Ruby
- Advanced knowledge of container and microservice technologies
Senior Responsibilities
The Senior IT Site Reliability Engineer has all the same responsibilities as the intermediate position plus the following:
- AWS and GCP - At least 5 years managing applications in AWS and/or GCP. An AWS and/or GCP professional certification is nice to have, however practical experience is more important in conjunction with Terraform experience for deploying applications and services using infrastructure-as-code with security best practices.
- Security - The current infrastructure and DevOps landscape requires a strong security background to design hardened environments using a variety of cloud services beyond the traditional firewall rules of VPCs. It is helpful to have a working knowledge of how different security vendor point solutions can be used to create a robust architecture.
- Software Languages and Frameworks (beyond simple scripts) - We work in a variety of languages including: PHP (Laravel), Ruby on Rails, GoLang, Python and Shell.
- CI/CD - Experience with Terraform and GitLab CI/CD for automated build, test and deployments. Previous experience with CI/CD platforms, GitHub Actions, Jenkins, etc is acceptable, however
- Build or implement open source automation and systems to manage AWS and GCP infrastructure and business applications and related services.
- Systems architecture design - In a DevOps ecosystem, your systems thinking will allow you to see automation efficiencies in areas outside of infrastructure. At GitLab, everyone can contribute and the IT Operations team welcomes automation and efficiency contributions from all roles.
Performance Indicators
- Mean Time between Failures (MTBF)
- Mean Time to Repair (MTTR)
- Number of days since last environment audit
- Cycle Time for IT Support Issue Resolution
Career Ladder
The next step in the IT Site Reliability Engineer job family is to move to the IT Manager job family.
Hiring Process
Candidates for this position can expect the hiring process to follow the order below. Please keep in mind that candidates can be declined from the position at any stage of the process. To learn more about someone who may be conducting the interview, find their job title on our team page.
- Qualified candidates will be invited to schedule a 30 minute screening call with one of our Global Recruiters
- Candidates will be invited to complete a ’take home assessment’. This is to be completed in your own time and returned within 3-5 working days
- Next, candidates will be invited to schedule an interview with the Hiring Manager
- Candidates will then be invited to schedule a Team interview with two members of the IT Systems Engineering team in a panel interview
- Candidates will also be invited to schedule a Technical interview with two other team members
- Finally, candidates will interview with our Director of IT Operations
Additional details about our process can be found on our hiring page.
About GitLab
GitLab is an open core software company that develops the most comprehensive AI-powered DevSecOps Platform, used by more than 100,000 organizations. Our mission is to enable everyone to contribute to and co-create the software that powers our world. When everyone can contribute, consumers become contributors, significantly accelerating the rate of human progress. This mission is integral to our culture, influencing how we hire, build products, and lead our industry. We make this possible at GitLab by running our operations on our product and staying aligned with our values. Learn more about Life at GitLab. Thanks to products like Duo Enterprise, and Duo Workflow, customers get the benefit of AI at every stage of the SDLC. The same principles built into our products are reflected in how our team works: we embrace AI as a core productivity multiplier. All team members are encouraged and expected to incorporate AI into their daily workflows to drive efficiency, innovation, and impact across our global organisation.See our culture page for more!
Work remotely from anywhere in the world. Curious to see what that looks like? Check out our remote manifesto and guides.
2feb413c
)