The Site Reliability Engineering (SRE) team combines software, systems and network engineering to deploy and run a portfolio of high-performance edge services including CDN, WAF and Compute. SRE’s daily focus is on the availability, change velocity, performance and capacity of customer-facing services and supporting internal systems.
On the SRE team you will have the opportunity to apply your experience against systems at scale – where a single week can involve shifting terabits of traffic between sites, deploying configuration changes to shave milliseconds off billions of requests, or enabling a new software feature on thousands of systems using automated tooling you designed and built.
This role will report to our: VP Site Reliability Engineering
Responsibilities
- Respond to incidents during on-call duty
- Respond to complex customer escalations, which often cross system, network and software boundaries
- Design, develop and maintain internal service metrics (SLA, SLO, SLI) in cross-team collaborations
- Design, develop and maintain dashboards, tooling, alarms and playbooks in collaboration with operations teams to support service-level objectives
- Design, develop and maintain reusable monitoring and canary infrastructure
- Design, execute and evaluate performance experiments
- Collaborate with development teams to complete production readinesschecklists prior to major feature launches
- Collaborate with operations and engineering teams in determining rootcause of major incidents, performance anomalies, or other customer-impacting issues
Requirements
- Experience with monitoring and alerting platforms (Prometheus and Alertmanager, Grafana, Zabbix, Nagios)
- Experience with a Linux server environment
- Experience with scripting languages (Python, Ruby, Perl)
- Experience with systems programming languages (Go, C)
- Experience with configuration management systems (Puppet, Ansible, Chef, Jenkins)
- Expert-level proficiency in systems, network or software engineering
- Excited about working on a remote-first engineering team
- Proficient at troubleshooting complex systems
- Production experience in a service provider environment
- Comfortable with a software engineering workflow for collaboration and configuration management — branches, pull requests, merges, conflicts
Projects You Might Work On
- Product launches
- Software and platform feature releases
- Live streaming event planning and execution
- Network reach and capability expansion
- Network and system automation tooling development
- Telemetry and monitoring system development
- Defining service metrics (SLA, SLO, SLI) during new product development
Deadline for applications: 04.01.2022.