Job Description
About the Role
At enterprise scale, cloud operations and DevOps are not support functions; they are the engine of reliability, speed, and resilience. We are seeking a Cloud Operations & DevOps leader to own the stability, scalability, and continuous delivery of mission-critical platforms that power a global organization.
This role is designed for a practitioner who understands that modern infrastructure is software. You will operate at the intersection of cloud engineering, platform reliability, automation, and security, ensuring that development teams can ship safely and quickly while production environments remain observable, resilient, and cost-efficient. Your mandate spans cloud operations, CI/CD, infrastructure-as-code, SRE practices, and incident management across multi-region, high-availability systems.
Working closely with Engineering, Security, Data, and Product leaders, you will modernize how infrastructure is built and run, moving from reactive operations to predictive, automated, and self-healing systems. You will champion DevOps culture, embed reliability into design, and institutionalize best practices that scale with growth.
This is not a ticket-driven operations role. It is an opportunity to shape the cloud operating model, influence architectural decisions, and deliver measurable improvements in uptime, deployment velocity, and cost efficiency at enterprise scale.
Essential Duties and Responsibilities
-
Own production cloud operations across multi-region, high-availability environments.
-
Design, implement, and maintain CI/CD pipelines enabling safe, fast, and repeatable deployments.
-
Build and manage infrastructure-as-code using modern tooling (e.g., Terraform, CloudFormation).
-
Establish SRE practices including SLIs/SLOs, error budgets, observability, and incident response.
-
Automate provisioning, scaling, patching, and recovery to reduce manual toil.
-
Partner with Engineering to embed reliability, security, and operability into system design.
-
Lead incident management, root-cause analysis, and post-incident improvements.
-
Optimize cloud cost, capacity planning, and performance across services and regions.
-
Implement monitoring, logging, and alerting for proactive issue detection and remediation.
-
Mentor engineers and promote DevOps and reliability best practices across teams.
Job Qualifications and Requirements
-
7–10+ years of experience in Cloud Operations, DevOps, SRE, or platform engineering roles.
-
Hands-on experience with major cloud platforms (AWS, Azure, or GCP) in production environments.
-
Strong expertise in CI/CD systems, containerization, and orchestration (e.g., Kubernetes).
-
Experience with infrastructure-as-code, configuration management, and automation frameworks.
-
Proven track record improving reliability, deployment velocity, and operational efficiency.
-
Understanding of security, networking, and identity concepts in cloud environments.
-
Strong scripting and automation skills (e.g., Python, Bash).
-
Ability to collaborate across engineering, security, and product teams.
Personal Capabilities and Qualifications
-
Systems thinker who anticipates failure modes and designs for resilience.
-
Calm and decisive during incidents; methodical in follow-up and improvement.
-
Highly analytical with a strong bias toward automation and simplification.
-
Comfortable owning outcomes in complex, high-availability environments.
-
Collaborative leader who builds trust with developers and stakeholders.
-
Strong communicator able to explain technical trade-offs clearly.
Strategic Support
-
Advise leadership on cloud strategy, reliability posture, and infrastructure investments.
-
Support enterprise modernization, platform consolidation, and migration initiatives.
-
Partner with Security on vulnerability management, access controls, and incident readiness.
-
Contribute to long-term decisions on cloud architecture and operating models.
-
Enable faster, safer product delivery across the organization.
Working Conditions
-
Hybrid work model with collaboration across global engineering teams.
-
On-call responsibility for critical production systems (rotational).
-
Regular interaction with senior engineering and technology leadership.
-
Occasional travel for planning sessions or cross-team alignment.
Job Function
-
Cloud Operations & Platform Reliability
-
DevOps & CI/CD Engineering
-
Infrastructure-as-Code & Automation
-
Site Reliability Engineering (SRE)
-
Incident Management & Observability
-
Cloud Cost & Performance Optimization
Compensation & Benefits
-
Base Salary: $197,000 – $293,000
-
Annual Performance Bonus
-
Long-Term Incentive Plan (Equity / Performance Awards)
-
Comprehensive Medical, Dental, Vision Coverage
-
401(k) with Competitive Company Match
-
Advanced Training, Certifications & Conference Support
-
Wellness, Mental Health & Family Support Programs
-
Generous Paid Time Off + Company Holidays
Why Join Us
-
Run cloud platforms that operate at true enterprise scale.
-
Influence how reliability, automation, and DevOps are practiced across the organization.
-
Work with modern cloud, container, and observability technologies.
-
Partner with top-tier engineers building high-impact products.
-
Join a company that treats operational excellence as a competitive advantage.