We're hiring a senior DevOps/SRE engineer who has operated production Kubernetes at meaningful scale, built AWS infrastructure with IaC, and can improve reliability, observability, security, and cost-efficiency for a microservices platform. You'll be instrumental in scaling our AI-powered product infrastructure and ensuring our data-intensive applications perform reliably at scale.

What you'll do (Key Responsibilities)

Kubernetes platform engineering (EKS-first)

● Design, build, and operate production-grade Kubernetes clusters (multi-nodegroup, autoscaling, upgrades, cluster add-ons).

● Implement intelligent autoscaling using real metrics (queue depth, consumer lag, service latency) via tools like KEDA/Karpenter. AWS infrastructure & IaC

● Own AWS environments end-to-end (VPC, IAM, EKS/ECS/EC2, ALB/ELB, S3, Route53, CloudWatch, RDS, SQS, Lambda).

● Build reproducible infrastructure using Terraform, with strong review + change management practices.

● Implement backup/DR patterns (e.g., snapshots, retention, automation) and safe rollouts.

● Design infrastructure for data-intensive workloads: high-throughput ingestion, batch processing, and real-time streaming.

Distributed systems & data infrastructure

● Understand and operate distributed systems at scale — consensus, partitioning, replication, and failure modes.

● Build and maintain infrastructure for data pipelines, vector databases

● Design for horizontal scalability, ensuring systems handle growing data volumes and user traffic gracefully.

Observability & reliability

● Build/own monitoring + logging from scratch and make it actionable (Prometheus/Grafana, ELK/EFK, alerting).

● Define/partner on SLI/SLOs and incident response practices; improve reliability with data-driven changes.

● Establish performance testing and production-like load testing environments Cost & performance optimization

● Continuously reduce AWS spend via right-sizing, Spot strategies, reserved capacity planning, and architecture improvements.

● Partner with engineering teams to diagnose bottlenecks (db queries, caching, queueing) and propose scalable solutions.

● Optimize infrastructure costs for data-heavy workloads (storage tiering, compute scheduling, GPU utilization). Security & compliance

● Improve cloud and cluster security posture (IAM, network policies, secrets management, least privilege).

● Support SOC2 readiness/execution (controls, evidence automation, operational hardening).

● Implement access management patterns

What we're looking for (Must-Have)

● 7+ years in DevOps / SRE / Cloud Infra roles operating production systems.

● Deep hands-on experience with Kubernetes in production.

● Strong AWS fundamentals across compute/networking/storage/identity, including VPC, IAM, EC2/EKS, ALB, S3, Route53, CloudWatch, RDS, SQS.

● Proven ability to build infra using Terraform (and strong IaC practices).

● Production-grade observability experience: Prometheus + Grafana, and centralized logging (ELK/EFK or similar).

● Experience scaling product infrastructure — you've grown systems from thousands to millions of requests, and understand capacity planning, bottleneck identification, and scaling patterns.

● Solid understanding of distributed systems concepts: CAP theorem, consistency models, partitioning strategies, distributed consensus, and failure handling.

● Strong understanding of databases and performance fundamentals.

● CI/CD experience building reliable pipelines (Jenkins/Spinnaker/GitHub Actions equivalents), with safe deployment strategies.

● Scripting/automation ability in Python and/or Bash (Go is a plus).

Nice-to-have (Bonus)

● Experience with high-throughput systems (millions of req/s, multi-region service operation).

● Data-intensive application design: familiarity with concepts from "Designing Data-Intensive Applications" — replication, partitioning, stream processing, batch processing.

● Data/streaming ecosystems: Kafka, Airflow, Druid (or similar).

● Vector database operations: Qdrant, Pinecone, Weaviate — understanding of ANN

search, sharding, and performance tuning.

● SOC2 execution experience (or equivalent compliance/security programs).

What success looks like

● Stabilize and improve the Kubernetes platform (upgrades, add-ons, reliability, scaling posture).

● Reduce cloud costs measurably without sacrificing reliability (clear plan + tracked impact).

● Observability is actionable: fewer noisy alerts, clearer dashboards, meaningful SLOs.

● Faster, safer deployments with a consistent, repeatable pipeline and rollback strategies.

● Clear infra ownership: documentation, runbooks, incident playbooks, and clean IaC repos.

Typical stack you'll work with

● Kubernetes/EKS, Docker, Helm

● Terraform

● Prometheus, Grafana, ELK/EFK

● PostgreSQL, Redis, Qdrant

● CI/CD: Jenkins/Spinnaker/GitHub Actions (or similar)

View all job openings

Senior DevOps / SRE Engineer