We're hiring a senior DevOps/SRE engineer who has operated production Kubernetes at meaningful scale, built AWS infrastructure with IaC, and can improve reliability, observability, security, and cost-efficiency for a microservices platform. You'll be instrumental in scaling our AI-powered product infrastructure and ensuring our data-intensive applications perform reliably at scale.
What you'll do (Key Responsibilities)
Kubernetes platform engineering (EKS-first)
● Design, build, and operate production-grade Kubernetes clusters (multi-nodegroup, autoscaling, upgrades, cluster add-ons).
● Implement intelligent autoscaling using real metrics (queue depth, consumer lag, service latency) via tools like KEDA/Karpenter. AWS infrastructure & IaC
● Own AWS environments end-to-end (VPC, IAM, EKS/ECS/EC2, ALB/ELB, S3, Route53, CloudWatch, RDS, SQS, Lambda).
● Build reproducible infrastructure using Terraform, with strong review + change management practices.
● Implement backup/DR patterns (e.g., snapshots, retention, automation) and safe rollouts.
● Design infrastructure for data-intensive workloads: high-throughput ingestion, batch processing, and real-time streaming.
Distributed systems & data infrastructure
● Understand and operate distributed systems at scale — consensus, partitioning, replication, and failure modes.
● Build and maintain infrastructure for data pipelines, vector databases
● Design for horizontal scalability, ensuring systems handle growing data volumes and user traffic gracefully.
Observability & reliability
● Build/own monitoring + logging from scratch and make it actionable (Prometheus/Grafana, ELK/EFK, alerting).
● Define/partner on SLI/SLOs and incident response practices; improve reliability with data-driven changes.
● Establish performance testing and production-like load testing environments Cost & performance optimization
● Continuously reduce AWS spend via right-sizing, Spot strategies, reserved capacity planning, and architecture improvements.
● Partner with engineering teams to diagnose bottlenecks (db queries, caching, queueing) and propose scalable solutions.
● Optimize infrastructure costs for data-heavy workloads (storage tiering, compute scheduling, GPU utilization). Security & compliance
● Improve cloud and cluster security posture (IAM, network policies, secrets management, least privilege).
● Support SOC2 readiness/execution (controls, evidence automation, operational hardening).
● Implement access management patterns
What we're looking for (Must-Have)
● 7+ years in DevOps / SRE / Cloud Infra roles operating production systems.
● Deep hands-on experience with Kubernetes in production.
● Strong AWS fundamentals across compute/networking/storage/identity, including VPC, IAM, EC2/EKS, ALB, S3, Route53, CloudWatch, RDS, SQS.
● Proven ability to build infra using Terraform (and strong IaC practices).
● Production-grade observability experience: Prometheus + Grafana, and centralized logging (ELK/EFK or similar).
● Experience scaling product infrastructure — you've grown systems from thousands to millions of requests, and understand capacity planning, bottleneck identification, and scaling patterns.
● Solid understanding of distributed systems concepts: CAP theorem, consistency models, partitioning strategies, distributed consensus, and failure handling.
● Strong understanding of databases and performance fundamentals.
● CI/CD experience building reliable pipelines (Jenkins/Spinnaker/GitHub Actions equivalents), with safe deployment strategies.
● Scripting/automation ability in Python and/or Bash (Go is a plus).
Nice-to-have (Bonus)
● Experience with high-throughput systems (millions of req/s, multi-region service operation).
● Data-intensive application design: familiarity with concepts from "Designing Data-Intensive Applications" — replication, partitioning, stream processing, batch processing.
● Data/streaming ecosystems: Kafka, Airflow, Druid (or similar).
● Vector database operations: Qdrant, Pinecone, Weaviate — understanding of ANN
search, sharding, and performance tuning.
● SOC2 execution experience (or equivalent compliance/security programs).
What success looks like
● Stabilize and improve the Kubernetes platform (upgrades, add-ons, reliability, scaling posture).
● Reduce cloud costs measurably without sacrificing reliability (clear plan + tracked impact).
● Observability is actionable: fewer noisy alerts, clearer dashboards, meaningful SLOs.
● Faster, safer deployments with a consistent, repeatable pipeline and rollback strategies.
● Clear infra ownership: documentation, runbooks, incident playbooks, and clean IaC repos.
Typical stack you'll work with
● Kubernetes/EKS, Docker, Helm
● Terraform
● Prometheus, Grafana, ELK/EFK
● PostgreSQL, Redis, Qdrant
● CI/CD: Jenkins/Spinnaker/GitHub Actions (or similar)