Sejoon Kim as a DevOps Engineer

CloudCops GmbHMarch 2025 - Present

DevOps EngineerKubernetesArgoCDGitOpsSecurityCost OptimizationPython

Sejoon Kim as a DevOps Engineer - CloudCops GmbH

Building infrastructure that teams can trust, and since 2026, the backend features that run on it.

About CloudCops

CloudCops GmbH is a German DevOps company that operates its own product portfolio alongside client work: a travel platform, a fintech platform for a banking-sector client, and several smaller products. Small team, real production traffic, no separation between "the person who builds it" and "the person who gets paged for it".

Role Overview

I joined in March 2025 as a DevOps Engineer working across the portfolio: Kubernetes platforms, GitOps deployment, observability, and security automation. Since early 2026 I additionally own backend feature work on the travel platform (Python / Django), which means I now ship features onto infrastructure I also operate.

Key Projects

Travel agency platform in Dubai (B2B, B2C)

Context: CloudCops product | Timeline: March 2025 - Present

The platform's biggest infrastructure decision was leaving managed cloud. I led the migration from Azure AKS to self-managed Kubernetes on Hetzner Cloud: kubeadm HA control plane, Cilium CNI, MetalLB with the Hetzner Cloud Controller Manager. Azure-managed primitives had to be replaced one by one (AAD pod identity with cert-manager plus Vault, managed PostgreSQL with self-hosted PostgreSQL with WAL archiving and point-in-time recovery, Velero for PVC backup and restore).

82% monthly cost reduction in post-migration steady state
Database cutover via PostgreSQL streaming replication with lag validation, holding write unavailability to roughly 5-10 minutes instead of a full-stack outage

Other work on this platform:

Ingress migration: nginx-ingress to Traefik v3 on Kubernetes Gateway API across three k3s clusters, executed as a 4-phase zero-downtime rollout (deploy, dual-stack, DNS cutover with pre-lowered TTL, nginx removal), triggered by CVE-2025-1974
P1 incident and the fix: PostgreSQL, RabbitMQ, and Typesense had co-scheduled onto one node, which went NotReady under memory pressure and spiked error rates ~400x. I split the cluster into system / workload / data / monitoring nodepools with workload taints, PodAntiAffinity, and zone-aware TopologySpreadConstraints so the failure mode is now blocked at the scheduler level
Per-PR preview environments: ArgoCD ApplicationSet listening to GitHub pull request events, PII-sanitized staging snapshots, per-PR subdomain ingress, namespace resource quotas, 7-day auto-teardown
Cluster lifecycle: soak-gated Kubernetes 1.33 to 1.35 upgrades across dev, stage, and prod
Security: CrowdSec WAF on ingress with scenario tuning to avoid CGNAT false positives
Observability: Tempo distributed tracing with object-storage backend, Traefik request dashboards in Grafana

Backend feature work on the travel platform (2026)

Passport-scanning booking feature, end-to-end: evaluated 7 OCR options (on-device, cloud, and specialized vendors) with a working demo, selected Azure Document Intelligence, and built the Python / Django OCR backend API
GDPR compliance layer for passport data: field-level encryption at rest, admin-access audit logging, participant consent records, and 30-day retention auto-cleanup via Celery beat, designed from a cross-jurisdiction compliance document I authored
Reliability fix: moved blocking 30-second OCR calls off gunicorn sync workers (which were starving the worker pool under load) onto dedicated Celery async workers with a Redis-backed task contract

Fintech platform (B2C)

Context: banking-sector client | Timeline: March 2025 - Present

98% CVE reduction: Trivy adoption end-to-end. Trivy Operator for runtime scanning, a CI gate where Critical findings block merge and High warns, and Renovate-driven base image updates
90% monitoring cost reduction: migrated from Grafana SaaS to self-hosted kube-prometheus-stack (Prometheus, Grafana, Alertmanager, Loki) with full observability retained
P2 incident, resolved in 1h35m: a single expired Azure Service Principal simultaneously broke ArgoCD OIDC, Grafana OIDC, Alertmanager OAuth2-Proxy, and External Secrets Operator. I authored a 4-module Terragrunt recovery runbook, then eliminated the failure class with a daily GitHub Actions workflow alerting on credentials expiring within 30/14/7 days
Secure staging environments for penetration testing and vulnerability management
Dual CI ecosystems: the client stack runs GitLab and GitLab CI on Azure AKS, so my daily work spans GitLab CI here and GitHub Actions on CloudCops products
Pattern reuse: replicated the Gateway API ingress migration from the travel platform cluster, the first concrete proof that our platform layer works across unrelated tenants

Azure Infrastructure Automation

Context: CloudCops internal | Timeline: March 2025 - December 2025

85% time reduction: automated Azure App Registration from a manual 4-step workflow into single-push deployment, eliminating manual errors in permission configuration
IaC modernization: reworked script-based Terraform ArgoCD modules into a Helm-based architecture with Gateway API support, version control, and rollback

Impact & Metrics

Metric	Result
Infrastructure cost (travel platform)	82% reduction via AKS to self-managed migration
Active CVEs (fintech client)	98% reduction
Monitoring cost (fintech client)	90% reduction
Azure App Registration time	85% reduction
P2 identity-cascade incident	Recovered in 1h35m, failure class eliminated

Technologies & Tools

Kubernetes: k3s, kubeadm, Helm, ArgoCD and ApplicationSet, Cilium, MetalLB, Gateway API, Traefik, cert-manager, Velero
IaC & CI/CD: Terraform, Terragrunt, GitHub Actions, Renovate, Docker
Observability: Prometheus, Grafana, Alertmanager, Loki, Tempo, Sentry
Security: Trivy, CrowdSec, OAuth2-Proxy, Vault, External Secrets Operator
Data: PostgreSQL, MongoDB, RabbitMQ, Redis, Typesense
Backend: Python, Django, Celery
Cloud: Microsoft Azure, Hetzner Cloud

Reflection

Three things this job keeps teaching me:

Incidents are tuition. The P1 taught us scheduler-level isolation, the P2 taught us credential lifecycle automation. Both failure classes are now structurally blocked, not just patched.
Platform patterns should be built for reuse. The Gateway API migration was designed once and applied to two unrelated clusters. That is the difference between doing ops and building a platform.
Operating what you build changes how you build. Owning backend features on infrastructure I also run made me write more boring, more observable code. The pager is an excellent code reviewer.