Millet Cheong

Site Reliability Engineer | Self-driven • Observant • Problem Solver

Personal Philosophy

To me, being an SRE means owning the full lifecycle of reliability — not just responding to incidents, but preventing them by design. I’m passionate about reducing toil, building humane alerts, and writing documentation that turns fire drills into learning moments. I believe in innovation, creativity, and commitments.

Problem Solving Highlights

Azure VM Recovery

Recovered a non-RDPable VM by mounting its disk to a healthy instance and repairing configs—solved remotely, then shared knowledge with the team.

Device-side Video Chat Bug

Traced unexpected video labels to local iOS settings—resolved a rare bug that even senior devs hadn’t seen. Enhanced client trust.

Certificate Incident Prevention

Discovered and escalated an ignored certificate expiration alert, preventing a major outage. Became the go-to SME for cert-related alerting.

SLI/SLO Design & Adoption

Designed and implemented SLO monitoring with Mosaic and Splunk for a critical Apple Pay service. Solution reused across other teams.

Case Study: Mosaic Alert Averted

Problem: A infrastracture-level alert was triggered intermittently for days involving the project team members are not familiar with, but handled carelessly, posing a risk of critical service issue and alerting enhancement.

Action: Took initiative during on-call to investigate the alert thoroughly. Discovered the root cause, proposed modifications in 5 different perspectives, and coordinated the corresponding cross-team revisements.

Impact: SRE visibility lifted, alert handling SOP updated.

Technical Skills

PythonBashJavaAWSAzureTerraformHelmDockerKubernetesPrometheusGrafanaSplunkGitHub CI/CDMosaicIncident Management

Knowledge Sharing & Impact

Authored 20+ onboarding and troubleshooting documents used globally.
Implemented 500+ improvements in logging and alerting across services.
Created local environment onboarding scripts in Python/Bash to support on-call engineers in APAC.

Project Descriptions

Local Environment Setup Automation Scripts

Developed Python and Bash scripts to automate local environment setup, significantly streamlining the process of establishing project prerequisites for APAC teammates.

SLI/SLO Service Dashboard Rollout

Created a framework for designing service-specific SLIs and implemented Mosaic dashboards for real-time tracking. Enhanced reliability tracking and initiated alerts based on SLO breaches.

Cross-team Alerting Improvements

Collaborated with global development and operations teams to revamp logging and alerting rules, reducing alert noise and false positives, and increasing incident detection rates by 20%.