Site Reliability Engineer | Self-driven • Observant • Problem Solver
To me, being an SRE means owning the full lifecycle of reliability — not just responding to incidents, but preventing them by design. I’m passionate about reducing toil, building humane alerts, and writing documentation that turns fire drills into learning moments. I believe in innovation, creativity, and commitments.
Recovered a non-RDPable VM by mounting its disk to a healthy instance and repairing configs—solved remotely, then shared knowledge with the team.
Traced unexpected video labels to local iOS settings—resolved a rare bug that even senior devs hadn’t seen. Enhanced client trust.
Discovered and escalated an ignored certificate expiration alert, preventing a major outage. Became the go-to SME for cert-related alerting.
Designed and implemented SLO monitoring with Mosaic and Splunk for a critical Apple Pay service. Solution reused across other teams.
Problem: A infrastracture-level alert was triggered intermittently for days involving the project team members are not familiar with, but handled carelessly, posing a risk of critical service issue and alerting enhancement.
Action: Took initiative during on-call to investigate the alert thoroughly. Discovered the root cause, proposed modifications in 5 different perspectives, and coordinated the corresponding cross-team revisements.
Impact: SRE visibility lifted, alert handling SOP updated.
Developed Python and Bash scripts to automate local environment setup, significantly streamlining the process of establishing project prerequisites for APAC teammates.
Created a framework for designing service-specific SLIs and implemented Mosaic dashboards for real-time tracking. Enhanced reliability tracking and initiated alerts based on SLO breaches.
Collaborated with global development and operations teams to revamp logging and alerting rules, reducing alert noise and false positives, and increasing incident detection rates by 20%.