Sr. System Operations Engineer, APAC

See more jobs from The Trade Desk Inc

18 days old

Apply Now

Who We Are

At The Trade Desk, we recognize that a seamless customer experience is driven by operational excellence. In pursuit of constantly improving the reliability of our platform, we are establishing a global Systems Operations team. This team's core mission will be to vigilantly monitor The Trade Desk platform services, refine our incident response methodologies, and guarantee a robust and highly-available customer experience. If you're passionate about ensuring system reliability, process improvement, and making an essential customer impact, we invite you to playing a critical role in this next evolution of our on-call experience.

What You'll Do

  • Act as a technical expert and advisor to more junior Associate Systems Operations Engineers
  • At an escalated tier, monitor the state of platform services and stability via telemetry and alerts; triage issues, escalate to engineering teams as needed
    • Work collaboratively with development teams to facilitate issue remediation
    • Manage remediation task workflow
  • Proactively update and improve Systems Operations documentation and runbooks
  • Increase the effectiveness of the incident response process by defining and measuring relevant metrics
  • There may be periodic weekend coverage requirements

Who We are Looking For

  • Bachelor’s Degree from a four-year university or relevant substitute experience
  • 6+ years relevant work experience in Technical and/or Application Support with strong knowledge of services support and troubleshooting

The Systems Operations Engineer will either possess or be excited to learn a number of skills...

Technical Proficiency:

  • Understanding of large-scale distributed system architectures (e.g., databases, web services, application services).
  • Familiarity with monitoring tools (e.g., Prometheus, Grafana, Nagios).
  • Ability to configure and fine-tune alerts.
  • Proficiency or ability to learn programming languages including C# and SQL.

Incident Management and Troubleshooting:

  • Ability to prioritize and manage incidents based on severity, with a focus on customer impact.
  • Ability to remain calm under pressure and quickly diagnose issues.
  • Understanding of system logs, metrics, telemetry.

Communication Skills:

  • Ability to communicate effectively with stakeholders during an incident.
  • Clear and concise documentation skills.
  • Ability to maintain and update trouble-shooting guides (TSGs) and operational documentation.
  • Ability to translate complex technical issues and platform outages to non-technical stakeholders.

Automation & Scripting:

  • Ability to automate repetitive tasks.
  • Proficiency in scripting languages (e.g., Python, Bash) is a plus.