Job Description
What We Do
What sets our group apart is end-to-end ownership of our models and services, which are distributed, high-throughput and low latency systems that are collectively called billions of times a day. In order to deliver at such scale, we are building platforms that enable our application-focused ML engineering teams to go from an idea to a model to a scalable service with minimal overhead. We also offer higher-level abstractions and UIs to enable domain experts to easily build, deploy and maintain production ML models for their applications in a self-service manner, with little engineering intervention.
What We Need From You
While working on the team as an MLOps Engineer, you will have the opportunity to enhance our platforms to streamline the productionization of ML models. You will work with both application and platforms teams to create a more cohesive, integrated, and managed model development life cycle. Typical activities include:
- Architecting, building, and diagnosing production ML systems
- Working closely with ML application teams to design seamless workflows for continuous model training, inference, and monitoring
- Defining and providing strong SLAs around latency, throughput and resource (memory / disk / network / CPU / GPU) usage
- Interfacing with both ML experts and platform engineers to understand workflows, pinpoint and resolve inefficiencies, and inform the next set of features for the platforms
- Collaborating with open-source communities and internal platform teams to build a cohesive MLOps experience
- Troubleshooting and debugging user issues
- Providing operational and user-facing documentation
Colleagues who excel in this role often exhibit these qualities:
- Curiosity to solve new problems and keep learning new technologies
- Passion for the engineering behind machine learning, and scaling it
- Industry experience with machine learning teams
- Proficiency in programming (Go, Python, JavaScript or similar) and willingness to learn new technologies as needed
- Working knowledge of common ML frameworks such as PyTorch, TensorFlow, scikit-learn, ONNX, etc.
- Prior experience with container technologies like Docker, Kubernetes, Buildpacks, etc.
- Experience with cloud providers such as AWS, GCP or Azure
- Willingness to collaborate with colleagues to achieve repeatable high quality outcomes as a team.
Job ID: 125716