ML Infrastructure Engineer
This role is a member of the AI/ML Infrastructure Engineering team and will be dedicated to implementing and supporting AI/ML infrastructure solutions in cloud and on-premise environments. The role will work directly with infrastructure teams and potentially face off with data scientists, machine learning engineers, application developers, and quantitative analysts by functioning as both a solutions architect and a professional services engineer.
This is a hands-on developer role, and candidates ideally have had experience deploying and supporting their own production-ready AI/ML models in cloud environments as well as automating the build and management of a broad range of cloud infrastructure using tools like Terraform. Candidates should be familiar with developing unit and functional tests, have experience designing and implementing CI/CD tools with infrastructure as code pipelines, and have knowledge of Linux systems administration, containerization, networking, security, automated configuration and state management, cross-system orchestration, configuration management, logging, metrics, monitoring, and alerting.
Principal Responsibilities:
Architect, develop and maintain internal AI/ML infrastructure components, frameworks, and offerings
Architect, develop and maintain AI/ML solutions for customers in cloud environments
Help customers architect, develop and maintain their own AI/ML solutions in cloud environments
Implement CI/CD pipelines which include application tests, security tests, and gates
Implement availability, security, performance monitoring, and alerting of AI/ML solutions
Automate data resiliency and replication for AI/ML models
Manage multiple environments and promote code between them
Automate systems configuration and orchestration using tools such as Terraform, Chef, Ansible, or Salt
Automate creation of machine images and containers
Required Qualifications/Skills:
6+ years of experience designing and supporting production cloud environments
Experience consulting with customers to develop AI/ML solutions
Experience developing collaboratively, including infrastructure as code, preferably in Python
Systems engineering knowledge, including understanding of Linux, security, and networking
Cloud templating tools such as Terraform
Experience with AI/ML frameworks (e.g., TensorFlow, PyTorch)
Experience with distributed computing tools (e.g., Ray, Dask)
Experience with model serving tools (e.g., vLLM, KFServing)
Experience with building, monitoring, and alerting on logs and metrics
Cloud Networking including connectivity, routing, DNS, VPCs, proxies, and load balancers
Cloud Security including IAM, Certificate Management, and Key Management
Excellent written and verbal communication skills
Excellent troubleshooting and analytical skills
Self-starter able to execute independently, on a deadline, and under pressure
#J-18808-Ljbffr