Senior Machine Learning Engineer - Agentic AI

As a Senior Machine Learning Engineer - Agentic AI within Data Impact & Governance, you will be at the forefront of designing and operating the platform capabilities that enable autonomous and semi-autonomous AI systems to function reliably across clinical, research, and operational domains.

This role offers a rare opportunity to build enterprise-wide agentic AI platforms in a regulated healthcare environment-where correctness, safety, governance, and auditability matter as much as innovation and scale. You will influence technical standards, platform architecture, and operational safeguards that shape how agentic AI is adopted across one of the world's leading cancer centers.

What's in it for you?
  • Outstanding Benefits: MD Anderson offers paid medical benefits, generous paid time off (PTO), and strong retirement plans, providing stability and long-term financial security.
  • Enterprise-Level Impact: Architect platform capabilities that support AI agents operating across complex health IT systems and enterprise workflows.
  • Technical Leadership: Shape standards, integration patterns, and guardrails governing agentic AI at organizational scale.
  • Career Growth & Visibility: Partner closely with enterprise architects, applied MLEs, data scientists, IT, and governance leaders on high-impact AI initiatives.
  • Responsible AI Innovation: Work in a mission-driven institution where responsible AI, safety, and trust are central to technology strategy.
  • Collaborative Culture: Join a highly skilled team that values intellectual rigor, mentorship, and cross-disciplinary collaboration.


***The ideal candidate will have a healthcare background with at least 5 years of industry experience in data science and 3+ years as a Senior ML Engineer focused agentic AI systems***

Summary

The Senior Machine Learning Engineer - Agentic AI designs, evolves, and operates enterprise-scale agentic AI platform capabilities that enable safe, scalable, and governed deployment of autonomous and semi-autonomous AI systems. The role focuses on platform architecture, interoperability, validation frameworks, and operational safeguards that allow internal and third-party agent systems to function reliably in production healthcare environments.

This position operates at the intersection of autonomous AI behavior, enterprise systems integration, and regulated healthcare operations-where subtle failures can have systemic and high-impact consequences.

Major Work Activities

Core Responsibilities
  • Lead the design, evolution, and operation of the enterprise agentic AI platform in collaboration with enterprise architects and platform ML engineers.
  • Build platform components that enable interoperability between first-party and third-party agents, including identity, state, memory, tool access, orchestration, auditability, and policy enforcement.
  • Define and document standardized integration patterns connecting agents with enterprise business systems, data platforms, APIs, and health IT systems.
  • Provide reusable platform services, reference implementations, and SDKs that reduce risk and accelerate delivery for applied teams.
  • Design and operate validation and de-risking frameworks, including simulation, sandboxing, shadow execution, canary releases, and continuous behavior monitoring.
  • Establish and enforce platform standards for agent development, including interfaces, execution contracts, evaluation hooks, safety constraints, and observability requirements.
  • Participate in platform governance, release coordination, and incident response, supporting investigation and remediation of agent-related failures.
  • Implement platform safeguards such as fallback mechanisms, rollback strategies, approval gates, rate limiting, audit trails, and kill-switch capabilities.
  • Partner with software engineering, security, IT, and health IT stakeholders to deploy agentic AI capabilities in secure enterprise environments.
  • Support responsible AI practices through traceability of prompts, policies, tools, models, agent actions, and documentation of known failure modes and limitations.


Competencies

Technical Expertise
  • Experience building AI or ML platforms that serve multiple downstream teams and production workloads.
  • Strong proficiency in Python and integration of modern ML frameworks (e.g., PyTorch) with large language models and agent systems.
  • Hands-on experience with agentic AI frameworks such as LangGraph, LangChain, AutoGen, CrewAI, Semantic Kernel, or equivalent.
  • Working knowledge of agentic AI protocols and interoperability standards (e.g., MCP, agent-to-agent communication, structured tool invocation).
  • Experience implementing planner-executor loops, hierarchical agents, and multi-agent coordination patterns.
  • Familiarity with workflow orchestration tools (Airflow, Prefect, Temporal) and distributed execution frameworks (Ray or equivalent).
  • Experience deploying containerized AI platforms using Kubernetes in enterprise cloud environments with lineage, auditability, and controlled promotion to production.


Analytical Expertise
  • Ability to reason at the systems and platform level, balancing safety, performance, flexibility, and usability.
  • Experience designing quantitative evaluation strategies for agentic systems, including success rates, latency, cost, recovery behavior, and safety metrics.
  • Strong understanding of enterprise data governance, security, and privacy requirements, including healthcare and health IT considerations.
  • Ability to identify systemic risks stemming from agent autonomy, non-determinism, tool access, and multi-agent interactions.
  • Experience analyzing failure modes caused by prompt drift, model updates, tool changes, and cross-system dependencies.


Oral & Written Communication
  • Collaborate effectively with architects, applied MLEs, data scientists, software engineers, and IT partners.
  • Produce clear documentation covering platform architecture, APIs, integration patterns, validation frameworks, and operational runbooks.
  • Communicate platform capabilities, risks, and limitations to leadership and partner teams.
  • Contribute to internal standards and shared practices that improve safety, scalability, and consistency of agentic AI development.
  • Provide hands-on technical guidance, mentorship, and troubleshooting support to platform adopters.
  • Present technical and non-technical concepts clearly in meetings and institutional forums.


Education Required: Bachelor's degree in Computer Science, Software Engineering, Data Science, Physics, Math & Statistics, or another related engineering discipline.

Preferred Education: Master's degree or PHD with a concentration in Science, engineering, or related field.

Experience Required: Five years of experience in machine learning engineering, data science, data engineering, and/or software engineering. With Master's degree, three years' experience required. With PhD, one year of experience required.

Preferred Experience:
  • Experience designing, deploying, and maintaining agentic AI systems that operate autonomously and collaboratively across distributed environments.
  • Experience in monitoring and troubleshooting autonomous agents post-deployment, including performance degradation, clinical incidents, model updates, or corrective actions.
  • Experience raising the technical bar for team members, such as establishing reproducibility practices, review standards, or shared patterns.
  • Experience technically evaluating third-party agentic AI platforms within clinical workflows.


The University of Texas MD Anderson Cancer Center offers excellent benefits, including medical, dental, paid time off, retirement, tuition benefits, educational opportunities, and individual and team recognition.

This position may be responsible for maintaining the security and integrity of critical infrastructure, as defined in Section 113.001(2) of the Texas Business and Commerce Code and therefore may require routine reviews and screening. The ability to satisfy and maintain all requirements necessary to ensure the continued security and integrity of such infrastructure is a condition of hire and continued employment.

It is the policy of The University of Texas MD Anderson Cancer Center to provide equal employment opportunity without regard to race, color, religion, age, national origin, sex, gender, sexual orientation, gender identity/expression, disability, protected veteran status, genetic information, or any other basis protected by institutional policy or by federal, state or local laws unless such distinction is required by law. http://www.mdanderson.org/about-us/legal-and-policy/legal-statements/eeo-affirmative-action.html

Additional Information
  • Requisition ID: 178303
  • Employment Status: Full-Time
  • Employee Status: Regular
  • Work Week: Days
  • Minimum Salary: US Dollar (USD) 146,500
  • Midpoint Salary: US Dollar (USD) 183,000
  • Maximum Salary : US Dollar (USD) 219,500
  • FLSA: exempt and not eligible for overtime pay
  • Fund Type: Hard
  • Work Location: Remote (within Texas only)
  • Pivotal Position: Yes
  • Referral Bonus Available?: Yes
  • Relocation Assistance Available?: No

#LI-Remote