Yujia Bao
🔬 Machine Learning Researcher
🚀 Associate Director @ Accenture
🐾 Proud owner of Samoyeds
Hello! I am a machine learning researcher and a life-long engineer who cannot live without Vim. My goal is to push the frontier of AI and make it useful and safe for humanity.
Currently, I manage a team of 80+ research scientists and engineers at Accenture, focusing on AI for the enterprise. I lead the development of AI Refinery, an agentic AI platform driving AI adoption for Fortune 500 companies.
My work spans building scalable agent architectures, optimizing LLM post-training, and advancing fundamental machine learning algorithms. I am driven by the excitement of “zero to one” innovation—translating cutting-edge research into stable, production-grade platforms that solve real-world problems.
I received my Ph.D. and S.M. in Computer Science from MIT CSAIL, advised by Regina Barzilay. Prior to MIT, I earned an M.A. in Mathematics from UW-Madison and a B.S. in Mathematics from Shanghai Jiao Tong University.
You can find my full resume here.
Recent Work
AI Refinery: Enterprise Agentic Platform
I lead the engineering and research for AI Refinery, enabling developers to build and govern complex agentic workflows.
- Agent Orchestration: Developed the Distiller framework, which decomposes complex user queries into tasks for specialized agents. Developed context management algorithms for efficient multi-agent memory sharing (ICLR 2025).
- Engineering Standards: Grew the team from 3 to 80, establishing core engineering standards, code maturity levels, and performance metrics to ensure production-grade platform availability.
- Ecosystem Growth: Scaled the ecosystem by organizing global workshops and tutorials, upskilling 10,000+ developers to accelerate GenAI delivery.
LLM Customization
To support domain-specific enterprise needs, I lead LLM customization efforts spanning multiple stages of training:
- Pre-training: Built automated pipelines to curate and filter large-scale open-source datasets (Wikipedia, arXiv), implementing quality taggers to ensure robust model foundations.
- Mid-training: Adapted models to domain-specific contexts (e.g., Fortune Analytics), processing proprietary multi-modal assets and eliminating biases from sensitive data.
- Post-training: Optimized performance through advanced techniques including KV-cache reuse (NeurIPS 2025), targeted unlearning (ICLR 2025), and SFT data selection (ICLR 2025).
Machine Learning Foundations
Prior to my current focus on GenAI, my research centered on transformer architectures, fairness, and human-machine interaction:
- Vision Transformers: Developed Channel ViT (ICLR 2024) and Contextual ViT to handle covariate shifts in biological imaging.
- Fairness & Robustness: Developed algorithms (ICML 2021, ICML 2022) for automatic bias discovery by learning challenging data splits.
- Human-Machine Interaction: Enhanced NLP system interpretability by deriving machine attention from human rationales (EMNLP 2018).