Interview with Xiang Fang: Multi-Modal Learning and Embodied Intelligence

Interview with Xiang Fang: Multi-Modal Learning and Embodied Intelligence

AIhub
AIhubJan 20, 2026

Summary

In this interview, PhD candidate Xiang Fang discusses his multi‑modal learning research at NTU, covering efficient video understanding, out‑of‑distribution detection for trustworthy AI, and embodied intelligence for vision‑language navigation. He highlights a standout project that adapts biological reaction‑diffusion patterns to fuse video and text, illustrating his interdisciplinary, mathematically‑driven approach. Looking ahead, he aims to build unified vision‑language‑action models that handle incomplete inputs, maintain robustness in the wild, and remain efficient for real‑time deployment. Fang’s background in geological engineering and competitive mathematics fuels his drive to create AI agents that can both see and act in the physical world.

Interview with Xiang Fang: Multi-modal learning and embodied intelligence

Comments

Want to join the conversation?

Loading comments...