About Me

Hello! I am a senior undergraduate student in my final year of study. My research interests lie in the domain of embodied AI, with a particular emphasis on aligning human instruction intents based on Visual Language Models (VLMs). At present, my work centers around employing multi AI agents for hierarchical decision making in robots, aiming to address long horizon tasks such as preparing a complete dinner.

Currently, I was supervised by Prof. Huiping Zhuang to complete my graduation project. Previously, I collaborated with Dr. Yongquan Chen and Junliang Li at Airs. Our work was focused on an object grasping system guided by visual instructions. This system has been reported on multiple media and has garnered extensive attention within the industry.

Beyond my research pursuits, I consider myself a hardcore tech geek. I have a passion for constructing mechanical structures, electronic systems, and visual algorithms with my own hands to solve problems in the physical world, rather than merely dwelling on theories. I also keep a close eye on the forefront of entrepreneurship, embodying the Silicon Valley spirit. I have always aspired to become a technology leader akin to Elon Musk or Steve Jobs. My life's goal is to pursue what I truly love, rather than what is commonly regarded as the right thing to do.

I am looking for 2025 summer internship and 2026 phd position. Please feel free to reach out to me and engage in discussions on any topic.

CV

Education

Publications

  • Robotic Visual Instruction

    Authors: Yanbang Li, ZiYang Gong, Haoyang Li, Xiaoqi Huang, Haolan Kang, Guangpingbai, Xianzheng Ma

    Publication 1
  • Natural language is commonly used for human-robot interaction but lacks spatial precision, leading to ambiguity. To address this, we introduce Robotic Visual Instruction (RoVI), which uses 2D sketches to guide robotic tasks. RoVI encodes spatial-temporal information for 3D manipulation. We also present the Visual Instruction Embodied Workflow (VIEW), which uses Vision-Language Models (VLMs) to interpret RoVI and generate 3D actions. A dataset of 15K instances fine-tunes VLMs for edge deployment. VIEW shows strong generalization across 11 tasks, achieving an 87.5% success rate in real-world scenarios. Code and datasets will be released soon.

    Conference: 2025 CVPR, Accept, 3 positive reviews
    Paper


  • Grasp What You Want: Embodied Dexterous Grasping System Driven by Your Voice

    Authors: Junliang Li, Kai Ye, Haolan Kang, Mingxuan Liang, Yuhang Wu, Zhenhua Liu, Huiping Zhuang, Rui Huang, Yongquan Chen

    Publication 2
  • In recent years, human-robot collaboration has become crucial, but robots struggle to interpret voice commands accurately. Traditional systems lack advanced manipulation and adaptability, especially in unstructured environments. This paper introduces the Embodied Dexterous Grasping System (EDGS), which uses a Vision-Language Model (VLM) to align voice and visual information, improving object handling. Inspired by human interactions, it employs a precise grasping strategy. Experiments show EDGS effectively manages complex tasks, demonstrating its potential in Embodied AI.

    Journal: 2025 JFR (under reviwes)

    Arxiv


Projects

Experience

Awards

  • Nov.2024 Best award of outstanding research findings of the 26th China Hi-Tech Fairs
  • Dec.2022 Second Prize of TCL Huameng Scholarship for the Academic Year 2021-2022
  • Dec.2022 Second Prize of Scholarship for the Academic Year 2021-2022
  • Sep.2022 Third Prize of 2022 Electronic Design Competition for Guangdong University Students.