Hanhui Wang

Room 472, West Village H

440 Huntington Avenue

Boston, Massachusetts

United States of America

Hanhui Wang (王翰辉) is a first-year Ph.D. student at the Visual Intelligence Lab at Northeastern University (NEU), where he is supervised by Prof. Huaizu Jiang. His research centers on generation and reasoning in AI systems, with current interests in controllable video generation, multimodal learning, and 3D/4D vision. His long-term goal is to bridge generative modeling and scene understanding toward building world models capable of causal reasoning and high-fidelity simulation of complex human-scene interactions.

Prior to joining NEU, he received his bachelor’s degree from Huazhong University of Science and Technology (HUST) and his master’s degree from the University of Southern California (USC). He has had the honor of collaborating with Prof. Xianzhi Li, Prof. Zhengzhong Tu, and Prof. Huaizu Jiang on research spanning diverse areas of computer vision. He also gained valuable industry experience through internships at leading technology companies such as iFLYTEK.

Research Keywords: Video Generation, Controllable Generation, World Models, Multimodal Learning, 3D/4D Vision, Generative AI.

News

Nov 18, 2025	Excited to deliver an Oral presentation of Struct2D at the New England Computer Vision (NECV) Workshop 2025 on November 21st at UMass Amherst!
Nov 9, 2025	Our paper SNAP: Towards Segmenting Anything in Any Point Cloud is accepted to 3DV 2026 as an Oral presentation! Especially meaningful to me, as it connects back to my undergrad days when I first became interested in 3D vision and point clouds.
Sep 18, 2025	Our paper Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs is accepted to NeurIPS 2025!
Jun 15, 2025	First conference, first poster, first time in Tennessee — I had an amazing time at CVPR 2025 in Nashville, the Music City. Grateful to everyone who made it so memorable. I’ll be returning, hungry for more!
Jun 7, 2025	After putting it off for way too long, I finally rebuilt my personal website. It’s cleaner, more up to date, and a little closer to how I want to present my work (for now).
Jun 2, 2025	Our paper Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs was selected as a spotlight presentation at the CVPR 2025 Workshop on Computer Vision in the Wild (CVinW)! I will be presenting it in Nashville on June 11th!
Feb 26, 2025	Our paper Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing is accepted to CVPR 2025!
Feb 6, 2025	I’m excited to share that I have accepted a Ph.D. offer at Northeastern University, where I will be joining the Visual Intelligence Lab under the supervision of Prof. Huaizu Jiang. Looking forward to this new chapter in Boston!

First-Authored Publications

See a full publication list at here.

3DV’26
SNAP: Towards Segmenting Anything in Any Point Cloud

Aniket Gupta*, Hanhui Wang*, Charles Saunders, Aruni Roy Chowdhury, Hanumant Singh, and Huaizu Jiang

In The 13th International Conference on 3D Vision 2026

Abs Paper Code Website

Interactive 3D point cloud segmentation enables efficient annotation of complex 3D scenes through user-guided prompts. However, current approaches are typically restricted in scope to a single domain (indoor or outdoor), and to a single form of user interaction (either spatial clicks or textual prompts). Moreover, training on multiple datasets often leads to negative transfer, resulting in domain-specific tools that lack generalizability. To address these limitations, we present SNAP (Segment aNything in Any Point cloud), a unified model for interactive 3D segmentation that supports both point-based and text-based prompts across diverse domains. Our approach achieves cross-domain generalizability by training on 7 datasets spanning indoor, outdoor, and aerial environments while employing domain-adaptive normalization to prevent negative transfer. For text-prompted segmentation, we automatically generate mask proposals without human intervention and match them against CLIP embeddings of textual queries, enabling both panoptic and open-vocabulary segmentation. Extensive experiments demonstrate that SNAP consistently delivers high-quality segmentation results. We achieve state-of-the-art performance on 8 out of 9 zero-shot benchmarks for spatial-prompted segmentation and demonstrate competitive results on all 5 text-prompted benchmarks. These results show that a unified model can match or exceed specialized domain-specific approaches, providing a practical tool for scalable 3D annotation.
@inproceedings{wang2024leveragingsamsinglesourcedomain, title = {SNAP: Towards Segmenting Anything in Any Point Cloud}, author = {Gupta*, Aniket and Wang*, Hanhui and Saunders, Charles and Chowdhury, Aruni Roy and Singh, Hanumant and Jiang, Huaizu}, booktitle = {The 13th International Conference on 3D Vision}, year = {2026} }
NeurIPS’25
Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs

Fangrui Zhu*, Hanhui Wang*, Yiming Xie, Jing Gu, Tianye Ding, Jianwei Yang, and Huaizu Jiang

In The 39th Annual Conference on Neural Information Processing Systems 2025

Abs Paper Code Poster Slides

Unlocking spatial reasoning in Multimodal Large Language Models (MLLMs) is crucial for enabling intelligent interaction with 3D environments. While prior efforts often rely on explicit 3D inputs or specialized model architectures, we ask: can MLLMs reason about 3D space using only structured 2D representations derived from perception? We introduce Struct2D, a perception-guided prompting framework that combines bird’s-eye-view (BEV) images with object marks and object-centric metadata, optionally incorporating egocentric keyframes when needed. Using Struct2D, we conduct an in-depth zero-shot analysis of closed-source MLLMs (e.g., GPT-o3) and find that they exhibit surprisingly strong spatial reasoning abilities when provided with structured 2D inputs, effectively handling tasks such as relative direction estimation and route planning. Building on these insights, we construct Struct2D-Set, a large-scale instruction tuning dataset with 200K fine-grained QA pairs across eight spatial reasoning categories, generated automatically from 3D indoor scenes. We fine-tune an open-source MLLM (Qwen2.5VL) on Struct2D-Set, achieving competitive performance on multiple benchmarks, including 3D question answering, dense captioning, and object grounding. Our approach demonstrates that structured 2D inputs can effectively bridge perception and language reasoning in MLLMs-without requiring explicit 3D representations as input. We will release both our code and dataset to support future research.
@inproceedings{zhu2025struct2dperceptionguidedframeworkspatial, title = {Struct2D: A Perception-Guided Framework for Spatial Reasoning in MLLMs}, author = {Zhu*, Fangrui and Wang*, Hanhui and Xie, Yiming and Gu, Jing and Ding, Tianye and Yang, Jianwei and Jiang, Huaizu}, year = {2025}, booktitle = {The 39th Annual Conference on Neural Information Processing Systems} }
CVPR’25
Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing

Hanhui Wang*, Yihua Zhang*, Ruizheng Bai, Yue Zhao, Sijia Liu, and Zhengzhong Tu

In The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2025

Abs Paper Code Poster

Recent advancements in diffusion models have made generative image editing more accessible than ever. While these developments allow users to generate creative edits with ease, they also raise significant ethical concerns, particularly regarding malicious edits to human portraits that threaten individuals’ privacy and identity security. Existing general-purpose image protection methods primarily focus on generating adversarial perturbations to nullify edit effects. However, these approaches often exhibit instability to protect against diverse editing requests. In this work, we introduce a novel perspective to personal human portrait protection against malicious editing. Unlike traditional methods aiming to prevent edits from taking effect, our method, FACELOCK, optimizes adversarial perturbations to ensure that original biometric information—such as facial features—is either destroyed or substantially altered post-editing, rendering the subject in the edited output biometrically unrecognizable. Our approach innovatively integrates facial recognition and visual perception factors into the perturbation optimization process, ensuring robust protection against a variety of editing attempts. Besides, we shed light on several critical issues with commonly used evaluation metrics in image editing and reveal cheating methods by which they can be easily manipulated, leading to deceptive assessments of protection. Through extensive experiments, we demonstrate that FACELOCK significantly outperforms all baselines in defense performance against a wide range of malicious edits. Moreover, our method also exhibits strong robustness against purification techniques. Comprehensive ablation studies confirm the stability and broad applicability of our method across diverse diffusion-based editing algorithms. Our work not only advances the state-of-the-art in biometric defense but also sets the foundation for more secure and privacy-preserving practices in image editing. The code is publicly available at: https://github.com/taco-group/FaceLock.
@inproceedings{wang2025edit, title = {Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing}, author = {Wang*, Hanhui and Zhang*, Yihua and Bai, Ruizheng and Zhao, Yue and Liu, Sijia and Tu, Zhengzhong}, booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition}, year = {2025} }
arXiv’24
Leveraging SAM for Single-Source Domain Generalization in Medical Image Segmentation

Hanhui Wang, Huaize Ye, Yi Xia, and Xueyan Zhang

In arXiv 2024

Abs Paper Code

Domain Generalization (DG) aims to reduce domain shifts between domains to achieve promising performance on the unseen target domain, which has been widely practiced in medical image segmentation. Single-source domain generalization (SDG) is the most challenging setting that trains on only one source domain. Although existing methods have made considerable progress on SDG of medical image segmentation, the performances are still far from the applicable standards when faced with a relatively large domain shift. In this paper, we leverage the Segment Anything Model (SAM) to SDG to greatly improve the ability of generalization. Specifically, we introduce a parallel framework, the source images are sent into the SAM module and normal segmentation module respectively. To reduce the calculation resources, we apply a merging strategy before sending images to the SAM module. We extract the bounding boxes from the segmentation module and send the refined version as prompts to the SAM module. We evaluate our model on a classic DG dataset and achieve competitive results compared to other state-of-the-art DG methods. Furthermore, We conducted a series of ablation experiments to prove the effectiveness of the proposed method. The code is publicly available at: https://github.com/sarihust/SAMMed.
@inproceedings{wang2024leveragingsamsinglesourcedomaio, title = {Leveraging SAM for Single-Source Domain Generalization in Medical Image Segmentation}, author = {Wang, Hanhui and Ye, Huaize and Xia, Yi and Zhang, Xueyan}, booktitle = {arXiv}, year = {2024} }