publications | Haoxuan Wu

2025

ESWA

RMP-adapter: A region-based Multiple Prompt Adapter for multi-concept customization in text-to-image diffusion model

Zeyu Jiang, Lai-Man Po, Xuyuan Xu, Yexin Wang, Haoxuan Wu, and 2 more authors

Expert Systems with Applications, 2025

Abs

This paper introduces a novel framework for multi-concept customization in text-to-image diffusion models. At its core is a Multiple Prompt Adapter (MP-Adapter) capable of processing multiple image prompts in parallel, extracting features from target concepts and projecting them into the same latent space as the text prompt. This enables simultaneous handling of multiple concepts using just one reference image per concept. To address challenges in fusing multiple concepts with complex interactions, we propose a Region-based Denoising Framework (RDF) that dynamically generates concept-specific regions of interest during inference, allowing spatially decoupled injection of concept features. By integrating the MP-Adapter and RDF, our end-to-end pipeline enables multi-concept customization with intricate occlusions and interactions while preserving concept identities. This approach surpasses current methods by resolving concept conflicts, identity degradation, and occlusion issues, allowing flexible customization without concept-specific retraining. Both qualitative and quantitative evaluations demonstrate that our framework outperforms state-of-the-art approaches in multi-concept customization tasks, while ablation studies validate the effectiveness of each proposed component. This work significantly advances text-to-image generation capabilities for complex, user-defined concept combinations. Code and models will be released at https://github.com/baojudezeze/RMP-Adapter.

2024

TCSVT

Self-Calibration Flow Guided Denoising Diffusion Model for Human Pose Transfer

Yu Xue, Lai-Man Po, Wing-Yin Yu, Haoxuan Wu, Xuyuan Xu, and 2 more authors

IEEE Transactions on Circuits and Systems for Video Technology, 2024

Abs

The human pose transfer task aims to generate synthetic person images that preserve the style of reference images while accurately aligning them with the desired target pose. However, existing methods based on generative adversarial networks (GANs) struggle to produce realistic details and often face spatial misalignment issues. On the other hand, methods relying on denoising diffusion models require a large number of model parameters, resulting in slower convergence rates. To address these challenges, we propose a self-calibration flow-guided module (SCFM) to establish precise spatial correspondence between reference images and target poses. This module facilitates the denoising diffusion model in predicting the noise at each denoising step more effectively. Additionally, we introduce a multi-scale feature fusing module (MSFF) that enhances the denoising U-Net architecture through a cross-attention mechanism, achieving better performance with a reduced parameter count. Our proposed model outperforms state-of-the-art methods on the DeepFashion and Market-1501 datasets in terms of both the quantity and quality of the synthesized images. Our code is publicly available at https://github.com/zylwithxy/SCFM-guided-DDPM.
ICONIP

SBoRA: Low-Rank Adaptation with Regional Weight Updates

Lai-Man Po, Yuyang Liu, Haoxuan Wu, Tianqi Zhang, Wing-Yin Yu, and 3 more authors

arXiv preprint arXiv:2407.05413, 2024

Abs

This paper introduces Standard Basis LoRA (SBoRA), a novel parameter-efficient fine-tuning approach for Large Language Models that builds upon the pioneering works of Low-Rank Adaptation (LoRA) and Orthogonal Adaptation. SBoRA reduces the number of trainable parameters by half or doubles the rank with the similar number of trainable parameters as LoRA, while improving learning performance. By utilizing orthogonal standard basis vectors to initialize one of the low-rank matrices (either A or B), SBoRA facilitates regional weight updates and memory-efficient fine-tuning. This results in two variants, SBoRA-FA and SBoRA-FB, where only one of the matrices is updated, leading to a sparse update matrix ΔW with predominantly zero rows or columns. Consequently, most of the fine-tuned model’s weights (W0+ΔW) remain unchanged from the pre-trained weights, akin to the modular organization of the human brain, which efficiently adapts to new tasks. Our empirical results demonstrate the superiority of SBoRA-FA over LoRA in various fine-tuning tasks, including commonsense reasoning and arithmetic reasoning. Furthermore, we evaluate the effectiveness of QSBoRA on quantized LLaMA models of varying scales, highlighting its potential for efficient adaptation to new tasks. Code is available at this https URL