AHS: Adaptive Head Synthesis

AHS teaser: Head-swapped results comparison among the baselines

Head-swapped results comparison among the baselines. Our model outperforms on preserving identity, hairstyle, and accessories while reenacting the target body image's expression and head pose.

Abstract

Recent digital media advancements have created increasing demands for sophisticated portrait manipulation techniques, particularly head swapping, where one's head is seamlessly integrated with another's body. However, current approaches predominantly rely on face-centered cropped data with limited view angles, significantly restricting their real-world applicability. They struggle with diverse head expressions, varying hairstyles, and natural blending beyond facial regions. To address these limitations, we propose Adaptive Head Synthesis (AHS), which effectively handles full upper-body images with varied head poses and expressions. AHS incorporates a novel head reenacted synthetic data augmentation strategy to overcome self-supervised training constraints, enhancing generalization across diverse facial expressions and orientations without requiring paired training data. Comprehensive experiments demonstrate that AHS achieves superior performance in challenging real-world scenarios, producing visually coherent results that preserve identity and expression fidelity across various head orientations and hairstyles. Notably, AHS shows exceptional robustness in maintaining facial identity while drastic expression changes and faithfully preserving accessories while significant head pose variations.

Problem Definition

Problem definition of head swapping. The first row indicates the portion of the head from the source image that needs to be transferred, while the second row indicates the portion of the head in the target image that should be preserved.

For a natural and practical outcome, it is crucial to transfer the face ID, facial shape, skin tone, accessories, and hairstyle of the source image, while preserving the pose, expression, and head orientation of the target image.

Method

AHS comprises a specialized network architecture and a novel data augmentation strategy for effective head swapping and reenactment. Our model encodes identity using dedicated Head and Face Encoders via decoupled cross-attention, while H-Net (a reference network) preserves fine-grained head details through self-attention. To prevent reconstruction artifacts common in self-supervised training and improve robustness, the training process is enhanced with GAGAvatar-based synthetic data augmentation, which generates augmented data with diverse head poses and expressions.

Overview of training AHS. Our model encodes identity using dedicated Head and Face Encoders, while H-Net preserves fine-grained head details. To prevent reconstruction artifacts and improve robustness, the training process is enhanced with GAGAvatar, which generates augmented data with diverse head poses and expressions.

Mask augmentation strategy. Including dilation, widened bounding box creation, and merging with a random mask.

Inference overview. Our model takes source and target images and outputs head swapped results within a unified model.

Quantitative Results

Method	CLIP-I (Head) ↑	FID ↓	FID (Crop) ↓	ID sim ↑	Head Orient. ↓	Expression ↓
REFace	0.7859	8.4491	24.31	0.5239	5.57	7.308
InstantID*	0.8223	5.7882	11.43	0.2829	7.52	7.591
HID	0.8577	5.6306	6.83	0.5555	11.51	8.474
AHS (Ours)	0.9132	5.7818	5.02	0.6230	8.01	6.204

Quantitative comparison. Best and second-best results are in bold and underlined, respectively. AHS outperforms existing methods in most metrics.

Qualitative Results

Qualitative comparison. The images in the Head column are combined with those in the Body column. The last four columns are the head-swapped results produced by each method. Unlike baseline methods, AHS achieves high identity preservation while maintaining facial expressions.

Comparison with Additional Baselines

We further compare AHS with four additional baselines: Nano Banana, Qwen-Image-Edit, HeSer, and Ghost 2.0. While Nano Banana and Qwen-Image-Edit prioritize consistency, they often produce images identical to the input or suffer from severe copy-and-paste artifacts. HeSer and Ghost 2.0 rely on a crop-and-align pipeline, which is inherently unsuitable for head swapping as it cannot handle regions outside the fixed facial crop.

Qualitative comparison with additional baselines.

Method	CLIP-I (Head) ↑	FID ↓	FID (Crop) ↓	ID sim ↑	Head Orient. ↓	Expression ↓
Ghost 2.0	0.8487	7.968	19.749	0.483	4.831	10.626
HeSer	0.8507	6.444	15.221	0.503	8.734	11.022
Nano Banana	0.8634	4.241	2.809	0.474	10.725	7.449
Qwen-Image-Edit	0.8789	4.377	4.422	0.536	15.504	7.522
AHS (Ours)	0.9139	9.613	6.719	0.625	8.427	6.887

Quantitative comparison with additional baselines.

Additional Results

Additional qualitative results (1/2)

Additional qualitative results (2/2)

User Study & Data Augmentation Effect

User study results. AHS is consistently preferred across all criteria, with a significant lead in identity, hairstyle, and accessories preservation.

Copy-and-paste artifacts from the model trained without our data augmentation (head reenactment + masking), demonstrating the importance of our augmentation strategy.

BibTeX

@inproceedings{kang2026ahs,
  title={AHS: Adaptive Head Synthesis via Synthetic Data Augmentations},
  author={Kang, Taewoong and Jang, Hyojin and Jeong, Sohyun and Moon, Seunggi and Kim, Gihwi and Jung, Hoon Jin and Choo, Jaegul},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}