AHS: Adaptive Head Synthesis via Synthetic Data Augmentations

Taewoong Kang*1, Hyojin Jang*1, Sohyun Jeong*1, Seunggi Moon2, Gihwi Kim3, Hoon Jin Jung3, Jaegul Choo1
* equal contribution
1KAIST    2Korea University    3FLIPTION
KAIST AI Logo
CVPR 2026
AHS teaser: Head-swapped results comparison among the baselines

Head-swapped results comparison among the baselines. Our model outperforms on preserving identity, hairstyle, and accessories while reenacting the target body image's expression and head pose.

Abstract

Recent digital media advancements have created increasing demands for sophisticated portrait manipulation techniques, particularly head swapping, where one's head is seamlessly integrated with another's body. However, current approaches predominantly rely on face-centered cropped data with limited view angles, significantly restricting their real-world applicability. They struggle with diverse head expressions, varying hairstyles, and natural blending beyond facial regions. To address these limitations, we propose Adaptive Head Synthesis (AHS), which effectively handles full upper-body images with varied head poses and expressions. AHS incorporates a novel head reenacted synthetic data augmentation strategy to overcome self-supervised training constraints, enhancing generalization across diverse facial expressions and orientations without requiring paired training data. Comprehensive experiments demonstrate that AHS achieves superior performance in challenging real-world scenarios, producing visually coherent results that preserve identity and expression fidelity across various head orientations and hairstyles. Notably, AHS shows exceptional robustness in maintaining facial identity while drastic expression changes and faithfully preserving accessories while significant head pose variations.

Problem Definition

Problem definition of head swapping

Problem definition of head swapping. The first row indicates the portion of the head from the source image that needs to be transferred, while the second row indicates the portion of the head in the target image that should be preserved.

For a natural and practical outcome, it is crucial to transfer the face ID, facial shape, skin tone, accessories, and hairstyle of the source image, while preserving the pose, expression, and head orientation of the target image.

Method

AHS comprises a specialized network architecture and a novel data augmentation strategy for effective head swapping and reenactment. Our model encodes identity using dedicated Head and Face Encoders via decoupled cross-attention, while H-Net (a reference network) preserves fine-grained head details through self-attention. To prevent reconstruction artifacts common in self-supervised training and improve robustness, the training process is enhanced with GAGAvatar-based synthetic data augmentation, which generates augmented data with diverse head poses and expressions.

Overview of training AHS

Overview of training AHS. Our model encodes identity using dedicated Head and Face Encoders, while H-Net preserves fine-grained head details. To prevent reconstruction artifacts and improve robustness, the training process is enhanced with GAGAvatar, which generates augmented data with diverse head poses and expressions.

Mask augmentation strategy

Mask augmentation strategy. Including dilation, widened bounding box creation, and merging with a random mask.

Inference overview

Inference overview. Our model takes source and target images and outputs head swapped results within a unified model.

Quantitative Results

Method CLIP-I (Head) ↑ FID ↓ FID (Crop) ↓ ID sim ↑ Head Orient. ↓ Expression ↓
REFace 0.7859 8.4491 24.31 0.5239 5.57 7.308
InstantID* 0.8223 5.7882 11.43 0.2829 7.52 7.591
HID 0.8577 5.6306 6.83 0.5555 11.51 8.474
AHS (Ours) 0.9132 5.7818 5.02 0.6230 8.01 6.204

Quantitative comparison. Best and second-best results are in bold and underlined, respectively. AHS outperforms existing methods in most metrics.

Qualitative Results

Qualitative comparison

Qualitative comparison. The images in the Head column are combined with those in the Body column. The last four columns are the head-swapped results produced by each method. Unlike baseline methods, AHS achieves high identity preservation while maintaining facial expressions.

Comparison with Additional Baselines

We further compare AHS with four additional baselines: Nano Banana, Qwen-Image-Edit, HeSer, and Ghost 2.0. While Nano Banana and Qwen-Image-Edit prioritize consistency, they often produce images identical to the input or suffer from severe copy-and-paste artifacts. HeSer and Ghost 2.0 rely on a crop-and-align pipeline, which is inherently unsuitable for head swapping as it cannot handle regions outside the fixed facial crop.

Additional baseline comparisons

Qualitative comparison with additional baselines.

Method CLIP-I (Head) ↑ FID ↓ FID (Crop) ↓ ID sim ↑ Head Orient. ↓ Expression ↓
Ghost 2.0 0.8487 7.968 19.749 0.483 4.831 10.626
HeSer 0.8507 6.444 15.221 0.503 8.734 11.022
Nano Banana 0.8634 4.241 2.809 0.474 10.725 7.449
Qwen-Image-Edit 0.8789 4.377 4.422 0.536 15.504 7.522
AHS (Ours) 0.9139 9.613 6.719 0.625 8.427 6.887

Quantitative comparison with additional baselines.

Additional Results

User Study & Data Augmentation Effect

User study results

User study results. AHS is consistently preferred across all criteria, with a significant lead in identity, hairstyle, and accessories preservation.

Copy-and-paste artifacts

Copy-and-paste artifacts from the model trained without our data augmentation (head reenactment + masking), demonstrating the importance of our augmentation strategy.

BibTeX

@inproceedings{kang2026ahs,
  title={AHS: Adaptive Head Synthesis via Synthetic Data Augmentations},
  author={Kang, Taewoong and Jang, Hyojin and Jeong, Sohyun and Moon, Seunggi and Kim, Gihwi and Jung, Hoon Jin and Choo, Jaegul},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}