EgoX: Egocentric Video Generation from a Single Exocentric Video

* equal contribution
1KAIST AI    2Seoul National University
KAIST AI Logo SNU Logo
"The Dark Knight" - Joker
"Iron Man" - Pepper Potts
"Avengers: Age of Ultron" - Hulk
"Captain America" - Iron Man
"PARIS 2024" - Table tennis player

Abstract

Egocentric perception enables humans to experience and understand the world directly from their own point of view. Translating exocentric (third-person) videos into egocentric (first-person) videos opens up new possibilities for immersive understanding but remains highly challenging due to extreme camera pose variations and minimal view overlap. This task requires faithfully preserving visible content while synthesizing unseen regions in a geometrically consistent manner. To achieve this, we present EgoX, a novel framework for generating egocentric videos from a single exocentric input. EgoX leverages the pretrained spatio-temporal knowledge of large-scale video diffusion models through lightweight LoRA adaptation and introduces a unified conditioning strategy that combines exocentric and egocentric priors via width- and channel-wise concatenation. Additionally, a geometry-guided self-attention mechanism selectively attends to spatially relevant regions, ensuring geometric coherence and high visual fidelity. Our approach achieves coherent and realistic egocentric video generation while demonstrating strong scalability and robustness across unseen and in-the-wild videos.

Overall Pipeline

Method overview
Given an exocentric video input, we first lift it into a 3D point cloud and render the scene from the egocentric viewpoint to obtain the egocentric prior video. The clean exocentric video latent and the egocentric prior latent are combined via width- wise and channel-wise concatenation in the latent space, and then fed into a pretrained video diffusion model equipped with the proposed geometry-guided self-attention.

Geometry-Guided Self-Attention

Method overview
Geometry-Guided Self-Attention. 3D direction similarities between egocentric queries and exocentric keys are used as an additive bias in the attention map, guiding the model to focus on geometrically aligned regions. Although the orange and red directions are the same key tokens, their directions differ due to different camera centers. The blue–red pairs have similar directions and thus receive higher scores, whereas the green–orange pairs have opposite directions and obtain lower scores.
GGA Visualization
GGA Stan Lee

Attention Visualization. & GGA benefits example. Visualization of the attention weights when querying the center token of the egocentric view. Without GGA, the model attends to unrelated regions outside the visible area, leading to the generation of unwanted events. With GGA, attention is concentrated on spatially relevant regions, effectively focusing only on the visible area and preventing unwanted event generation.

Qualitative Results

Exo view
Ego GT
Ours
Qualitative results. We compare our generated egocentric videos (Ours) with the ground truth egocentric videos (Ego GT) given the exocentric input (Exo view). Our method generates realistic and coherent egocentric videos that closely match the ground truth.

Qualitative comparison

Quantitative Results

Ablation qualitative comparison

Quantitative Results

Quantitative Results

Quantitative results. EgoX outperforms previous approaches by a large margin, achieving state-of-the-art performance on diverse and challenging exo-to-ego video generation benchmarks

Quantitative Results

Ablation Study Results. Performance comparison by removing each core component of our framework.

Quantitative Results