Profile Photo

Alejandro Pardo

I'm a final-year Ph.D. student at KAUST under the supervision of Professor Bernard Ghanem. Previously, I completed my M.Sc. degree advised by Pablo Arbelaez. During my PhD, I interned at the Embodied AI Labs at Intel and Adobe Research.

Currently, my research focuses on leveraging modern Computer Vision algorithms to automate creative video editing, trying to bridge the gap between creativity and technology. If you share similar interests, feel free to reach out—I’d love to connect and exchange ideas!

I am actively looking for permanent positions!

Email: alejandro dot pardo at kaust dot edu dot sa

Featured Research

Here are some of my most representative works. For a complete list of publications, feel free to visit my Google Scholar page. My research spans various video understanding topics that might also interest you.

MatchDiffusion Image

MatchDiffusion: Training-free Generation of Match-Cuts

Alejandro Pardo*, Fabio Pizzati*, Tong Zhang, Alexander Pondaven, Philip Torr, Juan Camilo Perez, Bernard Ghanem
Under Review - Project Page / arXiv

We introduce a training-free method for generating match-cuts using text-to-video diffusion models. By leveraging the denoising process, our approach creates visually coherent video pairs with shared structure but distinct semantics, enabling the creation of seamless and impactful transitions.

Assembler Image

Generative Timelines for Instructed Visual Assembly

Alejandro Pardo, Jui-Hsien Wang, Bernard Ghanem, Josef Sivic, Bryan Russell, Fabian Caba Heilbron
NeurIPS Workshop on Video-Language Models - Website / arXiv

*Work done during Internship at Adobe Resarch

We introduce the Timeline Assembler, a generative model that enables intuitive visual timeline editing using natural language instructions. Our method automates complex tasks like reordering, adding, and removing clips, making video editing accessible to non-experts.

CLMs Image

Compressed-Language Mmodels for Uunderstanding Compressed Formats:a JPEG exploration

Juan C. Pérez, Alejandro Pardo, Mattia Soldan, Hani Itani, Juan Leon-Alcázar, Bernard Ghanem
Under Review - / arXiv

This work explores the potential of Compressed-Language Models (CLMs) to process and understand data directly from compressed file formats (CFFs) like JPEG. By treating compressed byte streams as sequences, we evaluate CLMs across recognizing file properties, handling anomalies, and generating new files. Our findings reveal that CLMs can effectively grasp the semantics of compressed data, showcasing the promise of directly leveraging compressed formats for efficient and universal data processing.

TGT Image

Towards Automated Movie Trailer Generation

Dawit Mureja Argaw, Mattia Soldan, Alejandro Pardo, Chen Zhao, Fabian Caba Heilbron, Joon Son Chung, Bernard Ghanem
CVPR-2024 - GitHub / Paper

This work presents an approach to automate trailer creation using the Trailer Generation Transformer (TGT), a sequence-to-sequence model designed to predict plausible movie trailers. By leveraging an encoder-decoder architecture, TGT models the temporal order and relevance of movie shots to create engaging trailers. Our method overcomes the limitations of prior classification and ranking-based approaches, achieving state-of-the-art results on newly curated benchmarks for automatic trailer generation.

MovieCuts Image

MovieCuts: A New Dataset and Benchmark for Cut Type Recognition

Alejandro Pardo, Fabian Caba Heilbron, Juan Leon-Alcázar, Ali Thabet, Bernard Ghanem
ECCV-2022 - GitHub / arXiv

We present MovieCuts, a large-scale dataset and benchmark for recognizing cinematic cut types. With over 173,000 clips labeled with ten professional cut categories, MovieCuts addresses the multi-modal challenges of analyzing audio-visual transitions. Our benchmarks highlight the complexity of this task, paving the way for advancements in automated video editing, virtual cinematography, and film education.

MAD Image

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Mattia Soldan, Alejandro Pardo, Juan Leon-Alcázar, Fabian Caba Heilbron, Chen Zhao, Silvio Giancola, Bernard Ghanem
CVPR-2022 - GitHub / Paper

We introduce MAD (Movie Audio Descriptions), a large-scale dataset for video-language grounding with over 384,000 descriptive sentences aligned to 1,200+ hours of long-form movies. By leveraging professional audio descriptions, MAD reduces biases seen in prior datasets and challenges models to temporally ground short language moments in diverse, untrimmed videos. This benchmark pushes the boundaries of video-language research and practical applications like smart video search and editing.