Recent video generative models primarily rely on carefully written text prompts for specific tasks, like inpainting or style editing. They require labor-intensive textual descriptions for input videos, hindering their flexibility to adapt personal/raw videos to user specifications. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework that supports multiple video editing capabilities such as removal, addition, and modification, through a unified pipeline. RACCooN consists of two principal stages: Video-to-Paragraph (V2P) and Paragraph-to-Video (P2V). In the V2P stage, we automatically describe video scenes in well-structured natural language, capturing both the holistic context and focused object details. Subsequently, in the P2V stage, users can optionally refine these descriptions to guide the video diffusion model, enabling various modifica- tions to the input video, such as removing, changing subjects, and/or adding new objects. The proposed approach stands out from other methods through several significant contributions: (1) RACCooN suggests a multi-granular spatiotemporal pooling strategy to generate well-structured video descriptions, capturing both the broad context and object details without requiring complex human annotations, simplifying precise video content editing based on text for users. (2) Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. It supports the addition of video objects, inpainting, and attribute modification within a unified framework, surpassing existing video editing and inpainting benchmarks. The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation (up to 10.8% ↑ relative improvement in human evaluations against the baseline), video content editing (relative 49.7% ↑ in FVD), and can be incorporated into other SoTA video generative models for further enhancement
Figure2: Illustration of RACCooN framework. RACCooN generates video descriptions with the three distinct pooled visual tokens, including Multi-Granular Spatiotemporal (MGS) Pooling. Next, the user can edit the generated descriptions by adding, removing, or modifying words to create new videos. Note that for adding object tasks, if users do not provide layout information for the objects they want to add, RACCooN can predict the target layout in each frame.
Figure3: Illustration of MGS pooling.. We obtain MGS pooling tokens using a spatiotemporal mask m via overlapping k-means clustering (OKM) of averaged superpixel features S.
[Snowboarder]: A person dressed in a dark snowsuit, possibly black or navy, with a
snowboard attached to their feet. They display confidence and skill as they navigate down the slope, executing moves and jumps. Background.
[Man in Red Top]: The subject is a fit male wearing a bright red short-sleeved shirt, navy blue athletic shorts, and black sports shoes. He displays athleticism and coordination as he skillfully plays with the football. Background.
[Eagle]: A majestic eagle, with a sharp, hooked beak and keen eyes, its plumage is a mix of dark brown feathers with lighter brown accents. Its powerful yellow talons grip the rocky surface, symbolizing its strength and dominance over the terrain. Background.
[Brown Dog]: An enthusiastic medium-sized brown dog with glossy fur and a lean build. It remains undeterred by the splashes, focused on the man and possibly a tossed object or on the interaction itself. Its tail is partially submerged, demonstrating agility and the enjoyment of water. Background.
[Giant Panda]: A large, black and white panda with a distinct, dense coat. It is shown playfully climbing and balancing on the wooden beams of its enclosure. It has a round face, large black eye patches, and is quite agile for its size. Background.
[Camel]: A light brown camel stands prominently in the center of the frame with a tranquil demeanor. Its fur is smooth, and its hump is well-defined. Its facial expression remains placid, capturing the quintessential essence of a creature adapted to arid environments. Background.
[Seagull]: A white seagull with wings spread wide, showcasing gray tips. As it flies, its yellow beak and black eye markings are noticeable. The bird's underbelly is white, while the light and smooth lines of its body exemplify a strong yet graceful figure in mid-flight. Background.
[Whale Shark]: A massive, gentle, spotted whale shark seen dominating the frame with its grand presence. It showcases distinctive white spots along its dark gray body and long, powerful fins. This shark is the focal point and moves gracefully amid the marine life. Background.
* [Player 1 with Red Hat]: Male player on the near side of the court. He is wearing a white shirt, dark shorts, and a red hat, and is playing with a yellow tennis racket. He serves and returns the ball, demonstrating agility and competitive spirit in the game. Background.
* [Baby Gorilla]: A young gorilla with black fur, portrayed in the act of crawling. The gorilla's movement is slow and cautious, exhibiting natural curiosity as it explores its surroundings. Background.
* [Climber]: A woman in athletic wear, featuring a black tank top and matching shorts, with her hair tied back. She shows focus and precision in her bouldering technique. Background.
* [Dog]: Medium-sized and likely of a common breed, sporting a coat with hues of grey and black. The dog is on a leash, indicating it is well-trained and accustomed to walking alongside its owner. Background.
* [Painted Stork]: A solitary painted stork with white and pink plumage, known for its long, orange beak and black-and-white wing pattern. Visible are the bird's long, stilt-like legs that support it as it wades in the water. Its movements are purposeful and focused as it hunts for food. Background.
* [Basketball Player]: Athlete wearing a white basketball jersey with a prominent star design, royal blue basketball shorts, and white sports sneakers. The player is focused on performing dribbling exercises, showcasing control and agility. Background.
* [Basketball Player in White]: A male player sporting a white shirt and dark shorts, actively engaging in play. He exhibits agility and focus as he positions himself to intercept or rebound the basketball. Background.
* [Black Cat]: A sleek, all-black cat with a lithe figure and attentive eyes. It follows the dog closely, participating in the playful pursuit with agility and curiosity. Background.
[Robin]: A small bird with a striking orange-red breast and grey-brown feathers covering its wings and back. The bird's beak is narrow and sharp, adept for foraging and pecking the ground for insects or worms. In the first frame, the Robin is captured mid-peck, indicating active foraging behavior.
[White Poodle]: A small, fluffy white poodle exhibiting a pampered and stylized appearance with a rounded haircut characteristic of the breed's show grooming standards. The dog moves with a deliberate prance, displaying the distinctive behavior and training fit for a show animal.
[White Dog]: An active and attentive white Labrador, displaying playful and energetic movements as it interacts with its surroundings and human companions.
[Observer]: A male wearing dark grey shorts, a black T-shirt with neon yellow logos, and athletic shoes. He stands attentively with his arms crossed, watching the shooter's performance.
[Black Dog]: A medium-sized black dog with distinct tan markings on its legs and snout. It wears a red collar and exhibits high energy as it sprints across the lawn with its tongue out in a playful stance.
[Woman]: A casually dressed woman sporting a buttoned pink shirt and light blue denim jeans. Her blonde hair is tied back in a ponytail, and she wears white sneakers suitable for walking. She displays a relaxed demeanor while strolling with her dog.
[Woman]: A casual woman in a white beach dress and with a straw hat, walking beside a man, likely accompanying him and his white dog. She moves with a relaxed stride, suggesting a leisurely outing together.
[Seagull]: A white seagull with wings spread wide, showcasing gray tips. As it flies, its yellow beak and black eye markings are noticeable, adding to its distinctive features. The bird's underbelly is white, while the light and smooth lines of its body exemplify a strong yet graceful figure in mid-flight.
* [Brown Dog]: An exuberant medium-sized brown dog with a darker shade along its back and lighter tan on its legs and face. It has floppy ears, a long tail, and appears to be of a hunting or retriever breed. The dog's demeanor is playful and curious, as it moves around the grassy field with a sense of freedom and exploration.
* [Surfer]: An athlete wearing a dark wetsuit, possibly black or navy, showcasing talent in balancing and steering on the waves. Their stance is wide and steady, knees bent, arms outstretched for balance, and their posture exuding confidence.
* [Snowboarder]: A person dressed in a dark snowsuit, possibly black or navy, with a snowboard attached to their feet. They display confidence and skill as they navigate down the slope, executing moves and jumps.
* [Giant Panda]: A large, black and white panda with a distinct, dense coat. It is shown playfully climbing and balancing on the wooden beams of its enclosure. It has a round face, large black eye patches, and is quite agile for its size.
* [Skier in Blue and Green]: A skier wearing a blue jacket with green sleeves and blue pants. They're equipped with skiing poles and skis that appear to be primarily white with some red signage. Their helmet is a bright neon lime green with matching goggles.
*[Dog]: A light-colored Labrador with a lean build, seen swimming in the water. It has a pink nose, pointed ears, and appears to be a seasoned swimmer. The dog wears a black collar with a red attachment, possibly for safety or identification.
* [Climber]: A woman in athletic wear, featuring a black tank top and matching shorts, with her hair tied back. She shows focus and precision in her bouldering technique.
* [Bird]: The central object in the video is a soaring bird, likely a raptor, given its size and wing shape. It has dark feathering that contrasts against the sky, a fan-shaped tail, and widespread wings that demonstrate strong, controlled flight.
[Baby Gorilla] [Baby Panda]: A young gorilla panda with black fur, portrayed in the act of
crawling. The gorilla's movement is slow and cautious, exhibiting natural curiosity as it explores its surroundings.
[Brown Dog] [Raccoon]: A medium-sized,
light brown dog raccoon with a sturdy
build and alert ears. Its tail wags
enthusiastically as it leads the playful
chase in the forest, displaying a sense of
happiness and energy
[Snowboarder]: A person dressed in a white snowsuit dark snowsuit, possibly black or navy, with a
snowboard attached to their feet. They display confidence and skill as they
navigate down the slope, executing moves and jumps.
[Basketball Player in White]: A male
player sporting a white shirt and dark
shorts a blue shirt and white shorts,
actively engaging in play. He exhibits
agility and focus as he positions himself
to intercept or rebound the basketball.
[Man]: A casually dressed man in a red shirt and blue
jeans in a blue shirt and white beach shorts, likely
the owner of the white dog. He walks with a relaxed gait,
indicating a leisurely outing.
[Man] [Man in orange]: An athletic man wearing an orange shirt, black white shorts, and sports
shoes. He is running on a well-maintained playing area with manicured green grass, a
smooth outfield, and a perimeter wall adorned with sponsorship banners and a scoreboard.
[Woman]: Dressed in a casual outfit of
blue denim shorts and a horizontally
striped blue and white T-shirt a dark red
dinning dress. She accessorizes with pink
flip-flops and is engaged in the action of
putting a leash on her dog, showing signs
of preparing for a walk.
[Large Shark] [Orange Shark]: A sizable and powerful shark, possibly a Great White, with a grey and white orange body. Its dorsal fin is tall and prominent, and it has a distinctly streamlined shape. The shark conveys a sense of calm authority as it moves effortlessly through the water.
* [Man in Red Top White Shirt]: The subject is a fit male wearing a bright red white short-sleeved shirt, navy blue athletic shorts, and black sports shoes. He displays athleticism and coordination as he skillfully plays with the football.
* [Dog] [Cat]: A medium-sized black dog white cat with a shiny coat and a long tail. It sports a thick collar connected to a leash and appears lively, enthusiastically leading the way during its walk with the owner.
*[Siberian Husky] [Corgi]: A medium-sized, agile Corgi dog with a thick coat that is grey, white, and black in color. It has distinctive pointed ears and displays exuberant energy as it plays catch with a visible excitement.
*[Dalmatian] [Dog]: A medium-sized, Dalmatian dog with white facial markings, a white chest,
and white-tipped paws. It has floppy ears, a long tail, and a gentle expression. The dog is
actively engaging with a ball, using its front paws and nose, displaying playful behavior.
* [Giant Panda] [Brown Bear]: A large, black and white panda brown bear with a distinct, dense coat. It is shown playfully climbing and balancing on the wooden beams of its enclosure. It has a round face, large black eye patches, and is quite agile for its size.
* [Dog] [White Cat]: Medium-sized and likely of a common breed, sporting a coat with hues of grey and black. The dog The white cat is on a leash, indicating it is well-trained and accustomed to walking alongside its owner.
* [Basketball Player in White]: A male
player sporting a white shirt and dark
shorts a blue shirt and white shorts,
actively engaging in play. He exhibits
agility and focus as he positions himself
to intercept or rebound the basketball.
* [Brown Dog] [Raccoon]: A medium-sized,
light brown dog raccoon with a sturdy
build and alert ears. Its tail wags
enthusiastically as it leads the playful
chase in the forest, displaying a sense of
happiness and energy
[Whale Shark]: A massive, gentle, spotted whale shark seen dominating the frame with its grand presence. It showcases distinctive white spots along its dark gray body and long, powerful fins. This shark is the focal point and moves gracefully amid the marine life. Background.
[Whale Shark]: A massive, gentle, spotted whale shark seen dominating the frame with its grand presence. It showcases distinctive white spots along its dark gray body and long, powerful fins. This shark is the focal point and moves gracefully amid the marine life. Background.
[Black Dog]: A medium-sized black dog with distinct tan markings on its legs and snout. It wears a red collar and exhibits high energy as it sprints across the lawn with its tongue out in a playful stance.
[Woman]: A casually dressed woman sporting a buttoned pink shirt and light blue denim jeans. Her blonde hair is tied back in a ponytail, and she wears white sneakers suitable for walking. She displays a relaxed demeanor while strolling with her dog.
[Brown Dog] [Raccoon]: A medium-sized,
light brown dog raccoon with a sturdy
build and alert ears. Its tail wags
enthusiastically as it leads the playful
chase in the forest, displaying a sense of
happiness and energy
[Woman]: Dressed in a casual outfit of
blue denim shorts and a horizontally
striped blue and white T-shirt a dark red
dinning dress. She accessorizes with pink
flip-flops and is engaged in the action of
putting a leash on her dog, showing signs
of preparing for a walk.
We conducted a quantitative evaluation of our proposed RACCooN framework's video-to-paragraph generation capabilities, comparing it against strong baselines with a focus on object-centric captioning and object layout planning. The results, summarized in Table 1, show that open-source video-LLMs (e.g., PG-VL, Video-Chat) which have smaller LLMs (< 13B parameters), struggle with object-centric captioning and usually fail to generate layout planning. This is primarily due to their lack of instructional fine-tuning and insufficient video detail modeling without multi-granular pooling. In contrast, our RACCooN framework demonstrates superior performance in both object-centric captioning and complex object layout planning, benefiting from the instructional tuning on our VPLM dataset. Additionally, our method achieves competitive performance with proprietary MLLMs (e.g., Gemini 1.5 Pro, GPT-4o) in key object captioning and layout planning, demonstrating its superior instruction following and generation quality.
As shown in Table 3, we quantitatively compare the video editing ability of RACCooN with strong video editing models based on inpainting or DDIM-inversion across three object-centric video content editing subtasks: object changing, removal, and adding. In general, RACCooN outperforms all baselines across 9 metrics. For object changing, RACCooN outperforms the best-performing baseline by 0.8% on CLIP-T, indicating better video-text alignment while maintaining temporal consistency, as demonstrated by comparable CLIP-F and Qedit scores. Note that LGVI is not designed to alter video attributes and tends to preserve video content with marginal change (i.e., identical input and output videos), resulting in improved CLIP-F scores. In the object removal task, RACCooN shows significant improvements over strong baselines (relatively +57.8% FVD, +2.5% SSIM, +9.6% PSNR). Such improvements are maintained in the addition task (relatively+41.6% FVD, +4.3% PSNR). Meanwhile, some DDIM inversion-based models (e.g., TokenFlow) work well for specific tasks (change objects), but do not handle other types of editing. In contrast, our method is an all-rounder player that enables diverse video content editing skills.
@article{yoon2024raccoon,
title={RACCooN: Remove, Add, and Change Video Content with Auto-Generated Narratives},
author={Jaehong Yoon and Shoubin Yu and Mohit Bansal},
year={2024},
journal={arXiv:2405.18406},
}