RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives

University of North Carolina, Chapel Hill
*: equal contribution
Figure 1: Overview of RACCooN, a versatile and user-friendly video-to-paragraph-to-video framework, enables users to remove, add, or change video content via updating auto-generated narratives.

Abstract

Recent video generative models primarily rely on carefully written text prompts for specific tasks, like inpainting or style editing. They require labor-intensive textual descriptions for input videos, hindering their flexibility to adapt personal/raw videos to user specifications. This paper proposes RACCooN, a versatile and user-friendly video-to-paragraph-to-video generative framework that supports multiple video editing capabilities such as removal, addition, and modification, through a unified pipeline. RACCooN consists of two principal stages: Video-to-Paragraph (V2P) and Paragraph-to-Video (P2V). In the V2P stage, we automatically describe video scenes in well-structured natural language, capturing both the holistic context and focused object details. Subsequently, in the P2V stage, users can optionally refine these descriptions to guide the video diffusion model, enabling various modifica- tions to the input video, such as removing, changing subjects, and/or adding new objects. The proposed approach stands out from other methods through several significant contributions: (1) RACCooN suggests a multi-granular spatiotemporal pooling strategy to generate well-structured video descriptions, capturing both the broad context and object details without requiring complex human annotations, simplifying precise video content editing based on text for users. (2) Our video generative model incorporates auto-generated narratives or instructions to enhance the quality and accuracy of the generated content. It supports the addition of video objects, inpainting, and attribute modification within a unified framework, surpassing existing video editing and inpainting benchmarks. The proposed framework demonstrates impressive versatile capabilities in video-to-paragraph generation (up to 10.8% ↑ relative improvement in human evaluations against the baseline), video content editing (relative 49.7% ↑ in FVD), and can be incorporated into other SoTA video generative models for further enhancement

RACCooN Framework

Figure2: Illustration of RACCooN framework. RACCooN generates video descriptions with the three distinct pooled visual tokens, including Multi-Granular Spatiotemporal (MGS) Pooling. Next, the user can edit the generated descriptions by adding, removing, or modifying words to create new videos. Note that for adding object tasks, if users do not provide layout information for the objects they want to add, RACCooN can predict the target layout in each frame.

Figure3: Illustration of MGS pooling.. We obtain MGS pooling tokens using a spatiotemporal mask m via overlapping k-means clustering (OKM) of averaged superpixel features S.

Video-to-Pargraph Visualization

[Video Description]: The video captures an outdoor scene where a man and a brown dog are enjoying a playful moment in a residential swimming pool. The dog is seen actively playing, diving into the water, and splashing around, while the man interacts and plays with the dog. The pool is surrounded by a wooden fence, and the area seems to be a backyard setting with no other significant objects or individuals present.

  • [Man]: A shirtless individual, appearing to be in a casual, leisurely mood, engaging with the dog in the pool.
  • [Dog]: A brown dog with a dark face, playing actively in the water, diving and splashing around.
  • [Swimming Pool]: A standard residential swimming pool with a white interior, surrounded by a wooden fence, and a dark blue water surface.
  • [Water Splash]: The man and the dog are captured in moments where the man throws the dog into the water, resulting in splashes and splashbacks.
[Video Description]: The video captures a surfer engaging with a wave in a calm ocean. The surfer, clad in a dark wetsuit, maneuvers on a colorful surfboard, riding the crest of a wave and eventually falling off as the wave curls back. The scene is serene, with the blue water and gentle movement of the waves, and the surfer's attempt to catch the wave highlights the dynamic nature of the sport.

  • [Surfer]: The surfer is wearing a dark wetsuit with light-colored designs and a helmet-style accessory. They are on a colorful surfboard, navigating the wave with skill and balance.
  • [Surfboard]: The surfboard is vibrant, with a mix of colors that include white, blue, and red. Its design includes a swirling pattern near the tail.
  • [Wave]: The wave is a gentle, curling form, providing a surfing experience for the surfer.
  • [Sky]: The sky is a clear, cloudless blue, suggesting fair weather for outdoor activities like surfing.
[Video Description]: The video showcases an individual engaging in snowboarding on a snowy slope. The snowboarder is captured in various stages of movement, riding down the slope and performing a jump trick off a blue metal rail. The setting appears to be a winter sports resort with clear blue skies, pristine snow, and scenic mountains in the backdrop. The snowboarder is dressed in dark winter sportswear, prominently seen against the bright winter landscape.

  • [Snowboard]: A blue and white snowboard with black bindings. The board is designed for snowboarding, a winter sport where the rider descends a snow-covered slope on a single board with foot straps.
  • [Snowboarder]: A person dressed in dark winter sportswear, including a hooded jacket, snow pants, and snowboarding-specific boots. They are seen performing a jump trick off a blue metal rail, showcasing balance and skill.
  • [Slopestyle Obstacle]: A blue metal rail is featured in the video, set against a winter landscape with a white snowy background. The snowboarder is seen approaching and jumping over the rail.
  • [Snowy Slopes]: The video is set on a snowy slope with pristine white snow, set against a backdrop of rolling hills and a clear blue sky with white clouds. The sun casts shadows on the ground, indicating either early morning or late afternoon.
[Video Description]: The video captures an indoor bowling alley sequence focused on a bowling ball in motion. The ball is seen rolling down the polished wooden lane towards the pins at the end. The bowling ball is dark with a glossy finish, and it reflects the ambient lighting of the alley. The lane is bordered by gutters, and the pins are arranged in a triangular formation, with two pins on the left and one on the right. The atmosphere is casual and leisurely, showcasing a classic indoor recreational activity.

  • [Bowling Ball]: The bowling ball is a significant object in the video, and its motion is the main event. It is dark with a glossy finish, reflecting the ambient lighting of the alley.
  • [Wooden Lane]: The wooden lane is a clear and defined object in the video, with the bowling ball's movement towards the pins at the end.
  • [Pins]: The pins are arranged in a triangular formation, with two pins on the left and one on the right, and they are a significant part of the bowling activity.
  • [Alley]: The alley is dimly lit with ambient lighting, setting the mood for a casual and leisurely activity.
[Video Description]: The video shows a woman walking down a busy city street at night. She is wearing a red dress and carrying a handbag. She stops to look at a building and then walks away. The woman is also seen walking down a rain-soaked street and looking at a building. She then walks away again. The video also shows a man walking down the street and looking at a building. The woman is also seen walking down a street and looking at a building. The video ends with a woman walking down a street and looking at a building. Overall, the video captures the busy and bustling atmosphere of a city at night. The woman's red dress and handbag stand out as she walks down the street, and her attention to the buildings she looks at adds depth to the scene. The man walking down the street also adds to the overall sense of a busy urban environment.

  • [Woman]: A stylish woman wearing a black leather jacket, a long red dress, and black boots. She carries a black purse and has sunglasses on.
  • [Traffic lights]: There are traffic lights visible in the background, indicating that the street is a busy intersection.
  • [People]: There are people walking around on the street, adding to the bustling atmosphere of the city at night.
  • [Building]: A large, multi-story structure with a prominent presence in the background. It might be a commercial or residential building.

Paragraph-to-Video Visualization (Remove)

Teaser

[Snowboarder]: A person dressed in a dark snowsuit, possibly black or navy, with a snowboard attached to their feet. They display confidence and skill as they navigate down the slope, executing moves and jumps. Background.

Teaser

[Man in Red Top]: The subject is a fit male wearing a bright red short-sleeved shirt, navy blue athletic shorts, and black sports shoes. He displays athleticism and coordination as he skillfully plays with the football. Background.

Teaser

[Eagle]: A majestic eagle, with a sharp, hooked beak and keen eyes, its plumage is a mix of dark brown feathers with lighter brown accents. Its powerful yellow talons grip the rocky surface, symbolizing its strength and dominance over the terrain. Background.

Teaser

[Brown Dog]: An enthusiastic medium-sized brown dog with glossy fur and a lean build. It remains undeterred by the splashes, focused on the man and possibly a tossed object or on the interaction itself. Its tail is partially submerged, demonstrating agility and the enjoyment of water. Background.

Teaser

[Giant Panda]: A large, black and white panda with a distinct, dense coat. It is shown playfully climbing and balancing on the wooden beams of its enclosure. It has a round face, large black eye patches, and is quite agile for its size. Background.

Teaser

[Camel]: A light brown camel stands prominently in the center of the frame with a tranquil demeanor. Its fur is smooth, and its hump is well-defined. Its facial expression remains placid, capturing the quintessential essence of a creature adapted to arid environments. Background.

Teaser

[Seagull]: A white seagull with wings spread wide, showcasing gray tips. As it flies, its yellow beak and black eye markings are noticeable. The bird's underbelly is white, while the light and smooth lines of its body exemplify a strong yet graceful figure in mid-flight. Background.

Teaser

[Whale Shark]: A massive, gentle, spotted whale shark seen dominating the frame with its grand presence. It showcases distinctive white spots along its dark gray body and long, powerful fins. This shark is the focal point and moves gracefully amid the marine life. Background.

Teaser

* [Player 1 with Red Hat]: Male player on the near side of the court. He is wearing a white shirt, dark shorts, and a red hat, and is playing with a yellow tennis racket. He serves and returns the ball, demonstrating agility and competitive spirit in the game. Background.

Teaser

* [Baby Gorilla]: A young gorilla with black fur, portrayed in the act of crawling. The gorilla's movement is slow and cautious, exhibiting natural curiosity as it explores its surroundings. Background.

Teaser

* [Climber]: A woman in athletic wear, featuring a black tank top and matching shorts, with her hair tied back. She shows focus and precision in her bouldering technique. Background.

Teaser

* [Dog]: Medium-sized and likely of a common breed, sporting a coat with hues of grey and black. The dog is on a leash, indicating it is well-trained and accustomed to walking alongside its owner. Background.

Teaser

* [Painted Stork]: A solitary painted stork with white and pink plumage, known for its long, orange beak and black-and-white wing pattern. Visible are the bird's long, stilt-like legs that support it as it wades in the water. Its movements are purposeful and focused as it hunts for food. Background.

Teaser

* [Basketball Player]: Athlete wearing a white basketball jersey with a prominent star design, royal blue basketball shorts, and white sports sneakers. The player is focused on performing dribbling exercises, showcasing control and agility. Background.

Teaser

* [Basketball Player in White]: A male player sporting a white shirt and dark shorts, actively engaging in play. He exhibits agility and focus as he positions himself to intercept or rebound the basketball. Background.

Teaser

* [Black Cat]: A sleek, all-black cat with a lithe figure and attentive eyes. It follows the dog closely, participating in the playful pursuit with agility and curiosity. Background.

*: with grounding models predicted masks

Paragraph-to-Video Visualization (Add)

Teaser

[Robin]: A small bird with a striking orange-red breast and grey-brown feathers covering its wings and back. The bird's beak is narrow and sharp, adept for foraging and pecking the ground for insects or worms. In the first frame, the Robin is captured mid-peck, indicating active foraging behavior.

Teaser

[White Poodle]: A small, fluffy white poodle exhibiting a pampered and stylized appearance with a rounded haircut characteristic of the breed's show grooming standards. The dog moves with a deliberate prance, displaying the distinctive behavior and training fit for a show animal.

Teaser

[White Dog]: An active and attentive white Labrador, displaying playful and energetic movements as it interacts with its surroundings and human companions.

Teaser

[Observer]: A male wearing dark grey shorts, a black T-shirt with neon yellow logos, and athletic shoes. He stands attentively with his arms crossed, watching the shooter's performance.

Teaser

[Black Dog]: A medium-sized black dog with distinct tan markings on its legs and snout. It wears a red collar and exhibits high energy as it sprints across the lawn with its tongue out in a playful stance.

Teaser

[Woman]: A casually dressed woman sporting a buttoned pink shirt and light blue denim jeans. Her blonde hair is tied back in a ponytail, and she wears white sneakers suitable for walking. She displays a relaxed demeanor while strolling with her dog.

Teaser

[Woman]: A casual woman in a white beach dress and with a straw hat, walking beside a man, likely accompanying him and his white dog. She moves with a relaxed stride, suggesting a leisurely outing together.

Teaser

[Seagull]: A white seagull with wings spread wide, showcasing gray tips. As it flies, its yellow beak and black eye markings are noticeable, adding to its distinctive features. The bird's underbelly is white, while the light and smooth lines of its body exemplify a strong yet graceful figure in mid-flight.

Teaser

* [Brown Dog]: An exuberant medium-sized brown dog with a darker shade along its back and lighter tan on its legs and face. It has floppy ears, a long tail, and appears to be of a hunting or retriever breed. The dog's demeanor is playful and curious, as it moves around the grassy field with a sense of freedom and exploration.

Teaser

* [Surfer]: An athlete wearing a dark wetsuit, possibly black or navy, showcasing talent in balancing and steering on the waves. Their stance is wide and steady, knees bent, arms outstretched for balance, and their posture exuding confidence.

Teaser

* [Snowboarder]: A person dressed in a dark snowsuit, possibly black or navy, with a snowboard attached to their feet. They display confidence and skill as they navigate down the slope, executing moves and jumps.

Teaser

* [Giant Panda]: A large, black and white panda with a distinct, dense coat. It is shown playfully climbing and balancing on the wooden beams of its enclosure. It has a round face, large black eye patches, and is quite agile for its size.

Teaser

* [Skier in Blue and Green]: A skier wearing a blue jacket with green sleeves and blue pants. They're equipped with skiing poles and skis that appear to be primarily white with some red signage. Their helmet is a bright neon lime green with matching goggles.

Teaser

*[Dog]: A light-colored Labrador with a lean build, seen swimming in the water. It has a pink nose, pointed ears, and appears to be a seasoned swimmer. The dog wears a black collar with a red attachment, possibly for safety or identification.

Teaser

* [Climber]: A woman in athletic wear, featuring a black tank top and matching shorts, with her hair tied back. She shows focus and precision in her bouldering technique.

Teaser

* [Bird]: The central object in the video is a soaring bird, likely a raptor, given its size and wing shape. It has dark feathering that contrasts against the sky, a fan-shaped tail, and widespread wings that demonstrate strong, controlled flight.

*: with MLLM predicted box layouts

Paragraph-to-Video Visualization (Change)

Teaser

[Baby Gorilla] [Baby Panda]: A young gorilla panda with black fur, portrayed in the act of crawling. The gorilla's movement is slow and cautious, exhibiting natural curiosity as it explores its surroundings.

Teaser

[Brown Dog] [Raccoon]: A medium-sized, light brown dog raccoon with a sturdy build and alert ears. Its tail wags enthusiastically as it leads the playful chase in the forest, displaying a sense of happiness and energy

Teaser

[Snowboarder]: A person dressed in a white snowsuit dark snowsuit, possibly black or navy, with a snowboard attached to their feet. They display confidence and skill as they navigate down the slope, executing moves and jumps.

Teaser

[Basketball Player in White]: A male player sporting a white shirt and dark shorts a blue shirt and white shorts, actively engaging in play. He exhibits agility and focus as he positions himself to intercept or rebound the basketball.

Teaser

[Man]: A casually dressed man in a red shirt and blue jeans in a blue shirt and white beach shorts, likely the owner of the white dog. He walks with a relaxed gait, indicating a leisurely outing.

Teaser

[Man] [Man in orange]: An athletic man wearing an orange shirt, black white shorts, and sports shoes. He is running on a well-maintained playing area with manicured green grass, a smooth outfield, and a perimeter wall adorned with sponsorship banners and a scoreboard.

Teaser

[Woman]: Dressed in a casual outfit of blue denim shorts and a horizontally striped blue and white T-shirt a dark red dinning dress. She accessorizes with pink flip-flops and is engaged in the action of putting a leash on her dog, showing signs of preparing for a walk.

Teaser

[Large Shark] [Orange Shark]: A sizable and powerful shark, possibly a Great White, with a grey and white orange body. Its dorsal fin is tall and prominent, and it has a distinctly streamlined shape. The shark conveys a sense of calm authority as it moves effortlessly through the water.

Teaser

* [Man in Red Top White Shirt]: The subject is a fit male wearing a bright red white short-sleeved shirt, navy blue athletic shorts, and black sports shoes. He displays athleticism and coordination as he skillfully plays with the football.

Teaser

* [Dog] [Cat]: A medium-sized black dog white cat with a shiny coat and a long tail. It sports a thick collar connected to a leash and appears lively, enthusiastically leading the way during its walk with the owner.

Teaser

*[Siberian Husky] [Corgi]: A medium-sized, agile Corgi dog with a thick coat that is grey, white, and black in color. It has distinctive pointed ears and displays exuberant energy as it plays catch with a visible excitement.

Teaser

*[Dalmatian] [Dog]: A medium-sized, Dalmatian dog with white facial markings, a white chest, and white-tipped paws. It has floppy ears, a long tail, and a gentle expression. The dog is actively engaging with a ball, using its front paws and nose, displaying playful behavior.

Teaser

* [Giant Panda] [Brown Bear]: A large, black and white panda brown bear with a distinct, dense coat. It is shown playfully climbing and balancing on the wooden beams of its enclosure. It has a round face, large black eye patches, and is quite agile for its size.

Teaser

* [Dog] [White Cat]: Medium-sized and likely of a common breed, sporting a coat with hues of grey and black. The dog The white cat is on a leash, indicating it is well-trained and accustomed to walking alongside its owner.

Teaser

* [Basketball Player in White]: A male player sporting a white shirt and dark shorts a blue shirt and white shorts, actively engaging in play. He exhibits agility and focus as he positions himself to intercept or rebound the basketball.

Teaser

* [Brown Dog] [Raccoon]: A medium-sized, light brown dog raccoon with a sturdy build and alert ears. Its tail wags enthusiastically as it leads the playful chase in the forest, displaying a sense of happiness and energy

*: with grounding models predicted masks

Comparsions with Other Baselines

Teaser         Input Video                    LGVI                   VideoComposer              Ours

[Whale Shark]: A massive, gentle, spotted whale shark seen dominating the frame with its grand presence. It showcases distinctive white spots along its dark gray body and long, powerful fins. This shark is the focal point and moves gracefully amid the marine life. Background.

Teaser         Input Video                    LGVI                   VideoComposer              Ours

[Whale Shark]: A massive, gentle, spotted whale shark seen dominating the frame with its grand presence. It showcases distinctive white spots along its dark gray body and long, powerful fins. This shark is the focal point and moves gracefully amid the marine life. Background.

Teaser         Input Video            Inpaint-Anything     VideoComposer              Ours

[Black Dog]: A medium-sized black dog with distinct tan markings on its legs and snout. It wears a red collar and exhibits high energy as it sprints across the lawn with its tongue out in a playful stance.

Teaser         Input Video            Inpaint-Anything     VideoComposer              Ours

[Woman]: A casually dressed woman sporting a buttoned pink shirt and light blue denim jeans. Her blonde hair is tied back in a ponytail, and she wears white sneakers suitable for walking. She displays a relaxed demeanor while strolling with her dog.

Teaser         Input Video                TokenFlow            VideoComposer              Ours

[Brown Dog] [Raccoon]: A medium-sized, light brown dog raccoon with a sturdy build and alert ears. Its tail wags enthusiastically as it leads the playful chase in the forest, displaying a sense of happiness and energy

Teaser         Input Video                TokenFlow            VideoComposer              Ours

[Woman]: Dressed in a casual outfit of blue denim shorts and a horizontally striped blue and white T-shirt a dark red dinning dress. She accessorizes with pink flip-flops and is engaged in the action of putting a leash on her dog, showing signs of preparing for a walk.

Quantative Results

We conducted a quantitative evaluation of our proposed RACCooN framework's video-to-paragraph generation capabilities, comparing it against strong baselines with a focus on object-centric captioning and object layout planning. The results, summarized in Table 1, show that open-source video-LLMs (e.g., PG-VL, Video-Chat) which have smaller LLMs (< 13B parameters), struggle with object-centric captioning and usually fail to generate layout planning. This is primarily due to their lack of instructional fine-tuning and insufficient video detail modeling without multi-granular pooling. In contrast, our RACCooN framework demonstrates superior performance in both object-centric captioning and complex object layout planning, benefiting from the instructional tuning on our VPLM dataset. Additionally, our method achieves competitive performance with proprietary MLLMs (e.g., Gemini 1.5 Pro, GPT-4o) in key object captioning and layout planning, demonstrating its superior instruction following and generation quality.

As shown in Table 3, we quantitatively compare the video editing ability of RACCooN with strong video editing models based on inpainting or DDIM-inversion across three object-centric video content editing subtasks: object changing, removal, and adding. In general, RACCooN outperforms all baselines across 9 metrics. For object changing, RACCooN outperforms the best-performing baseline by 0.8% on CLIP-T, indicating better video-text alignment while maintaining temporal consistency, as demonstrated by comparable CLIP-F and Qedit scores. Note that LGVI is not designed to alter video attributes and tends to preserve video content with marginal change (i.e., identical input and output videos), resulting in improved CLIP-F scores. In the object removal task, RACCooN shows significant improvements over strong baselines (relatively +57.8% FVD, +2.5% SSIM, +9.6% PSNR). Such improvements are maintained in the addition task (relatively+41.6% FVD, +4.3% PSNR). Meanwhile, some DDIM inversion-based models (e.g., TokenFlow) work well for specific tasks (change objects), but do not handle other types of editing. In contrast, our method is an all-rounder player that enables diverse video content editing skills.

BibTeX

@article{yoon2024raccoon,
        title={RACCooN: Remove, Add, and Change Video Content with Auto-Generated Narratives}, 
        author={Jaehong Yoon and Shoubin Yu and Mohit Bansal},
        year={2024},
        journal={arXiv:2405.18406},
}