ComfyUI: Stable Video Diffusion (Workflow Tutorial)

3 Dec 202344:09

TLDRIn this tutorial, Mali introduces ComfyUI's Stable Video Diffusion, a tool for creating animated images and videos with AI. She demonstrates techniques for frame control, subtle animations, and complex video generation using latent noise composition. Mali showcases six workflows and provides eight comfy graphs for fine-tuning image to video output. She guides viewers through installing necessary nodes, setting up the workflow, and adjusting parameters for desired motion effects, concluding with a detailed example of combining videos for advanced animation.


  • 😀 Mali introduces a tutorial on ComfyUI for stable video diffusion using Stability AI's first model.
  • 🔍 The video covers frame control techniques, such as animating only certain elements like a candle flame or hair and eyes in a portrait.
  • 📚 Mali shares six 'comfy graphs' to demonstrate fine-tuning image to video output in different scenarios.
  • 💻 The tutorial requires ComfyUI, model files, and additional software like FFMpeg for video format conversion.
  • 🎥 Two models are discussed: one generating 14 frames and the second (SVD XT) generating 25 frames, with a focus on the latter for the tutorial.
  • 🖼️ The importance of image resizing and cropping for video output is highlighted, with specific settings for maintaining aspect ratios.
  • 🔄 Detailed explanation of workflow settings, including the use of nodes like 'video linear CFG guidance' and 'VHS video combine'.
  • 🌟 Techniques for creating specific animations, such as subtle movements in facial features or blinking eyes, are explored.
  • 🎨 A method for combining multiple images to create more complex animations, like blinking, is demonstrated.
  • 🌄 The tutorial concludes with advanced workflows, including the use of 'noisy latent composition' for complex video effects.
  • 🔧 The sensitivity of settings like 'augmentation level' and 'motion bucket ID' in achieving desired motion effects is emphasized.

Q & A

  • What is the main topic of the video tutorial?

    -The main topic of the video tutorial is about using ComfyUI for stable video diffusion, demonstrating how to create animations and videos using AI-generated images or DSLR photos.

  • Who is the presenter of the tutorial?

    -The presenter of the tutorial is Mali.

  • What are the two models for stable video diffusion mentioned in the script?

    -The two models for stable video diffusion mentioned are the first model trained to generate 14 frames and the second model, SVD XT, trained to generate 25 frames.

  • What is the ComfyUI's support for video resolution and aspect ratio?

    -ComfyUI supports video at a resolution of 1024x576 and works better in landscape mode. The aspect ratio should be kept at 16:9 for landscape or 9:16 for portrait.

  • What is the purpose of the 'video linear CFG' node in the workflow?

    -The 'video linear CFG' node is used to control the configuration value (CFG) throughout the video generation process, starting with the minimum CFG value and ending with the value input in the sampler.

  • How does the 'image resize' node function in the workflow?

    -The 'image resize' node is used to maintain the aspect ratio and crop the image to fit the required dimensions for video generation, ensuring the image does not exceed the maximum resolution of 1024x576.

  • What is the significance of the 'K sampler' and 'motion bucket ID' settings in the video generation process?

    -The 'K sampler' and 'motion bucket ID' settings determine the camera and motion movement throughout the video. They are crucial in controlling the animation and movement of elements within the video.

  • What is the recommended frame rate for the generated videos?

    -The recommended frame rate for the generated videos is 10, as higher frame rates are not recommended due to the total video frame limit of 25.

  • What format is recommended for exporting the final video, and why?

    -The recommended format for exporting the final video is h264 MP4 because it is a standard video format that can be used to upscale via third-party software.

  • How does the 'augmentation level' affect the video generation?

    -The 'augmentation level' adds noise to the generation, which affects the level of detail and motion in the video. It is sensitive and can lead to poorer motion details if set too high.

  • What is the 'noisy latent composition' technique mentioned in the script, and how is it used?

    -The 'noisy latent composition' technique is used to combine the effects of two videos by adding pixel noise in the latent space. It allows for the blending of elements from different images or videos, such as adding clouds to a sky in a time-lapse motion effect.



🎥 Introduction to Stable Video Diffusion

Mali introduces the video by welcoming viewers to the channel and discussing Stability AI's first model for stable video diffusion. The model allows for frame control and subtle animations in AI-generated images or DSLR photos. Mali shares six comfy graphs to demonstrate fine-tuning image to video output and thanks new channel members. All resources, including JSON files and MP4 videos, will be available for YouTube channel members. Comfy UI supports the stable video diffusion models and can be run locally, with Mali noting the performance on a 4090 GPU and the model's training for up to 25 frames at 1024x576 resolution.


🛠 Setting Up the Comfy UI Workflow

The paragraph explains the initial steps for setting up the Comfy UI workflow for video generation. Mali instructs viewers to update custom nodes and install necessary ones like the W node suit, video helper suite, and image resize. After installing the nodes, viewers are advised to restart Comfy UI and install FFMpeg for video format support. The workflow begins with a video model option and nodes for image to video conditioning, K sampler, and VAE decode. Mali also introduces a custom node called VHS video combine for easier format export within Comfy.


🔧 Fine-Tuning Video Parameters for Animation Control

Mali demonstrates how to control motion in a video using an AI-generated candle image. The node's purpose is explained for maintaining image ratio and precise alignment. Settings for the image resize node are detailed, including action, ratio, and resize mode. The tutorial continues with the importance of the CFG value and motion bucket ID in determining camera and motion movement. Mali also discusses the impact of the K sampler and scheduler on the output, and how to adjust denoise, height, width, video frames, and augmentation level for desired effects.


👩‍🎨 Advanced Techniques for Facial Animation

This section explores advanced techniques for animating facial features in AI-generated images. Mali discusses the challenges of animating a portrait and how to adjust settings to prevent distortion. The use of the augmentation level to fix distortion and the motion bucket to control specific elements like hands or eyes is explained. The paragraph also covers the 'ping pong' effect for looping animations and tips for animating eyes in close-up facial images using different samplers and motion bucket levels.


🤩 Creating Subtle Animations with Multi-Image Method

Mali introduces a method for creating subtle animations like blinking by using a set of images with varying eye states. The workflow involves using two image loaders and a repeat image batch node to create a sequence of images that influence the AI to animate specific elements. The importance of balancing the number of open and closed eye images is discussed to maintain color consistency. Mali also shows how to adjust the image resize node and test the animation within the SVD conditioning node.


🚴‍♂️ Animating Complex Motions with DSLR Images

The paragraph delves into animating complex motions using DSLR images, such as creating a video of a motorbike with forward motion. Mali explains the sensitivity of the augmentation level setting and how it affects motion details. The tutorial includes using the multi-image method for facial animations like lip movement and the importance of selecting the right sampler and scheduler for desired effects. The paragraph concludes with a note on the unpredictability of certain motions, like pedal movement.


🌁 Combining Effects with Noisy Latent Composition

Mali demonstrates a complex workflow for combining effects using noisy latent composition with a DSLR photo and a cloud image. The process involves creating separate groups for each image, adjusting the image size, and using conditioning combine nodes to merge prompts. The paragraph explains the use of a latent composite node to layer images and the importance of adjusting the augmentation level and denoise value for a smooth output. The workflow concludes with adding the final video output and blending the images using the feather value.


🖼️ Finalizing the Video with Text to Image Integration

The final paragraph outlines the process of integrating text to image generation into the video workflow. Mali describes setting up a standard text to image workflow and connecting it to the video processing group. The importance of maintaining aspect ratios for the image resize node and connecting it to the SVD conditioning is highlighted. The paragraph concludes with the availability of JSON files for YouTube members and a sign-off until the next tutorial.



💡Stable Video Diffusion

Stable Video Diffusion refers to a technology that allows for the creation of stable and coherent video animations from static images. In the context of the video, it is the core process that enables the transformation of AI-generated or DSLR photos into short videos with subtle animations. The script describes how to achieve this by using models released by Stability AI, showcasing different techniques to control motion and generate videos with specific frame counts.


ComfyUI is the user interface for the video diffusion process described in the script. It is a tool that supports the Stable Video Diffusion models and allows users to run the models locally. The script mentions that ComfyUI was used to create the workflows for video generation, and it requires certain custom nodes and software installations to function properly.

💡Frame Control

Frame Control is the ability to manipulate individual frames within a video sequence. The script discusses using frame control to animate specific elements within a portrait AI-generated image, such as the hair and eyes, while keeping the rest of the image static. This technique allows for the creation of more dynamic and focused animations.

💡Latent Noise Composition

Latent Noise Composition is a method mentioned in the script for creating videos using DSLR photos. It involves using the latent space of a model to generate videos with specific effects, such as moving clouds in the sky or animating waves. The script provides an example of how to combine this technique with ComfyUI to create a time-lapse motion effect.

💡CFG Guidance

CFG Guidance, or Classifier-Free Guidance, is a technique used in the video diffusion process to control the coherence and quality of the generated video. The script explains that the CFG value is significant and relative to the minimum CFG value set on the video linear guidance node, affecting the motion and camera movement throughout the video frames.


The K-Sampler is a node in the ComfyUI workflow that is used to generate the video frames. The script discusses how the K-Sampler works in conjunction with the CFG Guidance to determine the motion and animation of the video. Different K-Sampler settings can lead to different animation effects, such as panning or still elements.

💡VHS Video Combine

VHS Video Combine is a custom node in ComfyUI that allows for the export of video in various formats, including GIF, WebP, and MP4. The script mentions that this node is used instead of the default node to avoid the hassle of converting WebP to MP4, streamlining the video output process.

💡Image Resize

Image Resize is a process mentioned in the script for adjusting the dimensions of an image to fit the video generation requirements. It is used to ensure that the image does not exceed the maximum resolution for video generation, typically 1024x576, and to maintain the aspect ratio for proper alignment within the video frame.

💡Augmentation Level

Augmentation Level refers to the noise level added to the video generation process. The script explains that this setting is sensitive and affects the level of detail and motion in the video. A higher augmentation level can sometimes result in poorer motion details, depending on the image and other settings.

💡Ping Pong Effect

The Ping Pong Effect is a technique mentioned in the script for creating a looped animation by reversing the animation and playing it back and forth. This effect is used for animations where the motion is suitable for a continuous loop, such as a blinking effect in a facial animation.

💡Noisy Latent Composition

Noisy Latent Composition is an advanced technique used in the script to combine the effects of two videos in the latent space. It involves using the pixel noise in the latent space to blend the outputs from different K-Samplers, creating a seamless composite video with added elements like clouds or moving waves.


Introduction to the ComfyUI for Stable Video Diffusion workflow tutorial by Mali.

Stable AI's first model for stable video diffusion allows frame control with animations.

ComfyUI supports both the stable video diffusion models released by Stability AI.

ComfyUI can be run locally and is compatible with various GPU configurations.

The models support video resolutions of 1024x576 in both portrait and landscape orientations.

The first model generates 14 frames, while the second model, SVD XT, generates 25 frames.

Comfy Manager is required for the tutorial and must be updated before proceeding.

Custom nodes such as W node suit, video helper suite, and image resize are necessary for workflows.

Installation of FFMpeg is required for video format conversion within ComfyUI.

The workflow starts with a video model option and builds up to an advanced level.

Demonstration of how to control motion movement in a video using a candle image.

Explanation of the importance of the image resize and crop node for video output.

CFG value significance in determining the camera and motion movement throughout the video.

The impact of the K sampler and scheduler on the output video's motion details.

Adjusting the augmentation level to fix distortion and add detail to the video generation.

Techniques for creating subtle animations like blinking eyes or facial expressions.

Using multi-image methods to create animations without frame interpolation.

Combining effects with noisy latent composition for complex video animations.

Final workflow explanation for creating a video with specific elements in motion.