New Image2Video. Stable Video Diffusion 1.1 Tutorial.

Sebastian Kamph
13 Feb 202410:50

TLDRThe video discusses the latest update to Stability AI's stable video diffusion model, version 1.1. It compares the new model's performance with the previous 1.0 version by inputting images and evaluating the generated video results. The video also provides a tutorial on how to use the updated model in both Comy and a fork of Automatic 1111. The creator highlights the improvements in consistency and detail, especially in static objects and slow-moving scenes, while acknowledging some instances where the older model performed better. The video concludes by encouraging users to experiment with the new model and engage with the AI art community.


  • πŸ“ˆ Introduction of Stability AI's Stable Video Diffusion 1.1, an updated model from the previous 1.0 version.
  • 🎨 The process involves inputting a static image and generating a video output using the AI model.
  • πŸ”— The model was fine-tuned from the previous version, aiming to improve video generation quality.
  • πŸ“Š The AI model generates videos at 25 frames with a resolution of 1024 by 576 pixels.
  • πŸŽ₯ The default settings for frame rate and motion bucket ID should not be altered to avoid breaking the stability of the generated videos.
  • πŸ”§ Users can utilize either Comfy or a fork of Automatic 1111 to run the Stable Video Diffusion model.
  • 🌟 Comparisons between the new and old models show varying results, with the new model generally performing better in consistency and detail.
  • πŸ” An exception was noted in the case of a burger image, where the old model provided better results.
  • πŸš€ The video generation process was tested with various images, including a rocket launch, showcasing the model's capabilities and limitations.
  • 🌸 The cherry blossom tree image demonstrated the new model's ability to maintain scene consistency more effectively than the old model.
  • 🌟 Overall, Stable Video Diffusion 1.1 is recommended for use in most cases, with adjustments in seed or generation for desired outcomes.

Q & A

  • What is the main topic of the video script?

    -The main topic of the video script is the comparison between Stability AI's Stable Video Diffusion 1.1 and its previous 1.0 model, focusing on their performance in converting images to videos.

  • How can one obtain and use the Stable Video Diffusion 1.1 model?

    -To obtain and use the Stable Video Diffusion 1.1 model, one can visit Hugging Face's website, download the model, and follow the instructions provided in the video script for setting it up in their workflow.

  • What resolution was the new model trained to generate?

    -The new model was trained to generate videos at a resolution of 1024 by 576 pixels.

  • What frame rate and motion bucket ID were used for fine tuning in the new model?

    -The new model used a fixed conditioning of 6 frames per second and a motion bucket ID of 127 for fine tuning.

  • How does the video script suggest using the new model in comparison to the old one?

    -The script suggests using the new model for its improved consistency and better handling of certain elements like car tail lights and neon signs, as demonstrated in the provided examples.

  • What are the advantages of using the Comfy UI for this task?

    -The Comfy UI provides a user-friendly interface for setting up and running the workflow, making it easier for users to input images, adjust settings, and obtain video results without dealing with complex coding or command lines.

  • Is there an alternative to Comfy UI for running Stable Video Diffusion 1.1?

    -Yes, an alternative is using a fork of Automatic 1111, as mentioned in the script. However, the user may need to explore this option on their own as it is not the focus of the video.

  • What is the significance of the frame rate and motion bucket ID settings in the models?

    -The frame rate and motion bucket ID settings are crucial for maintaining the consistency and quality of the generated videos. Changing these settings can affect the output, so it's recommended to use the default values unless the user has specific reasons to modify them.

  • How does the video script demonstrate the comparison between the new and old models?

    -The script demonstrates the comparison by showing side-by-side examples of the output from both models, highlighting the differences in the quality and consistency of the generated videos.

  • What is the conclusion drawn from the comparisons made in the video script?

    -The conclusion drawn from the comparisons is that the Stable Video Diffusion 1.1 model generally performs better than the previous version, except in some specific cases like the burger example where the old model performed slightly better.

  • How does the video script address issues with the generated videos?

    -The script acknowledges that there may be imperfections in the generated videos, such as issues with stars in one of the examples. It suggests that users may need to adjust the seed or try different generations to achieve the desired results.



πŸŽ₯ Introduction to Stability AI's Stable Video Diffusion 1.1

The paragraph introduces Stability AI's updated Stable Video Diffusion model, version 1.1, which is a fine-tuned version of their previous 1.0 model. The speaker aims to compare the new model with the old one to determine improvements. They also mention their Patreon link as a source of income to support their video creation. The process involves inputting an image and getting video results, with a focus on the workflow available in the description. The model was trained to generate 25 frames at a resolution of 1024 by 576. The speaker also touches on the default settings for frames per second and motion bucket ID, emphasizing that altering these could lead to unstable results unless intended for testing purposes. Instructions on how to download and use the model, along with a comparison between the old and new models, are provided.


πŸ” Comparison of New and Old Models Using Various Images

This paragraph presents a detailed comparison between the new Stable Video Diffusion 1.1 model and the previous model using different images. The speaker first discusses an image of a hamburger, noting that the old model performed better in this instance due to the consistency in the background and the static nature of the image. They then move on to an image of a floating market, which proved challenging for both models, but the new model maintained a slightly better consistency. The speaker also comments on the slower zooms and movements in the new model, which helps in maintaining consistency. The comparison concludes with an image of a cherry blossom tree, where the new model clearly outperforms the old one by keeping the scene more consistent, despite some imperfections.


πŸš€ Final Thoughts on Stable Video Diffusion 1.1 and Community Engagement

In the final paragraph, the speaker wraps up the comparison by stating that Stable Video Diffusion 1.1 generally performs better, except in some specific cases like the hamburger image. They suggest using different seeds or generating new images if the results are not as expected. The speaker also reminds viewers about their Discord community, where AI art and generative AI enthusiasts participate in weekly challenges. They share some of the submissions for the current Cyberpunk Adventures challenge and encourage viewers to join and participate. The paragraph ends with a call to action for viewers to like, subscribe, and support the channel.



πŸ’‘Image to Video

The process of converting a static image into a dynamic video sequence. In the context of the video, this refers to the use of AI technology to generate video content from a single image input, which is the main theme of the video and its demonstration.

πŸ’‘Stability AI

A company or technology provider specializing in AI models that enhance the stability and quality of generated media content, such as videos. In the video, Stability AI is the developer of the stable video diffusion model used to create videos from images.

πŸ’‘Diffusion Model

A type of AI model that uses a diffusion process to generate data, such as images or videos, by progressively refining inputs towards a desired output. In the video, the diffusion model is central to the process of creating videos from images.


The process of making small adjustments to a machine learning model to improve its performance on a specific task. In the video, fine-tuning is mentioned in relation to the improvement of the stable video diffusion model from version 1.0 to 1.1.

πŸ’‘Comfy UI

A user interface designed for ease of use and comfort, often referring to a graphical interface for software applications. In the video, Comfy UI is the environment where the AI model is operated and the workflow is demonstrated.

πŸ’‘Automatic 1111 Fork

A modified version or derivative of the original Automatic 1111 software, which includes changes or additional features not found in the original. In the video, the fork is mentioned as an alternative platform for running the stable video diffusion model.

πŸ’‘Frames Per Second (FPS)

A measurement of how many individual frames are displayed per second in a video. It is a critical aspect of video smoothness and quality. In the video, FPS is discussed in the context of the model's default settings and its impact on the generated video content.

πŸ’‘Motion Bucket ID

A unique identifier used within AI models to categorize and manage motion data, which can affect the generation of movement in videos. In the video, the Motion Bucket ID is mentioned as a parameter that should not be altered for optimal results.


A series of steps or processes followed to achieve a particular outcome, such as the creation of a video from an image. In the video, the workflow is the sequence of operations demonstrated for using the AI model to generate videos.


An evaluation of differences and similarities between two or more items, often to determine which is superior or more effective. In the video, comparison is used to assess the performance of the new stable video diffusion 1.1 model against its predecessor.


The quality of being stable, uniform, or coherent throughout a process or in a final product. In the context of the video, consistency refers to the smoothness and believability of the generated video content.


Stability AI has released an updated version of their stable video diffusion model, version 1.1.

The new model is a fine-tune of the previous 1.0 version, aiming to improve the quality of the output videos.

The process involves inputting a single image and generating a video output through a series of nodes in a k- sampler.

A comparison between the new 1.1 model and the old 1.0 model will be conducted to determine the improvements.

The model was trained to generate 25 frames at a resolution of 1024 by 576.

The default settings for the model include a fixed conditioning of 6 frames per second and a motion bucket ID of 127.

The tutorial includes instructions on how to set up the model in both Comy and a fork of Automatic 1111.

The updated model shows significant improvement in consistency, especially in moving objects like a car with tail lights.

In the case of a hamburger image, the old model surprisingly performs better with more consistent background movement and rotation.

The new model handles slow zooms and movements better, maintaining consistency in the visuals.

The character and people depiction in the new model is not as realistic, but the consistency in lamps and other objects is fairly maintained.

The new model shows a clear advantage in maintaining the scene consistency in an image of a cherry blossom tree.

In the rocket launch image, the new model manages to keep the smoke and blast effects consistent, although the stars are not rendered perfectly.

The overall conclusion is that the stable video diffusion 1.1 model performs better in most cases, except for specific instances like the hamburger image.

The tutorial also mentions a Discord community for AI art and generative AI enthusiasts, with weekly challenges and submissions.

The video tutorial aims to educate viewers on the latest advancements in stable video diffusion and how to utilize them effectively.