The Secrets Behind Voice Cloning & AI Covers

8 Aug 202316:54

TLDRThe video script delves into the world of AI voice cloning and conversion technologies, explaining the differences between text-to-speech synthesis and voice-to-voice conversion. It highlights two main text-to-speech models, Tacotron 2 and Tortoise TTS, and discusses their pros and cons, including training times and voice quality. The video also covers voice conversion tools like so-vits-svc and RVC, which allow for high-quality voice transformations. Additionally, it touches on the use of vocoders like HiFiGAN for generating natural-sounding speech. The script mentions various services and applications, such as UberDuck, FakeYou, and ElevenLabs, that offer voice cloning and conversion, and it concludes with a discussion on the potential applications of these technologies for content creators and voice actors. The narrator also shares a unique approach to creating a custom pipeline for voice synthesis using Tortoise and RVC models, demonstrating the process with a sample narration.


  • 📚 **AI Voice Cloning and Synthesis Overview**: The video discusses the capabilities and differences between text-to-speech and voice-to-voice conversion technologies.
  • 🎵 **AI in Music**: AI-generated singing, exemplified by the AI Drake song, is made possible through voice-to-voice conversion which uses an audio reference to train the AI.
  • 🤖 **Text-to-Speech (TTS) Technologies**: Two main TTS research models are Tacotron 2 and Tortoise TTS, each with their own advantages and trade-offs in terms of speed, quality, and training time.
  • 🎙️ **Voice Cloning Services**: Services like UberDuck, FakeYou, and ElevenLabs offer various levels of voice cloning and TTS, with different libraries and functionalities.
  • 🔍 **Research Behind the Scenes**: The video references research papers and developments that form the backbone of current voice cloning tools, such as Tacotron 2 and Tortoise TTS.
  • 📈 **Quality and Training**: The quality of voice cloning is dependent on the training data and time, with some models requiring hours of data and others significantly less.
  • 🎼 **Vocoders in Voice Synthesis**: Vocoders like HiFiGAN are crucial for generating high-fidelity audio from spectrograms, contributing to the natural sound of the synthesized voice.
  • 🌐 **Cultural Impact on Research**: There's a noted difference in research priorities between US and Chinese researchers, with implications on the development of voice cloning technologies.
  • 📈 **ElevenLabs' Prominence**: ElevenLabs has gained attention for its ease of use and high-quality voice cloning, even being used to create videos of political figures in fictional scenarios.
  • 🔧 **DIY Voice Cloning**: For those with the necessary hardware, free local UIs are available to clone voices using models like Tacotron 2 and Tortoise TTS.
  • 🔬 **Combining Models for Better Quality**: The video's narration is an example of combining Tortoise TTS with RVC to create a high-quality, unpaired text-to-speech AI without needing a voice actor.

Q & A

  • What are the two main categories of voice or speech generation technologies mentioned in the transcript?

    -The two main categories are classic text-to-speech synthesis (pure text-to-speech) and voice to voice conversion.

  • How does the AI technology generate singing voices, such as the AI Drake song?

    -Voice to voice conversion technology is used, which requires an audio reference of the person's voice (like Drake's) to train the AI, then a person sings a song, and the AI converts the sung vocals into the trained voice.

  • What is the main difference between pure text-to-speech and voice-to-voice conversion in terms of sound imitation?

    -Pure text-to-speech does not allow for the imitation of specific sounds or styles of speech, whereas voice-to-voice conversion can copy those nuances since it is based on an audio reference.

  • Which two research papers are currently the most popular for text-to-speech synthesis?

    -The two most popular research papers for text-to-speech synthesis are Tacotron 2 and Tortoise TTS.

  • What is the main advantage of using Tortoise TTS over Tacotron 2?

    -Tortoise TTS requires less data and training time, and it provides better voice consistency and higher quality audio, although it is slower at generating voices.

  • How does the vocoder module contribute to the quality of the synthesized voice?

    -The vocoder module generates the audio waveform from audio spectrograms, with HiFiGAN being a popular choice due to its superior performance in creating high-fidelity, natural-sounding speech.

  • What are the two main popular options for voice-to-voice conversions?

    -The two main popular options for voice-to-voice conversions are so-vits-svc (SoftVC vits Singing Voice Conversion) and RVC (Retrieval Based Voice Conversion).

  • Why might some researchers from different cultures have different priorities in their AI developments?

    -Different cultural priorities can lead to a focus on different aspects of AI technology, such as the US researchers focusing more on text-to-speech and Chinese researchers on voice conversions.

  • What is UberDuck, and what recent change has it undergone?

    -UberDuck is a service with a large online library for text-to-speech and voice-to-voice models. It recently removed all user-uploaded models and transitioned into a commercial-friendly service, possibly due to legal takedowns.

  • How does ElevenLabs stand out in terms of voice cloning?

    -ElevenLabs is known for its ease of use and high-quality voice cloning. It can clone a voice with just a minute of voice data and has a professional voice cloning service that requires 80 minutes of voice data.

  • What is the process of combining Tortoise TTS and RVC for an unpaired text-to-speech?

    -The process involves using the output of Tortoise TTS as an input reference for RVC. This allows Tortoise to maintain the speaker's style while RVC smooths out the audio for a higher quality, fully generated voice.

  • What are the main considerations when choosing between ElevenLabs and the Tortoise TTS + RVC combo for voice cloning?

    -The choice depends on the balance between convenience and quality. ElevenLabs is more convenient and quicker but may not match the quality of the Tortoise TTS + RVC combo, which requires more data, time, and effort for higher quality voice cloning.



🤖 Introduction to AI Voice Cloning Technologies

This paragraph introduces the audience to the world of AI voice cloning, explaining the dual nature of AI's proficiency and shortcomings. It outlines the two main categories of voice generation: classic text-to-speech synthesis and voice-to-voice conversion. The former is exemplified by Siri and TikTok's text-to-speech, while the latter is showcased by AI-generated singing like the AI Drake song. The paragraph also distinguishes between the two by noting that text-to-speech cannot imitate specific sounds or styles, whereas voice-to-voice conversion can. The backbone technologies of these processes, including Tacotron 2, Tortoise TTS, and various vocoders like HiFiGAN, are briefly discussed, setting the stage for a deeper dive into the subject.


🎤 Exploring Voice Conversion Technologies

The second paragraph delves into the complexities of AI voice conversion software, mentioning the layered research and development process. It introduces so-vits-svc and RVC as two popular options for voice-to-voice conversion, highlighting their capabilities, GitHub popularity, and the improvements of RVC over its predecessor. The paragraph also touches on the cultural differences in research priorities, with a humorous note on the development communities behind text-to-speech and voice conversion technologies. It concludes with a brief mention of TalkNet, another text-to-speech synthesis research, and transitions into discussing various services that utilize these technologies.


🌐 Services Utilizing AI Voice Cloning

This paragraph discusses several services that provide AI voice cloning capabilities. UberDuck is noted for its large online library but has shifted to a commercial model. FakeYou is praised for its user interface and variety of models, despite longer wait times for free use. ElevenLabs is highlighted for its ease of use and high-quality voice cloning, particularly for English speakers. The paragraph also covers the limitations and requirements of these services, such as hardware needs and the importance of clear voice data. It concludes with a demonstration of how these services can be combined for higher quality voice synthesis, specifically mentioning the use of Tortoise TTS and RVC in the video's narration.


📚 Free Tools and Future of AI Voice Cloning

The final paragraph provides information on free local UIs for voice cloning with Tacotron 2 and Tortoise TTS, offering resources for those with sufficient hardware. It also suggests tools for separating voice from background noise. The potential applications of AI voice cloning in content creation and language translation are explored, emphasizing the customizability of pipelines for different users. The narrator shares their experience with Eleven Labs' pro-Finetune voice cloning and compares it with the Tortoise + RVC combo in terms of convenience and quality. The paragraph concludes with a sponsored message for, an educational platform for learning AI and machine learning, and a thank you note to supporters.



💡Text-to-Speech (TTS)

Text-to-Speech (TTS) technology is a form of AI that converts written text into audible speech. In the video, it is described as the process similar to how Siri or TikTok's text-to-speech feature operates, where the AI uses text as the sole input to generate audio. This technology is fundamental to the discussion as it represents the basic mechanism behind voice cloning and AI-generated voices.

💡Voice Cloning

Voice cloning refers to the process of creating a synthetic replica of a person's voice using AI. The video discusses how AI can generate custom voices and imitate the style and sound of a specific individual's voice, which is central to the theme of exploring advanced AI capabilities in voice replication.

💡Voice-to-Voice Conversion

This is a category of voice synthesis that allows for the conversion of one voice into another, even enabling AI-generated singing. The video uses the example of an AI Drake song to illustrate this concept, highlighting its role in creating realistic and stylistic voice reproductions.

💡Tacotron 2

Tacotron 2 is a specific research model for text-to-speech synthesis developed by Google and Nvidia. It is mentioned in the video as being fast but with lower quality compared to other models. Tacotron 2 requires significant fine-tuning and data to effectively replicate a voice, making it a less preferred option for some users.

💡Tortoise TTS

Tortoise TTS is a research model developed by James Becker that is popular for voice cloning. The video explains that it requires less data and training time than Tacotron 2, offering a balance between quality and efficiency. Despite being slower at generating voices, it provides better voice consistency and higher quality output.


HiFiGAN is a vocoder, a module that generates audio waveforms from audio spectrograms. It is highlighted in the video as the most popular choice among synthesizers due to its ability to create high-fidelity, natural-sounding speech with high-resolution audio waveforms. HiFiGAN plays a crucial role in enhancing the audio quality of voice-cloned outputs.


so-vits-svc, or SoftVC vits Singing Voice Conversion, is a software for voice-to-voice conversion, particularly adept at singing voice replication. The video notes its popularity on GitHub and its complex AI architecture that combines various technologies to achieve high-quality voice conversion.

💡RVC (Retrieval Based Voice Conversion)

RVC is a newer voice conversion technology that is presented as an improvement over so-vits-svc. It is capable of generating more consistent and accurate vocals with faster training times and lower data and hardware requirements. The video suggests that RVC might be the technology behind the AI Drake song due to its superior audio quality.


UberDuck is mentioned as a service with a large online library of text-to-speech and voice-to-voice models. However, it has shifted to a commercial-friendly service and removed user-uploaded models, possibly due to legal takedowns, which is a significant change impacting its user base.


ElevenLabs is highlighted for its ease of use in voice cloning. The video describes how it can clone voices with just a minute of voice data and has recently released an improved English V2 model. It is noted for its instant voice cloning function and the potential upcoming release of a new voice conversion feature.


TalkNet is a text-to-speech synthesis research that allows input of an arpabet, a pronunciation notation, to specify a sound. Although not covered in-depth in the video, it is mentioned as an example of ongoing research in the field of voice synthesis and AI.


AI technologies are being used to generate custom voices and even AI-generated singing, exemplified by the AI Drake song.

Voice generation can be categorized into two main types: text-to-speech synthesis and voice-to-voice conversion.

Text-to-speech synthesis involves AI using text to generate audio, like Siri or TikTok's text-to-speech feature.

Voice-to-voice conversion requires an audio reference to train and learn a specific voice, then convert another person's vocal into that voice.

Pure text-to-speech does not allow for imitation of sounds or speech style, which voice-to-voice conversion can achieve.

Tacotron 2 and Tortoise TTS are two main research backbones for text-to-speech synthesis.

HiFiGAN is a popular vocoder for generating high-fidelity, natural-sounding speech from audio spectrograms.

so-vits-svc and RVC are popular options for voice-to-voice conversions, capable of producing high-quality audio up to 48kHz.

ElevenLabs offers instant voice cloning with high-quality results, but requires the voice to be fluent in English.

Free local UIs are available for voice cloning using Tacotron 2 and Tortoise TTS, suitable for computers with sufficient VRAM.

Combining Tortoise's output with RVC can create an unpaired text-to-speech AI with improved quality.

The video's narration was created using a custom pipeline combining Tortoise and RVC, without the need for the narrator's voice.

Eleven Labs' pro voice cloning is a convenient option for voice cloning with less effort, despite potentially lower quality compared to open-source options.

The Tortoise + RVC combo provides the best quality in text voice cloning but requires more data, time, and effort.

AI voice cloning technology can assist content creators in translating content into other languages while maintaining their unique voice. offers a clear roadmap and interactive lessons for learning AI and machine learning fundamentals.