I Created Another App To REVOLUTIONIZE YouTube

ThioJoe
21 Dec 202215:14

TLDRThe video introduces a groundbreaking app designed to revolutionize the way YouTube operates for international audiences. The app, named 'Auto Synced and Translated Dubs', enables users to switch audio tracks to different languages, offering dubbed versions of videos instead of just subtitles. The creator discusses the process of making dubbed translations using AI, addressing the limitations of current technology and presenting a solution that involves transcribing, translating, and syncing audio with subtitles. The program utilizes Google and Microsoft Azure APIs for translation and voice synthesis, providing a high-quality output. The video also covers the challenges of custom voice models and the potential future of AI in automating video translation and dubbing processes.

Takeaways

  • 📢 The video introduces a new feature on YouTube that allows switching audio tracks to different languages, offering dubbed versions instead of subtitles.
  • 🔍 The feature is currently limited and not widely available, requiring special access which the creator had to request.
  • 🤖 The creator developed an open-source Python program called 'Auto Synced and Translated Dubs' on GitHub to automate the dubbing process using AI.
  • 📝 The program requires a well-edited SRT subtitle file for accurate timing and synchronization of the dubbed audio with the video.
  • 🔗 Google API is utilized to translate the text into the desired language and generate a new subtitle file.
  • 📉 The program offers two methods for audio length adjustment: time-stretching and two-pass synthesis, with the latter providing better audio quality.
  • 🎧 Two-pass synthesis involves an initial synthesis to determine the required speed adjustment for a second, more precise synthesis.
  • 📈 The use of Microsoft Azure's AI voices is preferred for higher quality compared to Google's, although it's not as easy to set up.
  • 📂 The program also includes a script for attaching the dubbed audio tracks to the video file for uploading to YouTube, using FFmpeg.
  • 🌐 Additional scripts are provided for translating video titles and descriptions into different languages using Google Translate API.
  • ⏱ The process is semi-automated and can be time-consuming due to the need for human editing and configuration setup.
  • 🔮 A prediction is made that AI will advance to a point where YouTube could automatically transcribe and dub videos, removing the need for manual processes.

Q & A

  • What is the new feature on YouTube that allows viewers to switch the audio track to different languages?

    -The new feature on YouTube allows viewers to switch the audio track to one of several languages, enabling them to hear a dubbed, spoken version of the video instead of just reading translated subtitles.

  • Why did the creator decide to make the 'Auto Synced and Translated Dubs' program?

    -The creator decided to make the 'Auto Synced and Translated Dubs' program because they noticed that there was no service that could tie together the separate features of transcription, translation, and AI voice synthesis into one cohesive tool.

  • What are the limitations of Google's 'Aloud' project compared to the creator's program?

    -Google's 'Aloud' project is invite-only, currently supports only Spanish and Portuguese, requires manual synchronization, and uses AI voices that the creator feels are not the highest quality. The creator's program addresses these limitations by being more inclusive, supporting more languages, using subtitle timings for exact synchronization, and utilizing Microsoft Azure's higher quality AI voices.

  • How does the program handle the synchronization of dubbed audio with the original video?

    -The program uses the subtitle SRT file's timings to determine how long each group of text should take to speak. It then synthesizes audio clips in the target language and adjusts the speed of the AI voice in a second pass to match the required duration, ensuring the dubbed audio is synchronized with the original video.

  • What is the 'two-pass synthesis' feature of the program and how does it improve audio quality?

    -The 'two-pass synthesis' feature involves synthesizing the audio clip at the default speed first, then comparing its length to the required duration from the subtitle file. The program calculates a ratio and sends a second speech request with an adjusted speed to the text-to-speech service, resulting in an audio clip that is effectively the correct duration without the need for time-stretching, which can degrade audio quality.

  • How does the program handle the addition of multiple language audio tracks to a video before uploading to YouTube?

    -The program includes a separate script that uses the language identified from the file name and the popular program FFmpeg to add the audio track with proper language tagging to the video without converting the video. It can also merge a sound effects track into each dubbed version before adding it to the video.

  • What additional features does the program offer for translating video titles and descriptions?

    -The program includes a script that uses the Google Translate API to translate video titles and descriptions into the languages set by the user. The translated text is then put into a text file from which the user can easily copy and paste for use on YouTube.

  • Why is the custom voice model feature not yet implemented in the program?

    -The custom voice model feature is not yet implemented because it is currently too expensive. Training a custom voice model on platforms like Microsoft Azure can cost between $1,000 to $2,000, with additional costs for using the model and hosting it.

  • What transcription tool does the creator use and why is it preferred?

    -The creator uses OpenAI's 'Whisper' model for transcription because it has proven to be more accurate than other options, even Google's transcription API. It also handles punctuation well and can be run locally on a powerful enough GPU.

  • How does the creator edit and refine the transcription of their videos?

    -The creator uses Descript for transcription editing. While Descript generates its own transcript, the creator prefers to replace it with the more accurate OpenAI Whisper transcript. Descript allows for easy editing of punctuation and capitalization with hotkeys and exports subtitle files that are well-suited for making dubbed versions of videos.

  • What are some of the program's configuration options and how can users customize them?

    -The program includes several configuration files where users can customize various settings, such as formatting options and the amount of space between sentences when the voice speaks. The config files are well-documented, allowing users to easily understand and adjust these settings.

  • What is the creator's prediction for the future of AI in video transcription and dubbing?

    -The creator predicts that AI will become so advanced and affordable that YouTube will automatically transcribe and dub videos in all languages without the need for user intervention. The current limiting factor is the accuracy of speech-to-text transcription, especially for fast or jargon-heavy speech.

Outlines

00:00

🌐 Introducing Multilingual Dubbing on YouTube

The video discusses a new feature on YouTube that allows viewers to switch audio tracks to different languages, offering dubbed versions of videos instead of relying on subtitles. The creator had to request access to this feature, which is currently limited. The video explains the process of creating dubbed translations, which are not automated, and introduces an open-source Python program called 'Auto Synced and Translated Dubs' developed by the speaker. This program utilizes AI to transcribe, translate, and sync audio with subtitles, addressing the limitations of Google's 'Aloud' project by supporting more languages and higher quality AI voices from Microsoft Azure.

05:02

🔍 How the Dubbing Program Works

The video provides a detailed explanation of how the dubbing program functions. It starts with the necessity of a human-edited SRT subtitle file, which the program uses to translate text into the desired language using Google API. The program then generates audio clips for each line of text using a text-to-speech service. To ensure synchronization, the program offers two methods: time-stretching the audio clips to fit the required duration or a two-pass synthesis technique that adjusts the speed of speech to match the subtitle timings more accurately. The program also includes a script for attaching the translated audio tracks to the video file using FFmpeg and another for translating video titles and descriptions.

10:05

💸 Costs and Limitations of Custom Voice Models

The speaker expresses a desire to create custom voice models for a more personalized dubbing experience but highlights the current high costs associated with this technology. Services like Microsoft Azure and Google Cloud offer custom voice creation, but they come with significant expenses in terms of training time and usage. The speaker predicts that AI will improve and become more affordable, eventually allowing YouTube to automatically transcribe and dub videos. The video also shares the speaker's personal workflow for transcribing videos, which includes using OpenAI's 'Whisper' model and Descript for transcription editing.

15:09

📣 Conclusion and Future Outlook

The video concludes with the speaker's intention to apply the dubbing process to most of their future videos. They also discuss the potential for AI advancements to reduce the need for manual dubbing processes. The speaker encourages viewers to like the video if they found it interesting and suggests watching their next video about a speech enhancer AI tool by Adobe.

Mindmap

Keywords

💡Gear

In the context of the video, 'gear' refers to the settings icon typically represented by a small gear symbol on video platforms like YouTube. It is used to access and adjust various settings for the video, such as switching audio tracks to different languages. This feature is central to the video's theme of enhancing accessibility and internationalization of YouTube content.

💡Dubbed Translations

Dubbed translations are audio tracks in which the original language of a video is replaced with a different language, allowing viewers to hear the content in their preferred language. In the video, this concept is significant as it discusses the creation of dubbed versions using AI, which can potentially revolutionize how international audiences experience YouTube videos.

💡Subtitle SRT File

An SRT file is a SubRip subtitle file format used to include subtitle text with specific timing information for when each subtitle should appear and disappear. In the video script, the importance of a well-edited SRT file is emphasized because it contains not only the text but also the exact timing for each subtitle, which is crucial for synchronizing dubbed audio with the video.

💡Google API

Google API refers to the application programming interfaces provided by Google that allow developers to access specific features of Google's services, such as translation services. In the video, the creator uses Google API to translate the text from the SRT files into different languages, which is a key step in the process of creating dubbed translations.

💡Text-to-Speech (TTS)

Text-to-speech technology converts written text into spoken words, using AI voices. In the video, TTS is used to generate audio clips for each subtitle line in the chosen language, which are then synchronized with the video to create dubbed translations. This technology is central to the program's ability to produce dubbed audio tracks.

💡Time Stretching

Time stretching is a process that involves altering the duration of an audio clip without changing its pitch. In the context of the video, the creator discusses using time stretching to match the length of the AI-generated audio clips to the timing specified in the SRT file. However, it is noted that this technique can degrade audio quality.

💡Two-Pass Synthesis

Two-pass synthesis is a method introduced in the video to generate audio clips of the exact required length without degrading quality through time stretching. It involves synthesizing the audio clip at the default speed first, determining the duration, and then synthesizing it again at an adjusted speed to match the desired length. This method ensures high-quality audio in dubbed translations.

💡FFmpeg

FFmpeg is a popular, open-source multimedia framework that can handle various tasks, such as transcoding audio and video files between different formats. In the video, the creator uses FFmpeg to add the newly created dubbed audio tracks to the video file with proper language tagging, without the need to convert the video itself.

💡Custom Voice Model

A custom voice model refers to a personalized voice created using AI, which can mimic the voice of a specific individual. The video discusses the possibility of creating such a model to enable the dubbed translations to be spoken in the creator's own voice across multiple languages, although it is currently cost-prohibitive.

💡Google Cloud

Google Cloud is a suite of cloud computing services offered by Google, which includes machine learning and voice recognition services. The video mentions Google Cloud in the context of custom voice creation, comparing its capabilities and costs with those of Microsoft Azure for creating multilingual voice models.

💡OpenAI's Whisper

OpenAI's Whisper is a model for transcribing and translating speech to text. The video script highlights its use for generating highly accurate transcripts, which are essential for creating dubbed translations with precise timing and punctuation, thus improving the overall quality of the final product.

Highlights

A new YouTube feature allows switching audio tracks to different languages, offering dubbed versions instead of subtitles.

The feature is currently limited and requires access, which the author had to request.

The author created an open-source Python program called 'Auto Synced and Translated Dubs' on GitHub to facilitate this process.

The program uses AI to transcribe, translate, and synchronize audio with subtitles, addressing limitations of current services.

The author discusses the high cost of training custom voices for multilingual speech, which is currently prohibitive.

The program requires a human-edited SRT subtitle file for accurate timing and synchronization.

Google's API is used to translate text into the desired language and generate a new subtitle file.

The program offers two methods for audio synchronization: time-stretching and two-pass synthesis for better quality.

Two-pass synthesis involves adjusting the speed of speech to match the required duration more accurately.

The program can stretch audio to be exactly the correct length, though this can degrade quality.

A separate script is used to attach the translated audio tracks to the video file for uploading to YouTube.

The program also includes a feature to translate video titles and descriptions for multilingual support.

The author predicts that AI will eventually automate transcription and dubbing for all videos on YouTube.

Current limitations in speech-to-text accuracy are the main challenge to fully automating this process.

The author uses OpenAI's 'Whisper' model for transcription, finding it more accurate than Google's API.

Descript is used for transcription editing, offering tools to quickly adjust punctuation and capitalization.

Descript's subtitle export is more suitable for dubbing as it aligns with sentence structures.

The program provides various configuration options for customization, such as voice speed and sentence spacing.

The author encourages viewers to like the video and check out the next video about Adobe's speech enhancer AI tool.