They Beat Open AI to the Punch... But at What Cost?

MattVidPro AI
3 Jul 202421:48

TLDRThe video explores a new multimodal AI, Mashi, that can understand and express emotions through voice. Despite its less impressive intelligence compared to GPT-4 Omni, Mashi's real-time conversational abilities and open-source potential are highlighted. The host tests Mashi's capabilities, including emotion recognition and singing, comparing it to other AIs like Pi and Chat GPT, noting the need for improvement but expressing excitement for its future development.


  • 😀 The video discusses a GP4 Omni voice demo that was impressive but not yet accessible to the public.
  • 🤖 An AI named Mhi has been released with a similar technology to GP4 Omni, but it is not as intelligent and has limitations in understanding emotions and conversations.
  • 🔊 Mhi has real-time audio capabilities and can generate speech, but the voice quality, while decent, is not the best available.
  • 🔍 Mhi was trained using a mix of text and audio synthetic data, but the underlying model is not fine-tuned and lacks the sophistication of other AIs like GP4 Omni.
  • 💡 Mhi's biggest advantage is that it will be open-source, allowing the community to improve and customize it after the release of the code and models.
  • 🎤 The video includes an attempt to get Mhi to sing a song about butterflies, which highlights the AI's limitations in performing tasks that require creativity or emotional understanding.
  • 📉 In comparison to other AIs like Pi AI and Chat GPT, Mhi falls short in both voice quality and interaction capabilities.
  • 📝 The video transcript includes a conversation with Mhi about writing a story, where Mhi provides basic advice but struggles to understand the complexity of the task.
  • 😓 Mhi's attempts to interpret the speaker's emotions are inconsistent and often incorrect, showing the AI's difficulty in advanced emotional recognition.
  • 👍 Despite its shortcomings, the video creator expresses hope and excitement for the potential of Mhi once it becomes open-source and can be improved by the community.
  • 🔮 The video concludes with a look forward to the future of multimodal AI and the anticipation of more advanced and user-friendly AI interactions.

Q & A

  • What is the main topic discussed in the video script?

    -The main topic discussed in the video script is the comparison between different AI technologies, specifically focusing on the MASHI AI demo and its capabilities compared to other AIs like GP4 Omni and Pi AI.

  • What is MASHI AI, and what makes it unique?

    -MASHI AI is a multimodal AI model that can listen and speak in real-time. It is unique because it is one of the first native multimodal AI models that can understand and reason across both text and voice, and it is set to be released as open-source.

  • Why is the MASHI AI demo considered disappointing by the video script author?

    -The MASHI AI demo is considered disappointing because, despite being able to express and understand emotions, it often fails to do so accurately and struggles with basic tasks, making it less intelligent compared to other AIs like GP4 Omni.

  • What is the significance of MASHI AI being open-source?

    -The significance of MASHI AI being open-source is that it allows the community to access, modify, and improve the AI model, potentially making it smarter and more useful in the future.

  • How does the MASHI AI demo handle singing requests?

    -The MASHI AI demo attempts to sing but struggles with the task, often repeating phrases and not delivering a coherent or melodious performance, indicating its limitations in this area.

  • What is the role of the Discord server mentioned in the script?

    -The Discord server is where the video script author found the link to the MASHI AI demo, suggesting it as a source of information and discussion about AI technologies.

  • How does the video script author interact with MASHI AI to test its emotion understanding capabilities?

    -The video script author changes their voice's tonality and pitch to simulate different emotions and asks MASHI AI to identify the emotions being projected, testing its ability to understand emotions.

  • What is the difference between MASHI AI and Pi AI in terms of voice quality and interaction?

    -Pi AI has a more realistic and better-sounding voice and can generate lyrics for songs, but it cannot sing. MASHI AI, while being a multimodal AI, struggles with voice interaction and singing, and its voice quality is not as good as Pi AI.

  • What is the author's opinion on the future potential of MASHI AI after being open-source?

    -The author is excited about the potential of MASHI AI once it becomes open-source, believing that the community can improve its capabilities and make it more usable and intelligent.

  • How does the video script author compare MASHI AI to GP4 Omni?

    -The author compares MASHI AI to GP4 Omni by noting that while MASHI AI is a true multimodal AI that can listen and speak, it lacks the intelligence and smooth interaction of GP4 Omni, which is not yet accessible to the public.



🤖 GP4 Omni Demo and MASHI AI Introduction

The video script begins with a discussion of the GP4 Omni Voice demo, which showcased an AI that could understand and mimic human emotions and conversation. The host expresses disappointment that the GP4 Omni is not yet accessible to the public and introduces MASHI AI, a similar technology that is currently available for testing. MASHI AI is a multimodal model that can listen and speak in real time, although it is not as advanced as GP4 Omni. The script mentions that MASHI AI is based on joint pre-training with text and audio synthetic data and will be released as open source, allowing the community to improve its capabilities.


🎤 Testing MASHI AI's Emotional Recognition and Singing Abilities

The script continues with the host attempting to test MASHI AI's ability to recognize emotions and sing. Despite MASHI AI's stated capability to understand emotions, it fails to accurately identify the host's emotional state during the conversation. The host also challenges MASHI AI to sing a song about butterflies, which it attempts to do, albeit with limited success. The host expresses frustration with MASHI AI's performance, comparing it unfavorably to other AI models like Pi AI and GPT-3.


🔊 Comparing MASHI AI with Other AI Models

In this section, the host compares MASHI AI with other AI models, specifically Pi AI and chat GPT. While acknowledging that MASHI AI is not as advanced, the host is interested in its potential once it becomes open source. The script details the host's experience with Pi AI, which is able to generate a song about butterflies with a more realistic voice and better lyrics. The host also tests chat GPT's ability to create a butterfly song, noting that while it cannot sing, it can generate lyrics.


📝 MASHI AI's Struggle with Storytelling and Emotional Understanding

The host engages MASHI AI in a task to help with writing a story, seeking advice on structuring the narrative. MASHI AI provides generic advice about protagonists and challenges. The host then asks MASHI AI to identify emotions projected through their voice, which MASHI AI attempts but often fails to do accurately, leading to a humorous exchange where the host accuses MASHI AI of 'cheating' when it finally guesses correctly after the host reveals the emotion.


🌐 Hopes for MASHI AI's Open Source Future

The script concludes with the host reflecting on the potential of MASHI AI once it becomes open source. They express optimism that the community can improve MASHI AI, making it more usable and competitive with other AI technologies like GP4 Omni. The host acknowledges the current limitations of MASHI AI but maintains a positive outlook for its future development, inviting viewers to share their thoughts on the matter.



💡gp4 Omni

gp4 Omni is a reference to a hypothetical advanced AI model that is capable of natural conversation and understanding emotions, similar to a human. In the video script, it is mentioned as a benchmark for comparison with other AI technologies. The script implies that gp4 Omni has not been publicly released yet, creating a sense of anticipation and setting a high standard for the AI capabilities being discussed.

💡multimodal model

A multimodal model in the context of AI refers to a system that can process and understand multiple types of data, such as text, audio, and visual information. In the video, the term is used to describe AI systems like gp4 Omni and mashi, which are capable of listening and speaking in real time, thus integrating different modalities of data to facilitate more human-like interaction.


Mashi is an AI model mentioned in the script that is being tested for its capabilities. It is described as a 'native multimodal Foundation model' that can listen and speak, similar to gp4 Omni, but with a current limitation in intelligence and emotional understanding. The script discusses mashi's performance in various tests, highlighting its potential as an open-source project for community improvement.

💡open source

Open source refers to a philosophy of software development where the source code is made available to the public, allowing anyone to view, modify, and distribute the software. In the context of the video, mashi is highlighted as an open-source project, which means that once the paper, code, and models are released, the community can contribute to its development and improvement.

💡emotional understanding

Emotional understanding in AI is the capability of an AI system to recognize, interpret, and respond to human emotions. The script discusses mashi's purported ability to 'Express and understand emotions,' although it is noted that this feature did not perform well during testing, indicating a gap between the AI's advertised capabilities and its actual performance.


Text-to-speech (TTS) is a technology that converts written text into audible speech. The script mentions TTS in the context of comparing different AI systems, such as Pi AI, which uses TTS to generate a realistic-sounding voice for its responses, despite not being a fully integrated multimodal system like mashi.

💡large language model (LLM)

A large language model, or LLM, refers to an AI system trained on vast amounts of text data to generate human-like language. The script mentions Helium 7bl, llm as the large language model that powers mashi, indicating the underlying technology that enables the AI's text generation and processing capabilities.


Latency in the context of AI and computing refers to the delay before a system responds to a command or request. The script notes that mashi has a 'really good latency of only 200 milliseconds,' which is important for real-time interactions, suggesting that despite its limitations, mashi performs well in terms of response speed.

💡voice demo

A voice demo is a demonstration of an AI system's ability to use voice input and output for interaction. The script begins with a reference to a 'gp4 Omni, Voice demo,' setting the stage for the discussion of AI capabilities in processing and generating human-like voice communication.


In the script, singing is used as a test of the AI's ability to generate creative content and possibly express emotion through song. The AI's attempts to sing a 'butterfly song' are highlighted, showcasing the system's capacity for creative language generation, even if it cannot produce actual singing.


A new AI demo similar to GP4 Omni has been released, offering real-time conversation with a human-like voice.

The GP4 Omni demo was impressive but not yet accessible to the public, causing disappointment.

MHI AI, while not as smart as GP4 Omni, is available for public testing and has a nice sounding voice.

MHI AI is a multimodal model that can listen and speak in real time, but struggles with understanding emotions.

MHI AI uses joint pre-training on text and audio synthetic data but is not fine-tuned on conversations.

The voice of MHI AI is competitive but not the best, with a latency of only 200 milliseconds.

MHI AI will be released open source, allowing the community to improve and customize the model.

The open-source release of MHI AI is a significant opportunity for the community to enhance its capabilities.

MHI AI's ability to express and understand emotions was tested but not consistently demonstrated.

The demo of MHI AI is entirely free to access, but requires joining a queue with an email.

Comparisons between MHI AI and other AI models like Pi AI and Chat GPT highlight differences in capabilities.

Pi AI, while not multimodal, offers a better singing performance and voice quality than MHI AI.

Chat GPT, despite not having multimodal capabilities, provides a more natural conversation flow.

MHI AI's performance in understanding emotions and tonality was inconsistent and often incorrect.

The potential for MHI AI to improve through open-source collaboration is a significant advantage.

The creator expresses excitement for the future of MHI AI and the possibilities of community-driven enhancements.

MHI AI's current limitations are acknowledged, but the open-source aspect offers hope for significant improvements.