Miso One by Miso AI offers Miso TTS 8B, an English-only emotive text-to-speech model with open weights for local download, enabling expressive and natural-sounding speech generation.

Is Miso One free to use?

The model weights are open and available for local download, but usage terms may vary. Check Miso AI's licensing for commercial use details.

What languages does Miso One support?

Miso One currently supports English only, with a focus on emotive and natural-sounding speech.

Can I run Miso One locally?

Yes, the model weights are open for local download, allowing developers to run it on their own hardware.

What are the system requirements for Miso One?

Requirements depend on the model size (8B parameters). A GPU with sufficient VRAM (e.g., 16GB+) is recommended for optimal performance.

How do I get started with Miso One?

Download the open weights from Miso AI's official repository and follow the provided documentation for installation and usage.

Miso One - AI Speech synthesis tools - Free trial, pricing intro, performance review, official site access and online experience

What is Miso One?

Miso One is the product-facing name for Miso Labs' Miso TTS 8B release—an open-weights English text-to-speech model designed for expressive, conversational speech. It enables developers and researchers to generate emotionally varied, natural-sounding voice outputs with low latency, including a published 110 ms latency claim for voice-agent workflows. The model supports audio context prompting, making it suitable for voice continuation and one-shot voice cloning tasks. It is primarily a tool for evaluation and experimentation in local TTS environments, not a lightweight browser voice toy.

Application scenarios

Voice-agent latency research
Developers can test Miso TTS 8B for real-time conversational agents, evaluating whether the 110 ms latency claim holds in their own workflows.
Local open-weights TTS
Users can download the model repository and Hugging Face weights to run inference locally on their own hardware, ideal for offline or privacy-sensitive projects.
One-shot voice cloning
The model can generate speech conditioned on a short audio prompt, enabling voice continuation or cloning from a single sample.
Expressive conversational speech
Content creators can produce emotionally varied, natural-sounding English narration for podcasts, audiobooks, or interactive dialogue.
Quality and safety checks
Researchers and developers can inspect the model’s limitations, watermarking notes, and responsible voice-cloning boundaries before production deployment.
Live translation drafts
The site mentions a "Live translate EN -> ES" feature, suggesting real-time translation with streaming transcript output for multilingual voiceover workflows.

Core Features

Open weights and inference code
The Miso TTS 8B model weights and inference code are publicly available for download and local use.
Expressive English speech
The model focuses on English speech quality, emotion, pacing, and conversational delivery rather than broad multilingual support.
Audio context prompting
Miso TTS 8B can condition on prompt audio, enabling voice continuation and one-shot voice cloning from a given sample.
Low-latency generation
The system is built for very low-latency voice-agent research, with a published 110 ms latency claim for real-time applications.
Voice Studio Session
Users can convert script to expressive audio using a dedicated studio interface, with a 48 kHz preview and timeline editing.
Realtime voiceover workflow
The platform supports live translation (EN to ES), streaming captions, and publish-ready audio output for creator workflows.
Watermarking and safety notes
The model includes clear limitations on English-only generation, large local hardware requirements, and responsible voice-cloning boundaries.

Target users

Developers, AI researchers, and voice-agent engineers who need an open-weights, expressive text-to-speech model for local experimentation or production testing. Content creators and voiceover professionals interested in low-latency, emotionally varied English speech generation will also find value, especially those working with live translation or streaming audio workflows.

How to use Miso One?

To get started, visit the Miso One website and try the free demo to test expressive speech generation. For local use, download the Miso TTS 8B model weights and inference code from the official repository or Hugging Face page, then set up the checkpoint on a GPU-equipped machine (8B parameters require significant local hardware). Use the Voice Studio Session to convert script to audio with timeline editing, or leverage the realtime voiceover workflow for live translation and streaming captions. For voice cloning, provide a short audio prompt to condition the model for voice continuation.

Effect review

Miso One delivers on its promise of expressive, low-latency English speech generation, with the open-weights approach making it a strong candidate for developers who need local control over TTS models. The 110 ms latency claim is notable for voice-agent research, though real-world performance will depend on hardware setup. The one-shot voice cloning and audio context features add practical value for voice continuation tasks, but the English-only limitation and large GPU requirements narrow its immediate audience. Overall, it is a capable tool for those willing to invest in local infrastructure and evaluation workflows, rather than a plug-and-play consumer product.

Miso One