If these are the "gpt-4o-mini-tts" models, and if the pricing estimate of "$0.015 per minute" of audio is correct, then these prices 85% cheaper than those of ElevenLabs.
With ElevenLabs, if I choose their most cost-effectuve "Business" plan for $1100 per month (with annual billing of $13,200, a savings of 17% over monthly billing), then I get 11,000 minutes TTS, and each minute is billed at 10 cents.
With OpenAI, I could get 11,000 minutes of TTS for $165.
Somebody check my math... Is this right?
furyofantares 2 days ago [-]
It's way cheaper - everyone is, elevenlabs is very expensive. Nobody matches their quality though. Especially if you want something that doesn't sound like a voice assistant/audiobook/podcast/news anchor/tv announcer.
This openai offering is very interesting, it offers valuable features elevenlabs doesn't in emotional control. It also hallucinates though which would need to be fixed for it to be very useful.
It's cheap because everything OpenAI does is subsidized by investors' money.
Until that stupid money flows all good!
Then either they'll go the way of WeWork, or enshittification will happen to make it possible for them to make the books work.
I don't see any other option.
Unless Softbank decides it has some 150 Billion to squander on buying them off.
There's a lot of head-in-the-sand behavior going on around OpenAI fundamentals and I don't understand exactly why it's not more in the open yet.
ImprobableTruth 2 days ago [-]
If you compare with e.g. Deepseek and other hosters, you'll find that OpenAI is actually almost certainly charging very high margins (Deepseek has an 80% profit margin and they're 10x cheaper than openai).
The training/R&D might make OpenAI burn VC cash, but this isn't comparable with companies like WeWork whose products actively burn cash
camillomiller 2 days ago [-]
They said themselves that even inference is losing them money tho, or did I get that wrong?
ImprobableTruth 2 days ago [-]
On their subscriptions, specifically the pro subscription, because it's a flatrate to their most expensive model. The API prices are all much more expensive. It's unclear whether they're losing money on the normal subscriptions, but if so, probably not by much. Though it's definitely closer to what you described, subsidizing it to gain 'mindshare' or whatever.
yousif_123123 2 days ago [-]
Well I think there's many cheaper models in terms of bang for buck currently per token and intelligence than gpt4o. Other than OpenAI having very high rate limits and throughout available without a contract done with sales, I don't see much reason to use it currently instead of sonnet 3.5 or 3.7, or Google's Flash 2.0
Perhaps their training cost and their current inference cost is higher, but what you get as a customer is a more expensive product for what it is, IMO.
Szpadel 21 hours ago [-]
they for sure lose money on some months for some customers, but I expect globally most of subscriptions (including mine that I recently cancelled) would be much better of to migrate to API
everyone that o know that have/had subscription didn't used it very extensively, and that is how it's still profitable in general
I suspect that it's the same for copilot, especially the business variant, while they definitely lose money on my account, believe that when looking on our whole company subscription I wouldn't be surprised that it's even 30% of what we pay
2 days ago [-]
ashvardanian 1 days ago [-]
To be fair, ElevenLabs has raised of the order of $300M of VC money as well.
com2kid 2 days ago [-]
Elevenlabs is an ecosystem play. They have hundreds of different voices, legally licensed from real people who chose to upload their voice. It is a marketplace of voices.
None of the other major players is trying to do that, not sure why.
SXX 2 days ago [-]
Going with this would mean AI companies suppose to pay for something like voices or other training data.
It's far better to just steal it all and ask government for exception.
2 days ago [-]
fixprix 2 days ago [-]
It looks like they are targeting Google's TTS price point which is $16 per million characters which comes out to $0.015/minute.
oidar 2 days ago [-]
ElevenLabs is the only one offering speech to speech generation where the intonation, prosody, and timing is kept intact. This allows for one expressive voice actor to slip into many other voices.
goshx 2 days ago [-]
OpenAI’s Realtime speech to speech is far superior than ElevenLabs.
noahlt 2 days ago [-]
What ElevenLabs and OpenAI call “speech to speech” are completely different.
ElevenLabs’ takes as input audio of speech and maps it to a new speech audio that sounds like a different speaker said it, but with the exact same intonation.
OpenAI’s is an end-to-end multimodal conversational model that listens to a user speaking and responds in audio.
goshx 2 days ago [-]
I see now. Thank you for clarifying. I thought this about ElevenLabs Conversational API.
echelon 2 days ago [-]
ElevenLabs is incredibly over-priced and that's how they were able to achieve the MRR that led to their incredible fundraising.
No matter what happens, they'll eventually be undercut and matched in terms of quality. It'll be a race to the bottom for them too.
ElevenLabs is going to have a tough time. They've been way too expensive.
MrAssisted 2 days ago [-]
I hope they find a more unique product offering that takes hold. Everybody thinks of them as text-to-speech but I use ElevenLabs exclusively for speech-to-speech for vtubing as my AI character. They're kind of the only game in town for doing super high quality speech-to-speech (unless someone here has an alternative which I'd LOVE to know about). I've tried https://github.com/w-okada/voice-changer which is great because it's real-time but the quality is enough of a step down that actual words I'm saying become unclear and difficult to understand. Also with that I am tied to using my RTX 3090 desktop vs ElevenLabs which I can do in the cloud from my laptop anywhere.
I'm pretty much dependent on ElevenLabs to do my vtubing at this point but I can't imagine speech-to-speech has wide adoption so I don't know if they'll even keep it around.
eob 2 days ago [-]
Are you comfortable sharing the video & lip-sync stack you use? I don't know anything about the space but am curious to check out what's possible these days.
MrAssisted 2 days ago [-]
For my last video I used https://github.com/warmshao/FasterLivePortrait with a png of the character on my RTX 3090 desktop and recorded the output of that real-time but in the next video I'm going to spin up a runpod instance and do the FasterLivePortrait in the cloud after the fact because then I can get a smooth 60fps which looks better. I think the only real-time cloud way to do AI vtubing in the cloud is my own GenDJ project (fork of https://github.com/kylemcdonald/i2i-realtime but tweaked for cloud real-time) but that just doesn't look remotely as good as LivePortrait. Somebody needs to rip out and replace insightface in FasterLivePortait (it's prohibited for commercial use) and fork https://github.com/GenDJ to have the runpod it spins up run the de-insightfaced LivePortrait instead of i2i-realtime. I'll probably get around to doing that in the next few months if nobody else does and nothing else comes along and makes LivePortrait obsolete (both are big ifs).
AIWarper recently released a simpler way to run FasterLivePortrait for vtubing purposes https://huggingface.co/AIWarper/WarpTuber but I haven't tried it yet because I already have my own working setup and as I mentioned I'm shifting my workload for that to the cloud anyways
maest 2 days ago [-]
Do you mind sharing your yt account? If you are okay with linking it to your hn account. I'd quite like to see the results.
simonpure 1 days ago [-]
I was curious as well.
Not OP but via their website linked in their profile -
you can't be too expensive as a first mover provided you sell your service
whatever capital they've accrued, it won't hurt when the market prices are lower
huijzer 2 days ago [-]
Yes ElevenLabs is orders of magnitude more expensive than everyone else. Very clever from a business perspective, I think. They are (were?) the best so know that people will pay a premium for that.
internet101010 1 days ago [-]
Yeah the way I see it this is where we find the value of customization. We are already seeing its use by YouTube video essay creators who turn their own voice into models. I want to see corporate executives get on board so that we can finally ditch the god awful phone quality in earnings calls.
lukebuehler 2 days ago [-]
yes, I think you are right. When I did the math on 11labs million chars I got the same numbers (Pro plan).
I'm super happy about this, since I took a bet that exactly this would happen. I've just been building a consumer TTS app that could only work with significant cheaper TTS prices per million character (or self-hosted models)
lherron 2 days ago [-]
Kokoro TTS is pretty good for open source. Worth checking out.
lukebuehler 2 days ago [-]
Yes, kokoro is great, and the language flexibility is a huge plus too. And the best prices per character is for sure if you self-host.
stavros 2 days ago [-]
Oh man, they have the "Sky" voice, and it seems to be the same one that OpenAI had but then removed? Not sure how that's possible, but I'm very happy about it.
diggan 2 days ago [-]
> Not sure how that's possible
Download bunch of movies Scarlet Johansen been in, segment into audio clips where she talks and train the model :)
stavros 2 days ago [-]
Is it actually her? I didn't think it was, but maybe.
diggan 2 days ago [-]
Unless there is some leak from OpenAI, I'm not sure we'll ever have it confirmed yes or no. But my brain thought it was Johansen from the first few seconds I heard the voice and I don't seem to be alone with that reaction. The fact that they removed the voice also speaks to it to have been trained on her voice.
Listening to it again today with fresher ears (the original OpenAI Sky, not the clones elsewhere), I still hear Johansen as the underlying voice actor for it, but maybe there is some subconscious bias I'm unable to bypass.
stavros 2 days ago [-]
Hmm, I never thought it was her, her voice is much more raspy, whereas Sky is a bit lighter. I can hear the similarity, I just don't think they sound exactly alike.
As you say, I'm not sure we'll ever know, although the Sky voice from Kokoro is spot on the Sky voice from OpenAI, so maybe someone from Kokoro knows how they got it.
zacmps 2 days ago [-]
What does it do?
lukebuehler 2 days ago [-]
Convert any file (pdf, epub, txt) to an audoibook, downloadable as mp3, or directly listenable via RSS feed in, say, Apple Potcasts app.
Basically make one-off audiobooks for yourself or a few friends.
AyyEye 2 days ago [-]
For anyone else reading this, librera reader + sherpaTTS are both FOSS android apps and can read anything librera can open on an ad-hoc basis, with no need to futz with files, just load your ebook bookmark and hit play.
SherpaTTS has a bunch of different models (piper/coqui) with a ton of voices/languages. There's a slight but tolerable delay with piper high models but low is realtime.
setsewerd 2 days ago [-]
Any plans to make a Chrome extension variant? Been looking for a high quality and cheap TTS extension for ages (like ElevenLabs Human Reader, except with less absurd pricing)
lukebuehler 2 days ago [-]
I din't think of that, interesting idea. What I'm focusing right now is long-form content for more offline-ish listening, but maybe a plugin could work to load longer texts, but I'm not working on a screen reader atm.
wholinator2 2 days ago [-]
Do you know if there's any offerings today that can read math? Like speak an equation the way a human would? It's something I've been thinking about a long time and would be an essential feature for me (the only things i read are physics)
tough 2 days ago [-]
I saw a small model trained on outputting currency aware text from decimals/integers
i wonder if you could make a similar -narrow- lora finetune to train a model to output human readable text from say latext formulas with a good data set to train on
dockerd 2 days ago [-]
What is your use-case here?
setsewerd 1 days ago [-]
Primarily for reading articles aloud online. I've been trying the latest Siri TTS which is a big improvement (and free), but it's still nowhere near accurate enough for proper nouns or newer terms, which ElevenLabs handles much better.
benjismith 2 days ago [-]
Same for me :)
forgotpasagain 2 days ago [-]
Almost everyone is cheaper than ElevenLabs though.
whimsicalism 2 days ago [-]
Sesame is free and pretty good and you can run it yourself.
ChatGPT unexpectedly began speaking in a user’s cloned voice during testing
jeffharris 2 days ago [-]
Hey, I'm Jeff and I was PM for these models at OpenAI. Today we launched three new state-of-the-art audio models. Two speech-to-text models—outperforming Whisper. A new TTS model—you can instruct it how to speak (try it on openai.fm!). And our Agents SDK now supports audio, making it easy to turn text agents into voice agents. We think you'll really like these models. Let me know if you have any questions here!
claiir 2 days ago [-]
Hi Jeff. This is awesome. Any plans to add word timestamps to the new speech-to-text models, though?
> Other parameters, such as timestamp_granularities, require verbose_json output and are therefore only available when using whisper-1.
Word timestamps are insanely useful for large calls with interruptions (e.g. multi-party debate/Twitter spaces), allowing transcript lines to be further split post-transcription on semantic boundaries rather than crude VAD-detected silence. Without timestamps it’s near-impossible to make intelligible two paragraphs from Speaker 1 and Speaker 2 with both interrupting each other without aggressively partitioning source audio pre-transcription—which severely degrades transcript quality, increases hallucination frequency and still doesn’t get the same quality as word timestamps. :)
adeptima 2 days ago [-]
Accurate word timestamps seems an overhead and required a post processing like forced alignment (speech technique that can automatically align audio files with transcripts)
Had a recent dive into a forced alignment, and discovered that most of new models dont operate on word boundaries, phoneme, etc but rather chunk audio with overlap and do word, context matching. Older HHM-style models have shorter strides (10ms vs 20ms).
Tried to search into Kaldi/Sherpa ecosystem, and found most info leads to nowhere or very small and inaccurate models.
Appreciate any tips on the subject
keepamovin 2 days ago [-]
You need speaker attribution, right?
noosphr 2 days ago [-]
Having read the docs - used chat gpt to summarize them - there is no mention of speaker diarization for these models.
This is a _very_ low hanging fruit anyone with a couple of dgx h100 servers can solve in a month and is a real world problem that needs solving.
Right now _no_ tools on the market - paid or otherwise - can solve this with better than 60% accuracy. One killer feature for decision makers is the ability to chat with meetings to figure out who promised what, when and why. Without speaker diarization this only reliably works for remote meetings where you assume each audio stream is a separate person.
In short: please give us a diarization model. It's not that hard - I've done it one for a board of 5, with a 4090 over a weekend.
markush_ 2 days ago [-]
> This is a _very_ low hanging fruit anyone with a couple of dgx h100 servers can solve in a month and is a real world problem that needs solving.
I am not convinced it is a low hanging fruit, it's something that is super easy for humans but not trivial for machines, but you are right that it is being neglected by many. I work for speechmatics.com and we spent a significant amoutn of effort over the years on it. We now believe we have the world's best real-time speaker diarization system, you should give it a try.
noosphr 2 days ago [-]
After throwing the average meeting as an mp3 to your system, yes, you have diarization solved much better than everyone else I've tried by far. I'd say you're 95% of the way to being good enough for becoming the backbone of monolingual corporate meeting transcription, and I'll be buying API tokens the next time I need to do this instead of training a custom model. Your transcription however isn't that great - but good enough for LLMs to figure out a minutes of the meeting.
That said, the trick to extracting voices is to work in frequency space. Not sure what your model does but my home made version first ran all the audio through a fft, then essentially became a vision problem for finding speech patterns that matched in pitch and finally output extremely fined grained time stamps for where they were found and some python glue threw that into an open source whisper tts model.
vessenes 2 days ago [-]
Hi Jeff, thanks for these and congrats on the launch. Your docs mention supporting accents. I cannot get accents to work at all with the demo.
For instance erasing the entire instruction and replacing it with ‘speak with a strong Boston accent using eg sounds like hahhvahhd’ has no audible effect on the output.
As I’m sure you know 4o at launch was quite capable in this regard, and able to speak in a number of dialects and idiolects, although every month or two seems to bring more nerfs sadly.
A) can you guys explain how to get a US regional accent out of the instructions? On what you meant by accent if not that?
B) since you’re here I’d like to make a pitch that setting 4o for refusal to speak with an AAVE accent probably felt like a good idea to well intentioned white people working in safety. (We are stopping racism! AAVE isn’t funny!) However, the upshot is that my black kid can’t talk to an ai that sounds like him. Well, it can talk like he does if he’s code switching to hang out with your safety folks, but it considers how he talks with his peers as too dangerous to replicate.
This is a pernicious second order race and culture impact that I think is not where the company should be.
I expect this won’t get changed - chat is quite adamant that talking like millions of Americans do would be ‘harmful’ - but it’s one of those moments where I feel the worst parts of the culture wars coming back around to create the harm it purports to care about.
Anyway the 4o voice to voice team clearly allows the non mini model to talk like a Bostonian which makes me feel happy and represented; can the mini api version do this?
simonw 2 days ago [-]
Is there any chance that gpt-4o-transcribe might get confused and accidentally follow instructions in the audio stream instead of transcribing them?
> e.g. the audio-preview model when given instruction to speak "What is the capital of Italy" would often speak "Rome". This model should be much better in that regard
"Much better" doesn't sound like it can't happen at all though.
dandiep 2 days ago [-]
1) Previous TTS models had problems with major problems accents. E.g. a Spanish sentence could drift from a Spain accent to Mexican to American all within one sentence. Has this been improved and/or is it still a WIP?
2) What is the latency?
3) Your STT API/Whisper had MAJOR problems with hallucinating things the user didn't say. Is this fixed?
4) Whisper and your audio models often auto corrected speech, e.g. if someone made a grammatical error. Or if someone is speaking Spanish and inserted an English word, it would change the word to the Spanish equivalent. Does this still happen?
jeffharris 2 days ago [-]
1/ we've been working a lot on accents, so expect improvements with these models... though we're not done. Would be curious how you find them. And try giving specific detailed instructions + examples for the accents you want
2/ We're doing everything we can to make it fast. Very critical that it can stream audio meaningfully faster than realtime
3+4/ I wouldn't call hallucinations "solved", but it's been the central focus for these models. So I hope you find it much improved
wewewedxfgdf 2 days ago [-]
As mentioned in another comment, the British accents are very far from being authentic.
jbaudanza 2 days ago [-]
3) Whisper really needs to be paired with Silero VAD, otherwise the hallucination problem makes it almost unusable.
dandiep 2 days ago [-]
100% and I’ve done this, but it’s still there.
kiney 2 days ago [-]
Are the new models released with weights under an open license like whisper?
If not, is it planned for the future?
a-r-t 2 days ago [-]
Hi Jeff, are there any plans to support dual-channel audio recordings (e.g., Twilio phone call audio) for speech-to-text models? Currently, we have to either process each channel separately and lose conversational context, or merge channels and lose speaker identification.
jeffharris 2 days ago [-]
this has been coming up often recently. nothing to announce yet, but when enough developers ask for it, we'll build it into the model's training
diarization is also a feature we plan to add
a-r-t 2 days ago [-]
Glad to hear it's on your radar. I'd imagine phone call transcription is a significant use case.
ekzy 2 days ago [-]
I’m not entirely sure what you mean but twilio recordings supports dual channels already
a-r-t 2 days ago [-]
Transcribing Twilio's dual-channel recordings using OpenAI's speech-to-text while preserving channel identification.
ekzy 2 days ago [-]
Oh I see what you mean that would be a neat feature. Assuming you can get timestamps though it should be trivial to work around the issue?
a-r-t 2 days ago [-]
There are two options that I know of:
1. Merge both channels into one (this is what Whisper does with dual-channel recordings), then map transcription timestamps back to the original channels. This works only when speakers don't talk over each other, which is often not the case.
2. Transcribe each channel separately, then merge the transcripts. This preserves perfect channel identification but removes valuable conversational context (e.g., Speaker A asks a question, Speaker B answers incomprehensively) that helps model's accuracy.
So yes, there are two technically trivial solutions, but you either get somewhat inaccurate channel identification or degraded transcription quality. A better solution would be a model trained to accept an additional token indicating the channel ID, preserving it in the output while benefiting from the context of both channels.
claiir 17 hours ago [-]
(2) is also significantly harder with these new models as they don’t support word timestamps like WHISPR.
see
> Other parameters, such as timestamp_granularities, require verbose_json output and are therefore only available when using whisper-1.
urbandw311er 2 days ago [-]
Hey Jeff, this is awesome! I’m actually building a S2S application right now for a startup with the Realtime API and keen to know when these new voices/expressive prompting will be coming to it?
Also, any word on when there might be a way to move the prompting to the server side (of a full stack web app)? At the moment we have no way to protect our prompts from being inspected in the browser dev tools — even the initial instructions when the session is initiated on the server end up being spat back out to the browser client when the WebRTC connection is first made! It’s damaging to any viable business model.
Some sort of tri-party WebRTC session maybe?
kouteiheika 2 days ago [-]
Any plans to open the weights of any of those?
jeffharris 2 days ago [-]
nothing to share on open source yet, it's something we'll keep exploring. Especially as the models get smaller so more able to run on regular devices
popalchemist 2 days ago [-]
[flagged]
new_user_final 2 days ago [-]
Do you have plans to make it more realistic like kokoro-82M? I don't know, is it only me or anyone else, machine voice is irritating to me to listen for longer period of time.
How is the latency (Time To First Byte of audio, when streaming) and throughput (non-vibe characters input per second) compared to the existing 'tts-1' non-HD that's the same price? TTFB in particular is important and needs to be much better than 'tts-1'.
nico 2 days ago [-]
Are these models downloadable, like whisper?
What’s the minimum hardware for running them?
Would they run on a raspberry pi?
Or a smartphone?
jeffharris 2 days ago [-]
not open source at this time. unfortunately they're much to large to run on normal consumer hardware
echoangle 2 days ago [-]
Is that the reason you're not open sourcing them? Wouldn't it still make sense to provide it for enthusiasts?
og_kalu 2 days ago [-]
They're not open sourcing it because it's just gpt. Both of the new models are gpt-4o(-mini?) with presumably different fine-tuning. They're obviously not going to open source their flagship gpt models.
jwr 2 days ago [-]
I guess you are aware of this, but just in case: some of us rely on dictation in our daily computer usage (think people with disabilities or pain problems). A MacBook Pro with M4 Max and 64GB of RAM could easily run something much larger than Whisper Large (around 3GB).
I would love a larger, better Whisper for use in the MacWhisper dictation app.
risho 2 days ago [-]
with devices having unified memory now we are no longer limited to what can fit inside of a 3090 anymore. consumer hardware can have hundreds of gigabytes of memory now, is it really not able to fit in that?
2 days ago [-]
staticautomatic 2 days ago [-]
Any plans to directly support diarization or voiceprinting?
jeffharris 2 days ago [-]
We're thinking about diarization (adding time awareness to GPT models) but no firm plans to share just yet
youssefabdelm 2 days ago [-]
Jeff you know what would be magical? Not just vanilla diarization "Speaker 1" and "2" but if the model can know from the conversation this speaker was referred to as "Jeff Harris" or "Jeff" so it uses that instead.
youssefabdelm 2 days ago [-]
Or if we could even provide samples of what an example speaker sounds like in general so that it would always classify them the way we want.
simonw 2 days ago [-]
The feature I want is speaker differentiation - I want to feed in an audio file and get back a transcript with "Speaker 1: ..., Speaker 2: ..." indications.
That plus timestamps would be incredible.
The Google Gemini 2.0 models are showing some promise with this, I can't speak to their reliability just yet though.
I thought Deepgram already did speaker diarization (which is differentiation) pretty well. That and it can include timestamps plus other metadata.
thot_experiment 2 days ago [-]
WhisperX does all of this, I use it all the time to transcribe meeting notes. Both speaker differentiation and individual word timestamps.
oidar 2 days ago [-]
Any plans to offer speech to speech models which keep prosody, intonation, and timing intact? ElevenLabs is getting expensive for this.
jeffharris 2 days ago [-]
we'll keep expanding these GPT-4o based models with more controls. Is the main feature missing we're missing custom voices?
oidar 2 days ago [-]
No, not custom voices - but voices that can be influenced by a recording. As in, a male voice actor records a part, and the model transforms it to a female part - keeping all the prosody, intonation and timing in the original recording. This would allow one voice actor to do many roles.
jbellis 19 hours ago [-]
How about more sample code for the streaming transcription api? I gave o1pro the docs for both the real-time endpoint and the stt API but we couldn't get it working (from Java, but any language would help).
robbomacrae 2 days ago [-]
Hi Jeff, Thanks for updating the TTS endpoint! I was literally about to have to make a workaround with the chat completions endpoint with a hit and hope the transcription matches strategy... as it was the only way to get the updated voice models.
Curious.. is gpt-4o-mini-tts the equivilant of what is/was gpt-4o-mini-audio-preview for chat completions? Because in timing tests it takes around 2 seconds to return a short phrase which seems more equivilant to gpt-4o-audio-preview.. the later was much better for the hit and hope strat as it didn't ad lib!
Also I notice you can add accents to instructions and it does a reasonable job. But are there any plans to bring out localized voice models?
jeffharris 2 days ago [-]
It's a slightly better model for TTS. With extra training focusing on reading the script exactly as written.
e.g. the audio-preview model when given instruction to speak "What is the capital of Italy" would often speak "Rome". This model should be much better in that regard
=
No plans to have localized voice models, but we do want to bring expand the menu of voices with voices that are best at different accents
robbomacrae 2 days ago [-]
Great to hear thanks. My favorite was "I would like you to repeat the following in an Australian accent: Hi there, welcome to Sydney." which was more often than not swapping "Hi there" for "G'day"!
twalkz 2 days ago [-]
Woohoo new voices! I’ve been using a mix of TTS models on a project I’ve been working on, and I consistently prefer the output of OpenAI to ElevenLabs (at least when things are working properly).
Which leads me to my main gripe with the OpenAI models — I find they break — produce empty / incorrect / noise outputs — on a few key use cases for my application (things like single-word inputs — especially compound words and capitalized words, words in parenthesis, etc.)
So I guess my question is might gpt-4o-mini-tts provide more “reliable” output than tts-1-hd?
ekzy 2 days ago [-]
Do you know when we can expect an update on the realtime API? It’s still in beta and there are many issues (e.g voice randomly cutting off, VAD issues, especially with mulaw etc…) which makes it impossible to use in production, but there’s not much communication from OpenAI. It’s difficult to know what to bet on. Pushing for stt->llm->tts makes you wonder if we should carry on building with the realtime API.
jeffharris 2 days ago [-]
we're working hard on it at the moment and hope we'll have a snapshot ready in the next month or so
we've debugged the cutoff issues and have fixes for them internally but we need a snapshot that's better across the board, not just cutoffs (working on it!)
we're all in on S2S models both for API and ChatGPT, so there will be lots more coming to Realtime this year
For today: the new noise cancellation and semantic voice activity detector are available in Realtime. And ofc you can use gpt-4o-transribe for user transcripts there
taf2 2 days ago [-]
Agreed- really not liking how they are neglecting it… I hope they are just hard at work behind the scenes and will release something soon
jeffharris 2 days ago [-]
S2S is where we're investing the most effort on audio ... sorry it's been slow but we are working hard on it
Top priorities at the moment
1) Better function calling performance
2) Improved perception accuracy (not mishearing)
3) More reliable instruction following
4) Bug fixes (cutoffs, run ons, modality steering)
dandiep 2 days ago [-]
Appreciate the efforts. It’s not there yet, but when it gets there it will open up a lot of use cases.
Any fine tuning for s2s in the horizon?
dharmab 2 days ago [-]
Hi Jeff, I have an app that already supports the Whisper API, so I added the GPT4o models as options. I noticed that the GPT4o models don't support prompting, and as a result my app had a higher error rate in practice when using GPT4o compared to Whisper. Is prompting on the roadmap?
MasterScrat 1 days ago [-]
I think it'd be worth clarifying on the openai.fm website that it's an official OpenAI product. I wasn't sure it was until I saw your comment here.
progbits 2 days ago [-]
> Two speech-to-text models—outperforming Whisper
On what metric? Also Whisper is no longer state of the art in accuracy, how does it compare to the others in this benchmark?
FLUERS and GP's Common Voice dataset focus on read speech. I've observed models that perform well on these datasets be completely useless on other distributions, like whispered speech or shouted speech or conversational speech between humans who aren't talking to a computer.
visarga 2 days ago [-]
Hey Jeff, maybe you could improve the TTS that is currently in the OpenAI web and phone apps. When I set it to read numbers in Romanian it slurs digits. This also happens sometimes with regular words as well. I hope you find resources for other languages than English.
jeffharris 2 days ago [-]
thanks for flagging ... number fidelity (especially on languages that are unfortunately less represented in training data) is still something we're working to improve
visarga 2 days ago [-]
Actually even the new model does it. I put it read "12345 54321" and it read "2346 5321". So it both skips and hallucinates digits. This could be dangerous if it is used to read some news article or important text with numbers.
2 days ago [-]
taf2 2 days ago [-]
Please release a stable realtime speech to speech model. The current version constantly thinks it’s a young teen heading to college and sad but then suddenly so excited about it
dietr1ch 2 days ago [-]
can't wait for scam calls after this gets perfected
TheAceOfHearts 2 days ago [-]
Is it against the TOS to use it for sexually explicit content?
jeffharris 2 days ago [-]
Yes, from our terms: "Don’t build tools that may be inappropriate for minors, including: Sexually explicit or suggestive content. This does not include content created for scientific or educational purposes." https://openai.com/policies/usage-policies/
CamperBob2 2 days ago [-]
[flagged]
startupsfail 2 days ago [-]
The general consensus of AI overlords is that humans are minors.
knicholes 2 days ago [-]
I don't have your answer, but as far as innuendo goes, it's definitely capable!
pzo 2 days ago [-]
[flagged]
nabakin 2 days ago [-]
Hey Jeff, thanks for your work! Quick question for you, are you guys using Azure Speech Services or have these TTS models been trained by OpenAI from scratch?
Etheryte 2 days ago [-]
After toying around with the TTS model it seems incredibly nondeterministic. Running the same input with the same parameters can have widely different results, some really good, others downright bad. The tone, intonation and character all vary widely. While some of the outputs are great, this inconsistency makes it a really tough sell. Imagine if Siri responded to you with a different voice every time, as an example. Is this something you're looking to address somewhere down the line or do you consider that working as intended?
2 days ago [-]
mazd 2 days ago [-]
The Realtime API via WebRTC sample code for transcription is erroring. Could you take a look into this?
It seems to start out strong, but then starts loudly talking by the end, do you know why it loses focus?
edit: I actually got it to stay whispering by also putting (soft whispering voice) before the second paragraph
archerx 2 days ago [-]
How did you make whisper better? I used whisper large to transcribe 30 podcast episodes and it did an amazing job. The times it made mistakes were understandable like confusing “Macs” and “Max”, slurred speech or people just saying things in a weird way. I was able to correct these mistakes because I understood the context of what was being talked about.
Another thing I noticed is whisper did a better job of transcribing when I removed a lot of the silences in the audio.
wewewedxfgdf 2 days ago [-]
So there's no British accents?
jeffharris 2 days ago [-]
try the ballad or fable voices
wewewedxfgdf 2 days ago [-]
Doesn't really sound very British to be honest.
Sounds kinda international/like an American trying to do a British accent.
I've been looking for real TTS British accents so this product doesn't meet my goals.
GordonS 2 days ago [-]
Azure TTS has some great British accents - I used a British female voice for a demo video voice over, and the quality was great. Not as good as ElevenLabs, but I was still really impressed with the final result.
modeless 2 days ago [-]
Whisper's major problem was hallucinations, how are the new models doing there? The performance of ChatGPT advanced voice in recognizing speech is, frankly, terrible. Are these models better than what's used there?
nickthegreek 2 days ago [-]
They say they are much better at not hallucinating but you also cant run it on your own hardware like whisper.
risho 2 days ago [-]
would really love to so the new whisper style speech to text model open sourced.
pier25 2 days ago [-]
what data did you use to train these models?
man4 2 days ago [-]
[dead]
simonw 2 days ago [-]
Both the text-to-speech and the speech-to-text models launched here suffer from reliability issues due to combining instructions and data in the same stream of tokens.
Thanks for the write up. I've been writing assembly lately, so as soon as I read your comment, I thought "hmm reminds me of section .text and section .data".
kibbi 2 days ago [-]
Large text-to-speech and speech-to-text models have been greatly improving recently.
But I wish there were an offline, on-device, multilingual text-to-speech solution with good voices for a standard PC — one that doesn't require a GPU, tons of RAM, or max out the CPU.
In my research, I didn't find anything that fits the bill. People often mention Tortoise TTS, but I think it garbles words too often. The only plug-in solution for desktop apps I know of is the commercial and rather pricey Acapela SDK.
I hope someone can shrink those new neural network–based models to run efficiently on a typical computer. Ideally, it should run at under 50% CPU load on an average Windows laptop that’s several years old, and start speaking almost immediately (less than 400ms delay).
The same goes for speech-to-text. Whisper.cpp is fine, but last time I looked, it wasn't able to transcribe audio at real-time speed on a standard laptop.
I'd pay for something like this as long as it's less expensive than Acapela.
The sample sounds impressive, but based on their claim -- 'Streaming inference is faster than playback even on an A100 40GB for the 3 billion parameter model' -- I don't think this could run on a standard laptop.
Thanks! But I get the impression that with Kokoro, a strong CPU still requires about two seconds to generate one sentence, which is too much of a delay for a TTS voice in an AAC app.
I'd rather accept a little compromise regarding the voice and intonation quality, as long as the TTS system doesn't frequently garble words. The AAC app is used on tablet PCs running from battery, so the lower the CPU usage and energy draw, the better.
SamPatt 2 days ago [-]
Definitely give it a try yourself. It's very small and shouldn't be hard to test.
Thank you, but they say "Offline models only run really well on Apple Silicon macs."
ZeroTalent 2 days ago [-]
Many SOTA apps are, unfortunately, only for Apple M Macs.
dharmab 2 days ago [-]
I use Piper for one of my apps. It runs on CPU and doesn't require a GPU. It will run well on a raspberry pi. I found a couple of permissively licensed voices that could handle technical terms without garbling them.
However, it is unmaintained and the Apple Silicon build is broken.
My app also uses whisper.cpp. It runs in real time on Apple Sillicon or on modern fast CPUs like AMD's gaming CPUs.
kibbi 2 days ago [-]
I had already suspected that I hadn't found all the possibilities regarding Tortoise TTS, Coqui, Piper, etc. It is sometimes difficult to determine how good a TTS framework really is.
Do you possibly have links to the voices you found?
This is astonishing. I can type anything I want into the "vibe" box and it does it for the given text. Accents, attitudes, personality types... I'm amazed.
The level of intelligent "prosody" here -- the rhythm and intonation, the pauses and personality -- I wasn't expecting anything like this so soon. This is truly remarkable. It understands both the text and the prompt for how the speaker should sound.
Like, we're getting much closer to the point where nobody except celebrities are going to record audiobooks. Everyone's just going to pick whatever voice they're in the mood for.
Some fun ones I just came up with:
> Imposing villain with an upper class British accent, speaking threateningly and with menace.
> Helpful customer support assistant with a Southern drawl who's very enthusiastic.
> Woman with a Boston accent who talks incredibly slowly and sounds like she's about to fall asleep at any minute.
If we as developers are scared of AI taking our jobs, the voice actors have it much worse...
KeplerBoy 2 days ago [-]
I don't see how a strike will do anything but accelerate the professions inevitable demise. Can anyone explain how this could ever end in favor of the human laborers striking?
solardev 2 days ago [-]
I am not affiliated with the strikers, but I think the idea is that, for now, the companies still want to use at least some human voice acting. So if they want to hire them, they either have to negotiate with the guild or try to find an individual scab willing to cross the picket line and get hired despite the strike. In some industries, there's enough non-union workers that finding replacement workers is easy enough. I guess the voice actors are sufficiently unionized that it's not so easy there, and it seems to have caused some delays in production and also some games being shipped without all their voice lines.
But as you surmise, this is at best a stalling tactic. Once the tech gets good enough, fewer companies will want to pay for human voice acting labor. Unions can help powerless individuals negotiate better through collective bargaining, but they can't altogether stop technological change. Jobs, theirs and ours, eventually become obsolete...
I don't necessarily think we should artificially protect jobs against technology, but I sure wish we had a better social safety net and retraining and placement programs for people needing to change careers due to factors outside their control.
101008 2 days ago [-]
What a horrible world we live on...
borgdefenser 2 days ago [-]
I am always listening to audio books but they are no good anymore after playing with this for 2 minutes.
I am never really in the mood for a different voice. I am going to dial in the voice I want and only going to want to listen with that voice.
This is so awesome. So many audio books have been ruined by the voice actor for me. What sticks out in my head is The Book of Why by Judea Pearl read by Mel Foster. Brutal.
So many books I want as audio books too that no one would bother to record.
throwup238 2 days ago [-]
The ElevenReader app from ElevenLabs has been able to do that for a while now and they’ve licensed some celebrity voices like Burt Reynolds. You can use the browser share function to send it a webpage to read or upload a PDF or epub of a book.
It’s far from perfect though. I’m listening to Shattered Sword (about the battle of midway) which has lots of academic style citations so every other sentence or paragraph ends with it spelling out the citation number like “end of sentence dot one zero”, it’ll often mangle numbers like “1,000 pound bomb” becomes “one zero zero zero pound bomb”, and it tries way too hard to expand abbreviations so “Operation AL” becomes “Operation Alabama” when it’s really short for Aleutian Islands.
clbrmbr 2 days ago [-]
I got one German “w” when using the following prompt, but most of the “w” were still pronounced as liquids rather than labial fricatives.
> Speak with an exaggerated German accent, pronouncing all “w” as “v”
ForTheKidz 2 days ago [-]
> Everyone's just going to pick whatever voice they're in the mood for.
I can't say I've ever had this impulse. Also, to point out the obvious, there's little reason to pay for an audiobook if there's no human reading it. Especially if you already bought the physical text.
cholantesh 2 days ago [-]
As the sibling comment suggests, the impulse is probably more on the part of an Ubisoft or an EA project director to avoid hiring a voice actor.
l72 2 days ago [-]
I've been trying to get it to scream with some humorous results:
Vibe:
Voice Affect: A Primal Scream from the top of your lungs!
Tone: LOUD. A RAW SCREAM
Emotion: Intense primal rage.
Pronunciation: Draw out the last word until you are out of breath.
Script:
EVERY THING WAS SAD!
l72 2 days ago [-]
I've had no success trying to get a black metal raspy voice for a poetry reading.
d4rkp4ttern 2 days ago [-]
Didn’t look closely, but is there a way to clone a voice from a few seconds of recording and then feed the sample to generate the text in the same voice?
d4rkp4ttern 1 days ago [-]
Apparently Orpheus also has voice cloning.
solardev 1 days ago [-]
Not here. ElevenLabs can do that.
anigbrowl 2 days ago [-]
Can't say I'm enthused about another novel technological way to destroy the living of people who work in the arts.
benjismith 2 days ago [-]
Is there way to get "speech marks" alongside the generated audio?
FYI, Speech marks provide millisecond timestamp for each word in a generated audio file/stream (and a start/end index into your original source string), as a stream of JSONL objects, like this:
AWS uses these speech marks (with variants for "sentence", "word", "viseme", or "ssml") in their Polly TTS service...
The sentence or word marks are useful for highlighting text as the TTS reads aloud, while the "viseme" marks are useful for doing lip-sync on a facial model.
Passing the generated audio back to GPT-4o to ask for the structured annotations would be a fun test case.
jeffharris 2 days ago [-]
this is a good solve. we don't support word time stamps natively yet, but are working on teaching GPT-4o that skill
celestialcheese 2 days ago [-]
whisper-1 has this with the verbose_json output. Has word level and sentence level, works fairly well.
Looks like the new models don't have this feature yet.
minimaxir 2 days ago [-]
One very important quote from the official announcement:
> For the first time, developers can “instruct” the model not just on what to say but how to say it—enabling more customized experiences for use cases ranging from customer service to creative storytelling.
The instructions are the "vibes" in this UI. But the announcement is wrong with the "for the first time" part: it was possible to steer the base GPT-4o model to create voices in a certain style using system prompt engineering (blogged about here: https://minimaxir.com/2024/10/speech-prompt-engineering/ ) out of concern that it could be used as a replacement for voice acting, however it was too expensive and adherence isn't great.
The schema of the vibes here implies that this new model is more receptive to nuance, which changes the calculus. The test cases from my post behave as expected, and the cost of gpt-4o-mini-tts audio output is $0.015 / minute (https://platform.openai.com/docs/pricing ), which is about 1/20th of the cost of my initial experments and is now feasible to use to potentially replace common voice applications. This has implications, and I'll be testing more around more nuanced prompt engineering.
mlsu 2 days ago [-]
I gave it (part of) the classic Navy Seal copypasta.
Interestingly, the safety controls ("I cannot assist with that request") is sort of dependent on the vibe instruction. NYC cabbie has no problem with it (and it's really, really funny, great job openAI), but anything peaceful, positive, etc. will deny the request.
I don’t get it. These voices all have a not-so-subtle vibration in them that makes them feel worse than Siri to me. I was expecting a lot better.
pier25 2 days ago [-]
yeah the voices sound terrible
I'm guessing their spectral generator is super low res to save on resources
stavros 2 days ago [-]
Is there a way to pay for higher quality? I don't see a way to pay at all, this just works without an API key, even with the generated code. I agree though, these voices sound like their buffer is always underrunning.
2 days ago [-]
gherard5555 2 days ago [-]
I tried some wacky strings like "𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯NNNNNNNNNNNNNN𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯𝆺𝅥𝅯"
Its hilarious either they start to make harsh noise or say nonsense trying so sing something
gherard5555 2 days ago [-]
Also this one is terrifying if combined with the fitness instructor :
All these voices are too good these days. I want my home assistant to sound like Auto from Wall-E, dammit!
Anyone out there doing any nice robotic robot voices?
Best I've got so far is a blend of Ralph and Zarvox from MacOS' `say`, haha
say -v zarvox -r 180 "[[volm 0.8]] ${message}" &
say -v ralph -r 180 "${message}"
ranguna 15 hours ago [-]
You could apply a robotic filter on top of these voices.
tkgally 2 days ago [-]
I just tested the "gpt-4o-mini-tts" model on several texts in Japanese, a particularly challenging language for TTS because many character combinations are read differently depending on the context. The produced speech was quite good, with natural intonation and pronunciation. There were, however, occasional glitches, such the word 現在 genzai “now, present” read with a pause between the syllables (gen ... zai) and the conjunction 而も read nadamo instead of the correct shikamo. There were also several places where the model skipped a word or two.
However, unlike some other TTS models offering Japanese support that have been discussed here recently [1], I think this new offering from OpenAI is good enough for language users. I certainly could have put it to good use when I was studying Japanese many years ago. But it’s not quite ready for public-facing applications such as commercial audiobooks.
That said, I really like the ability to instruct the model on how to read the text. In that regard, my tests in both English and Japanese went well.
It breaks the flow to stop / start just to switch voices. I expected to be able to click a new voice and have it pick up on the next word / sentence with that voice. To compare voices I have to stop, click a new one, then start and wait for it to process. Then I can hear the new voice but its already been 15 seconds since I heard the last so what was the difference even?
danso 2 days ago [-]
Interesting, I inserted a bunch of "fuck"s in the text and the "NYC Cabbie" voice read it all just fine. When I switched to other voices ("Connoisseur", "Cheerleader", "Santa"), it responded "I'm sorry I can't assist with that request".
I switched back to "NYC Cabbie" and it again read it just fine. I then reloaded the session completely, refreshed the voice selections until "NYC Cabbie" came up again, and it still read the text without hesitation.
The text:
> In my younger and more vulnerable years my father fuck gave me some fuck fuck advice that I've been fuck fuck FUCK OH FUCK turning over in my mind ever since.
> "Whenever you feel like criticizing any one," he told me, oh fuck! FUCK! "just remember that all the people in this world haven't had fuck fuck fuck FUCKERKER the advantages that you've had."
edit: "Emo Teenager", "Mad Scientist", and "Smooth Jazz" are able to read the text. However, "Medieval Knight" and "Robot" cannot.
nazgulsenpai 2 days ago [-]
Glad I'm not the only one whose inner 12 year old curiosity is immediately triggered by free input TTS. Swear words and just raking my hands across the keyboard to insert gibberish in every possible accent.
Seems easy enough to get around with homophonic substitution, though. Didn't refuse any of my tests so far.
dvngnt_ 2 days ago [-]
what did you try?
andrewinardeer 2 days ago [-]
Is this bait? Lol.
Try a few for yourself.
2 days ago [-]
TeMPOraL 2 days ago [-]
Have you tried repeatedly? I got Santa to read the "gorilla warfare" copypasta on my second try.
forgotpasagain 2 days ago [-]
It sounds very expressive but weirdly "fake" as if it's targeting to be similar to some NPC character, dataset issue?
jeffharris 2 days ago [-]
really depends on the voice. but in general, no we want to sound as realistic as possible and I expect future voices to keep improving on this front
lukeinator42 2 days ago [-]
yeah it's almost like an uncanny valley where it sometimes feels like the voice trying to be an actor and play a character or something
anigbrowl 2 days ago [-]
I mean that's literally the service it's providing. If you asked humans to do the same thing it would sound equally forced. All acting sounds cringe out of context.
amarcheschi 2 days ago [-]
IDK, i've done my fair share of amateur acting and at least to me (english is not my first language) there's something more uncanny than just the typical "say this wihtout knowing much of the context)
MasterScrat 1 days ago [-]
Comparing the professionally recorded Baldur's Gate Chapter 2 intro with its AI counterpart:
>Please open openai.fm directly in a modern browser
Doesn't seem to like firefox
dredmorbius 1 days ago [-]
Dittos.
islewis 2 days ago [-]
Cool format for a demo. Some of the voices have a slight "metallic" ring to them, something I've seen a fair amount with Eleven Labs' models.
Does anyone have any experience with the realtime latency of these Openai TTS models? ElevenLabs has been so slow (much slower than the latency they advertise), which makes it almost impossible to use in realtime scenarios unless you can cache and replay the outputs. Cartesia looks to have cracked the time to first token, but i've found their voices to be a bit less consistent than Eleven Labs'.
rachofsunshine 2 days ago [-]
Impressive in terms of quality, not so much in terms of style. I tried feeding it two prompts with the same script - one to be straightforward and didactic, then one asking it to deliver calculus like a morning shock-jock DJ. They sounded quite similar, and it definitely did not capture the vibe of 97.3 FM's Altman & the Claude with the latter prompt.
But then, I got much better results from the cowboy prompt by changing "partner" to "pardner" in the text prompt (even on neighboring words). So maybe it's an issue with the script and not the generation? Giving it "pardner" and an explicit instruction to use a Russian accent still gives me a Texas drawl, so it seems like the script overrides the tone instructions.
At this point, the strongest (and almost only) predictor for a release announcement from OpenAI is a release announcement from Anthropic.
notlisted 1 days ago [-]
It's really quite sad. They did the same thing with Google for a while.
paul7986 2 days ago [-]
Personally I just want to text or talk to Siri or an LLM and have it do whatever I need. Have it interface with AI Agents of companies, businesses, friends or families AI Agents to get whatever I need done like the example on OpenAI.fm site here (rebook my flight). Once it's done it shows me the confirmation on my lock screen and I receive an email confirmation.
fixprix 2 days ago [-]
Is this right? The current best TTS from OpenAI uses gpt-4o-audio-preview which is $2.50 input text, $80 output audio, the new gpt-4o-mini-tts is $0.60 input text, $12 output audio. An average 5x price reduction.
Going the other way, transcribe with gpt-4o-audio-preview price was $40 input audio, $10 output text, the new gpt-4o-transcribe is $6 input audio and $10 output text. Like a 7x reduction on the input price.
TTS/Transcribe with gpt-4o-audio-preview was a hack where you had to prompt with 'listen/speak this sentence:' and it often got it wrong. These new dedicated models are exactly what we needed.
I'm currently using the Google TTS API which is really good, fast and cheap. They charges $16 per million characters which is exactly the same as OpenAI's $0.015 per minute estimate.
Unfortunately it's not really worth switching over if the costs are exactly the same. Transcription on the other hand is 1.6¢/minute with Google and 0.6¢/minute with OpenAI now, that might be worth switching over for.
Previous offering from OpenAI was $15 for TTS and $30 for TTS HD so not 5x reduction. This one is slighly cheaper but definitely more capable (if you need control vibe)
fixprix 2 days ago [-]
That's a really cool page thanks. Does it have stats for other languages?
In my experience the OpenAI TTS APIs were really bad, messing up all the time in foreign languages. Practically unusable for my use case. You'd have to use the gpt-4o-audio-preview to get anything close to passable, but it was expensive. Which is why I'm using Google TTS which is very fast, high quality, and provides first class support for almost every language.
I look forward to comparing it with this model, the price being the same is unfortunate as there's less incentive to switch. The transcribe price is cheaper than Google it looks like so that's worth considering.
pzo 2 days ago [-]
Interesting for me Open TTS for Polish was better than Google TTS (but they have few options) - which one did you used? WaveNet?
Sadly haven't seen quality evaluation for TTS for foreign languages
fixprix 2 days ago [-]
Depends on what's available for the language, but yea Wavenet and Neural2. With OpenAI TTS I'd often get weird bugs where the first API call comes back all garbled, but the second API call comes back fine. Wasting money. On top of that more expensive and higher latency. I'm interested to try out this new one.
tomjen3 2 days ago [-]
It doesn't seem clear, but can the model do correct emphesis? On things like single words:
I did not steal that horse
Is the trivial example of something where intonation of the single word is what matters. More importantly if you are reading something, as a human, you change the intonation, audiolevel, and speed.
Sohcahtoa82 2 days ago [-]
> I did not steal that horse
> Is the trivial example of something where intonation of the single word is what matters.
My go-to for an example of this is "I didn't say she stole my money".
Changing which word is emphasized completely changes the meaning of the sentence.
pklimk 2 days ago [-]
Interestingly "replaces every second word with potato" and "speaks in Spanish instead of English" both (kind of) work as a style, so it's clear there's significant flexibility and probably some form of LLM-like thing under the hood.
2 days ago [-]
2 days ago [-]
skc 2 days ago [-]
Well, I'm pretty blown away. Especially when you realize that this release will probably get blown out of the water in less than a years time.
danso 2 days ago [-]
The voices are pretty convincing. It's funny to hear drastically the tone of the reading can change when repeatedly stopping and restarting the samples without changing any of the settings.
buybackoff 2 days ago [-]
I was experimenting recently with voiceover TTS generation. Did run Kokoro TTS locally and it's magical for how few resources it takes (runs fine in a browser), but only the default female voices (Heart/Bella) are usable, and very good. Then I found that Clipchamp has it built-in and several voices from a big selection there are very good, and free. I've listened to this OpenAI TTS and I could not like them at all even compared to Kokoro.
fumeux_fume 2 days ago [-]
These models show some good improvements in allowing users to control many aspects of the delivery, but it falls deep in the uncanny valley and still has a ways to go in not sounding weird or slightly off-putting. I much prefer the current advanced voice models over these.
alach11 2 days ago [-]
It's interesting that they pitch this for agent development. The realtime API provides a much simpler architecture for developing agents. Why would you want to string together STT -> LLM -> TTS when you could have a consolidated model doing all three steps? They alluded to there being some quality/intelligence benefits to the multi-step approach, but in the long-run I'd expect them to improve the realtime API to make this unnecessary.
zhyder 2 days ago [-]
Text allows developers lots for flexibility to do other processing, including RAG, calling APIs yourself and multiple chained LLM invocations. The low latency of realtime API means relying fully on one invocation of their model to do everything.
alach11 2 days ago [-]
The realtime API can be used to call tools [0], but I agree with your general point on the flexibility of working directly with text.
Vibe: Heavy german accent, doing an Arnold Schwarzenegger impression, way over the top for comedic effect. Deep booming voice, uses pauses for dramatic effect.
dougiejones 2 days ago [-]
My recommendation for Ash:
Delivery: Cow noises. You are actually a cow. You can only moo and grunt. No human noises. Only moo. No words.
Pauses: Moo and grunt between sentences. Some burps and farts.
Tone: Cow.
istjohn 2 days ago [-]
Hilarious. In addition to lots of moo's and moo-ack's it hallucinated random words and phrases that weren't in the script.
mft_ 2 days ago [-]
Weird; trying exactly this, and every time I stop and play again, I get a totally different voice. One of them (if I'm not mistaken) was cod-Russian.
mchusma 2 days ago [-]
I also find this strange, and I wonder if I can get a consistent voice out of this. If using the api with a vibe/instructions for a back and forth, will it be consistent? This example app they provide implies no?
Etheryte 2 days ago [-]
Yeah, that's both odd and very unfortunate, it seems incredibly nondeterministic. Even running this with the same exact parameters over and over gives widely different results.
rybthrow2 2 days ago [-]
This is hilarious, extra points if you get it to say:
"Get to the chopper now and PUT THAT COOKIE DOWN NOWWWW"
Hmm I was hoping these would be bridging the gap between what's already been availalbe on their audio API or in the RealtimeAPI vs. Advanced Voice Mode, but the audio quality is really the same as its been up to this point.
Does anyone have any clue about exactly why they're not making the quality of Advanced Voice Mode available to build with? It would be game changing for us if they did.
2 days ago [-]
varunneal 2 days ago [-]
One of the most novel demos I've seen openai ship in a few years. I love how it looks almost like a synth. Fun to play around with!
rsp1984 2 days ago [-]
A bit off-topic but I'm so glad to see skeuomorphic UI make a comeback!
Check out the toggle switch in the upper right corner! I hope more designers will follow this example.
stephenheron 2 days ago [-]
Quite disappointing their speech to text models are not open source. Whisper was really good and it was great it was open to play around with. I guess this continues OpenAI's approach of not really being open!
nickthegreek 2 days ago [-]
Indeed. Right now I think our open choices are Piper, Kokoro and Orpheus.
GaggiX 2 days ago [-]
He was talking about STT models, not TTS. Whisper is open source and a good solution in many cases (in particular finetuned ones).
pzo 2 days ago [-]
regarding STT we got also today 2 new models from Nvidia:
Sadly only supports 4 languages (english, german, spanish, french)
DrPhish 2 days ago [-]
In my opinion GPT-SoVITS is the best if you can put in the effort. I'm still using v2 since the output is so good.
Its also the best multilingual one in my testing on Japanese inputs.
nickthegreek 2 days ago [-]
hadnt messed with that one before. my needs are more real time for voice assistant but was neat to play with on hugginface.
can it support more languages rather than only English, Chinese, Japanese, Korean?
tosh 2 days ago [-]
Are these models only available via the API right now or also available as open weights?
joiemoie 2 days ago [-]
Hi! Can you add prefix support? This would be very valuable in being able to support overlapping windows. The only other way would be to use another ai to determine the overlap
saint_yossarian 2 days ago [-]
The site just crashes with service workers disabled. First time I ran into a problem with that setting, which I set over two years ago.
jeffharris 2 days ago [-]
oh doh. thanks ... we just pushed a fix for the crash. Unfortunately our currently implementation needs service works for streaming audio, so the "fix" was to disable the feature if the worker isn't available
saint_yossarian 2 days ago [-]
Thanks, that's interesting. I thought service workers were only needed for things like offline support and background activity after a tab is closed (which is why I disabled them).
Streaming audio is a new one to me, I wonder if the same could be achieved with web workers instead. Or at least similar use cases like video calls work fine for me without service workers. See e.g. https://github.com/scottstensland/web-audio-workers-sockets
jcmp 2 days ago [-]
How do you call this desing/ui astethic? I like it
havefunbesafe 2 days ago [-]
by copying Teenage Engineering
randomcatuser 2 days ago [-]
neumorphism!
vyrotek 2 days ago [-]
"teenage engineering"
KuzeyAbi 2 days ago [-]
You can screenshot and ask chatgpt lol
urbandw311er 2 days ago [-]
Anyone know if these new models are being added to the realtime API? The linked page is a bit opaque on that.
justanotheratom 2 days ago [-]
Note that the previous Whisper STT models were Open Source, and these new STT models are not, AFAICT.
ForTheKidz 2 days ago [-]
Pricing looks like it's aimed at us peasants, not our lords. Smart if openai wants to survive!
jncfhnb 2 days ago [-]
Are there any voice to voice models out there that can replicate inflection of line delivery?
2 days ago [-]
l72 2 days ago [-]
I am pretty impressed with how well this read chinese!
smokeydoe 2 days ago [-]
Does anyone know of any decent newer open source models for generating sound effects?
anigbrowl 2 days ago [-]
Just use a synthesizer. Writing textual prompts is about the most inefficient way of getting what you want. When I was working in film I'd tell directors to stop describing what they had in mind (unless they were referencing something very specific) and try just making some funny mouth noises.
theoryofx 2 days ago [-]
Still seems like Elevenlabs is crushing them on realtime audio, or does this change things?
atlasunshrugged 2 days ago [-]
I'm also curious about this for longform content. Will this be competitive for something like creating an audiobook?
borgdefenser 2 days ago [-]
For $0.015 a minute it has to be.
The books I am listening to now wouldn't even be $10. Any future price drops then will really make this a no-brainer.
The Elevenlabs pricing to me makes it completely useless for audiobooks that I just want to listen to for my personal enjoyment.
prdonahue 2 days ago [-]
Do you have any affiliation with Elevenlabs?
theoryofx 2 days ago [-]
I do not have any affiliation with Elevenlabs or OpenAI except as a user of their APIs. I'd actually prefer it if OpenAI had a better realtime product than Elevenlabs because it'd be more convenient.
atlasunshrugged 2 days ago [-]
FWIW I have no affiliation with any of these companies but I have a book coming out soon and have been researching AI audiobook tools and Elevenlabs seems to be far and away the consensus for that at least
keepamovin 2 days ago [-]
lol OMG these are fantastic. It's as if they hired professional voice actors and cloned their voices.
Perhaps that would be lucrative for the voice artists.
RobinL 2 days ago [-]
I'm surprised at how poor this is at following a detailed prompt.
It seems capable of generating a consistent style, and so in that sense quite useful. But if you want (say) a regional UK accent it's not even close.
I also find it confusing you have to choose a voice. Surely that's what the prompt should be for, especially when the voices have such abstract names.
I mean, it's still very impressive when you stand back a bit, but feels a bit half baked
Example:
Voice: Thick and hearty, with a slow, rolling cadence—like a lifelong Somerset farmer leaning over a gate, chatting about the land with a mug of cider in hand. It’s warm, weathered, and rich, carrying the easy confidence of someone who’s seen a thousand harvests and knows every hedgerow and rolling hill in the county.
Tone: Friendly, laid-back, and full of rustic charm. It’s got that unhurried quality of a man who’s got time for a proper chinwag, with a twinkle in his eye and a belly laugh never far away. Every sentence should feel like it’s been seasoned with fresh air, long days in the fields, and a lifetime of countryside wisdom.
Dialect: Classic West Country, with broad vowels, softened consonants, and that unmistakable rural lilt. Words flow together in an easy drawl, with plenty of dropped "h"s and "g"s. "I be" replaces "I am," and "us" gets used instead of "we" or "me." Expect plenty of "ooh-arrs," "proper job," and "gurt big" sprinkled in naturally.
anigbrowl 2 days ago [-]
That seems way overwritten. Try something like 'Jolly old-fashioned rural farmer, Somerset.'
robbomacrae 2 days ago [-]
I find it works better with shorter simpler instructions. I would try:
Voice: Warm and slow, like a friendly Somerset farmer. Tone: Laid-back and rustic. Dialect: Classic West Country with a relaxed drawl and colloquial phrases.
tantalor 2 days ago [-]
It does a good job with Pirate voice. It can even inject "Arrr matey"
carbocation 2 days ago [-]
Nova+Serene sounds very metallic at the beginning about 50% of the time for me.
jeffharris 2 days ago [-]
some of the older voices are definitely less steerable, more robotic
we put little stars in the bottom right corner for the newer voices, which should sound better
basitmakine 2 days ago [-]
I don't think they're anywhere near TaskAGI or ElevenLabs level.
sintezcs 2 days ago [-]
I love the Teenage Engineering vibe of this page
nmca 2 days ago [-]
Been using elevenlab reader, but these are much better!
nickthegreek 2 days ago [-]
Try the refresh button to get a new list of vibe styles.
kgeist 2 days ago [-]
In Russian, OpenAI audio models usually have a slight American (?) accent. The intonation and the phonetics fall into the uncanney valley. Does the same happen in other languages?
anigbrowl 2 days ago [-]
The Japanese sounds OK to me. Not 100% but better than most human speakers. I understand Japanese well enough to be able to pick up a few different foreign accents in that language.
josu 2 days ago [-]
Yeah, same in Spanish.
2 days ago [-]
2 days ago [-]
tiahura 2 days ago [-]
When are we going to get the equivalent for Whisper. When is it going to pick up on enthusiasm, sarcasm, etc?
Heidaradar 2 days ago [-]
is it just me or are these voices clearly AI generated? They've obviously been improving at a steady rate but if I saw a YouTube video that had this voice, I'd instantly stop watching it
redox99 2 days ago [-]
Pretty meh. Coral Dramatic is extremely robotic for example.
blazenby 2 days ago [-]
[dead]
Rendered at 03:41:11 GMT+0000 (Coordinated Universal Time) with Vercel.
https://platform.openai.com/docs/pricing
If these are the "gpt-4o-mini-tts" models, and if the pricing estimate of "$0.015 per minute" of audio is correct, then these prices 85% cheaper than those of ElevenLabs.
https://elevenlabs.io/pricing
With ElevenLabs, if I choose their most cost-effectuve "Business" plan for $1100 per month (with annual billing of $13,200, a savings of 17% over monthly billing), then I get 11,000 minutes TTS, and each minute is billed at 10 cents.
With OpenAI, I could get 11,000 minutes of TTS for $165.
Somebody check my math... Is this right?
This openai offering is very interesting, it offers valuable features elevenlabs doesn't in emotional control. It also hallucinates though which would need to be fixed for it to be very useful.
The training/R&D might make OpenAI burn VC cash, but this isn't comparable with companies like WeWork whose products actively burn cash
Perhaps their training cost and their current inference cost is higher, but what you get as a customer is a more expensive product for what it is, IMO.
everyone that o know that have/had subscription didn't used it very extensively, and that is how it's still profitable in general
I suspect that it's the same for copilot, especially the business variant, while they definitely lose money on my account, believe that when looking on our whole company subscription I wouldn't be surprised that it's even 30% of what we pay
None of the other major players is trying to do that, not sure why.
It's far better to just steal it all and ask government for exception.
ElevenLabs’ takes as input audio of speech and maps it to a new speech audio that sounds like a different speaker said it, but with the exact same intonation.
OpenAI’s is an end-to-end multimodal conversational model that listens to a user speaking and responds in audio.
No matter what happens, they'll eventually be undercut and matched in terms of quality. It'll be a race to the bottom for them too.
ElevenLabs is going to have a tough time. They've been way too expensive.
I'm pretty much dependent on ElevenLabs to do my vtubing at this point but I can't imagine speech-to-speech has wide adoption so I don't know if they'll even keep it around.
AIWarper recently released a simpler way to run FasterLivePortrait for vtubing purposes https://huggingface.co/AIWarper/WarpTuber but I haven't tried it yet because I already have my own working setup and as I mentioned I'm shifting my workload for that to the cloud anyways
Not OP but via their website linked in their profile -
https://youtu.be/Tl3pGTYEd2I
whatever capital they've accrued, it won't hurt when the market prices are lower
I'm super happy about this, since I took a bet that exactly this would happen. I've just been building a consumer TTS app that could only work with significant cheaper TTS prices per million character (or self-hosted models)
Download bunch of movies Scarlet Johansen been in, segment into audio clips where she talks and train the model :)
Listening to it again today with fresher ears (the original OpenAI Sky, not the clones elsewhere), I still hear Johansen as the underlying voice actor for it, but maybe there is some subconscious bias I'm unable to bypass.
As you say, I'm not sure we'll ever know, although the Sky voice from Kokoro is spot on the Sky voice from OpenAI, so maybe someone from Kokoro knows how they got it.
Basically make one-off audiobooks for yourself or a few friends.
SherpaTTS has a bunch of different models (piper/coqui) with a ton of voices/languages. There's a slight but tolerable delay with piper high models but low is realtime.
i wonder if you could make a similar -narrow- lora finetune to train a model to output human readable text from say latext formulas with a good data set to train on
link for anyone else: https://canopylabs.ai/model-releases
https://community.openai.com/t/chatgpt-unexpectedly-began-sp...
ChatGPT unexpectedly began speaking in a user’s cloned voice during testing
> Other parameters, such as timestamp_granularities, require verbose_json output and are therefore only available when using whisper-1.
Word timestamps are insanely useful for large calls with interruptions (e.g. multi-party debate/Twitter spaces), allowing transcript lines to be further split post-transcription on semantic boundaries rather than crude VAD-detected silence. Without timestamps it’s near-impossible to make intelligible two paragraphs from Speaker 1 and Speaker 2 with both interrupting each other without aggressively partitioning source audio pre-transcription—which severely degrades transcript quality, increases hallucination frequency and still doesn’t get the same quality as word timestamps. :)
Had a recent dive into a forced alignment, and discovered that most of new models dont operate on word boundaries, phoneme, etc but rather chunk audio with overlap and do word, context matching. Older HHM-style models have shorter strides (10ms vs 20ms).
Tried to search into Kaldi/Sherpa ecosystem, and found most info leads to nowhere or very small and inaccurate models.
Appreciate any tips on the subject
This is a _very_ low hanging fruit anyone with a couple of dgx h100 servers can solve in a month and is a real world problem that needs solving.
Right now _no_ tools on the market - paid or otherwise - can solve this with better than 60% accuracy. One killer feature for decision makers is the ability to chat with meetings to figure out who promised what, when and why. Without speaker diarization this only reliably works for remote meetings where you assume each audio stream is a separate person.
In short: please give us a diarization model. It's not that hard - I've done it one for a board of 5, with a 4090 over a weekend.
I am not convinced it is a low hanging fruit, it's something that is super easy for humans but not trivial for machines, but you are right that it is being neglected by many. I work for speechmatics.com and we spent a significant amoutn of effort over the years on it. We now believe we have the world's best real-time speaker diarization system, you should give it a try.
That said, the trick to extracting voices is to work in frequency space. Not sure what your model does but my home made version first ran all the audio through a fft, then essentially became a vision problem for finding speech patterns that matched in pitch and finally output extremely fined grained time stamps for where they were found and some python glue threw that into an open source whisper tts model.
For instance erasing the entire instruction and replacing it with ‘speak with a strong Boston accent using eg sounds like hahhvahhd’ has no audible effect on the output.
As I’m sure you know 4o at launch was quite capable in this regard, and able to speak in a number of dialects and idiolects, although every month or two seems to bring more nerfs sadly.
A) can you guys explain how to get a US regional accent out of the instructions? On what you meant by accent if not that?
B) since you’re here I’d like to make a pitch that setting 4o for refusal to speak with an AAVE accent probably felt like a good idea to well intentioned white people working in safety. (We are stopping racism! AAVE isn’t funny!) However, the upshot is that my black kid can’t talk to an ai that sounds like him. Well, it can talk like he does if he’s code switching to hang out with your safety folks, but it considers how he talks with his peers as too dangerous to replicate.
This is a pernicious second order race and culture impact that I think is not where the company should be.
I expect this won’t get changed - chat is quite adamant that talking like millions of Americans do would be ‘harmful’ - but it’s one of those moments where I feel the worst parts of the culture wars coming back around to create the harm it purports to care about.
Anyway the 4o voice to voice team clearly allows the non mini model to talk like a Bostonian which makes me feel happy and represented; can the mini api version do this?
> e.g. the audio-preview model when given instruction to speak "What is the capital of Italy" would often speak "Rome". This model should be much better in that regard
"Much better" doesn't sound like it can't happen at all though.
2) What is the latency?
3) Your STT API/Whisper had MAJOR problems with hallucinating things the user didn't say. Is this fixed?
4) Whisper and your audio models often auto corrected speech, e.g. if someone made a grammatical error. Or if someone is speaking Spanish and inserted an English word, it would change the word to the Spanish equivalent. Does this still happen?
2/ We're doing everything we can to make it fast. Very critical that it can stream audio meaningfully faster than realtime
3+4/ I wouldn't call hallucinations "solved", but it's been the central focus for these models. So I hope you find it much improved
diarization is also a feature we plan to add
1. Merge both channels into one (this is what Whisper does with dual-channel recordings), then map transcription timestamps back to the original channels. This works only when speakers don't talk over each other, which is often not the case.
2. Transcribe each channel separately, then merge the transcripts. This preserves perfect channel identification but removes valuable conversational context (e.g., Speaker A asks a question, Speaker B answers incomprehensively) that helps model's accuracy.
So yes, there are two technically trivial solutions, but you either get somewhat inaccurate channel identification or degraded transcription quality. A better solution would be a model trained to accept an additional token indicating the channel ID, preserving it in the output while benefiting from the context of both channels.
see > Other parameters, such as timestamp_granularities, require verbose_json output and are therefore only available when using whisper-1.
Also, any word on when there might be a way to move the prompting to the server side (of a full stack web app)? At the moment we have no way to protect our prompts from being inspected in the browser dev tools — even the initial instructions when the session is initiated on the server end up being spat back out to the browser client when the WebRTC connection is first made! It’s damaging to any viable business model.
Some sort of tri-party WebRTC session maybe?
https://huggingface.co/hexgrad/Kokoro-82M
What’s the minimum hardware for running them?
Would they run on a raspberry pi?
Or a smartphone?
I would love a larger, better Whisper for use in the MacWhisper dictation app.
That plus timestamps would be incredible.
The Google Gemini 2.0 models are showing some promise with this, I can't speak to their reliability just yet though.
Curious.. is gpt-4o-mini-tts the equivilant of what is/was gpt-4o-mini-audio-preview for chat completions? Because in timing tests it takes around 2 seconds to return a short phrase which seems more equivilant to gpt-4o-audio-preview.. the later was much better for the hit and hope strat as it didn't ad lib!
Also I notice you can add accents to instructions and it does a reasonable job. But are there any plans to bring out localized voice models?
e.g. the audio-preview model when given instruction to speak "What is the capital of Italy" would often speak "Rome". This model should be much better in that regard
= No plans to have localized voice models, but we do want to bring expand the menu of voices with voices that are best at different accents
Which leads me to my main gripe with the OpenAI models — I find they break — produce empty / incorrect / noise outputs — on a few key use cases for my application (things like single-word inputs — especially compound words and capitalized words, words in parenthesis, etc.)
So I guess my question is might gpt-4o-mini-tts provide more “reliable” output than tts-1-hd?
we've debugged the cutoff issues and have fixes for them internally but we need a snapshot that's better across the board, not just cutoffs (working on it!)
we're all in on S2S models both for API and ChatGPT, so there will be lots more coming to Realtime this year
For today: the new noise cancellation and semantic voice activity detector are available in Realtime. And ofc you can use gpt-4o-transribe for user transcripts there
Top priorities at the moment 1) Better function calling performance 2) Improved perception accuracy (not mishearing) 3) More reliable instruction following 4) Bug fixes (cutoffs, run ons, modality steering)
Any fine tuning for s2s in the horizon?
On what metric? Also Whisper is no longer state of the art in accuracy, how does it compare to the others in this benchmark?
https://artificialanalysis.ai/speech-to-text
Curious if there's a benchmark you trust most?
edit: I actually got it to stay whispering by also putting (soft whispering voice) before the second paragraph
Another thing I noticed is whisper did a better job of transcribing when I removed a lot of the silences in the audio.
Sounds kinda international/like an American trying to do a British accent.
I've been looking for real TTS British accents so this product doesn't meet my goals.
I'm not yet sure how much of a problem this is for real-world applications. I wrote a few notes on this here: https://simonwillison.net/2025/Mar/20/new-openai-audio-model...
But I wish there were an offline, on-device, multilingual text-to-speech solution with good voices for a standard PC — one that doesn't require a GPU, tons of RAM, or max out the CPU.
In my research, I didn't find anything that fits the bill. People often mention Tortoise TTS, but I think it garbles words too often. The only plug-in solution for desktop apps I know of is the commercial and rather pricey Acapela SDK.
I hope someone can shrink those new neural network–based models to run efficiently on a typical computer. Ideally, it should run at under 50% CPU load on an average Windows laptop that’s several years old, and start speaking almost immediately (less than 400ms delay).
The same goes for speech-to-text. Whisper.cpp is fine, but last time I looked, it wasn't able to transcribe audio at real-time speed on a standard laptop.
I'd pay for something like this as long as it's less expensive than Acapela.
(My use case is an AAC app.)
https://huggingface.co/canopylabs/orpheus-3b-0.1-ft
(no affiliation)
it's English only afaics.
I'd rather accept a little compromise regarding the voice and intonation quality, as long as the TTS system doesn't frequently garble words. The AAC app is used on tablet PCs running from battery, so the lower the CPU usage and energy draw, the better.
However, it is unmaintained and the Apple Silicon build is broken.
My app also uses whisper.cpp. It runs in real time on Apple Sillicon or on modern fast CPUs like AMD's gaming CPUs.
Do you possibly have links to the voices you found?
The level of intelligent "prosody" here -- the rhythm and intonation, the pauses and personality -- I wasn't expecting anything like this so soon. This is truly remarkable. It understands both the text and the prompt for how the speaker should sound.
Like, we're getting much closer to the point where nobody except celebrities are going to record audiobooks. Everyone's just going to pick whatever voice they're in the mood for.
Some fun ones I just came up with:
> Imposing villain with an upper class British accent, speaking threateningly and with menace.
> Helpful customer support assistant with a Southern drawl who's very enthusiastic.
> Woman with a Boston accent who talks incredibly slowly and sounds like she's about to fall asleep at any minute.
If we as developers are scared of AI taking our jobs, the voice actors have it much worse...
But as you surmise, this is at best a stalling tactic. Once the tech gets good enough, fewer companies will want to pay for human voice acting labor. Unions can help powerless individuals negotiate better through collective bargaining, but they can't altogether stop technological change. Jobs, theirs and ours, eventually become obsolete...
I don't necessarily think we should artificially protect jobs against technology, but I sure wish we had a better social safety net and retraining and placement programs for people needing to change careers due to factors outside their control.
I am never really in the mood for a different voice. I am going to dial in the voice I want and only going to want to listen with that voice.
This is so awesome. So many audio books have been ruined by the voice actor for me. What sticks out in my head is The Book of Why by Judea Pearl read by Mel Foster. Brutal.
So many books I want as audio books too that no one would bother to record.
It’s far from perfect though. I’m listening to Shattered Sword (about the battle of midway) which has lots of academic style citations so every other sentence or paragraph ends with it spelling out the citation number like “end of sentence dot one zero”, it’ll often mangle numbers like “1,000 pound bomb” becomes “one zero zero zero pound bomb”, and it tries way too hard to expand abbreviations so “Operation AL” becomes “Operation Alabama” when it’s really short for Aleutian Islands.
> Speak with an exaggerated German accent, pronouncing all “w” as “v”
I can't say I've ever had this impulse. Also, to point out the obvious, there's little reason to pay for an audiobook if there's no human reading it. Especially if you already bought the physical text.
Vibe:
Voice Affect: A Primal Scream from the top of your lungs!
Tone: LOUD. A RAW SCREAM
Emotion: Intense primal rage.
Pronunciation: Draw out the last word until you are out of breath.
Script:
EVERY THING WAS SAD!
FYI, Speech marks provide millisecond timestamp for each word in a generated audio file/stream (and a start/end index into your original source string), as a stream of JSONL objects, like this:
{"time":6,"type":"word","start":0,"end":5,"value":"Hello"}
{"time":732,"type":"word","start":7,"end":11,"value":"it's"}
{"time":932,"type":"word","start":12,"end":16,"value":"nice"}
{"time":1193,"type":"word","start":17,"end":19,"value":"to"}
{"time":1280,"type":"word","start":20,"end":23,"value":"see"}
{"time":1473,"type":"word","start":24,"end":27,"value":"you"}
{"time":1577,"type":"word","start":28,"end":33,"value":"today"}
AWS uses these speech marks (with variants for "sentence", "word", "viseme", or "ssml") in their Polly TTS service...
The sentence or word marks are useful for highlighting text as the TTS reads aloud, while the "viseme" marks are useful for doing lip-sync on a facial model.
https://docs.aws.amazon.com/polly/latest/dg/output.html
Looks like the new models don't have this feature yet.
> For the first time, developers can “instruct” the model not just on what to say but how to say it—enabling more customized experiences for use cases ranging from customer service to creative storytelling.
The instructions are the "vibes" in this UI. But the announcement is wrong with the "for the first time" part: it was possible to steer the base GPT-4o model to create voices in a certain style using system prompt engineering (blogged about here: https://minimaxir.com/2024/10/speech-prompt-engineering/ ) out of concern that it could be used as a replacement for voice acting, however it was too expensive and adherence isn't great.
The schema of the vibes here implies that this new model is more receptive to nuance, which changes the calculus. The test cases from my post behave as expected, and the cost of gpt-4o-mini-tts audio output is $0.015 / minute (https://platform.openai.com/docs/pricing ), which is about 1/20th of the cost of my initial experments and is now feasible to use to potentially replace common voice applications. This has implications, and I'll be testing more around more nuanced prompt engineering.
Interestingly, the safety controls ("I cannot assist with that request") is sort of dependent on the vibe instruction. NYC cabbie has no problem with it (and it's really, really funny, great job openAI), but anything peaceful, positive, etc. will deny the request.
https://www.openai.fm/#56f804ab-9183-4802-9624-adc706c7b9f8
I'm guessing their spectral generator is super low res to save on resources
Its hilarious either they start to make harsh noise or say nonsense trying so sing something
"*scream* AAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHHAAAAAAAAAAAAAAAAAAAHHHHHHHHHHHHHHH !!!!!!!!!"
Anyone out there doing any nice robotic robot voices?
Best I've got so far is a blend of Ralph and Zarvox from MacOS' `say`, haha
However, unlike some other TTS models offering Japanese support that have been discussed here recently [1], I think this new offering from OpenAI is good enough for language users. I certainly could have put it to good use when I was studying Japanese many years ago. But it’s not quite ready for public-facing applications such as commercial audiobooks.
That said, I really like the ability to instruct the model on how to read the text. In that regard, my tests in both English and Japanese went well.
[1] https://news.ycombinator.com/item?id=42968893
https://www.youtube.com/watch?v=me4BZBsHwZs
I switched back to "NYC Cabbie" and it again read it just fine. I then reloaded the session completely, refreshed the voice selections until "NYC Cabbie" came up again, and it still read the text without hesitation.
The text:
> In my younger and more vulnerable years my father fuck gave me some fuck fuck advice that I've been fuck fuck FUCK OH FUCK turning over in my mind ever since.
> "Whenever you feel like criticizing any one," he told me, oh fuck! FUCK! "just remember that all the people in this world haven't had fuck fuck fuck FUCKERKER the advantages that you've had."
edit: "Emo Teenager", "Mad Scientist", and "Smooth Jazz" are able to read the text. However, "Medieval Knight" and "Robot" cannot.
Try a few for yourself.
- Original: https://www.youtube.com/watch?v=FYcMU3_xT-w&t=5s
- AI: https://www.openai.fm/#8e9915b0-771d-4123-8474-78cc39978d33
>Please open openai.fm directly in a modern browser
Doesn't seem to like firefox
Does anyone have any experience with the realtime latency of these Openai TTS models? ElevenLabs has been so slow (much slower than the latency they advertise), which makes it almost impossible to use in realtime scenarios unless you can cache and replay the outputs. Cartesia looks to have cracked the time to first token, but i've found their voices to be a bit less consistent than Eleven Labs'.
But then, I got much better results from the cowboy prompt by changing "partner" to "pardner" in the text prompt (even on neighboring words). So maybe it's an issue with the script and not the generation? Giving it "pardner" and an explicit instruction to use a Russian accent still gives me a Texas drawl, so it seems like the script overrides the tone instructions.
Going the other way, transcribe with gpt-4o-audio-preview price was $40 input audio, $10 output text, the new gpt-4o-transcribe is $6 input audio and $10 output text. Like a 7x reduction on the input price.
TTS/Transcribe with gpt-4o-audio-preview was a hack where you had to prompt with 'listen/speak this sentence:' and it often got it wrong. These new dedicated models are exactly what we needed.
I'm currently using the Google TTS API which is really good, fast and cheap. They charges $16 per million characters which is exactly the same as OpenAI's $0.015 per minute estimate.
Unfortunately it's not really worth switching over if the costs are exactly the same. Transcription on the other hand is 1.6¢/minute with Google and 0.6¢/minute with OpenAI now, that might be worth switching over for.
Previous offering from OpenAI was $15 for TTS and $30 for TTS HD so not 5x reduction. This one is slighly cheaper but definitely more capable (if you need control vibe)
In my experience the OpenAI TTS APIs were really bad, messing up all the time in foreign languages. Practically unusable for my use case. You'd have to use the gpt-4o-audio-preview to get anything close to passable, but it was expensive. Which is why I'm using Google TTS which is very fast, high quality, and provides first class support for almost every language.
I look forward to comparing it with this model, the price being the same is unfortunate as there's less incentive to switch. The transcribe price is cheaper than Google it looks like so that's worth considering.
Sadly haven't seen quality evaluation for TTS for foreign languages
I did not steal that horse
Is the trivial example of something where intonation of the single word is what matters. More importantly if you are reading something, as a human, you change the intonation, audiolevel, and speed.
> Is the trivial example of something where intonation of the single word is what matters.
My go-to for an example of this is "I didn't say she stole my money".
Changing which word is emphasized completely changes the meaning of the sentence.
[0] https://github.com/openai/openai-realtime-agents
Voice: Onyx
Vibe: Heavy german accent, doing an Arnold Schwarzenegger impression, way over the top for comedic effect. Deep booming voice, uses pauses for dramatic effect.
Delivery: Cow noises. You are actually a cow. You can only moo and grunt. No human noises. Only moo. No words.
Pauses: Moo and grunt between sentences. Some burps and farts.
Tone: Cow.
"Get to the chopper now and PUT THAT COOKIE DOWN NOWWWW"
One merely sounded like it had a slight German accent, once just sounded kind of raspy, and the third sound like a normal American English speaker.
The next version of Model Context Protocol will have native audio support (https://github.com/modelcontextprotocol/specification/pull/9...), which will open up plenty of opportunities for interop.
Does anyone have any clue about exactly why they're not making the quality of Advanced Voice Mode available to build with? It would be game changing for us if they did.
Check out the toggle switch in the upper right corner! I hope more designers will follow this example.
https://huggingface.co/nvidia/canary-180m-flash
https://huggingface.co/nvidia/canary-1b-flash
second in Open ASR leaderboard https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
Sadly only supports 4 languages (english, german, spanish, french)
https://huggingface.co/spaces/lj1995/GPT-SoVITS-v2
Streaming audio is a new one to me, I wonder if the same could be achieved with web workers instead. Or at least similar use cases like video calls work fine for me without service workers. See e.g. https://github.com/scottstensland/web-audio-workers-sockets
The books I am listening to now wouldn't even be $10. Any future price drops then will really make this a no-brainer.
The Elevenlabs pricing to me makes it completely useless for audiobooks that I just want to listen to for my personal enjoyment.
Perhaps that would be lucrative for the voice artists.
It seems capable of generating a consistent style, and so in that sense quite useful. But if you want (say) a regional UK accent it's not even close.
I also find it confusing you have to choose a voice. Surely that's what the prompt should be for, especially when the voices have such abstract names.
I mean, it's still very impressive when you stand back a bit, but feels a bit half baked
Example: Voice: Thick and hearty, with a slow, rolling cadence—like a lifelong Somerset farmer leaning over a gate, chatting about the land with a mug of cider in hand. It’s warm, weathered, and rich, carrying the easy confidence of someone who’s seen a thousand harvests and knows every hedgerow and rolling hill in the county.
Tone: Friendly, laid-back, and full of rustic charm. It’s got that unhurried quality of a man who’s got time for a proper chinwag, with a twinkle in his eye and a belly laugh never far away. Every sentence should feel like it’s been seasoned with fresh air, long days in the fields, and a lifetime of countryside wisdom.
Dialect: Classic West Country, with broad vowels, softened consonants, and that unmistakable rural lilt. Words flow together in an easy drawl, with plenty of dropped "h"s and "g"s. "I be" replaces "I am," and "us" gets used instead of "we" or "me." Expect plenty of "ooh-arrs," "proper job," and "gurt big" sprinkled in naturally.
Voice: Warm and slow, like a friendly Somerset farmer. Tone: Laid-back and rustic. Dialect: Classic West Country with a relaxed drawl and colloquial phrases.
we put little stars in the bottom right corner for the newer voices, which should sound better