It has to be a separate model doing the voice output right? I can’t imagine they’ve solved true multimodal output from a single model, they’d be bragging about it.
They’re probably predicting tone of voice tokens. Feed that into an audio transformer along with some speculative decoding to keep latency low.
The old voice mode was but everyone including gdb is saying that this one is natively multimodal once it’s fully rolled out, audio in audio out. This has been in the works for a while, you can look up papers on things like OCR-free document understanding and the like but the basic idea is you just train it and evaluate it on whatever media you want it to understand. As long as you can tokenize it it’ll work.
It’s definitely multimodal input. Passing Clip embeddings to an LLM is nothing new, and that’s really all you need for document understanding. It’s almost certainly the same thing for audio. They would have trained a dual encoder that maps both audio and text to a shared embedding space.
What’s not at all clear to me is if they’re doing something special for output. Are you saying OpenAI has moved beyond next token prediction and just hasn’t bothered to mention it?
I assume so, you can’t really tokenize audio, at least not high fidelity audio. Audio models like Bark don’t output logits from what I understand. For true multimodal output you’d need a model that can output both logits and audio embeddings.