Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I skimmed it, but I still wonder why (1) we still need a tokenizer for text, and (2) why the other modalities (audio/video) don't need one.
 help



How do you think the other modalities are fed into the attention layers? The other modalities are tokenized as well, that's literally what these separate image/audio encoders created as output before feeding it into the main network. Tokenization is at its core just a tradeoff between sequence length and embedding size, so it will probably stay relevant as long as attention layers scale quadratically with sequence length.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: