Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It’s not really multiple instances of the same model. Model weights aren’t replicated in vram. The results of multiplying k sequences through the model is larger, but that’s pretty small compared with the model weights themselves.

The bigger constraint is the target model and the draft model needing to share VRAM.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: