Hacker Newsnew | past | comments | ask | show | jobs | submit | ay's commentslogin

With almost 600 books in my kindle collection over a period of about 15 years, I would like to think I was a relatively active customer. When they announced “your kindle books are just a license to read”, which happened about the time they announced the deprecation of the old format, I went and converted the entirety of my library to Calibre with multiple open formats.

That was in December, I have not bought a single book on Amazon since then, and the kindle app is not installed on my new phone. Just in case anyone from the relevant AMZN department is reading this.


Both you and parent could be right.

There is a fun term “jagged frontier”.

Meaning: one model can be much better than the other one in one thing, and much worse than the other in another thing.


Very interesting!

Architecturally - where do you run Postgres ? I assume it would be external to the cluster ? (doing it internally would create a circular dependency ?)


The dual clusters which run each others' control plane is also a perennial classic.


Yes, it is external to the cluster.

If you want to do a quick setup, it creates a SQLite DB for the metadata.


I’ve made something very similar that is almost backend-agnostic: https://github.com/ayourtch-llm/tttt - and it does auto inject the MCP in case of Claude, but it is trivial to adapt to other backends.


This is really cool.

I did something similar with gemini cli by just wrapping it in tmux and building some extensions.[0]

Eventually that wasnt enough so I ended up forking it and adding REST endpoints to inject commands and read the screen.[1]

Your solution is much cleaner! I'll probably replace mine with it. Thanks for sharing!

[0] https://github.com/stevenAthompson/self-command

[1] https://github.com/stevenAthompson/gemini-cli-remote-control


Isn’t that what LoRA does ?


LoRAs are better at steering models to produce correct answers from their data set than imparting new knowledge.


https://arxiv.org/abs/2603.01097

>Overall, our findings position LoRA as the complementary axis of memory alongside RAG and ICL, offering distinct advantages.


Hinduism is probably right. Every system of sufficient complexity is probably sentient - even if in the ways we at our level can not fathom.


I'm a (non-practicing) Dwaitin Hindu. AFAICT, there's no mainstream school of Hindu philosophy (there are three) espouses that view. Although, Advaitins come very close to it with their four mahavakyas.

IMO, Integrated Information theory of consciousness (IIT) is exactly that. Everything is conscious, the difference is only in the degree to which they are conscious.


Oh, thank you very much enlightening me! All the time I misunderstood! I guess then IIT it is for me :-)


I tried qwen3.5:4b in ollama on my 4 year old Mac M1 with my own coding harness and it exhibited pretty decent tool calling, but it is a bit slow and seemed a little confused with the more complex tasks (also, I have it code rust, that might add complexity). The task was “find the debug that does X and make it conditional based on the whichever variable is controlled by the CLI ‘/debug foo’” - I didn’t do much with it after that.

It may be interesting to try a 6bit quant of qwen3.5-35b-a3b - I had pretty good results with it running it on a single 4090 - for obvious reasons I didn’t try it on the old mac.

I am using 8bit quant of qwen3.5-27b as more or less the main engine for the past ~week and am quite happy with it - but that requires more memory/gpu power.

HTH.


What matters for Qwen models, and most/all local MoE models (ie. where the performance is limited) is memory bandwidth. This goes for small models too. Here's the top Apple chips by memory bandwidth (and to steal from clickbait: Apple definitely does not want you to think too closely about this):

M3 Ultra — 819 GB/s

M2 Ultra — 800 GB/s

M1 Ultra — 800 GB/s

M5 Max (40-core GPU) — 610 GB/s

M4 Max (16-core CPU / 40-core GPU) — 546 GB/s

M4 Max (14-core CPU / 32-core GPU) — 410 GB/s

M2 Max — 400 GB/s

M3 Max (16-core CPU / 40-core GPU) — 400 GB/s

M1 Max — 400 GB/s

Or, just counting portable/macbook chips: M5 max (top model, 64/128G) M4 max (top model, 64/128G), M1 max (64G). Everything else is slower for local LLM inference.

TLDR: An M1 max chip is faster than all M5 chips, with the sole exception of the 40-GPU-core M5 max, the top model, only available in 64 and 128G versions. An M5 pro, any M5 pro (or any M* pro, or M3/M2 max chip) will be slower than an M1 max on LLM inference, and any Ultra chip, even the M1 Ultra, will be faster than any max chip, including the M5 max (though you may want the M2 ultra for bfloat16 support, maybe. It doesn't matter much for quantized models)


For comparison, most recent (consumer) NVIDIA GPUs released:

- 5050 - MSRP: 249 USD - 320 GB/s

- 5060 - MSRP: 299 USD - 448 GB/s

- 5060 Ti - MSRP: 379 USD - 448 GB/s

- 5070 - MSRP: 549 USD - 672 GB/s

- 5070 Ti - MSRP: 749 USD - 896 GB/s

- 5080 - MSRP: 999 USD - 960 GB/s

- 5090 - MSRP: 1999 USD - 1792 GB/s

M3 Ultra seems to come close to a ~5070 Ti more or less.


You should really list memory with the graphics cards, and above should list (unified) memory and prices as well with particular price points.


I mean what I was curious (and maybe others) about was comparing it to parent's post, which is all about the memory bandwidth, hence the comparison.


But it doesn't matter if you have 1000GB/s memory bandwidth if you only have 32GB of vram. Well, maybe for some applications it works out (image generation?), but its not seriously competing with an ultra with 128 GB of unified memory or even a max with 64 GB if unified memory.


> but its not seriously competing with an ultra with 128 GB of unified memory or even a max with 64 GB if unified memory.

No one is arguing that either, this sub-thread is quite literally about the memory bandwidth. Of course there are more things to care about in real-life applications of all this stuff, again, no one is claiming otherwise. My reply was adding additional context to the "What matters [...] is memory bandwidth" parent comment, nothing more, hence the added context of what other consumer hardware does in memory bandwidth.


If we are talking about Apple silicon, where we can configure the memory separately from the bandwidth (and the memory costs the same for each processor), we can say something like "its all about bandwidth". If we switch to GPUs where that is no longer true, NVIDIA won't let you buy an 5090 with more 32GB of VRAM, then...we aren't comparing apples to apples anymore.


A 10GB 3080 still beats even an M2 Ultra with 192GB... memory bandwidth is not the only factor.

https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...


If the model is small enough to fit in to 10GB of VRAM the GPU can win.

But the bigger models are more useful, so that’s what people fixate on.


There is also prompt processing that's compute-bound, and for agentic workflows it can matter more than tg, especially if the model is not of "thinking" type.


I am 50, coding since ~12. Started with Apple II, during the uni times wrote my own editor in assembly for BK-0010 (a soviet computer), then 30 years in computer networking with some high performance dataplane stuff more recently;

The last years somehow it felt like there’s nothing new anymore, the same 10 ideas being regurgitated with slight modifications. I tinkered with AI for the past 2 years but it was mostly a “tool for writing boilerplate”. I have tried a few ideas for agents but didn’t see how it could work.

That changed with Opus 4.6 and the subsequent wave of local models - now I try 10 ideas a day and it’s like magic! And if something doesn’t work - jumping into the code and debugging it is huge fun!

Understanding that the era of the almost-free cloud tokens might come to an end, I run my own harness pointing to my own GPUs running Qwen3.5-27B, and the last few days it has been very busy! :)

My harness doesn’t “pressure cook” since it doesn’t make sense to do that with only one GPU (besides many other reasons), it runs everything in a linear fashion, including subagents, and logs everything - reading the logs as they go by is another cool thing - sometimes I pick up interesting things from it !

The distribution of people’s moods related to AI seems indeed bimodal. And I feel lucky somehow ending up in the “enthusiastic” rather than “depressed” part of it. To the folks in the other one: I am sorry. I don’t know why it is this way. If I knew I might have given unsolicited advice.


So you’ve tried at least a hundred ideas by now, care to share fifty of them? I’m very curious as to what they are. Opus is too slow to even complete one idea per day for me, and that’s fine, I don’t have hundreds of them :)


I dont have big ideas. Some of the more interesting ones that I ended up using but can’t share: a streaming radio for my MP3 collection (runs behind the vpn); a lightweight and self contained webrtc conference server for talking with my family; a process-level virtualization based on KVM.

Of the ones I can share:

Browser-based network tester using webrtc unreliable data https://netpoke.com - use magic code “DEMO” to see what’s it about - the source is at https://github.com/ayourtch/netpoke

A port of the SOTA speech generation model from Python to Rust:

https://github.com/ayourtch/fish-audio-experiment

A study on LLM prompting techniques:

https://github.com/ayourtch-llm/kindness

My own coding agent that i use with my locally hosted LLM for experiments:

https://github.com/ayourtch-llm/apchat

Also LLM helped with a lot of code for my packet mangling library: https://github.com/ayourtch/oside - which, among other things, includes a now battle tested SNMPv3 stack.

A true “stochastic parrot” using hash tables: https://github.com/ayourtch/hashmem

These are the ones I remember. Feel free to scout my GitHub for more. Edit: And of course it doesn’t need to be said that out of ideas I try all of them make it to github. Many end up thrown away.


Just use iSH and use the local terminal on the iPhone from which you can connect to the Mac terminal. Works well over tailscale, too.


How do I know iSH app isn’t exfiltrating data?


You don’t know whether your C compiler isn’t doing that either.


Fixing a non-trivial bug is a great way to learn - assuming they don’t give up.

By virtue of being generators subtly broken stuff, LLMs are well positioned to create very nice learning material.

Same thing about growing the project - having to deal with something too big for AI is a very valuable experience.

And, in my experience, some of the purely human made codebases are strictly worse than LLM-made :-)


Isn’t that how a lot of us learned — buy typing the code out of back of a magazine? Then spending hours trying to debug a typo somewhere.

I didn’t realize how close LLMs are to the old magazines. Let it give you a seed, then use that springboard to learn everything else.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: