Open Source isn't even within 50% of what the SOTA models are. Benchmarks are to...

twobitshifter · 2026-04-24T17:26:43 1777051603

Unless you are getting outside of your comfort zone and taking a month off from your $200 subscription, every other month, I can’t see how you can make the universal claim that the open weights models are all 50% as good. Just today, DeepSeek released a new model, so nobody knows how that will compare, a week ago it was Gemma 4, etc. I’m okay with you making a comparison, but state the model and the timeframe in which it was tested that you are basing your conclusions on.

MostlyStable · 2026-04-24T17:02:24 1777050144

I think that there will come a point when open source models are "good enough" for many tasks (they probably already are for some tasks; or at least, some small number of people seem happy with them), but, as you suggest, it will likely always (for the forseeable future at least) be the case that closed SOTA models are significantly ahead of open models, and any task which can still benefit from a smarter model (which will probably always remain some large subset of tasks) will be better done on a closed model.

The trick is going to be recognizing tasks which have some ceiling on what they need and which will therefore eventually be doable by open models, and those which can always be done better if you add a bit more intelligence.

bachmeier · 2026-04-24T17:35:37 1777052137

> Benchmarks are toys, real world use is vastly different...Why should anyone waste time on poorer results? I'd rather pay my $200/mo because my time matters.

This kind of rhetoric is not helpful. If you want to make a point, then make one, but this adds nothing to the conversation. Maybe open source models don't work for you. They work very well for me.

brazukadev · 2026-04-24T17:17:58 1777051078

> Open Source isn't even within 50% of what the SOTA models are

Who said so? GLM 5.1 is 90% Opus, at least. Some people quite happy with Kimi 2.6 too. I did not try Deepseek 4 yet but also hearing it is as good as Opus. You might be confusing open source models with local models. It is not easy to run a 1.6T model locally, but they are not 50% of SOTA models.

oceanplexian · 2026-04-24T17:38:05 1777052285

> Benchmarks are toys, real world use is vastly different, and that's where they seriously lag.

I'm not disagreeing per-se but if you think the benchmarks are flawed and "my real world usage" is more reflective of model capabilities, why not write some benchmarks of your own?

You stand to make a lot of money and gain a lot of clout in the industry if you've figured out a better way to measure model capability, maybe the frontier labs would hire you.

bandrami · 2026-04-24T17:20:17 1777051217

> Why should anyone waste time on poorer results?

Because in almost no real-world project is "programming time" the limiting factor?

bdangubic · 2026-04-25T01:11:41 1777079501

amazing how often is this repeated on here are some sort of a gospel SWEs pass down to one another to continue this charade. I have worked in this industry for 30+ years on countless projects, last decade+ as consultant - at every single project (every single one) programming time was the limiting factor. there is a whole industry inside our industry dealing with “processes” and “how to estimate” (apparently we are incapable of doing that) and whatnot, all because the actual programming time is always a limiting factor and there isn’t an even close 2nd

manny_rat · 2026-04-25T18:33:02 1777141982

Agreed, it's very strange. I'm sure there are many projects that are like they describe, but it's certainly not all of them. I have worked as a game dev for over 20 years, and probably 75% of that time my team and I have been coding. AI has been an incredible game changer for me over the past 6 months or so (I was using it quite a bit before then, but the capability became much higher lately). I actually have some free time in my days now while still hitting milestone dates, instead of endless crunching.

hypnoce_fr · 2026-04-25T08:46:22 1777106782

What counts as programming time ? Writing ? Reviewing ? Compiling ? Debugging ? It also depends the industry. From idea to production, the limiting factor is not always writing the code, and in my experience (15years in fintech) it almost has never been. Discussion, alignment, compilation, heavy testing pipelines, shipping, all of this on a 30million line monorepo. On a greenfield 10k line repo, yes, AI really shines. In other cases, it’s currently just a helper on very specific narrow tasks, that is not always programming.

bandrami · 2026-04-25T02:04:12 1777082652

That's just not my experience. Making the software in the first place is never even the cost center.

dymk · 2026-04-24T17:57:06 1777053426

No, it's rate at which you can solve problems, and weaker models waste your time because they don't solve problems at the same speed.

hunterpayne · 2026-04-25T01:04:44 1777079084

No, its the number of debug cycles you need to solve said problems. That's the major attribute that controls dev time. And models require far more than I need. You are paying money to take longer and produce worse code. If its different for you, that's a you problem.

Someone1234 · 2026-04-24T17:07:22 1777050442

> Open Source isn't even within 50% of what the SOTA models are.

When was the last time you used any of them? Because, a lot of people are actively using them for 9-5 work today, I count myself in that group. That opinion feels outdated, like it was formed a year ago+ and held onto. Or based on highly quantized versions and or small non-Thinking models.

Do you really think Qwen3.6 for a specific example is "50%" as good as Opus4.7? Opus4.7 is clearly and objectively better, no debate on that, but the gap isn't anywhere near that wide. I'd call "20%" hyperbole, the true difference is difficult to exactly measure but sub-10% for their top-tier Thinking models is likely.

cwnyth · 2026-04-24T18:36:57 1777055817

Their opinion is also behind on LibreOffice, too. I won't defend GIMP's monstrosity, but I finished a whole dissertation, do all my regular spreadsheet work (that isn't done via R), and have created plenty of visual mockups with LibreOffice. Plus, I don't have to deal with a spammy Windows environment.

Sure, we use Google Drive, too, but that's just for sharing documents across offices, not for everyday use. For that, the open source model is a clear winner in my book.

vlovich123 · 2026-04-24T17:17:47 1777051067

Qwen3.6 at which model size and quantization? I already think Opus 4.6 is usable but still dumb as bricks. A 20% cut off that feels like it would still be unusable. And that's not even getting to the annoyance of setting everything up to run locally & getting HW that can run it locally which basically looks like a Macbook M4 these days as the x86 side is ridiculously pricey to get decent performance out of models.

Someone1234 · 2026-04-24T20:40:14 1777063214

At their highest model size and quant. We are discussing price and quality at the top, not what you can run on the lower end.

So the starting point is Opus 4.7 pricing and we're contrasting alternatives near the top end (offered across multiple providers).

Also I said 20% was hyperbole, meaning far too high.

vlovich123 · 2026-04-24T21:09:01 1777064941

That makes no sense because the largest Qwen models are not even open weight so I’m not sure how that’s any different.

Someone1234 · 2026-04-25T00:45:58 1777077958

Right, which isn't what we're discussing, since I mentioned "across multiple providers" in every comment about this topic.

Those closed weight models aren't available like we're discussing. They're only available from the vendor that created them.

vlovich123 · 2026-04-25T06:04:43 1777097083

The largest qwen model is similar so I’m not sure what point you’re trying to make. The only ones available are the open weight ones which are the smaller variants and nowhere near within 20% of the closed frontier models.

Someone1234 · 2026-04-25T12:55:41 1777121741

The largest open models are within 20%; they're likely within 10%. Go actually try them and stop making outdated assumptions. You don't need to invest a lot of money either, just pick your favorite vendor, and send out a few prompts.

lelanthran · 2026-04-24T21:24:36 1777065876

> Open Source isn't even within 50% of what the SOTA models are.

The gap has been shrinking with each release, and the SOTA has already run into diminishing returns for each extra unit of data+computation it uses.

Do you really want to bet that the gap will not eventually be a hairs breadth?

conrs · 2026-04-24T17:54:52 1777053292

IMO It's a different and new model. We're engineers, and we're rich. It's not going to be good enough for us. But the much larger market by far is all the people who used to HAVE to work with engineers. They now have optionality; the pendulum is going to swing.

swader999 · 2026-04-24T18:03:14 1777053794

Also, this space will (and perhaps already is for some of us) be an arms race. Sure you can go local but hosted will always be able to offer more and if you want to be competitive, you'll need to be using the most capable.

kube-system · 2026-04-24T17:16:31 1777050991

There's going to be a day when we look back at $200/mo price tags and say "wow that was cheap".

The breakeven at this price is 6 minutes of productivity per work day for an engineer making $200k.

cheschire · 2026-04-24T18:12:57 1777054377

Okay, but then by that logic a person making only $20k would break even at about an hour.

Are you suggesting that someone making $20k should be spending $200/mo on Claude?

kube-system · 2026-04-24T19:02:45 1777057365

I'm talking about the cost of labor.

If you pay someone $20,000 for labor, and they save 65 minutes worth of labor per day using a $200/mo Claude subscription, you are better off buying the Claude subscription.

kuboble · 2026-04-25T05:01:21 1777093281

I think if you (a company) pay someone for labor, your labor cannot use personal subscription and you have to pay considerably higher api prices.

hrimfaxi · 2026-04-25T14:42:25 1777128145

Most companies don't provide a corporate cell phone and have no problems with answering emails from a personal account. Can't have it both ways.

kube-system · 2026-04-26T18:03:43 1777226623

You could it’s just against ToS.

But the specific numbers in my prior comments aren’t really relevant to my point. Adjust for whatever numbers you want.

kuboble · 2026-04-26T18:33:41 1777228421

But I think they are relevant because you compare two numbers and one is much lower.

I've done some napkin math and CC code makes me more efficient when I pay 200/ month, but it wouldn't if I had to pay api prices

kube-system · 2026-04-27T05:39:18 1777268358

Really? Are you using opus and letting it run for long periods? Curious as to what your workflow is.

The math is highly in favor of us using it at our company and we are paying API pricing. I don’t imagine there’s a lot of people using Claude without getting their money’s worth…?

kuboble · 2026-04-27T05:50:30 1777269030

Yes, recently I've been working on some research/ optimizatiom problem.

I would start claude in Yolo mode, tell it keep trying new ideas until it runs out of 1m context. (Every day I am giving it a hint to explore different directions as the sessions before)

Twice a day for a month, fits well into CC max plan.

I guess if I had to pay per token I would still use it but only for tasks where the value is clearer and immediate.

dragandj · 2026-04-25T10:57:41 1777114661

Who's gonna pay $20,000 for labor that can be done by anyone with a $200/mo subscription?

kube-system · 2026-04-25T14:22:15 1777126935

Nobody, but that doesn’t exist yet. Currently these solutions enhance the productivity of workers, but it can’t quite replace them.

echelon · 2026-04-24T18:30:29 1777055429

Everyone is arguing why I'm wrong or that I should have presented more data.

You've got the real insight with this claim.

This is the way the world is moving. Open source isn't even going where the ball is being tossed. There is no leadership here.

You're spot on.

If the cost to deliver a unit of business automation is:

    A. $1M with human labor

    B. $700k human labor + open source models

    C. $500k human labor + $10,000 in claude code max (duration of project)

    D. $250k with humans + $200k claude code "mythos ultra"

The one that will get picked is option "D".

Your poor college students and hobbyists will be on option "B". But this won't be as productive as evidenced by the human labor input costs.

Option "C" will begin to disappear as models/compute get more expensive and capable.

Option "A" will be nonviable. Humans just won't be able to keep up.

Open source strictly depends on models decreasing their capability gap. But I'm not seeing it.

Targeting home hardware is the biggest smell. It's showing that this is non-serious, hobby tinkery and has no real role in business.

For open source to work and not to turn into a toy, the models need to target data center deployment.

hunterpayne · 2026-04-25T01:02:23 1777078943

You are assuming (imagining) a cost relationship which doesn't exist and when researched was the opposite of what you claim.

brazukadev · 2026-04-25T13:13:01 1777122781

This is you playing with imaginary numbers, like Sam Altman is doing for a long time. It won't end well.

echelon · 2026-04-25T21:09:08 1777151348

I'm willing to bet that this is the shape of the future.

Wanna bet on it?

brazukadev · 2026-04-25T22:44:58 1777157098

It is not. Yeah I'm betting already. AI is changing software landscape but it won't be captured by openai and anthropic.

kube-system · 2026-04-24T19:40:48 1777059648

Yeah, I don't wanna shit on open source, there will certainly be uses for all different kinds of models.

The real money in this market, though, is going to be made in the C suite, and they don't really care about the model. They don't care if it's open source, closed source, or what it is. They don't want to buy a model. They're interested in buying a solution to their problems. They're not going to be afraid of a software price tag -- any number they spend on labor is far more.

Labor is something like 50%+ of the Fortune 500's operating expenses -- capturing any chunk of this is a ridiculous sum of money.

nancyminusone · 2026-04-24T18:22:10 1777054930

People pirate photoshop and office if they don't want to pay for it, making it as "free" as GIMP. If there is a free option people will use it. never underestimate the cheapskates.

kardos · 2026-04-24T19:02:18 1777057338

If sharing all of your code with the closed providers is OK then it works. If that is a blocker, open weights becomes much more compelling...

joquarky · 2026-04-24T19:41:30 1777059690

What will you do when they stop burning cash and the $200 plan becomes $2000?

jawilson2 · 2026-04-24T21:22:22 1777065742

I think the problem is that we're all waiting for the patented Silicon Value Rug Pull and ensuing enshittification, where there are a dozen tiers of products, you need 4 of them, and they now cost $2000/month. I want to hedge against that.