I never want to hear from developers again that they are not susceptible to mark...

Aurornis · 2026-05-30T15:25:04 1780154704

> We used the claude code and codex harness and I implemented some prs they needed with gpt5.5 and opus4.7 and asked them to identify which came from which only from the code.

> Couldn’t tell.

Why would you expect them to be able to recognize the signature of a model from a pair of PRs? I don’t understand why you think this is a useful test for anything when we have numerous benchmarks that run 100s of tests on models and both GPT-5.5 and Opus-4.8 perform similarly.

I have subscriptions to both. I run both on max reasoning. It is interesting to see the relative strengths and weaknesses of each model. You won’t always see it if you’re just scanning code. Some times one will spin for a long time on certain problems where the other has no problem finding the appropriate parts of the codebase and getting an efficient solution.

antirez made a comment that he and others found GPT-5.5 to be better at the optimization tasks he was working on than Opus. There are other classes of tasks where GPT-5.5 consistently stumbles where Opus will get a solution quicker. Lately I’ve been working on some code where neither model comes up with a good solution. That’s just how LLMs go.

The only reason you have seen more activity about Claude is that they got there first. Codex has been a step behind and GPT couldn’t match Opus at first. You’re testing them after they’ve closed the gap.

vunderba · 2026-05-30T16:39:44 1780159184

Yup, OP is conflating so many things that the comparison has all the scientific rigor of the Pepsi Challenge.

For a developer using an LLM on a daily basis, the experience is about much more than just the resultant code.

There’s everything from:

- how often you had to manually steer the model

- how frequently you needed to course-correct

- how much detail you had to provide up front

- how was the interaction process (sycophantic, etc)

- how well did it handle MCP and external tooling?

- how effectively could it pull in additional information from external sources such as the web?

- how fast did it produce code?

- how much did it cost?

Many of my friends who are devs use things like OpenCode CLI with Openrouter because they switch between the various SOTA models so often. Just because you saw a Claude "meetup" doesn't prove anything other than somebody chose the name because it resonated more than "Generic LLM Meetup".

thefounder · 2026-05-30T17:36:51 1780162611

the answer is : I usually let it do its thing with bypass permission and I run the max plan so nothing really matters except the "result". I think Claude is faster and has better UX integration with vscode but I wouldn't use it without GPT 5.5 XHigh as reviewer.Claude is just sloppy. Eventually I think it will not matter much in 1 -2 years. Most AI models will be good enough for most tasks so you may need the best of the best only if you do very complex stuff (i.e. optimizations etc)

vunderba · 2026-05-30T17:44:09 1780163049

I've actually settled on a very similar workflow - I mostly use Claude 4.6[1M] with adaptive reasoning disabled on High/Max for implementation, and then I'll do some combination of manual review in conjunction with GPT 5.5 xhigh.

Wowfunhappy · 2026-05-30T16:37:56 1780159076

Kind of orthogonal to the discussion, but could you broadly describe the code you're working on that both models are bad at? One thing I'm still struggling with is figuring out what types of code LLMs can vs cannot write.

addaon · 2026-05-30T17:26:57 1780162017

C code formally proven correct with Frama-C WP has been... marginal. The models do better than I expected at the proof portion (with ChatGPT 5.5 seeming to have a meaningful lead), but they all have a hard time (a) writing really good C code to begin with and (b) with compliance around not modifying C code semantics or performance as a cheat to simplify proof obligations. They also tend to be insanely and consistently verbose on the first proof pass... e.g. 8 lines of C code might end up at 200+ lines annotated and proven, but after simplification passes end up at 40 lines. I find I spend 90%+ of tokens on those simplification passes, and haven't really found a way to avoid the over-annotate-and-then-optimize tides by being a bit more sane the first time around.

amalcon · 2026-05-30T18:59:18 1780167558

Rules of thumb:

The more your toolchain (compilers, linters, etc) can statically verify, the better agents will do.

The terser the code, the better agents will do.

The more often similar problems have been solved in open source, the better agents will do. Agents seem particularly good at plumbing together different pieces of software.

Anything that requires a judgement call, as opposed to having one obvious way to do it, will get worse results from an agent.

As the scope of the request grows, agents get worse at it. This can be mitigated somewhat using various techniques ("write a plan", "do step 1 of the plan", etc) but never fully resolved. At some point the task is so big that it's necessary to do large parts by hand.

Aurornis · 2026-05-30T18:32:22 1780165942

> Kind of orthogonal to the discussion, but could you broadly describe the code you're working on that both models are bad at?

Commonly, anything that hasn't already been done across 100 different projects on GitHub.

Making a React app with a CRUD backend: LLMs are great. They've been trained on this.

Doing new work on complex non-public codebases or in niche problems that aren't commonly solved: Completely different story. Some times they'll find enough information to piece together a path toward a solution, but that doesn't mean it's a good solution. I also have to feed in a lot more context and even stop them when they go down bad paths frequently.

For the complex work I don't have the LLMs write code, but I may have them do a proof of concept. I have to write and understand everything myself. There are times when I'll think the LLM output looks good until I go through it line by line and realize it's done something completely unnecessary, or happened to get the right result for the wrong reasons. For unknown problems they're good at getting something to work through brute force if you let them consume enough tokens, but it may rely on safety fallbacks from the OS or fallbacks instead of being a proper solution. I always chuckle when they encounter intermittent errors and the first idea is to add a retry mechanism so the error is ignored.

ryandrake · 2026-05-30T16:42:20 1780159340

I think the subscription pricing model kind of incentivizes developers (at least hobby developers) to pick one and go all in on it. For someone who has probably never paid $20/mo for a piece of software in their life, $20/mo is kind of a big commitment, and the pay-per-token schemes are reportedly much more expensive for the equivalent blob of coding they enable. So you "pick one," plonk down the $20, and use it as much as you can in the month so it's worth it. If you want to try the other one, you don't renew next month, and plonk down another $20 for the other one.

You can go back and forth and compare since you pay for both subscriptions, but is that a usual case? I'd guess most developers picked one in 2025 and haven't gone back. Just like most people just pick a bank for their checking account and never change it.

Aurornis · 2026-05-30T18:35:28 1780166128

> For someone who has probably never paid $20/mo for a piece of software in their life, $20/mo is kind of a big commitment

I could see this being true for a high school student or college freshman eating rice and beans in their dorm room. Many of us have been there.

For someone in a software development career, $20/month for tools is a trivial expense.

I think some people have a strong aversion to paying for any tooling, but I don't think the people carrying around their $3000 MacBook Pros are going to avoid paying $20 for a month to try something new if they're using it daily.

osigurdson · 2026-05-30T15:44:25 1780155865

Exactly. Popular opinion is behind reality by several months. Claude used to be significantly better, now it is basically the same.

bluebands · 2026-05-30T15:53:19 1780156399

Claude has been behind since GPT-5. Claude Code just looked cool and had better marketing

cpchander · 2026-05-30T16:08:49 1780157329

Claude is more reliable in production, less errors and better understand instructions. That's why the valuation shifted, technical people are choosing Claude for actual shipped products.

saberience · 2026-05-30T16:21:40 1780158100

This isn’t the case at all, the most technical and best engineers are all using Codex now and have been for roughly six months.

It’s a known “secret” for a while now how much better Codex is than Claude. I’ve used both since they were released and I often implement in both to compare and 95% of the time Codex writes better code and also less code!

Claude is only really better at front end design.

chromadon · 2026-05-30T16:58:17 1780160297

How could you possibly know all the most "technical and best engineers?". Wait.. are you a codex instance?

saberience · 2026-05-30T21:02:10 1780174930

Just follow respected folks on X who all stated they use Codex now over Claude.

theossuary · 2026-05-30T16:54:58 1780160098

You're being silly. The actual technical people are using Claude for implementation and relying on MCP servers to use Codex 5.5 and Gemini 3.1 pro to build teams, councils, and long running senior engineer conversations within Claude to handle the technical bits that're too complicated for Claude.

7thpower · 2026-05-30T16:30:27 1780158627

You must not be talking about the Anthropic endpoint…

amazingamazing · 2026-05-30T15:41:23 1780155683

I am not sure why the past matters here. I am talking about now, it is a fast moving space.

As for the test, of course the output matters. Take image models for example. Differences are clear as day.

Should the fact that OpenAI existed before Anthropic did at all matter? No, imo. I would have used opus 4.8, but it only just came out- fast moving space

jnovek · 2026-05-30T15:49:15 1780156155

Correct output is table stakes. Your test only shows that the products work as advertised, it doesn’t reveal reasons why people prefer one to the other.

You’re guessing that it’s a result of advertising, and I agree that that’s probably a component, but it’s a mistake to assume that they are interchangeable when you have people saying to you directly “I use both and they’re not.”

Maxatar · 2026-05-30T16:12:35 1780157555

This is an incredibly silly comparison. It amounts to claiming that a Ford Pinto is just as good of a car as a Rolls-Royce by simply observing that both cars got a person from point A to point B. After all, once someone reaches their destination you can hardly tell what vehicle they actually used to get there, but that doesn't mean there's no difference between vehicles.

What matters most in state of the art models isn't simply the final destination, it's the process of how one arrives to that destination.

vohk · 2026-05-30T16:44:05 1780159445

I think your analogy makes the opposite case better. A Rolls-Royce and a Pinto have the same real commute time because horsepower isn't the bottleneck, and they both get passengers from point to point. Sure the Pinto explodes a bit but much like the actuaries at Ford, you might well judge the cost of an occasional explosion to be a trade-off you can easily compensate for.

I would argue the process these days has more to do with the harness than the model, at least when we're talking about the SOTA options. Claude Code's biggest advantage isn't Opus, rather it's the shared knowledge the community has been building and sharing around using it effectively. Almost all of the out-of-the-box tutorials and skills and frameworks are build for Claude first, then Codex maybe.

I'd go further and say that CC and Codex are not even the best harnesses available, they just offer the most subsidized rate plans.

ethbr1 · 2026-05-30T17:32:45 1780162365

> Claude Code's biggest advantage isn't Opus, rather it's the shared knowledge the community has been building and sharing around using it effectively.

This. Never underestimate the ability of a large number of power users to substantially improve the actual utility of a complex software product.

They always have more time (and sometimes more skill) than a product's developers.

Sometimes the quantity of monkeys matters more than the quality of the typewriters.

amazingamazing · 2026-05-30T16:16:51 1780157811

In my test the prompt was the same and all suggestions were auto accepted so indeed there was no difference other than model and harness. The amount of characters typed and interaction with the harnesses were exactly the same.

kbenson · 2026-05-30T17:53:39 1780163619

To keep with the analogy, isn't that sort of like testing two cars by having them both drive the same few hundred foot stretch of new road at the posted speed limit of 35 MPH? You will test some things doing that, but not particularly well, and hardly all the things people find interesting and useful for comparing the performance of cars.

To bring ng this back to the discussion at hand (and to be redundant, as it's been mentioned here already), there are many aspects of using an LLM that are not purely about the output from a single or few well formed prompts. Additionally, if the end results are very similar, these othrr aspects will have an outsized influence on people's perspective of the tools, as they're the only differences worth choosing one model over another.

epistasis · 2026-05-30T16:04:10 1780157050

If I were to give one carpenter a set of fine hand tools, another a full workshop with power tools, and they both made a picnic table to the same spec, and at the end I wasn't able to tell which came from which, would you say I have come to a fair metric for which type of tooling to use for wood working?

amazingamazing · 2026-05-30T16:18:23 1780157903

If the effort was the same as was in my test, yes.

epistasis · 2026-05-30T17:23:44 1780161824

The amount of effort is an absolutely critical comparison, right? That's been left out, yet you keep on harping about how the outputs are the same, ignoring all the many many many comments that are talking about the amount and type of effort.

In fact, after seeing all these comments about the amount of effort, you redirected at calling that mere "vibes:

> Edit: i bet 99% of people here, if presented with a test where i gave 5 models but all of the results came from one, would not be able to discern this. Just vibes all the way down

Which, again, is a highly emotional way to view people trying to say that the process matters too. Calling people "vibes based" or "highly susceptible to marketting" and saying they take part in "tupperware parties" rather than evaluating their experience with tools is quite a thing to see, a complete dismissal of professionals' core experience as "vibes" rather than something intrinsic to how they perform labor.

SiempreViernes · 2026-05-30T16:31:03 1780158663

Wouldn't the question be if they could tell the tables apart by quality (after insisting one of the two parties made things of superior quality)?

amazingamazing · 2026-05-30T16:40:29 1780159229

oreally · 2026-05-30T16:33:12 1780158792

To add on context, the experiment you're giving is called a *blind judging test*. Remove the branding and labels, and let judges sample the results and see if they can tell which is ranked correctly.

Some examples are blind wine tasting tests. There are instances whereby some journalists invited renowned/established wine tasters and subjected them to blind wine tasting tests. Turns out the judges couldn't tell which was which. Pretty embarrassing.

It speaks volumes as to how people can accurately judge the value of things. There is research by some network scientist that says you can't generally can't tell the 1% from the top, though you can tell the really bad from the generally good. What OP's experiment might tell us is that the LLM competitive advantage is so small no one can tell which is objectively better.

riedel · 2026-05-30T17:26:57 1780162017

Actually it would be fun to try to test the developer personality of the models.

Actually there is a nice body of work by Steven Clarke on cognitive dimensions of notations/APIs and the interaction with developer personalities.

I wonder if the same holds for AI models and harnesses.

kaydub · 2026-05-30T17:38:34 1780162714

I just don't believe non-deterministic tools can actually be benchmarked. It's all hoopla to me.

I flip between models all the time. Makes little difference. Sometimes one model is faster or better than another but there's no rhyme or reason why.

mpyne · 2026-05-30T17:47:20 1780163240

> I just don't believe non-deterministic tools can actually be benchmarked. It's all hoopla to me.

We benchmark non-deterministic things all the time and it's frankly not even that unusual or hard. You yourself indicate that one model outperforms another one in your experience on various facets, and that is itself a benchmark.

The more relevant question is probably how well does a given benchmark translate to improvement on a specific desired outcome or task. The military uses the ASVAB testing battery to benchmark potential new recruits for suitability in various career specialties, but the actual outcome the benchmark is meant to correlate with is later success in the training pipeline.

So every so often the various military branches have to do and compare ASVAB results against training results and make sure that they still have a predictive relationship.

And this is benchmarking real flesh-and-blood human beings where you get on the order of magnitude of a million data points or so per year. You can benchmark AIs much more efficiently than that, as non-deterministic as they are, and as long as the benchmark itself is reasonably predictive of outcome it's going to be useful information.

kaydub · 2026-05-30T18:18:33 1780165113

My anecdotal experience isn't a benchmark. Just because I feel like something is better or different doesn't mean it actually is.

mpyne · 2026-05-31T04:11:01 1780200661

> Just because I feel like something is better or different doesn't mean it actually is.

Of course, but it is a data point, and multiple such data points can be aggregated. This is true even if all you can do is compare two things.

The shape of that data will reveal something more about the thing you want to measure than the null hypothesis you'd otherwise have.

drawnwren · 2026-05-30T17:42:58 1780162978

All tools are non-deterministic on some reasonably specified input set.

fmbb · 2026-05-30T17:38:22 1780162702

> Some times one will spin for a long time on certain problems where the other has no problem finding the appropriate parts of the codebase and getting an efficient solution.

Surely this is just to the random nature of these stochastic parrots?

Do you mean you have identified a class of problems Claude always stalls on and another class of problems Codex always stalls on? What identifies these different classes of problems you see? How would you say Claude is stronger than Codex and vice versa? Why?

epistasis · 2026-05-30T14:50:10 1780152610

Calling this a "tupper ware" seems a bit emotional, you're intentionally disregarding many things that matter for devs in order to try to claim equivalence, rather than paying attention to the actual process of software creation.

For example in your "test" you're only looking at output and ignoring the entire process of creation.

In addition to that process, you're ignoring that Claude Code was first and better for a long time, why would people switch for something that produces the same output? Claude Code has been way ahead in the process of agentic software creation for a long time, I still prefer its features. Even though I think that Opus 4.7 was a big step backwards, and I've been getting worse results seemingly every day with the churn of features at Claude Code, some of that may also be me testing the bounds of how little I can specify and still get acceptable results, so it's hard to know.

Calling all these concrete realities "marketing" is itself you trying to market Codex as "good enough" instead of paying attention to how we got where we are and where we will go in the future.

mold_aid · 2026-05-30T16:23:09 1780158189

>Calling this a "tupper ware" seems a bit emotional

Calling this "emotional" seems a little weird

amazingamazing · 2026-05-30T15:42:55 1780155775

Tupperware party is a particular thing about the social framework around promoting corporate goods.

epistasis · 2026-05-30T15:56:17 1780156577

You might want to expand on that a bit.

Tupper ware parties were a way for housewives to make a bit of money on a pyramid scheme, socialize, and have fun.

Are you suggesting that Anthropic is giving kickbacks to devs that talk about their positive experiences with Claude Code? Seems false, so I don't think that's it. Are you saying people are having fun talking about Claude Code socially ans ann escape from their everyday routine? Also seems false Are you talking about how it's mere housewives that are supposedly easily susceptible to marketing? Or are you assuming that we all think housewives only bought Tupperware because they are mindless sheep? That seems to be what you are implying but I don't agree with either that characterization of housewives' tupper ware parties, as it's merely an emotional dismissive mid characterization, and I further disagree that even if it were a correct characterization of Tupper ware parties it's obviously nothing like anything I have seen anywhere with Claude Code, and I'm a freelancer with insight to several different sizes of companies and cultures over the past year.

amazingamazing · 2026-05-30T16:08:02 1780157282

https://claude.com/community/ambassadors

https://www.tupperware.com/pages/host-a-party?srsltid=AfmBOo...

It really is the same thing. You and others get more credits or gift social gathering, expanded opportunities, etc.

epistasis · 2026-05-30T16:21:39 1780158099

I've seen OpenClaw events, but I've never seen an Anthropic event, anywhere.

Are you actually asserting that Claude Ambassadors are a significant fraction of the cause of adoption? If so, why have Codex Ambassadors been so less successful?

https://developers.openai.com/community/codex-ambassadors

If you've met people that have been to these sorts of things, sure, I guess I can sort of understand your post, but come on, who has even heard of this sort of party on HN?

I've been going to Python data meetups, Machine Learning meetups, etc, back to the times when AI was an uncool word whose usage would mark the speaker as completely incompetent. I guess you could call them Tupperware marketing parties but come on, it's just an emotionally charged way of describing a normal way of exchanging information amongst professionals. Ambassador programs? Yes, cringe, but seriously who has even seen an actual "Ambassador" or taken them at their word rather than viewing them as a detriment to the thing they are advertising?

bluebands · 2026-05-30T15:56:10 1780156570

Claude Code was first by a few weeks and only better for those few weeks! Have you used Codex in 2025?

412876 · 2026-05-30T15:00:58 1780153258

No, Tupperware is the exact analogy. As you point out though, the multi level marketing applies to all models. Anthropic is just the most aggressive, especially here.

Software developers are the most susceptible of all population groups for amplifying their employers' new whims. There are true believers and useful idiots, but many are just mediocre and know that playing along will further their career for a couple of years.

In the end they will be fired anyway of course.

afavour · 2026-05-30T15:18:22 1780154302

You're overestimating the extent to which individual developers have a choice here. My employer signed up for a Claude Code membership, I use Claude Code. I cannot use Codex.

Anecdotally I hear of folks with workplace Claude Code subscriptions all the time. I'm not sure I've ever heard someone talk about their workplace Codex subscription. Anthropic clearly did a far better job chasing corporate customers while OpenAI was busy chasing consumers with Sora etc.

Aurornis · 2026-05-30T15:26:30 1780154790

The OP seems unaware that Claude had a lead in this space and captured market share and attention for that reason alone.

The test they (supposedly) ran with their coworkers to look at PRs from both is such a bad way to compare LLMs that I don’t think they’re very experienced with using them.

bluebands · 2026-05-30T15:54:39 1780156479

Claude had a lead for a few weeks. When Codex launched it was better and it has been (marginally, yes, but still) better since.

It's marketing

jnovek · 2026-05-30T15:59:16 1780156756

Having a lead in a new market even for a short time creates initial conditions that are in your favor. That’s why startups get all fired up over being first to market.

bluebands · 2026-05-30T16:06:55 1780157215

Sure, but OpenAI has caught up with such velocity (and frankly has the better models) such that it's kind of irrational to refuse it based on "vibes"

NewsaHackO · 2026-05-30T21:51:48 1780177908

I know it is irrational vibes, but how the whole Department of Defense situation was handled will always make me partial to Anthropic. Unless there so huge shift, of course.

siva7 · 2026-05-31T00:35:17 1780187717

No, it wasn't and no one who used Codex and Claude since release could seriously make this claim. It was a magnitude behind Claude and only catched up recently

ignoramous · 2026-05-30T15:29:55 1780154995

> The OP seems unaware that Claude had a lead in this space

I remember using GitHub Copilot (OpenAI "Codex" mk1) in Aug 2021 (ChatGPT would launch a year later 2 weeks after Meta's botched Galactica release). Cursor & others took it and ran a mighty good race.

ixaxaar · 2026-05-30T15:57:46 1780156666

Okay so regardless of the model, platforms provide attestation from end to end that nothing is being logged from either input or output, including at the firmware and OS level to the extent that the customers have proof of the data never being saved. AFAIK, both GPUs and TPUs support this.

dboreham · 2026-05-30T16:41:01 1780159261

I have lots of choice (I own the company) but I'm still not going to switch from Claude until I see evidence that the alternative is meaningfully better. So far I don't see that evidence. In the past I've looked at using competitive products and it turned out to be a painful experience (Cursor didn't work at all on my computer, Google thing -- whatever it was at that time -- required dependencies I wasn't willing to install). I'm sure these issues have been resolved since but why would I spent time kicking the tires of another product just to have it work "as well"? Claude's cost to me is minimal so there's no cost savings to be made.

fwiw nobody "marketed to me". I picked Claude because friends were using it with great success and they helped me get started with suggestions on prompt style. Before that I'd played around with various LLMs for coding but not done any actual production work.

irthomasthomas · 2026-05-30T15:23:18 1780154598

Corporate accounts pay the full api price, so I don't know what is stopping them or you from also using codex on the same terms?

afavour · 2026-05-30T15:26:10 1780154770

Intellectual property. My employer has an agreement that our code will never end up as part of Claude's training data. At this point there are also now custom Claude integrations etc.

I'm sure they could also negotiate a similar deal with OpenAI but in my outsider experience it seems that negotiations around these kind of corporate contracts takes forever and when the selling point is "they're broadly pretty similar" I suspect the motivation isn't there.

bayarearefugee · 2026-05-30T17:48:27 1780163307

> My employer has an agreement that our code will never end up as part of Claude's training data.

Bit of a tangent but it is funny to me that so many companies talk about how some large percentage (sometimes 100%) of their new code is LLM-written and they still bother to worry about their code being used for training.

If an LLM is writing all your new code then your existing code certainly wasn't some unique special secret, and your new code came from the LLM to begin with.

Barbing · 2026-05-30T15:39:23 1780155563

> My employer has an agreement that our code will never end up as part of Claude's training data.

“Our competitive advantage is that we believe them,” I’ve read—wonder if that’s still a [prevailing] sentiment.

(Edit - context was probably using SotA models instead of being limited to local open source only)

mlsu · 2026-05-30T15:25:45 1780154745

I think the marketing campaign came first. Anthropic captured developer mindshare first, then they brought it to their companies.

epistasis · 2026-05-30T15:50:01 1780156201

Claude Code was a huge huge huge step up when it came out, absolutely massive.

It was barely marketed. I always turned copilot off, never found any benefit from Cursor. Claude Code was vastly different in conception, function, and capability, a product that defined an entirely new category of product.

Perhaps to others, that found copilot or cursor useful, it was merely marketing. But to me it was function and productivity, that I had never seen before.

People try to dismiss these things as LLM wrappers, but the LLM will be commoditized, and the wrapper will be where the real product design goes and where the real differentiation happens. Owning that unique process of communication between the dev and what the dev wants, figuring out the most stuff with the least complete spec, and maximizing every bit of the very tiny communication channel between the dev and the LLM and the code on disk, that's where 2026 and 2027 will be focused, until the next category defining product is created.

theptip · 2026-05-30T16:06:09 1780157169

Sorry, no. Claude Code was the product that brought coding agents to the mainstream. Sonnet 3.5 was the model and harness that created vibe coding.

This was a push of the technical frontier, not a marketing achievement.

dilyevsky · 2026-05-30T17:40:41 1780162841

I started using Claude web with sonnet over chatgpt before any of the coding tools came out and noticed other founders were using it too and the reason was pretty simple - it was much less likely to hallucinate non existing APIs than ChatGPT

jnovek · 2026-05-30T14:22:56 1780150976

I can’t tell the difference between code written in vim or vs code but it matters substantially to the person writing the code. There’s stuff beyond just the output that goes into tool choice.

neosat · 2026-05-30T14:39:29 1780151969

Your argument is fine but different from the claim the OP is making. You cannot simply make a claim that (model + harness) X is better than Y, but then have no discernible difference in the output. Subjectively, people might still prefer one over due to anything from design to marketing, but that's very different from the claim that X is better than Y for coding (see: "A colleague was convinced Claude is better"). Basically, I prefer Claude is a different claim than Claude is better and the latter has a higher bar of proof.

spider-mario · 2026-05-30T14:50:05 1780152605

> You cannot simply make a claim that (model + harness) X is better than Y, but then have no discernible difference in the output.

You definitely can in principle; that’s the entire point of the comment you are responding to. If one tool completes it in 10 minutes with little hand holding, and the other does it in one hour at 4× the cost and while needing a lot of steering, the former is arguably better even if the end result is the same.

Whether that’s specifically true and demonstrable of GPT and Claude is another question, but your blanket statement doesn’t hold as a general rule.

neosat · 2026-05-30T14:58:30 1780153110

That's a fair callout and I agree my statement was too general in just mentioning 'output', as you correctly pointed out. To define 'better' you would indeed need to agree on the dimensions you would evaluate candidates against.

I think a more appropriate rephrasing would be 'You cannot simply make a claim that (model + harness) X is better than Y, but then have no discernible difference on dimensions you care about'. In the case of latest of claude code vs codex with gpt 5.5) both are similar enough in the dimensions people will care about in evaluating (vs. differing wildly in cost or time taken).

runako · 2026-05-30T14:57:54 1780153074

This obviously correct take will get pushback, so let me add some other examples:

- which tool required more detailed goal-setting in the prompt?

- did one tool ask follow-up questions up front vs spread out over implementation?

- did either tool match existing coding styles?

- did either tool remind you about potential conflicts between what you asked it to build and other parts of the codebase?

There are a lot of ways to compare agents besides just the code. (Similarly, working engineers are not evaluated just on their code output.)

SiempreViernes · 2026-05-30T16:41:50 1780159310

The colleague implicitly agreed that comparing the output was a valid way to settle the matter as they took part in the test, so they weren't using "better" in the way you propose.

spider-mario · 2026-05-30T19:11:07 1780168267

I wasn’t really discussing the colleague, but either way, from:

> A colleague was convinced Claude is better so we played a game. We used the claude code and codex harness and I implemented some prs they needed with gpt5.5 and opus4.7 and asked them to identify which came from which only from the code.

I don’t think it’s obvious that they specifically agreed that losing the game meant that. They might just have thought “sure, it might be fun”, if they even gave it that much thought.

“So we played a game” is rather vague and I feel it’s a bit of a leap to read it as: “as an explicit outcome of their claim that Claude is better, we made a formal bet as to whether they could tell the difference in the output, the failure of which would mean a full retractation of their statement”.

skillina · 2026-05-30T15:09:13 1780153753

Claude and Codex are tools. You can't tell the difference in the output between something that was done with a ratcheting wrench vs a standard combination wrench, but your mechanic certainly knows the ratcheting wrench is better (for most tasks).

I've not used Codex to compare against, so I'm not claiming X is better than Y, but comparing tools simply on their output is naive.

bluegatty · 2026-05-30T15:24:51 1780154691

" You cannot simply make a claim that (model + harness) X is better than Y, but then have no discernible difference in the output"

Sorry I think this misses the mark.

Because it's not the output but the process.

And sometimes the outcomes are not always discernable.

Codex and Claude are very different.

I use them for different things.

Their behaviour difference is obvious.

Of course it'd impossible for anyone to tell by looking at my code base 'how it was written'.

neosat · 2026-05-30T16:06:17 1780157177

You need to see the response in light of the original discussion. Referencing here for clarity since I should have included it in the first place: "We used the claude code and codex harness and I implemented some prs they needed with gpt5.5 and opus4.7 and asked them to identify which came from which only from the code."

So the same person, was using similarly competitive tools, and showing that the output was hard to discern (indirectly the implication was also that implementation was fairly trivial in both of those). A better analogy would not be different process and widely different tools but for example two power drills. Sure, folks could still prefer one over the other, but that's a different claim that saying X is objectively better than Y when both are directly competing on very similar dimensions.

Assuming you meant Claude code: I'd love to learn more about "Codex and Claude are very different" because maybe I'm assuming just based on my use case where I use both of them interchangeably for the same thing (coding web and mobile apps)

bluegatty · 2026-05-30T18:04:33 1780164273

It's not reasonable to compare results from two different tool sets, especially as they are guided by humans.

The only way a reasonable comparison could be made, would be to compare completely automated results from either technology - that would be useful.

For example - creating a 'per-baked script' and running on both to see the output.

Codex and Claude are obviously very different, though it's hard to characterize how those differences might apply exactly to a given problem.

Two 'very different power saws' will ultimately build the same home.

jnovek · 2026-05-30T15:08:18 1780153698

> A colleague was convinced Claude is better

That’s actually what my comment was based on; raw code output isn’t the only measure of quality. Engineers write better code if they have the tools they prefer.

SiempreViernes · 2026-05-30T16:36:12 1780158972

The colleague participated in the test though, so apparently the colleague didn't object to "better" being interpreted as "makes better output".

SiempreViernes · 2026-05-30T16:39:58 1780159198

If you told someone "I think vim is better for writing code" and they proposed the comparison above as a way to prove it, would you accept and take part of the test?

Apparently the colleague did take part, so I think the evidence we have is that the colleague agreed with the interpretation that "better" was "produces discernible better code".

amazingamazing · 2026-05-30T14:37:37 1780151857

> There’s stuff beyond just the output that goes into tool choice.

Yup, like billions of capex. Unlike vim.

grayhatter · 2026-05-30T14:40:24 1780152024

I'd bet I could tell with a result somewhat better than random chance.

While there is no meaningful difference in the ability to write code, vim has earned it's reputation for having a learning curve. I'd argue that predisposition, that requirement for additional investment energy will bias the results towards attention to detail, and pure minimalism.

davidguetta · 2026-05-31T21:19:22 1780262362

yeah but you dont pretend vim is better

utopiah · 2026-05-30T14:30:29 1780151429

Ah that's always SO fun. It doesn't matter how "smart" the person actually are (or think they are) we are ALL susceptible to influence and blind tests are shockingly simple to implement.

Convinced you can distinguish A from B? Ok! No problem, let's try! Can be at the dinner table for fancy wine or with agents, it's all the same, you try an option, another option, maybe all options from the same, and if you reliably can't tell well kudos, you are just like the rest of us!

It's easy to "know" in retrospect but blind test is where genuine difference can be found. Or not.

api · 2026-05-30T14:53:46 1780152826

It’s also true in every other realm. Governments, think tanks, political parties, and activist groups use propaganda because it works.

I sometimes wonder how much of what I believe is bullshit I was fed through intentional propaganda. I do think as I’ve gotten older I’ve gradually identified and challenged some of it.

MichaelZuo · 2026-05-30T15:17:23 1780154243

Isn’t this obvious?

Over half of HN commentators visibly struggle to piece 3 or more complex ideas together.

How could anyone, who spent more than 30 minutes reading HN, expect otherwise?

tempest_ · 2026-05-30T15:55:59 1780156559

HackerNews is social media and this is just representative of social media as a whole.

Critical thinking is at an all time low to start with but even if you attempt to think critically while using social media you cannot do it constantly. This is one of the problems with social media as a whole. You might notice one thing is not quite right and discard it but you cant do that constantly and eventually you will absorb one of the 15 posts or comments.

brookst · 2026-05-30T14:17:53 1780150673

This is like saying you gave a Taylor Swift fan sheet music from 1984 and from Michael Jackson’s thriller and they couldn’t tell the difference.

I have a strong affinity for Claude Code because of the interaction experience and overall tone / vibe / process. I am 100% willing to believe the code it produces is identical or possibly less good than Codex.

I enjoy working with Claude in a way I just don’t get from OpenAI. YMMV, you may feel just the opposite. But it’s a mistake to look at the produced code as the only dimension of these products.

tasuki · 2026-05-30T15:49:59 1780156199

I have a disaffinity for Claude Code because it's unnecessarily big, closed source (disregarding the leak), and I have a strong feeling it'll be shittified in the future because of all the investors waiting to cash out (and perhaps even earlier by vibe coding).

I have an affinity for small open source tools that do one thing and do it well. But those are just my preferences and I feel a little bit like an alien :)

brookst · 2026-05-30T18:20:57 1780165257

I hear you. I’m just in the camp of using the best tool available today, and if things change in another tool becomes better(either because the new tool is an improvement, or because the old tool gets worse) then I will switch.

Perhaps because I am notoriously terrible at predicting the future. I gave up on that after passionately and exhaustively trying to convince everybody I knew that OS/2 was the future.

dboreham · 2026-05-30T16:44:43 1780159483

> it'll be shittified in the future

This happens to everything from which a profit is extracted.

Perhaps there's a way to fund the training of "actually open source" models, but so far we don't have that (unless you count the Chinese government).

tasuki · 2026-05-30T17:57:44 1780163864

> This happens to everything from which a profit is extracted.

Yes, hence my affinity for small open source tools!

> Perhaps there's a way to fund the training of "actually open source" models

I meant Claude Code the agent harness, not the model. Models are an entirely different can of worms!

bluegatty · 2026-05-30T15:26:41 1780154801

If it were a matter of 'enjoyment' then the OP would have made his point.

There should be a material difference between the tools.

There is.

vim / emacs / jetbrains - different tools to produce code.

Codex and Claude are different.

matusp · 2026-05-30T16:43:02 1780159382

Can you give me some examples of these interactions / vibe?

brookst · 2026-05-31T12:45:41 1780231541

I’m working on an AI-assisted music composition and criticism tool (giant project, may or may not pan out). It covers audio (samples, levels, etc), theory (harmony, rhythm), genre (classical, Motown), song intent, etc.

Working on the melody model, I asked Claude to thread it through those dimensions, both for composition and analysis. It’s a tough problem because there are heuristics but not rules for melody, so you have to come at it in layers: pitch and dynamics for analysis, intent -> genre -> harmony for composition.

Lots of research and brainstorming, and I like that Claude will start implementing a plan we decided on, then say “hey wait this isn’t making sense” and pivot or change scope in sensible ways. For instance (btw its recent obsession with “honesty” is driving me batty, so that’s a good counter example right there):

> The honest fix: distinguishing shaped-from-aimless is genuinely profile-relative (it needs the declared intent) — so v1 should not verdict it. v1 classifies only what’s genre-safe to call without intent (static vs active), reports all the facts that feed the shaped/random judgment (step/leap, reversal, contour, alphabet), and defers the shaped-vs-random verdict to profile-relative phase 2b. This is the same honesty as performance deferring declared-profile grading to 2b — and it’s more deferred for melody because melody is more genre-relative than feel. Let me revise the lens accordingly.

bluebands · 2026-05-30T15:58:05 1780156685

talk about the quiet part out loud

"yea it's dumber but it's nicer to me and i like the cool flashing colors so i'll use that"

amazingamazing · 2026-05-30T14:22:10 1780150930

This is my point. The harness itself creates feelings that are positive, but the artifacts produced are similar.

It is like the employee who is slightly worse but is a brownnoser getting promoted more often.

And what do you know, that is what is happening. It is like the coke commercial with the nice music and beautiful person in the back.

Speaking of which, remember Pepsi Challenge? Coke lovers are like the claude code lovers.

hgoel · 2026-05-30T14:25:50 1780151150

But what they're pointing out is user experience, not marketing.

mewpmewp2 · 2026-05-30T14:24:47 1780151087

The creative output and time to direct, to deliver due to the flow will also be different.

And it really depends on the task. Is it a typical well defined bug, or is it simpel CRUD. Or does it require research, combining different sources of data in a complex and creative ways.

This is also why benches never show reality, and the only real understanding comes if you actually try to build something.

9dev · 2026-05-30T14:28:21 1780151301

That's a weird way to look at it. Any car gets you to your destination, but some people prefer driving a sports car or an SUV. They get something out of it that isn't just a marketing delusion, but subjective joy from the interaction with one product over another.

tasuki · 2026-05-30T16:01:00 1780156860

Oh I've always thought people like SUVs just because they want a bigger and pricier toy than the other guy. Is it not so?

> isn't just a marketing delusion, but subjective joy

What is the difference? When a product is being marketed, isn't the subjective joy created by the marketing?

brookst · 2026-05-31T13:46:10 1780235170

The car metaphor isn’t working. SUVs range from $30k to $500k and up. Sports cars range from $30k to $500k and up. Sedans range from $30k to $500k and up.

I used to love sports cars because I’d get out to the track a few times a year, and enjoyed the performance engineering in them.

Today I enjoy and drive a SUV because I enjoy camping in remote locations that my old (awesome) sports car would not begin to be able to get to. In the future if I decide to build my own house, odds are I will find a pickup truck is a better fit than either sports car or SUV.

It’s weird to see these choices through the lens of marketing. Marketing may influence which vehicle I buy, but it’s hard to imagine sports car marketing tricking me into buying a Ferrari to go camping.

9dev · 2026-05-30T16:43:21 1780159401

Attributing all joy and personal preference people experience to marketing influence is a pretty cynical take, and also a pretty inaccurate and overly simplistic one: Some people might never be reached by a marketing effort, yet buy a specific car because they remember a childhood story, or like pronounced rounded shapes, or because the interior design appeals to their sense of fashion.

Or more specifically, for the case of coding agent harnesses, where many developers have experimented with a wide range of tools - someone might just favour the interactions with a specific one from their personal experience. Entirely unrelated to marketing.

tasuki · 2026-05-30T16:55:49 1780160149

Yes, of course not all personal preference is caused by marketing, but much of it is. Actual personal preference is whether you prefer running or cycling, the marketing part is which running shoes or which bike you buy.

> or because the interior design appeals to their sense of fashion.

Surely you'll grant me that the sense of fashion is mostly marketing in sheep's clothes?

Yes, I'll grant you that the choice of a coding agent harness is influenced by marketing to a much lesser degree than eg cars. I still think Anthropic does marketing way better than OpenAI!

[Edit:] I use the pi.dev agent. I was heavily influenced by its marketing: minimal and mit-licensed and espoused by the HN crowd. Do you think I read the source code and made an actual informed decision? Nah...

amazingamazing · 2026-05-30T14:30:23 1780151423

Luxury cars are indeed a good comparison. The subjective joy is a result of the delusion. That is why so much money is spent on such marketing to begin with. The analogous comparison would be if a blindfolded passenger turned out to prefer the Sienna to the 911.

wtetzner · 2026-05-30T16:04:45 1780157085

I suspect a blindfolded passenger might prefer the Sienna. I can imagine it might be easier to get carsick blindfolded in a 911.

But also, as a driver, there is a clear difference between a Sienna and a 911. The differences are objective, but of course the preferences are subjective.

mewpmewp2 · 2026-05-30T14:34:20 1780151660

I would actually say it is a luxury car where you have your personal driver and you are free to work on other tasks, and it gets you faster to the destination. Time to me is at least the most valuable thing.

hgoel · 2026-05-30T15:48:09 1780156089

I hope you can one day grow to understand that other people have preferences that are different to your own.

ctvo · 2026-05-30T14:46:37 1780152397

> The subjective joy is a result of the delusion.

Repeat after me:

_Other people can experience things you do not experience and it is still valid, and not a delusion_. They are not sheeple who fell for marketing.

tasuki · 2026-05-30T16:05:22 1780157122

I understand that's your opinion, but don't feel any inclination to repeat that.

Sure, the subjective joy is valid, and yet it was 100% induced by marketing.

> They are not sheeple who fell for marketing.

People generally fall for marketing. Why do you think these specific people didn't?

timfsu · 2026-05-30T16:31:08 1780158668

Imagine you try two products you’ve never heard of. You prefer one over the other. Was it marketing? That’s what’s happening here. Marketing can get you to try something you wouldn’t have otherwise, and it may suggest benefits you’d get if you tried it, but your preference of using one thing or the other is a subjective experience of your own.

tasuki · 2026-05-30T16:37:46 1780159066

> Imagine you try two products you’ve never heard of. You prefer one over the other. Was it marketing?

No, obviously not.

> That’s what’s happening here.

No, that's not what's happening here.

> Marketing can get you to try something you wouldn’t have otherwise, and it may suggest benefits you’d get if you tried it, but your preference of using one thing or the other is a subjective experience of your own.

Marketing can very much shape your preferences and create wishes you didn't have before. That's why companies invest so much money in marketing.

bibimsz · 2026-05-30T14:46:01 1780152361

this site is reddit 2.0

bilekas · 2026-05-30T14:14:25 1780150465

I think for developers the distinction is that ChatGPT is this commercial all in one solution for normies and Claude is specific for developers, in reality as you say the results for normal developers is indistinguishable.

kube-system · 2026-05-30T14:35:15 1780151715

Maybe some people think that but there’s not really any meaningful difference in their offerings

FWIW most of the normies I know are using Claude

Frost1x · 2026-05-30T15:27:17 1780154837

The results are the same but I’ve found the process to get to the results are just more pleasant with Claude. I can’t put my finger on it. Overall most these models at the highest level are about the same in many respects but the UI/UX for some are just more enjoyable, for lack of a better term.

Codex I feel the need to be very specific and precise with. Claude… I feel like I can be lazy, which I enjoy.

Both still need to be reviewed stringently but I feel I can be more ambiguous with Claude and get better results than when Codex.

sebzim4500 · 2026-05-30T14:36:22 1780151782

I don't think it's marketing, for quite a long time Claude was clearly better and not everyone has adapted to the new reality where they have similar capabilities.

wincy · 2026-05-30T14:47:29 1780152449

I was really frustrated by GPT-5.4, but last night I really pulled out the stops and within a few hours I got path tracing and DLSS implemented on top of Godot, which doesn’t even support DLSS. Just to see if it could do it? And you know what, it did, which was absolutely mind blowing. It wrote like 5,000 lines of C++, I set up a mostly local asset production pipeline using GPT image gen, voiceovers using ElevenLabs API, and even background music using Suno via the chrome use extensions in Codex. I just wanted to see how far I could push this little dumb game my kids asked me to make, and my kids are like “wow our game looks so good!” These models are absolutely mind blowing. I didn’t want to go to sleep I was having so much fun.

slashdave · 2026-05-30T15:18:19 1780154299

Adapt to what? If they are the "same", there is no reason to move. Actually, there are reasons not to, if you care about OpenAI's behavior.

AnotherGoodName · 2026-05-30T14:53:07 1780152787

I don't think that's the only reason but you're spot on about OpenAI marketing being absolutely terrible. The primary product names of "Claude" vs "ChatGPT" highlights this remarkable difference. To the point where I'm seeing Claude completely take over the generic term for agent.

I do think OpenAI is doomed due to bad leadership. What you said (that the marketing is relatively terrible) and what others are saying here (that the product is worse) is damning isn't it? Are they really failing on all fronts?

notnullorvoid · 2026-05-30T16:29:58 1780158598

The marketing of Claude relies primarily on fear, and I don't think that will have lasting success. Using fear like that tends to backfire once people see past false taking points.

comboy · 2026-05-30T15:05:25 1780153525

1. It's 1 in 10 failures that can take half of your time or bugs that can take a long time to surface. Plus the way they change things largely depends on the current codebase (and how it was created)

2. In my case codex seem to be writing a more solid code, but I still use claude most of the time because it's my witty rubber ducky and I can actually sometimes force some legit insights out of it. Codex is much worse at this. And whether that matters or not depends on the project.

regluous · 2026-05-30T14:13:07 1780150387

Everyone can be propagandised. It's a matter of pushing the right buttons.

slashdave · 2026-05-30T15:19:48 1780154388

Or pushing the wrong ones

2026-05-30T14:16:04 1780150564

[dead]

jnovek · 2026-05-30T14:21:24 1780150884

Seeing yourself as immune to propaganda probably makes you more susceptible to propaganda.

Edit: Oh they’re trolling, nm. :-/

site-packages1 · 2026-05-30T14:17:39 1780150659

I RAN to downvote this dunning kruger of a comment.

yoyohello13 · 2026-05-30T15:14:41 1780154081

I picked Anthropic way early on, before Claude code even existed. Because they at least play lip service to behaving morally. That’s the most you can hope for these days really.

notnullorvoid · 2026-05-30T16:39:40 1780159180

Before the DoD thing there wasn't much indication of positive moral stance, and they still have a rather negative moral behavior where their fear based marketing is concerned.

AndrewKemendo · 2026-05-30T15:18:40 1780154320

“…Hey but at least the tormentor in my panopticon gives you a high five after the skin harvesting”

This has to be in some far side gallery somewhere

jesse_dot_id · 2026-05-30T17:23:10 1780161790

It's a matter of what context is available to me at this time. I like LLMS. They improve my workflow to an insane degree. I think Sam Altman kind of sucks. I don't trust OpenAI. If they were the only kid on the block, I'd use Codex. It's entirely possible Anthropic sucks in the exact ways that OpenAI sucks but has better PR. I don't have time to deep dive to find out. I still like using LLMs. I started using Claude because Cursor, as a company, did something that I can't recall but gave me the ick. So I switched to Claude Code.

I still use Claude Code because I have the most experience with it now, and it's the harness that I understand on a granular level. If something comes along that is clearly better, or if it becomes clear the Codex is miles ahead, I'll try it and evaluate it. To your point, there doesn't seem to be much of a difference.

Arguing over this stuff feels kind of silly, like back in the day when my friends would give me shit for using mIRC instead of ircii or BitchX. I liked the GUI then because I did. I like Claude Code now because I do.

holistio · 2026-05-30T14:17:08 1780150628

Been to an Anthropic event in Paris last summer.

They served caviar. It probably had good ROI.

mgrunwald_ · 2026-05-30T15:00:08 1780153208

I don't think it's only marketing. OpenAI had the advantage of being first to the market, and in the beginning of the race it seemed that the future belongs to them. Then came the bad PR and unpredictable quality of their main product.

For general use, ChatGPT's answers have gotten worse over the last year. I abandoned it.

pyrale · 2026-05-30T16:31:18 1780158678

> I never want to hear from developers again that they are not susceptible to marketing.

Did you need to come to that conclusion?

Marketing has always been a significant part of new technology adoption. Whether it's for cloud adoption, for new programming languages, for new software development techniques, etc...

isityettime · 2026-05-30T15:08:03 1780153683

> i bet 99% of people here, if presented with a test where i gave 5 models but all of the results came from one, would not be able to discern this. Just vibes all the way down.

This is complicated by the way that the coding agents inject prompts that preempt and potentially undermine user instructions. I suspect that one of the reasons Codex works way better for me than Claude Code in certain projects is that the latter adds some garbage like "go ahead and write repetitive copy/paste code, keep it simple, take shortcuts" to every session. A fair test would have to hide but more or less still use the harnesses, not just the models.

scosman · 2026-05-30T15:41:09 1780155669

Benchmarking 1 or a few samples isn't ever going to yield anything but noise. The actual benchmarks use thousands of tasks.

GPT 5.5 genuinely was back on top for a while there, but if you look at the past 2 years, being on Claude was better than being on OpenAI most of the time. If you're going to pick a tool and not switch constantly it was the right choice. Not to mention their tooling has always been ahead, and that gets ecosystem benefits.

Are they close and interchangeable today? Sure. But Sonnet was genuinely way better than anything OpenAI offered for a long time -- the valuation reflects that, not any given moment in time.

bluebands · 2026-05-30T15:59:57 1780156797

okay what's a point in time where Claude was better? just give me a date

scosman · 2026-05-30T17:47:22 1780163242

until GPT 5.4, Claude always had a decent edge in benchmarks (any date before then). The gap was huge during the Sonnet 3.x vs GPT 4.x eta.

dawnerd · 2026-05-30T14:39:54 1780151994

Pretty easy to tell depending what the code is. GPT follows this pattern is using maybe_something and using uppercase constants by default. Claude is a little more natural but tends to include more fallbacks than gpt5.5

christophilus · 2026-05-30T15:08:53 1780153733

I find codex superior in speed and equal in quality, so it’s my preference. But Claude Code made prettier UIs last time I tested. Codex produces Microsoft-grade UIs. Very enterprise and ugly unless I actively steer it.

pflenker · 2026-05-30T15:35:19 1780155319

You confuse ease of using a tool with quality of output. A skilled carpenter can work both with high and with medium quality tools and prefer one over the other with no difference visible in the craft they produce.

mewpmewp2 · 2026-05-30T14:23:48 1780151028

I use both, enough to reach Codex highest personal sub limits and Claude is stronger to me specifically because of how the flow of building feels. So the PR for any random task would be irrelevant to me.

jjcm · 2026-05-30T15:08:35 1780153715

Very similar thing happened when I was at a design event a couple of days ago. I’d say it’s even worse on the design end - there was a big discussion around how to optimize your usage of Claude. Not optimize your usage of AI, but Claude specifically, as it was the only model literally all of them were using. The biggest issue is they were all hitting their usage limits. I asked whether they had tried other, lighter models (Ie gemini or composer), and it was like I was speaking a foreign language.

dangus · 2026-05-30T15:19:23 1780154363

Maybe some of these companies will learn to stop appointing awful leadership then.

Having a sleazy CEO like Sam Altman or Elon Musk is a business risk. Many potential customers don’t like these people and they say abrasive and alienating things publicly.

Rolling over to the DoD’s desire for fully automated weaponry is more bad marketing. How many people switched from OpenAI to Anthropic over that? I sure did. Anthropic’s willingness to burn that bridge over an ethical stance said a lot about the company to me.

I’m not going to use OpenAI products for these reasons among others.

I’m also not going to use Cursor as xAI plans to acquire Cursor.

Maybe it’s foolish of me to avoid those companies for such petty reasons, but that’s not my problem. That’s their problem.

It takes years to build trust and hours to burn that trust to the ground. Customers can hold grudges for a lifetime.

This is especially true in a market with almost zero product differentiation.

Hippocrates · 2026-05-30T17:13:57 1780161237

The harness/UI that claude code brought was the thing that stole developer mindshare. Thats when people stopped coding in IDEs. Nothing to do with the underlying model.

kaydub · 2026-05-30T17:37:00 1780162620

I've always interchangeably used the models.

I don't look at benchmarks.

It's a non-deterministic tool. A lot of the shit going on with LLMs just doesn't make sense to me. All the tooling around like MCPs, they're all just putting stuff into context. So to me the tools aren't really robust and they make little difference.

Lots of AI psychosis going on these days. And I say that as somebody that hasn't written a line of code since Sept 2025

bloggie · 2026-05-30T17:38:28 1780162708

Steam and other game stores are pretty much the same but Steam is more popular because every one of their competitors has decided to continually shoot themselves in the foot over and over.

Even if Claude and ChatGPT were exactly the same, Claude would be more popular because OpenAI has decided to make some very unpopular moves and try to make money where popularity isn't required. At the moment that popularity still seems to matter.

vr46 · 2026-05-30T14:50:25 1780152625

a) everyone is "susceptible" to marketing - so what

b) therefore a preference for Claude is marketing - complete bollocks

Either the tasks you chose were well below the capabilities of top models, or meaningful differences for preference are elsewhere, or both.

Your comment is probably energy-efficient and sustainable, however, because you could use it again and again when another comparison comes up, like Vim vs Emacs, or tea vs coffee

amazingamazing · 2026-05-30T14:51:37 1780152697

I can tell the difference between tea and coffee 100% of the time.

bwfan123 · 2026-05-30T15:53:00 1780156380

> Couldn’t tell.

add deepseek v4 to it, and it will be close at 1/10 th the price. I use all three codex, claude, and deepseek, and they are close.

api · 2026-05-30T14:53:01 1780152781

I have always found this field, especially in the last 10-15 years, to be incredibly fad driven to the point that it reminds me of things like fashion more than an engineering field.

It’s one of the things I don’t like about it. All humans are susceptible to herd behavior and influence but engineers should be at least a bit more hard nosed and reason more from first principles.

melenaboija · 2026-05-30T14:20:16 1780150816

Yes, which means that in the long run this looks ugly.

So much faith and money in this idea, and seeing how fragile it is, does not look good.

tim333 · 2026-05-31T10:21:08 1780222868

I don't think the success is due to marketing. They've been top of the LLM Arena leaderboard for most of the year which I think is blind AB testing. Most people on HN say they are best for code. I've never seen their marketing. Your post was the first time I even realised it exists.

__MatrixMan__ · 2026-05-30T16:40:45 1780159245

It seems we're moving past the point where it's all about model capability. opus4.7 behaves better for me than gpt5.5 because I'm familiar with its idiosyncrasies. Sounds like you've got a good balance between them.

At the end of the day what matters is which team is better, not which model. If Anthropic continues to feel like the good guy, relatively speaking, then people are gonna chose to spend more time getting to know its products and less time with OpenAPI's and on average Anthropic's will be the more capable teams.

I think vibes are gonna matter more and more going forward. The potential for bad behavior on the part of an AI company is severe. We're gonna have to tolerate whoever we enable in this space, so I propose that we make their marketing teams work as hard as possible to show us which will supply better vibes.

epolanski · 2026-05-30T14:35:13 1780151713

I don't think it's marketing, it's the "nobody got fired for buying IBM" effect applied to software developers choosing tools.

It's the same reason why most of the software out there keeps using bloated technologies that are most of the time the wrong fit for the product.

And the same applies to tooling. Nothing new.

jjice · 2026-05-30T16:58:44 1780160324

I found that the newest opus and 5.5 are definitely close enough where most of the work I do could be done with either. I've seen small differences in planning which I feel like Claude does do better, but I think both products are close enough where I wouldn't be upset if one disappeared.

unshavedyak · 2026-05-30T15:17:09 1780154229

> Edit: i bet 99% of people here, if presented with a test where i gave 5 models but all of the results came from one, would not be able to discern this. Just vibes all the way down.

I think you're missing one (or more) of the facets individuals decide "better" is, for the subjective individual.

Early on i hopped between all the providers. Code quality for SOTA at the time was pretty decent if you didn't ask it to solve challenging problems. However the thing i found most difficult is consistency in how it listened. Eg Gemini (i forget what version, not current) was super prone to focusing solely on the functionality/goal, but not any of the directions on how to write the code. It would throw in comments everywhere, document in a manner i didn't want, use abstractions i told it not to, etc.

How well a model would follow instructions to drop their horrible "isms" was the #1 criteria for me. If i have to constantly remind the model not to do X behavior then it's a terrible model.

With that said, that is why i chose Claude for the last N months. However i've stuck with Claude because dealing with these "isms" and their little behavioral nuances is a chore in itself. I've found you have to learn the model just as much as anything, and so the idea of hopping these days when i'm just trying to get shit done is not likely.

These days for me personally, Claude has to give me a reason to switch rather than me investing even more money (i'm on the 20x plan) in other providers. I'm definitely not committed to Claude Code, but i am tired of the LLM churn, tooling churn, subscription churn, and the general fear of which providers we can trust.

edit: In short, it's the interactive UX just as much as it is the final output.

_345 · 2026-05-30T15:56:21 1780156581

Agree wholeheartedly. I think that Anthropic has just invested more effort in creating a better DevEx than OpenAI, and so people just "feel" that claude code is better but they're about the same really, claude code might be 5% better at best.

duxup · 2026-05-30T18:10:15 1780164615

I certainly can’t tell.

I honestly think I’d need weeks of all workday testing to even form an opinion… and some in depth training before that to use each given tool right…

And then … I might decide I can’t tell the difference.

As it is I use Claude and I don’t have the time to properly compare.

andsoitis · 2026-05-30T15:37:39 1780155459

Instead of only hanging them evaluate the final output, you ought to also have a way to have them evaluate the process and agentic aspects in getting to said output. Claude Code outshines when you look at it end-to-end, in my experience.

illwrks · 2026-05-30T14:41:28 1780152088

Modern Tupperware party. 100% agree! That’s the best framing I’ve heard in a long time!

jrnichols · 2026-05-30T17:40:34 1780162834

The funny thing about Tupperware is that some of us have their products from many many years ago and they still work great.

I think we've had the same iced tea pitcher since I was 5 years old, for example. Solid.

Will we be able to say the same thing about Claude?

wongarsu · 2026-05-30T14:27:13 1780151233

Claude was the best for the longest time. GPT5.5 challenges that, but inertia is real

basilgohar · 2026-05-30T14:31:33 1780151493

You're comparing apples to oranges. Claude is a frontend overall product name, GPT5.5 is a specific model. Which model within Claude's offerings are you referring to? Opus 4.7, Sonnet 4.6, or something else?

wongarsu · 2026-05-30T14:48:57 1780152537

I am not refering to one specific model, I mean the entire Claude Opus line starting from about 4.5, vs the at the time equivalent OpenAI model

Google came pretty close at times

rjh29 · 2026-05-30T14:26:17 1780151177

It's crazy hearing devs on this site claim Claude is 10x better than all other AI solutions. I think it is fomo. Claude $LATEST_VERSION is perceived as the best and anything else is "missing out". New version comes out? Suddenly the old version is worthless, how on earth did anyone get work done with that?

Same reason people buy the RTX 4090 and 5090 cards - overpriced but they must have the "best". Never mind the diminishing returns trying to max out PC settings (3-4x performance hit for an almost imperceptible increase in graphics, ignoring DLSS) - it's the psychological cost of having to move a slider down a notch.

I've been using Google and now DeepSeek v4 and I am having absolutely no problems and it's a fraction of the cost. I'd love for Claude to be 10x better but it just isn't, for my use case anyway.

jnovek · 2026-05-30T14:29:27 1780151367

I’ve been using DeepSeek V4 in OpenCode exclusively for about a month.

I think it’s great, but coming from Claude Code it did feel like going back in time by ~6 months in model capabilities. This isn’t a big deal to me for what I do, but the difference is definitely there.

Leynos · 2026-05-30T14:48:01 1780152481

Deepseek v4 Pro is like Opus 4.5 or GPT 5.2, but costs pennies on the pound for API. Which is to say, I should definitely be using it more to let my Codex and Claude subs go further.

jnovek · 2026-05-30T15:00:04 1780153204

Opus 4.5 was definitely stronger than DeepSeek V4 for me, specifically with large context.

I’m being pedantic/splitting hairs, though. I’ve obviously switched to DeepSeek full-time because it makes more sense to me pragmatically — I spend a few more tokens to get the outcome I want, but the tokens are cheap as dirt and the API is faster.

Perhaps I should plug it into Claude Code and see how it performs? I haven’t tried that.

Leynos · 2026-06-01T09:14:32 1780305272

Which harness do you use at the moment?

rjh29 · 2026-05-30T20:05:31 1780171531

So my GPU comparison is pretty apt then. Paying 4-10x to be slightly ahead of the curve.

jnovek · 2026-05-31T20:34:40 1780259680

I think so, I think you’re getting downvoted because of how you’ve framed it. People don’t like it when you tell them that you think they’ve made a stupid purchase. :-P

I do think more expensive models are valuable in some cases. For example, I’ve noticed that Opus (even 4.7) is much better than DeepSeek V4 at noticing information with small amounts of representation in a large context history, you should pick Opus if you need to find needle(s) in a large haystack. I’ve never worked on a project with millions of lines of code, but I’m guessing it becomes relevant in those situations.

A big thing, too, is that it’s work to get a non-frontier coding stack setup. I’ve spent many hours of free time getting OpenCode to do what I want, but I enjoy it so it’s NBD. If you don’t like puttering around with your development tools, $100/mo for Claude Code really isn’t all that much and you can call it a day.

rjh29 · 2026-06-01T08:56:18 1780304178

On your last point since DeepSeek v4 supports Claude Code's API you can literally set a few environment variables and continue to use Claude Code as a harness.

Like you say a combination of frontier and cheaper models is probably the optimum and that would require setup if you didn't want to choose models manually each time.

solenoid0937 · 2026-05-30T14:58:26 1780153106

Opus 4.8 and GPT 5.5 are the best models, but people don't care about "best" anymore, until there is a big leap in capability I don't think anyone will care about point releases.

Vibes and tribalism will prevail until one of emerges as clearly and unambiguously superior to the other.

Tenemo · 2026-05-30T15:44:51 1780155891

I get what you mean but the GPU comparison isn't the best here, I think. Money-is-no-object-I-want-the-best approach is questionable, definitely. But no one can argue that an old Nvidia card is objectively better for e.g. 4k gaming than a 4090 if you don't mind the wattage. You can just measure it.

With LLMs the problem is more complex, it's people getting used to how a model works and to the ecosystem. Sure, you can make all your skills harness-agnostic and deal with Anthropic's stubborn refusal to adopt the common naming/directory structure. But most people don't. So then you end up with something closer to the ancient Android vs iOS discussion. Can you prove, in isolation, that iOS is more energy efficient, the hardware is faster? Yeah. But that won't speak to someone who has been on Android for 10 years and would have to migrate and get used to iOS to experience that, first.

I've noticed myself how I get used to common failure modes of particular models in my projects. GPT5.5 tends to create some checks/booleans I don't need, it heavily overcorrects on error handling, etc. While Claude 4.7/4.8 doesn't do those as often but gets derailed on our E2E test suite, forgets to run linting despite guidance. So even assuming fully harness-agnostic working setup, a new LLM model with its own quirks can be a lot friction for heavy users who might be used to Claude specifically and all their skills/guidance pre-address common failure modes.

E.g. I might be a Prius owner, then you gift me an objectively better, more efficient, safer, newer, same-size, physical knobs car ...and I might still swear by my Prius! I'm used to how it turns, how it feels, I can repair some issues myself. Isn't that a normal reaction then?

Aurornis · 2026-05-30T15:29:04 1780154944

> Same reason people buy the RTX 4090 and 5090 cards - overpriced but they must have the "best".

Or they need to run high VRAM apps like LLMs

Or they have 4K monitors and want smooth gameplay on them

Is this whole thread just dedicated to snark about other people’s personal preferences?

rjh29 · 2026-05-30T20:12:34 1780171954

The cost/performance is terrible for higher end cards. In a few years your card is now worth nothing because the lower end cards of the next gen are matching it and there's yet another new SOTA card out. So people end up buying that and chasing the dragon.

The 4k situation is a good point because nvidia deliberately don't provide 24GB except the 90 series, but ... you're too good for DLSS? You can't move a texture slider down from Ultra to Super High? It's your choice, just like it's your choice to pay for Claude. I am also allowed to think you're being stupid.

CamperBob2 · 2026-05-30T20:31:04 1780173064

Someone who bought a 4090 "a few years" ago can now sell it for more than they paid for it, but never mind that.

rjh29 · 2026-05-31T11:49:32 1780228172

Do you think GPUs are going to keep going up in value forever like houses? The current situation (game console price rises years into their lifecycle) is unprecedented and irrelevant to my point.

And if you need SOTA then you can sell your old card sure, but the next xx90 card is now 2x the price as the last gen. So you're not any better off.

CamperBob2 · 2026-05-31T18:23:34 1780251814

A lot depends on what happens with China. I don't think Xi will attack Taiwan, but then I didn't think Putin would attack Ukraine, either.

If Xi goes for Taiwan, then yes, GPUs will appreciate like real estate ("Buy now! They're not making any more of it, you know!") for the next 10 years.

rjh29 · 2026-06-01T09:00:52 1780304452

If GPUs appreciate like real estate then we'll probably see game graphics flatten in response. AI will continue to be a money sink no doubt, but if you have a good card in 2026 you're probably fine for gaming for the next 5 years.

Hamuko · 2026-05-30T14:28:36 1780151316

Hey, at least the superior performance of a 4090 or a 5090 can be objectively measured.

doctorwho42 · 2026-05-30T17:33:28 1780162408

But it's a matter of degrees better, not miles.

slashdave · 2026-05-30T15:18:51 1780154331

You're projecting

simianwords · 2026-05-30T17:01:41 1780160501

I find this pattern annoying and also commonplace: MY taste is correct. I AM right. The emergent properties of free market resulting from revealed preferences of free willed agents is WRONG.

Any name suitable to name this phenomenon?

vjvjvjvjghv · 2026-05-30T15:08:56 1780153736

The results may be the same but I personally find Claude nicer to work with. It seems to understand my intent better than GPT and needs less guidance. Maybe it’s just personal preference.

echelon · 2026-05-30T14:16:05 1780150565

> Couldn’t tell.

I can tell. It's night and day.

Last year I used a bunch of models to try to generate Rust code. They all sucked.

This February I tried again and used Claude to generate Rust code. I have never been more stunned in my life. It's just as good as I am, and 30x faster. No fluff, the code is verbatim just as I would have written.

I then tried other models. Total disappointment.

I've continued to repeat this experiment. Opus is the only model that can write Rust reasonably.

Codex produces junk to this day. It passes variables that aren't needed, it abuses pointers, it creates overly verbose monstrosities...

I don't want any single company to win. I want OpenAI to be competitive. I want open source models to win. But right now, Claude Code and Opus are it.

lunar_mycroft · 2026-05-30T14:48:46 1780152526

> This February I tried again and used Claude to generate Rust code. I have never been more stunned in my life. It's just as good as I am, and 30x faster. No fluff, the code is verbatim just as I would have written.

Having looked at a bunch of known or suspected (based on the intent of the code and/or what I know about the developer(s)) LLM generated rust, there's only a few explanations here:

1. You're way better at prompting than (virtually) anyone else.

2. You're vastly overestimating how good the rust code it produced is.

3. You handheld the model throughout and made lots of edits.

4. Your hand written rust code is very bad.

Because from every example I've seen, these models write horrible rust. Sure, it may technically pass all the tests, but it's horribly pessimized, badly organized, doesn't even attempt to use the type system, if there aren't bugs now there will be the second it tries to refactor or add a new feature, etc. etc.

(I also strongly suspect that the same would be true for other languages, but I can detect it in rust more easily because it's my main language)

amelius · 2026-05-30T14:30:00 1780151400

I recently tried with C# code and Avalonia on Linux. Total disaster. Could only get things to run after 10 attempts or so, and was only trying a very basic example. For some of the experiments I actually gave up.

lelanthran · 2026-05-30T14:28:19 1780151299

Should've used deepseek. That would have have been interesting.

mpalmer · 2026-05-30T15:04:57 1780153497

Isn't the experience of interacting with the models appreciably different? It's not all about the outcome. Not to mention the harnesses are increasingly the real product.

tedivm · 2026-05-30T16:58:29 1780160309

So you both used Anthropic models (Opus 4.7 being from Anthropic)? I'm struggling to understand what your comparison really was here.

theptip · 2026-05-30T16:00:39 1780156839

Honestly I have no idea how you couldn’t tell. Reading a PR I can see the difference without even reading the words. (I doubt I could spot the difference just looking at the code diffs though.)

Claude commit messages - well structured test plan, readable.

Codex commit messages - wall of text, no structure.

The big difference though is sitting with the tools and using them for work. These are for sure vibes, but I’m sure you could pull out metrics for # steering re-prompts for example.

Codex just goes off and solves the problem, usually comes back with a solve; Claude more often gives up or needs input. Opus gives a broader design discussion, better at conversation. Codex finds deeper/better edge cases.

I think it’s like EMacs vs Vim - you can get your work done with both. There may be some tasks where one is way stronger. A strict “Better” is quite hard to justify.

Ultimately tool choice is a mix of science and art/taste; I want to feel joy using my tools, and fun little pixel explosions make me happy. If a different tool makes you happy, that is also fine.

shepherdjerred · 2026-05-30T17:47:50 1780163270

> I never want to hear from developers again that they are not susceptible to marketing.

It’s a really good signal of self-awareness/arrogance

tailscaler2026 · 2026-05-30T14:22:06 1780150926

for me personally it's two reasons:

1) Brockman ($25M) and Altman ($1M) both personally donated to Trump/MAGA.

2) Anthropic pushed back against DOD's demand for unrestricted use of AI to kill people while OpenAI eagerly said "please use ours!".

solenoid0937 · 2026-05-30T15:01:06 1780153266

Same. But even worse than all that: OAI erased Anthropic's red lines with the DOW, making it socially acceptable for every other AI company to do the same, creating a "race to the bottom."

I think OAI actually legitimately increased p(doom) for us all. Very strange behavior for a company that is supposedly concerned about x-risk.

PeterStuer · 2026-05-30T17:34:06 1780162446

So you black boxed a few 'success' test, while the main diference between the two is the way they get to the result?

logdahl · 2026-05-30T15:59:50 1780156790

A lot is changing. Like 9months ago, I was convinced Claude was best. I'm not so sure anymore :^)

micromacrofoot · 2026-05-30T14:56:50 1780153010

in my experience out of the box Claude Code is the better tool if you want to spend 0 time on config

pkilgore · 2026-05-30T15:05:03 1780153503

Sure, none of this is rational.

Some of its timing: Claude Code was good before other harnesses and so behaviors (and contracts) were timed to lock in on that ecosystem.

Some of it was ethical/political: Anthropic fighting with the Trump admin about use of the model.

Some of it is social: Never overrate a CEO just being kind of perceived as a piece of shit by people who have power to influence decisions.

But switching costs are low! Because of the same models!

Let the race to the bottom commence. Hopefully before the monopoly/collusion starts.

joshspankit · 2026-05-31T02:14:38 1780193678

You attempted to create a deterministic test for an N-dimensional non-deterministic output

brazukadev · 2026-05-30T17:33:57 1780162437

That sounds like someone desperately trying ton convince people Pesi is better than Coke