Hacker Newsnew | past | comments | ask | show | jobs | submit | cube2222's commentslogin

Relatedly, I think it's worth noting that Anthropic models have consistently been top-scoring in BullshitBench[0], in a league of their own, really.

Not affiliated with the bench in any way, but I think it surfaces important differences between the behavior of the models from different labs.

TLDR: The benchmark is measuring pushback in response to nonsensical requests and questions, as opposed to going with it and hallucinating a nonsensical answer.

[0]: https://petergpt.github.io/bullshit-benchmark/viewer/index.v...


TBH this is the main thing that made me start trusting Claude enough to actually find it useful, and I'm surprised other models haven't caught up. I assumed they had and I just wasn't aware because I'm not using them in the same way.

> I found my interactions with Fable to be extremely impressive; it made other models, including GPT 5.5 and Opus 4.8, feel small and dumb.

> Anthropic models have consistently been top-scoring in BullshitBench[0]

eyeroll I find that Anthropic models feel big and dumber.

https://www.endorlabs.com/research/ai-code-security-benchmar... puts Fable 5th, which seems about right to me.

I'm interested in code utility and correctness, even if the majority of AI use is not focused on that.


I think this just proves anyone can pick a benchmark that supports their point so maybe we shouldn't use treat them as evidence at all.

Hey, I responded to you two months ago[0], the job post has been up for probably 4-5 years by now (in a continuously evolving form).

We’re doing well, are steadily growing, and have been actively hiring backend engineers ever since!

Is there any specific concern you have?

[0]: https://news.ycombinator.com/item?id=47797034


Spacelift | Remote (Europe) | Full-time | Senior Software Engineer | $80k-$110k+ (can go higher)

We're a VC-funded startup (recently raised $51M Series C) building an infrastructure orchestrator and collaborative management platform for Infrastructure-as-Code – from OpenTofu, Terraform, Terragrunt, CloudFormation, Pulumi, Kubernetes, to Ansible.

On the backend we're using 100% Go with AWS primitives. We're looking for backend developers who like doing DevOps'y stuff sometimes (because in a way it's the spirit of our company), or have experience with the cloud native ecosystem. Ideally you'd have experience working with an IaC tool, i.e. Terraform, Pulumi, Ansible, CloudFormation, Kubernetes, or SaltStack.

Overall we have a deeply technical product, trying to build something customers love to use, and have a lot of happy and satisfied customers. We promise interesting work, the ability to open source parts of the project which don't give us a business advantage, as well as healthy working hours.

If that sounds like fun to you, please apply at https://careers.spacelift.io/jobs/3006934-software-engineer-...

You can find out more about the product we're building at https://spacelift.io and also see our engineering blog for a few technical blog posts of ours: https://spacelift.io/blog/engineering


Love the writing style!

> Nothing beats an organic, pasture-raised, hand-written spec.

Hah, I strongly empathize with the wording. I’ve been starting my design docs for fellow humans with “100% hand-written, organic content”, I might steal a part of yours.

Overall, cool idea. I don’t see myself using your SaaS, but the approach of tagging the requirements and constraints to make them easier to find sounds good.

One project you didn’t mention which I think is also, I think, a cool perspective on this is codespeak.dev , but I haven’t given it a go yet.

All in all, I feel like maintaining specs, and having agents translate spec diffs into code diffs is a promising area for the future. Good thing I enjoy writing!


Spacelift | Remote (Europe) | Full-time | Senior Software Engineer | $80k-$110k+ (can go higher)

We're a VC-funded startup (recently raised $51M Series C) building an infrastructure orchestrator and collaborative management platform for Infrastructure-as-Code – from OpenTofu, Terraform, Terragrunt, CloudFormation, Pulumi, Kubernetes, to Ansible.

On the backend we're using 100% Go with AWS primitives. We're looking for backend developers who like doing DevOps'y stuff sometimes (because in a way it's the spirit of our company), or have experience with the cloud native ecosystem. Ideally you'd have experience working with an IaC tool, i.e. Terraform, Pulumi, Ansible, CloudFormation, Kubernetes, or SaltStack.

Overall we have a deeply technical product, trying to build something customers love to use, and have a lot of happy and satisfied customers. We promise interesting work, the ability to open source parts of the project which don't give us a business advantage, as well as healthy working hours.

If that sounds like fun to you, please apply at https://careers.spacelift.io/jobs/3006934-software-engineer-...

You can find out more about the product we're building at https://spacelift.io and also see our engineering blog for a few technical blog posts of ours: https://spacelift.io/blog/engineering


It’s worth noting that e.g. the Go stdlib has this hybrid construction built-in via crypto/hpke.


thank you!!! i shall be using this immediately


Small tip, at least for now you can switch back to Opus 4.6, both in the ui and in Claude Code.


I’ve never been one to complain about new models, and also didn’t experience most of the issues folks were citing about Claude Code over the last couple months. I’ve been using it since release, happy with almost each new update.

Until Opus 4.7 - this is the first time I rolled back to a previous model.

Personality-wise it’s the worst of AI, “it’s not x, it’s y”, strong short sentences, in general a bulshitty vibe, also gaslighting me that it fixed something even though it didn’t actually check.

I’m not sure what’s up, maybe it’s tuned for harnesses like Claude Design (which is great btw) where there’s an independent judge to check it, but for now, Opus 4.6 it is.


I noticed the difference, but coming from Gemini and xAI models it wasn’t that glaring. I still find that Opus makes much better plans than anything else I’ve tried, and it’s been very good at catching my mistakes in using public-key cryptography, also finding out why my crsqlite queries were failing despite no official documentation on the topic.

I’d never use such an expensive model for coding, so that might explain why I have little to complain about.


I've been using it with `/effort max` all the time, and it's been working better than ever.

I think here's part of the problem, it's hard to measure this, and you also don't know in which AB test cohorts you may currently be and how they are affecting results.


Agree. I keep effort max on Claude and xhigh on GPT for all tasks and keep tasks as scoped units of work instead of boil the ocean type prompts. It is hard to measure but ultimately the tasks are getting completed and I'm validating so I consider it "working as expected".


It works better, until you run out of tokens. Running out of tokens is something that used to never happen to me, but this month now regularly happens.

Maybe I could avoid running out of tokens by turning off 1M tokens and max effort, but that's a cure worse than the disease IMO.


I would risk a guess that people have a wrong intuition about the long-context pricing and are complaining because of that.

Yeah, the per-token price stays the same, even with large context. But that still means that you're spending 4x more cache-read tokens in a 400k context conversation, on each turn, than you would be in a 100k context conversation.


unless you always have run it on effort max, and see that it degraded.


Seems like it's not in Claude Code natively yet, but you can do an explicit `/model claude-opus-4-7` and it works.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: