Hacker Newsnew | past | comments | ask | show | jobs | submit | ammar_x's commentslogin

Is there some sort of a leaderboard for this test? Like if you'd give each of Opus 4.8 and GPT 5.5 a score out of 100, what would the scores be?

There isn't, as I wasn't going for strictness, more like a playful challenge in the vein of Simon's SVG pelican.

Between the two, Opus 4.8 seems more capable. But, I suspect the harness plays a large role here. It's possible the result would be as good if Codex ran 10+ agents and spent an hour on it.

OpenAI and Anthropic usually fast-follow each other, so I wouldn't be surprised if Codex got the same capability in a couple of days (and even an update to the model), then it'll be a better test.

Sooo, let's say, winging it, vibes-based: 85% for Opus 4.8, 75% for GPT 5.5. Compare with GPT 5.3 (let's say 25%) here: https://senko.net/vibecode-bench/2026/rts-codex-5.3.html


Absolutely! We need new and better benchmarks like this.

I have a question: why not use the maximum available reasoning on each LLM? For example, I see that Opus 4.7 at `max` reasoning but Sonnet 4.6 at `high`. Wouldn't it be a fairer comparison if all were at max?



I usually do this for complex features:

- Opus 4.7 writes the code - I make GPT-5.5 in Codex to review it (given context) - I provide the review back to Opus and ask it to verify the review findings - Make Opus plan the fixes then execute them - Ask GPT-5.5 to review the fixes and check if they solve the problems


Cool, but body font size is too small for comfortable reading!

Sounds like an authentic HN experience to me!

I keep it at 150% zoom level. That would still be on the small side if it was the default. But at least it's somewhat readable this way.

OP, I love the font size as is, have multiple options if you're going to change things! Remember the users that loved things as they were!

I did increase it in the meanwhile from when that comment was posted.

I whipped up a quick uBO rule to fix that (also makes meta-information lines readable):

    thefrontpage.dev##p.newspaper-copy:style(line-height: normal !important; font-size: 1rem !important;)
    thefrontpage.dev##p.article-meta:style(font-size: 1rem !important; font-weight: normal !important; letter-spacing: normal !important;)
EDIT: changed to 1rem as someone else suggested

I agree, but I think it's that small because otherwise, the justified text results in ridiculous spacing.

OP, consider reducing the number of columns from 4 to 3 (at least below very wide viewports), increasing the font size, and then also allowing hyphenation. I think the last will help a lot with the justification problem.


Or have a button that makes the text left-aligned for easier reading.

I think that very much defeats the point of making it look like a newspaper.

Which might be fine? Since web pages are not newspaper sites one might say its just not the ideal way of presenting information.

This entire submission is styled to look like a newspaper. If you just want information that's available at news.ycombinator.com.

An overridden `.newspaper-copy { font-size: 1rem; }` works well.

I've used Brave Search and found it better than Google's in some cases

This looks great for quick audio operations without the need to use heavy apps.

One question: I tried the "Fade In" effect; is there a way to control its timing (i.e. the part of the clip where the effect is applied) ?


you can click and drag to select part of the audio (and then drag the edges of the selection region that has appeared if you want to adjust it), and then apply the effect. All effects prioritize the current selection first, and if no selection is present then get applied on the entire track.

You can use V4 Pro with Claude Code [1].

I tried it and it's impressive.

[1]: https://api-docs.deepseek.com/quick_start/agent_integrations...


I'm working on a custom launcher for hooking up Claude Code with various providers (groups env variables in profiles) cause DeepSeek doesn't have vision and sometimes I need browser use with screenshots or Opus reasoning, for other tasks it's fine: https://ccode.kronis.dev/

  # After installed (or when run portably with ./ccode)
  ccode init-config
  ccode edit-config
  
  # Run with default profile
  ccode
  # Run with named profile
  ccode --deepseek
  
  # Set default profile
  ccode set-default-profile deepseek
Also turns out that with a local proxy you can get Remote Control working and see the DeepSeek sessions in the desktop app, screenshots on the page. Other than that, I'm happy that it works pretty well and the discount is enough to make me consider going from Anthropic's Max subscription to Pro and using it only where DeepSeek is insufficient. With that proxy I eventually hope to be able to transparently switch models mid-task, if I need Opus for like 5 turns or something.

Overall though I'm not sure exactly how well Claude Code would stack up against OpenCode, since the latter overall feels a bit less hacky with 3rd party models and is even getting niche but nice features like a locally runnable web version: https://opencode.ai/docs/web/


I've been using V4 flash consistently with Claude. Pretty great fast and darn cheap. I use it about 3h/day and so far haven't crossed $1 USD/week.

FWIW, I this is what I have in my settings.json

  "env": {
    "ANTHROPIC_AUTH_TOKEN":"sk-nope_not_real",   
    "ANTHROPIC_BASE_URL": "https://api.deepseek.com/anthropic",
    "ANTHROPIC_MODEL": "deepseek-v4-flash",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "deepseek-v4-flash",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "deepseek-v4-flash",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "deepseek-v4-flash",
    "CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC": "1",
    "CLAUDE_CODE_EFFORT_LEVEL": "low",
    "CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING": "1",
    "CLAUDE_CODE_DISABLE_THINKING": "0",
    "CLAUDE_CODE_ENABLE_AWAY_SUMMARY": "0",
    "CLAUDE_CODE_SUBAGENT_MODEL": "deepseek-v4-flash",
    "CLAUDE_CODE_MAX_OUTPUT_TOKENS": "8000",
    "CLAUDE_CODE_FILE_READ_MAX_OUTPUT_TOKENS": "4000",
    "BASH_MAX_OUTPUT_LENGTH": "20000",
    "CLAUDE_AUTOCOMPACT_PCT_OVERRIDE": "60",
    "CLAUDE_CODE_AUTO_COMPACT_WINDOW": "200000",
    "CLAUDE_CODE_DISABLE_GIT_INSTRUCTIONS": "1"
  }

3h/day and how many parallel agents? 1/3/10?

I think out tokens would be a better metric.


Max 2 parallel agents, usually.

As for out tokens, it's about 200k/day


Why not use higher thinking effort?

Just cause that level seems to be working fine for me and it's usually faster.

Hi, is it comparable to Opus?

V4 Pro is between Sonnet and Opus. But it is cheap. Slow but very cheap. Very diligent.

I run a proxy that allows me switching back to Opus when necessary.

Deepseek isn't like Z.ai which is bit cheaper only on the surface. Or like Qwen 3.7 Max which is Opus-level but very expensive.

Deepseek is my favorite since V3 but V4 is definitely catch-up to newer Anthropic models


thank you so much for sharing ir

How does the cost compare using the API vs the $20/month plans with other providers?

I did some back of the envelope calculations and it seems like you would pay $5/month using DeepSeek directly or $15-20 with OpenRouter or similar. But would be interested to hear real world usage.


It is still more expensive per-request than the common Anthropic and OpenAI subscriptions, but the math changes a lot based on your specific use case. https://codeberg.org/mutablecc/calculate-ai-cost/src/branch/...

But as usual, there are far cheaper subscriptions with higher limits than Anthropic and OpenAI, that also provide DeepSeek v4 Pro. So you should use those subscriptions first until you max them out, then look at a different subscription.


I don’t even use Claude that much and was hitting limits in the 20$ using sonnet, I’ve deposited 5$ with deepseek and haven’t hit the limit after spending 60million+ tokens. So no way it’s more expensive.

The link you shared is just a large table of data, which is hard to browse on a phone.

Could you please elaborate on the far cheaper subscriptions that we should be using?


I've been using it pretty extensively over a month and I'm at maybe $7. It thinks for quite a while, but the results have been better than Sonnet for me.

I'm not curious what tasks you tested it for. Im working on coding agent writing code dynamically on request for customers. i'd say code itself very simple and aggressively cached, and patternalized, e.g. we adding lots of hints to the system.

the only real family models that work were claude and openai, surprisingly, for tasks that needs faster speed, gpt 5.4 is very impressive. Deep seek was very average , doing things somewhere in gemini flash 3.0 domain.


I am curious - Is there a way to switch between models depending on the task? Because I believe Deepseek V4 is not multimodal and it will be good to switch back to Claude if vision or other capabilities are required.

I was looking into something similar because I wanted to test a local model for doing basic coding and smart model (deepseek) for planning.

It's basically not possible with claude code, the api endpoint is a single environment variable and whatever models are on that endpoint are what's available.

HOWEVER, if you run a proxy like LiteLLM, you can configure it to send requests to different api endpoints on the back end and expose them as different "models" on the front end, then configure claude code to switch between those virtual models.


Found this: https://github.com/farion1231/cc-switch

It allows for switching models in Claude Code.


Right that says it has a proxy feature so it can probably do what I was describing with LiteLLM

Check out the project called superpowers. It can use different models for different agents. I use it witb opencode to have different models for reaearch, planning, execution, testing etc

There is a tool called deepclaude, which runs a proxy in the background capable of doing this, by simply doing /model in Claude.

i've been trying that, in reality every time you try to save it, it's not worth it, the cost of mistake is so high , you can spent 2-3h on just wrong assumption, you lost your time and all the burned tokens.

It seems you can use the Claude Code CLI harness without a Claude Pro subscription now, which I don't think you could a before?

I've been using Deepseek v4 with Cline in VS Code as a replacement for Github Copilot, and it's not been too bad.


The npm install of Claude Code deprecated, since Feb 2026.

Surprised Anthropic hasn't done anything to restrict Claude Code from using other providers.

At this point in the AI wars, it is probably better to have more users of Claude code rather than restrict which LLMs it can connect to. Claude code is probably (currently at least) stickier than the LLM model itself. Getting people into the Claude code ecosystem is worth it.

Later, they can always lock it down more or add Claude LLM only features to it.


The value of Claude Code the harness isn't that great. There's a lot of other good harnesses out there.

And it gets dragged down by Anthropic actively injecting unhelpful things into prompts without telling users about them (https://github.com/anthropics/claude-code/issues/58262).

I thought so, and then I tried Opencode and Codex and started to appreciate Claude Code a lot more. They've actually done great work with the small details.

I actually have't looked back since trying opencode The ability to properly see what the agent is doing in tool calls and subagents is really unmatched, CC strips all reasoning and return values, only displaying tool calls, and you're unable to expand a single subagent, it's expand everything and scroll endlessly or show everything collapsed with basically no info at all (read x files, ran x commands) Just seems like extremely basic features are missing

What’s your favourite harness? Is there any benchmarks for harness like LLMs have for swe verified?

There Seen to be more and more harness benchmarks out there, pretty interesting read:

https://neuralnoise.com/2026/harness-bench-wip/


You can check my profile for which one I like most :) I do think there have been efforts to benchmark different harnesses.

Personally I'm not going to choose one harness or another based on +/- a few percentage points in a benchmark. I'm going to use one the one that I find the most ergonomic, that isn't too bloated, etc. The models are the primary lever, not the harness.


Good or better? Curious which would be in either bucket.

Probably a matter of taste. I prefer the harness I wrote, I don't want to go near Anthropic's bloated mess of a harness with a 10-meter pole.

IMHO the ergonomics of their tooling are not great. I'd rather use Codex or even OpenCode. Configuration alone is very arcane with lacking documentation. Sandboxing/permission system is quite confusing too.

It went the other way, you can't use other harnesses to connect to the cheaper versions of Claude. So clearly they think their current moat is Claude Code use, not the LLM itself.

That's interesting. I thought Claude Code is not as good, therefore people want to use Claude model with other alternatives. This is the other way around.

Which begs the question, regardless of the model, which Claude Code alternative is better? (I keep saying "Claude Code alternative" because I don't know the term... LLM CLI?)


AFAIK the two most popular open source harnesses right now are OpenCode and Pi. They take a pretty different approach, OpenCode includes a lot of features while Pi is very minimal by design and focused on extensibility, to the point where many people are just asking Pi to write a plugin for itself whenever they want it to have a new feature. I personally like Pi's philosophy more and I think its developer justified the choices really well in his blog post:

https://mariozechner.at/posts/2025-11-30-pi-coding-agent/#to... (the pi-coding-agent section)


Author blocks referrals from HN, weirdly dramatic, especially considering they have 1086 karma here. I wonder what we did to them.

Oh damn, I haven't noticed because my browser removes the referer header. But I think the image on the block page is a pretty good answer to why he did that.

What's the image trying to convey? Genuine question, I just come here to read nerd stuff and I'm not aware of any controversy

The image shows Garry Tan, the CEO of Y Combinator. He has lately been on a huge AI psychosis streak, bragging about things like "shipping 37000 lines of code every day" and "using Claude Code so much it burned out his USB-C power connectors". He's in a lobster suit because he's talking about OpenClaw, an AI agent assistant which those same AI psychosis types lean into too much by giving it full read-write access to all their life and then getting surprised when it accidentally deletes all of their emails.

Pi's developer is obviously not anti-AI, and he definitely doesn't hate OpenClaw, since it's based on Pi. But there's a growing number of people who take those things too far, and a lot of them are on HN. You can easily find them in the comments of any AI-related post here. I assume that's the type of people the image is portraying.


Thank you for the explanation!

The common term for a tool that wraps an LLM with a workflow is “harness”.

I've seen good results with opencode connected to glm 5.1 on ollama cloud... for $20 a month you get similar performance that you get with opus 4.7

I love oh-my-pi, but I'm not sure if it's "better". Maybe just as good.

I use DeepSeek v4 flash with CoPilot and it works pretty good.

I'm my experience claude code is kind of shit.

Pi works very well with deepseek though


My "trick" was to divide things into batches (which can be big with LLMs with larger context sizes) and classify the items in each batch, then take the resulting categories from each batch and feed them into an LLM to group semantically similar categories into groups with a representative category for each group. The representative category can be chosen from the group or created by the LLM. This is an over-simplification of the process but that's the gist of it.


Language support is not mentioned in the repo. But from the paper, it offers extensive multilingual support (nearly 100 languages) which is good, but I need to test it to see how it compares to Gemini and Mistral OCR.


I suspect the number of langauges it can do with reasonable accuracy is actually much smaller, probably <15.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: