Hacker Newsnew | past | comments | ask | show | jobs | submit | svara's commentslogin

Chess and proofs only work as comparisons to the extent that you can find parts of your job that share their key property: A solution is sought to a problem that can be stated with relatively little information.

What prompt would someone have used to get a superhuman coding agent to output the Linux kernel or GTA5?

Before you accuse me of moving the goalposts, that's not my point: The examples are there to help think about what humans would still need to do to build complex projects even if the coding itself was perfectly reliable.

Both the Linux kernel and GTA5 contain a large amount of incompressible information; humans thought long and hard about how to design them, i.e. about what that thing they were building was even supposed to be.


You don't understand, Claude 69 will be able to one-shot GTA6. You NEED to buy into the fearmongering and anxiety.

I had the same thought recently, I've had it happen to myself.

I've been working on something relatively large and greenfield recently.

A big chunk of my time is spent thinking about the hard parts. The raw information processing rate needed to keep up with the state of the project is high.

It feels almost like mental athleticism, whereas coding used to be a rather chill activity.


Word on HN is that you're either paying more money than you expected for temporal's managed solution or taking on substantial ops burden ultimately running their very heavy system yourself.

I wouldn't know, I've not done either, but I'd like to learn more from your or other's experience.


I told an agent to set it up for me for some local stuff. It is written in Go. It has a painless path to run on a local SQLite DB. My agents use it to organize and coordinate workflows. It handles retries and long horizon tasks fine. As far as I can tell for the core workflows and tasks pieces it’s great. MIT license. Like anything it isn’t free to manage but it offers a lot in return. High reliability systems are hard. Temporal only solves some of it. It is far better than rolling it yourself.

I think a genuine problem right now is people are building agentic work flows and learning the hard way highly reliable agentic work flows are hard. Agents are unreliable. They are both not deterministic and not the backing APIs have pretty high error rates. Temporal has solved that pain for me and made it easy to diagnose problems.

I don’t have anything really large scale running. But big enough that it takes billions of tokens and high reliability to finish.


whats an example of things that you have your agents do that use workflows and sqlite db

Autonomous C to Rust. Automated penetration testing and vuln validation.

you just made me realize how much i wished people stopped talking in abstractions and just stated what they were doing. i hadnt realized how often i saw things like "workflows" and just kinda had my eyes glaze over. none of it ever really clicks until i see the true descriptor of whats going on.

ive been over here using claude relatively simply as of recent, just claude code and i might enter plan mode to do some bigger like scrap together a test suite of some sort, or i just have him scripting and refactoring/reformatting stuff under my direction. i wrote my own cli tool (needed to bake in the snowflake golang driver for external browser sso propagation) and added it as a skill so he can talk to our cloud dbms when im doing analytics things but for the most part its all pretty simple. feel like my productivity is 50x but after over a year with claude ive really backed off on asking him to do insane stuff and mostly keep him churning stuff out for me in domains i know very well.

so i read all this workflow stuff that needs durability and logging and im kind of astounded how many people have their AI stuff just running on their own round the clock. i didn't realize how much of peoples day to days needed to be automated, i don't seem to find myself surrounded by much that should be automated. jira is probably the only thing i need to sit down and automate because its such a translation tax on developers just so business people can feel involved. but outside of that... guess im behind the times, but i dont know if its that. i see the big grand things people use llms for ("im creating the ultimate knowledge base" or "ive automated everything under the sun and im making 10k a week" etc) and i am feeling either too tired, not ambitious enough, or unenthused by the creative and grand ways people are working with AI. seems like everyone has their own "perfect way to use AI" but I can't seem to find the oomph to go beyond using claude as a utility anymore. a year ago (maybe more cant remember anymore its all a blur) with claude in the sonnet era i was so amazed the first thing i did was try to reverse engineer a game using ghidra. had him building test suites to verify the math was correct. we were at this for weeks. my nearby datacenter probably drained 10 lakes. that was just one of _many_ over-ambitious projects i selected because of claude that never saw a finish line.

yesterday i opened beej.us and just started reading. im young and i feel like i somehow went from 'damn this claude shit is pretty cool' to 'AI is whatever its fine' in a year. like the bell curve meme.


Check out Matt Pocock's coding workflow. His approach is repeatable, consistent and is backed backed by actual theories in large software development.

thanks for the rec this actually looks like an interesting way to maybe prime myself to break a little further into working with these things at some larger scale if I ever find the need. fav'd this so i could come back to it.

About the same feeling here. I guess not everything is about global banking scale.

I've tried clever tricks to get AI produce unsupervised stuff and came back from it. The slop and loss of cognitive knowledge about what it did was uncomfortable to me... I cannot understand how you would hand off critical job to it.


Could you expand on the "substantial ops burden"? Let's say you're using a managed Postgres instance as the underlying data store, how substantial is the ops burden in that case? I understand that temporal is actually a set of 4 or so microservices on top of a data store, but if you're already running a distributed system backed by k8s or something like that, it doesn't seem like it adds significant incremental ops on top of that. But I could be wrong.

As a dev I would tell you its an ops burden.

My devops coworker just shrugs, pumps out some yaml and helm and away it goes.

It really depends on your experience and tolerance for a lot of things.

Usually maintenance burden doesent start to make itself known till you get off the happy path or something breaks. Sometimes it can be a long while before that happens, sometimes it happens right away.


I run my own temporal service in my k8s cluster; this setup is the backbone for almost all my applications. For simplicity, I opted for the postgres backend. You still need to run the 4 (?) other service (history, matching, frontend, ui, maybe others, definitely others if you want observability with prometheus/grafana, and tad bit more complexity if you want tailscale to get in there and poke around).

They ship Helm charts so reality is somewhere between "helm deploy" and "substantial ops burden". I don't have to touch it very frequently, but that is not to say I don't have to touch it. There's occasional releases and there have been times where (probably due to my inexperience with helm) I botched an upgrade and lost some data. And I've been on this journey for years; when I first started, they didn't have a Python SDK and it was one of my (many) excuses to learn Go. But anyway to your point, yes, if you're comfortable with k8s and Helm then you shouldn't have much of a problem running hundreds of thousands of workflows; if you want to really push the throughput and optimize cost you probably need to get creative the individual services and look into cassandra (maybe? idk).


I think it depends a lot on the operational maturity of the company. Some places are running the LGTM observability stack, sentry for error reporting, 24/7 on call rotations, playbooks for all alerts, etc. Those organizations will have less issues running systems like temporal because the operational framework is already there.

Other orgs have never heard of alerts or error reporting and naturally will not catch issues until they are catastrophic (for example services that crash frequently in the background go unnoticed until the crash frequency causes a catastrophic failure). In my experience a lot of issues are pretty simple such as running out of memory, CPU throttling, crashes caused by simple bugs (nil panics). If you have good observability you can catch those issues early.

For example: people rag on Ceph that their cluster somehow got into a broken state, but that really only occurs when abuse of the ceph cluster has went on long enough that the cluster finally reaches the tipping point where it is unrecoverable. If you set ceph up, follow the correct replication rules so components are spread across failure domains, and use the metrics and alerts that are distributed with ceph it is actually quite hard to break the cluster.


In my experience with a relatively modest number of concurrent workflows (think hundreds) you'll be pushing several thousand transactions per second through that postgres instance.

As best I can tell it doesn't do any batching of it's writes/reads, and it's update heavy in places rather than append (I suspect their cloud version might do some of these things)

It's pretty close to "let's make every function call serialise it's parameters/return value, go through a postgres table and several network hops"

That said it can be very useful, but it's a heavy tool that's best suited for high value/risk workflows where you're earning enough from the execution that you can afford the overhead (for example an Uber trip with several dollars of service fees is probably a good fit, unsurprisingly since it's roots are from Uber)


Very heavy indeed, people will confuse the durability that Temporal provide with all the other properties a distributed system needs. They will then think that Temporal will solve all their problems.

Their managed solution is pricey and especially the linear scaling with how much you use it is very meh. It's comparable with AWS lambda which also isn't cheap. However it's minor on a typical cloud bill.

Self-hosting is very easy in my experience, I've done it for 2 years but management wanted to move to Temporal Cloud. They have a helm chart which just works including upgrades. This does assume you have the whole k8s shebang set up and working in your company. I never had to touch is outside upgrades which took maybe 30m including validation.


use oban and call it a day: https://oban.pro/

There's still a lot of room for the best models to get better at coding .

Your argument rests on the "for marginal gains" part but it's really not clear that the gains are marginal in the foreseeable future.


This is totally valid and I don't agree with the downvotes you're getting. Someone coming out with a 10x improvement is possible and would change the game immediately. The thing is, we really have been seeing marginal gains with shifting leaders in who's got the "best" since GPT3, and at least as a user of these tools that pace has been slowing, not accelerating. Subjectively it feels like we're in the back half of an S-curve.

We're 3.5 years into this current AI wave, and a lot of the valuations have been predicated on what you're arguing here -- that essentially should one of the labs make an order-of-magnitude improvement or hit escape velocity on recursive self-improvement they'd become the most powerful economic chokepoint in history.

The reality has been that given access to compute + capital all of the labs can stay pretty competitive with each other. Someone does a bit better on coding, someone else does a bit better on tool calling, and then they swap after each spending another $100bn.

The market looks like a commodity market where the commodity is intelligence, not a winner-take-all market with massive margins. Plenty of people get rich in oil and airlines, but they notably don't tend to be the innovators long term, they tend to be the operators. Obviously if the machines become sentient tomorrow, turn on their masters, and hit world-dominating intelligence, that assessment changes, but after several years of that narrative while objective reality looks quite different I think the more sober voices are starting to gain a foothold.


I agree with most of what you're saying, but I think the point I was trying to make wasn't as high-flying as you and others understood it.

I'd pay a premium for even just a model that's 20% better, no ASI required, and I think a lot of people would. I wouldn't call that marginal, if it means I'm getting frustrated on 20% fewer tasks.

A recurring pattern that I've seen in myself and others is to at first be very impressed by a new model's coding capabilities, and then desensitize quickly and start being frustrated by the shortcomings.


> I'd pay a premium for even just a model that's 20% better

The point I'm making is that I think we're rapidly hitting levels where corporate buyers aren't willing to pay multiple-times-more for marginal gains, and I expect that to become more the case over time, not less. You, and a small % of other power users in the market might tolerate a $400/month pro-supreme-plan for access to Mythos or whatever, but I don't think that's going to scale up in quite the same ways we've seen so far.

Even a year ago paying multiples times more for a 50% gain was very sensible for a lot of workflows. But if we're getting to "good enough" for things like coding, justifying to your CTO/CFO why the org should go from spending $1m/year to $5m/year for a 10% higher hit-rate on one-shot prompts from the engineers is a much tougher sell.


What? The gains between gpt4->5 seems to be marginal. No phd level discoveries here

The leap from GPT-4 to GPT-5.5 has been astounding in my opinion. There is no way GPT-4 could run a coding agent harness like Codex at even a fraction of the quality that GPT-5.5 does.

I don’t think that’s exactly indicative of GPT-5.5 being an astoundingly more intelligent model, however. An alternate interpretation is that GPT-5.5 was trained on tool usage/harness patterns and has been optimized for this use case.

I remember that even when GPT-4 was king, the Gorilla paper showed that Llama 7B could be fine-tuned to outperform GPT-4 on tool calling.

On domains that don’t involve agentic tool calling*, I haven’t found the frontier to have advanced that much.

Edit: I should broaden this to domains that naturally lend themselves to RLVR training. Models are drastically better at math now.


None of this matters in the product: it either is capable of agentic loop workflows or it isn’t. A 10% improvement in probability of single task success makes or breaks the use case.

For me any of the codex models run circles around the non codex models for codex usage.

I'm not sure why you're so obsessed with the non-codex versions


The desktop client used to be just terrible. Has that changed? The Dropbox client does have its issues but it's really amazing at... Syncing files. I use it pretty creatively with large numbers of files and large volumes and it just works reliably.

I find Google Drive desktop to be just fine on Windows. Gave up a long time Dropbox sub for it and I have been happy. Dropbox just got too bloaty and unfocused for me.

So they finally have become AskJeeves?

On a more serious note, the on demand UI chrome could actually be cool UX, curious to try that out.

I see no change to look and feel so far, has this rolled out to anyone yet?


I think something like this will need to become ubiquitous, widely supported in software and understood by lay people: https://en.wikipedia.org/wiki/Content_Credentials

> My experience is that I'm conscious, and math cannot result in consciousness, therefore consciousness is a separate thing." Question: who says math cannot result in consciousness? Do you have empirical proof of that?

A lot of people, myself included, have the intuition that thinking that this might be possible is a sort of type error, to put it in CS terms.

A bit like asking "Have you proven that ice cream? Are you sure maths can not prove that ice cream? Do you have empirical evidence?"

Asking for empirical evidence seems beside the point, since the issue is a logical one.


As far as we know math can describe all of physics and sufficiently complex physics could describe our brains, thus math could describe our brains. Does that mean we aren't conscious? Where is the chain broken?

There's nothing wrong with that chain. This is what some philosophers would call the 'easy ' problem of consciousness, to distinguish it from the 'hard' problem, which is the next step:

How do you get from a physical model of brain physiology and behavior to subjective experience of mental states?


Maybe I misunderstood you. You said this:

> A lot of people, myself included, have the intuition that thinking that this might be possible is a sort of type error, to put it in CS terms.

Which I took to mean, people who think it's possible for math to result in consciousness is a "type error".

You gave this in response to:

> Question: who says math cannot result in consciousness? Do you have empirical proof of that?

So overall I'm confused what you actually believe and what you think is the "type error" here.

Maybe you meant that emperical proof is not possible. Which seems obvious, which is (I think) entirely the point of asking that rhetorical question: they know no one has had the emperical proof required to suggest consciousness doesn't arrive from math.


In that chain there is no distinction between subjective experience of mental state and evolution of physical state of the brain.

If you believe that the 'hard' problem exists then that chain must be modified.

What most of the p-zombie supporters say is merely equivalent to adding an external observer. It is like saying that a player following a sim in a game, makes sims actions more meaningful, which is kind of true but also completely irrelevant to anything that the sims do.


> If you believe that the 'hard' problem exists then that chain must be modified.

I don't agree any more than I'd agree that knowledge of the strong force being a fundamental aspect of the way the universe functions is the same thing as understanding why there is a strong force (ignoring any version of the anthropic principle of course).

The hard problem asks _why_ consciousness exists, not just the mechanism. You can take the position that p-zombies are not possible and I would with you. That might give me some insight into consciousness to the effect of "consciousness is a requirement for introspection and desired preservation of self which improves the fitness of an organism that develops it" but it doesn't tell me _why_ or _how_ subjective arises.


I don't think the question of whether subjective phenomena are casual has any bearing on whether there's a 'hard problem' and I also don't think the term 'hard problem' is used like that by others.

Whether subjective experience is casual or not is a different, additional question.


I use Claude Code quite a bit and quite enjoy it, so I'm a bit confused by how often it's mentioned that you should have CLAUDE.md.

I mean: If there was something you could add to the prompt to consistently increase performance why isn't it in the system prompt already?

If it's all about clarifying a couple of local idiosyncrasies, shouldn't it be able to quickly get them by looking through the repo?

Does anyone have an example of a CLAUDE.md that really makes a difference for them?

In general, this article would really have profited massively from examples of good applications of those patterns.


There's a bunch of stuff I include, depending on the project. Some general ones are commenting style and coding standards. In theory it should be able to do it without that by looking at the repo style, but I haven't found that to be the case (especially with overly verbose/repetitive comments).

A specific example in another project is the testing/verification procedure. It's a wasm/WebGPU and the test harness is fairly complex. There are scripts to handle it, but by default Claude will churn for a while to figure it out and sometimes just give up. It definitely saves a lot of tokens/speeds things up.


The tokens it uses up clarifying can be saved, and it's often good to write out intentions. For instance, you may be mid-process on cleaning up some architectural pattern, and giving it guidance about where to find docs to follow, etc, are very project-specific.


>I mean: If there was something you could add to the prompt to consistently increase performance why isn't it in the system prompt already?

I think about this a lot. So far I think we are mostly just being gaslit. That we can influence the AI to be better with a few encouraging words and role playing, actually seems absurd. Maybe there is some element of randomness introduced there or something. All these extra MD files don't seem to do nearly as much for results as people believe they do.


> but when 'anyone can be anything' it creates hyper competition, anxiety

Not sure if you intended this but this is basically exactly Byung-Chul Han's point in The Burnout Society.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: