Hacker Newsnew | past | comments | ask | show | jobs | submit | jsemrau's commentslogin

I did a pair programming comparison over 3 month on Codex 5.2 and Claude Sonnet and my subjective experience was that based on cost and rollbacks to a previous commit Claude is significantly better. Especially in VS Code Copilot. I wrote a long Substack post about it. I would share its but its in the paywalled archive by now.

In my opinion, the bread example doesn't really work that well because it bridges into the physical domain which most cognitive systems don't have access to. That said, for grounding context and therefore creating truth having a version of a world model is very important (See Yan LeCun's work). My experience is that given the right world to operate in, an agent can indeed find flaws in recipes and fix it even though the agent has not been prompted explicitly to do it. This world, as far as I understand it now, is a combination of sequential (at which step am I in a process), conversational (what was talked about alread/ what had I done already), and context memory (what is the frame or reference/plane of existence).

Self-correcting agents are already here: https://jdsemrau.substack.com/p/hyperagents-and-self-correct...


Anthropic for sure. It's a useful professional product that I find many use-cases for in my professional and private life. OpenAI not so much.

I am currently working on deep context query which uses dynamically generated regex to pull only the relevant context blocks. By using lightweight RegEx pattern matching to detect semantic intent and filter structured context sections accordingly, you avoid the attention degradation that comes from stuffing semantically redundant information into the window

https://jdsemrau.substack.com/p/tokenmaxxing-and-optimizing-...


The more real world use cases we see, the more we see the use of a well thought out regex as a bridge from probabilistic to deterministic.


Interesting approach.

> Prioritize recall over precision.

Have you tried stemming your regex? That would help you catch messages where a different form of your word appeared. For example instead of “story” you look for “stor” which catches “stories” as well.

Then you might think, could we do an even better job by figuring out the general semantic intent of the query and history? Let’s project them into a semantic vector space! That’s an embedding.

Then you want to query that, which means you need a vector database. So now we can take the query, embed it, query the vector DB with that embedding and retrieve the N closest history documents. You can use that to augment the generation of the response to your prompt.

This is RAG.

Anyway, interesting to see different degrees of sophistication here. Certainly a handful of naive regex are very snappy.

There’s probably a hybrid approach where you use sophisticated NLP and embedding techniques to robustly define topics, then train a regex to approximate that well.


That assumes one layer of memory. In my experience you need to have at least 4 layers of memory to work well. All of them have different requirements for retrieval. Everything that is in short-term memory (state of the app, current conversation, current workspace artefact) requires fast latency and precision. For example if you want to edit a segment in a financial analysis, a blog post, or a program you only want to edit this segment. RAG on a VectorDB is overkill in my opinion.


This is one of the most interesting comments I've read on this website.


Thank you.


I don't understand OpenAI's product strategy.


It seems pretty simple:

1) Keep getting investors to give them money.

2) Convince the right people that OpenAI is "critical to national security" so that when 1 runs out, they can get bailed out by the government.

Everything else is just set dressing.


People ask for a thing they want, someone there has codex build it. Repeat.


> I don't understand OpenAI's product strategy.

Neither does OpenAI.


What part is confusing?


The same applies to banks and lending standards. In the end it is a function of governance and professional conduct.


"You can’t be the CTO of Uber wanting to do AVs, and get the data collection requirement shockingly wrong."

Problem 1: Cost and privacy constrain limit data collection.

Problem 2: It makes not much sense to collect and store data that you already have. Yet you don't know that when collecting if it is useful or not.

Problem 3: P2P in urban setting fails at edge cases which by definition are rare to collect.

All of these problems limit AV scaling.


When working with APIs it makes a lot of sense to filter only for relevant portions based on an intent-driven dynamic RegEx.


He is proof that LinkedIn doesn't matter.


Codex is much worse than Anthropics model. My experience is that I burn 10x the tokens using Codex compared to Sonnet 4.6


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: