I did a pair programming comparison over 3 month on Codex 5.2 and Claude Sonnet and my subjective experience was that based on cost and rollbacks to a previous commit Claude is significantly better. Especially in VS Code Copilot. I wrote a long Substack post about it. I would share its but its in the paywalled archive by now.
In my opinion, the bread example doesn't really work that well because it bridges into the physical domain which most cognitive systems don't have access to. That said, for grounding context and therefore creating truth having a version of a world model is very important (See Yan LeCun's work). My experience is that given the right world to operate in, an agent can indeed find flaws in recipes and fix it even though the agent has not been prompted explicitly to do it. This world, as far as I understand it now, is a combination of sequential (at which step am I in a process), conversational (what was talked about alread/ what had I done already), and context memory (what is the frame or reference/plane of existence).
I am currently working on deep context query which uses dynamically generated regex to pull only the relevant context blocks. By using lightweight RegEx pattern matching to detect semantic intent and filter structured context sections accordingly, you avoid the attention degradation that comes from stuffing semantically redundant information into the window
Have you tried stemming your regex? That would help you catch messages where a different form of your word appeared. For example instead of “story” you look for “stor” which catches “stories” as well.
Then you might think, could we do an even better job by figuring out the general semantic intent of the query and history? Let’s project them into a semantic vector space! That’s an embedding.
Then you want to query that, which means you need a vector database. So now we can take the query, embed it, query the vector DB with that embedding and retrieve the N closest history documents. You can use that to augment the generation of the response to your prompt.
This is RAG.
Anyway, interesting to see different degrees of sophistication here. Certainly a handful of naive regex are very snappy.
There’s probably a hybrid approach where you use sophisticated NLP and embedding techniques to robustly define topics, then train a regex to approximate that well.
That assumes one layer of memory. In my experience you need to have at least 4 layers of memory to work well. All of them have different requirements for retrieval.
Everything that is in short-term memory (state of the app, current conversation, current workspace artefact) requires fast latency and precision. For example if you want to edit a segment in a financial analysis, a blog post, or a program you only want to edit this segment. RAG on a VectorDB is overkill in my opinion.
reply