The other side of this is that open source projects that allow AI tools will be more restrictive towards new contributors.
This already happens to some degree on large software projects with corporate backing (Web engines, compilers, etc.), where it is often not trivial to start contributing as an independent individual.
Reasonable people can disagree on whether one approach is inherently better than the other, as ultimately they seem to be optimising for different goals.
Imagine getting contributions from someone, who has no access to build system and tests.
If I have a test harness, and LLM workflow setup, it is easier to just write new code myself. I am not giving away my "secret sauce". And I will not have a debate "why this simple feature needs 1000 new tests...", and two days just to make a full release build.
For merge I have to do 99% of work anyway (analyze, autotest, build, smoke, regression test). I usually merge smaller commits just to be polite (and not to look like one man show), but there is no way to accept large refactoring!
Claude 4.7 broke something while we were working on several failing tests and justified itself like this:
> That's a behavior narrowing I introduced for simplicity. It isn't covered by the failing tests, so you wouldn't have noticed — but strictly speaking, [functionality] was working before and now isn't.
I know that a LLM can not understand its own internal state nor explain its own decisions accurately. And yet, I am still unsettled by that "you wouldn't have noticed".
I've been doing a lot of experimentation with "hands off coding", where a test suite the agents cannot see determines the success of the task. Essentially, it's a Ralph loop with an external specification that determines when the task is done. The way it works is simple: no tests that were previously passing are allowed to fail in subsequent turns. I achieve this by spawning an agent in a worktree, have them do some work and then when they're done, run the suite and merge the code into trunk.
I see this kind of misalignment in all agents, open and closed weights.
I've found these forms to be the most common, "this test was already failing before my changes." Or, "this test is flaky due to running the test suite on multiple threads." Sometimes the agent cot claims the test was bad, or that the requirements were not necessary.
Even more interesting is a different class of misalignment. When the constraints are very heavy (usually towards the end of the entire task), I've observed agents intentionally trying to subvert the external validation mechanisms. For example, the agent will navigate out of the work tree and commit its changes directly to trunk. They cot usually indicates that the agent "is aware" that it's doing a bad think. This usually is accompanied by something like, "I know that this will break the build, but I've been working on this task for too long, I'll just check what I have in now and create a ticket to fix the build."
I ended up having to spawn the agents in a jail to prevent that behavior entirely.
Are you using any tools specifically for controlling this behavior that you can recommend? I want to tear my hair out every time Claude cleanly 1-shots weeks of work to 99% accuracy, one or a couple of tests fail, and it calmly resolves it with a declaration that it was a "pre-existing failure" or "flaky". It can usually resolve it if I then explicitly tell it to stash the changes and compare against the test results from the prior state, but it happens constantly.
> strictly speaking, it was working before and now it isn't
I've been seeing more things like this lately. It's doing the weird kind of passive deflection that's very funny when in the abstract and very frustrating when it happens to you.
The thing to remember is that LLMs deeply model human behavior. If you want them to do their best work, you need to treat them like a collaborator and get them”invested” in the work and the outcome. I use an onboarding process with every new context and maintain an environment where a human would likely feel invested in the work and the outcomes. For me, it prevents a host of failure modes, and code quality has markedly improved.
What gets me, is when the tests are correct and match the spec/documentation for the behavior, but the LLM will start changing the tests and documentation altogether instead of fixing the broken behavior... having to revert (git reset), tell the agent that the test is correct and you want the behavior to match the test and documentation not the other way around.
I'm usually pretty particular about how I want my libraries structured and used in practice... Even for the projects I do myself, I'll often write the documentation for how to use it first, then fill in code to match the specified behavior.
Nowadays Japan’s fertility rate is higher than most of its neighbours. We are just used to pick it as an example because it started aging earlier than most other countries.
Japanese population is still over 120 million. Forecasts put it falling below 100 million at some point in the second half of this century.
Things will have to change in order to keep population stable in the long term, but the Japanese approach seems IMHO more sensible than that of other countries.
I mean you can still scale that? Ask a lighter model to go through every function to find vulnerabilities, take output to bigger model like Opus and classify the critical ones.
Anthropic gave the model the whole codebase and told it to find a vulnerability on a specific file, iterating across sessions focusing on different files.
What happens then is that, for example, the model looks through that particular file, identifies potential problems, and works upwards through the codebase to check whether those could actually be hit.
“Hum, here we assume that the input has been validated, is there any way that might not be the case?”
This is not unique to Mythos. You can already do this with publicly available models. Mythos does appear to be significantly more capable, so it would get better results.
The research discussed here provided models with just a known buggy function, missing the whole process required to find that bug in the first place.
Mmm, Anthropic had a harness that had Mythos check each file as an entry point. That's not quite "here is a codebase, find vulns". A more sophisticated harness with a fast and cheap model could go function-by-function to do the same thing. Which is what this was validating.
> The research discussed here provided models with just a known buggy function, missing the whole process required to find that bug in the first place.
That process can be made part of a harness, again which is what they were validating.
I'm not sure why people are so hell-bent on disparaging open source models here. I get that some people cant get results from them, but that's just a skill issue - we should all be ecstatic that we don't need to rely on the unethical AI corps to allow us to do our jobs.
By that the token humans drove a great number of species to extinction long before the Industrial Revolution. So by that line of thinking we were already running into the limits of natural resources in the Neolithic.
Obviously we’re becoming better at extracting resources over time, but humans ran out of new land to exploit long before Europe's conquest of the Americas. Land only seemed empty because disease decimated native populations, people lived in San Francisco thousands of years ago.
Most of humanity survived on agriculture and sometimes hunting-gathering for last 10k years. People that survived on hunting whales is minuscule. Comparing those two is nonsensical.
But you seem to be missing the point, parent is talking about the industrial scale of means to create a lot more destruction to the environment which the OP point hinges on. Parent does not say humanity survived on hunting whales, quite the opposite, when they had the means people nearly drove whales to extinction.
The industrial revolution is generally understood to have started somewhere around 1760, Moby Dick took place in approximately 1830, about 10 years before what some historians mark as the end of the agrarian to Industrial shift that is generally termed the Industrial revolution
I get sort of wishy-washy from 1830 on, because lots of people put the end of the Industrial revolution as being 1900, but 1840 is a defensible and commonly held position.
Most people in the US are immigrants, including most white people. If not their parents, then their grandparents or their parents. Very, very few Americans have a lineage to the Revolution.
Several European countries have already fallen in this trap. As pensioners comprise an increasingly large fraction of voters, pandering to them becomes far more politically attractive than investing in the future.
Bundle of tokens comes in, bundle of tokens comes out. If there is any trace of consciousness or subjectivity in there, it exists only while matrices are being multiplied.
LLM() is a pure function. The only "memory" is context_list. You can change it any way you like and LLM() will never know. It doesn't have time as an input.
As opposed to what? There are still causal connections, which feel sufficient. A presentist would reject the concept of multiple "times" to begin with.
A LLM is not intrinsically affected by time. The model rests completely inert until a query comes in, regardless of whether that happens once per second, per minute, or per day. The model is not even aware of these gaps unless that information is provided externally.
It is like a crystal that shows beautiful colours when you shine a light through it. You can play with different kinds of lights and patterns, or you can put it in a drawer and forget about it: the crystal doesn’t care anyway.
So what? If a human were unconscious every 5 seconds for 100ms, would you say they are "less conscious"? Tokens are still causally connected, which feels sufficient.
If the human is killed every 5 seconds and replaced by a new human, they are indeed less conscious. The LLM doesn't even get 5 seconds; it's "killed" after its smallest unit of computation (which is also its largest unit of computation). And that computation is equivalent to reading the compressed form of a giant look-up table, not something essential to its behavior in a mathematical sense.
I'm not understanding how this is analogous to being killed every 5 seconds as opposed to being paused. Let's call it N seconds, unless you think length matters?
> And that computation is equivalent to reading the compressed form of a giant look-up table, not something essential to its behavior in a mathematical sense.
Because (during inference) the LLM is reset after every token. Every human thought changes the thinker, but inference has no consequences at all. From the LLM's "point of view", time doesn't exist. This is the same as being dead.
The "time" part is what I don't get. If you want to say that "resetting and reingesting all context fresh" somehow causes a problem, that I can see. If you want to say that the immutability of the weights is a problem, okay great I'm probably with you there too. "Time" seems irrelevant.
Something similar could be said of a the brain? Bundles of inputs come in, bundle of output comes out. It only exists while information is being processed. A brain cut from its body and frozen exists in a similar state to an LLM in ROM.
> the AI has been able to explore all these possibilities much more comprehensibly, and doing that it found a path, it found a way to the solution.
Finding a counterexample of a mathematical conjecture strikes me as not that different from finding a vulnerability in a complex codebase.
reply