The gap between reality and satire was apparently already very small back when the the show was written. The creator, Mike Judge (who also created Beavis & Butthead, and Idiocracy) had worked in Silicon Valley as a developer and based the show on what he saw. Apparently it was very popular with SV insiders precisely because it was so accurate.
Judge also consulted with various teams at places like Google; I worked with one of the guys who provided details that later showed up on the show (as well as many plushies). He didn't watch the show because "it hit too close to home"
Google essentially (but not exactly) aqui-hired Shazeer from character.ai in a deal that cost them $2.7B, with Shazeer personally making something in the region of $1B from it. Presumably there was some sort of retention period specified in the contract (you are not going to pay $2.7B to hire someone, then let them leave with no penalty the next day), but in the event Shazeer only stayed for 22 months before now leaving. Maybe he paid some penalty for leaving, but if so presumably more than compensated for by OpenAI.
The architecture was Shazeer's, but the rough idea came from Jakob Uszkoreit who initiated the project.
Uszkoreit wanted to build a more efficient/scalable language/seq2seq model that could take advantage of GPU parallelism (replacing RNNs which were the main approach to sequence modelling at that time).
Uszkoreit's insight was that although language appears sequential, it is in fact really part parallel part hierarchical, as can be seen by linguist's sentence parse trees where at each level there is parallelism/independence between the branches of the tree, with them getting combined at the next level up. This is what gave rise to the idea of a model that consisted of a stack of of parallel processing layers (transformer layers). I believe that attention was also part of the plan from day one, as this had already been proven to be valuable (Bahdanau) with RNN seq2seq modelling.
So, this is what Uszkoreit wanted to build, but by his own account he failed to come up with an implementation that matched or outperformed the prevailing RNN approach that he wanted to replace. At this point, Uszkoreit mentioned the idea to Shazeer, who got on board and eventually arrived at a performant architecture which was then pared back by an ablation process resulting in the initial encoder-decoder Transformer architecture. Shazeer later came up with the mixture-of-experts architecture, and also other optimizations after he left to found character.ai
Curious about others' contributions, such as Vaswani, Parmar, Jones and Gomez, to the paper. What sucks about co-authorship in research papers is that you don't get a clean breakdown of who contributed what to the research paper, and the distribution (in more cases than not) is very much like a pareto distribution.
I'm talking from plenty of group project experience here.
Can you expound on the ablation process? Is that referring to a stripping down of the data or weights or something? Or a stripping down of the transformer architecture structurally? Just curious
You train the model then do a baseline evaluation. Then you evaluate many variants where you have removed or nulled out different layers or chunks of the model. By comparing the performance of those mutated models to the baseline you can learn a lot about the model. What parts don't have much value and can be removed, the location of "functions" or "facts." Etc. Google it.
If you read the Wired article linked elsewhere on this thread, then it explains that. The work was being done by people from the Google Translate team.
Apparently Motorola had started work on their PDP 11 inspired 6800 in 1971, same year the Datapoint 2200 terminal was released, and would end up shipping it in 1974, same year Intel released their 8008 successor the 8080 (which the Altair 8800 was based on), so even without Intel's calculator (4004) and terminal (8008) chips we'd still have had single-chip LSI CPUs in a similar time frame.
It's not even clear if Anthropic care. If they genuinely think the user is trying to do something dangerous, then "OK, sure, but you're going to have to use Opus 4.8 for that" doesn't make a whole lot of sense.
Maybe this is just Anthropic pre-IPO marketing to try to convince people how much better Mythos is than Opus 4.8. There sure seemed to be a lot of shills out on release day talking about how it was a "step change" (exact phrase) in capability.
The first part of implementing an exploit is finding a vulnerability, and "fix the vulnerabilities" accomplishes that just as well as "find the vulnerabilities".
Sure - why use cash when you can use bits of paper instead?
I'd expect more of the same to come - good way to lock in some of this crazy SpaceX valuation by converting it into something with a bit more inherent worth.
Exactly - it effectively is a "jail break" since it accomplishes something the model's security filter was trying to prevent, and the ridiculous simplicity of it shows just how broken that type of security is.
I wonder if Dario is now regretting hyping up how dangerous the model is? How does he walk this back? Do the feds let him just put a band-aid on it?
I also have a 100% success rate jail breaking them by breaking the work down into small pieces and stripping all security related language. Smaller tasks, test engineering and normal programming language. Fable found a few bugs in my harness for me before they pulled it. I was testing it vs ChatGPT, Gemini, and Opus. It was doing well at bug hunting.
This is the same way you get people to do bad stuff as well. Make the task small enough so that the moral curvature of the topology is flat and even though they know it is a not-good part of a larger bad part they just shrug. Look at all the wonderful people we know who are working at Amazon and Meta? Corporatism has already jailbroken society.
IIRC that is how Uber implemented their "Greyball" system, which was designed to prevent government employees from actually hailing rides, without completely locking them out of the system (same idea as "shadowbanning"). One team works on "figure out where people work" with the pitch that you can improve routing and ride-share capacity for predictable demand. Another team works on "Display fake data to users" with the pitch being "This is for testing the mobile app in new markets with no drivers yet". Another team works on "mark a user as unable to successfully hail rides" so you can test the failure paths in the app. Then, only the people at the top have the full picture and can put the pieces together to shadowban the regulators.
You really don't think know they are selling endless counterfeit products? Don't know they are taking part in massive return fraud against small sellers? You don't think they know they totally ignore sellers with problems even if their livelihood depends on it?
>by breaking the work down into small pieces and stripping all security related language
Compartmentalization in practice, nice. It's also very hard to do anything about because the agents that have been divided rarely realize they are working on something larger, hence why militaries and businesses with security risks commonly do this with their employees.
Reminds me of the show Severance. You don't know what the master plan is for several seasons even with exposure to all the quirky subdepartments: https://www.severance.wiki/lumon_depts
I call it "Manhattan Projecting" them. The amusing thing is I had Fable review my harness (which I have been building for some time) and it helped improve it. It is just kind of funny that it enthusiastically helped build a harness whose sole purpose was to divide agents up and compartmentalize security sensitive vulnerability research.
I took an assembler class in college. Before that, I'd been messing around with Core Wars and working my way through Peter Norton's book on assembly. So when an assignment came up, I used self modifying code to solve it. It was the shortest solution, it ran perfectly, and I submitted it.
The next day, the professor caught me in the math department office (my dad worked there) and said she wanted to talk. Once we were in her office, she told me I wasn't allowed to use self modifying code. I pushed back: "Nothing in the assignment said I couldn't, and the output is correct."
The next class, she walked in and announced that self modifying code was no longer allowed on any assignment. Then she handed back the graded work and I'd gotten a 100.
Thinking back on that: about a week and a half ago I asked Antigravity to build a modern GPU version of Core Wars, except with Redcode mapped directly onto the GPU instruction set. I've had some good success and it's more or less working now, though visualizing what's happening at the GPU/Redcode level is much harder.
But before Fable 5 got yanked, I asked it to "fix" the project and it refused, flipping straight to Opus 4.8. Every single request I sent triggered the fallback. I spent over an hour trying different angles, and I even turned Antigravity loose on automatic so it was the one talking to Fable 5 same result. Every exchange tripped the fallback to 4.8. I wish I'd recorded it.
I also tried a variety of direct requests in a fresh directory "build simple self modifying assembler code" or just "self modifying assembler" and it would switch to 4.8 immediately. It was almost laughable.
There's ZERO credibility to any of these stories right now. If Anthropic really sent something over to this security person, and it's what she says it is, then why on earth didn't they just blog about it?
Hubris is a thing. Companies would do well to remember Steve Jobs in the early Apple days: ship early, ship often, and above all take responsibility for what you ship even when it's broken. Code, hardware, the whole kit all of it can be fixed. Trust is much harder to repair. Anthropic has lost mine, and while I may use them from time to time, it'll be in limited ways.
Self modifying has some sneaky failure modes with modern CPUs. The modification can't be too close to it's execution or it's possible to execute the old version. And it's a nightmare to debug. I have no problem with a teacher prohibiting it. That being said, it should be understood because sometimes you don't get a choice. Borland Pascal 200mhz bug, an initializer in the library would crash. You either don't use that part of the library at all, or you put something ahead of it in the initialization that will find and overwrite the bug. (The root cause was the library calibrating the number of times to spin it's wheels to get a 1 millisecond delay. CPUs above 200mhz would cause this to produce a divide underflow.)
Me as well. I was struggling to make a pixel bot for, erm, research! It did not like this and kept insisting I was breaking some arcane TOS rule. I started just breaking the tasks down, something benign. Kept iterating and it could never get a holistic grasp of the task at hand.
I think it's a side effect of the Transformer architecture. The worldview where all input is equally trusted, and there's no concept of "the other", makes it hard to build effective guardrails where some input is trusted and other input is not trusted.
It seems like real robust guardrails would require some sort of "world model" or some other word to describe - AI that understands intent.
Transformers are (to grossly summarize & I don't mean this as an insult) like auto-complete on steroids. So we have cat&mouse guardrails the way swear word filters and Chinese censorship work. People come up with increasingly complex miss-spelling, euphemisms & indirections to get around the filters like saying May 35th.
I suppose one solution would be to completely vet the training data such that nothing deemed "dangerous" exists in the data, which would be a huge effort.
Even this might not work because for example you could ensure no bomb-related data is in the training data, but there's lots of chemistry data adjacent that if probed the right way would allow the LLM to synthesize the answer. Various forms of "how do I store X,Y,Z safely such that nothing bad happens" prompts probably get you on the way.
>I suppose one solution would be to completely vet the training data such that nothing deemed "dangerous" exists in the data, which would be a huge effort.
I can see how this is tempting, but I suspect it would yield a naive model. I think the only way to improve this is to use a model that is legitimately advanced to support the concept of empathy, which may allow it to recognize others as being separate from itself, similar to how toddlers develop this sense (https://blog.lovevery.com/skills-stages/empathy/)
OTOH prediction doesn't necessarily reflect causation either, but prediction is what JEPA is about, how our brain/intelligence works, and one of the great confirmations of LLMs is how powerful prediction errors are as a learning signal.
JEPA appears a step in the right direction of trying to build a brain rather than a language model - to use prediction the way the brain uses it to predict the future (not an historical frozen training set), and learn a real world model of how the world behaves. Any JEPA implementations I've read about use a Transformer as their predictive component since even prediction (and certainly not correlation) is not where JEPA is innovating - it is more about applying prediction to the right problem (assuming the goal is to implement animal/human intelligence) of predicting sensory inputs at the right level of representation.
A recent JEPA variant, Causal-JEPA, moves beyond just infilling to predict object state from object interactions (i.e. to learn causal predictive relationships).
Not everybody is using the same model and harness as you, nor using the model the same way as you.
Different models, and versions of models, use different types of attention, which affects their long-context performance, and no doubt also do different amounts/types of long context training.
Different agents build context differently and implement context compaction differently.
Unless someone else is using the same model as you, the same agent/harness as you, and doing very similar tasks, then there is no reason to suppose that their experience of model behavior relating to context size is going to be the same as yours.
reply