But that isn't that different from requesting the llms.txt version. Why not just make it so the useful content you want the LLM to focus on is easily retrievable from the same HTML the user's browser gets?
The sanity.io page writes:
> serving agents a bunch of HTML might just bloat their context window.
That's only true if you assume the the agent can't extract the useful text before it goes into the model as tokens. Your browser's reader mode uses heuristics to identify what the actual content is in a large HTML response and strips away the rest.
To me this is a far better approach than worrying about an llms.txt files or looking at HTTP headers to see if markdown is preferred. Such efforts could easily be directed at ensuring the useful content on your site carries the appropriate markup for an agent or any other tool to extract it. And it would require less work to implement for the publisher of the content.
I was using llms.txt as the general idea of providing an alternative version of your content for agents - whether that's llms.txt for the entire site, my-article.md instead of my-article.html for a specific page, or via content-negotiation as your link prefers.
The content (HTML or Markdown) only become tokens when given to the model. Agents use parameters to limit the output from their tool calls all the time, precisely to reduce the number of tokens they have to pass to the model. So when an agent requests content for example.com/page and gets a 800KB response, those are not tokens yet. It could simply call a tool to extract the useful info before it gives the content to the model. That would effectively produce the same number of tokens as requesting example.com/page.md or example.com/page with request headers preferring markdown.
So why not just make sure the useful info is easily extractable from the same HTML? Less work, no content negotiation on the server side, no worrying about maintaining two similar versions of the same content.
As an aside, I've always been against content negotiation for different representations of content. So if you really must maintain two different versions of your content (HTML and Markdown, say) make them different URLs. I agree with Roy Fielding on this[1]:
> It is a bad design trade-off to send a bunch of header fields on every
request just to tell the server all of the possible variations of
preference held by the user, particularly when there is a very small
chance that any of those dimensions are applicable to the target resource.
It has been a bad design trade-off ever since the very brief period
in 1993-94 when folks didn't know which image format would be
usable on all UAs and there was no CSS or javascript to allow
for client-side adaptation.
> ...The caching impact of
proactive negotiation is far worse than the one extra round trip
per site for reactive negotiation, and even that round-trip isn't
necessary in formats that support client-side adaptation.
On the caching impact, see this from Simon Willison[2]:
> ...you can’t deploy an application that uses content negotiation via the Accept header behind the Cloudflare CDN — for example serving JSON or HTML for the same URL depending on the incoming Accept header. If you do, Cloudflare may serve cached JSON to an HTML client or vice-versa.
[Edited to add: if the source of truth is already Markdown in your system, by all means expose that. What I'm discussing here is related to efforts to produce new Markdown or plain text output, in addition to HTML, specifically for agents]
The sanity.io page writes:
> serving agents a bunch of HTML might just bloat their context window.
That's only true if you assume the the agent can't extract the useful text before it goes into the model as tokens. Your browser's reader mode uses heuristics to identify what the actual content is in a large HTML response and strips away the rest.
To me this is a far better approach than worrying about an llms.txt files or looking at HTTP headers to see if markdown is preferred. Such efforts could easily be directed at ensuring the useful content on your site carries the appropriate markup for an agent or any other tool to extract it. And it would require less work to implement for the publisher of the content.