Even if they could practically restrict access to US citizens only, I would expect them not to - it would be hard to regain that once lost and they need a global market for growth.
Yeah, v1 is sloppy, then I tell the LLM to clean it up. Every 1 prompt of building tends to require 1-5 prompts of clean up. Simple, fast, clean good code.
The chasm between "Software Developer" and "Software Engineer" is getting wider. Articles like this and the comments under it give away who is an Engineer and who is just a coder.
> Every 1 prompt of building tends to require 1-5 prompts of clean up. Simple, fast, clean good code.
I have found this to be very effective as well. However, it's so easy to do, I can't imagine they won't build it in.
The harnesses will improve and the loop of "self-review, judge what needs clean up, do the refactoring, repeat until clean" will get included in the one-shot. They are already doing this somewhat, they'll just get a lot better at it and as the models get faster and cheaper to run, the refactoring churn at the end of each task won't even create a noticeable delay.
I do not think the high-level "taste" knowledge that I've built up -- when to break something off into its own service, what to put in the DB vs cache vs queues vs blob storage, how to isolate important logic in pure functional layers so it can be tested and validated independently -- is any more "unlearnable" to AI than the stuff I previously considered impressive that's now one-shottable like "write a Prolog implementation from scratch".
And yes, right now you still need the architectural and system design knowledge because the LLM will fuck that up. We'll all find out if that continues being needed in the future. From what I understand about LLMs and how they work, I doubt it, but also, yeah, I doubted it would've gotten this far when I think back 2+ years ago.
Also, maybe I should be clear, I pretty much never one-shot things. My sessions with claude or other cli tools always starts with a bit of a conversation until we converge on a good plan, claude builds the code, we discuss some more, then we iterate.
I wish I had current-day AI (and a big credit card) for my previous job, they had a big legacy mess made by a productive but not very good developer, but my job was to rebuild it.
If I had AI tooling at the time I'd probably be more inclined to have it both refactor / optimize the existing application, add automated regression tests etc, and use it to extract all of the features and requirements for it for a potential rebuild.
But honestly I think if that application was properly designed and factored (instead of nesting JS in HTML in strings in JS or concatenating XML from query results only for it to be converted to JSON taking up 50% of response time) its lifetime could've been extended, especially if it was then containerized into a HHVM or similar php optimizer.
Any tips on how you unsloppify things? Are you using things like claude.md/copilot.md (or similar) to guide better, do you have specific types of prompts that you run, or do you adjust your code review practices in some way to more efficiently review lots of slop code?
One of my particular complaints is how code-gen LLMs tend to re-create the same code over and over again. Case in point, a use-case where a team name is generated from a list of team member names. The LLM re-generates this code in-line every time it needs to display the team name, rather than simply writing and reusing a utility style function.
I know I need to fix this. At this point I'm planning to just prompt something like "please list all the places where team names are generated/calculated", plus manually search through the codebase, then perform the abstraction myself. But I'm unsure how to prevent this (both this example, and other cases that could benefit from similar utility functions) continuing to occur in the future.
I accept that for every prompt of building I'm going to have 1-5 prompts of refinement.
Once the LLM tells me "Okay, it's done, everything works" I always as it to do a thorough review, I tell it to split up the work among sub-agents with each one taking on a specific responsibility (look for code smells, look for bad architecture, review the data access model, DUPLICATE CODE, testability and unit testing, etc.)
After a certain number of revisions and reviews you'll come to accept the shortcomings it comes back. Usually there will be specific design decisions you made that the LLM keeps bringing up, once the review only brings that up and maybe some other minor issues it's time to move on.
I don't overly rely on markdown files and directions. I don't rely on tooling around it either. I just don't trust the LLM when it says "all done", tests pass, and deployment works. I make it to multiple reviews and iterations even when it thinks it's done.
I also ask it to explain the system back to me. Obviously it should understand the system just by reading the code. But somehow, explaining the system back to me seems to make it more effective. Then I'll ask it questions about how I should make changes to the system. Sometimes I'll agree, sometimes I'll disagree and offer an alternative and ask it to assess the alternative. Having this entire conversion in its context seems to make it way more effective at refactoring/unsloppifying code.
Understand what you're writing. If you never build up the mental model of what the code is doing you'll never be able to discern what is slop and what isn't. There are no shortcuts.
Piling more prompts on might get you to the same end result, but without understanding you'll never know when you're there.
Absolutely. I really don't think the future will be humans reading and picking apart an AI-generated codebase, there will be tech debt agents or whatever running overnight.
I think you misunderstand why tech debt lingers around. It's not a capacity or capability problem.
Organisations just don't want to deal with the accountability involved with "touching cold code". Whether it's a human or "AI agent" doesn't change the "It worked in prod, you touched it, you broke it, never touch anything again" dynamic.
That's one dimension of it, but in the context of this thread we are talking about how maintainable a codebase is for other humans. If your codebase is messy you depend on a few key employees and it might be hard to onboard new ones, so there has always been financial incentives to reduce tech debt.
Um, no, actually AI makes it better because the cost is lower now. I'm not sure what point you're trying to make here, obviously organizations already fight against tech debt all the time through a variety of means?
The point there is that it is MUCH easier to get corporate to agree to something when the cost is nebulous and being paid anyway. If you get a senior dev to clean up some tech debt, how much did that cost the company? The dev will have some multiple things at the same time, so you can't cleanly assign a number of hours, maybe multiple people are involved. It's practically just an unknowable. Practically, $0.
So your proposal to handle tech debt created by "AI" being unable to do good engineering is... throw more AI at it? There's a saying about the definition of insanity which comes to mind.
This is virtually impossible to build. Not just because all current "AI detector" systems are fake or outright scams with accuracy comparable to a coin-flip on frontier model output, but because even if someone did build a reliable detector and released it to the public, it could be used for adversarial training and it would become worthless pretty fast.
Pangram is legit. I don't work at pangram, we integrated it in our paper website and one of the cool emergent behaviors I've seen is that on AI papers with example rollouts, it will accurately mark the paper's main text as human generated and the rollouts as AI generated.
My understanding is that they strongly believe in no false positives, so it's definitely possible to slip something by them but if it marks something as AI, it very likely is.
> My understanding is that they strongly believe in no false positives
Who cares what they "believe" (or, more accurately, say they believe). What are the underlying processes that actually guarantee this, and what data supports it?
2: the third study they link there is based entirely around the assumption that Pangram is correct, and seems to have been a collaboration or something as they're included in the credits area.
Bot detectors are broken. Even human bot detectors are broken. When I'm in the right mood, I can be quite capable of writing with very good formatting, structure, and phrasing. When I actually take the time to do this, there seems to be about a 70% chance that some nimrod will crawl out of the woodwork just to accuse me of being a bot.
Even humans who deliberately use lazy formatting and leave obvious errors uncorrected to provide "proof" of being human aren't seeing the big picture, here.
---
That bigger picture is that it's easy to make instruct a bot to be lazy, or to avoid the usual quirks. I hate when I'm working on a project and see a constant outflow of negation ("Don't do x, y, or w" is a recent hit) and unfounded exclusive confidence ("The correct answer" as if this is Highlander and there can be only one). Repetitious jargon like overuse of "gate" for things other than fences and skiing is something I can't stand. Plus the usual things — like overuse of unusual punctuation — that are obvious tells.
That stuff all drives me nuts.
But the bot just follows instructions, and my bot has been instructed to avoid those things. It generally performs very well, though the instructions do need re-hashed every now and then as models ebb and flow.
It's super easy to get the bot to write some python or perl that takes a body of text and intentionally some words or lose a comma while mmaking other errors and converting — into --.
When it comes to human error in written language, we just aren't that hard to emulate.
Now, that all said: You'll just have to take my word for it, but I do not use the bot to help with writing English. But I do have every confidence that if I woke up tomorrow and actually started bulking up my comments using a bot, none of you would be able to tell.
I work somewhere that tries to do such detection (for fraud prevention) and it sort of feels impossible to me in the medium term. AI slop qualities are fleeting - I’ve seen Reddit AI posts that have misspelled words, no dashes, stilted sayings and so on.
I've been waiting for someone to say this. An agent will generally produce far more code than technically necessary for the task. It's a kind of over engineering which makes it increasingly harder to wrap your head around the codebase.
The issue is provenance. We need cameras and phones to digitally sign photos so we can easily verify an unadulterated image.
You also want to be able chain signing so that for example a news reporter could take a photo, then the news outlet could attest its authenticity by adding their signature on top.
Same principle could be applied to video and text.
Signing something doesn't verify that it's real, it just verifies that you claimed that it was real, which everyone was already aware of. You can either hack a camera, or use an unhacked camera to take a picture of a fake picture.
reply