Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Decoded: GNU Coreutils (maizure.org)
256 points by bshanks on July 2, 2019 | hide | past | favorite | 55 comments


Interesting project, this would make a good teaching tool for people interested in more operating system interaction.

On the website it would be nice if you had a way to flag which ones were fully annotated. But I like the main page design. The block graphics on the flow is very helpful.

You've done some very nice work here.


I had considered rewriting a few of them in Python as an exercise. But, I’ve never really gotten around to it, and I’m not entirely sure it would be that useful to do so, so, I’ve held off.


Agreed, this would be very beneficial to students and people learning C.


I have used the source code for Plan 9's core utilities as a teaching aid for students learning the basics of C and operating systems to great effect. I have found GNU's tools to contain a lot of "cruft" to make them run fast, which is not necessarily conducive for those who are not familiar to why they work.


Busybox and Toybox also contain pretty simple and easy to follow implementations.


Busybox is a fun read on how they got all that functionality into such a small space. It always amazes me that my home router has all those functions because of Busybox being loaded.

OTOH, I shouldn't because I think my current router has more CPU / Memory than my first mainframe.


Yeah, plan 9 was my second thought when I saw this. The first one was suckless' core.


suckless seems like a spiritual successor to Plan 9 (albeit POSIX), to the point that it uses some of the same constructs, so I’m not surprised that you thought of it.


Huh, I'd never heard of FTS before, the functions used by Coreutils to traverse the filesystem:

http://man7.org/linux/man-pages/man3/fts.3.html


FWIW, the POSIX equivalents are `ftw` and `nftw`[1]. POSIX 2008 deprecates `ftw`.

[1]: https://pubs.opengroup.org/onlinepubs/9699919799//functions/...


ftw/nftw is a crap interface as it relies on C callbacks and isn't re-entrant.[1] Android/Bionic, Darwin/macOS, DragonflyBSD, FreeBSD, glibc, musl libc, OpenBSD, NetBSD, and even Solaris all support the FTS API.[2]

Of the extant Unix environments only AIX and QNX seems to lack it. HP-UX doesn't seem to support it, either, but I don't count HP-UX as extant. In any event it's trivial to copy an FTS implementation from a BSD.

I'm a strong advocate for adherence to POSIX, but in this case there's significant benefit from using FTS and very little, if any, cost.

[1] FTS uses a C callback for its comparator, which can be a headache but not nearly to the same extent as it is with nftw.

[2] It's originally from BSD Reno. It was adopted by glibc, which in turn has forced commercial vendors like Solaris to support it as they're chasing Linux/glibc API compatibility. I would expect AIX to add it eventually.


Annoyingly FTS was completely broken on 64 bit platforms in glibc before 2016. This still affects some enterprise systems (RHEL 7 in my case). Coreutils uses gnulib to replace glibc FTS with a working version on these systems.


Ah, linux-man bites again: the Linux versions of `nftw` and `ftw` claim to be re-entrant, so I assumed that POSIX specified them as such. Looks like I was wrong about that.


Red Hat-based distributions include POSIX man pages out of the box.

    $ whatis ftw
    ftw (3)              - file tree walk
    ftw (3p)             - traverse (walk) a file tree
Run `man 3p ftw` to read the POSIX version.

On Ubuntu systems, the following packages can be installed to provide the POSIX man pages:

    manpages-posix
    manpages-posix-dev


Super interesting. I had a brief look at "head" and "df" and the flows seem to be complete.

The flow for "tr" is not completely right given that the second parameter ("string2") is optional with certain options.

The text for "yes" mentions how it fills a buffer with multiple copies of the string to achieve its very high performance. Nice!


Reminds me of a time at a large animation studio where I was working on performance issues with the in-house asset management system.

The asset management tool was fully commandline based, no GUI. An artist showed me what was involved in committing their scene: launch the command, type yes, hit return, wait about 30 seconds, type yes, hit return, repeat. I piped yes to stdin and it ran for 47 hours.

If you need a performance-optimized version of "yes" there are probably other aspects of your architecture you should look at first.


yes(1) takes a custom "expletive", so in the performance context it's usually used for flood filling in situations where e.g. /dev/zero or /dev/urandom are not appropriate sources for some reason.


yes outputs newlines, though, which makes it somewhat annoying to use for flood filling in practice.


I'm not particularly interested in coreutils per se but this is a super-interesting way to look at any piece of software. I just _love_ the block diagrams... why can't source code look like this?

Great work!


There are languages that model code like this. Using one for a while will likely teach you why. If you think "spaghetti" is an apt description for text code, wait until you see flow based code that's evolved in place for a while.


Presumably all the detail is shown at once? I'm picturing something more like Google Maps navigation where you zoom into the source diagram and the code appears.

Perhaps I'm imagining an alternative way to manage source trees, or modules, rather than the code in individual source files. A hierarchy like a traditional file system but with extra kinds of relationships. Perhaps it'd have to be curated like reference documentation, perhaps it could be automatically generated from source directives. I dunno, just idle thoughts...


In the best of those languages the user is getting input and output through the diagram in realtime. Also, syntax errors are either removed completely or converted into runtime errors.

These features allow the user to iterate rapidly through the development process to create a prototype, perhaps even gaining an average 100x development-time increase over using a compiled text-based language. The spaghetti is evidence of that-- the user moved so quickly that they were able to hold the relevant abstractions in their short term memory to produce the final write-only diagram. And at least in my experience, the final write-only diagrams tend to work to produce the desired output, even if they were written a decade ago.

In fact if there were a button users could push to "abstractify" some pre-spaghetti sub-diagram I believe that would negatively impact the language. The point is that when you've got mental model of all inputs and outputs within a single page visual cheat sheet, you can instantly connect any input to any output to take the next step in the development process. For certain domains like audio synthesis the fact that the user can leverage the diagram-based system to get any output at all is worth the seemingly prohibitive price of write-only programming.


Labview is one of these languages, which I’ve used for the last 5 years. It’s the only tool I use daily that makes me want to blow my brains out. I think diagrams like these are really useful as high level visualizations, but fail miserably as a low level implementation.

It’s my opinion that one of the things that turns visual code into spaghetti is that refactoring is difficult, because the code grows in 2d, as opposed to 1d (I don’t count line length). It’s sort of like if you could have infinitely many columns of concurrently executing code in text a based function, where any line in a given column can be connected to any line in a different column.

I think what could be useful is a hybrid model, where you have something like text based modules whose communication is orchestrated by a visual diagram. For what it’s worth, Simulink can be used like this to some extent(e.g., hooking S-functions together) and I find it orders of magnitude more pleasant than labview.


Nice project. Just looked for the source code of "yes" (https://github.com/coreutils/coreutils/blob/master/src/yes.c), cause it seems to be the easiest code to understand.

I didn't get it. Damn. Looks still complicated.


Yeah, `yes` ended up complex in order to get better performance; if you want something simpler, maybe try https://www.maizure.org/projects/decoded-gnu-coreutils/env.h... ?


Oh, this is great! And it's one piece of something I've been looking for (so hopefully a good comment section to ask for more): a book on how (GNU/)Linux 'works'.

i.e. I'm not interested in a book of commands or cheat sheets, I can use `man` and SO, Arch wiki, etc. for that when needed.

I think part of the problem is that I don't know what I don't know - but I discovered namespaces (`/proc/<pid>/ns/<type>`) recently and don't know much about it but thought it was interesting, and managed to use it to do what I needed to overcome a problem I was having with `ip netns`. A similar 'aha' was with inodes a while ago.

So, any recommendations for a book on 'how Linux works'? I think it's a gap in my understanding (academically EE/CS - hardware, up to OS but not Linux-specific, theoretical CS; professionally software, 'using' Linux). Best candidate I've found is Brian Ward's 'How Linux Works: what every superuser should know'.


This is probably going to be an unpopular opinion, I dont agree with the criticism of the use of goto.

I know, I know, the classic "goto Considered Harmful". And, yes, it can be abused just like every other programming construct.

I see 2 generally good use cases for goto: 1) error handling to cleanup in failure in C (in C++ RIAA constructs ahould be used instead) 2) exitting nested loops, instead of a flag and a break.

I've also seen C++ code use do while loops with while(false) so that they could use "break" to exit the non loop. Its very confusing and non obvious, because you see the do, and thus expect a loop. Meanwhile (the absurdly long function) has the while(false) on a separate screen, not making it clear that the code doesn't actually loop. Just use the goto. It's clearer and omits the abuse of a looping primitive (that doesn't actually loop).

Yes, goto can be abused, so can many other statements.

Sorry, /rant


that's fairly cool. Wish such sites were available for more open source packages.

Always wanted to know the algorithms used in sort.

So ls has more source lines than sort? Funny, but a bit depressing too.


ls is kind of complicated, has a billion flags, and has to print pretty so it has to do some ioctls to get the terminal width and such, then column things out nicely so they don't overflow. Glad I'm not implementing ls from scratch.


exa[0] dubs itself a "modern" ls written in Rust. Interesting to compare. Still over 6,000 lines not including comments.

[0] https://github.com/ogham/exa


This guy has some really nice articles, but I'm unavailable to find any link to his RSS feed. Any help?


I've been trying to contribute additional flags for some coreutils commands, but I've been having trouble building on macOS. Should I use a certain VM or container instead? Does anyone have a link to instructions for building on a mac?


Do you have a modern GCC installed?


  $ gcc -v
  Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include/c++/4.2.1
  Apple LLVM version 10.0.1 (clang-1001.0.46.4)
  Target: x86_64-apple-darwin18.6.0
  Thread model: posix
  InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin


I would not at all be suprised to learn that GNU coreutils does not build with clang/llvm. Try installing a real GNU C compiler with homebrew (brew install gcc)


If you like this, the author posts a number of other 'Decoded' episodes: https://www.maizure.org/projects/


I never noticed `runcon` before, it and `chcon` seem decidedly out of place being in core-utils, being that they're a) linux specific, b) SElinux specific.


This is amazing. The author needs a donate button!


Are you doing this by manual inspection or in some semi-automated or automated way? In any case, very nice work.


this is very cool. Yet GNU utils have so many options that the interesting bits are difficult to find. The same display but for openbsd utils would be still more enlightening.


the diagram is really nice,is it done by dia?

I use dia once a while, and wish freeplane could draw arrows between its nodes as freeplane is easier to add child nodes.


This is fantastic work!


It is true archeology.


Or, just browse the originals:

https://www.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src https://www.tuhs.org/cgi-bin/utree.pl?file=4.3BSD/usr/src

and some of their current decedents:

https://svnweb.freebsd.org/ http://cvsweb.openbsd.org/cgi-bin/cvsweb/ http://cvsweb.netbsd.org/bsdweb.cgi/src/?only_with_tag=MAIN https://gitweb.dragonflybsd.org/dragonfly.git

conveniently all in one source tree, self hosting, and buildable with a single command.

it is almost as if they were designed as part of a single coherent operating system along with the very C language itself..

as discussed previously, typically with more intelligible sources:

https://news.ycombinator.com/item?id=14542938

and usually with better man pages (https://www.freebsd.org/cgi/man.cgi)

unix came first. gnu and posix came later.

the unix way is the standard, not posix, and not gnu.

from stallman himself (https://stallman.org/articles/posix.html):

" It seemed to me that nobody would ever say "IEEEIX", since the pronunciation would sound like a shriek of terror; rather, everyone would call it "Unix". That would have boosted AT&T, the GNU Project's rival, an outcome I did not want. So I looked for another name, but nothing natural suggested itself to me.

So I put the initials of "Portable Operating System" together with the same suffix "ix", and came up with "POSIX". It sounded good and I saw no reason not to use it, so I suggested it. Although it was just barely in time, the committee adopted it. "

so, as can be seen, posix is RMS's way of keeping people from calling unix unix, and as a result today, now that AT&T Unix is nearly dead, people lose track of the fact that BSD is Unix, and don't know what Unix actually is, and are forced to 'decode' the sometimes deliberately less clear cloned sources not knowing to look at the simpler original.


> Or, just browse the originals:

I believe that it's worth noting that doubts about the validity of Caldera's license (at least in some jurisdictions) have been raised recently[1].

[1] https://virtuallyfun.com/wordpress/2018/11/26/why-bsd-os-is-...


I think you missed the point of the website. It's not about the source code itself, but providing a higher-level overview of the utilities' architecture.


Did I?

"This resource is for novice programmers exploring the design of command-line utilities. It is best used as an accompaniment providing useful background while reading the source code of the utility you may be interested in. "

^ "best used as an accompaniment"

I posit that GNUs intentionally obfuscated code design[1]^ requires the use of such a site, whereas, reading the originals and their decedents, provides a similar level of understanding program functionality in addition to the actual historical context in which most of the tools were developed and, also, typically occurs in the same source tree as the kernel, c library, and build toolchain implementing the OS side of the equation and so is a better and more productive thing to do with one's time that will still apply to reading and understanding coreutils sources.

apologies if unclear. but kudos to the author for trying to do something positive to spread knowledge of system internals, lest I be misconstrued.

.. [1] https://www.gnu.org/prep/standards/standards.html#Reading-No...

.. ^: the design of a utlity like 'cat'+ has generally obvious direct implementation. coming up with something else to be different necessitates doing goofy stuff. see also other hn thread referenced.

.. +: see also: https://github.com/coreutils/coreutils/blob/master/src/cat.c vs https://www.tuhs.org/cgi-bin/utree.pl?file=V7/usr/src/cmd/ca...


I'm pretty curious on how you conclude that GNU is "intentionally obfuscating code" from [1]. I read [1] as good advice to avoid inadvertently getting code into GNU that could be claimed by copyright. Focus on speed instead of memory; simplicity instead of speed. I don't see "make it different for difference sake."

I also conclude the opposite from the cat implementations. I really don't see how the BSD cat is a more "pure" or straight forward implementation. I find the GNU cat to have way more options and easier to read source code. My guess is that novice programmers would find the GNU version easier to grok, which also probably aligns more to the goal of GNU. Or I just like to read obfuscated code.


I really don't see how the BSD cat is a more "pure" or straight forward implementation. I find the GNU cat to have way more options

Way more options is "cat -v considered harmful" not thought any more?


If you aren't trying to be Unix I would imagine you aren't thinking that. GNU probably doesn't subscribe to that notion.


> I find the GNU cat to have way more options and easier to read source code. My guess is that novice programmers would find the GNU version easier to grok, which also probably aligns more to the goal of GNU.

Perhaps, but, also, on a source based unix system (i argue the only real unix system, since this is what unix was from the v5 research days, and continued to be for source license holders through the dawn of PC-BSD and then in open-source BSD onwards), one can do something like this:

    $ which ls
    /bin/ls
    $ man ls > /dev/null # output hidden for example purposes
    $ ls /usr/src/bin/ls     
    CVS      cmp.c    ls.1     ls.h     obj      utf8.c
    Makefile extern.h ls.c     main.c   print.c  util.c
    $ grep include /usr/src/bin/ls/ls.c |sed -ne 2p     
    #include <sys/stat.h>
    $ ls /usr/src/sys/sys/stat.h                                              
    /usr/src/sys/sys/stat.h
    $ grep ^sys_statfs /usr/src/sys/kern/*.c    
    /usr/src/sys/kern/vfs_syscalls.c:sys_statfs(struct proc *p, void *v, register_t *retval)
    $ less /usr/src/sys/kern/vfs_syscalls.c  # oh so that's how it works in the kernel..
    $ (cd /usr/src/bin/ls && cp ls ls.c.orig && $EDITOR ls.c && make && make install && ls -Q)  # yay! my new option -Q works
    $ (cd /usr/src/bin/ls && diff -urw ls.c.orig ls.c |Mail -s 'new patch' developers@some-mailing-list.org)
whereas, for example, on a binary package based linux or proprietary UNIX, this would entail either separate download of N different tools and disparate compilation steps in the former, or not being able to do it in the latter. While there is nothing constraining a linux or proprietary unix from being made available in such a way, they quite simply arent, and the 'culture' and concept of having a single binary+source+documentation tree which is fully self hosting with a single build system managing the build combined as a single unit as 'what the unix system is' is lost..


"If you have a vague recollection of the internals of a Unix program, this does not absolutely mean you can’t write an imitation of it, but do try to organize the imitation internally along different lines, because this is likely to make the details of the Unix version irrelevant and dissimilar to your results. "

on it's own, not intentionally obfuscating. but when the naive implementation is straightforward and the only clear cut way, and everything else involves abstracting something, then yes.

see also the 'yes' example from the other thread:

https://github.com/coreutils/coreutils/blob/master/src/yes.c https://github.com/openbsd/src/blob/master/usr.bin/yes/yes.c

Really, my main beef is the suppression of history and misrepresentation of a broader culture as 'proprietary' to serve RMS's political aims, whereas in the actual Usenet/BSD culture, it quite simply wasn't. Unix wasn't just AT&T, it was the community around it as well, which lives on in the BSD's with a clear lineage and 'cultural context' which is also truly open source and built around software freedom, but the promotion of GNU devoid of context obscures this fact.


Rather than a flow chart, etc. It would better to convert the utilities to simple C with no error handling or optimization. So `cat` would be about 5 lines. No mmap() or fancy stuff.


You might find source of Xv6 - educational version of unix - useful. Here is their implementation of cat, all 44 lines of it: https://github.com/mit-pdos/xv6-public/blob/master/cat.c


Many older operating systems do a pretty decent job at this: I would suggest taking a look at those. (Alternatively, if you're looking for something that's still modern, BSD usually cares less about optimization than GNU does.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: