[Novalug] Efficiency and creeping featurism [was: Interest probe Results]

Rich Kulawiec rsk@gsp.org
Wed May 31 06:45:47 EDT 2017


On Mon, May 29, 2017 at 07:00:02PM -0400, Peter Larsen via Novalug wrote:
> Ahh, this reminds me of some of the more heated discussions here on the
> mailing list.  Unfortunately structured data requires more than just a
> simple "grep" filter as data is passed from one to another. This is why
> tools like journalctl exist instead of using grep, head, tail and a few
> other utils to try to filter/interpret the files.  So while I do agree
> with the notion in theory - simple commands are good - that isn't always
> practical. Back in the mainframe days we could certainly justify
> splitting a process into multiple subprocesses and pass data between
> processes as files/pipes. But it's highly inefficient.

Efficiency is almost never a concern, and when it is, it's usually
secondary to quite a few others.  Security, maintainability, portability,
robustness, to mention some of those, are almost always more important.
Many folks who lack sufficiently broad and deep and painful experience
often focus on it and thereby create gratuitous complexity -- along
with the myriad issues that entails -- when all they really had to was
sit on their hands and wait for computing power to increase, which it's
been reliably doing for decades.  Or just throw hardware at it --
which, unless you're working with spacecraft, is just about always
the easiest and cheapest solution.

	"Your scientists were so preoccupied with whether or not they
	could that they didn't stop to think if they should."

Let me give you an example from my own experience -- one that happens
to be in cache memory at the moment because I wrote about it elsewhere
recently, on another LUG list.

I'm one of the authors of stat(1) -- specifically, the one responsible
for the version that found modest popularity many years ago and thus
the output format that all extant versions seem to still be using

Phil Hochstetler (then a student at Purdue, later of Sequent) wrote an
early bare-bones version under Research Unix v6, and some of us in the
neighborhood made use of it.  A few years later, I rewrote it from
scratch under 4.2 BSD because I needed it to support work being
done on dump(8) on non-quiescent filesystems.  That's when I picked
the cosmetic appearance that -- to my surprise -- has persisted to today.

I released the code to Usenet's net.sources in July 1985.  Judging from the
correpondence I received, a fair number of people found it a useful tool.
Contributions, fixes, and ports came from Bill Stoll, Stan Barber,
William J. Bogstad, Michael MacKenzie, Marty Leisner, my colleague
Kevin Braunsdorf, and other folks I'm no doubt overlooking (my apologies).

Around the same time and just across campus, Dave Curry ("Using C on
the UNIX System", among others) independently wrote a program called
"info", that did much the same thing and incorporated some of the
functionality of file(1) as well.  I considered going that way
with stat(1) circa 1988-1989, but nothing came of it -- after a lot of
thought and even some design work, I decided to stick with the software
tools philosophy: do one thing, do it well. 

I'm pretty sure that the contemporary Linux version of stat(1) is a
complete rewrite, so none of that code likely survives except in concept.
(It might survive in the BSD version -- I haven't looked.)  That's probably
a good thing: we've all learned a lot in the years since.

But the reason I'm telling you this story doesn't have to do with the
code I wrote to make stat(1) a useful tool that has managed to survive
in various incarnations for 30+ years.  It has to do with the code that
I DIDN'T write.  Back up two paragraphs and re-read.

I thought about it for a long time, because there are useful things
that could be done by combining stat(1) and file(1).  For example:
suppose that there's a file suffixed ".c".  File(1) was likely to
conclude that it's a C program.  But what if it's 10 bytes long?
What if it's 10M bytes long?  Is it *really* a C program?

Probably not.  And there are hundreds more examples where that one
came from.  The bottom line is that it would be useful to combine
the information that a stat(2) system call yields with heuristics
based on file names and file contents because more sophisticated
analysis could be done.  I know, I actually *did* code some of this
and tested it and played around with it for a year.

And then I stopped.  I decided not to proceed, because I'd thought
about it long enough to realize that this was probably not in the
best interest of the tools ecosystem.  *If* we needed better tools
to classify files based on their data and metadata, and it wasn't
clear that we did, then the best path forward was probably to write
a tool that used the outputs of file(1) and stat(1) in some fashion.

That's the software tools approach.  Individual tools do one thing,
and do it well.  More complex tasks can often be handled by tools
in combination, perhaps with some glue, perhaps with some temporary
intermediate steps, perhaps with the addition of a new individual tool.

Of course that's much less efficient.  And of course it almost never
matters.  And even when it *does* matter, it almost never matters as
much as the other concerns I mentioned (security, maintainability,
portability, robustness) and all the ones I didn't.

With the advantage of many years' of hindsight, I still think I made
the correct decision.  The software tools approach has served us
exceedingly well when we've exercised the judgment use it properly,
and sometimes that means not doing something just because we can.

I'm not only a fan of the concepts in "Software Tools" but also
"The Elements of Programming Style" and the way these were often applied
at the late, great, Bell Labs.  It is instructive to study things
like Research Unix v8 and Plan 9 in order to see how these concepts
can be used in practice.  It's not an exaggeration to say that parts
of those systems were beautiful.  And it's not an accident that those
same parts were secure, maintainable, etc.

One of the problems with the some of the contemporary Linux distribution
environments is that far too many people have written far too much code
that should never have been created.  The operating architecture is
becoming grandiose and baroque, which is inevitably the predecessor to
all kinds of problems.  There is far too much hubris and ego involved,
and far too little economy and elegance.  This would be a good time for
all concerned to take three big steps back, evaluate the situation,
throw away most if not all of the code, and start over -- using the
experience gained to craft far better and much smaller solutions.

But it's hard to kill your own creation.  Pride often prevents it.
Yet this is the mark, I think, of a seasoned programmer.

	"The management question, therefore, is not whether to build
	a pilot system and throw it away.  You WILL do that. [...]
	Hence plan to throw one away; you will, anyhow."
		--- Fred P. Brooks, The Mythical Man Month:
			Essays on Software Engineering

Fred is still right all these years later.  Except sometimes we should
throw away two or three.

---rsk



More information about the Novalug mailing list