[Novalug] Filename handling: correctness vs. convenience

bidwell bidwell@dead-city.org
Sat Feb 28 22:29:19 EST 2009


Michael Henry wrote:
> All,
> 
> In the Unix culture, there has been a historical bias against filenames
> with "odd" characters.  Filenames containing whitespace, punctuation, or
> other unusual characters are harder to deal with at the command line and
> in scripts because these unusual characters are often used as
> delimiters.  Unusual characters must be escaped in some way for the
> shell to deal with them correctly.
> 
> Because of the inconvenience caused by these unusual characters, they
> tend to be avoided in Unix filenames.  As a result of this scarcity of
> unusual filenames, there are often shortcuts that can be taken to avoid
> the hassle fully general filename handling.  These shortcuts truly save
> a lot of time, and I use them myself frequently; however, to paraphrase
> a great philosopher[1], "A man's got to know his (tools') limitations".
> Many of the shortcuts are in essence buggy approximations to a robust
> general-case solution, but they are so much more convenient than the
> fully correct solution and they are perfectly correct in a number of
> interesting special cases.  It is my goal to point out some of these
> shortcuts and their limitations, as well as the more awkward but
> general-purpose solutions.
> 
> So to raise awareness of the general problem, I offer a couple of
> suggestions for your consideration.  I'll be using Bash for my examples,
> as that's the most common shell on Linux.  I'll try to keep my examples
> generic enough to apply to any Bourne-shell derivative, but almost all
> of my experience has been with Bash (and almost none with C-shell
> derivatives), so though I believe the concepts generalize to most
> shells, the details may vary if you are not using Bash.  You may want to
> follow along in a temporary directory.  Throughout, commands typed at
> the shell prompt begin with a dollar-sign::
> 
>     $ mkdir tmp
>     $ cd tmp
> 
> Consider the task of creating a new file named ``dummy_file``::
> 
>     $ touch dummy_file
>     $ ls -Q
>     "dummy_file"
> 
> Note that the ``-Q`` option to GNU ``ls`` puts quotation marks around
> the filenames to make the filename boundaries obvious.  The ``touch``
> command updates the time stamp on a file, creating a new file if
> necessary.  The shell splits the command line into the two words
> ``touch`` and ``dummy_file`` at the whitespace between the words.
> Because neither the command nor the filename contained whitespace, no
> special effort is required to invoke the command correctly.  But suppose
> the desired filename had contained a space, such as ``dummy file``.  The
> same simple invocation now produces an unintended result, creating two
> new files (``dummy`` and ``file``)::
> 
>     $ touch dummy file
>     $ ls -Q
>     "dummy"  "dummy_file"  "file"
> 
> Some characters are always treated literally by the shell and taken at
> "face value".  At a minimum, these include letters, numbers, and some
> punctuation characters like period (``.``), slash (``/``), and
> underscore(``_``).  Some other characters are always treated specially
> unless some kind of quoting or escaping mechanism is used; these special
> characters include whitespace, backslash (``\``), quotation marks, and
> others.
> 
> To express a filename containing a special character requires some kind
> of quoting or escaping.  One way to escape a special character is to use
> a backslash before the special character::
> 
>     $ touch dummy\ file
>     $ ls -Q
>     "dummy"  "dummy file"  "dummy_file"  "file"
> 
> Now the desired ``dummy file`` is properly created.  The backslash
> informs the shell to treat the escaped space character literally,
> causing ``dummy file`` to be passed as a single argument to ``touch``.
> 
> Failure to escape special characters can have disastrous effects.  If the
> goal is to remove ``dummy file``, the following command line is
> erroneous; it will incorrectly remove the unrelated files ``dummy`` and
> ``file``::
> 
>     $ rm dummy file
>     $ ls -Q
>     "dummy file"  "dummy_file"
> 
> As an alternative to using backslash to escape the space, the filename
> can be surrounded by quotes.  There is a distinction between single- and
> double-quotes.  In general, single-quotes are "more powerful" in the
> sense that all characters between a pair of single-quotes will be
> treated literally[2].  Within double-quotes, whitespace is treated
> literally but certain other characters are still treated specially; in
> particular, the dollar sign (``$``) is used to expand shell variables.
> For purely literal strings, I find single-quotes to be more convenient,
> but double-quotes are required when expanding shell variables as
> explained later.
> 
> To remove the special file ``dummy file``, quotes can be used to protect
> the space from special treatment::
> 
>     $ rm 'dummy file'
>     $ ls -Q
>     "dummy_file"
> 
> Notice that quoting a filename is permissible even when it's not
> necessary.  Though the filename ``dummy_file`` has no special
> characters, it's legal to quote it anyway in the following command to
> remove the file::
> 
>     $ rm 'dummy_file'
>     $ ls -Q
> 
> Things get more interesting when shell variables are used.  A shell
> variable can be assigned in many ways.  The following example sets the
> variable ``filename`` to the value ``dummy_file``.  The subsequent
> ``echo`` command shows the value of ``filename``::
> 
>     $ export filename=dummy_file
>     $ echo $filename
>     dummy_file
> 
> When the shell encounters the variable expansion ``$filename`` on the
> command line, it replaces it with the value of the variable.  Before the
> ``echo`` command is run, the shell updates that command line to become
> ``echo dummy_file``.  Similarly, the ``touch`` command could be used to
> create a file named according to the value of the ``filename`` variable::
> 
>     $ touch $filename
>     $ ls -Q
>     "dummy_file"
> 
> As before, the command line ``touch $filename`` is replaced by the shell
> with ``touch dummy_file`` before the ``touch`` command is run.
> Command-line parsing continues after the variable replacement, but in
> this case nothing interesting happens.  But suppose the variable held a
> filename with a space::
> 
>     $ export filename='dummy file'
>     $ echo $filename
>     dummy file
> 
> The previous ``touch`` command would now erroneously create the two files
> ``dummy`` and ``file`` instead of the desired file ``dummy file``::
> 
>     $ touch $filename
>     $ ls -Q
>     "dummy"  "dummy_file"  "file"
> 
> The bug is the lack of quoting around the variable expansion of
> ``$filename``.  After expanding the variable, the shell continues to
> parse the command line.  Because the expansion was not quoted, the space
> in the filename is treated as an argument delimiter, so the ``touch``
> command receives two distinct arguments, ``dummy`` and ``file``.  A
> corrected invocation of ``touch`` that creates the filename specified by
> the ``filename`` variable follows::
> 
>     $ touch "$filename"
>     $ ls -Q
>     "dummy"  "dummy file"  "dummy_file"  "file"
> 
> Notice the required quotes in the command.  Within double-quotes, the
> shell will expand shell variables such as ``$filename``, but the
> resulting string will not be further processed for argument splitting.
> Therefore, the expanded filename will be treated as a single argument to
> the ``touch`` command, yielding the correct result.
> 
> Here might be a good place to point out the tension between correctness
> and convenience.  In the general case, a command such as ``touch
> $filename`` that lacks proper quoting around a shell variable is buggy,
> because in the most general case the script writer has no control over
> the naming of arbitrary files in the filesystem.  When a script must
> work correctly in the general case, such quoting is required.  But when
> the files of interest are known to contain no unusual characters,
> quoting is not required.  Especially for one-liners and "throw-away"
> scripts, the author often knows that the filenames contain no spaces or
> other characters requiring special processing, so for convenience the
> author leaves out the otherwise mandatory quoting.
> 
> In my view, such shortcuts are valuable productivity enhancers, as long
> as the author is aware that he is cutting corners and does not come to
> view the shortcut as the correct idiom for the general case.  Therefore,
> whenever I suggest a shortcut idiom to a fellow hacker, I like to point
> out where the shortcut fails in the general case.  This is especially
> important for quoting because it is so easy to overlook.  It also tends
> to work fine during testing due to the relative scarcity of Unix
> filenames with unusual characters, then fail miserably in the wild.  In
> addition, bad quoting can create some very serious security
> vulnerabilities.
> 
> A common idiom for processing files in a directory is the shell ``for``
> loop.  The ``for`` loop takes a list of space-separated words and
> iterates across them.  For example::
> 
>     $ for i in one two three; do echo $i; done
>     one
>     two
>     three
> 
> To iterate over words with unusual characters, the words must be
> quoted.  For example, to iterate over the two phrases ``first   phrase``
> and ``second   phrase`` (note the three spaces in each phrase)::
> 
>     $ for i in "first   phrase" "second   phrase"; do echo "$i"; done
>     first   phrase
>     second   phrase
> 
> Notice that for correctness, there must also now be quotes around the
> expansion of the variable ``i`` in the ``echo`` command, in order to
> preserve the three spaces in each phrase.  Without the quotes, the
> ``echo`` command will see each phrase as a list of words, and it will
> put only a single space between them::
> 
>     $ for i in "first   phrase" "second   phrase"; do echo $i; done
>     first phrase
>     second phrase
> 
> Frequently the list of words will be taken from a filename glob (a
> pattern to match filenames).  Here is an example that is analogous
> to ``ls *``::
> 
>     $ for i in *; do echo "$i"; done
>     dummy
>     dummy file
>     dummy_file
>     file
> 
> It's common to use such a ``for`` loop to do something to each file
> individually.  For example, to make a backup of each file::
> 
>     $ for i in *; do cp "$i" "$i".bak; done
>     $ ls -Q
>     "dummy"      "dummy file"  "dummy file.bak"  "file"
>     "dummy.bak"  "dummy_file"  "dummy_file.bak"  "file.bak"
> 
> This is safe for arbitrary filenames because of the quoting of the
> variable expansions.  The idiomatic use of filename globs does, however,
> run into trouble when there are too many files.  Because the shell
> expands the glob in-place, the size of the command line grows.  On a
> typical Linux system, the command line is limited to around 128 KBytes,
> after which the glob expansion will overflow the command line.
> 
> One common technique to get around the command-line length limit is to
> pipe a list of filenames into another program.  For example::
> 
>     $ ls *.bak | while read i; do echo "$i"; done
>     dummy.bak
>     dummy file.bak
>     dummy_file.bak
>     file.bak
> 
> This uses the ``read`` command read a line of input at a time, assigning
> a shell variable to the value of the entire line.  This works for
> filenames with spaces and some other special characters, but in
> particular does not work correctly for filenames containing newlines.
> Though newlines are very rare in filenames, it's possible to create one,
> so this idiom is only a shortcut, not a fully general solution.  For
> example, consider the filename ``dummy\nfile``, where the embedded
> ``\n`` indicates a newline.  First, a little cleanup::
> 
>     $ rm *.bak
> 
> Now the following command will create a file with a newline::
> 
>     $ touch $'dummy\nfile'
>     $ ls -Q dummy*file
>     "dummy file"  "dummy_file"  "dummy\nfile"
> 
> Notice how the ``while read i`` idiom fails to treat ``dummy\nfile`` as
> a single file::
> 
>     $ ls dummy*file | while read i; do echo The file is: "$i"; done
>     The file is: dummy file
>     The file is: dummy_file
>     The file is: dummy
>     The file is: file
> 
> Whereas the filename with the space is treated correctly,
> ``dummy\nfile`` is erroneously treated as the two separate files
> ``dummy`` and ``file``.  
> 
> Another approach for dodging the command line length limit is the use of
> ``find`` and ``xargs``.   ``find`` generates a list of matched filenames
> to the standard output, and ``xargs`` then splits those filenames at
> whitespace and uses them as command-line arguments to another program.
> As an example::
> 
>     $ find -name 'dummy*file' | xargs ls -Q
>     "./dummy"  "./dummy"  "./dummy_file"  "file"  "file"
> 
> Notice that both filenames containing whitespace were split incorrectly
> and treated as the pair of filenames ``dummy`` and ``file``.  To correct
> this erroneous behavior, you can use the ``-print0`` option to GNU
> ``find`` and the corresponding ``-0`` option to GNU ``xargs`` (these
> switches are unfortunately not portable to all older Unix systems, but
> they work on many modern Unix systems like Linux).  Correct behavior is
> achieved using this idiom::
> 
>     $ find -name 'dummy*file' -print0 | xargs -0 ls -Q
>     "./dummy file"  "./dummy_file"  "./dummy\nfile"
> 
> I hope this discussion has brought some light to what I feel are
> some historically dark corners in the Unix culture.  It's surprisingly
> easy to write scripts that behave properly in every tested case but which
> fail spectacularly when presented with unusual filenames in the wild
> (especially when black hats have the chance to choose the filenames).
> 
> Michael Henry
> 
> [1]: http://www.imdb.com/title/tt0070355/quotes
> 
> [2]: Note that you can't use single-quotes to quote another
> single-quote.
> 

Michael,
Thanks for the very informative email.  I've been using ls everyday
for so long I never would have thought to look at the man page
and find the -Q option.

Matt



More information about the Novalug mailing list