[Novalug] Fwd: perl help | filename spaces

Jon LaBadie novalugml@jgcomp.com
Sat Mar 20 10:01:32 EDT 2010


Wow Michael, great and extensive research.

Thanks for extending much of my knowledge and correcting
several misconceptions!!

On Sat, Mar 20, 2010 at 07:57:25AM -0400, Michael Henry wrote:
> On 03/19/2010 11:32 AM, Jon LaBadie wrote:
> 
> > When I saw your results my (likely biased) view was different
> > than yours.  IIRC you suggested that with the + delimiter
> > find being single threaded had to work in a linear fashion
> > while the pipelined version with xargs naturally allowed
> > for multi-processing efficiencies.
> 
> I don't think ``find`` is required to block during the exec.
> The POSIX specification says this:
> 
>   http://www.opengroup.org/onlinepubs/009695399/utilities/find.html
> 
>   Each invocation shall begin after the last pathname in the set
>   is aggregated, and shall be completed before the find utility
>   exits and before the first pathname in the next set (if any)
>   is aggregated for this primary, but it is otherwise
>   unspecified whether the invocation occurs before, during, or
>   after the evaluations of other primaries.
> 
> I'm still a bit unclear what it means for the paths to be
> aggregated (is this just packing up the arguments at the last
> minute right before exec'ing the program?), but it seems clear
> that they are leaving room for at least some processing overlap
> between the exec'ed program and ``find``.

It seems the standard authors meant for only one of find or its
invoked utility to operate at a time.  But you are correct that
wiggle room is left.  I can see an interpretation that says creating
the total list of "found" file is a separate activity from
aggregating them in to sets less than ARG_MAX length and exec'ing
the utility.  So one find thread could make the big list and a
second thread does aggregate/fork-exec/wait.

Your later research clearly shows this approach was not taken
in gnu's find.

> 
> It might be an interesting suggestion to the GNU folks.  I
> grabbed the source for findutils-4.4.2
> (http://ftp.gnu.org/pub/gnu/findutils/)  and took a look.  They
> are using ``fork``/``exec`` and immediately waiting for the
> child process to complete.  See the function ``launch`` in
> ``find/pred.c``.

Use the source Luke!	:)

> As a quick hack for testing, I modified
> ``launch`` to defer waiting for the child process until the next
> invocation of ``-exec`` (see below for patch).  The patch does
> not wait for the final child to return, so it's useful only for
> testing, but since in my test dataset ``-exec`` is invoked 380
> times, I think allowing ``find`` to exit before the last child
> completes is not skewing the results significantly.  The results
> line up with expectations.  The patched ``find`` now performs on
> par with the ``xargs``-based solution for my dataset:

Great test.  I think a standard-based find is constrained from
exiting before the last invocation completes as find's exit status
must be non-zero if any utility invocation exits non-zero.

> 
>   ; Using xargs
>   time find `cat 100dot` -type f -print0 | xargs -0 grep bigteststring
> 
>   real    0m15.923s
>   user    0m7.543s
>   sys     0m13.646s
> 
>   ; Using unmodified find with ``-exec ... +``
>   time find `cat 100dot` -type f -exec grep bigteststring {} +
> 
>   real    0m20.668s
>   user    0m6.766s
>   sys     0m13.772s
> 
>   ; Using hacked find with ``-exec ... +``
>   time findhack `cat 100dot` -type f -exec grep bigteststring {} +
> 
>   real    0m15.830s
>   user    0m7.163s
>   sys     0m14.302s
> 
> > I'll still recommend the -exec + versions.  I think it is
> > sufficiently wide spread, probably more than -print0 | xargs -0
> > if you consider UNIX systems also.  Xargs also has some
> > problems dealing with multi-byte character sets and commands
> > where the filenames are not the last items on the command line.
> 
> It certainly seems portable enough, and I can imagine having
> multi-byte character set problems with ``xargs`` (I'm certainly
> no expert on MBCS issues).  And as you say, for commands where
> the path arguments are not last on the line, you can't use
> ``xargs``.  In fairness, though, you can't use ``-exec ... +``
> either, because the ``+`` variant of ``-exec`` is only valid
> when the ``{}`` is just before the ``+``.  From the spec:
> 
>   Only a plus sign that follows an argument containing the two
>   characters "{}" shall punctuate the end of the primary
>   expression. Other uses of the plus sign shall not be treated
>   as special.
> 

Darn, you poked another hole in one of my long-held 'mis'assumptions.
I knew that -exec terminated with ';' could use multiple {}'s and
simply assumed that -exec terminated with + could also.

And you made me read gnu's find manpage to learn that the single
{} must immediately preceed the +; another 'mis'assumption killed.
Looking back at Solaris' find manpage, it is also true there as well,
but less clearly stated.  Gnu's find does another thing does better.
Invoked with two {}'s it clearly complains with "Only one instance
of {} is supported with -exec ... +" and exits with a 1 status.
Solaris 9's find is silent, executes nothing, and exits with a
"successful" status of 0.  This could lead one to think nothing
was wrong, simply there were no files matching the find criteria.

Thanks for an interesting discussion.
I learned several things, hope others did also.

Jon
-- 
Jon H. LaBadie                  jon@jgcomp.com
 JG Computing
 12027 Creekbend Drive		(703) 787-0884
 Reston, VA  20194		(703) 787-0922 (fax)



More information about the Novalug mailing list