[Novalug] Fwd: perl help | filename spaces
Jon LaBadie
novalugml@jgcomp.com
Sat Mar 20 10:01:32 EDT 2010
Wow Michael, great and extensive research.
Thanks for extending much of my knowledge and correcting
several misconceptions!!
On Sat, Mar 20, 2010 at 07:57:25AM -0400, Michael Henry wrote:
> On 03/19/2010 11:32 AM, Jon LaBadie wrote:
>
> > When I saw your results my (likely biased) view was different
> > than yours. IIRC you suggested that with the + delimiter
> > find being single threaded had to work in a linear fashion
> > while the pipelined version with xargs naturally allowed
> > for multi-processing efficiencies.
>
> I don't think ``find`` is required to block during the exec.
> The POSIX specification says this:
>
> http://www.opengroup.org/onlinepubs/009695399/utilities/find.html
>
> Each invocation shall begin after the last pathname in the set
> is aggregated, and shall be completed before the find utility
> exits and before the first pathname in the next set (if any)
> is aggregated for this primary, but it is otherwise
> unspecified whether the invocation occurs before, during, or
> after the evaluations of other primaries.
>
> I'm still a bit unclear what it means for the paths to be
> aggregated (is this just packing up the arguments at the last
> minute right before exec'ing the program?), but it seems clear
> that they are leaving room for at least some processing overlap
> between the exec'ed program and ``find``.
It seems the standard authors meant for only one of find or its
invoked utility to operate at a time. But you are correct that
wiggle room is left. I can see an interpretation that says creating
the total list of "found" file is a separate activity from
aggregating them in to sets less than ARG_MAX length and exec'ing
the utility. So one find thread could make the big list and a
second thread does aggregate/fork-exec/wait.
Your later research clearly shows this approach was not taken
in gnu's find.
>
> It might be an interesting suggestion to the GNU folks. I
> grabbed the source for findutils-4.4.2
> (http://ftp.gnu.org/pub/gnu/findutils/) and took a look. They
> are using ``fork``/``exec`` and immediately waiting for the
> child process to complete. See the function ``launch`` in
> ``find/pred.c``.
Use the source Luke! :)
> As a quick hack for testing, I modified
> ``launch`` to defer waiting for the child process until the next
> invocation of ``-exec`` (see below for patch). The patch does
> not wait for the final child to return, so it's useful only for
> testing, but since in my test dataset ``-exec`` is invoked 380
> times, I think allowing ``find`` to exit before the last child
> completes is not skewing the results significantly. The results
> line up with expectations. The patched ``find`` now performs on
> par with the ``xargs``-based solution for my dataset:
Great test. I think a standard-based find is constrained from
exiting before the last invocation completes as find's exit status
must be non-zero if any utility invocation exits non-zero.
>
> ; Using xargs
> time find `cat 100dot` -type f -print0 | xargs -0 grep bigteststring
>
> real 0m15.923s
> user 0m7.543s
> sys 0m13.646s
>
> ; Using unmodified find with ``-exec ... +``
> time find `cat 100dot` -type f -exec grep bigteststring {} +
>
> real 0m20.668s
> user 0m6.766s
> sys 0m13.772s
>
> ; Using hacked find with ``-exec ... +``
> time findhack `cat 100dot` -type f -exec grep bigteststring {} +
>
> real 0m15.830s
> user 0m7.163s
> sys 0m14.302s
>
> > I'll still recommend the -exec + versions. I think it is
> > sufficiently wide spread, probably more than -print0 | xargs -0
> > if you consider UNIX systems also. Xargs also has some
> > problems dealing with multi-byte character sets and commands
> > where the filenames are not the last items on the command line.
>
> It certainly seems portable enough, and I can imagine having
> multi-byte character set problems with ``xargs`` (I'm certainly
> no expert on MBCS issues). And as you say, for commands where
> the path arguments are not last on the line, you can't use
> ``xargs``. In fairness, though, you can't use ``-exec ... +``
> either, because the ``+`` variant of ``-exec`` is only valid
> when the ``{}`` is just before the ``+``. From the spec:
>
> Only a plus sign that follows an argument containing the two
> characters "{}" shall punctuate the end of the primary
> expression. Other uses of the plus sign shall not be treated
> as special.
>
Darn, you poked another hole in one of my long-held 'mis'assumptions.
I knew that -exec terminated with ';' could use multiple {}'s and
simply assumed that -exec terminated with + could also.
And you made me read gnu's find manpage to learn that the single
{} must immediately preceed the +; another 'mis'assumption killed.
Looking back at Solaris' find manpage, it is also true there as well,
but less clearly stated. Gnu's find does another thing does better.
Invoked with two {}'s it clearly complains with "Only one instance
of {} is supported with -exec ... +" and exits with a 1 status.
Solaris 9's find is silent, executes nothing, and exits with a
"successful" status of 0. This could lead one to think nothing
was wrong, simply there were no files matching the find criteria.
Thanks for an interesting discussion.
I learned several things, hope others did also.
Jon
--
Jon H. LaBadie jon@jgcomp.com
JG Computing
12027 Creekbend Drive (703) 787-0884
Reston, VA 20194 (703) 787-0922 (fax)
More information about the Novalug
mailing list