[Novalug] Fwd: perl help | filename spaces

Michael Henry lug-user@drmikehenry.com
Fri Mar 19 04:52:56 EDT 2010


On 03/18/2010 12:36 PM, Jon LaBadie wrote:
> On Thu, Mar 18, 2010 at 06:39:20AM -0400, Michael Henry wrote:
>> But that
>> downside was removed with the addition of the "+" delimiter for
>> ``-exec`` (I'm not sure how long ago).
>
> Quite a long time ago.

After a quick look, I think it was adopted into POSIX around
2004, and GNU find picked it up around 2005.  From the NEWS file
in GNU findutils::

  * Major changes in release 4.2.2, 2004-10-24
  *** "find ... -exec {}+" is not yet supported.
  [...]
  * Major changes in release 4.2.12, 2005-01-22
  *** -exec ... {} + now works.

The Single Unix Specification from 1997 doesn't have it:

http://www.opengroup.org/onlinepubs/007908799/xcu/find.html

But the 1003.1 2004 Edition page has it:
http://www.opengroup.org/onlinepubs/009695399/utilities/find.html

It also includes the following:

  A feature of SVR4's find utility was the -exec primary's +
  terminator. [...] The "-exec ... {} +" syntax adopted was a
  result of IEEE PASC Interpretation 1003.2 #210. It should be
  noted that this is an incompatible change to the ISO/IEC
  9899:1999 standard.

So perhaps SVR4 had it much earlier, but POSIX hasn't had it for
a decade yet :-) I'm just curious about the timing.  Regardless,
I didn't hear about it myself until fairly recently.

>> Here's a quick benchmark that's repeatable on my box:
>
> Timings of such a short duration are nearly meaningless.

As I mentioned, the test results are fully repeatable.
Benchmarking can take a lot of time, and I wasn't trying to give
numbers that demonstrate expected average time ratios.  I ran
each individual tests multiple times in a row to ensure the
entire data set (156 MB in my case) was cached.  Timings would
of course be different if I flushed the cache explicitly before
each run.

In any event, I believe the timings are not meaningless, but
they are naturally a strong function of the dataset and the
operation provided to -exec.  Using my same dataset, I can scale
up to a much longer test.  For example, if the same directory is
provided multiple times to the ``find`` command, the original
dataset can be traversed multiple times, keeping the cache warm
on the data.  With file ``10dot`` containing 10 lines of ".", I
get these results:

  time find `cat 10dot` -type f -exec grep bigteststring {} +

  real    0m2.035s
  user    0m0.693s
  sys     0m1.337s

  time find `cat 10dot` -type f -print0 | xargs -0 grep bigteststring

  real    0m1.573s
  user    0m0.793s
  sys     0m1.377s

Similarly, for 100 copies of the same dataset, I get:

These numbers show a consistent ratio of execution times.

  time find `cat 100dot` -type f -exec grep bigteststring {} +

  real    0m20.399s
  user    0m7.000s
  sys     0m13.342s

  time find `cat 100dot` -type f -print0 | xargs -0 grep bigteststring

  real    0m16.436s
  user    0m7.683s
  sys     0m14.346s

I also copied my dataset into separate numbered directories to
see how things might change with larger datasets.  So, for
example, 10 unique copies of the dataset yield these results:

  time find projects-{1..10} -type f -exec grep bigteststring {} +

  real    0m30.759s
  user    0m0.963s
  sys     0m5.790s

  time find projects-{1..10} -type f -print0 | xargs -0 grep bigteststring

  real    0m27.185s
  user    0m1.070s
  sys     0m5.840s

For 10 copies, it was substantially slower than 10 iterations of
the same copy, and the results are much closer, probably because
it's not keeping everything cached the entire time.

Instead of grepping through the entire file, consider a
different operation, that of reading just the first line of each
file.  Now, for 10 independent copies, we are

  time find projects-{1..10} -type f -exec head -1 {} + > /dev/null

  real    0m1.113s
  user    0m0.233s
  sys     0m0.873s

  time find projects-{1..10} -type f -print0 | xargs -0 head -1 > /dev/null

  real    0m0.727s
  user    0m0.277s
  sys     0m1.003s

As another interesting comparison, the time taken for the
``find`` operation by itself is:

  time find projects-{1..10} -type f > /dev/null

  real    0m0.543s
  user    0m0.103s
  sys     0m0.433s

When the dataset and operation are such that merely finding the
files is a significant fraction of the overall processing time,
using xargs is a bigger win.  Grepping through large file data
most likely dominates the time required to find the files in the
first place, so your benchmarks that show almost no difference
between the two are very believable given your dataset.

Finally, consider the operation "chmod +r".  This operation
doesn't need to pull in the file data at all, so it should be a
fairly short operation. 

  time find projects-{1..10} -type f -exec chmod +r {} +

  real    0m1.606s
  user    0m0.187s
  sys     0m0.947s

  time find projects-{1..10} -type f -print0 | xargs -0 chmod +r

  real    0m1.203s
  user    0m0.253s
  sys     0m1.060s

Anyway, my point was not that it's a significant performance
difference to be routinely worried about (unlike ``-exec ... ;``),
but that with certain datasets it's a real, measurable effect
that's different between the two idiomatic techniques.

Michael Henry




More information about the Novalug mailing list