[Novalug] Fwd: perl help | filename spaces
Michael Henry
lug-user@drmikehenry.com
Fri Mar 19 04:52:56 EDT 2010
On 03/18/2010 12:36 PM, Jon LaBadie wrote:
> On Thu, Mar 18, 2010 at 06:39:20AM -0400, Michael Henry wrote:
>> But that
>> downside was removed with the addition of the "+" delimiter for
>> ``-exec`` (I'm not sure how long ago).
>
> Quite a long time ago.
After a quick look, I think it was adopted into POSIX around
2004, and GNU find picked it up around 2005. From the NEWS file
in GNU findutils::
* Major changes in release 4.2.2, 2004-10-24
*** "find ... -exec {}+" is not yet supported.
[...]
* Major changes in release 4.2.12, 2005-01-22
*** -exec ... {} + now works.
The Single Unix Specification from 1997 doesn't have it:
http://www.opengroup.org/onlinepubs/007908799/xcu/find.html
But the 1003.1 2004 Edition page has it:
http://www.opengroup.org/onlinepubs/009695399/utilities/find.html
It also includes the following:
A feature of SVR4's find utility was the -exec primary's +
terminator. [...] The "-exec ... {} +" syntax adopted was a
result of IEEE PASC Interpretation 1003.2 #210. It should be
noted that this is an incompatible change to the ISO/IEC
9899:1999 standard.
So perhaps SVR4 had it much earlier, but POSIX hasn't had it for
a decade yet :-) I'm just curious about the timing. Regardless,
I didn't hear about it myself until fairly recently.
>> Here's a quick benchmark that's repeatable on my box:
>
> Timings of such a short duration are nearly meaningless.
As I mentioned, the test results are fully repeatable.
Benchmarking can take a lot of time, and I wasn't trying to give
numbers that demonstrate expected average time ratios. I ran
each individual tests multiple times in a row to ensure the
entire data set (156 MB in my case) was cached. Timings would
of course be different if I flushed the cache explicitly before
each run.
In any event, I believe the timings are not meaningless, but
they are naturally a strong function of the dataset and the
operation provided to -exec. Using my same dataset, I can scale
up to a much longer test. For example, if the same directory is
provided multiple times to the ``find`` command, the original
dataset can be traversed multiple times, keeping the cache warm
on the data. With file ``10dot`` containing 10 lines of ".", I
get these results:
time find `cat 10dot` -type f -exec grep bigteststring {} +
real 0m2.035s
user 0m0.693s
sys 0m1.337s
time find `cat 10dot` -type f -print0 | xargs -0 grep bigteststring
real 0m1.573s
user 0m0.793s
sys 0m1.377s
Similarly, for 100 copies of the same dataset, I get:
These numbers show a consistent ratio of execution times.
time find `cat 100dot` -type f -exec grep bigteststring {} +
real 0m20.399s
user 0m7.000s
sys 0m13.342s
time find `cat 100dot` -type f -print0 | xargs -0 grep bigteststring
real 0m16.436s
user 0m7.683s
sys 0m14.346s
I also copied my dataset into separate numbered directories to
see how things might change with larger datasets. So, for
example, 10 unique copies of the dataset yield these results:
time find projects-{1..10} -type f -exec grep bigteststring {} +
real 0m30.759s
user 0m0.963s
sys 0m5.790s
time find projects-{1..10} -type f -print0 | xargs -0 grep bigteststring
real 0m27.185s
user 0m1.070s
sys 0m5.840s
For 10 copies, it was substantially slower than 10 iterations of
the same copy, and the results are much closer, probably because
it's not keeping everything cached the entire time.
Instead of grepping through the entire file, consider a
different operation, that of reading just the first line of each
file. Now, for 10 independent copies, we are
time find projects-{1..10} -type f -exec head -1 {} + > /dev/null
real 0m1.113s
user 0m0.233s
sys 0m0.873s
time find projects-{1..10} -type f -print0 | xargs -0 head -1 > /dev/null
real 0m0.727s
user 0m0.277s
sys 0m1.003s
As another interesting comparison, the time taken for the
``find`` operation by itself is:
time find projects-{1..10} -type f > /dev/null
real 0m0.543s
user 0m0.103s
sys 0m0.433s
When the dataset and operation are such that merely finding the
files is a significant fraction of the overall processing time,
using xargs is a bigger win. Grepping through large file data
most likely dominates the time required to find the files in the
first place, so your benchmarks that show almost no difference
between the two are very believable given your dataset.
Finally, consider the operation "chmod +r". This operation
doesn't need to pull in the file data at all, so it should be a
fairly short operation.
time find projects-{1..10} -type f -exec chmod +r {} +
real 0m1.606s
user 0m0.187s
sys 0m0.947s
time find projects-{1..10} -type f -print0 | xargs -0 chmod +r
real 0m1.203s
user 0m0.253s
sys 0m1.060s
Anyway, my point was not that it's a significant performance
difference to be routinely worried about (unlike ``-exec ... ;``),
but that with certain datasets it's a real, measurable effect
that's different between the two idiomatic techniques.
Michael Henry
More information about the Novalug
mailing list