[Novalug] Fwd: perl help | filename spaces

jecottrell3@comcast.net jecottrell3@comcast.net
Fri Mar 19 15:47:14 EDT 2010


I'm torn. Having made a career out of blasting -exec, and using it as an example for why you should learn xargs, I am reluctant to change my ways. However, it is clearly better integrated (no -0 kluges needed) and conceptually cleaner.

OK, just to show y'all I'm not a stick in the mud and can roll with the (good) changes...

We Accept You, One Of Us!

Having said all that, both sides of the pipe (real or conceptual) are likely to be I/O bound, and so splitting into two processes will allow CPU processing to overlap the I/O Wait time of each processor.

And while the single file version of -exec returns an exit status on each file, it's not clear that the multiple version one does, meaning that find could indeed fork and execute as two processes. Or two threads. Fibers?

Actually, I just ran some tests that indicate that exit status on the multiline version is ignored.

I also agree with whoever mentioned that short running times indicate little.

JIM

----- Original Message -----
From: "Jon LaBadie" <novalugml@jgcomp.com>
To: novalug@calypso.tux.org
Sent: Friday, March 19, 2010 11:32:50 AM GMT -05:00 US/Canada Eastern
Subject: Re: [Novalug] Fwd:  perl help | filename spaces

On Fri, Mar 19, 2010 at 04:52:56AM -0400, Michael Henry wrote:
> On 03/18/2010 12:36 PM, Jon LaBadie wrote:
> > On Thu, Mar 18, 2010 at 06:39:20AM -0400, Michael Henry wrote:
> >> But that
> >> downside was removed with the addition of the "+" delimiter for
> >> ``-exec`` (I'm not sure how long ago).
> >
> > Quite a long time ago.
> 
> After a quick look, I think it was adopted into POSIX around
> 2004, and GNU find picked it up around 2005.  From the NEWS file
> in GNU findutils::

Agreed.  I'm colored by having used it since the late 80's or early
90's when it was adopted in SVR4 as you note.  Quite a few systems
implemented the feature but did not document it.  I was unaware
of this until I checked with my UNIX history guru, Sven Mascheck.
His page on the '+' delimiter is:

http://www.in-ulm.de/~mascheck/various/find/#xargs

> >> Here's a quick benchmark that's repeatable on my box:
> >
> > Timings of such a short duration are nearly meaningless.
> 
> As I mentioned, the test results are fully repeatable.
> Benchmarking can take a lot of time, and I wasn't trying to give
> numbers that demonstrate expected average time ratios.

When I saw your results my (likely biased) view was different
than yours.  IIRC you suggested that with the + delimiter
find being single threaded had to work in a linear fashion
while the pipelined version with xargs naturally allowed
for multi-processing efficiencies.

Perhaps you are correct, and there are reasons why find must
wait for the exec'ed process to finish before continuing to
execute it's find activity, but if I were writing find, I'd
strive very hard to fork off that process and continue finding.

So I looked for alternative explanations of longer clock (real)
times you noted in your test.  My hypothesis is the kernel's
cpu (core) allocation algorithm on a multi-cpu system like yours.

Note, all your timings of the xargs version are less "efficient"
than the + versions as they show higher cpu usage (user + sys)
than the corresponding + versions.  It is clear that xargs and
find were often running on different cpus as their real time
is less than their cpu usage.  In contrast, the + versions
basically have identical cpu usage and clock time indicating
a saturation of a single cpu.

I wonder if the cpu allocation algorithm initially allocates
separate cpu's to processes started at essentially the same
time (as in a pipeline) but for a single fork uses the current
cpu as some metadata is already there and does not need to
be transfered to the other cpu cache.  For similar reasons,
a process may be partially constrained to the same cpu so
as to avoid the inefficiency of moving/duplicating the
process metadata.


Whether my analysis is reasonable or not, I agree with you
that whatever activity is exec'ed or xarg'ed is likely to
dominate the timings.

BTW I tried your "chmod" example on a mono-core system and
saw no repeatable differences between the + version and
the xargs version.  Not a surprise.

I'll still recommend the -exec + versions.  I think it is
sufficiently wide spread, probably more than -print0 | xargs -0
if you consider UNIX systems also.  Xargs also has some
problems dealing with multi-byte character sets and commands
where the filenames are not the last items on the command line.

-- 
Jon H. LaBadie                  jon@jgcomp.com
 JG Computing
 12027 Creekbend Drive		(703) 787-0884
 Reston, VA  20194		(703) 787-0922 (fax)
_______________________________________________
Novalug mailing list
Novalug@calypso.tux.org
http://calypso.tux.org/mailman/listinfo/novalug



More information about the Novalug mailing list