[Novalug] ls -d

William Sutton william@trilug.org
Mon Oct 17 16:56:25 EDT 2016


As an alternative to multiple |grep -v (or |grep -vi) filters, you can 
also |egrep -vi "pattern1|pattern2|pattern3|...etc..."

William Sutton

On Mon, 17 Oct 2016, Rich Kulawiec via Novalug wrote:

>
> A few comments.
>
> First, there's no such thing as a "hidden file" on a Linux or Unix
> filesystem.  (There can be hidden volumes on Truecrypt'd devices,
> but that's a different animal.)
>
> Second, find(1) is your friend.  It is enormously powerful and well
> worth the investment of time it'll take to use well.  For many of
> the tasks described in this thread, it's a much better choice than ls(1).
>
> Third, file(1) doesn't know everything.  There are rather a lot of
> medical image formats that it's unaware of, at least in part because
> they're manufacturer-specific. [1]
>
> File(1) also makes mistakes, for example:
>
> 	find mail -type f -a -exec file {} ';'
>
> run in my home directory on this system yields:
>
> 	mail/f1: HTML document, ASCII text
> 	mail/f2: ASCII text, with very long lines, with LF, NEL line terminators
> 	mail/f3: ASCII text
> 	mail/f4: ISO-8859 text
> 	mail/f5: C source, ASCII text
> 	mail/f6: ASCII text, with very long lines
> 	mail/f7: HTML document, Non-ISO extended-ASCII text
> 	mail/f8: HTML document, ASCII text, with very long lines
> 	mail/f9: UTF-8 Unicode text, with very long lines
>
> Every one of those is wrong.  All of those files are standard Unix
> mboxes -- a format that's been around for decades.  So not only
> is file(1) wrong, it's inconsistently wrong.  (I should probably
> write this up as a bug report.)  (I should probably include the
> code to fix it. ;) )
>
> Fourth, given the task of identifying an unknown number of medical image
> files of unknown format somewhere in a directory/tree of unknown depth
> and containing an unknown number of other files of unknown types,
> I'd probably try something like this:
>
> 	find foo -type f -a -exec file {} ';'
> 	find foo -type f -a -exec file {} ';' | grep -v ftype1
> 	find foo -type f -a -exec file {} ';' | grep -v ftype1 | grep -v ftype2
>
> and so on -- each time extending the pipeline by adding more file types
> that I'm pretty sure the images are *not*, e.g.:
>
> 	find foo -type f -a -exec file {} ';'
> 	find foo -type f -a -exec file {} ';' | grep -v "ASCII text"
> 	find foo -type f -a -exec file {} ';' | grep -v "ASCII text" | grep -v "executable"
>
> Obviously this has its issues.  If you guess wrong, you may miss the
> files you're looking for.  If there are 300 different kinds of files,
> this pipeline will become untenably long.  So here's another approach:
>
> 	find foo -type f -a -exec ls -s {} ';' | sort -n
>
> This will report the size of every file (in blocks) and it will do so
> in a list sorted in ascending order.  Chances are reasonably good that
> the medical image files will be the largest ones and that they're going
> to all be about the same size and they're all going to have the same
> suffix (if any).  So:
>
> 	find foo -type f -a -exec ls -s {} ';' | sort -n | tail -200
>
> (where I chose 200 as a plausible first guess) might suffice.
>
> You could also use the find(1) built-in directive "ls", which will
> save you a ton of exec() calls, but then you'll need to sort on the
> second field:
>
> 	find foo -type f -ls | sort -k 2 -n
>
> ---rsk
>
> [1] Yes, there are standardized formats, see "DICOM", but that doesn't
> mean that they're universally used.
> **********************************************************************
> The Novalug mailing list is hosted by firemountain.net.
>
> To unsubscribe or change delivery options:
> http://www.firemountain.net/mailman/listinfo/novalug
>



More information about the Novalug mailing list