[Novalug] question about grep in perl

John Holland jholland@vin-dit.org
Wed Mar 29 15:09:15 EDT 2017


So,
When this runs, it will read the file line by line and for each line of the file, check each word for a match. I assume the perl Regex engine will stop trying to match the look ahead chunks as soon as one fails.
It's well done.
I would just observe that my code does the same thing, but very explicitly. A while loop for lines of the file,containing a for loop for the words to check, skipping ahead on the while loop as soon as a match fails.
I couldn't make it happen on one line though!!!!

Sent from my iPhone

> On Mar 29, 2017, at 2:56 PM, William Sutton via Novalug <novalug@firemountain.net> wrote:
> 
> I am in awe.
> 
> William Sutton
> 
>> On Tue, 28 Mar 2017, Sean McGowan via Novalug wrote:
>> 
>> it has been a few years, so my perl-fu is rusty (apparently I have
>> successfully deprogrammed myself from always appending semicolons) and this
>> is not fully tested, but it should work and be pretty quick...
>> 
>> #!/bin/env perl
>> # ^^generally this shebang will be more cross-platform
>> #
>> use strict;
>> use warnings;
>> # will match set of distinct strings to a long input line
>> 
>> print "enter file name: ";
>> my $filename=<STDIN>;
>> chomp($filename);
>> # open file first so we fail fast
>> # use the three arg open
>> open my $fh, '<', $filename
>> or die "Boo! The file won't open!:  $!";
>> 
>> 
>> print "enter search words: ";
>> my $string=<STDIN>;
>> chomp($string);
>> 
>> # although this is neither here nor there
>> # you can one-liner this if you want e.g.:
>> # $string =~ s/^\s+|\s+$//g
>> $string=~ s/^\s+//; #strips leading white space
>> $string =~ s/\s+$//; #strips trailing white space
>> 
>> 
>> # i think you are ANDing all your matches... e.g:
>> # string contains 'fox' AND string contains 'fell' (case-insensitive)
>> # so you can use a positive lookahead.  The whole expression is
>> non-consuming
>> # so it works as if you were ANDing any number of expressions.
>> # E.g. /(?=.*expr0)(?=.*expr1)(?=.*expr2)/
>> #
>> my $pattern = join "", map( "(?=.*$_)", split(/\s+/, $string));
>> #
>> # breaking this down, we *split* the string.  you had done this as well
>> # to return an array.  I used /\s+/ instead of ' ', just in case.
>> #
>> # Anyhow, *split* returns an array which we *map* to a expression.
>> # the expression here returns a string and *map* returns an array of
>> # those strings.  we then *join* that array and store it in $pattern.
>> 
>> # compile the regex so we don' have to do it every time
>> # note the 'i' at the end gives you your case-insensitivity.
>> my $re = qr/$pattern/i;
>> 
>> # filter the list with grep
>> print join "", grep( /$re/, <$fh>);
>> 
>> 
>> On Tue, Mar 28, 2017 at 8:59 AM, Rich Kulawiec via Novalug <
>> novalug@firemountain.net> wrote:
>> 
>>>> On Tue, Mar 28, 2017 at 06:03:40AM -0400, William Sutton via Novalug wrote:
>>>> 2. I think that searching through a 65k line csv file is probably not the
>>>> best solution; I think you really should be storing this in a database
>>> and
>>>> using SQL to query a full text index... but that's just me
>>> 
>>> My goodness no.
>>> 
>>> The use of relational databases when flat files would easily suffice
>>> is one of the principle design blunders I've witnessed over the past
>>> several decades.  Relational (and other) databases have their uses,
>>> but this sure isn't one of them: the overhead -- in terms of computational
>>> cost, complexity, and storage -- is much too high.
>>> 
>>> The easiest, fastest, simplest way to solve this problem is to read the
>>> entire file into memory and search it there.  I just wrote a dumb little
>>> Perl script to do exactly that.
>>> 
>>> Here's the test file ("dogs") I used:
>>> 
>>>        Fancy Dog Nose Face
>>>        Biscuit McBarky
>>>        Dog Nose Biscuit
>>>        Bark Bark Bark
>>>        Sniffy Furball
>>>        McBarky Biscuit
>>> 
>>> Here's the Perl script ("foo.pl"):
>>> 
>>>        #!/usr/bin/perl
>>> 
>>>        @dogs  = <STDIN>;
>>> 
>>>        while ($#ARGV>=0) {
>>>                my $arg=shift(@ARGV);
>>>                @dogs = grep($_ =~ $arg, @dogs);
>>>        }
>>> 
>>>        print @dogs;
>>> 
>>> Here's a demo:
>>> 
>>>        % perl foo.pl Bark Biscuit < dogs
>>>        Biscuit McBarky
>>>        McBarky Biscuit
>>> 
>>>        % perl foo.pl Biscuit Bark < dogs
>>>        Biscuit McBarky
>>>        McBarky Biscuit
>>> 
>>> Of course error-checking, case-insensitive matching, etc., could all
>>> be added, but the basic idea is to read the entire file into an array,
>>> then step through the arguments while clobbering the array with only
>>> the entries that match.  This is six lines of code, it's stupidly fast,
>>> and you can throw it away when you're done.
>>> 
>>> ---rsk
>>> **********************************************************************
>>> The Novalug mailing list is hosted by firemountain.net.
>>> 
>>> To unsubscribe or change delivery options:
>>> http://www.firemountain.net/mailman/listinfo/novalug
>>> 
>> 
>> 
>> 
>> -- 
>> Sean McGowan <spmcgowan@gmail.com>
>> 
>> -Give a man a fire, warm him for a day, light a man on fire and warm him
>> for the rest of his life.
>> **********************************************************************
>> The Novalug mailing list is hosted by firemountain.net.
>> 
>> To unsubscribe or change delivery options:
>> http://www.firemountain.net/mailman/listinfo/novalug
>> 
> **********************************************************************
> The Novalug mailing list is hosted by firemountain.net.
> 
> To unsubscribe or change delivery options:
> http://www.firemountain.net/mailman/listinfo/novalug





More information about the Novalug mailing list