[Novalug] question about grep in perl
John Holland
jholland@vin-dit.org
Wed Mar 29 15:09:15 EDT 2017
So,
When this runs, it will read the file line by line and for each line of the file, check each word for a match. I assume the perl Regex engine will stop trying to match the look ahead chunks as soon as one fails.
It's well done.
I would just observe that my code does the same thing, but very explicitly. A while loop for lines of the file,containing a for loop for the words to check, skipping ahead on the while loop as soon as a match fails.
I couldn't make it happen on one line though!!!!
Sent from my iPhone
> On Mar 29, 2017, at 2:56 PM, William Sutton via Novalug <novalug@firemountain.net> wrote:
>
> I am in awe.
>
> William Sutton
>
>> On Tue, 28 Mar 2017, Sean McGowan via Novalug wrote:
>>
>> it has been a few years, so my perl-fu is rusty (apparently I have
>> successfully deprogrammed myself from always appending semicolons) and this
>> is not fully tested, but it should work and be pretty quick...
>>
>> #!/bin/env perl
>> # ^^generally this shebang will be more cross-platform
>> #
>> use strict;
>> use warnings;
>> # will match set of distinct strings to a long input line
>>
>> print "enter file name: ";
>> my $filename=<STDIN>;
>> chomp($filename);
>> # open file first so we fail fast
>> # use the three arg open
>> open my $fh, '<', $filename
>> or die "Boo! The file won't open!: $!";
>>
>>
>> print "enter search words: ";
>> my $string=<STDIN>;
>> chomp($string);
>>
>> # although this is neither here nor there
>> # you can one-liner this if you want e.g.:
>> # $string =~ s/^\s+|\s+$//g
>> $string=~ s/^\s+//; #strips leading white space
>> $string =~ s/\s+$//; #strips trailing white space
>>
>>
>> # i think you are ANDing all your matches... e.g:
>> # string contains 'fox' AND string contains 'fell' (case-insensitive)
>> # so you can use a positive lookahead. The whole expression is
>> non-consuming
>> # so it works as if you were ANDing any number of expressions.
>> # E.g. /(?=.*expr0)(?=.*expr1)(?=.*expr2)/
>> #
>> my $pattern = join "", map( "(?=.*$_)", split(/\s+/, $string));
>> #
>> # breaking this down, we *split* the string. you had done this as well
>> # to return an array. I used /\s+/ instead of ' ', just in case.
>> #
>> # Anyhow, *split* returns an array which we *map* to a expression.
>> # the expression here returns a string and *map* returns an array of
>> # those strings. we then *join* that array and store it in $pattern.
>>
>> # compile the regex so we don' have to do it every time
>> # note the 'i' at the end gives you your case-insensitivity.
>> my $re = qr/$pattern/i;
>>
>> # filter the list with grep
>> print join "", grep( /$re/, <$fh>);
>>
>>
>> On Tue, Mar 28, 2017 at 8:59 AM, Rich Kulawiec via Novalug <
>> novalug@firemountain.net> wrote:
>>
>>>> On Tue, Mar 28, 2017 at 06:03:40AM -0400, William Sutton via Novalug wrote:
>>>> 2. I think that searching through a 65k line csv file is probably not the
>>>> best solution; I think you really should be storing this in a database
>>> and
>>>> using SQL to query a full text index... but that's just me
>>>
>>> My goodness no.
>>>
>>> The use of relational databases when flat files would easily suffice
>>> is one of the principle design blunders I've witnessed over the past
>>> several decades. Relational (and other) databases have their uses,
>>> but this sure isn't one of them: the overhead -- in terms of computational
>>> cost, complexity, and storage -- is much too high.
>>>
>>> The easiest, fastest, simplest way to solve this problem is to read the
>>> entire file into memory and search it there. I just wrote a dumb little
>>> Perl script to do exactly that.
>>>
>>> Here's the test file ("dogs") I used:
>>>
>>> Fancy Dog Nose Face
>>> Biscuit McBarky
>>> Dog Nose Biscuit
>>> Bark Bark Bark
>>> Sniffy Furball
>>> McBarky Biscuit
>>>
>>> Here's the Perl script ("foo.pl"):
>>>
>>> #!/usr/bin/perl
>>>
>>> @dogs = <STDIN>;
>>>
>>> while ($#ARGV>=0) {
>>> my $arg=shift(@ARGV);
>>> @dogs = grep($_ =~ $arg, @dogs);
>>> }
>>>
>>> print @dogs;
>>>
>>> Here's a demo:
>>>
>>> % perl foo.pl Bark Biscuit < dogs
>>> Biscuit McBarky
>>> McBarky Biscuit
>>>
>>> % perl foo.pl Biscuit Bark < dogs
>>> Biscuit McBarky
>>> McBarky Biscuit
>>>
>>> Of course error-checking, case-insensitive matching, etc., could all
>>> be added, but the basic idea is to read the entire file into an array,
>>> then step through the arguments while clobbering the array with only
>>> the entries that match. This is six lines of code, it's stupidly fast,
>>> and you can throw it away when you're done.
>>>
>>> ---rsk
>>> **********************************************************************
>>> The Novalug mailing list is hosted by firemountain.net.
>>>
>>> To unsubscribe or change delivery options:
>>> http://www.firemountain.net/mailman/listinfo/novalug
>>>
>>
>>
>>
>> --
>> Sean McGowan <spmcgowan@gmail.com>
>>
>> -Give a man a fire, warm him for a day, light a man on fire and warm him
>> for the rest of his life.
>> **********************************************************************
>> The Novalug mailing list is hosted by firemountain.net.
>>
>> To unsubscribe or change delivery options:
>> http://www.firemountain.net/mailman/listinfo/novalug
>>
> **********************************************************************
> The Novalug mailing list is hosted by firemountain.net.
>
> To unsubscribe or change delivery options:
> http://www.firemountain.net/mailman/listinfo/novalug
More information about the Novalug
mailing list