[Novalug] [OT] File and Database Questions

Bonnie Dalzell bdalzell@qis.net
Sun Dec 13 19:51:51 EST 2009


On Sun, 13 Dec 2009, Jon LaBadie wrote:

thanks for the answers. they are on point for what i was interested 
in.

I was going to also have a bunch of indices so that common searching would 
usually be accomplished via the indices.

For example if I want a report on the full sibs of an individual i would 
already have made a combined full sib index file which would contain as 
each record line the identifiers for each set of sibs.


> On Sun, Dec 13, 2009 at 04:15:10PM -0500, Bonnie Dalzell wrote:
>>
>> Pardon me if this sounds like a dumb inquiry but here goes to elicit some
>> discussion:
>>
>> I have this large dataset (54000 dogs) of almost all ascii formatted
>> information about dogs which is the basis for the pedigree program I am
>> working on.
>>
>> A given dog record varies from 190 bytes to 400 bytes. most of
>> them are the smaller size.
>
> Allowing an ave size of 300 bytes, that is only 16 MB.
>
>>
>> I have been trying to decide the best way to store this infomation in
>> a way that will minimize getting records confused and make editing
>> infortmation within the record simple.
>
> Is the problem that you are using a text editor to enter and modify
> the individual records?
>
>>
>> I have a program I have written which can make display pedigrees from
>> the records.
>>
>> So I just did a little experiment and saved a set of 10 records each
>> to its own uniquely named file. The set of separate files seems to
>> occupy the same amount of disk space as the combined file of the 10
>> records.
>>
>> Given the ability in linux to set up subdirectories in the manner
>> they do on cspan of w/wi/william why not keep my individual records in
>> this manner rather than all mashed together in one giant file.
>
> I'll play and offer some considerations.
>
> With a single file, probably after the initial opening and a little
> bit of access, the entire file will be cached in memory and further
> access will not involve the disk.
>
> Accessing a single entry when its filename is discernable will
> probably be quick in the multi-file dataset.
>
> Searching for an entry based on the record's content will involve
> an average of over 25000 thousand file and directory openings and
> lots of disk access.
>
>
>

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                        Bonnie Dalzell, MA
mail:5100 Hydes Rd PO Box 60, Hydes,MD,USA 21082-0060|EMAIL:bdalzell@qis.net
Freelance anatomist, vertebrate paleontologist, writer, illustrator, dog
breeder, computer nerd & iconoclast... Borzoi info at www.borzois.com.
HOME www.batw.net    ART bdalzellart.batw.net  BUSINESS www.boardingatwedge.com




More information about the Novalug mailing list