[Novalug] need help: server freezing -- How to troubleshoot

jerry w jerrywone@gmail.com
Fri Oct 23 07:29:47 EDT 2009


Richard, et. al.:

A later thought was heat,
if the variation in lockup time is based on
a cold or warm boot? i.e. short times
when rebooting, and long time
when booting really cold...

Plus vibration, disks are rotating little buggers
and besides wiggling cables (IDE/ SATA/ power),
knowing if you reused your old drives,
so looking for bad sectors (or RAID equivalent,
as I'm getting BIOS errors recently,
after the altercations on list, CMOS verification
errors on RAID controller...).

software, might be more difficult to troubleshoot,
but as others have said you [may] have logs
and /etc config files, lsmod, SMART disk warnings,
etc

Przemek presented on LVM on Wednesday night,
and I'm still trying to get the slides for posting on
DCLUG.Tux.org and learning what I was too tired
to even hear late in the day.

Others on the DCLUG (a very low volume list
that might use some messages / knowledgable
hardware posters, if apropos) may have ideas,
as well as MA-Linux and BWBUG (Baltimore Washington/
Beowolf cluster lists) since you are doing a
rather interesting project, even for home use...


On Thu, Oct 22, 2009 at 9:56 PM, Jay Hart <jhart@kevla.org> wrote:
> Richard,
>
> Please post your messages file here.  Paul has a good idea, but hardware
> problems are not always captured in log files if the *Sg&S(# PC locks up prior
> to entry being written.
>
> I have successfully troubleshot hundreds of PCs, and the first thing I always
> try to do is go bare bones and see if problem still exists, then start adding
> one thing back into the system at a time.  Used this type of method in the
> Nuclear Navy to great effect.
>
> So post your messages file, so we can look it over.  Dmesg on startup would be
> nice too.  If you post the DMESG file, go with a full up configured PC.
>
> Jay
>
>> Richard Ertel wrote:
>>> *sigh*
>>>
>>> ok, so my fileserver is locking up. seems to always happen, anywhere
>>> from 1 minute to 4 hours after booting.  if i disconnect all four SATA
>>> hard drives (all for storage) and just have the boot drive (PATA)
>>> connected, it seems to stay up indefinitely.
>>>
>>> i've ran the SATA drives that i thought were problematic through
>>> Seagate's SeaTools, and they passed all tests.
>>>
>>> i've looked through /var/log/messages for entries when the lockup
>>> occurred, but nothing looks odd (to me, what do i know?)
>>>
>>> can anyone tell me where to start troubleshooting to get to the bottom of
>>> this?
>>>
>>> Ubuntu Server 8.04.3, all updates as of this morning.
>>
>> Rich Ertel,
>>
>>       On the one hand, the responses that you have received from Jay Hart and
>> Gerald Williams are not bad, indeed, they are good ideas. On the other
>> hand, that is not the best way to troubleshoot. IMO, blindly guessing as
>> to the cause (of a problem) is rarely the best way to troubleshoot. When
>> troubleshooting, you should always _first_ attempt to generate
>> diagnostic information (DI), and, in order to do that, you must identify
>> the tools (e.g., software packages) that will help you to generate DI.
>> Some of these tools are built into a standard Linux distribution, but
>> others must be installed. I cannot recall the names of any such tools
>> for HDD's and HDD controllers, but I am certain that they exist.
>>
>> Sincerely,
>> Paul Bain
>>
>> _______________________________________________
>> Novalug mailing list
>> Novalug@calypso.tux.org
>> http://calypso.tux.org/mailman/listinfo/novalug
>>
>
>
> _______________________________________________
> Novalug mailing list
> Novalug@calypso.tux.org
> http://calypso.tux.org/mailman/listinfo/novalug
>



-- 
Jerry W



More information about the Novalug mailing list