Before You Think
It's often tempting to start on a debugging session by jumping in to what you think is the cause of the problem. But sometimes all attempts to come right are stymied by something else that undermines the platform you are working on, and there's a case to be made for excluding probable causes of such problems first.
This applies particularly with PCs you haven't seen before, or that are known to be bad, or where the stakes of collateral damage are high - in the last two cases, I'd suggest the formal approach to the bad PC as your safety net.
Hardware:
PSU and processor fans
Hardware:
Test RAM diagnostics
Hardware:
Physical hard disk
Malware:
Formal scan
Hard drive free space
File system sanity
Temp
file clutter
Underfootware
And then there are these two factors...
Testing
the untestables
Defragging
the file system
Hardware: PSU and processor fans
PSU fan failure pattern: Spontaneous resets, hard lockups, "sufferin' silicon" smell, parts of case may be warm to the touch
Processor fan failure pattern: Spontaneous hard lockups, possibly resets and error messages; single periodic pauses that stop mouse and keyboard activity (thermal protection as used in Pentium-233, etc.)
You can drape your hand over the back of the case for cool air, but you need to pop the case to check the processor fan. You may need to exclude warranty or ownership prohibitions on opening that base before you do that, though. Some systems may have additional fan on the display card's chipset.
Hardware: Test RAM diagnostics
RAM failure pattern: Miscellaneous Windows errors and BSoDs, Windows Protection Error or "registry is corrupt" on startup, file system corruption, failed installs, crashes, lockups and resets
General diagnostic software had a pretty bad record on RAM testing, in that RAM that tested "OK" would often take problems with it when swap-tested across PCs. I've had better results with the free RAM testers MemTest86, and DocMem from www.simmtester.com.
There are three caveats involved with SIMMTester; it won't run from a WinME boot diskette (requires older version), it will understandably crash if the DOS startup has been less than totally clean, and it won't run except from diskette, though you can run it after a clean DOS mode boot from hard drive. MemTest86 will only run as booted from the diskette the download creates, an approach followed by recent SIMMTester downloads.
HD failure pattern: Apparent hard lockups with the HD activity LED constantly-lit, file system corruption, failure to boot up, explicit error messages, data loss, startup directly to shutdown with no desktop access between, crashes etc.
For best results, use an OS-agnostic diagnostic; often your hard drive manufacturer will have free diagnostics for download. If you don't know who made your hard drive and don't want to strip the PC to find out, there are free DOS-based utilities such as IDEID that can give you that information. Always be careful to stick to non-destructive tests!
An OS-agnostic diagnostic will ignore clusters marked as bad in the FAT and file system logic errors, which are not what you are interested in yet, and will check disk areas occupied by NTFS and non-MS partitions as well. You can make do with DOS mode Scandisk surface check in "look-don't-touch" mode, but this won't do a surface scan until it has been allowed to "fix everything" at the file system logic level first - and you may not want to do that, depending on the nature and type of file system errors it finds.
Malware failure pattern: Can mimic RAM (errors, etc.), HD (boot failure, data loss), slow performance, and peculiarities from intended effects or unintended bugginess
As per formal virus check criteria. I include virus scanning early on with the hardware phase of my approach, as some malware can pervade both sides of the DOS mode and Windows fence - the only other software that is always common to both is IO.sys and pre-file-system boot record code.
HD free space failure pattern: Slow performance, Windows errors (BSoDs rare), crashes, file system corruption
Windows can hardly breathe without scribbling on the hard drive, and it's best to have at least 50M free space where Windows locates the swap file, Temp directory, Internet cache etc., which is usually the C: volume. The Dir command will give you a best-case but accurate reading in DOS mode, but if checking a FAT32 volume in Windows, you may need to not only "refresh" via the F5 key, but also do a ScanDisk logical check to ensure the free space count is accurate.
File system failure pattern: Various focal software or activity failures, failure to boot up, explicit error messages, data loss, startup directly to shutdown with no desktop access between, apparent lack of free disk space, crashes etc.
You can use ScanDisk in DOS mode to check the file system for errors, but may not want to allow it to "fix" what it finds - and never use "auto-fix" if you are likely to be held responsible for the system you are working on! Show-stopper errors that are better served by formal data recovery include mismatched FAT tables, bad sectors, significant cross-linked files and damaged directories. A huge amount of data in lost cluster chains would also be a cue to eject.
If any abnormalities are found (especially if any repairs are allowed), you should save the results to the log file, appending to what is already there.
You may notice that I've said very little about Windows as yet. Until you know the hardware is OK and file system is sane, you daren't risk running it, because Windows:
Temp clutter failure pattern: Slow performance, application or activity errors, file system corruption; associated with crashes and shutdown failures
It is the number of directory entries, rather than the bulk of bytes within the files, that bloats up the Temp directory - especially when the need to create unique file names requires the whole of this directory to be searched to exclude matches. As the directory grows, it will tend to become fragmented, thus further slowing down access; the longer update period increases the risk of file system corruption and system instability.
Temporary file locations include not only Temp, but also web browser caches and the spooler. Some applications create temp files in the same location as data files being edited, or even in the Windows base directory. By default, Internet Explorer will set aside ludicrous amounts of disk space for its browser cache.
Underfootware failure pattern: Slow performance, Windows or application errors, other
All sorts of interaction between underfootware and other tasks are possible; contention for a finite hardware resource, version soup effects when sharing code, or other software-level conflicts. Safe Mode suppresses both startup axis and the use of PnP and most 3rd-party drivers. Compare this with disabling the startup axis alone, which is easiest to do on Windows 98 or later via the MSConfig utility.
Untestables failure pattern: Lockups that may persist through use of the reset button, resets, Windows or application errors, oddities on screen display if SVGA is involved, stack overrun if interrupt flooding
Whereas RAM and hard drive diagnostics are usually effective, there may be no useful diagnostics for other components, e.g. power supply, motherboard, display chipset or other cards. When these fail, the result is usually a hard lockup or spontaneous reset.
So you have to physically swap these out to test them - which can precipitate several complications! Windows Plug and Play will detect different hardware and want to install drivers for it, and will shuffle hardware resource allocations around as components come and go. In Windows XP, or if you have installed other recent commercial MSware, you may be stabbed in the back by "Product Activation"; when this detects the hardware has changed "too much", afflicted software may precipitously and permanently refuse to work until you phone Microsoft and beg.
For these reasons, it is better to avoid Windows when swap testing or reduction testing hardware. However, testing in DOS mode won't include functionalities that are wholly (e.g. USB, WinModems) or partially (e.g. UIDE modes of hard disk access, acceleration of SVGA or 3D) absent in this environment.
In dire cases of instability, it's often useful to pull the hard drive out of the sick system for data evacuation and malware scanning on a known-good PC, while the rest of the sick system does RAM diagnostics and swap testing. This removes the risk of data corruption from flaky hardware other than the hard drive, and keeps the effects of swap testing from being noticed by Windows. Needless to say, the sick system's hard drive should not be allowed to boot itself up in the other PC!
File fragmentation failure pattern: Variable mild to moderate system slowdown
Think of Defrag as a strenuous work-out in the gym; something to make a healthy system fitter; but dangerous when the system is sick. Specifically, Defrag potentially reads every file off the hard disk into RAM, and then writes it back again - so that any flaky hardware is likely to cause massive data corruption or interrupt the process, losing data either way.
A fragmented file system causes slower performance, and as the critical period for file writes is increased, the risks of data corruption and instability are increased. But if all else is well, file system fragmentation alone is unlikely to cause crashes, lockups, resets or software errors of any kind.
(C) Chris Quirke, all rights reserved - December 2002, link massages April 2003, July 2004