Case Story: The "OK" hard drive

"Gee, you techs are lucky, having so much cool stuff lying around, like this spare hard drive..."
' Erm yes, well let me tell you about that one... '
<cue swimmy back-in-time effect...>

History

System was one of a few PCs at a new site that was to be networked and broadbanded.  That means hardening the PCs first, i.e. checking they are malware-clean, and then patching and risk-managing them to stay that way.

One of the PCs was known to be flaky, so that one got the full Scandisk surface scan treatment as well as the single-pass RAM check, eyeball test for bad motherboard capacitors, and formal virus scan that they all got.  All PCs passed the logical file system Scandisk check OK, and were generally fine.

A few days later, I get a call that one of the PCs - and not the one noted to be flaky before - has fallen over and can't get up.

Troubleshooting

Scandisk logical shows mismatched FAT, so I aborted that and went straight into Norton's DiskEdit.  When DiskEdit starts up, it builds a list of file system errors that you can jump to; these included errors in the XP installation's pagefile, and that might have been the showstopper that prevented booting XP.

The FAT mismatch was a minor one, i.e. one FAT showed a file's chain continuing sequentially into another's, while the other FAT showed a terminating EOF (End Of File) marker at that point.  You might think such FAT mismatch errors are common; in practice I seldom see them, compared with far more seriously garbaged FAT contents.

So I ballpointed myself an undo path and then tried to write an EOF to match the FATs, intending to then continue comparing the two FATs for further mismatches.  Here's what I got instead:

O..K... tip toe away quietly (Esc out of DiskEdit, switch off) and proceed directly to data recovery, which went fine.  Courtesy HD used to get the PC back into use, and everyone's happy.  The warranty replacement of this drive can then be done as and when convenient, as the PC's not waiting for the replacement drive.

At this point, I thanked my client's lucky stars they were not using NTFS, which would have done nothing to avoid the problem (disk errors are at a lower abstraction layer where the file system is irrelevant) and everything to screw up data recovery.  There's no interactive file system or surface checking tools, no DiskEdit, and the file system's complexity and lack of documentation make the prospect of raw DiskEdit hex editing rather unattractive.

Further testing

Once the client had been running awhile without any "argh, where's my *.* file!" anguish, I tested the hard drive again.  As the bad sector was in FAT rather than cluster space, a surface scan may miss it, so I FDisk'd a 2G FAT16 primary that would likely position the defect within cluster space (FAT16 FATs have a far shorter footprint on disk). 

Surface scan passed without errors or latency, so it was on to Western Digital's own diagnostics:

So according to that, the hard drive is "OK".  But OK hard drives don't throw write errors when writing to core internal file system structures.  Would you trust your client's data on such a drive?

And so the "lucky" tech's "cool stuff" collection grows...

 

(C) Chris Quirke, all rights reserved, April 2005

Back to index