Hard Drive Data Corruption

Various things can cause data on the hard drive to be corrupted in various ways. With an understanding of the FATxx file system and the problems reported by ScanDisk, you can not only troubleshoot file system errors, but also determine the cause of the problem (so that you can solve this and avoid more of the same).

This page will consider various causes of data corruption, and the patterns of damage they cause.

Cause: Bad exits from Windows

This happens if the PC is reset or switched off without shutting down Windows properly, e.g.:

Unless the cause is a crash that deviates from sane file management, these situations will corrupt data by interrupting sane file system operations. The only damage you should see are:

In theory, it is possible to interrupt updates to the FAT so that mismatched FAT errors occur, but in practice I have hardly ever seen this - the two FAT copies are updated so soon after each other that the critical window is small.

In particular, you should never see crosslinked files, corrupted directories, corrupted media byte or physical bad clusters. Any of the first four of these suggests insane file operations, which in turn implies deranged software or flaky hardware. The fifth problem suggests a failing hard drive.

Microsoft seems to assume bad exits to be the only form of data corruption you will see. Thus the journalling and rollback feature of NTFS is touted as making data corruption a thing of the past, and thus the assumption that mismatched FAT errors are safe to auto-fix by blindly copying one FAT copy over the other.

Cause: Hard drive disk defects

There are various patterns of hard drive failure:

In all but the first of these, it should be possible to evacuate most or all data to safety before the drive fails completely.

Catastrophic disk failure typically presents as a drive that can no longer be detected by CMOS setup, or that fails to pass BIOS POST so that the operating system never sees it. Often the POST failure will follow a long pause, during which you may hear no hard drive seek activity, or a cyclical repetitive pattern of seek activity such as rapid clk-clk-clk... or slower seek-to-end nyyyyyyyakk-nyyyyyyyakk-nyyyyyyyakk noises.

Unless you have access to clean-room facilities (and the know-how to wield these effectively), the story ends there; it's not possible to perform logic-level data recovery on a hard drive that does not spin, does not seek, or is not seen and accepted as functional by BIOS.

The one exception is if the logic board of the hard drive has failed, and a matching replacement can be swapped into place. Note that "matching" may require not only the same drive capacity and model number, but also the same logic board and firmware revision numbers.

Gradual disk failure (or partial failure) may present in the following ways:

Modern hard drives are manufactured with extra sectors that are not used, but are held in reserve. If a sector is found to be unreadable, or requires too many attempts before the error-detection checksum passes, then the contents of that sector are copied to one of the spares, and the spare is set to be used at the same physical sector address as the failed sector (which is no longer used).

If the sector is detected as failing before it has failed completely, then this process is "harmless" (other than hiding the fact that the hard drive is failing, and possibly delaying appropriate warranty management). However, if the sector cannot be read, or is mis-read as garbage, then silent data loss can occur.

This internal hard drive defect management happens at the raw sector level, below the awareness of operating systems or diagnostic utilities. S.M.A.R.T. allows some insight as to what the hard drive's self-management is doing.

It is only when the drive's defect management is unable to "fix" a failing sector - either because the failure is too sudden and complete, or because all the spare sectors have been used up for previous defects - that bad sectors will become visible to diagnostics such as ScanDisk. That is why any hard drive with "just one bad sector" should be treated with suspicion, and immediately replaced under warranty if possible.

A sector may require multiple attempts to read it successfully, yet still be considered "good". When this happens, you may notice a drastic slowdown of system performance, with these notable features:

In particular, if a DOS mode ScanDisk surface scan shows erratic pauses in the tested sector count, this is always significant. It's harder to evaluate this when doing the same test within Windows, because the count is incrimented less frequently and by larger abounts, and because there may be other running tasks that slow the process.

Finally, a sector that is marked as "bad" by Format or ScanDisk surface scan will be ignored (not tested) by subsequent ScanDisk surface scans. This certainly does not make the drive fit for use!

It may be that "general surface wear" is often or usually due to pollution of the sealed air space within which the disks and heads operate. This may result from failure of the air filter or seal, opening of the unit (always a baaad idea), or debris kicks up by a head crash. I have seen claims of spurious sector failure in newsgroups and elsewhere, on the basis of some sort of "soft" corruption that invalidates the checksum or other internal data, but remain sceptical. The only "bad sector" artefacts I've seen have been where incorrect geometry or capacity assumptions cause attempts to read beyond the last sector that is physically present.

Cause: Other flaky hardware

Anything that corrupts data in RAM, or on the busses to and from the hard drive, can cause corruption of the hard drive contents, e.g.:

These failures can have two broad effects; corrupting what is written to the disk, and corrupting where things are written to the disk. Both can be disasterous!

Cause: Deranged software

The trend in operating systems is for all direct disk access to be managed by the operating system and the drivers for the particular hardware in use. Typically, no other process is allowed direct disk access. Windows NT (including Windows 2000 and XP) enforce this protection with some vigor, as does Windows 9x - though rogue software may bypass this in Windows 9x if disk access is in DOS compatibility mode.

This makes it unlikely for application software to cause data corruption via wild writes or insane file operations. What is written within the file's data clusters is up to the application, but it is unlikely to write outside the file's data clusters, over internal file system structures, or outside the bounds of the volume or partition.

However, flaky operating system and device driver code can and has caused data corruption through wild writes and file system insanity. This is more likely to happen at the device driver level, as the smaller user base may delay detection and correction of problems. Message to VIA: There is no place in the real world for beta hard drive controller drivers!

Cause: Malware

Traditional virus payloads caused severe data corruption by overwriting arbitrary areas of raw disk with garbage. This can still happen where the malware is running before the operating system is in control (e.g. in the pre-file-system boot code or real-mode phases of startup) or where operating system control is bypassed or foiled.

Today's malware is more likely to "paint within the lines", i.e. overwrite data through legitimate file operations. Not only is this easier, but is just as effective in destroying data (if not more so). Antivirus heuristics look for attempts to perform low-level disk access, so malware can avoid this risk by sticking to normal "open for write, write data" methods.

Several malware cause charactaristic patterns of corruption, e.g. the CIH payload overwrites the first 1M of the first physical hard drive, etc. A review of malware reference sites such as www.f-secure.com/v-descs et al is the best place to research these, and readers with more than a passing interest in such matters would do well to read such sites regularly.

Pattern: Silent data loss

Sometimes you will hear stories of files "just disappearing", with no errors or crashes. Before you refer the user to psychiatric care, consider these possibilities:

You should do a formal malware scan, check startup axis for RATs, and apply risk management with particular reference to network shares and binding of file and print sharing to Internet access. Then look at the C:\Scandisk.log to see what has been going on, run ScanDisk to make sure "automatically fix errors" is not checked, then check ScanDisk.ini to see what auto-ScanDisk is being allowed to do. Ask about shutdown procedures and co-users of the system.

You should also look at the directories in raw disk form, to see whether files have been erased (erased directory entries start with the E5h character). ScanDisk or internal defect management fixing will leave no erased directory entries, and an after-the-fact Defrag may clear these also. As usual, you need to avoid all writes to the volume to avoid the risk of overwriting the data clusters before these lost files can be unerased.

Pattern: Snakebite

Some flaky hardware issues can cause odd bytes of data to get bent - perhaps one quantum out of a billion or so, which translates to steady bit-rot over months of use. Most of the time this will affect what is written to disk, but once in a while it may affect where a sector is written to disk (and therefore cause more obvious damage). As PCI, processor registers and RAM busses work with at least 32 bits at a time, the corruption will typically take the form of 4 consecutive bad bytes.

I've seen this happen with perfectly healthy hardware for no reason other than that the hard drive shell had no metal to metal grounding contact with the PC case; a problem I was able to reproduce. For this reason, I recommend grounding the chassis of hard drives to case when doing casual data transfers or using removable drive brackets.

Pattern: Sector freckling

This can cause havoc and make data recover "interesting", and is usually caused by flaky hardware. It is possible for a failing hard drive to present this way, if internal hard drive defect management or ScanDisk surface scan have attempted to relocate data from a failing sector, but were unable to read the failing sector correctly.

You find random sectors of garbage in the middle of files, directories, within file system structures such as the FATs, or outside the bounds of the volume or partition. This implies deranged file operations or a deliberate (malware) attempt to attack the drive's contents.

The contents of the freckles may shed a light on what is going on. Disk formatting processes of FDisk probe writes tend to fill sectors with ASCII "divide-by" characters or nulls, whereas malware may fill with charactaristic text. Arbitrary contents such as pieces of files, FAT, directory etc. suggest flaky hardware is causing the right contents to be written to the wrong place, but this can also happen if malware doesn't seat any particular content to be written.

Where the File Allocation Tables are freckled, it is crucial to repair this intelligently! If you leave it to a "blind" utility such as ScanDisk, you may well end up with the garbage faithfully copied over the only remaining true data left.

Pattern: Byte creep

Once in a while I've seen a directory considered "too damaged to repair" by ScanDisk (and indeed, looking totally insane after a Dir command) that turned out to have been shifted by one byte - as if the first byte of the sector had been deleted somehow. I presume this was the result of some sort of hardware flakiness.

Pattern: Stuck bit

Sometimes an insane value makes sense if you just flip one bit. Stuck bit errors are much easier to spot when looking at plain text. The cause is likely to be some sort of hardware issue that can affect one bit; typically the same bit is bent thoughout the system.

Pattern: Quicksand

Sometimes things just ain't what they used to be, even as they were five minutes ago. You look at a raw disk sector, page up to somewhere else, page down back again and find it's changed. The first thing to do if this happens is to attempt to preserve the state (preferably multiple times) by doing a dumb image dump onto some other hard drive, using a known-good PC as the host. The cause is likely to be failing circuitry somewhere, as flaky disk surface is more likely to be error-detected as such.

As you can see, the most "interesting" corruption patterns are often consequent to flaky hardware other than the hard drive itself - a good reason to avoid overclocking, dodgy hardware and carry (and use) multiple RAM diagnostic utilities.

The other lesson out of all this is to preserve forensics as much as possible, and eyeball error loci via a raw disk editor whenever you can. Once you get a feel for the pattern(s) you are dealing with, you know how to guess more successfully, and can be forewarned as to whether you should take special precautions. For example, a "quicksand" means "get off this hardware and motor-drive some snapshots now!", while "freckles" means "don't trust the system not to stray beyond the bounds of this volume or partition", etc.

 

(C) Chris Quirke, all rights reserved - November 2002

Back to index