Data integrity

Using data transfers everyday one assumes the data to be transfered and stored correctly and if nothing goes wrong to stay in place for years. Here are some statistics and occurances that changed my confidence.

Transfer errors

Who says the data that leaves one end of the transmission is necessarily the the data that is stored at the opposing end.

Mainboard to disk transfer errors

Using FreeNAS I had a setup consiting of 3 HDDs in a RAID5 array as well as a trayless enclosure to swap backup HDDs. While creating a backup I was monitoring the SMART output of the backup disk and noticed the exorbitant number of transfer errors between the operating system and the HDD. The number was somewhere 106 to 107 corrected bits. Due to the nature of the harddisk enclosure there is an extra piece between the HDD SATA connector and the SATA cable coming off the mainboard. My assumption is that this causes the transmission errors.
But this means that the data off the operating system suffers corruption on its way to the harddisk. Due to transmission protocol and error correction this is corrected. However it left me wondering how likely it was that under unlucky circumstances the one or other error would go through unnoticed.
So to check if I was seeing unnoticed file corruption I generated the MD5 hashes for several folders on the source and compared them to the hashes on the target. For the 1000 files I checked there was no difference, but it still left my uneasy.

Attached is the code to perform the MD5 comparison:
# /mnt/pool :: this is where the original data resides
# /mnt/backup :: here is the backup copy

# code starts here
find /mnt/pool -type f -exec md5 "{}" \; > /mnt/temp/source.md5
find /mnt/backup -type f -exec md5 "{}" \; > /mnt/temp/backup.md5
sort -d /mnt/temp/source.md5 > /mnt/temp/source.sort.md5
sort -d /mnt/temp/backup.md5 > /mnt/temp/backup.sort.md5
diff -u /mnt/temp/backup.sort.md5 /mnt/temp/source.sort.md5 > /mnt/temp/md5result.txt
# code ends here

PC to PC networking errors

to be continued

Administrative handling error

to be continued

Physical handling error

Just how quickly can you destroy 1TB of data? Pretty quickly if you're not careful. This is the story about the second 1TB harddisk failure I've had in the last couple of years. As I was swapping harddisks from one PC to another I noticed that the cage to which the hdd was mounted did not have rails which usually prevent the hdd from moving vertically. I thought I'd have to be careful when I removed the last screw lest the hdd fall to the bottom of the cage. Of course when I removed the last screw I had forgotten about this again (it was late and I wanted to get done quickly).
With a PLOOK the hdd fell the 8-10cm to the bottom of the cage which was laying on the hard parquet floor.
I feared this would be the end of the hdd and that fear was confirmed when I powered up the device. It powered up with the correct sound, but there was instantaneously this smell coming out of the harddisk... the smell of terminated data. My assumption is that the fall of the device released some particles that got inbetween the reading heads and the media which lead to the end of the disk.