The following is a guest post by Barry Wheeler, Digital Projects Coordinator, Office of Strategic Initiatives.
With the large size and amount of my personal digital archives, my archiving problem may be a bit extreme, but I think a description of my archiving system may be helpful to many people who want to save and preserve their digital files. I am a photographer and I have tens of thousands of pictures, including both professional and personal work, to archive each year. Add in my routine papers, spreadsheets, and data files and I often have over 500 GBs of digital files to preserve at the end of the each year.
I do maintain backups – but in my system, backups cannot be considered part of my archive. Backups preserve current work. Over the years, as drives fill up or I change computers, old files are removed from the primary drive on my main computer. I want to preserve all these files that are no longer part of my backups. My backups are also compressed and maintained by proprietary software. I’d prefer to have my archive in common standard formats that will be readable in the future – or at least will be easy to convert to an accessible format. Therefore I have developed a system of yearly archiving. Again, this system is mainly for those with above average amounts of digital material. For others, do not be discouraged by this! At the end of this post, I’ve provided a simpler version for those with less material.
I prepared for my archiving task well in advance. All my content files are stored in directories by broad general topic, then in subdirectories named by date and specific subject. (All of my data subdirectories begin with the date in YYYYMMDD format so they sort automatically most recent to oldest.) Once the files are organized, I follow these six steps in my end-of-the-year archive processing:
First, I purchase a new external hard disk drive each year. I name the new drive – physically with a tape labeler and electronically in my drive properties tool. Thus, this years’ drive is ARCHIVE11. I also name a top-level directory “archive11”.
Second, I copy all documents for the past year to the new drive into the “archive11” directory. This should take about 33% of the drive space.
Third, I setup a powered USB hub connected to my primary computer and connect each of my yearly archive drives to the hub – each drive should appear on my computer desktop as ARCHIVEXX. (By now, I have 6 USB drives! A forest of drives and a tangle of cables as Figure 1 above shows.)
Fourth, I check the available space on last years’ drive. If I used approximately 33% of the space last year I should have enough space, so I create another top-level directory named “archive11”. Again, I copy all documents for the past year into this “archive11” directory. I now have two copies of my past years’ documents, each on a separate drive. As part of my archive plan, I also keep a copy of my very best images on a remote, “cloud” site, but that’s another blog. (Figure 2 shows my computer desktop, directory listing, and file info panel.)
Fifth, I use a disk utility to check (and repair if necessary) the integrity of each external drive. Then I check file permissions with repair on each external archive drive. I also randomly select and open a number of files from each drive and each archive directory.
Finally, I import both the new archive directories into my cataloging database (again, perhaps a suitable subject for another blog) and then power down and store each drive with its’ power supply until next year. At this point I can delete all archived files I do not intend to work with immediately from my working drive. (Figure 3 shows my drives ready for storage.) My yearly archiving processing is now done!
As promised, this process can be simplified and used by anyone who wants to archive their most important digital documents. The basic process a user can follow is:
- Gather all your important data files into one master directory.
- Arrange them by year – especially if you archive tax files.
- Make copies of the files. If the total directory size fits on a CD (ie. less than 600 MBs), then make separate copies on two archival gold CDs. (The life expectancy of an inexpensive or standard CD is uncertain so I think the archival gold CDs are worth the extra expense.)
- Continue copying each year, on two new gold CDs.
- Check all CDs yearly. If your document collection is too big for CDs, use external hard disk drives – and be even more vigilant in checking the drives!
(See Barry Wheeler’s previous post for The Signal, on photo sharing sites.)
Barry, Thanks for posting this article. It is very helpful to learn from the professional photographers how the archive can be done. Just a few comments here. 1) Thank you very much to point out that the backup and archive are different thing! That is very critical to many photographers to identify the differences. Your methodology clearly indicated how different they are and how they should be done correctly. 2) It is also very important to know how would you complete the catalog – I think that is something called post-archive-process to update your archive with the status after the annual archive was completed. I wish you would share that process someday to benefit the rest of us. So I vote yes to both your backup and catalog blog in advance. 3) I also noticed that you are using external single disk drive as the archive storage – did you ever consider to use the additional redundancy on the archive device side to reduce the risks of the hard disk drive failure for that archive? 4) You have mentioned that you are using a thirdparty software doing the archive – why not using the TimeMachine comes with the Mac OS for this purpose? (I noticed you are using that Mac OS from your images).
5) By nature since Mac OS is a unix operating system, did you tried to use some command line options such as “tar” or “rsync” to get the archive done in a more flexible way? Also, did you know that you may add these command line commands to the “cron” job to make them run at your scheduled time to complete the job automatically? If you did, what is your comments on that vs the thirdparty software? 6) In case you have to work on the archived images, what is your re-archive policy or naming strategy on those images? Where are they going to stay, with the current local hard disk or you need to deposit them back to their original archive and rename them with the updated dates? What is your thoughts on this and con pro? 7) If you have 500GB archive yearly, your local hard disk drive must be much larger like 1TB or 2, how would you prepare for the disaster for your local computer and how would you config your local storage to access large quantity of images – are you using the new thing from apple called Thunderbolt already? Or any comments? I just realized that the trend of the digital photography is making the photographer to become a computer engineer and a system administrator in someway therefore I wish you won’t mind if I have drafted too far away from your archive topic. Thanks! –Liyun
Hi Liyun –
Thank you for your comments and questions. You’ve actually helped me clarify my own thinking and document my process more accurately! Now that I’ve thought through the process more clearly I’ll try to answer each of you questions.
I may try to explain the difference between backups and archives more thoroughly in a future blog post – I think the topic may be helpful to many people. And I’ll definitely do an article on cataloging my archives.
I try to have a reasonable redundancy built into my process. I try to purchase a new drive each year that is large enough so that the current year archive takes only one third of the drive space. That way the second copy of next years archive will fit (hopefully) on this drive. By purchasing one drive a year and following this process I always have two copies on two separate drives. Hopefully both drives won’t fail in the same year. By checking each drive thoroughly each year if one drive fails I can purchase a replacement drive and use the remaining copy to restore my two drive system.
I only use a 3rd party application (Time Machine) for backup. I do not use 3rd party software for archiving. I do manual copying for my archiving to maintain everything exactly as it is on my primary working drive. That means I can carry my archive disk to any machine (Windows, Mac, Unix), anywhere, and expect that I can read and work with my archival files.
I could automate the process – and did when I was responsible for archiving materials as one of my professional tasks. But the scale was far beyond what I have to do for my personal files. As long as the work load is reasonable, I have a much higher comfort level doing everything myself. Besides, I couldn’t write a good blog post if I had to explain my Unix commands!
Anytime I retrieve materials from my archive, I create a new directory on my primary drive and copy the needed files into the directory. I never change my archive content. Then the new directory and all files are archived in the next year processing. The duplication hasn’t been a problem so far. My cataloging ties the files together.
Finally, I monitor the space remaining on my primary working drive. If my working disk gets too full, I simply archive a set of files earlier than scheduled. At the end of the year, my directory naming scheme makes it easy to see what I’ve already archived.
I hope my comments help. If you want to continue this conversation feel free to email me directly at [email protected]
I also use a yyyymmdd format to name all files. However, I additionally name all files with descriptive text to be able to quickly do an eye scan for the subject. Exp 20110608 Joe William Cameron – 12th Birthday – Sarasota Florida. This enables me to do a quick name or date search for all of Cameron.
Thank you for the emphasis on backward and forward redundancy. Multiple, off site copies of files can be a valuable anxiety relieving effort.