I was talking to one of my archivist colleagues about a collection he was processing and the challenges he was having identifying file types based on their extensions. The collection does go back several decades, but some of the file extensions were unrecognizable.
This was when I confessed: during a period of time in my life, I ignored files extensions, sometimes changing them to meet my own whims.
There were a few reasons for this. The very first computer that I owned was an original 1984 Macintosh. Since file extensions were not visible in the file browser, and my applications magically opened associated files, I didn’t really even know they existed. In my first work environment where we had DOS IBM PCs (prior to that I used terminals and mainframes), all my files were on a handful of single-sided 5.25″ floppy disks. I would insert the floppy into the drive and open the file from within the application, since that was the easiest method. I had no awareness of the registry or the existence of file format-application mapping at that time.
I had the misguided notion that it would be easiest to manage my files if I knew what type of content it was, not what type of file it was: .LET for letter, .MAN or .GDE for documentation, .ENV for envelope and so on. In some cases, the weird file extensions were created when I moved a Mac file with a long file name over to a DOS PC with its 8.3 character file naming restrictions. That’s how I ended up with files with names like “PHOTOLA.BEL” and “LJMCNEMA.ILI”.
But for the life of me, I cannot fathom what I meant by some of these file extensions 30 years later. What could I have meant by “.OB”? I can guess what my work process was when I created “.WK1” and “.CHG” files. I created them, and I don’t know what I meant. How would an archivist fare?
This definitely came back to haunt me when I needed to access and/or migrate both my Mac and PC files later. I kept my original files with their original names, migrated off their original media, including the original 1984 Mac floppy disk that came with my Mac. And yes, I do still also have the original media. I ran all the files through a commercial file conversion tool, making copies and converting to more recent versions with much (but not complete) success. In some cases the files _without_ any extension fared the best, because operating systems and conversion tool weren’t mislead by the extension, getting their information from the file itself. The files with the crazy extensions (a mix of mostly Word Perfect, MS Word, Aldus Pagemaker and Adobe Illustrator files) were a mixed bag. In some cases when it failed, I made educated guesses and changed the file extensions on the copied files and was at least able to read the content, even if it was not formatted 100% correctly. The biggest failure at the time? Mac Write 1.0 files, but that was a tool issue, not a file extension issue.
I did this circa 2003 before our community practices and tools had evolved to where they are now. I want to run this experiment again to see how successful it is. And I want to caution all content creators to carefully watch your use of file extensions, because you never know what the legacy of your files might be.
Comments (5)
I’ll quibble with your title. The decision to use a non-standard file extension was often influenced by the fact that in days gone by, people were limited to 8.3. It was reasonable to use the extension as a second level hierarchy. Book.1, Book.2, Book.3 for the chapters of a book.
Moreover, in early days, not all programs had requirements for specific extensions. If memory serves (always a sketchy proposition), extensions became important with GUIs. With a CLI, you launched the program and specified the file as an argument (or opened it internally within the program). It wasn’t until you could double-click a file name that the OS needed to know which program to use — and inferred that from the extension. (As opposed to information in the file header, a la Unix and later Mac.)
You can absolutely quibble, Richard. Yes, in some cases I am sure I came up with extensions because of the limits of the 8.3 file naming. But as I did this and I have no idea what I meant by some of my extensions (trust me, there are many more than I mentioned), and I have to deal with it, I feel justified in calling what I did silly. And absolutely, working from the command line did influence that as well. I made mention of that in an earlier draft of this post that I edited out; I should have kept it.
I’m not sure I agree with you Leslie.
There are a lot of programs that today write out files to disk that have the same extension but very different internal structures.This of course was even worse back when most programs didn’t use file extensions at all, or didn’t use them consistently. Because of this it is very rarely a good idea to rely on the file extension for any sort of identification purpose. And because of that, people should not worry too much what their file extensions are. Digital preservation practitioners are rarely going to be able to trust them anyway. Instead they will have to use pattern matching techniques to identify the types of files they have based on a number of factors including the (hopefully) unique internal patterns (or signatures) that exist within the files.
Hi from Wellington NZ,
This is a really pertinent conversation for us Leslie here at National Library of New Zealand. For a long time we would not and did not change file extensions. It went against concepts of the “original” object. Recently we’ve revisited that decision and decided that we are happy to change things such as file extensions of files before the object is taken into the preservation system. We class this as “preconditioning”. We have a policy that governs such change. Essentially it says that you can make changes to files (and not keep the “original”) as long as those changes do affect the intellectual message being conveyed, that the change is reversible, and that there is adequate provenance information recorded. [As an aside, if we do keep the original, then this would be classed as a preservation action and done within the system, therefore the original is ingested first, then acted upon.]
In the specific case of file extensions, we now change crazy file extensions (and we get a fair number of crazy ones) into ones that the format standard says are permitted.
As for ID’ing by file extension. It is less ideal than doing it by internal signature, but sometimes it is the only way to do it. For the content we receive, we do use ID by file extension where there is no DROID signature. We do this automatically where we know the producer, and manually when we do not (so we can have a better degree of confidence about what we’re identifying).
Cheers,
Pete
The three-leter file extension is a leftover from the days of the cp/m operating system, and should have been abolished thirty years ago. As should the idea that each file format should be linked to some particular program.
As you say, the Mac did not use them; nor did the Amiga at the same period. Only Microsoft clung to this obsolete format. (And Mac OS X unfortunately followed suit.)
Almost all files contain headers from which the format can be identified. The Amiga OS had a library called “datatypes” which identified files from their headers.