I first encountered Jason Scott in mid- to late-2010 through a colleague who informed that me that if I did not know who he was, that I had better learn. Since then I have become a big fan of his passion for digital archiving and his drive to save collections and content that few organizations have considered part of their collecting scope, let alone something that required preservation. In 2011 Jason became affiliated with The Internet Archive, and he has been doing extensive work in building gathering a huge array of content, including open source software, shareware, and conference videos, but also the output of entire communities that was at risk of completely disappearing with little notice.
I recently had the opportunity to ask Jason some questions about his work.
You describe yourself as a “rogue archivist,” with a sense of outrage over the potential loss of history in digital form. You work with individuals and organization internationally to preserve born-digital content and digital at-risk materials. What drove you to this madness?
I’d always been a collector of data, from the time I was very young – when I called dial-up bulletin boards in the 1980s, I would print out the messages or run to the file sections to download textfiles and binary files, saving them on stacks of floppies, and then keeping them around. When, in my late 20s, I realized the internet didn’t have any sites that I could find dedicated to these old files, I put them up myself – the resulting waves of fan mail and appreciation told me there was something to this whole “saving data for later” thing. Over time, my mission expanded from saving dial-up BBS history (which I still do frequently, contacting and being contacted by people with items) to saving most any data related to underappreciated or non-contemporary online cultures, and it has continued in that fashion to this day. I find the work very rewarding, and the effort is so minimal (copy-paste, arranging scripts) for such a large benefit, I see no downsides at all.
Tell me about textfiles, and how it came to be.
As I said above, I was an avid collector of files from BBSes, but for some reason, the textfiles from these places were particularly compelling. On the BBSes, there would often be a section called “G-Files”, short for “[G]eneral files”, where you were supposed to put in what we think of as the help section or FAQs, but which a lot of BBSes would use to put in some hot info, some set of how-tos or rants or humor, and I just found the brilliance of them really fascinating, so I found that when I called a new place, I’d rush to that section and download all of them. Over time, I had amassed thousands of these files, and when I finally put up that website of old BBS material, I named it TEXTFILES.COM because that described the vast majority of them – files of text. Many were passed around and gained little additions and subtractions, so I kept all versions up, and it’s been interesting in the decade-plus time since to see how many people have emotional connections and memories of specific versions, or have found meaning in the little add-ons around the files. Over this decade-plus, I’ve been encountering files that were more than just “text files”, and so the mission expanded greatly – I now have shareware CDs, ANSI art files, sounds and audio, even PDFs and video. I consider them all equally important, just expressing the meaning and experiences of their time a different way.
What is the Archive Team?
The Archive Team was an idea I dreamed up in 2009 when I realized how many of these for-free hosting sites were beginning to shut down, taking all the data with them. The inherent drive was the realization that saving this data from complete corporate-motivation deletion was trivial – as long as people were trying to do so. If a team of archivists could rush in and duplicate materials, even a portion, then the history and important information and teaching of this user-generated data would be, in some small way, saved. This small idea caught fire, and three years later Archive Team is a force to be reckoned with, a rollicking set of maniacs who are sure to get in your face if your site announces a shutdown with scant warning or a lack of export tools. We run a wiki at http://www.archiveteam.org, and are working on dozens of simulataneous projects ranging from individual site mirroring to enabling and promoting tools to help ensure data preservation becomes a strong option, and not just an afterthought.
What should people do when faced with impending loss? Or doing to avoid a data-loss crisis?
I think the number one turn-of-mind that’s required right now is for people to realize they even have data to lose. The transparent nature of working with computers in the modern era is just to think of what you put into your computer as being “in there, somewhere”. Not enough effort has been put into the engineering of data loss prevention – nowhere near the amount put into UI or making flashy interfaces that beckon you. We still, for example, tend to “save a file”, which really means “put your data on a spinning piece of metal sprayed with a coating of magnetic atoms that another moving metal head will check now and again”. While waiting for the industry to catch up to really improving this, the best an end-user can do for their own sanity is to think of documents and writings that are their own, be it e-mails, essays, photos and movies, and get those onto at least a few USB sticks (cheap and easy to get) and put them somewhere away from the computer. Once that idea is in your head, you can start looking at where you put yourself, be it a website you run, a weblog you write for, a flickr or a livejournal or what have you, and ask yourself “if this was gone tomorrow, how would that affect me?” If the answer is “not that much, frankly”, great. But for a lot of people it is probably “this would really really bum me out”, and if so, it’s worth setting aside a day to just pull things out and put them on the USB sticks, so you at least can get things from them if you need to. The fact is, sites go down all the time, and sometimes with very ineffective warnings. While we don’t sit around worrying (or having to worry) that our data packets will get to the websites or computers we’re connecting to, we DO have to worry about our data, that which makes our presence unique and ourselves to others online, and its safety. That’s the single biggest hurdle, I think – the rest is just procedure and fire drill.
How can organizations like the Library of Congress work with efforts like yours?
The LoC is slow, methodical, and grinds fine. That works great for a lot of things in this world, but the firefly-like existence of the web is not one of them. That said, it’s obvious that the Library of Congress represents the Carnegie Hall for data – once you make it there, you’re in the big time. Over time, I expect the LoC to reach out its steady, intense hand at this dataset or another, and withdraw a copy into its bowels for the permanent future. I’m looking forward to a lot of what I and many others are doing to get there at that point.