The following is a guest post from Christie Moffatt an archivist in the History of Medicine Division and Program Manager of the Digital Manuscripts Program at the National Library of Medicine and Jennifer Marill, Chief of the Technical Services Division for NLM.
The National Library of Medicine has a mandate to collect, preserve and make accessible the scholarly biomedical literature as well as resources that illustrate a diversity of philosophical and cultural perspectives not found in the technical literature. New forms of publication on the web, such as blogs authored by doctors and patients, illuminate health care thought and practice in the 21st century. In June 2011 the NLM Web Collecting and Archiving Working Group engaged in a pilot project to understand better the processes and challenges of collecting born-digital web content and to expand the Library’s collecting strategy for digital formats.
The NLM working group gained a practical understanding of web archiving workflows and began a health and medicine blog collection, presenting the perspectives of physicians, nurses, hospital administrators and other individuals in healthcare fields. So far, the authors of these blogs are physicians, nurses, hospital administrators and other health professionals in different stages of their careers. NLM also collected blogs of patients who are chronicling their experiences with conditions such as cancer, diabetes and multiple sclerosis.
NLM has collected the following examples:
E-patient Dave is the blog of Dave deBronkart, a cancer patient and blogger who has become a noted activist for healthcare transformation through participatory medicine and personal health data rights. Mr. deBronkart writes in this post as a newly diagnosed skin cancer patient who is taking action to make his treatment most cost-effective.
Life as a Health Care CIO
Life as a Health Care CIO is the blog of Dr. John Halamka, Chief Information Officer of Beth Israel Deaconess Medical Center in Boston, Massachusetts. He is also Professor at Harvard Medical School and a practicing Emergency Physician. In this captured blog post Dr. Halamka reflects on his work in Japan on the implementation of health care IT to support earthquake/tsunami response.
Wheelchair Kamikaze is written by an individual, named Marc, with Multiple Sclerosis. He drives his wheelchair at full speed and takes videos and still photos using a camera that he has attached to his chair. He posts these images on his blog and writes about his experience living with MS. This is a screenshot of his first blog post on February 27, 2009.
During the pilot, NLM crawled selected blogs monthly over the course of a year using the Internet Archive’s Archive-It service. NLM staff conducted monthly quality control reviews of the archived pages and made adjustments to the crawling instructions to better capture the look, feel and functionality of the content. Throughout this effort the working group explored issues of selection, quality control, metadata, copyright and the workflows needed to develop web-based collections.
Through a learn-by-doing approach, the group found that it was able to capture selected blogs fairly well despite known limitations to web archiving. One significant challenge that the group faced included dealing with frequent links to outside, “out of scope” sources, raising questions about what it means to completely capture a blog and the extent to which linked content should be preserved. Other challenges included capturing content protected by passwords or blocked by robots.txt and some types of video files. NLM learned that capturing web content remains a moving target and that with each new post and certainly with overall structural changes to a blog, problems can quickly arise. The group’s experience confirmed the value of early and thoughtful attention to scope, crawling frequency and crawling duration, as well as the importance of thorough quality review. Test crawls were very helpful for identifying and addressing problems in advance.
Some of the biggest challenges were non-technical and included determining collection scope (this is a “big picture” question–which blogs should be captured?). Other issues were permissions (weighing in the fact that these blogs can be quite personal) and monitoring when blogs end, change focus, or move to a new URL. NLM staff learned the importance of both curatorial and technical expertise and the need to keep up with new tools to get a better handle on working with this content. Perhaps most significantly, NLM gained first-hand appreciation for the importance of acting now, despite the imperfect methods of collection.
The Working Group has recommended that NLM expand traditional collecting capabilities to include born-digital web information and to participate in collaborative efforts to capture at-risk web content. As NLM moves forward, other areas of interest include:
- Capturing web-only grey literature, especially content from small “at-risk” organizational web sites that do not already have affiliations with repositories or that lack the resources to archive their web content themselves
- Developing thematic collections, such as the intersection of medicine and art on the web
- Event-based collecting (for example, both official and non-official responses to epidemics, or in the aftermath of a disaster)
- Web content that complements traditional manuscript collecting (laboratory web sites, online laboratory notebooks)