The Library of Congress Web Archiving Team Goes Agile

Today’s guest post is from Grace Thomas, Senior Digital Collections Specialist on the Library of Congress Web Archiving Team. You can read more about the Web Archiving Team right here on the Signal.

In the web archiving community, we build the plane and fly it simultaneously. While this pattern is present in most disciplines, web archiving is unique because web archives are massive in scale (Petabytes (PB) of data), but diligently sustained by a small, geographically disparate community. In the twenty years of the Library of Congress Web Archiving Program’s existence, the archive has grown from zero to 2.3 PB, while the size of the team has remained virtually the same: approximately five full-time staff members.

Over the past twenty years, the Library’s small and mighty Web Archiving Team (WAT) has sustained enormous accomplishments, often making the impossible happen through sheer will. Sometimes that meant securing a dedicated Terabyte (just one!) of storage for the burgeoning archive, releasing 8,000+ catalogue records on, or training fifty web archives collection leaders on a new process to review the Library’s ongoing collections.

You may be thinking, “with such a small team and large amount of data, why don’t you just automate?” Fair question! Automation is a shared dream among organizations in the web archiving community to some extent, and has happened in pieces: automated indexing, improving crawlers and access tools, sharing seedlists for crawls at national or international scales, automated cataloguing. However, building automated processes takes human skills and human time. Time that is allocated towards flying the plane, even though we consistently field researcher requests for bulk data and hear lamentations of our lack of full-text search.

For these particular aspects, we have excellent role models in the community. We clearly see what is possible to do with web archives, and we imagine a near future dream world where everything is automated and the coffee is hot in the morning. The Library’s WAT needed our own way to make strides toward building that new world, but also manage expectations of how soon it could become our reality.

Enter: Scrum

Scrum is a lightweight Agile framework that allows teams to sort through complex projects and consistently deliver high-value results while minimizing waste. Scrum is rooted in software development and has been widely adopted in the technology domain, but its concepts are applicable elsewhere, even to libraries!

You can read a detailed Scrum Guide, but two core concepts are iteration and increments. During a period of time called a sprint (i.e., two weeks), work is planned, carried out, and reviewed, all by the team. Planning work only for the next two weeks forces a team to be realistic, prioritize the work, be transparent, and be flexible, considering ever-changing external priorities and new information. The team must also break tasks down into manageable sizes. For example, instead of a task being “create full-text indexes of the 2.3 PB archive” for one sprint, maybe a first task is “test one full-text indexing script on one file.”

This Is How We Do It

No two organizations use Scrum in exactly the same way. The WAT even runs our Scrum increments on a different schedule and intensity than our parent section, Digital Content Management, which adopted Scrum in 2018 and provided inspiration to our team. After two permanent WAT members left in Spring 2020, the team held a workshop in May to examine our new reality, what we wanted for the future, and our streams of work. We brainstormed and prioritized goals, with every team member having an equal voice in the shaping of our collective future.

From this workshop, we created the basis of our Scrum backlog. Streams of work and goals were translated into tickets and assigned to real people. Afterwards, we held our Sprint Zero to test two-week sprint lengths and utilized all four Scrum events . Since then, we have successfully completed sixteen two-week sprint cycles.

So, Does It Work?

During our impromptu work-from-home reality, the Scrum framework has been a uniting force for our highly communicative team, with consistent meetings providing space for team discussion and problem solving. Since we are a talkative and opinionated bunch, we might need to take it upon ourselves to speak up, but the space to do so exists.

We can also see Scrum functioning through our completion of research and development projects, while still performing ongoing tasks. Our team recently completed an expansive project where we provided reports about all twenty Library units’ web archiving activity to Library leadership. This project had been on our minds for years, but Scrum, along with the team’s propensity toward collaboration, gave us the framework to finally tackle it.

The team recently held a virtual, seven-month retrospective workshop to discuss whether Scrum is useful. We decided it works for us, but there are ways we can iterate and make it even better going forward. We are working on breaking down our tickets into smaller chunks, prioritizing our work, and wrangling a “definition of done” for our tasks.

Aside from the work itself, and whether we are truly producing “more” or “better” things, we finally feel in control of our work. We can provide greater focus on the things that matter now and leave tasks of lesser priority in the backlog for another day. We no longer have to feel overwhelmed that our entire plane is not yet built, but we glide along just the same.

Speculative Annotation in the Classroom: A Conversation with Educator Ashley Wood and Innovator Courtney McClellan

The following is a guest post by the 2021 Innovator in Residence Courtney McClellan, a research-based artist who lives in Atlanta, Georgia. With a subject focus on speech and civic engagement, McClellan works in a range of media including sculpture, performance, photography, and writing. She has served as studio art faculty at Virginia Commonwealth University, […]

An Archivist’s Perspective on Legacy Files

In this post, 2020 Staff Innovator Chad Conrady discusses his area of expertise, emulation, which imitates older operating systems in order to open outdated or legacy files that are no longer operable with contemporary operating systems or software.


LC for Robots in Action: using the API to access the Federal Theatre Project collection

The following is a guest post by Derek Miller, Harvard University, and Elizabeth Brown, a reference librarian in the Main Reading Room at the Library of Congress. In it, they discuss how Brown helped Miller access LC for Robots resources that helped him gain enhanced access to Library of Congress digital collections used in his research.

Nominations sought for the U.S. Federal Government Domain End of Term 2020 Web Archive

This is a guest blog post by Abbie Grotke, Assistant Head, Digital Content Management Section You may have noticed that it is presidential election season in the United States, which means it’s also time for web archivists to gather once again to archive United States Federal Government websites during the end of the presidential term. […]

Analyzing the Born-Digital Archive

Kathleen O’Neill is a 2020 Staff Innovator with LC Labs and a Senior Archivist in the Manuscript Division at the Library of Congress. In this post, she discusses her analysis of the various file formats in the Manuscript Division’s born-digital holdings.

How to Write a FDD in 149 Easy Steps: Learning to Evaluate Digital File Formats

Today’s guest post is from Marcus Nappier, who is a Digital Collections Specialist in the Digital Content Management Section at the Library of Congress. The Library of Congress maintains over 470 format description documents (FDDs) on the Sustainability of Digital Formats website that provide information about file-formats, bit stream structures and encodings, and their usage in […]