Top of page

The Library of Congress Web Archiving Team Goes Agile

Share this post:

Today’s guest post is from Grace Thomas, Senior Digital Collections Specialist on the Library of Congress Web Archiving Team. You can read more about the Web Archiving Team right here on the Signal.


In the web archiving community, we build the plane and fly it simultaneously. While this pattern is present in most disciplines, web archiving is unique because web archives are massive in scale (Petabytes (PB) of data), but diligently sustained by a small, geographically disparate community. In the twenty years of the Library of Congress Web Archiving Program’s existence, the archive has grown from zero to 2.3 PB, while the size of the team has remained virtually the same: approximately five full-time staff members.

Over the past twenty years, the Library’s small and mighty Web Archiving Team (WAT) has sustained enormous accomplishments, often making the impossible happen through sheer will. Sometimes that meant securing a dedicated Terabyte (just one!) of storage for the burgeoning archive, releasing 8,000+ catalogue records on loc.gov, or training fifty web archives collection leaders on a new process to review the Library’s ongoing collections.

You may be thinking, “with such a small team and large amount of data, why don’t you just automate?” Fair question! Automation is a shared dream among organizations in the web archiving community to some extent, and has happened in pieces: automated indexing, improving crawlers and access tools, sharing seedlists for crawls at national or international scales, automated cataloguing. However, building automated processes takes human skills and human time. Time that is allocated towards flying the plane, even though we consistently field researcher requests for bulk data and hear lamentations of our lack of full-text search.

For these particular aspects, we have excellent role models in the community. We clearly see what is possible to do with web archives, and we imagine a near future dream world where everything is automated and the coffee is hot in the morning. The Library’s WAT needed our own way to make strides toward building that new world, but also manage expectations of how soon it could become our reality.

Enter: Scrum

Scrum is a lightweight Agile framework that allows teams to sort through complex projects and consistently deliver high-value results while minimizing waste. Scrum is rooted in software development and has been widely adopted in the technology domain, but its concepts are applicable elsewhere, even to libraries!

You can read a detailed Scrum Guide, but two core concepts are iteration and increments. During a period of time called a sprint (i.e., two weeks), work is planned, carried out, and reviewed, all by the team. Planning work only for the next two weeks forces a team to be realistic, prioritize the work, be transparent, and be flexible, considering ever-changing external priorities and new information. The team must also break tasks down into manageable sizes. For example, instead of a task being “create full-text indexes of the 2.3 PB archive” for one sprint, maybe a first task is “test one full-text indexing script on one file.”

This Is How We Do It

No two organizations use Scrum in exactly the same way. The WAT even runs our Scrum increments on a different schedule and intensity than our parent section, Digital Content Management, which adopted Scrum in 2018 and provided inspiration to our team. After two permanent WAT members left in Spring 2020, the team held a workshop in May to examine our new reality, what we wanted for the future, and our streams of work. We brainstormed and prioritized goals, with every team member having an equal voice in the shaping of our collective future.

From this workshop, we created the basis of our Scrum backlog. Streams of work and goals were translated into tickets and assigned to real people. Afterwards, we held our Sprint Zero to test two-week sprint lengths and utilized all four Scrum events . Since then, we have successfully completed sixteen two-week sprint cycles.

So, Does It Work?

During our impromptu work-from-home reality, the Scrum framework has been a uniting force for our highly communicative team, with consistent meetings providing space for team discussion and problem solving. Since we are a talkative and opinionated bunch, we might need to take it upon ourselves to speak up, but the space to do so exists.

We can also see Scrum functioning through our completion of research and development projects, while still performing ongoing tasks. Our team recently completed an expansive project where we provided reports about all twenty Library units’ web archiving activity to Library leadership. This project had been on our minds for years, but Scrum, along with the team’s propensity toward collaboration, gave us the framework to finally tackle it.

The team recently held a virtual, seven-month retrospective workshop to discuss whether Scrum is useful. We decided it works for us, but there are ways we can iterate and make it even better going forward. We are working on breaking down our tickets into smaller chunks, prioritizing our work, and wrangling a “definition of done” for our tasks.

Aside from the work itself, and whether we are truly producing “more” or “better” things, we finally feel in control of our work. We can provide greater focus on the things that matter now and leave tasks of lesser priority in the backlog for another day. We no longer have to feel overwhelmed that our entire plane is not yet built, but we glide along just the same.

Add a Comment

Your email address will not be published. Required fields are marked *