This is a guest post from Kathleen Kenney, Digitization Specialist, Digital Information Management Program, State Library of North Carolina.
The State Library of North Carolina, in collaboration with the North Carolina State Archives, has been archiving North Carolina state agency web sites since 2005 and social media since 2009. Since then, we have crawled over 82 million documents and archived 6 terabytes of data.
With each quarterly crawl we are capturing over 5,000 hosts, many of which are out-of-scope. As a government agency it is our responsibility to ensure that we are not archiving any sites that are inappropriate due to their content, copyright status, or simply because they are not related to North Carolina government. Managing the crawl budget is also a priority. Performing a crawl analysis allows us to prevent out-of-scope hosts from being crawled, and at the same time we often encounter new seeds that should be actively captured. A seed is any URL that we want to capture in our crawl.
In the past, a crawl analysis involved downloading the crawl reports, importing the data into a spreadsheet, and comparing the current list to previously reviewed URLs in order to delete any duplicates. After de-duping, the team would manually review over 3,000 URLs in order to constrain from the next crawl any out-of-scope web sites. An onerous task, but necessary nonetheless.
We developed a new open source tool, Constraint Analysis, to streamline this process dramatically. The tool works in conjunction with the Archive-It web archiving service. Now, instead of using a spreadsheet to manage the crawl reports and remove duplicates, we use a series of automated database queries. And, rather than manually reviewing each remaining URL, a .txt file of the URLs is uploaded into the tool. The tool sends a request to a free 3rd party screen scraper service, which generates a .png image of the home page. On the URL Listings page, the URLs and home page images are generated, 100 per page.
We are able to quickly scan the images and scroll through 100 images at a time, pausing only to click through to a URL if necessary, change a web site’s status from its default “constrain,” flag a site as a possible seed, or shorten a URL to constrain at the root, if necessary. Once all of the pages of images are reviewed, a .txt file of the data is exported to allow easy updating.
A modest change to the process and a simple tool reduced the time required for a crawl analysis from 16 hours to 4 hours. The Constraint Analysis tool is useful for crawl analysis and can also be modified for other quality assurance tasks such as analyzing seed lists or reviewing sites blocked by a robots.txt file. The tool is free and available for download from GitHub.