This is a guest post from Kathleen Kenney, Digitization Specialist, Digital Information Management Program, State Library of North Carolina.
The State Library of North Carolina, in collaboration with the North Carolina State Archives, has been archiving North Carolina state agency web sites since 2005 and social media since 2009. Since then, we have crawled over 82 million documents and archived 6 terabytes of data.
With each quarterly crawl we are capturing over 5,000 hosts, many of which are out-of-scope. As a government agency it is our responsibility to ensure that we are not archiving any sites that are inappropriate due to their content, copyright status, or simply because they are not related to North Carolina government. Managing the crawl budget is also a priority. Performing a crawl analysis allows us to prevent out-of-scope hosts from being crawled, and at the same time we often encounter new seeds that should be actively captured. A seed is any URL that we want to capture in our crawl.
In the past, a crawl analysis involved downloading the crawl reports, importing the data into a spreadsheet, and comparing the current list to previously reviewed URLs in order to delete any duplicates. After de-duping, the team would manually review over 3,000 URLs in order to constrain from the next crawl any out-of-scope web sites. An onerous task, but necessary nonetheless.
We developed a new open source tool, Constraint Analysis, to streamline this process dramatically. The tool works in conjunction with the Archive-It web archiving service. Now, instead of using a spreadsheet to manage the crawl reports and remove duplicates, we use a series of automated database queries. And, rather than manually reviewing each remaining URL, a .txt file of the URLs is uploaded into the tool. The tool sends a request to a free 3rd party screen scraper service, which generates a .png image of the home page. On the URL Listings page, the URLs and home page images are generated, 100 per page.
We are able to quickly scan the images and scroll through 100 images at a time, pausing only to click through to a URL if necessary, change a web site’s status from its default “constrain,” flag a site as a possible seed, or shorten a URL to constrain at the root, if necessary. Once all of the pages of images are reviewed, a .txt file of the data is exported to allow easy updating.
A modest change to the process and a simple tool reduced the time required for a crawl analysis from 16 hours to 4 hours. The Constraint Analysis tool is useful for crawl analysis and can also be modified for other quality assurance tasks such as analyzing seed lists or reviewing sites blocked by a robots.txt file. The tool is free and available for download from GitHub.
That’s a creative use of the wimg webpage screenshot service to do something extremely practical. It’s awesome to see the State Library of North Carolina putting their tools out on GitHub.
A tool like wkhtmltopdf allows you to convert a webpage into an image file using WebKit, which is the opensource browser framework used by Safar in OSX. Something like wkhtmltopdf should make it possible to generate the images as part of the Constraint Analysis application, if more control over the image generation is needed.
Thanks for the nice words, Ed. We looked at wkhtmltopdf, but our ISP didn’t want to add it to their setup. We now have a dedicated server so I’ve been thinking of adding it. There’s also a nice wrapper setup on GitHub for using whkhtmltopdf with PHP (https://github.com/knplabs/snappy).
some other tools for the screenshooting job:
the best i like is phantom.js
Thanks for the link to phantoms. The Wimg service has instituted a limit of 3 concurrent screenshot request from the same IP; severely limiting the usefulness of the current approach. I’ve been testing wkhtmltopdf, but it’s not without its issues at least on Windows.
Could you please tell me what license this tool is available under?
Hi, Maarten. All of the tools created by the State Library of North Carolina are in the public domain. Here is our rights statement, http://digital.ncdcr.gov/u?/p249901coll22,63754. Thank you for your interest in our Constraint Analysis tool.
The tool has been moved to a new location on GitHub. The new link is: https://github.com/SLNC-DIMP/Constraint-Analysis