Abstract
Distributed collections are made of metadata entries that contain references to artifacts not controlled by the collection curators. These collections often have limited forms of change; for digital distributed collections, primarily creation and deletion of additional resources. However, there exists a class of digital collection that undergoes additional kinds of change. These collections consist of resources that are distributed across the Internet and brought together via hyperlinking. Resources in these collections can be expected to change as time goes on. Part of the difficulty in maintaining these collections is determining if a changed page is still a valid member of the collection. Others have tried to address this by defining a maximum allowed threshold of change, however, these methods treat change as a potential problem and treat web content as static despite its intrinsic dynamicism. Instead we acknowledge change on the web as a normal part of a web document and determine the difference between what a maintainer expects a page to do and what it actually does. In this work we evaluate options for extractors and analyzers from a suite of techniques against a human-generated ground-truth set of blog changes. The results of this work show a statistically significant improvement over traditional threshold techniques for our collection.