Skip to main content
SHARE
Publication

A Comparison of Techniques for Detecting Abnormal Change in Blogs...

by Richard Furuta, Frank Shipman, Paul L Bogen Ii
Publication Type
Conference Paper
Publication Date
Conference Name
12th ACM/IEEE Joint Conference on Digital Libraries
Conference Location
Washington, Virginia, United States of America
Conference Date
-

Distributed collections are made of metadata entries that contain references to artifacts not controlled by the collection curators. These collections often have limited forms of change; for digital distributed collections, primarily creation and deletion of additional resources. However, there exists a class of digital collection that undergoes additional kinds of change. These collections consist of resources that are distributed across the Internet and brought together via hyperlinking. Resources in these collections can be expected to change as time goes on. Part of the difficulty in maintaining these collections is determining if a changed page is still a valid member of the collection. Others have tried to address this by defining a maximum allowed threshold of change, however, these methods treat change as a potential problem and treat web content as static despite its intrinsic dynamicism. Instead we acknowledge change on the web as a normal part of a web document and determine the difference between what a maintainer expects a page to do and what it actually does. In this work we evaluate options for extractors and analyzers from a suite of techniques against a human-generated ground-truth set of blog changes. The results of this work show a statistically significant improvement over traditional threshold techniques for our collection.