Abstract
The KDD community has described a multitude of methods for knowledge discovery on large datasets. We consider some of these methods and integrate them into an analyst’s workflow that proceeds from the data-centric descriptive level to the model-centric causal level. Examples of the workflow are shown for the Health Indicators Warehouse, which is a public database for community health information that is a potent resource for conducting data science on a medium scale. We demonstrate the potential of HIW as a source of serious visual analytics efforts by showing correlation matrix visualizations, multivariate outlier analysis, multiple linear regression of Medicare costs, and scatterplot matrices for a broad set of health indicators. We conclude by sketching the first steps toward a causal dependence hypothesis.